Over a year ago, I wrote a blog post entitled SGML on the Web: A Failed Dream? where I asked whether the original vision of XML had failed. Below are excerpts from that post

The people who got together to produce the XML 1.0 recommendation where motivated to do this because they saw a need for SGML on the Web. Specifically  
their discussions focused on two general areas:
  • Classes of software applications for which HTML was an inadequate information format
  • Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology

The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases.

The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web.

And thus XML was born.
The W3C's attempts to get people to author XML directly on the Web have mostly failed as can be seen by the dismal adoption rate of XHTML and in fact many [including myself] have come to the conclusion that the costs of adopting XHTML compared to the benefits are too low if not non-existent. There was once an expectation that content producers would be able to place documents conformant to their own XML vocabularies on the Web and then display would entirely be handled by stylesheets but this is yet to become widespread. In fact, at least one member of a W3C working group has called this a bad practice since it means that User Agents that aren't sophisticated enough to understand style sheets are left out in the cold.

Interestingly enough although XML has not been as successfully as its originators initially expected as a markup language for authoring documents on the Web it has found significant success as the successor to the Comma Separated Value (CSV) File Format. XML's primary usage on the Web and even within internal networks is for exchanging machine generated, structured data between applications. Speculatively, the largest usage of XML on the Web today is RSS and it conforms to this pattern.

These thoughts were recently rekindled when reading Tim Bray's recent post Don’t Invent XML Languages where Tim Bray argues that people should stop designing new XML formats. For designing new data formats for the Web, Tim Bray advocates the use of Microformats instead of XML.

The vision behind microformats is completely different from the XML vision. The original XML inventers started with the premise that HTML is not expressive enough to describe every possible document type that would be exchanged on the Web. Proponents of microformats argue that one can embed additional semantics over HTML and thus HTML is expressive enough to represent every possible document type that could be exchanged on the Web. I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML  page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan. 


Thursday, 12 January 2006 22:44:20 (GMT Standard Time, UTC+00:00)
One way of viewing microformats is as a reinvention of architectural forms. From that perspective defining a new microformat *is* defining a new vocabulary.

Architectural Forms are as old as the hills, so maybe its not a case of moving forward, just that everything that was old is now new again. Or maybe we just needed to pare down SGML into smaller and smaller pieces until the value was uncovered.


Thursday, 12 January 2006 22:59:13 (GMT Standard Time, UTC+00:00)
Interesting. I never considered XML to be a replacement for CSV, I always considered it to be a replacement for ASN.1, complete with all the issues that ASN.1 has.

Friday, 13 January 2006 02:41:55 (GMT Standard Time, UTC+00:00)
OK, so they're saying: don't create new XML languages - instead, create new HTML languages. Because if you can't get people to stop separating presentation from data, hijack the presentation!

There's something to be said for freeing the data to sit inside web pages. It makes it much easier to deploy microformatted data, rather than waiting for XML Islands to really happen. But in that case why not deploy it inside script tags? Or any other particular hacked tag?

I'm not even a hardcore data vs. presentation guy and I still think the use of DIVs and classes to approximate structured storage seems pretty dumb. You're not solving the "metcalfe's law" usability issue stated in Tim's article by creating a handful of strictly defined elements, because it's the complexity of the data that prevents re-use, not the hosting. As RSS shows, once there's a real use for a one-way transmission of structured data, people will use it even if it's not embedded within the HTML page.

They might say that it's cool that you can use CSS to mark up the same data islands that services are consuming. But that's another hack. CSS still isn't fully separated from the data (you need to specify divs in order, can't do certain things easily without hacks or tables, etc.) Will all the microformat readers need to be able to abstract away the extra TABLE, etc. garbage people have added just so their Resume, Contact, etc. look good?

Thumbs down.
Friday, 13 January 2006 02:43:25 (GMT Standard Time, UTC+00:00)
Sorry, an edit: I meant, "if you can't get people to separate presentation from data"
Friday, 13 January 2006 04:01:45 (GMT Standard Time, UTC+00:00)
I'm definitely not in the "don't invent an XML format, hack up a microformat" in general, but I've always wondered why the use cases for RSS and HTML are so different as to justify a separate format. Had the microformat meme been in circulation a few years ago, I don't think RSS would have taken off -- people would have added a bit of syntax sugar to the HTML of their blogs and news feeds to aid screen scrapers, religious wars would have been fought over which conventions were better, aggregators would support most flavors of these microformats, and so on. I think we'd be more or less where we are today ... minus all those confusing XML/RSS/Atom icons and redundant URLs, minus the full text feed debate, and minus a few terabytes of stuff claiming to be XML but isn't. I agree that microformat-assisted screen scraping is an egregious hack, but, ahem, so is an awful lot of stuff on the Web that's validates the depressing truth that "worse is better."

But RSS is here to stay, social pressure (and maybe the momentum of Atom standardization) is getting the average quality of the XML up to fairly reasonable levels, so that issue is probably not worth raising. I guess the more operative question is whether stuff that is being done with microformats such as address card entries, calendar entries, outlines, reviews, etc. really "should" be done with new XML markup (presumably in a namespace rather than a separate file) rather than microformats. I'm pretty hard pressed to come up with an argument for simply adding semantics to primarily human-readable HTML documents using XML rather than microformats. It's odd that this was what "SGML on the Web" was supposed to enable, but in the same 10 years since that initiative was launched, we've gone from using voice telephone infrastructure to handle internet traffic to using broadband internet infrastructure to handle voice traffic. That's kindof odd too, but illustrates how the intent of inventors doesn't constrain the creativity of the next generation.
Friday, 13 January 2006 13:17:32 (GMT Standard Time, UTC+00:00)
I think microformats are interesting, but they fail utterly as a replacement for XML vocabularies for authoring. See
http://norman.walsh.name/2005/09/05/microformats for a brief discussion of my concerns.

That said, they may be the stepping stone that the semantic web folks need: some way to get machine readable data onto the web. Assuming, of course, that the data is valid, which microformats discourage.
Friday, 13 January 2006 15:45:42 (GMT Standard Time, UTC+00:00)
Actually Mike I'd say the *primary* benefit of RSS (after its popularity and thus compatibility of course) is the extra URL, not anything about XML in particular. I don't think that authors would ever want a periodic updater - some as frequent as 5 min - to download the entire content of their HTML website just to scrape the interesting data out of it. At the very least microformat RSS would have required a dynamic URL variable to turn it on and off, provide full vs. partial vs. different preferred formats, etc.

Once you're turning on embedded RSS you might as well turn off the rest of the bytes. Even easier, just put it on its own page.

Now that it's on its own page you might as well use whatever format you want - they chose XML but I guess they could have chosen a HTML microformat - after all, XML w/ styling is just finally making human-readable feeds, something that I would have wanted right up front had I designed 'em.

Either way the popular bloggers who contributed to RSS success would probably not have been happy adding a few screen-scrape annotations to their existing pages.

Same discussion applies to the current microformats idea: I can understand why they want to deliver in HTML but "separateness" makes data more usable. Is a "contact manager" -- anything as simple as an XSL transform to a full-fledged READ/WRITE tool -- really going to replace all the TABLE,TD,P nodes you put in to make your contacts pretty? Or on the other hand, are you really going to just dump your contacts into a page without anything but CSS styling? I predict either problem is too much intertia to overcome, so there will either be no published content, or no good tools, and each case causes the other, too.

They need a stronger separation of data, even if it exists in the same page. I don't particularly care about the format once it's separated.

(Note that once CSS becomes better supported and HTML "islands" really can be separated they have another shot at it but they'll still need to use some form of namespaces to prevent class conflicts between data islands and their containers)
Saturday, 14 January 2006 06:35:43 (GMT Standard Time, UTC+00:00)
XML enables your pieces of data to be first-class elements. Microformats are a sort of tunneling through HTML, in the same way SOAP services are sort of tunneled through HTTP by using it as a just a transport protocol whereas a REST style uses HTTP methods as a first-class interface. Or how SQL-in-strings are tunneled in some other programming language as opposed to .NET 3.0's first-class query constructs. Or Regexes-in-strings vs. Perl or JavaScript's regex literals.
Monday, 16 January 2006 22:46:50 (GMT Standard Time, UTC+00:00)
Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.


I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

All in all, a mixed bag, but definitely a debate worth having (again).
Comments are closed.