Microformats vs. XML: Pros and Cons - Dare Obasanjo's weblog

January 18, 2006

@ 12:39 PM

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches.

I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently.

In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote

Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.

http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication

I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is wasted bandwidth since web pages tend to contain more stuff than feeds and change more often.

Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using <div> and <span> to markup the individual appointments, so I can't add hCalendar to them.

I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.

They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an <abbr> with class="dtstart" interleaved with an <abbr> with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun.

I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.

Categories: XML

Tracked by:
http://blog.broadbandmechanics.com/2006/01/microformats-a-gross-hack [Pingback]
"Web Servers" (Web Servers) [Trackback]
"Domain Registration Search" (Monica M.) [Trackback]

« A Flickr-like API for MSN Spaces? | Home | My Favorite SOAP Anti-Pattern »

Wednesday, 18 January 2006 15:25:09 (GMT Standard Time, UTC+00:00)

re: The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun.

This is mixing apples and oranges. Norm is saying that one can't easily validate microformats with technologies like DTDs, RElAX NG, etc. The feedvalidator doesn't rely on any of these approaches.

Sam Ruby

Wednesday, 18 January 2006 16:52:36 (GMT Standard Time, UTC+00:00)

Not that a validator is useless, but I hardly think lack of one would stop anyone from using microformats. On the other hand, can you imagine the code necessary to make a tool for editing your data stored in microformats? If you dare to mix any presentation tags in with your data then it will have to preserve that - leading to all the same complaints that Visual Studio 2003 users have about "Design Mode". And if you don't mix presentation, you'll have to do presentation entirely with CSS, which is not "possible" with CSS2 (w.r.t. what people want to do - you can get 80/20 but not 100% and it's certainly not easy for the average user)

Another real problem is Norm's #2 point about consuming the "class" tag although I suppose most browsers will support multiple class tags by the time microformats have any traction.

Steve

Steve

Wednesday, 18 January 2006 18:02:58 (GMT Standard Time, UTC+00:00)

Norm is wrong about the class attribute, and admits it in the comments.

His point about architectural forms is well-taken, though. Given that one of the tenets of the microformats religion is building on successful approaches, I'm surprised no one has explained why this variety of architectural forms will be more successful than they have in the past.

Also, microformats seem to rely on detailed crawler coverage of HTML pages. How in the heck are these things supposed to be discovered? Feeds do stick out when encountered on the Web, for better or worse. If microformats need browsers to monitor the parse tree for their presence, we can bet it will be a while before they become useful in end-user programs like IE. To some extent, syndication took off because it routed around the over-engineering of HTML and Web browsers like Netscape 4 and IE5.

Robert Sayre

Wednesday, 18 January 2006 18:07:43 (GMT Standard Time, UTC+00:00)

Sam,
You're right about the apples to oranges comparison. However Norm's point on the difficulty of using an off-the-shelf schema language or validation tool to validate microformats is still valid.

Dare Obasanjo

Wednesday, 18 January 2006 19:12:12 (GMT Standard Time, UTC+00:00)

From my understanding, I don't think the microformat community is really promoting things like embedding RSS into your XHTML web page so you can keep just one web page (unless it just happens to work out better that way).

I think they are mostly recommmending using the XHTML language instead of inventing your own. So instead of embedding RSS tags into your XHTML Web page, still create another file for your feed (that just contains the items), but use XHTML as the format rather then another language like RSS or Atom.

The one benifit would be that users could view your RSS/XHTML feed in a browser without any special transformations (ie. stylesheet).

Joe

Wednesday, 18 January 2006 22:02:52 (GMT Standard Time, UTC+00:00)

For the simpler compound microformats (i.e. not hCard) it should be possible to use an XMDP profile to at least perform cursory validation. XMDP-based validators get mentioned every once in a while on the mf-discuss mailing list, but I don't think there's anything concrete actually out there.

The microformats group would be doing themselves (and me) a favour if they'd release one of these (or even something Schematron-based).

Phil Wilson

Monday, 30 January 2006 20:48:57 (GMT Standard Time, UTC+00:00)

While points raised on both sides are reasonable, my view is that "if there is a good reason to carry a ship up a mountain, people will do it." So far, I don't see one beyond some vague desires and promises for Semantic Web. Someone needs to bury a pot of gold at the end of the microformat rainbow.

Don Park

Tuesday, 31 January 2006 04:59:02 (GMT Standard Time, UTC+00:00)

What Joe said -- the concept of making your homepage machine-readable and throwing away your feed isn't being seriously considered by the microformats people. It's just one of those "hey, cool, look what this could enable if we wanted to do something totally weird" things.

Most microformats coexist nicely with feeds. Look at what we're doing with the Structured Blogging plugins (structuredblogging.org), for example. If you publish a book review, it is marked up in the blog's HTML pages using hReview class attributes, and the exact same HTML gets published in the feeds.

Phillip Pearson

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Microformats vs. XML: Pros and Cons - Dare Obasanjo's weblog