I've been reading some of the hype around microformats in certain blogs with some amusement. I have been ignoring microformats but now I see that some of its proponents have started claiming that using XML on the Web is bad and instead HTML is the only markup language we'll ever need.

In her post Why generic XML on the Web is a bad idea Anne van Kesteren writes

Of course, using XML or even RDF serialized as XML you can describe your content much better and in far more detail, but there is no search engine out there that will understand you. For RDF there is a chance one day they will. Generic XML on the other hand will always fail to work. (Semantics will not be extracted.)

An example that shows the difference more clearly:

<em>Look at me when I talk to you!</em>

… and:

<angry>Look at me when I talk to you!</angry>

The latter element describes the content probably more accurately, but on ‘the web’ it means close to nothing. Because on the web it’s not humans who come by and try to parse the text, they already know how to read something correctly. No, software comes along and tries to make something meaningful of the above. As the latter is in a namespace no software will know and the latter is also not specified somewhere in a specification it will be ignored. The former however has been here since the beginning of HTML — even before it’s often wrongly considered presentational equivalent I — and will be recognized by software.

This post in itself isn't that bad, if anything it is just somewhat misguided. However Tantek Celik followed it up with his post Avoiding plain XML and presentational markup which boggled my mind. Tantek wrote

The marketing message of XML has been for people to develop their own tags to express whatever they wanted, rather than being stuck with the limited predefined tag set in HTML. This approach has often been labeled "plain XML" or "generic XML" or "SGML, but easier, better, and designed just for the Web".

The problem with this approach is that while having the freedom to make up all your own tags and attributes sounds like a huge improvement over the (mostly perceived) limits of HTML, making up your own XML has numerous problems, both for the author, and for users / readers, especially when sharing with others (e.g. anything you publish on the Web) is important.

This post by no means contains a complete set of arguments against plain/generic XML and presentational markup, nor are the arguments presented as definitive proofs. Mostly I wanted to share a bunch of reinforcing resources in one place. Readers are encouraged to improve upon the arguments made here.

The original impetus for creating XML was to enable SGML on the Web. People had become frustrated with the limited tag set in HTML and the solution was to create a language that enabled content creators to create their own tags yet have them still readable in browsers via stylesheet technologies (e.g. CSS). Over time, XML has failed to take off as a generic document format used by content authors for creating human readable documents on the Web but has become popular as a data format used for machine to machine communications on the Web(RSS, XML-RPC, SOAP, etc) .

Thus any arguments against XML usage on the Web today are really arguing about using XML as a data format since it isn't really used as a document format except for XHTML [and even that is only by markup geeks like Tantek & Anne].

Anyway let's look at some of Tantek's arguments against using XML on the Web...

Tower of Babel Problem

If everyone invents their own tags and attributes, pretty soon you get people calling the same thing by different names and different things by the same name. While avoid both of those occurences completely is very difficult (many of the microformats principles are design to help avoid those problems), downright encouraging authors to make up their own tags and attributes makes it much worse and all you end up with are a bunch of documents that give you the illusion of self-description.

Didn't the XML world solve this with XML namespaces like six or seven years ago?

Temptation of Presentational Markup

What happens all too often when authors or developers make up their own tags is that they choose tags that are tightly tied to a specific presentation rather than abstracting them with semantics. Quite similar to the phenomenon of authors picking presentational class names.

As a casual user of HTML, I personally haven't seen a good explanation of why <strong> is better than <b> so arguments whose entire basis is "presentational markup is evil" don't carry much weight in my book. If I come up with a custom markup format and it has a <bold> element, is that really so evil? I'm pretty sure that the XML formats used by OpenOffice or Microsoft Office contain markup that is presentational in nature whether it is setting font sizes, text colors or paragraph alignemnt. Are they evil or does the fact that they aren't intended for the Web give them a pass?

Preferring Semantic Richness

Sometimes something is a bad idea not just in absolute terms, but also relative to other approaches and solutions.

A while ago I wrote about a semantic richness spectrum on the www-style mailing list which went into a bit more detail. Håkon Wium Lie wrote a paper that both predated my rough summary by a couple of years, and provided a much more thorough analysis.

 Languages with well-known semantics are preferred to proprietary/made-up XML. This is for many reasons, including accessibility, cross-device support, and future user agent support.

This seems to be arguing that instead of cooking up your own custom format you should pick an established format with the semantics you want if one exists. This is regularly practiced in the XML world especially when it comes to the Web so I don't see how this is an argument against using XML.

--

Seriously, I feel like I am in some bizarre alternate universe if having aggregators subscribe to HTML web pages is being advocated as being a better idea than using specialized XML formats like RSS & Atom.

That's it...I'm going back to my vacation. The world has gone too loopy for me.


 

Thursday, 28 July 2005 10:46:33 (GMT Daylight Time, UTC+01:00)
"In her post Why generic XML on the Web is a bad idea Anne van Kesteren writes"

Anne is for some reason a name for guys in the Netherlands. Replace 'her' with 'his'.
Thursday, 28 July 2005 14:37:20 (GMT Daylight Time, UTC+01:00)
"Seriously, I feel like I am in some bizarre alternate universe if having aggregators subscribe to HTML web pages is being advocated as being a better idea than using specialized XML formats like RSS & Atom."

Not a bad idea! :)

The primary "value-add" of most syndication formats is that they offer a channel with a list of links, each with title and description (often containing escaped HTML). There isn't much there that can't be done with HTML -- *if* people could agree on uniform semantics.

In that issue RSS has a first mover's advantage here, because it's an agreed-upon format... Or is it? 1.0? 0.9x? 2.0? Atom? :)
Jeoff Wilks
Thursday, 28 July 2005 16:31:17 (GMT Daylight Time, UTC+01:00)
Everytime something cool gets invented, invariably some people try to use it to fly to the moon, make pancakes, and clean their room. While there's nothing wrong with microformat payloads in RSS/Atom feeds it's stupid to talk of using xhtml a s a general purpose meta language for syndication. xhtml is itself a microformat whose purpose is to render a view of information. Why can't people simply select the right tool for each job?
Thursday, 28 July 2005 17:15:51 (GMT Daylight Time, UTC+01:00)
I'm not understanding why HTML microformats rather than RSS is such a bad idea. I've long wondered what Jeoff Wilks said, "There isn't much [in RSS] that can't be done with HTML -- *if* people could agree on uniform semantics. " What's the deep reason why the world needs both HTML pages *and* full-text RSS feeds for news items, blog entries, etc.?

"the right tool for each job" is a reasonable principle but I still wonder whether RSS wasn't the right tool for an age where bandwidth was so precious that full-text feeds were impractical.

Then again, by perpective may be biased because I use servier-side aggregators (especially Bloglines). The different user experience between reading a feed in an aggregator and a browser is lost on me. I would love a feature in either a client side or server side aggregator that let me subscribe to straight HTML, relying on heuristics in the aggregator (or a widely adopted microformat convention) to figure out what HTML markup patterns mapped onto which feed semantics.
Thursday, 28 July 2005 18:03:25 (GMT Daylight Time, UTC+01:00)
The way Atom can have XHTML does make you wonder if HTML could have been used in the first place for RSS and Atom. I like how you (Dare) take the time to respond to issues even if they are from a bizarre alternate universe, you seem very open minded.
Thursday, 28 July 2005 18:45:18 (GMT Daylight Time, UTC+01:00)
Tantek's idea of web pages doubling as APIs is at least interesting. It's really just an evolution of screen scraping. It's kind of funny that screen scraping remains the best and sometimes only way to get API-like functionality from most service providers. If they just spit out XHTML with some basic classing and didn't change it all the time, we'd see a lot of cool stuff.

BTW, 1) the default theme on this blog is lousy and 2) crappy .net errors when I try to switch it.
pwb
Thursday, 28 July 2005 21:01:52 (GMT Daylight Time, UTC+01:00)
+1 for lousy theme, would prefer simpler without resizing issues (article gets knocked below inset all the time). And the location of comments on separate link is a real pain -- I like Larry Osterman's blog system MUCH better.
Thursday, 28 July 2005 21:51:33 (GMT Daylight Time, UTC+01:00)
Mike,
You should separate the end user need from technical considerations. I agree that I should be able to subscribe to any page on the Web I find interesting. The question is whether it is more correct or even more likely that we can get there by evangelizing RSS or evangelizing people use special class names and special span/div tags in their HTML.

pwb,
I agree that for a number of online services the best way to get API like functionality from them is screen scraping. However the reason a lot of services don't have APIs have more to do with business reasons than technical considerations so I don't see microformats as being a panacea there.
Thursday, 28 July 2005 23:02:29 (GMT Daylight Time, UTC+01:00)
XHTML is semantically rich in conveying *presentation* information, to use it was some kind of loosely-typed general purpose data payload device by piggy-backing/misusing the namespace is verging on the charlatan.

In fact loose typing in generally isn't that robust an architectural decision (especially for data interchange) and can lead to blindness, just look at Javascript.
John Gunning
Friday, 29 July 2005 16:54:04 (GMT Daylight Time, UTC+01:00)
Precisely my point. An xhtml microformat (instead of RSS) may well have been all you needed to subscribe to a web page's content in a browser-based environment, but it's awfully short sighted to evangelize this as a general-purpose syndication format.

First, to convey complicated semantic meaning that you might find in a feed used, say, in a CMS application or Server or network monitoring application via html could easily lead to unnecessary verbosity and semantic abuse. For me, the coolness of the atom and rss syndication formats go way beyond the browser experience.
Friday, 29 July 2005 19:17:01 (GMT Daylight Time, UTC+01:00)
"just look at Javascript."

Hah, hah, good one. Just look at JavaScrip, the most widely used programming language!!
pwb
Friday, 29 July 2005 20:38:57 (GMT Daylight Time, UTC+01:00)
Dare,
Sorry for the misunderstanding, but I was certainly not suggesting that microformats *replace* Atom and RSS. What I am suggesting is that for extensions to Atom and RSS that microformats should be considered, and preferred if practical, over namespaced elements in a feed. For example, if you wanted to syndicate calendar information, syndicated hCalendar info would be more accessible than putting that calendar info into special namespaced elements in the entry. The former is still human readable, the latter requires a bit of ocean boiling before it is visible in every aggregator.
Friday, 29 July 2005 21:15:21 (GMT Daylight Time, UTC+01:00)
Joe,
As an aggregator author I am confused as to how you came to that conclusion. I'd rather process extensions to an RSS item or Atom entry as XML elements that were children of the item/atom:entry element than have to grovel around in the [X]HTML.

The only benefit I can see is that you reduce the duplication of content by having some event information in the HTML and having the same information in the feed. Somewhat like how some podcast feeds have both an enclosure element and also place a link to the podcast in the HTML.

I don't see that as a good enough reason to make the leap to advocating that all RSS extensions might as well just be embedded in the HTML.
Friday, 29 July 2005 22:04:37 (GMT Daylight Time, UTC+01:00)
Dare,
I came to that conclusion by thinking not about the impact to me the aggregator author, but in terms of impact to the users of aggregators (customers).
Saturday, 30 July 2005 00:54:01 (GMT Daylight Time, UTC+01:00)
dare, the reason for microformats is that they are easy on authors, and also to make your blog easier to spider by technorati.
glorp
Sunday, 31 July 2005 13:18:25 (GMT Daylight Time, UTC+01:00)
You kids need to do two things:

1) Take a markup language history class that goes back more than one year
2) Learn more about how to use something rather than jump to false conclusions based on ignorant assumptions due to a lack of #1 above.
Sunday, 31 July 2005 13:23:53 (GMT Daylight Time, UTC+01:00)
Glorp, "Easy" is a relative term. Easy how? Easy because you don't have to learn anything new? That just means that you'll find it harder to express anything that wasn't included in the original spec.

I'm with Dare, I don't know what just happened to the world. We suddenly swung the pendulum from the ivory tower elitist specification club all the way to the amateur ignorance hour. I think we need to get that pendulum back to the center real soon before we do something stupid.
Monday, 01 August 2005 03:46:06 (GMT Daylight Time, UTC+01:00)
"We suddenly swung the pendulum from the ivory tower elitist specification club all the way to the amateur ignorance hour. I think we need to get that pendulum back to the center real soon before we do something stupid."

I think it's much too late for that, but with luck the microformats movement will make it possible for us to successfully recover and re-center the pendulum.
Eric
Wednesday, 03 August 2005 14:54:12 (GMT Daylight Time, UTC+01:00)
Relax, people. I wasn't saying HTML can do everything that arbitrary XML does. All I'm saying is, it happens to be pretty good at providing links that have titles and descriptions, which is the 20% of RSS/Atom that provides 80% of the benefit.

Now, as for semantics:

"...evangelizing people use special class names and special span/div tags in their HTML."

That response assumes div/span/class semantics, and that's what I mean when I say people would have to agree on the semantics. An alternate HTML representation that uses already-standardized HTML semantics might be:

* /html/head/title is the feed title
* all /html/body//a are feed items
* for each a,
* the @href is the link [1]
* the @title is the item title [2]
* the text() is the description [3]
* optional meta tags [4] to identify the page
as an HTML feed

Too simple? Perhaps. Yet it manages to accomplish the most useful functionality in RSS/Atom (a standardized way of publishing new links with some basic accompanying metadata). The rest is just geek frosting. And HTML has some geek frosting of its own; some examples can be found on the page linked to below.

[1] @href
See the A element definition at http://www.w3.org/TR/REC-html40/struct/links.html#h-12.2

[2] @title
See title attribute at http://www.w3.org/TR/REC-html40/struct/links.html#h-12.1.4

[3] text()
If you wanted to fill the A element text with stuff that isn't allowed by the HTML spec, then hey, you could always borrow an RSS hack and put escaped HTML inside. :)

[4] meta tags
Optionally, you could have meta tags to tip off the user-agent that this is in fact a feed page.
&lt;meta name="feed-info" content="html feed"&gt;
With this meta tag in place, the user-agent knows that this is a feed, and it can read "a" elements to extract links, titles, descriptions, etc.
Comments are closed.