I've been reading some of the hype around microformats in certain blogs with some amusement. I have been ignoring microformats but now I see that some of its proponents have started claiming that using XML on the Web is bad and instead HTML is the only markup language we'll ever need.

In her post Why generic XML on the Web is a bad idea Anne van Kesteren writes

Of course, using XML or even RDF serialized as XML you can describe your content much better and in far more detail, but there is no search engine out there that will understand you. For RDF there is a chance one day they will. Generic XML on the other hand will always fail to work. (Semantics will not be extracted.)

An example that shows the difference more clearly:

<em>Look at me when I talk to you!</em>

… and:

<angry>Look at me when I talk to you!</angry>

The latter element describes the content probably more accurately, but on ‘the web’ it means close to nothing. Because on the web it’s not humans who come by and try to parse the text, they already know how to read something correctly. No, software comes along and tries to make something meaningful of the above. As the latter is in a namespace no software will know and the latter is also not specified somewhere in a specification it will be ignored. The former however has been here since the beginning of HTML — even before it’s often wrongly considered presentational equivalent I — and will be recognized by software.

This post in itself isn't that bad, if anything it is just somewhat misguided. However Tantek Celik followed it up with his post Avoiding plain XML and presentational markup which boggled my mind. Tantek wrote

The marketing message of XML has been for people to develop their own tags to express whatever they wanted, rather than being stuck with the limited predefined tag set in HTML. This approach has often been labeled "plain XML" or "generic XML" or "SGML, but easier, better, and designed just for the Web".

The problem with this approach is that while having the freedom to make up all your own tags and attributes sounds like a huge improvement over the (mostly perceived) limits of HTML, making up your own XML has numerous problems, both for the author, and for users / readers, especially when sharing with others (e.g. anything you publish on the Web) is important.

This post by no means contains a complete set of arguments against plain/generic XML and presentational markup, nor are the arguments presented as definitive proofs. Mostly I wanted to share a bunch of reinforcing resources in one place. Readers are encouraged to improve upon the arguments made here.

The original impetus for creating XML was to enable SGML on the Web. People had become frustrated with the limited tag set in HTML and the solution was to create a language that enabled content creators to create their own tags yet have them still readable in browsers via stylesheet technologies (e.g. CSS). Over time, XML has failed to take off as a generic document format used by content authors for creating human readable documents on the Web but has become popular as a data format used for machine to machine communications on the Web(RSS, XML-RPC, SOAP, etc) .

Thus any arguments against XML usage on the Web today are really arguing about using XML as a data format since it isn't really used as a document format except for XHTML [and even that is only by markup geeks like Tantek & Anne].

Anyway let's look at some of Tantek's arguments against using XML on the Web...

Tower of Babel Problem

If everyone invents their own tags and attributes, pretty soon you get people calling the same thing by different names and different things by the same name. While avoid both of those occurences completely is very difficult (many of the microformats principles are design to help avoid those problems), downright encouraging authors to make up their own tags and attributes makes it much worse and all you end up with are a bunch of documents that give you the illusion of self-description.

Didn't the XML world solve this with XML namespaces like six or seven years ago?

Temptation of Presentational Markup

What happens all too often when authors or developers make up their own tags is that they choose tags that are tightly tied to a specific presentation rather than abstracting them with semantics. Quite similar to the phenomenon of authors picking presentational class names.

As a casual user of HTML, I personally haven't seen a good explanation of why <strong> is better than <b> so arguments whose entire basis is "presentational markup is evil" don't carry much weight in my book. If I come up with a custom markup format and it has a <bold> element, is that really so evil? I'm pretty sure that the XML formats used by OpenOffice or Microsoft Office contain markup that is presentational in nature whether it is setting font sizes, text colors or paragraph alignemnt. Are they evil or does the fact that they aren't intended for the Web give them a pass?

Preferring Semantic Richness

Sometimes something is a bad idea not just in absolute terms, but also relative to other approaches and solutions.

A while ago I wrote about a semantic richness spectrum on the www-style mailing list which went into a bit more detail. Håkon Wium Lie wrote a paper that both predated my rough summary by a couple of years, and provided a much more thorough analysis.

 Languages with well-known semantics are preferred to proprietary/made-up XML. This is for many reasons, including accessibility, cross-device support, and future user agent support.

This seems to be arguing that instead of cooking up your own custom format you should pick an established format with the semantics you want if one exists. This is regularly practiced in the XML world especially when it comes to the Web so I don't see how this is an argument against using XML.

--

Seriously, I feel like I am in some bizarre alternate universe if having aggregators subscribe to HTML web pages is being advocated as being a better idea than using specialized XML formats like RSS & Atom.

That's it...I'm going back to my vacation. The world has gone too loopy for me.