September 30, 2005
@ 08:14 PM

There have been a number of amusing discussions in the recent back and forth between Robert Scoble and several others on whether OPML is a crappy XML format. In posts such as OPML "crappy" Robertson says and More on crappy formats Robert defends OPML. I've seen some really poor arguments made as people rushed to bash Dave Winer and OPML but  none made me want to join the discussion until this morning.

In the post Some one has to say it again… brainwagon writes

Take for example Mark Pilgrim's comments:

I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML.

The reason just isn't that programmers are lazy (we are, but we also like stuff to work). The fact is that the specification itself is ambiguous and weak enough that nobody really knows what it means. As a result, there are all sorts of flavors of RSS out there, and parsing them is a big hassle.

The promise of XML was that you could ignore the format and manipulate data using standard off-the-shelf-tools. But that promise is largely negated by the ambiguity in the specification, which results in ill-formed RSS feeds, which cannot be parsed by standard XML feeds. Since Dave Winer himself managed to get it wrong as late as the date of the above article (probably due to an error that I myself have done, cutting and pasting unsafe text into Wordpress) we really can't say that it's because people don't understand the specification unless we are willing to state that Dave himself doesn't understand the specification.

As someone who has (i) written a moderately popular RSS reader and (ii) worked on the XML team at Microsoft for three years, I know a thing or two about XML-related specifications. Blaming malformed XML in RSS feeds on the RSS specification is silly. That's like blaming the large number of HTML pages that don't validate on the W3C's HTML specification instead of on the fact that instead of erroring on invalid web pages web browsers actually try to render them. If web browsers didn't render invalid web pages then they wouldn't exist on the Web.

Similarly, if every aggregator rejected invalid feeds then they wouldn't exist. However, just like in the browser wars, aggregator authors consider it a competitive advantage to be able to handle malformed feeds. This has nothing to do with the quality of the RSS specification [or the HTML specification], this is all about applications trying to get marketshare.  

As for whether OPML is a crappy spec? I've had to read a lot of technology specifications in my day from W3C recommendations and IETF RFCs to API documentation and informal specs. They all suck in their own ways. However experience has thought me that the bigger the spec, the more it sucks. Given that, I'd rather have a short, human readable spec that sucks a little (e.g. RSS, XML-RPC, OPML etc.) than a large, jargon filled specificaton which sucks a whole lot more (e.g. WSDL, XML Schema, C++, etc). Then there's the issue of using the right tool for the job but I'll leave that rant for another day.