On Crappy XML Formats - Dare Obasanjo's weblog

September 30, 2005

@ 08:14 PM

There have been a number of amusing discussions in the recent back and forth between Robert Scoble and several others on whether OPML is a crappy XML format. In posts such as OPML "crappy" Robertson says and More on crappy formats Robert defends OPML. I've seen some really poor arguments made as people rushed to bash Dave Winer and OPML but none made me want to join the discussion until this morning.

In the post Some one has to say it again… brainwagon writes

Take for example Mark Pilgrim's comments:

I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML.

The reason just isn't that programmers are lazy (we are, but we also like stuff to work). The fact is that the specification itself is ambiguous and weak enough that nobody really knows what it means. As a result, there are all sorts of flavors of RSS out there, and parsing them is a big hassle.

The promise of XML was that you could ignore the format and manipulate data using standard off-the-shelf-tools. But that promise is largely negated by the ambiguity in the specification, which results in ill-formed RSS feeds, which cannot be parsed by standard XML feeds. Since Dave Winer himself managed to get it wrong as late as the date of the above article (probably due to an error that I myself have done, cutting and pasting unsafe text into Wordpress) we really can't say that it's because people don't understand the specification unless we are willing to state that Dave himself doesn't understand the specification.

As someone who has (i) written a moderately popular RSS reader and (ii) worked on the XML team at Microsoft for three years, I know a thing or two about XML-related specifications. Blaming malformed XML in RSS feeds on the RSS specification is silly. That's like blaming the large number of HTML pages that don't validate on the W3C's HTML specification instead of on the fact that instead of erroring on invalid web pages web browsers actually try to render them. If web browsers didn't render invalid web pages then they wouldn't exist on the Web.

Similarly, if every aggregator rejected invalid feeds then they wouldn't exist. However, just like in the browser wars, aggregator authors consider it a competitive advantage to be able to handle malformed feeds. This has nothing to do with the quality of the RSS specification [or the HTML specification], this is all about applications trying to get marketshare.

As for whether OPML is a crappy spec? I've had to read a lot of technology specifications in my day from W3C recommendations and IETF RFCs to API documentation and informal specs. They all suck in their own ways. However experience has thought me that the bigger the spec, the more it sucks. Given that, I'd rather have a short, human readable spec that sucks a little (e.g. RSS, XML-RPC, OPML etc.) than a large, jargon filled specificaton which sucks a whole lot more (e.g. WSDL, XML Schema, C++, etc). Then there's the issue of using the right tool for the job but I'll leave that rant for another day.

Categories: XML

« Where is the MSN Toolbar for Firefox? | Home | Integrating MSN Virtual Earth and MSN Me... »

Friday, 30 September 2005 21:53:55 (GMT Daylight Time, UTC+01:00)

Dare, I believe the beef that most people have with OPML is it's ambiguity and usefulness given the lack of any idea what to put into the type elements. If the spec leaves something so prone to abuse (do I use text or string? what do you use?) how can it work? It's fine for a limited domain (where you can ignore type because you know it is a list of feeds or you agree in person with the person you are working with) but does it really scale well?

Ross

Friday, 30 September 2005 22:11:08 (GMT Daylight Time, UTC+01:00)

Ross,
I honestly haven't looked at OPML usage within the particular domain in question in the current debate. I know OPML is good for some things and bad at others which is where my statement about using the right tool for the job comes from. However I haven't taken a look to see whether the feature being debated by Scoble and co. falls into one or the other situation.

Dare Obasanjo

Friday, 30 September 2005 22:21:28 (GMT Daylight Time, UTC+01:00)

Dare,

I know in the past I've implemented features based on common usage without reference to a spec (I know, naughty AND dangerous) but you should take a look at the OPML spec. It is so ambiguous as to be dangerous, I think I would have preferred a single paragraph and an XSD than that whole spec. If I were to suggest you implement a particular XML dialect and then didn't specify what you could put in one of the key attributes - I'm guessing you'd be worried. Questions about namespace qualifying attributes I can handle, now knowing valid values for attributes worries me.

As for the RSS criticism, it wasn't just the entity encoding (as far as I can see) it was also questions such as how many enclosure elements are you allowed in a single entry?

Ross

Saturday, 01 October 2005 17:02:27 (GMT Daylight Time, UTC+01:00)

Dare,

Your post has been copied at http://www.kapoorsolutions.com/Reblogger/AspDotNetNewsPost10312.aspx
but I can't see your name anywhere on it. Thought you should know.

Ross

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for On Crappy XML Formats - Dare Obasanjo's weblog