Dogma, Religion and Computer Programmers

January 10, 2004

@ 11:28 PM

Mark Pilgrim has a fairly interesting post entitled There are no exceptions to Postel’s Law which contains the following gem

There have been a number of unhelpful suggestions recently on the Atom mailing list...

Another suggestion was that we do away with the Atom autodiscovery <link> element and “just” use an HTTP header, because parsing HTML is perceived as being hard and parsing HTTP headers is perceived as being simple. This does not work for Bob either, because he has no way to set arbitrary HTTP headers. It also ignores the fact that the HTML specification explicitly states that all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element. So instead of requiring clients to parse HTML, we should “just” require them to parse HTTP headers... and HTML.

Given that I am the one that made this unhelpful suggestion on the ATOM list it only seems fair that I clarify my suggestion. The current proposal for how an ATOM client (for example. a future version of RSS Bandit) determines how to locate the ATOM feed for a website or post a blog entry or comment is via Mark Pilgrim's ATOM autodsicovery RFC which basically boils down to parsing the webpage for <link> tags that point to the ATOM feed or web service endpoints. This is very similar to RSS autodiscovery which has been a feature of RSS Bandit for several months.

The problem with this approach is that it means that an ATOM client has to know how to parse HTML on the Web in all it's screwed up glory including broken XHTML documents that aren't even wellformed XML, documents that use incorrect encodings and other forms of tag soup. Thankfully on major platforms developers don't have to worry about figuring out how to rewrite the equivalent of the Internet Explorer or Mozilla parser themselves because others have done so and made the libraries freely available. For Java there's John Cowan's TagSoup parser while for C# there's Chris Lovett's SgmlReader (speaking of which it looks like he just updated it a few days ago meaning I need to upgrade the version used by RSS Bandit). In RSS Bandit I use SgmlReader which in general works fine until confronted with weirdness such as the completely broken HTML produced by old versions of Microsoft Word including tags such as

<?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office" />

Over time I've figured out how to work past the markup that SgmlReader can't handle but it's been a pain to track down what they were and I often ended up finding out about them via bug reports from frustrated users. Now Mark Pilgrim is proposing that ATOM clients must have to go through the same problems that're faced by folks like me who've had to deal with RSS autodiscovery.

So I proposed an alternative, that instead of every ATOM client having to require an HTML parser that instead this information is provided in a custom HTTP header that is returned by the website. Custom HTTP headers are commonplace on the World Wide Web and are widely supported by most web development technologies. The most popular extension header I've seen is the X-Powered-By header although I'd say the most entertaining is the X-Bender header returned by Slashdot which contains a quote from Futurama's Bender. You can test for yourself which sites return custom HTTP headers by trying out Rex Swain's HTTP Viewer. Not only is generating custom headers widely supported by web development technologies like PHP and ASP.NET but also extracting them from an HTTP response is fairly trivial on most platforms since practically every HTTP library gives you a handy way to extract the headers from a response in a collection or similar data structure.

If ATOM autodiscovery used a custom header as opposed to requiring clients to use an HTML parser it would make the process more reliable (no more worry about malformed [X]HTML borking the process) which is good for users as I can attest from my experiences with RSS Bandit and reduce the complexity of client applications (no dependence on a tag soup parsing library).

Reading Mark Pilgrim's post the only major objection he raises seems to be that the average user (Bob) doesn't know how add custom HTTP headers to their site which is a fallacious argument given that the average user similarly doesn't know how to generate an XML feed from their weblog either. However the expectation is that Bob's blogging software should do this not that Bob will be generating this stuff by hand.

Mark also incorrectly states that the HTML spec states that any “all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element”. The HTML specification actually states

META and HTTP headers

The http-equiv attribute can be used in place of the name attribute and has a special significance when documents are retrieved via the Hypertext Transfer Protocol (HTTP). HTTP servers may use the property name specified by the http-equiv attribute to create an [RFC822]-style header in the HTTP response. Please see the HTTP specification ([RFC2616]) for details on valid HTTP headers.
The following sample META declaration:
<META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
will result in the HTTP header:
Expires: Tue, 20 Aug 1996 14:25:27 GMT

That's right, the HTML spec says that authors can put <meta http-equiv="..."> in their HTMl documents and a web server gets a request for a document it should parse out these tags and use them to add HTTP headers to the response. In reality this turned out to be infeasible because it would be highly inefficient and require web servers to run a tag soup parser over a file each time they served it up to determine which headers to send in the response. So what ended up happening, is that certain browsers support a limited subset of the HTTP headers if they appear as <meta http-equiv="..."> in teh document.

It is unsurprising that Mark mistakes what ended up being implemented by the major browsers and web servers as what was in the spec after all he who writes the code makes the rules.

At this point I'd definitely like to see an answer to the questions Dave Winer asked on the atom-syntax list about its decision making process. So far it's seemed like there's a bunch of discussion on the mailing list or on the Wiki which afterwards may be ignored by the powers that be who end up writing the specs (he who writes the spec makes the rules). The choice of <link> tags over using RSD for ATOM autodiscovery is just one of many examples of this occurence. It'd be nice to some documentation of the actual process as opposed to the anarchy and “might is right” approach that currently exists.

Categories: XML

« Why I Read "The Boondocks" Daily | Home | Technology that Changes Your Way of Life... »

Sunday, 11 January 2004 04:03:06 (GMT Standard Time, UTC+00:00)

Custom http headers cannot be configured for static files on some hosted accounts. Although the web server might support custom headers for static files, the host might not have enabled the functionality for its users. In the case of Apache, the host must AllowOverride FileInfo.

Before accepting the custom header approach, some research will need to be done to find out how many users can configure custom headers for static files.

Note that static files are used by many popular blogging tools including Radio, Blogger and MT.

Gary Burd

Sunday, 11 January 2004 17:21:18 (GMT Standard Time, UTC+00:00)

"... client has to know how to parse HTML on the Web in all it's screwed up glory including broken XHTML documents that aren't even wellformed XML..."

Documents aren't XHTML documents if they aren't wellformed... it's one of the requirements of XHTML. I think it's fair, in today's day and age, to fail parsing of an XHTML document that is not wellformed. It is the author's duty to ensure wellformed-ness and if he/she doesn't validate that before publishing said document, then they only have themselves to blame when people can't view that document. If we keep writing software that attempts to interpret people's broken documents we're never going to get anywhere.

Drew Marsh

Sunday, 11 January 2004 21:20:18 (GMT Standard Time, UTC+00:00)

I prefer the use of a the 'link' element to indicate alternate representations of a site, whether they be RSS or what-have-you. Feed parsers should simply give up on any page that isn't well-formed. If a page author cannot create proper HTML/XHTML, I doubt they're going to be emitting proper RSS, etc. As Drew said, bending over backwards to adapt to bad markup encorages the use of bad markup.

On a side note, I, too, am baffled by the PEA (Pie|Echo|Atom) decision process. I had tried to follow the Wiki-dicussion, but, given real-world constraints, could not keep up on it every day. And one day, while poking around to see what was new, I found that the name "Atom" had been picked; I could swear I checked the site no more than a day or so before, and the name discussion was still going. And it was then I got the feeling the decision process was not meant to be inclusive, but was instead designed to gather ideas and suggestions from an interested and informed community before someone or other issued a sudden decree.

James Britt

Tuesday, 13 January 2004 00:37:28 (GMT Standard Time, UTC+00:00)

Errm, correct me if I'm wrong, but wasn't it my unhelpful suggestion:

http://www.imc.org/atom-syntax/mail-archive/msg02173.html

Danny

Tuesday, 13 January 2004 14:37:33 (GMT Standard Time, UTC+00:00)

Danny,
You did make the formal proposal for using an HTTP header but I initially questioned the autodiscovery proposal. I assume that Mark probably meant that my questioning which led to your proposal was unhelpful. If you want me to edit my post I'll go ahead and do that.

Dare Obasanjo

Thursday, 15 January 2004 03:32:00 (GMT Standard Time, UTC+00:00)

You're conveniently ignoring the fact that an HTTP-header-only solution doesn't work for our target audience, which can be summed up as "anyone who can run Movable Type".

Also, you're entire argument against autodiscovery-via-link-tag is ridiculous in the face of your position on rejecting well-formed XML. At least be consistent and reject XHTML that is not well-formed when you're looking for link elements. Your smear tactic of calling it "the tag soup autodiscovery spec" is similarly ridiculous. The spec says nothing about parsing tag soup. The accompanying test suite contains only valid HTML and XHTML documents. Yet you go on and on about the lengths you go to to parse tag soup.

Why the double standard? Are you really so pissed that Ted reported an RSS Bandit bug in public?

Mark

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Dogma, Religion and Computer Programmers - Dare Obasanjo's weblog

`META` and HTTP headers

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Dogma, Religion and Computer Programmers - Dare Obasanjo's weblog

META and HTTP headers

`META` and HTTP headers