Mark Pilgrim has a fairly interesting post entitled There are no exceptions to Postel’s Law which contains the following gem

There have been a number of unhelpful suggestions recently on the Atom mailing list...

Another suggestion was that we do away with the Atom autodiscovery <link> element and “just” use an HTTP header, because parsing HTML is perceived as being hard and parsing HTTP headers is perceived as being simple. This does not work for Bob either, because he has no way to set arbitrary HTTP headers. It also ignores the fact that the HTML specification explicitly states that all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element. So instead of requiring clients to parse HTML, we should “just” require them to parse HTTP headers... and HTML.

Given that I am the one that made this unhelpful suggestion on the ATOM list it only seems fair that I clarify my suggestion. The current proposal for how an ATOM client (for example. a future version of RSS Bandit) determines how to locate the ATOM feed for a website or post a blog entry or comment is via Mark Pilgrim's ATOM autodsicovery RFC which basically boils down to parsing the webpage for <link> tags that point to the ATOM feed or web service endpoints. This is very similar to RSS autodiscovery which has been a feature of RSS Bandit for several months.

The problem with this approach is that it means that an ATOM client has to know how to parse HTML on the Web in all it's screwed up glory including broken XHTML documents that aren't even wellformed XML, documents that use incorrect encodings and other forms of tag soup. Thankfully on major platforms developers don't have to worry about figuring out how to rewrite the equivalent of the Internet Explorer or Mozilla parser themselves because others have done so and made the libraries freely available. For Java there's John Cowan's TagSoup parser while for C# there's Chris Lovett's SgmlReader (speaking of which it looks like he just updated it a few days ago meaning I need to upgrade the version used by RSS Bandit). In RSS Bandit I use SgmlReader which in general works fine until confronted with weirdness such as the completely broken HTML produced by old versions of Microsoft Word including tags such as 

<?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office" />

Over time I've figured out how to work past the markup that SgmlReader can't handle but it's been a pain to track down what they were and I often ended up finding out about them via bug reports from frustrated users. Now Mark Pilgrim is proposing that ATOM clients must have to go through the same problems that're faced by folks like me who've had to deal with RSS autodiscovery.

So I proposed an alternative, that instead of every ATOM client having to require an HTML parser that instead this information is provided in a custom HTTP header that is returned by the website. Custom HTTP headers are commonplace on the World Wide Web and are widely supported by most web development technologies. The most popular extension header I've seen is the X-Powered-By header although I'd say the most entertaining is the X-Bender header returned by Slashdot which contains a quote from Futurama's Bender. You can test for yourself which sites return custom HTTP headers by trying out Rex Swain's HTTP Viewer. Not only is generating custom headers widely supported by web development technologies like PHP and ASP.NET but also extracting them from an HTTP response is fairly trivial on most platforms since practically every HTTP library gives you a handy way to extract the headers from a response in a collection or similar data structure.

If ATOM autodiscovery used a custom header as opposed to requiring clients to use an HTML parser it would make the process more reliable (no more worry about malformed [X]HTML borking the process) which is good for users as I can attest from my experiences with RSS Bandit and reduce the complexity of client applications (no dependence on a tag soup parsing library).  

Reading Mark Pilgrim's post the only major objection he raises seems to be that the average user (Bob) doesn't know how add custom HTTP headers to their site which is a fallacious argument given that the average user similarly doesn't know how to generate an XML feed from their weblog either. However the expectation is that Bob's blogging software should do this not that Bob will be generating this stuff by hand.

Mark also incorrectly states that the HTML spec states that any “all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element”. The HTML specification actually states

META and HTTP headers

The http-equiv attribute can be used in place of the name attribute and has a special significance when documents are retrieved via the Hypertext Transfer Protocol (HTTP). HTTP servers may use the property name specified by the http-equiv attribute to create an [RFC822]-style header in the HTTP response. Please see the HTTP specification ([RFC2616]) for details on valid HTTP headers.

The following sample META declaration:

<META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">

will result in the HTTP header:

Expires: Tue, 20 Aug 1996 14:25:27 GMT

That's right, the HTML spec says that authors can put <meta http-equiv="..."> in their HTMl documents and a web server gets a request for a document it should parse out these tags and use them to add HTTP headers to the response. In reality this turned out to be infeasible because it would be highly inefficient and require web servers to run a tag soup parser over a file each time they served it up to determine which headers to send in the response. So what ended up happening, is that certain browsers support a limited subset of the HTTP headers if they appear as <meta http-equiv="..."> in teh document.

It is unsurprising that Mark mistakes what ended up being implemented by the major browsers and web servers as what was in the spec after all he who writes the code makes the rules.

At this point I'd definitely like to see an answer to the questions Dave Winer asked on the atom-syntax list about its decision making process. So far it's seemed like there's a bunch of discussion on the mailing list or on the Wiki which afterwards may be ignored by the powers that be who end up writing the specs (he who writes the spec makes the rules). The choice of <link> tags over using RSD  for ATOM autodiscovery is just one of many examples of this occurence. It'd be nice to some documentation of the actual process as opposed to the anarchy and “might is right” approach that currently exists.


Categories: XML

January 10, 2004
@ 07:58 PM

The National Pork Board would like to remind you self-righteous, holier-than-thou beef-eatin' snobs there's never been a single case of "mad pig" disease.

Pork: The other white meat, Bee-yotch!

The Boondocks comic is consistently funny unlike other online comics that have recently started falling off *cough*Sluggy*cough*. Also it is the only other regular newspaper comic that decries the insanity of the current situation in the US.


Slashdot ran yet another article on outsourcing today, this one about how Tech Firms Defend Moving Jobs Overseas. It had the usual comments one's come to expect from such stories. It's been quite interesting watching the attitudes of the folks on Slashdot over the past few years. I started reading the site around the time of the RedHat IPO when everyone was cocky and folks useed to brag about getting cars as signing bonuses. Then the beginning of the downturn when the general sentiment was that only those who couldn't hack it were getting fired. Then the feeling that the job loss was more commonplace started to spread and the xenophobic phase began with railings againsg H1Bs. Now it seems every other poster is either out of work or just got a job after being out of work for a couple of months. The same folks who used to laugh at the problems the RIAA had dealing with the fact that "their business model was obsolete in a digital world" now seek protectionalist government policies to deal with the fact that their IT careers are obsolete in a global economy.   

Anyway, I digress. I found an interesting link in one of the posts to an article on FastCompany entitled The Wal-Mart You Don't Know. It begins

A gallon-sized jar of whole pickles is something to behold. The jar is the size of a small aquarium. The fat green pickles, floating in swampy juice, look reptilian, their shapes exaggerated by the glass. It weighs 12 pounds, too big to carry with one hand. The gallon jar of pickles is a display of abundance and excess; it is entrancing, and also vaguely unsettling. This is the product that Wal-Mart fell in love with: Vlasic's gallon jar of pickles.

Wal-Mart priced it at $2.97--a year's supply of pickles for less than $3! "They were using it as a 'statement' item," says Pat Hunn, who calls himself the "mad scientist" of Vlasic's gallon jar. "Wal-Mart was putting it before consumers, saying, This represents what Wal-Mart's about. You can buy a stinkin' gallon of pickles for $2.97. And it's the nation's number-one brand."

Therein lies the basic conundrum of doing business with the world's largest retailer. By selling a gallon of kosher dills for less than most grocers sell a quart, Wal-Mart may have provided a ser-vice for its customers. But what did it do for Vlasic? The pickle maker had spent decades convincing customers that they should pay a premium for its brand. Now Wal-Mart was practically giving them away. And the fevered buying spree that resulted distorted every aspect of Vlasic's operations, from farm field to factory to financial statement.

and has this somewhere in the middle

Wal-Mart has also lulled shoppers into ignoring the difference between the price of something and the cost. Its unending focus on price underscores something that Americans are only starting to realize about globalization: Ever-cheaper prices have consequences. Says Steve Dobbins, president of thread maker Carolina Mills: "We want clean air, clear water, good living conditions, the best health care in the world--yet we aren't willing to pay for anything manufactured under those restrictions."

which is particularly interesting given the various points I've seen raised about outsourcing in the IT field. The US is definitely headed for interesting times.


January 9, 2004
@ 05:27 AM

For the last couple of months I've noticed a rather annoying bug with my cellphone, an LG 5350. Whenever I enter a new contact it also copies the person's number over the number of an existing contact. If I later delete the new entry it also deletes the copied over number from the other contact. I've lost a couple of folk's phone numbers due to this annoyance. I'm now in the market for a new phone.

The main features I want besides the standard cell phone features (makes calls, addressbook) are the ability to sync with my calendar in Outlook and perhaps the ability to get information on traffic conditions as well.   

I'm currently torn between getting a SmartPhone and a Pocket PC Phone Edition. Too bad stores don't let you test drive cellphones like they do cars.


Categories: Ramblings

January 6, 2004
@ 09:28 PM

In response to my previous post David Orchard provides a link to his post entitled XQuery: Meet the Web where he writes

In fact, this separation of the private and more general query mechanism from the public facing constrained operations is the essence of the movement we made years ago to 3 tier architectures. SQL didn't allow us to constrain the queries (subset of the data model, subset of the data, authorization) so we had to create another tier to do this.

What would it take to bring the generic functionality of the first tier (database) into the 2nd tier, let's call this "WebXQuery" for now. Or will XQuery be hidden behind Web and WSDL endpoints?

Every way I try to interpret this it seems like a step back to me. It seems like in general the software industry decided that exposing your database & query language directly to client applications was the wrong way to build software and 2-tier client-server architectures giving way to N-tier architectures was an indication of this trend. I fail to see why one would think it is a good idea to allow clients to issue arbitrary XQuery queries but not think the same for SQL. From where I sit there is basically little if any difference from either choice for queries. Note that although SQL is also has a Data Definition Langauge (DDL) and Data Manipulation Language (DML) as well as a query language for the purposes of this discussion I'm only considering the query aspects of SQL.

David then puts forth some questions about this idea that I can't help offering my opinions on

If this is an interesting idea, of providing generic and specific query interfaces to applications, what technology is necessary? I've listed a number of areas that I think need examination before we can get to XQuery married to the Web and to make a generic second tier.

1. How to express that a particular schema is queryable and the related bindings and endpoint references to send and receive the queries. Some WSDL extensions would probably do the trick.

One thing lacking in the XML Web Services world are the simple REST-like notions of GET and POST. In the RESTful HTTP world one would simply specify a URI which one could perform an HTTP GET on an get back an XML document. One could then either use the hierarchy of the URI to select subsets of the document or perhaps use HTTP POST to send more complex queries. All this indirection with WSDL files and SOAP headers yet functionality such as what Yahoo has done with their Yahoo! News Search RSS feeds isn't straightforward. I agree that WSDL annotations would do the trick but then you have to deal with the fact that WSDL's themselves are not discoverable. *sigh* Yet more human intervebtion is needed instead of loosely coupled application building.

2. Limit the data set returned in a query. There's simply no way an large provider of data is going to let users retrieve the data set from a query. Amazon is just not going to let "select * from *" happen. Perhaps fomal support in XQuery for ResultSets to be layered on any query result would do the trick. A client would then need to iterate over the result set to get all the results, and so a provider could more easily limit the # of iterations. Another mechanism is to constrain the Return portion of XQuery. Amazon might specify that only book descriptions with reviews are returnable.

This is just a difficult problem. Some queries are complex, computationally intensive but return few results. In some cases it is hard to tell by just looking at the query how badly it'll perform. A notion of returning result sets makes sense in a mid-tier application that's talking to a database but not to client app half-way across the world talking to a website.

3. Subset the Xquery functionality. Xquery is a very large and complicated specification. There's no need for all that functionality in every application. This would make implementation of XQuery more wide spread as well. Probably the biggest subset will be Read versus Update.

Finally something I agree with although David shows some ignorance of XQuery by assuming that there is an update aspect to XQuery when DML was shelved for the 1.0 version. XQuery is just a query language. However it is an extremely complex query language which is hundreds of pages in specification long. The most relevant specs from the W3C XML Query page are linked to below

I probably should also link to the W3C XML Schema: Structures and W3C XML Schema: Datatypes specs since they are the basis of the type system of XQuery. My personal opinion is that XQuery is probably too complex to use as the language for such an endeavor since you want something that is simple to implement and fairly straightforward so that there can be ubiqitous implementations and therefore lots of interoperability (unlike the current situation with W3C XML Schema). I personally would start with XPath 1.0 and subset or modify that instead of XQuery.

4. Data model subsets. Particular user subsets will only be granted access to a subset of the data model. For example, Amazon may want to say that book publishers can query all the reviews and sales statistics for their books but users can only query the reviews. Maybe completely separate schemas for each subset. The current approach seems to be to do an extract of the data subset accoring to each subset, so there's a data model for publishers and a data model for users. Maybe this will do for WebXQuery.

5. Security. How to express in the service description (wsdl or policy?) that a given class of users can perform some subset of the functionality, either the query, the data model or the data set. Some way of specifying the relationship between the set of data model, query functionality, data set and authorization.

I'd say the above two features are tied together. You need some way to restrict what the sys admin vs. the regular user executing such a query over the wire can do as well as a way to authenticate them.  

6. Performance. The Web has a great ability to increase performance because resources are cachable. The design of URIs and HTTP specifically optimizes for this. The ability to compare URIs is crucial for caching., hence why so much work went into specifying how they are absolutized and canonically compared. But clearly XQuery inputs are not going to be sent in URIs, so how do we have cachable XQueries gven that the query will be in a soap header? There is a well defined place in URIs for the query, but there isn't such a thing in SOAP. There needs to be some way of canonicalizing an Xquery and knowing which portions of the message contain the query. Canonicalizing a query through c14n might do the trick, though I wonder about performance. And then there's the figuring out of which header has the query. There are 2 obvious solutions: provide a description annotation or an inline marker. I don't think that requiring any "XQuery cache" engine to parse the WSDL for all the possible services is really going to scale, so I'm figuring a well-defined SOAP header is the way to go.

Sounds like overthinking the problem and yet not being general enough. The first problem is that there should be standard ways that proxies and internet caches know how to cache XML Web Service results in the same way that they know how to cache results of HTTP GET requests today. After that figuring out how to canonicalize a query expression (I'm not even sure what that means- will /root/child and /root/*[local-name()='child'] be canonicalized into the same thing?) is probably a couple of Ph.D theses of work.

Then there's just the fact that allowing clients to ship arbitrary queries to the server is a performance nightmare waiting to happen...

Your thoughts? Is WebXQuery an interesting idea and what are the hurdles to overcome?

It's an interesting idea but I suspect not a very practical or useful one outside of certain closed applications with strict limitations on the number of users or the type of queries issued.  

Anyway, I'm off to play in the snow. I just saw someone skiing down the street. Snow storms are fun.


Categories: XML

Jon Udell recently wrote in his post entitled XML for the rest of us

By the way, Adam Bosworth said a great many other interesting things in his XML 2003 talk. For those of you not inclined to watch this QuickTime clip -- and in particular for the search crawlers -- I would like to enter the following quote into the public record.

The reason people get scared of queries is that it's hard to say 'You can send me this kind of query, but not that kind of query.' And therefore it's hard to have control, and people end up building other systems. It's not clear that you always want query. Sometimes people can't handle arbitrary statements. But we never have queries. I don't have a way to walk up to Salesforce and Siebel and say tell me everything I know about the customer -- in the same way. I don't even have a way to say tell me everything about the customers who meet the following criteria. I don't have a way to walk up to Amazon and Barnes and Noble and in a consistent way say 'Find me all the books reviewed by this person,' or even, 'Find me the reviews for this book.' I can do that for both, but not in the same way. We don't have an information model. We don't have a query model. And for that, if you remember the dream we started with, we should be ashamed.

I think we can fix this. I think we can take us back to a world that's a simple world. I think we can go back to a world where there are just XML messages flowing back and forth between...resources. <snipped />

Three things jump out at me from that passage. First, the emphasis on XML query. My instincts have been leading me in that direction for a while now, and much of my own R&D in 2003 was driven by a realization that XPath is now a ubiquitous technology with huge untapped potential. Now, of course, XQuery is coming on like a freight train.

When Don and I hung out over the holidays this was one of the things we talked about. Jon's post has been sitting flagged for follow up in my aggregator for a while. Here are my thoughts...  

The main problem is that there are a number of websites which have the same information but do not provide a uniform way to access this information and when access mechanisms to information are provided do not allow ad-hoc queries. So the first thing that is needed is a shared view (or schema) of what this information looks like which is the shared information model Adam talks about. There are two routes you can take with this, one is to define a shared data model with the transfer syntax being secondary (i.e. use RDF) while another is to define a shared data model and transfer syntax (i.e use XML).  In most cases, people have tended to pick the latter.

Once an XML representation of the relevant information users are interested has been designed (i.e. the XML schema for books, reviews and wishlists that could be exposed by sites like Amazon or Barnes & Nobles) the next technical problem to be solved is uniform access mechanisms. The eternal REST vs. SOAP vs. XML-RPC that has plagued a number of online discussions. Then there's deployment, adoption and evangelism.

Besides the fact that I've glossed over the significant political and business reasons that may or may not make such an endeavor fruitful we still haven't gotten to Adam's Nirvana. We still need a way to process the data exposed by these web services in arbitrary ways. How does one express a query such as "Find all the CDs released between 1990 and 1999 that Dare Obasanjo rated higher than 3 stars"? Given the size of the databases hosted by such sites would it make more sense to ship the documents to the client or some mid-tier which then performs the post-processing of the raw data instead of pushing such queries down to the database? What are the performance ramifications of exposing your database to anyone with a web browser and allowing them to run ad-hoc queries instead of specially optimized, canned queries? 

At this point  if you are like me you might suspect that defining that the web service endpoints return the results of performing canned queries which can then be post processed by the client may be more practical then expecting to be able to ship arbitrary SQL/XML, XQuery or XPath queries to web service end points.  

The main problem with what I've described is that it takes a lot of effort. Coming up with standardized schema(s) and distributed computing architecture for a particular industry then driving adoption is hard even when there's lots of cooperation let alone in highly competitive markets.

In an ideal world, this degree of boot strapping would be unnecessary. After all, people can already build the kinds of applications Adam described today by screen scraping [X]HTML although they tend to be brittle. What the software industry should strive for is a way to build such applications in a similarly loosely connected manner in the XML Web Services world without requiring the heavy investment of human organizational effort that is currently needed. This was the initial promise of XML Web Services which like Adam I am ashamed has not come to pass. Instead many seem to be satisfied with reinventing DCOM/CORBA/RMI with angle-brackets (then replacing it with "binary infosets"). Unfortunate...


Categories: XML

January 6, 2004
@ 03:21 PM

The first day of work yesterday after most people being gone for 2 or 3 weeks was refreshing. I attended a few meetings about design issues we've been having and people involved seemed to be coalescing around decisions that make sense. A good start to the new year. The main problems seem to not be be making changes but not breaking folks as we make the changes. Processing XML in the next version of the .NET Framework is going to kick the llama's azz.  

I already have lots of follow up meetings already scheduled for tomorrow (not today since we're supposed to have a record snowfall for this area today and most people here don't do snow driving). I'll probably be working from home, the streets look dangerous from my the vantage point of my bedroom window.  




Categories: Life in the B0rg Cube

I feel somewhat hypocritical today. Although I've complained about bandwidth consumed by news aggregators polling RSS feeds and have implemented support for both HTTP conditional GET and gzip compression over HTTP to RSS Bandit (along with a number of other aggregators) it turns out that I don't put my money where my mouth is and use blogging software that utilizes either of the aformentioned bandwidth saving techniques.

Ted Leung lists a few feeds from his list of subscriptions that don't use gzip compression and mine is an example of one that doesn't. He also mentiones in a previous post that

As far as gzipped feeds go, about 10% of the feeds in my NNW (about 900) are gzipped. That's a lot worse than I expected. I understand that this can be tough -- the easiest way to implement gzipping is todo what Brent suggested, shove it off to Apache. That means that people who are being hosted somewhere need to know enough Apache config to turn gzip on. Not likely. Or have enlighted hosting admins that automatically turn it on, but that' doesn't appear to be the case. So blogging software vendors could help a lot by turning gzip support on in the software.

What's even more depressing is that for HTTP conditional get, the figure is only about 33% of feeds. And this is something that the blogging software folks should do. We are doing it in pyblosxom.

This is quite unfortunate. Every few months I read some "sky is falling" post about how the bandwidth costs of polling RSS feeds yet many blogging tools don't support existing technologies that could reduce the badnwidth cost of RSS polling by an order of magnitude. I am guilty of this as well.

I'll investigate how difficult it'll be to practice what I preach in the next few weeks. It looks like an upgrade to my version of dasBlog is in the works.


Categories: Technology

January 6, 2004
@ 02:23 PM

The Bad Fads Museum provides an entertaining glimpse to past [and current] fads in FashionCollectibles, Activities and Events. To show that the site isn't mean spirited the "About BadFads" section of the site contains the following text

While the name of this site is BAD FADS, please note that this is neither an indictment nor an endorsement of any of the fads mentioned. As you know, during the '70s the word "bad" could alternately mean "good!" Thus, this site was created to take a fun and nostalgic look at fashions, collectibles, activities and events which are cherished by some and ridiculed by others.

It's all in good fun. The best writeups are on fashion fads like Tatoos and Tie Dye T-Shirts as well as fads in collectibles such as Pet Rocks and Troll Dolls


January 5, 2004
@ 04:26 PM

Last night I posted about bugs in RSS Bandit others have that aren't reproducible on my machine. Ted Leung was nice enough to post some details about the problems he had. So it turns out that RSS Bandit is at fault. The two primary ways of adding feeds to your subscription list have usability issues bugs.

  1. If you click the "New Feed" button and specify a feed URL without providing the URI scheme (e.g. enter instead of then RSS Bandit assumes the URI is a local URI (i.e. file:// ) . Actually, it's worse sometimes it just throws an exception.

  2. The "Locate Feed" button that uses Mark Pilgrim's Ultra-liberal RSS autodiscovery algorithm to locate a feed for a site shows a weird error message if it couldn't make sense of the HTML because it was too malformed (i.e tag soup) tag soup.  There are a bunch of things I could change here from using a better error message to falling back to using Syndic8 to find the feed.

I'll fix both of these bugs before heading out to work today. Hopefully this should take care of the problems various people have had and probably never mentioned with adding feeds to RSS Bandit.

Categories: RSS Bandit