Thursday, 08 January 2004 - Dare Obasanjo's weblog

January 6, 2004

@ 09:28 PM

In response to my previous post David Orchard provides a link to his post entitled XQuery: Meet the Web where he writes

In fact, this separation of the private and more general query mechanism from the public facing constrained operations is the essence of the movement we made years ago to 3 tier architectures. SQL didn't allow us to constrain the queries (subset of the data model, subset of the data, authorization) so we had to create another tier to do this.

What would it take to bring the generic functionality of the first tier (database) into the 2nd tier, let's call this "WebXQuery" for now. Or will XQuery be hidden behind Web and WSDL endpoints?

Every way I try to interpret this it seems like a step back to me. It seems like in general the software industry decided that exposing your database & query language directly to client applications was the wrong way to build software and 2-tier client-server architectures giving way to N-tier architectures was an indication of this trend. I fail to see why one would think it is a good idea to allow clients to issue arbitrary XQuery queries but not think the same for SQL. From where I sit there is basically little if any difference from either choice for queries. Note that although SQL is also has a Data Definition Langauge (DDL) and Data Manipulation Language (DML) as well as a query language for the purposes of this discussion I'm only considering the query aspects of SQL.

David then puts forth some questions about this idea that I can't help offering my opinions on

If this is an interesting idea, of providing generic and specific query interfaces to applications, what technology is necessary? I've listed a number of areas that I think need examination before we can get to XQuery married to the Web and to make a generic second tier.

1. How to express that a particular schema is queryable and the related bindings and endpoint references to send and receive the queries. Some WSDL extensions would probably do the trick.

One thing lacking in the XML Web Services world are the simple REST-like notions of GET and POST. In the RESTful HTTP world one would simply specify a URI which one could perform an HTTP GET on an get back an XML document. One could then either use the hierarchy of the URI to select subsets of the document or perhaps use HTTP POST to send more complex queries. All this indirection with WSDL files and SOAP headers yet functionality such as what Yahoo has done with their Yahoo! News Search RSS feeds isn't straightforward. I agree that WSDL annotations would do the trick but then you have to deal with the fact that WSDL's themselves are not discoverable. *sigh* Yet more human intervebtion is needed instead of loosely coupled application building.

2. Limit the data set returned in a query. There's simply no way an large provider of data is going to let users retrieve the data set from a query. Amazon is just not going to let "select * from *" happen. Perhaps fomal support in XQuery for ResultSets to be layered on any query result would do the trick. A client would then need to iterate over the result set to get all the results, and so a provider could more easily limit the # of iterations. Another mechanism is to constrain the Return portion of XQuery. Amazon might specify that only book descriptions with reviews are returnable.

This is just a difficult problem. Some queries are complex, computationally intensive but return few results. In some cases it is hard to tell by just looking at the query how badly it'll perform. A notion of returning result sets makes sense in a mid-tier application that's talking to a database but not to client app half-way across the world talking to a website.

3. Subset the Xquery functionality. Xquery is a very large and complicated specification. There's no need for all that functionality in every application. This would make implementation of XQuery more wide spread as well. Probably the biggest subset will be Read versus Update.

Finally something I agree with although David shows some ignorance of XQuery by assuming that there is an update aspect to XQuery when DML was shelved for the 1.0 version. XQuery is just a query language. However it is an extremely complex query language which is hundreds of pages in specification long. The most relevant specs from the W3C XML Query page are linked to below

XQuery 1.0 and XPath 2.0 Data Model (updated, LAST CALL), last release 12 November 2003
XSLT 2.0 and XQuery 1.0 Serialization, (updated, LAST CALL), last release 12 November 2003
XQuery 1.0 and XPath 2.0 Formal Semantics (updated), last release 12 November 2003
XQuery 1.0: An XML Query Language (updated, LAST CALL), [DIFF with previous version], last release 12 November 2003
XML Syntax for XQuery 1.0 (XQueryX), updated, last release 19 December 2003
XQuery 1.0 and XPath 2.0 Functions and Operators (updated, LAST CALL), [DIFF with previous version], last release 12 November 2003

I probably should also link to the W3C XML Schema: Structures and W3C XML Schema: Datatypes specs since they are the basis of the type system of XQuery. My personal opinion is that XQuery is probably too complex to use as the language for such an endeavor since you want something that is simple to implement and fairly straightforward so that there can be ubiqitous implementations and therefore lots of interoperability (unlike the current situation with W3C XML Schema). I personally would start with XPath 1.0 and subset or modify that instead of XQuery.

4. Data model subsets. Particular user subsets will only be granted access to a subset of the data model. For example, Amazon may want to say that book publishers can query all the reviews and sales statistics for their books but users can only query the reviews. Maybe completely separate schemas for each subset. The current approach seems to be to do an extract of the data subset accoring to each subset, so there's a data model for publishers and a data model for users. Maybe this will do for WebXQuery.

5. Security. How to express in the service description (wsdl or policy?) that a given class of users can perform some subset of the functionality, either the query, the data model or the data set. Some way of specifying the relationship between the set of data model, query functionality, data set and authorization.

I'd say the above two features are tied together. You need some way to restrict what the sys admin vs. the regular user executing such a query over the wire can do as well as a way to authenticate them.

6. Performance. The Web has a great ability to increase performance because resources are cachable. The design of URIs and HTTP specifically optimizes for this. The ability to compare URIs is crucial for caching., hence why so much work went into specifying how they are absolutized and canonically compared. But clearly XQuery inputs are not going to be sent in URIs, so how do we have cachable XQueries gven that the query will be in a soap header? There is a well defined place in URIs for the query, but there isn't such a thing in SOAP. There needs to be some way of canonicalizing an Xquery and knowing which portions of the message contain the query. Canonicalizing a query through c14n might do the trick, though I wonder about performance. And then there's the figuring out of which header has the query. There are 2 obvious solutions: provide a description annotation or an inline marker. I don't think that requiring any "XQuery cache" engine to parse the WSDL for all the possible services is really going to scale, so I'm figuring a well-defined SOAP header is the way to go.

Sounds like overthinking the problem and yet not being general enough. The first problem is that there should be standard ways that proxies and internet caches know how to cache XML Web Service results in the same way that they know how to cache results of HTTP GET requests today. After that figuring out how to canonicalize a query expression (I'm not even sure what that means- will /root/child and /root/*[local-name()='child'] be canonicalized into the same thing?) is probably a couple of Ph.D theses of work.

Then there's just the fact that allowing clients to ship arbitrary queries to the server is a performance nightmare waiting to happen...

Your thoughts? Is WebXQuery an interesting idea and what are the hurdles to overcome?

It's an interesting idea but I suspect not a very practical or useful one outside of certain closed applications with strict limitations on the number of users or the type of queries issued.

Anyway, I'm off to play in the snow. I just saw someone skiing down the street. Snow storms are fun.

Categories: XML

January 6, 2004

@ 04:17 PM

Comments [2]

XML For You and Me, Your Mama and Your Cousin Too

Jon Udell recently wrote in his post entitled XML for the rest of us

By the way, Adam Bosworth said a great many other interesting things in his XML 2003 talk. For those of you not inclined to watch this QuickTime clip -- and in particular for the search crawlers -- I would like to enter the following quote into the public record.

The reason people get scared of queries is that it's hard to say 'You can send me this kind of query, but not that kind of query.' And therefore it's hard to have control, and people end up building other systems. It's not clear that you always want query. Sometimes people can't handle arbitrary statements. But we never have queries. I don't have a way to walk up to Salesforce and Siebel and say tell me everything I know about the customer -- in the same way. I don't even have a way to say tell me everything about the customers who meet the following criteria. I don't have a way to walk up to Amazon and Barnes and Noble and in a consistent way say 'Find me all the books reviewed by this person,' or even, 'Find me the reviews for this book.' I can do that for both, but not in the same way. We don't have an information model. We don't have a query model. And for that, if you remember the dream we started with, we should be ashamed.

I think we can fix this. I think we can take us back to a world that's a simple world. I think we can go back to a world where there are just XML messages flowing back and forth between...resources. <snipped />

Three things jump out at me from that passage. First, the emphasis on XML query. My instincts have been leading me in that direction for a while now, and much of my own R&D in 2003 was driven by a realization that XPath is now a ubiquitous technology with huge untapped potential. Now, of course, XQuery is coming on like a freight train.

When Don and I hung out over the holidays this was one of the things we talked about. Jon's post has been sitting flagged for follow up in my aggregator for a while. Here are my thoughts...

The main problem is that there are a number of websites which have the same information but do not provide a uniform way to access this information and when access mechanisms to information are provided do not allow ad-hoc queries. So the first thing that is needed is a shared view (or schema) of what this information looks like which is the shared information model Adam talks about. There are two routes you can take with this, one is to define a shared data model with the transfer syntax being secondary (i.e. use RDF) while another is to define a shared data model and transfer syntax (i.e use XML). In most cases, people have tended to pick the latter.

Once an XML representation of the relevant information users are interested has been designed (i.e. the XML schema for books, reviews and wishlists that could be exposed by sites like Amazon or Barnes & Nobles) the next technical problem to be solved is uniform access mechanisms. The eternal REST vs. SOAP vs. XML-RPC that has plagued a number of online discussions. Then there's deployment, adoption and evangelism.

Besides the fact that I've glossed over the significant political and business reasons that may or may not make such an endeavor fruitful we still haven't gotten to Adam's Nirvana. We still need a way to process the data exposed by these web services in arbitrary ways. How does one express a query such as "Find all the CDs released between 1990 and 1999 that Dare Obasanjo rated higher than 3 stars"? Given the size of the databases hosted by such sites would it make more sense to ship the documents to the client or some mid-tier which then performs the post-processing of the raw data instead of pushing such queries down to the database? What are the performance ramifications of exposing your database to anyone with a web browser and allowing them to run ad-hoc queries instead of specially optimized, canned queries?

At this point if you are like me you might suspect that defining that the web service endpoints return the results of performing canned queries which can then be post processed by the client may be more practical then expecting to be able to ship arbitrary SQL/XML, XQuery or XPath queries to web service end points.

The main problem with what I've described is that it takes a lot of effort. Coming up with standardized schema(s) and distributed computing architecture for a particular industry then driving adoption is hard even when there's lots of cooperation let alone in highly competitive markets.

In an ideal world, this degree of boot strapping would be unnecessary. After all, people can already build the kinds of applications Adam described today by screen scraping [X]HTML although they tend to be brittle. What the software industry should strive for is a way to build such applications in a similarly loosely connected manner in the XML Web Services world without requiring the heavy investment of human organizational effort that is currently needed. This was the initial promise of XML Web Services which like Adam I am ashamed has not come to pass. Instead many seem to be satisfied with reinventing DCOM/CORBA/RMI with angle-brackets (then replacing it with "binary infosets"). Unfortunate...

Categories: XML

January 6, 2004

@ 03:21 PM

Comments [0]

Forward Motion

The first day of work yesterday after most people being gone for 2 or 3 weeks was refreshing. I attended a few meetings about design issues we've been having and people involved seemed to be coalescing around decisions that make sense. A good start to the new year. The main problems seem to not be be making changes but not breaking folks as we make the changes. Processing XML in the next version of the .NET Framework is going to kick the llama's azz.

I already have lots of follow up meetings already scheduled for tomorrow (not today since we're supposed to have a record snowfall for this area today and most people here don't do snow driving). I'll probably be working from home, the streets look dangerous from my the vantage point of my bedroom window.

Categories: Life in the B0rg Cube

January 6, 2004

@ 02:52 PM

Comments [2]

Blogging Tools and the RSS Bandwidth Problem

I feel somewhat hypocritical today. Although I've complained about bandwidth consumed by news aggregators polling RSS feeds and have implemented support for both HTTP conditional GET and gzip compression over HTTP to RSS Bandit (along with a number of other aggregators) it turns out that I don't put my money where my mouth is and use blogging software that utilizes either of the aformentioned bandwidth saving techniques.

Ted Leung lists a few feeds from his list of subscriptions that don't use gzip compression and mine is an example of one that doesn't. He also mentiones in a previous post that

As far as gzipped feeds go, about 10% of the feeds in my NNW (about 900) are gzipped. That's a lot worse than I expected. I understand that this can be tough -- the easiest way to implement gzipping is todo what Brent suggested, shove it off to Apache. That means that people who are being hosted somewhere need to know enough Apache config to turn gzip on. Not likely. Or have enlighted hosting admins that automatically turn it on, but that' doesn't appear to be the case. So blogging software vendors could help a lot by turning gzip support on in the software.

What's even more depressing is that for HTTP conditional get, the figure is only about 33% of feeds. And this is something that the blogging software folks should do. We are doing it in pyblosxom.

This is quite unfortunate. Every few months I read some "sky is falling" post about how the bandwidth costs of polling RSS feeds yet many blogging tools don't support existing technologies that could reduce the badnwidth cost of RSS polling by an order of magnitude. I am guilty of this as well.

I'll investigate how difficult it'll be to practice what I preach in the next few weeks. It looks like an upgrade to my version of dasBlog is in the works.

Categories: Technology

January 6, 2004

@ 02:23 PM

Comments [0]

The Bad Fads Museum

The Bad Fads Museum provides an entertaining glimpse to past [and current] fads in Fashion, Collectibles, Activities and Events. To show that the site isn't mean spirited the "About BadFads" section of the site contains the following text

While the name of this site is BAD FADS, please note that this is neither an indictment nor an endorsement of any of the fads mentioned. As you know, during the '70s the word "bad" could alternately mean "good!" Thus, this site was created to take a fun and nostalgic look at fashions, collectibles, activities and events which are cherished by some and ridiculed by others.

It's all in good fun. The best writeups are on fashion fads like Tatoos and Tie Dye T-Shirts as well as fads in collectibles such as Pet Rocks and Troll Dolls

Categories: Mindless Link Propagation

January 5, 2004

@ 04:26 PM

Comments [1]

Bugs Bugs Everywhere

Last night I posted about bugs in RSS Bandit others have that aren't reproducible on my machine. Ted Leung was nice enough to post some details about the problems he had. So it turns out that RSS Bandit is at fault. The two primary ways of adding feeds to your subscription list have ~~usability issues~~ bugs.

If you click the "New Feed" button and specify a feed URL without providing the URI scheme (e.g. enter www.example.org instead of http://www.example.org) then RSS Bandit assumes the URI is a local URI (i.e. file://www.example.org ) . Actually, it's worse sometimes it just throws an exception.
The "Locate Feed" button that uses Mark Pilgrim's Ultra-liberal RSS autodiscovery algorithm to locate a feed for a site shows a weird error message if it couldn't make sense of the HTML because it was too malformed (i.e tag soup) tag soup. There are a bunch of things I could change here from using a better error message to falling back to using Syndic8 to find the feed.

I'll fix both of these bugs before heading out to work today. Hopefully this should take care of the problems various people have had and probably never mentioned with adding feeds to RSS Bandit.

Categories: RSS Bandit

January 5, 2004

@ 08:37 AM

Comments [1]

More Stuff That Doesn't Repro

A few hours after my recent post about Roy's problems subscribing to feeds not being reproduible on my machine I stumbled on the following excerpt from a post by Ted Leung

I've played with RSS Bandit and there were some recent laudatory posts about the latest versions, so this morning I downloaded a copy (after doing Windows update for Win2K on the Thinkpad, rebooting, installing a newer version of the .NET framework, and rebooting...) and installed it. Things seemed fine, at least until I started adding feeds. The first two feeds I added were hers and mine. RSS Bandit choked on both. Now we have a internal/external network setup, complete with split DNS and a whole bunch of other stuff. I figured that might be a problem, and started tweaking. The deeper I got, the more I realized it wasn't going to work. I foresaw many pleas for technical support followed by frustration -- I mean, *I* was frustrated. So I dumped that and went for Plan B, as it were.

What's truly weird about his post is that I was reading it in RSS Bandit which means reading his feed works fine for me on my machine but for some reason didn't work with his. In fact, I just checked his wife's blog and again no problems reading it in RSS Bandit. *sigh*

I suspect gremlins are to blame for this...

Categories: RSS Bandit

January 5, 2004

@ 08:22 AM

Comments [8]

Porous Thinking

Nick Bradbury recently posted an entry entitled On Piracy which read

Many people who use pirated products justify it by claiming they're only stealing from rich mega-corporations that screw their customers, but this conveniently overlooks the fact that the people who are hurt the most by piracy are people like me.

Shareware developers are losing enormous amounts of money to piracy, and we're mostly helpless to do anything about it. We can't afford to sue everyone who steals from us, let alone track down people in countries such as Russia who host web sites offering pirated versions of our work...Some would argue that we should just accept piracy as part of the job, but chances are the people who say this aren't aware of how widespread piracy really is. A quick look at my web server logs would be enough to startle most people, since the top referrers are invariably warez sites that link to my site (yes, not only do they steal my software, but they also suck my bandwidth).

A couple of years ago I wanted to get an idea of how many people were using pirated versions of TopStyle, so I signed up for an anonymous email account (using a "kewl" nickname, of course) and started hanging out in cracker forums. After proving my cracker creds, I created a supposedly cracked version of TopStyle and arranged to have it listed on a popular warez site....This cracked version pinged home the first time it was run, providing a way for me to find out how many people were using it. To my dismay, in just a few weeks more people had used this cracked version than had ever purchased it. I knew piracy was rampant, but I didn't realize how widespread it was until this test.

The proliferation of software piracy isn't anything new. The primary reason I'm bothering to post about it is that Aaron Swartz posted an obnoxious response to Nick's post entitled On Piracy, or, Nick Bradbury is an Amazing Idiot which besides containing a "parody" which is part Slippery Slope and part False Analogy ends with the following gems

Nick has no innate right to have people pay for his software, just as I have no right to ask people to pay for use of my name.

Even if he did, most people who pirate his software probably would never use it anyway, so they aren't costing him any money and they're providing him with free advertising.

And of course it makes sense that lots of people who see some interesting new program available for free from a site they're already at will download it and try it out once, just as more people will read an article I wrote in the New York Times than on my weblog.

...

Yes, piracy probably does take some sales away from Nick, but I doubt it's very many. If Nick wants to sell more software, maybe he should start by not screaming at his potential customers. What's next? Yelling at people who use his software on friends computers? Or at the library?

Aaron's arguments are so silly they boggle the mind but let's take them one at a time. Human beings have no innate rights. Concepts such as "unalienable rights" and documents such as the Bill of Rights have been agreed upon by some societies as the law but this doesn't mean they are universal or would mean anything if not backed up by the law and its enforcers. Using Aaron's argument, Aaron has no innate right to live in a house he paid for, eat food he bought or use his computer if some physically superior person or armed thug decides he covets his possessions. The primary thing preventing this from being the normal state of affairs is the law, the same law that states that software piracy is illegal. Western society has decided that Capitalism is the way to go (i.e. a party provides goods or services for sale and consumers of said goods and services pay for them). So for whatever definition of "rights" Aaron is using Nick has a right to not have his software pirated.

Secondly, Aaron claims that if people illegally utilizing your software can't afford it then it's OK for them to do so. This argument is basically, "It's OK to steal if what you want is beyond your purchasing power". Truly, why work hard and save for what you want when you can just steal it. Note that this handy rule of Aaron's also applies to all sorts of real life situations. Why not shoplift, after all big department store chains can afford it anyway and in fact they factor that into their prices? Why not steal cars or rob jewellery stores if you can't afford them after all, it's all insured anyway right? The instant gratification generation truly is running amok.

The best part of Aaron's post is that even though Nick states that there are more people using pirated versions of his software than those that paid for it Aaron dismisses this by saying that his personal opinion is that there wouldn't have been many lost sales by piracy then it devolves into some slippery slope argument about whether people should pay for using Nick's software on a friend's computer or at the library. Of course, the simple answer to this question is that by purchasing the software the friend or the library can let anyone use it, the same way that I can carry anyone in my car after purchasing it.

My personal opinion is that if you think software is too expensive then (a) use cheaper alternatives (b) write your own or (c) do without it after all no one needs software. Don't steal it then try and justify your position with inane arguments that sound like the childish "information wants to be free" rants that used to litter Slashdot during the dotbomb era.

Categories: Ramblings

January 5, 2004

@ 07:39 AM

Comments [17]

Request For Comments: Synchronization of Information Aggregators using Markup (SIAM)

I've just finished the first draft of a specification for Synchronization of Information Aggregators using Markup (SIAM) which is the result of a couple of weeks of discussion between myself and a number of others authors of news aggregators. From the introduction

A common problem for users of desktop information aggregators is that there is currently no way to synchronize the state of information aggregators used on different machines in the same way that can be done with email clients today. The most common occurence of this is a user that uses a information aggregator at home and at work or at school who'd like to keep the state of each aggregator synchronized independent of whether the same aggregator is used on both machines.

The purpose of this specification is to define an XML format that can be used to describe the state of a information aggregator which can then be used to synchronize another information aggregator instance to the same state. The "state" of information aggregator includes information such as which feeds are currently subscribed to by the user and which news items have been read by the user.

This specification assumes that a information aggregator is software that consumes an XML syndication feed in one of the following formats; ATOM, [RSS0.91], [RSS1.0] or [RSS2.0]. If more syndication formats gain prominence then this specification will be updated to take them into account.

This final draft owes a lot of its polish to comments from Luke Hutteman (author of SharpReader), Brent Simmons (author of NetNewsWire) and Kevin Hemenway aka Morbus Iff (author of AmphetaDesk ). There are no implementations out there yet although once enough feedback has been gathered about the current spec I'll definitely add this to RSS Bandit and deprecate the existing mechanisms for subscription harmonization.

Brent Simmons has a post which highlights some of the the various issues that came up in our discussions entitled The challenges of synching.

Categories: Technology | XML

January 5, 2004

@ 05:53 AM

Comments [1]

Request For Comments: The "feed" URI Scheme [final draft]

I've written what should be the final draft of the specification for the "feed" URI scheme. From the abstract

This document specifies the "feed" URI (Uniform Resource Identifier) scheme for identifying data feeds used for syndicating news or other content from an information source such as a weblog or news website. In practice, such data feeds will most likely be XML documents containing a series of news items representing updated information from a particular news source.

The primary change from the previous version was to incorporate feedback from Graham Parks about compliance with RFC 2396. The current grammar for the "feed" URI scheme is

feedURI = 'feed:' absoluteURI | 'feed://' hier_part

where absoluteURI and hier_part are defined in section 3 of RFC 2396. Support for one click subscription to syndication feeds via this URI scheme is supported in the following news aggregators; SharpReader, RSS Bandit, NewsGator (in next release), NetNewsWire, Shrook, WinRSS and Vox Lite.

The next step will be to find somewhere more permanent to host the spec.

Categories: Technology

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Thursday, 08 January 2004 - Dare Obasanjo's weblog