January 6, 2004
@ 09:28 PM

In response to my previous post David Orchard provides a link to his post entitled XQuery: Meet the Web where he writes

In fact, this separation of the private and more general query mechanism from the public facing constrained operations is the essence of the movement we made years ago to 3 tier architectures. SQL didn't allow us to constrain the queries (subset of the data model, subset of the data, authorization) so we had to create another tier to do this.

What would it take to bring the generic functionality of the first tier (database) into the 2nd tier, let's call this "WebXQuery" for now. Or will XQuery be hidden behind Web and WSDL endpoints?

Every way I try to interpret this it seems like a step back to me. It seems like in general the software industry decided that exposing your database & query language directly to client applications was the wrong way to build software and 2-tier client-server architectures giving way to N-tier architectures was an indication of this trend. I fail to see why one would think it is a good idea to allow clients to issue arbitrary XQuery queries but not think the same for SQL. From where I sit there is basically little if any difference from either choice for queries. Note that although SQL is also has a Data Definition Langauge (DDL) and Data Manipulation Language (DML) as well as a query language for the purposes of this discussion I'm only considering the query aspects of SQL.

David then puts forth some questions about this idea that I can't help offering my opinions on

If this is an interesting idea, of providing generic and specific query interfaces to applications, what technology is necessary? I've listed a number of areas that I think need examination before we can get to XQuery married to the Web and to make a generic second tier.

1. How to express that a particular schema is queryable and the related bindings and endpoint references to send and receive the queries. Some WSDL extensions would probably do the trick.

One thing lacking in the XML Web Services world are the simple REST-like notions of GET and POST. In the RESTful HTTP world one would simply specify a URI which one could perform an HTTP GET on an get back an XML document. One could then either use the hierarchy of the URI to select subsets of the document or perhaps use HTTP POST to send more complex queries. All this indirection with WSDL files and SOAP headers yet functionality such as what Yahoo has done with their Yahoo! News Search RSS feeds isn't straightforward. I agree that WSDL annotations would do the trick but then you have to deal with the fact that WSDL's themselves are not discoverable. *sigh* Yet more human intervebtion is needed instead of loosely coupled application building.

2. Limit the data set returned in a query. There's simply no way an large provider of data is going to let users retrieve the data set from a query. Amazon is just not going to let "select * from *" happen. Perhaps fomal support in XQuery for ResultSets to be layered on any query result would do the trick. A client would then need to iterate over the result set to get all the results, and so a provider could more easily limit the # of iterations. Another mechanism is to constrain the Return portion of XQuery. Amazon might specify that only book descriptions with reviews are returnable.

This is just a difficult problem. Some queries are complex, computationally intensive but return few results. In some cases it is hard to tell by just looking at the query how badly it'll perform. A notion of returning result sets makes sense in a mid-tier application that's talking to a database but not to client app half-way across the world talking to a website.

3. Subset the Xquery functionality. Xquery is a very large and complicated specification. There's no need for all that functionality in every application. This would make implementation of XQuery more wide spread as well. Probably the biggest subset will be Read versus Update.

Finally something I agree with although David shows some ignorance of XQuery by assuming that there is an update aspect to XQuery when DML was shelved for the 1.0 version. XQuery is just a query language. However it is an extremely complex query language which is hundreds of pages in specification long. The most relevant specs from the W3C XML Query page are linked to below

I probably should also link to the W3C XML Schema: Structures and W3C XML Schema: Datatypes specs since they are the basis of the type system of XQuery. My personal opinion is that XQuery is probably too complex to use as the language for such an endeavor since you want something that is simple to implement and fairly straightforward so that there can be ubiqitous implementations and therefore lots of interoperability (unlike the current situation with W3C XML Schema). I personally would start with XPath 1.0 and subset or modify that instead of XQuery.

4. Data model subsets. Particular user subsets will only be granted access to a subset of the data model. For example, Amazon may want to say that book publishers can query all the reviews and sales statistics for their books but users can only query the reviews. Maybe completely separate schemas for each subset. The current approach seems to be to do an extract of the data subset accoring to each subset, so there's a data model for publishers and a data model for users. Maybe this will do for WebXQuery.


5. Security. How to express in the service description (wsdl or policy?) that a given class of users can perform some subset of the functionality, either the query, the data model or the data set. Some way of specifying the relationship between the set of data model, query functionality, data set and authorization.

I'd say the above two features are tied together. You need some way to restrict what the sys admin vs. the regular user executing such a query over the wire can do as well as a way to authenticate them.  

6. Performance. The Web has a great ability to increase performance because resources are cachable. The design of URIs and HTTP specifically optimizes for this. The ability to compare URIs is crucial for caching., hence why so much work went into specifying how they are absolutized and canonically compared. But clearly XQuery inputs are not going to be sent in URIs, so how do we have cachable XQueries gven that the query will be in a soap header? There is a well defined place in URIs for the query, but there isn't such a thing in SOAP. There needs to be some way of canonicalizing an Xquery and knowing which portions of the message contain the query. Canonicalizing a query through c14n might do the trick, though I wonder about performance. And then there's the figuring out of which header has the query. There are 2 obvious solutions: provide a description annotation or an inline marker. I don't think that requiring any "XQuery cache" engine to parse the WSDL for all the possible services is really going to scale, so I'm figuring a well-defined SOAP header is the way to go.

Sounds like overthinking the problem and yet not being general enough. The first problem is that there should be standard ways that proxies and internet caches know how to cache XML Web Service results in the same way that they know how to cache results of HTTP GET requests today. After that figuring out how to canonicalize a query expression (I'm not even sure what that means- will /root/child and /root/*[local-name()='child'] be canonicalized into the same thing?) is probably a couple of Ph.D theses of work.

Then there's just the fact that allowing clients to ship arbitrary queries to the server is a performance nightmare waiting to happen...

Your thoughts? Is WebXQuery an interesting idea and what are the hurdles to overcome?

It's an interesting idea but I suspect not a very practical or useful one outside of certain closed applications with strict limitations on the number of users or the type of queries issued.  

Anyway, I'm off to play in the snow. I just saw someone skiing down the street. Snow storms are fun.