Dare Obasanjo's weblog

Scalability: I Don't Think That Word Means What You Think It Does

Via Mark Pilgrim I stumbled on an article by Scott Loganbill entitled Google’s Open Source Protocol Buffers Offer Scalability, Speed which contains the following excerpt

The best way to explore Protocol Buffers is to compare it to its alternative. What do Protocol Buffers have that XML doesn’t? As the Google Protocol Buffer blog post mentions, XML isn’t scalable:

"As nice as XML is, it isn’t going to be efficient enough for [Google’s] scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy."

We’ve never had to deal with XML in a scale where programming for it would become unwieldy, but we’ll take Google’s word for it.

Perhaps the biggest value-add of Protocol Buffers to the development community is as a method of dealing with scalability before it is necessary. The biggest developing drain of any start-up is success. How do you prepare for the onslaught of visitors companies such as Google or Twitter have experienced? Scaling for numbers takes critical development time, usually at a juncture where you should be introducing much-needed features to stay ahead of competition rather than paralyzing feature development to keep your servers running.

Over time, Google has tackled the problem of communication between platforms with Protocol Buffers and data storage with Big Table. Protocol Buffers is the first open release of the technology making Google tick, although you can utilize Big Table with App Engine.

It is unfortunate that it is now commonplace for people to throw around terms like "scaling" and "scalability" in technical discussions without actually explaining what they mean. Having a Web application that scales means that your application can handle becoming popular or being more popular than it is today in a cost effective manner. Depending on your class of Web application, there are different technologies that have been proven to help Web sites handle significantly higher traffic than they normally would. However there is no silver bullet.

The fact that Google uses MapReduce and BigTable to solve problems in a particular problem space does not mean those technologies work well in others. MapReduce isn't terribly useful if you are building an instant messaging service. Similarly, if you are building an email service you want an infrastructure based on message queuing not BigTable. A binary wire format like Protocol Buffers is a smart idea if your applications bottleneck is network bandwidth or CPU used when serializing/deserializing XML. As part of building their search engine Google has to cache a significant chunk of the World Wide Web and then perform data intensive operations on that data. In Google's scenarios, the network bandwidth utilized when transferring the massive amounts of data they process can actually be the bottleneck. Hence inventing a technology like Protocol Buffers became a necessity. However, that isn't Twitter's problem so a technology like Protocol Buffers isn't going to "help them scale". Twitter's problems have been clearly spelled out by the development team and nowhere is network bandwidth called out as a culprit.

Almost every technology that has been loudly proclaimed as unscalable by some pundit on the Web is being used by a massively popular service in some context. Relational databases don't scale? Well, eBay seems to be doing OK. PHP doesn't scale? I believe it scales well enough for Facebook. Microsoft technologies aren't scalable? MySpace begs to differ. And so on…

If someone tells you "technology X doesn't scale" without qualifying that statement, it often means the person either doesn't know what he is talking about or is trying to sell you something. Technologies don't scale, services do. Thinking you can just sprinkle a technology on your service and make it scale is the kind of thinking that led Blaine Cook (former architect at Twitter) to publish a presentation on Scaling Twitter which claimed their scaling problems where solved with their adoption of memcached. That was in 2007. In 2008, let's just say the Fail Whale begs to differ.

If a service doesn't scale it is more likely due to bad design than to technology choice. Remember that.

Now Playing: Zapp & Roger - Computer Love

Categories: Web Development | XML

July 2, 2008

@ 01:56 PM

In Defense of XML

Jeff Atwood recently published two anti-XML rants in his blog entitled XML: The Angle Bracket Tax and Revisiting the XML Angle Bracket Tax. The source of his beef with XML and his recommendations to developers are excerpted below

Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails.

I'm deeply ambivalent about XML. I'm reminded of this Winston Churchill quote:

It has been said that democracy is the worst form of government except all the others that have been tried.

XML is like democracy. Sometimes it even works. On the other hand, it also means we end up with stuff like this:
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
  SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
  <SOAP-ENV:Body>
    <m:GetLastTradePrice xmlns:m="Some-URI">
      <symbol>DIS</symbol>
    </m:GetLastTradePrice>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>
…
You could do worse than XML. It's a reasonable choice, and if you're going to use XML, then at least learn to use it correctly. But consider:

Should XML be the default choice?

Is XML the simplest possible thing that can work for your intended use?

Do you know what the XML alternatives are?

Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs?

I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it. Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. <CleverEndQuote>Again.</CleverEndQuote>

The question of if and when to use XML is one I am intimately familiar with given that I spent the first 2.5 years of my professional career at Microsoft working on the XML team as the “face of XML” on MSDN.

My problem with Jeff’s articles is that they take a very narrow view of how to evaluate a technology. No one should argue that XML is the simplest or most efficient technology to satisfy the uses it has been put to today. It isn’t. The value of XML isn’t in its simplicity or its efficiency. It is in the fact that there is a massive ecosystem of knowledge and tools around working with XML.

If I decide to use XML for my data format, I can be sure that my data will be consumable using a variety off-the-shelf tools on practically every platform in use today. In addition, there are a variety of tools for authoring XML, transforming it to HTML or text, parsing it, converting it to objects, mapping it to database schemas, validating it against a schema, and so on. Want to convert my XML config file into a pretty HTML page? I can use XSLT or CSS. Want to validate my XML against a schema? I have my choice of Schematron, Relax NG and XSD. Want to find stuff in my XML document? XPath and XQuery to the rescue. And so on.

No other data format hits a similar sweet spot when it comes to ease of use, popularity and breadth of tool ecosystem.

So the question you really want to ask yourself before taking on the “Angle Bracket Tax” as Jeff Atwood puts it, is whether the benefits of avoiding XML outweigh the costs of giving up the tool ecosystem of XML and the familiarity that practically every developer out there has with the technology? In some cases this might be true such as when deciding whether to go with JSON over XML in AJAX applications (I’ve given two reasons in the past why JSON is a better choice). On the other hand, I can’t imagine a good reason to want to roll your own data format for office documents or application configuration files as opposed to using XML.

Problem #1: Collaborative Document Editing

So a bunch are working on a document that is due today. Yesterday I wanted to edit the document but found out I could not because the software claimed someone else was currently editing the document. So I opened it in read-only mode, copied out some data, edited it and then sent my changes in an email to the person who was in charge of the document. As if that wasn’t bad enough…

This morning, as I'm driving to the gym for my morning work out, I glance at my phone to see that I've received mail from several co-workers because it I've "locked" the document and no one can make their changes. When I get to work, I find out that I didn’t close the document within the application and this was the reason none of my co-workers could edit it. Wow.

The notion that only one person at a time can edit a document or that if one is viewing a document, it cannot be edited seems archaic in today’s globally networked world. Why is software taking so long to catch up?

Problem #2: Loosely Coupled XML Web Services

While I was driving to the office I noticed another email from one of the services that integrates with ours via a SOAP-based XML Web Service. As part of the design to handle a news scenario we added a new type that was going to be returned by one of our methods (e.g. imagine that there was a GetFruit() method which used to return apples and oranges which now returns apples, oranges and bananas) . This change was crashing the applications that were invoking our service because they weren’t expecting us to return bananas.

However, the insidious thing is that the failure wasn’t because their application was improperly coded to fail if it saw a fruit it didn’t know, it was because the platform they built on was statically typed. Specifically, the Web Services platform automatically converted the XML to objects by looking at our WSDL file (i.e. the interface definition language which stated up front which types are returned by our service) . So this meant that any time new types were added to our service, our WSDL file would be updated and any application invoking our service which was built on a Web services platform that performed such XML<->object mapping and was statically typed would need to be recompiled. Yes, recompiled.

Now, consider how many potentially different applications that could be accessing our service. What are our choices? Come up with GetFruitEx() or GetFruit2() methods so we don’t break old clients? Go over our web server logs and try to track down every application that has accessed our service? Never introduce new types?

It’s sad that as an industry we built a technology on an eXtensible Markup Language (XML) and our first instinct was to make it as inflexible as technology that is two decades old which was never meant to scale to a global network like the World Wide Web.

Software should solve problems, not create new ones which require more technology to fix.

Now playing: Young Jeezy - Bang (feat. T.I. & Lil Scrappy)

Categories: Technology | XML | XML Web Services

May 2, 2007

@ 12:28 AM

Microsoft's Astoria and Jasper data access projects

Andy Conrad, who I used to work with back on the XML team, has two blog posts about Project Astoria and Project Jasper from Microsoft's Data Programmability team. Both projects are listed as data access incubation projects on MSDN. Below are the descriptions of the projects

Project Codename “Astoria”
The goal of Microsoft Codename Astoria is to enable applications to expose data as a data service that can be consumed by web clients within a corporate network and across the internet. The data service is reachable over regular HTTP requests, and standard HTTP verbs such as GET, POST, PUT and DELETE are used to perform operations against the service. The payload format for the service is controllable by the application, but all options are simple, open formats such as plan XML and JSON. Web-friendly technologies make Astoria an ideal data back-end for AJAX-style applications, and other applications that need to operate against data that is across the web.

To learn more about Project Astoria or download the CTP, visit the Project Astoria website at http://astoria.mslivelabs.com.

Project Codename “Jasper”
Project Jasper is geared towards iterative and agile development. You can start interacting with the data in your database without having to create mapping files or define classes. You can build user interfaces by naming controls according to your model without worrying about binding code. Project Jasper is also extensible, allowing you to provide your own business logic and class model. Since Project Jasper is built on top of the ADO.NET Entity Framework, it supports rich queries and complex mapping.

To learn more about Project Jasper visit the ADO.NET Blog at http://blogs.msdn.com/adonet

I was called in a few weeks ago by an architect on the Data Programmability team to give some advice about Project Astoria. The project is basically a way to create RESTful endpoints on top of a SQL Server database then retrieve the relational data as plain XML, JSON or a subset of RDF+XML using HTTP requests. The reason I was called in was to give some of my thoughts on exposing relational data as RSS/Atom feeds. My feedback was that attempting to map arbitrary relational data to RSS/Atom feeds seemed unnatural and was bordering on abuse of an XML syndication format. Although this feature was not included in the Project Astoria CTP, it seems that mapping relational data to RSS/Atom feeds is still something the team thinks is interesting based on the Project Astoria FAQ. You can find out more in the Project Astoria overview documentation.

REST is totally sweeping Microsoft.

Categories: XML | XML Web Services

February 1, 2007

@ 01:19 AM

Miguel de Icaza on OOXML vs. ODF

Miguel de Icaza of Gnumeric, GNOME and Ximian fame has weighed in with his thoughts on the FUD war that is ODF vs. OOXML. In his blog post entitled The EU Prosecutors are Wrong Miguel writes

Open standards and the need for public access to information was a strong message. This became a key component of promoting open office, and open source software. This posed two problems:
First, those promoting open standards did not stress the importance of having a fully open source implementation of an office suite. Second, it assumed that Microsoft would stand still and would not react to this new change in the market.
And that is where the strategy to promote the open source office suite is running into problems. Microsoft did not stand still. It reacted to this new requirement by creating a file format of its own, the OOXML.
...
The Size of OOXML

A common objection to OOXML is that the specification is "too big", that 6,000 pages is a bit too much for a specification and that this would prevent third parties from implementing support for the standard. Considering that for years we, the open source community, have been trying to extract as much information about protocols and file formats from Microsoft, this is actually a good thing.
For example, many years ago, when I was working on Gnumeric, one of the issues that we ran into was that the actual descriptions for functions and formulas in Excel was not entirely accurate from the public books you could buy.
OOXML devotes 324 pages of the standard to document the formulas and functions. The original submission to the ECMA TC45 working group did not have any of this information. Jody Goldberg and Michael Meeks that represented Novell at the TC45 requested the information and it eventually made it into the standards. I consider this a win, and I consider those 324 extra pages a win for everyone (almost half the size of the ODF standard).
Depending on how you count, ODF has 4 to 10 pages devoted to it. There is no way you could build a spreadsheet software based on this specification.
...
I have obviously not read the entire specification, and am biased towards what I have seen in the spreadsheet angle. But considering that it is impossible to implement a spreadsheet program based on ODF, am convinced that the analysis done by those opposing OOXML is incredibly shallow, the burden is on them to prove that ODF is "enough" to implement from scratch alternative applications.
...
The real challenge today that open source faces in the office space is that some administrations might choose to move from the binary office formats to the OOXML formats and that "open standards" will not play a role in promoting OpenOffice.org nor open source.
What is worse is that even if people manage to stop OOXML from becoming an ISO standard it will be an ephemeral victory.
We need to recognize that this is the problem. Instead of trying to bury OOXML, which amounts to covering the sun with your finger.

I think there is an interesting bit of insight in Miguel's post which I highlighted in red font. IBM and the rest of the ODF opponents lobbied governments against Microsoft's products by arguing that its file formats where not open. However they did not expect that Microsoft would turn around and make those very file formats open and instead compete on innovation in the user experience.

Now ODF proponents like Rob Weir who've been trumpeting the value of open standards now find themselves in the absurd position of arguing that is a bad thing for Microsoft to open up its file formats and provide exhaustive documentation for them. Instead they demand that Microsoft should either abandon backwards compatibility with the billions of documents produced by Microsoft Office in the past decade or that it should embrace and extend ODF to meet its needs. Neither of which sounds like a good thing for customers.

I guess it's like Tim Bray said, life gets complicated when there are billion$ of dollars on the line. I'm curious to see how Rob Weir responds to Miguel's post. Ideally, we'll eventually move away from these absurd discussions about whether it is a bad thing for Microsoft to open up its file formats and hand them over to an international standards body to talking about how we office productivity software can improve the lives of workers by innovating on features especially with regards to collaboration in the workplace. After all everyone knows that single user, office productivity software is dead. Right?

Categories: Technology | XML

January 23, 2007

@ 06:34 PM

Comments [9]

What is Rob Weir (and IBM's) Agenda with the OOXML Bashing?

In response to my recent post entitled ODF vs. OOXML on Wikipedia one of my readers pointed out

Well, many of Weir's points are not about OOXML being a "second", and therefore unnecessary, standard. Many of them, I think, are about how crappy the standard actually is.

Since I don't regularly read Rob Weir's blog this was interesting to me. I wondered why someone who identifies himself as working for IBM on various ODF technical topics would be spending a lot of his time attacking a related standard as opposed to talking about the technology he worked. I assumed my reader was mistaken and decided to subscribe to his feed and see how many of his recent posts were about OOXML. Below is a screenshot of what his feed looks like when I subscribed to it in RSS Bandit a few minutes ago

Of his 24 most recent posts, 16 of them are explicitly about OOXML while 7 of them are about ODF.

Interesting. I wonder why a senior technical guy at IBM is spending more time attacking a technology whose proponents have claimed is not competitive with it instead of talking about the technology he works on? Reading the blogs of Microsoft folks like Raymond Chen, Jensen Harris or Brian Jones you don't see them dedicating two thirds of their blog postings to bash rival products or technologies.

From my perspective as an outsider in this debate it seems to me that OOXML is an overspecified description of an open XML document format that is backwards compatible with the billions of documents produced in Microsoft Office formats over the past decade. On the other hand, ODF is an open XML document format that aims to be a generic format for storing business documents that isn't tied to any one product which still needs some work to do in beefing up the specification in certain areas if interoperability is key.

In an ideal world both of these efforts would be trying to learn from each other. However it seems that for whatever reasons IBM has decided that it would rather that Microsoft failed at its attempt to open up the XML formats behind the most popular office productivity software in the world. How this is a good thing for Microsoft's customers or IBM's is lost on me.

Having a family member who is in politics, I've learned that whenever you see what seems like a religious fundamentalism there usually is a quest for money and/or power behind it. Reading articles such as Reader Beware as ODF News Coverage Increases it seems clear that IBM has a lot of money riding on being first to market with ODF-enabled products while simultaneously encouraging governments to only mandate ODF. The fly in the ointment is that the requirement of most governments is that the document format is open, not that it is ODF. Which explains IBM's unfortunate FUD campaign.

Usually, I wouldn't care about something like this since this is Big Business and Politics 101, but there was something that Rick Jellife wrote in his post An interesting offer: get paid to contribute to Wikipedia which is excerpted below

So I think there are distinguishing features for OOXML, and one of the more political issues is do we want to encourage and reward MS for taking the step of opening up their file formats, at last?

The last thing I'd personally hate is for this experience to have soured Microsoft from opening up its technologies so I thought I'd throw my hat in the ring at least this once.

PS: It's pretty impressive that a Google search for "ooxml" pulls up a bunch of negative blog posts and the wikipedia article as the first couple of hits. It seems the folks on the Microsoft Office team need to do some SEO to fix that pronto.

Categories: Competitors/Web Companies | XML

January 22, 2007

@ 09:44 PM

ODF vs. OOXML on Wikipedia

This morning I stumbled upon an interestingly titled post by Rick Jellife which piqued my interest entitled An interesting offer: get paid to contribute to Wikipedia where he writes

I’m not a Microsoft hater at all, its just that I’ve swum in a different stream. Readers of this blog will know that I have differing views on standards to some Microsoft people at least.
...
So I was a little surprised to receive email a couple of days ago from Microsoft saying they wanted to contract someone independent but friendly (me) for a couple of days to provide more balance on Wikipedia concerning ODF/OOXML. I am hardly the poster boy of Microsoft partisanship! Apparently they are frustrated at the amount of spin from some ODF stakeholders on Wikipedia and blogs.

I think I’ll accept it: FUD enrages me and MS certainly are not hiring me to add any pro-MS FUD, just to correct any errors I see.
...
Just scanning quickly the Wikipedia entry I see one example straight away: The OOXML specification requires conforming implementations to accept and understand various legacy office applications . But the conformance section to the ISO standard (which is only about page four) specifies conformance in terms of being able to accept the grammar, use the standard semantics for the bits you implement, and document where you do something different. The bits you don’t implement are no-one’s business. So that entry is simply wrong. The same myth comes up in the form “You have to implement all 6000 pages or Microsoft will sue you.” Are we idiots?
Now I certainly think there are some good issues to consider with ODF versus OOXML, and it is good that they come out an get discussed. For example, the proposition that “ODF and OOXML are both office document formats: why should there be two standards?” is one that should be discussed. As I have mentioned before on this blog, I think OOXML has attributes that distinguish it: ODF has simply not been designed with the goal of being able to represent all the information possible in an MS Office document; this makes it poorer for archiving but paradoxically may make it better for level-playing-field, inter-organization document interchange. But the archiving community deserves support just as much as the document distribution community. And XHTML is better than both for simple documents. And PDF still has a role. And specific markup trumps all of them, where it is possible. So I think there are distinguishing features for OOXML, and one of the more political issues is do we want to encourage and reward MS for taking the step of opening up their file formats, at last?

I'm glad to hear that Rick Jellife is considering taking this contract. Protecting your brand on Wikipedia, especially against well-funded or organized detractors is unfortunately a full time job and one that really should be performed by an impartial party not a biased one. It's great to see that Microsoft isn't only savvy enough to realize that keeping an eye on Wikipedia entries about itself is important but also is seeking objective 3rd parties to do the policing.

It looks to me that online discussion around XML formats for business documents has significantly detoriorated. When I read posts like Rob Weir's A Foolish Inconsistency and The Vast Blue-Wing Conspiracy or Brian Jones's Passing the OpenXML standard over to ISO it seems clear that rational technical discussion is out the windows and the parties involved are in full mud slinging mode. It reminds me of watching TV during U.S. election years. I'm probably a biased party but I think the "why should we have two XML formats for business documents" line that is being thrown around by IBM is crap. The entire reason for XML's existence is so that we can build different formats that satisfy different needs. After all, no one asks them why the ODF folks had to invent their own format when PDF and [X]HTML already exist. The fact that ODF and OOXML exist yet have different goals is fine. What is important is that they both are non-proprietary, open standards which prevents customers from being locked-in which is what people really want.

And I thought the RSS vs. Atom wars were pointless.

PS: On the issue of Wikipedia now using nofollow links, I kinda prefer Shelley Powers's idea in her post Wikipedia and nofollow that search engines treat Wikipedia specially as an 'instant answer' (MSN speak) or OneBox result (Google speak) instead of including it in the organic search results page. It has earned its place on the Web and should be treated specially including the placement of disclaimers warning Web n00bs that it's information should be taken with a grain of salt.

Categories: XML

January 3, 2007

@ 11:25 PM

Comments [17]

Updated: XML Has Too Many Architecture Astronauts

Joel Spolsky has an seminal article entitled Don't Let Architecture Astronauts Scare You where he wrote

A recent example illustrates this. Your typical architecture astronaut will take a fact like "Napster is a peer-to-peer service for downloading music" and ignore everything but the architecture, thinking it's interesting because it's peer to peer, completely missing the point that it's interesting because you can type the name of a song and listen to it right away.

All they'll talk about is peer-to-peer this, that, and the other thing. Suddenly you have peer-to-peer conferences, peer-to-peer venture capital funds, and even peer-to-peer backlash with the imbecile business journalists dripping with glee as they copy each other's stories: "Peer To Peer: Dead!"

The Architecture Astronauts will say things like: "Can you imagine a program like Napster where you can download anything, not just songs?" Then they'll build applications like Groove that they think are more general than Napster, but which seem to have neglected that wee little feature that lets you type the name of a song and then listen to it -- the feature we wanted in the first place. Talk about missing the point. If Napster wasn't peer-to-peer but it did let you type the name of a song and then listen to it, it would have been just as popular.

This article is relevant because I recently wrote a series of posts explaining why Web developers have begun to favor JSON over XML in Web Services. My motivation for writing this article were conversations I'd had with former co-workers who seemed intent on "abstracting" the discussion and comparing whether JSON was a better data format than XML in all the cases that XML is used today instead of understanding the context in which JSON has become popular.

In the past two weeks, I've seen three different posts from various XML heavy hitters committing this very sin

JSON and XML by Tim Bray - This kicked it off and starts off by firing some easily refutable allegations about the extensibility and unicode capabilities of JSON as a general data transfer format.
Tim Bray on JSON and XML by Don Box - Refutes the allegations by Tim Bray above but still misses the point.
All markup ends up looking like XML by David Megginson - argues that XML is just like JSON except with the former we use angle brackets and in the latter we use curly braces + square brackets. Thus they are "Turing" equivalent. Academically interesting but not terribly useful information if you are a Web developer trying to get things done.

This is my plea to you, if you are an XML guru and you aren't sure why JSON seems to have come out of nowhere to threaten your precious XML, go read JSON vs. XML: Browser Security Model and JSON vs. XML: Browser Programming Models then let's have the discussion.

If you're too busy to read them, here's the executive summary. JSON is a better fit for Web services that power Web mashups and AJAX widgets due to the fact ~~it gets around the cross domain limitations put in place by browsers that hamper XMLHttpRequest and~~ that it is essentially serialized Javascript objects which makes it fit better client side scripting which is primarily done in Javascript. That's it. XML will never fit the bill as well for these scenarios without changes to the existing browser ecosystem which I doubt are forthcoming anytime soon.

Update: See comments by David Megginson and Steve Marx below.

Categories: XML

January 2, 2007

@ 07:55 PM

JSON vs. XML: Browser Programming Models

Over the holidays I had a chance to talk to some of my old compadres from the XML team at Microsoft and we got to talking about the JSON as an alternative to XML. I concluded that there are a small number of key reasons that JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. This is the second in a series of posts on what these key reasons are.

In my previous post, I mentioned that getting around limitations in cross domain requests imposed by modern browsers has been a key reason for the increased adoption of JSON. However this is only part of the story.

Early on in the adoption of AJAX techniques across various Windows Live services I noticed that even for building pages with no cross domain requirements, our Web developers favored JSON to XML. One response that kept coming up is the easier programming model when processing JSON responses on the client than with XML. I'll illustrate this difference in ease of use via a JScript code that shows how to process a sample document in both XML and JSON formats taken from the JSON website. Below is the code sample

var json_menu = '{"menu": {' + '\n' +
'"id": "file",' + '\n' +
'"value": "File",' + '\n' +
'"popup": {' + '\n' +
'"menuitem": [' + '\n' +
'{"value": "New", "onclick": "CreateNewDoc()"},' + '\n' +
'{"value": "Open", "onclick": "OpenDoc()"},' + '\n' +
'{"value": "Close", "onclick": "CloseDoc()"}' + '\n' +
']' + '\n' +
'}' + '\n' +
'}}';

var xml_menu = '<menu id="file" value="File">' + '\n' +
'<popup>' + '\n' +
'<menuitem value="New" onclick="CreateNewDoc()" />' + '\n' +
'<menuitem value="Open" onclick="OpenDoc()" />' + '\n' +
'<menuitem value="Close" onclick="CloseDoc()" />' + '\n' +
'</popup>' + '\n' +
'</menu>';

WhatHappensWhenYouClick_Xml(xml_menu);
WhatHappensWhenYouClick_Json(json_menu);

function WhatHappensWhenYouClick_Json(data){

var j = eval("(" + data + ")");

WScript.Echo("When you click the " + j.menu.value + " menu, you get the following options");

for(var i = 0; i < j.menu.popup.menuitem.length; i++){
   WScript.Echo((i + 1) + "." + j.menu.popup.menuitem[i].value
    + " aka " + j.menu.popup.menuitem[i].onclick);
}

}

function WhatHappensWhenYouClick_Xml(data){

var x = new ActiveXObject( "Microsoft.XMLDOM" );
x.loadXML(data);

WScript.Echo("When you click the " + x.documentElement.getAttribute("value")
                + " menu, you get the following options");

var nodes = x.documentElement.selectNodes("//menuitem");

for(var i = 0; i < nodes.length; i++){
   WScript.Echo((i + 1) + "." + nodes[i].getAttribute("value") + " aka " + nodes[i].getAttribute("onclick"));
}
}

When comparing both sample functions, it seems clear that the XML version takes more code and requires a layer of mental indirection as the developer has to be knowledgeable about XML APIs and their idiosyncracies. We should dig a little deeper into this.

A couple of people have already replied to my previous post to point out that any good Web application should process JSON responses to ensure they are not malicious. This means my usage of eval() in the code sample, should be replaced with JSON parser that only accepts 'safe' JSON responses. Given that that there are JSON parsers available that come in under 2KB that particular security issue is not a deal breaker.

On the XML front, there is no off-the-shelf manner to get a programming model as straightforward and as flexible as that obtained from parsing JSON directly into objects using eval(). One light on the horizon is that E4X becomes widely implemented in Web browsers . With E4X, the code for processing the XML version of the menu document above would be

function WhatHappensWhenYouClick_E4x(data){

var e = new XML(data);

WScript.Echo("When you click the " + j.menu.value + " menu, you get the following options");

foreach(var m in e.menu.popup.menuitem){
WScript.Echo( m.@value + " aka " + m.@onclick);
}

}

However as cool as the language seems to be it is unclear whether E4X will ever see mainstream adoption. There is an initial implementation of E4X in the engine that powers the Firefox browser which seems to be incomplete. On the other hand, there is no indication that either Opera or Internet Explorer will support E4X in the future.

Another option for getting the simpler object-centric programming models out of XML data could be to adopt a simple XML serialization format such as XML-RPC and providing off-the-shelf Javascript parsers for this data format. A trivial implementation could be for the parser to convert XML-RPC to JSON using XSLT then eval() the results. However it is unlikely that people would go through that trouble when they can just use JSON.

This may be another nail in the coffin of XML on the Web.

Categories: Web Development | XML | XML Web Services

January 2, 2007

@ 05:46 PM

JSON vs. XML: Browser Security Model

Over the holidays I had a chance to talk to some of my old compadres from the XML team at Microsoft and we got to talking about the JSON as an alternative to XML. I concluded that there are a small number of key reasons that JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. This is the first in a series of posts on what these key reasons are.

The first "problem" that chosing JSON over XML as the output format for a Web service solves is that it works around security features built into modern browsers that prevent web pages from initiating certain classes of communication with web servers on domains other than the one hosting the page. This "problem" is accurately described in the XML.com article Fixing AJAX: XMLHttpRequest Considered Harmful which is excerpted below

But the kind of AJAX examples that you don't see very often (are there any?) are ones that access third-party web services, such as those from Amazon, Yahoo, Google, and eBay. That's because all the newest web browsers impose a significant security restriction on the use of XMLHttpRequest. That restriction is that you aren't allowed to make XMLHttpRequests to any server except the server where your web page came from. So, if your AJAX application is in the page http://www.yourserver.com/junk.html, then any XMLHttpRequest that comes from that page can only make a request to a web service using the domain www.yourserver.com. Too bad -- your application is on www.yourserver.com, but their web service is on webservices.amazon.com (for Amazon). The XMLHttpRequest will either fail or pop up warnings, depending on the browser you're using.

On Microsoft's IE 5 and 6, such requests are possible provided your browser security settings are low enough (though most users will still see a security warning that they have to accept before the request will proceed). On Firefox, Netscape, Safari, and the latest versions of Opera, the requests are denied. On Firefox, Netscape, and other Mozilla browsers, you can get your XMLHttpRequest to work by digitally signing your script, but the digital signature isn't compatible with IE, Safari, or other web browsers.

This restriction is a significant annoyance for Web developers because it eliminates a number of compelling end user applications due to the limitations it imposes on developers. However, there are a number of common workarounds which are also listed in the article

Solutions Worthy of Paranoia

There is hope, or rather, there are gruesome hacks, that can bring the splendor of seamless cross-browser XMLHttpRequests to your developer palette. The three methods currently in vogue are:

Application proxies. Write an application in your favorite programming language that sits on your server, responds to XMLHttpRequests from users, makes the web service call, and sends the data back to users.
Apache proxy. Adjust your Apache web server configuration so that XMLHttpRequests can be invisibly re-routed from your server to the target web service domain.
Script tag hack with application proxy (doesn't use XMLHttpRequest at all). Use the HTML script tag to make a request to an application proxy (see #1 above) that returns your data wrapped in JavaScript. This approach is also known as On-Demand JavaScript.

Although the first two approaches work, there are a number of problems with them. The first is that it adds a requirement that the owner of the page also have Web master level access to a Web server and either tweak its configuration settings or be a savvy enough programmer to write an application to proxy requests between a user's browser and the third part web service. A second problem is that it significantly increases the cost and scalability impact of the page because the Web page author now has to create a connection to the third party Web service for each user viewing their page instead of the user's browser making the connection. This can lead to a bottleneck especially if the page becomes popular. A final problem is that if the third party service requires authentication [via cookies] then there is no way to pass this information through the Web page author's proxy due to browser security models.

The third approach avoids all of these problems without a significant cost to either the Web page author or the provider of the Web service. An example of how this approach is utilized in practice is described in Simon Willison's post JSON and Yahoo!’s JavaScript APIs where he writes

As of today, JSON is supported as an alternative output format for nearly all of Yahoo!’s Web Service APIs. This is a Really Big Deal, because it makes Yahoo!’s APIs available to JavaScript running anywhere on the web without any of the normal problems caused by XMLHttpRequest’s cross domain security policy.

Like JSON itself, the workaround is simple. You can append two arguments to a Yahoo! REST Web Service call:
&output=json&callback=myFunction
The page returned by the service will look like this:
myFunction({ JSON data here });
You just need to define myFunction in your code and it will be called when the script is loaded. To make cross-domain requests, just dynamically create your script tags using the DOM:

var script = document.createElement('script');
script.type = 'text/javascript';
script.src = '...' + '&output=json&callback=myFunction';
document.getElementsByTagName('head')[0].appendChild(script);

People who are security minded will likely be shocked that this technique involves Web pages executing arbitrary code they retrieve from a third party site since this seems like a security flaw waiting to happen especially if the 3rd party site becomes compromised. One might also wonder what's the point of browsers restricting cross-domain HTTP requests if pages can load and run arbitrary Javascript code [not just XML data] from any domain.

However despite these concerns, it gets the job done with minimal cost to all parties involved and more often than not that is all that matters.

Postscript: When reading articles like Tim Bray's JSON and XML which primarily compares both data formats based on their physical qualities, it is good to keep the above information in mind since it explains a key reason JSON is popular on the Web today which turns out to be independent of any physical qualities of the data format.

Categories: Web Development | XML | XML Web Services

December 15, 2006

@ 02:57 AM

Versioning Does Not Make Validation Irrelevant

Mark Baker has a blog post entitled Validation considered harmful where he writes

We believe that virtually all forms of validation, as commonly practiced, are harmful; an anathema to use at Web scale. Specifically, our argument is this;
Tests of validity which are a function of time make the independent evolution of software problematic.

Why? Consider the scenario of two parties on the Web which want to exchange a certain kind of document. Party A has an expensive support contract with BigDocCo that ensures that they’re always running the latest-and-greatest document processing software. But party B doesn’t, and so typically lags a few months behind. During one of those lags, a new version of the schema is released which relaxes an earlier stanza in the schema which constrained a certain field to the values “1″, “2″, or “3″; “4″ is now a valid value. So, party B, with its new software, happily fires off a document to A as it often does, but this document includes the value “4″ in that field. What happens? Of course A rejects it; it’s an invalid document, and an alert is raised with the human adminstrator, dramatically increasing the cost of document exchange. All because evolvability wasn’t baked in, because a schema was used in its default mode of operation; to restrict rather than permit.

This doesn't seem like a very good argument to me. The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions. First of all, every system does some form of validation because it cannot process arbitrary documents. For example an RSS reader cannot do anything reasonable with an XBRL or ODF document, no matter how liberal it is in what it accepts. Now that we have accepted that there are certain levels validation that are no-brainers the next question is to ask what happens if there are no constraints on the values of elements and attributes in an input document. Let's say we have a purchase order format which in v1 has a <currency> element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?

As in all things in software, there are no hard and fast rules as to what is right and what is wrong. In general, it is better to be flexible rather than not as the success of HTML and RSS have shown us but this does not mean that it is acceptable in every situation. And it comes with its own set of costs as the success of HTML and RSS have shown us. :)

Sam Ruby puts it more eloquently than I can in his blog post entitled Tolerance.

Categories: XML | XML Web Services

December 11, 2006

@ 02:03 PM

My Good Deed for the Day

Edd Dumbill has a blog post entitled Afraid of the POX? where he writes

The other day I had was tinkering with that cute little poster child of Web 2.0, Flickr. Looking for a lightweight way to incorporate some photos into a web site, I headed to their feeds page to find some XML to use.
...
The result was interesting. Flickr have a variety of outputs in RSS dialects, but you just can't get at the raw data using XML. The bookmarking service del.icio.us is another case in point. My friend Matt Biddulph recently had to resort to screenscraping in order to write his tag stemmer, until some kind soul pointed out there's a JSON feed.

Both of these services support XML output, but only with the semantics crammed awkwardly into RSS or Atom. Neither have plain XML, but do support serialization via other formats. We don't really have "XML on the Web". We have RSS on the web, plus a bunch of mostly JSON and YAML for those who didn't care for pointy brackets.

Interesting set of conclusions but unfortunately based on faulty data. Flickr provides custom XML output from their Plain Old XML over HTTP APIs at http://www.flickr.com/services/api as does del.icio.us from its API at http://del.icio.us/help/api. If anything, this seems to indicate that old school XML heads like Edd have a different set of vocabulary from the Web developer crowd. It seems Edd did searches for "XML feeds" from these sites then came off irritated that the data was in RSS/Atom and not custom XML formats. However once you do a search for "API" with the appropriate service name, you find their POX/HTTP APIs which provide custom XML output.

The morale of this story is that "XML feeds" pretty much means RSS/Atom feeds these days and is not a generic term for XML being provided by a website.

PS: This should really be a comment on Edd's blog but it doesn't look like his blog supports comment.

Categories: XML

December 6, 2006

@ 02:37 AM

Miguel De Icaza on the Novell's OpenOffice "Fork"

If you are a reggular reader of Slashdot you probably stumbled on a link to the Groklaw article Novell "Forking" OpenOffice.org by Pamela Jones. In the article, she berates Novell for daring to provide support for the Office Open XML formats in their version of OpenOffice.

Miguel De Icaza, a Novell employee, has posted a response entitled OpenOffice Forks? where he writes

Facts barely matter when they get in the way of a good smear. The comments over at Groklaw are interesting, in that they explore new levels of ignorance.
Let me explain.
We have been working on OpenOffice.Org for longer than anyone else has. We were some of the earliest contributors to OpenOffice, and we are the largest external contributor to actual code to OpenOffice than anyone else.
...
Today we ship modified versions of OpenOffice to integrate GStreamer, 64-bit fixes, integrate with the GNOME and KDE file choosers, add SVG importing support, add OpenDMA support, add VBA support, integrate Mono, integrate fontconfig, fix bugs, improve performance and a myriad of others. The above url contains some of the patches that are pending, but like every other open source project, we have published all of those patches as part of the src.rpm files that we shipped, and those patches have eventually ended up in every distribution under the sun.
But the problem of course is not improving OpenOffice, the problem is improving OpenOffice in ways that PJ disapproves of. Improving OpenOffice to support an XML format created by Microsoft is tantamount to treason.
And of course, the code that we write to interop with Office XML is covered by the Microsoft Open Specification Promise (Update: this is a public patent agreement, this has nothing to do with the Microsoft/Novell agreement, and is available to anyone; If you still want to email me, read the previous link, and read it twice before hitting the send button).
I would reply to each individual point from PJ, but she either has not grasped how open source is actually delivered to people or she is using this as a rallying cry to advance her own ideological position on ODF vs OfficeXML.
Debating the technical merits of one of those might be interesting, but they are both standards that are here to stay, so from an adoption and support standpoint they are a no-brainer to me. The ideological argument on the other hand is a discussion as interesting as watching water boil. Am myself surprised at the spasms and epileptic seizures that folks are having over this.

I've been a fan of Miguel ever since I was a good lil' Slashbot in college. I've always admired his belief in "Free" [as in speech] Software and the impact it has on people's lives as well as the fact that he doesn't let geeky religious battles get in the way of shipping code. When Miguel saw good ideas in Microsoft's technologies, he incorporated the ideas into Bonobo and Mono as a way to improve the Linux software landscape instead of resorting to Not Invented Here syndrome.

Unfortunately, we don't have enough of that in the software industry today.

Categories: Mindless Link Propagation | XML

November 28, 2006

@ 08:56 PM

Should You Choose RELAX Now?

Tim Bray has a blog post entitled Choose RELAX Now where he writes

Elliotte Rusty Harold’s RELAX Wins may be a milestone in the life of XML. Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. To Elliotte’s list of important XML applications that are RELAX-based, I’d add the Atom Syndication Format and, pretty soon now, the Atom Publishing Protocol. It’s a pity; when XSD came out people thought that since it came from the W3C, same as XML, it must be the way to go, and it got baked into a bunch of other technology before anyone really had a chance to think it over. So now lots of people say “Well, yeah, it sucks, but we’re stuck with it.” Wrong! The time has come to declare it a worthy but failed experiment, tear down the shaky towers with XSD in their foundation, and start using RELAX for all significant XML work.

In a past life I was the PM for XML schema technologies at Microsoft so I obviously have an opinion here. What Tim Bray and Elliotte Rusty Harold gloss over in their advocacy is that there are actually two reasons one would choose an XML schema technology. I covered both reasons in my article XML Schema Design Patterns: Is Complex Type Derivation Unnecessary? for XML.com a few years ago. The relevant part of the article is excerpted below

As usage of XML and XML schema languages has become more widespread, two primary usage scenarios have developed around XML document validation and XML schemas.
Describing and enforcing the contract between producers and consumers of XML documents: An XML schema ordinarily serves as a means for consumers and producers of XML to understand the structure of the document being consumed or produced. Schemas are a fairly terse and machine readable way to describe what constitutes a valid XML document according to a particular XML vocabulary. Thus a schema can be thought of as contract between the producer and consumer of an XML document. Typically the consumer ensures that the XML document being received from the producer conforms to the contract by validating the received document against the schema.

This description covers a wide array of XML usage scenarios from business entities exchanging XML documents to applications that utilize XML configuration files.
Creating the basis for processing and storing typed data represented as XML documents: As XML became popular as a way to represent rigidly structured, strongly typed data, such as the content of a relational database or programming language objects, the ability to to describe the datatypes within an XML document became important. This led to Microsoft's XML Data and XML Data-Reduced schema languages, which ultimately led to WXS. These schema languages are used to convert an input XML infoset into a type annotated infoset (TAI) where element and attribute information items are annotated with a type name.

WXS describes the creation of a type annotated infoset as a consequence of document validation against a schema. During validation against a WXS, an input XML infoset is converted into a post schema validation infoset (PSVI), which among other things contains type annotations. However practical experience has shown that one does not need to perform full document validation to create type annotated infosets; in general many applications that use XML schemas to create strongly typed XML such as XML<->object mapping technologies do not perform full document validation, since a number of WXS features do not map to concepts in the target domain.

RELAX NG is good at #1 but not #2 which is by design. Most of the folks who are interested in XSD are either WS-* folks who are building toolkits that map XML on the wire to in-memory objects or database folks implementing XQuery who also have to deal with strongly typed data. Neither category of developers/vendors are interested in RELAX NG because it wasn't designed to meet their needs. On the other hand, if you are designing an XML format from scratch and need a language/toolkit for validating the structure and correctness of your documents you definitely need to strongly consider using RELAX NG over XSD.

Categories: XML

July 18, 2006

@ 05:38 PM

On Microsoft Not Joining the OpenDocument Format (ODF) Committee

Brian Jones has a blog post entitled Politics behind standardization where he writes

We ultimately need to prioritize our standardization efforts, and as the Ecma Office Open XML spec is clearly further along in meeting the goal of full interoperability with the existing set of billions of Office documents, that is where our focus is. The Ecma spec is only a few months away from completion, while the OASIS committee has stated they believe they have at least another year before they are even able to define spreadsheet formulas. If the OASIS Open Document committee is having trouble meeting the goal of compatibility with the existing set of Office documents, then they should be able to leverage the work done by Ecma as the draft released back in the spring is already very detailed and the final draft should be published later this year.

To be clear, we have taken a 'hands off' approach to the OASIS technical committees because: a) we have our hands full finishing a great product (Office 2007) and contributing to Ecma TC45, and b) we do not want in any way to be perceived as slowing down or working against ODF. We have made this clear during the ISO consideration process as well. The ODF and Open XML projects have legitimate differences of architecture, customer requirements and purpose. This Translator project and others will prove that the formats can coexist with a certain tolerance, despite the differences and gaps.

No matter how well-intentioned our involvement might be with ODF, it would be perceived to be self-serving or detrimental to ODF and might come from a different perception of requirements. We have nothing against the different ODF committees' work, but just recognize that our presence and input would tend to be misinterpreted and an inefficient use of valuable resources. The Translator project we feel is a good productive 'middle ground' for practical interoperability concerns to be worked out in a transparent way for everyone, rather than attempting to swing one technical approach and set of customer requirements over to the other.

As someone who's watched standards committees from the Microsoft perspective while working on the XML team, I agree with everything Brian writes in his post. Trying to merge a bunch of contradictory requirements often results in a complex technology that causes more problems than it solves (e.g. W3C XML Schema). In addition, Microsoft showing up and trying to change the direction of the project to supports its primary requirement (an XML file format compatible with the legacy Microsoft Office file formats) would not be well received.

Unfortunately, the ODF discussion has seemed to be more political than technical which often obscures the truth. Microsoft is making moves to ensure that Microsoft Office not only provides the best features for its customers but ensures that they can exchange documents in a variety of document formats from those owned by Microsoft to PDF and ODF. I've seen a lot of customers acknowledge this truth and commend the company for it. At the end of the day, that matters a lot more than what competitors and detractors say. Making our customers happy is job #1.

Categories: XML

July 6, 2006

@ 06:04 PM

Microsoft Announces ODF Support for Office

The Office team continues to impress me how savvy they are about the changing software landscape. In his blog post entitled Open XML Translator project announced (ODF support for Office) Brian Jones writes

Today we are announcing the creation of the Open XML Translator project that will help translate between the Office Open XML formats and the OpenDocument format. We've talked a lot about the value the Open XML formats bring, and one of them of course is the ability to filter it down into other formats. While we still aren't seeing a strong demand for ODF support from our corporate or consumer customers, it's now a bit different with governments. We've had some governments request that we help build solutions so that can use ODF for certain situations, so that's why we are creating the Open XML Translator project. I think it's going to be really beneficial to a number of folks and for a number of reasons.
There has been a push in Microsoft for better interoperability and this is another great step in that direction. We already have the PDF and XPS support for Office 2007 users that unfortunately had to be separated out of the product and instead offered as a free download. There will be a menu item in the Office applications that will point people to the downloads for XPS, PDF, and now ODF. So you'll have the ability to save to and open ODF files directly within Office (just like any other format).

For me, one of the really cool parts of this project is that it will be open source and located up on SourceForge, which means everyone will have the ability to see how to leverage the open architectures of both the Office Open XML formats and ODF. We're developing the tools with the help of Clever Age (based in France) and a few other folks like Aztecsoft (based in India) and Dialogika (based in Germany). There should actually be a prototype of the first translator (for Word 2007) posted up on SourceForge later on today (http://sourceforge.net/projects/odf-converter). It's going to be made available under the BSD license, and anyone can provide feedback, submit bugs, and of course directly contribute to the project. The Word tool should be available by the end of this year, with the Excel and PPT versions following in 2007.

This announcement is cool on so many levels. The coolest being that the projects will not only be Open Source but will be hosted on SourceForge. That is sweet. It is interesting to note that it is government customers and not businesses that are interested in ODF support in Office. I guess that makes sense if you consider which parties have been expressing interest in Open Office.

There already some great analyst responses to this move such as Stephen O'Grady of Redmonk who in his post Microsoft Office to Support ODF: The Q&A has some great insights. My favorite insight is excerpted below

Q: How about Microsoft's competitors?
A: Well, this is a bittersweet moment for them. For those like Corel that have eschewed ODF support, it's a matter of minor importance - at least until Microsoft is able to compete in public sector markets that mandate ODF and they are not.

But for those vendors that have touted ODF support as a diffentiator, this is a good news/bad news deal. The good news is that they can and almost certainly will point to Microsoft's support as validation of further ODF traction and momentum, they will now be competing - at least in theory, remember the limitation - with an Office suite that is frankly the most capable on the market. I've said for years that packages like OpenOffice.org are more than good enough for the majority of users, and that's been validated by our own usage of the product over the past few years; but Microsoft's suite is better than good enough. I'm interested to see if there's any fallout from the UI overhaul, but for now Office remains the undisputed champ of the Office arena. This means that commercial packages like StarOffice and Workplace, not to mention open source projects such as Abiword, KOffice, and OpenOffice.org will have to compete more on features and innovation and less on their support for formats such as ODF or PDF.

It'll be good to see the debate migrate away from support for file formats back to exactly which product's features provides the best value for customers. Everybody wins. Mad props to the Office team for making this decision. Rock on.

Categories: XML

June 23, 2006

@ 09:38 PM

Mike Champion on Why We Need XLinq

Mike Champion has a blog post entitled Why does the world need another XML API? where he writes

One basic question keeps coming up, something like: "We have SAX, DOM, XmlReader/Writer APIs (and the Java people have a bunch more), we have XSLT, we have XQuery ... why do you think we need Yet Another XML API?"
...
XmlReader / XmlWriter can't go away because XLinq uses them to parse and serialize between XLinq objects and XML text. Also, while we are making XLinq as streaming-friendly as possible (see the XStreamingElement class in the CTP release for a taste of where we are going), we're only aiming at hitting the 80/20 point...
DOM can't go away because there are important use cases for API-level interoperability, most notably in the browser...DOM doesn't make code truly interoperable across implementations (especially on other languages), but there is enough conceptual similarity that porting is generally not terribly difficult...
XSLT definitely won't go away. The Microsoft XML team was promoting XQuery as a "better XSLT than XSLT 2.0" a few years ago (before I came, don't hurt me!), and got set straight by the large and vocal XSLT user community on why this is not going to fly. While it may be true in some abstract way that XQuery or XLinq might logically be able to do everything that XSLT does, as a practical matter it won't...

XQuery won't go away, at least for its original use case as a database query language. Microsoft supports a draft of XQuery in SQL Server 2005, contributes to the work of the XQuery working group at W3C, and will continue to invest in finalizing the XQuery Recommendation and implementing it in our DBMS..
we believe that the overall LINQ story is going to have a pretty profound impact on data programmability, and we want to make sure that LINQ has a good story for XML...For XML users, I see a few really powerful implications:
The ability to query data by declaraing the characterics of the result set rather than imperatively navigating through and filtering out all the data...
The ability to join across diverse data sources, be they XML documents, objects, or DBMS queries
The ability to "functionally" reshape data within the same language as the application is written. XSLT pioneered the functional transformation approach to XML processing, but it is difficult for many developers to learn and requires a processing pipeline architecture to combine XSLT transforms with conventional application logic...

This brings back memories of my days on the XML team at Microsoft. We went back and forth a lot about building the "perfect XML API", the one problem we had was that there one too many diverse user bases which had different ideas of what was important to expose in an API. We were always caught between a rock and a hard place when it came to customer requests for fixing our APIs. To some people (e.g. Microsoft Office) XML was a markup format for documents while to others (e.g. Windows Communications Foundation aka Indigo) it was simply a serialization format for programming language objects. Some of our customers were primarily interested in processing XML in a streaming fashion (e.g. Biztalk) while others (e.g. Internet Explorer) always worked on in-memory XML documents. Then there were the teams whose primarily interest was in strongly typed XML (e.g. SQL Server, ADO.NET) since it would be stored in relational database columns.

In trying to solve all of these problems with a single set of APIs, we went down the road of prematurely declaring the death of certain XML APIs and technologies such as the DOM (see Ode to the XML DOM) and XSLT (see XSLT 2.0 Sir? or Why You Won't See XSLT 2.0 or XPath 2.0 in the Next Version of the .NET Framework). At the end of the day we saw the light and we eventually changed our tune by not deprecating the System.Xml.XmlDocument class and by reconsidering whether replacing XSLT with XQuery was the right way forward.

When I was on the team there was a lot of infatuation with XQuery which eventually turned to frustration. There were a number of technical and non-technical reasons for this such as its dependence on W3C XML Schema which significantly complicated its type system and how long the spec was taking to become a standard (over 5 years and counting as I write this post). Since then a bunch of folks who were were enamored with XQuery have taken some of its lessons (e.g. declaritiveness, simple model for XML generation, etc) and integrated it into a mainstream programming environment with the XLinq project. XML geeks everywhere should read Erik Meijer's paper, XLinq: XML Programming Refactored (The Return Of The Monoids), it is a thing of beauty if angle brackets are your thing. And even better, if you are one you are one of those that chants rabid slogans like "XML is the assembly language of Web 2.0", you'll still like XLinq because it provides a easier and richer level of abstraction for working with XML.

Enjoy.

Categories: XML

May 17, 2006

@ 02:35 PM

On the C# 3.0 Preview: Some Thoughts on LINQ

If you're a regular reader of Don Box's weblog then you probably know that Microsoft has made available another Community Technical Preview (CTP) of Language Integrated Query (LINQ) aka C# 3.0. I think the notion of integrating data access and query languages into programming languages is the next natural evolution in programming language design. A large number of developers write code that performs queries over rich data structures of some sort whether they are relational databases, XML files or just plain old objects in memory. In all three cases, the code tends to be verbose and more cumbersome than it needs to be. The goal of the LINQ project is to try to simplify and unify data access in programming languages built on the .NET Framework.

When I used to work on the XML team, we also used to salivate about the power that developers would get if they could get rich query over their data stores in a consistent manner. I was the PM for the IXPathNavigable interface and the related XPathNavigator class which we hoped people would implement over their custom stores to enable them to use XPath to query them. Some developers did do exactly that such as Steve Saxon with the ObjectXPathNavigator which allows you to use XPath to query a graph of in-memory objects. The main problem with this approach is that implementing IXPathNavigable for custom data stores is non-trivial especially given the impedence mismatch between XML and other data models. In fact, I've been wanting to do something like this in RSS Bandit for a while but the complexity of implementing my own custom XPathNavigator class over our internal data structures is something I've balked at doing.

According to Matt Warren's blog post Oops, we did it again it looks like the LINQ folks have similar ideas but are making it easier than we did on the XML team. He writes

What's the coolest new feature? IMHO, its IQueryable<T>.

DLINQ's query mechanism has been generalized and available for all to use as part of System.Query. It implements the Standard Query Operators for you using expression nodes to represent the query. Your queries can now be truly polymorphic, written over a common abstraction and translated into the target environment only when you need it to.

    public int CustomersInLondon(IQueryable<Customer> customers) {

        int count = (from c in customers

                     where c.City == "London"

                     select c).Count();

        return count;

    }

Now you can define a function like this and it can operate on either an in memory collection or a remote DLINQ collection (or you own IQueryable for that matter.) The query is then either run entirely locally or remotely depending on the target.

If its a DLINQ query a count query is sent to the database.

SELECT COUNT(*) AS [value]

FROM [Customers] AS [t0]

WHERE [t0].[City] = @p0

If its a normal CLR collection, the query is executed locally, using the System.Query.Sequence classes definitions of the standard query operators. All you need to do is turn your IEnumerable<Customer> into IQueryable<Customer>. This is accomplished easily with a built-in ToQueryable() method.

List<Customer> customers = ...;

CustomersInLondon(customers.ToQueryable());

Wow! That was easy. But, how is this done? How can you possible turn my List<T> into some queryable thingamabob?

Good question. Glad you asked.

Check out this little gem:

Expression<Func<Customer,bool>> predicate = c => c.City == "London";

Func<Customer,bool> d = predicate.Compile();

Now you can compile lambda expressions directly into IL at runtime!

ToQueryable() wraps your IEnumerable<T> in IQueryable<T> clothing, uses the Queryable infrastructure to let you build up your own expression tree queries, and then when you enumerate it, the expression is rebound to refer to your IEnumerable<T> directly, the operators rebound to refer to System.Query.Sequence, and the resulting code is compiled using the built-in expression compiler. That code is then invoked producing your results.
Amazing, but true.

I think it's pretty amazing that all I have to do as a developer is implement a simple iterator over my data structures (i.e. IEnumerable) and then I get all the power of Linq for free. Of course, if I want the queries to be performant it would make sense to implement IQueryable directly but the fact that the barrier to entry is so low if my perf needs aren't high is goodness.

For more information on LINQ, read the Linq project overview. If you are like me and are primarily interested in XLinq then check out XLinq: XML Programming Refactored (The Return Of The Monoids) which has the fingerprints of my former team all over it. Way to go guys!

Categories: XML

April 11, 2006

@ 11:55 PM

W3C Publishes XMLHttpRequest Object specification

I just noticed that last week the W3C published a working draft specification for The XMLHttpRequest Object. I found the end of the working draft somewhat interesting. Read through the list of references and authors of the specifcation below

References

This section is normative

DOM3
Document Object Model (DOM) Level 3 Core Specification, Arnaud Le Hors (IBM), Philippe Le Hégaret (W3C), Lauren Wood (SoftQuad, Inc.), Gavin Nicol (Inso EPS), Jonathan Robie (Texcel Research and Software AG), Mike Champion (Arbortext and Software AG), and Steve Byrne (JavaSoft).
RFC2119
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner.
RFC2616
Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding (UC Irvine), J. Gettys (Compaq/W3C), J. Mogul (Compaq), H. Frystyk (W3C/MIT), L. Masinter (Xerox), P. Leach (Microsoft), and T. Berners-Lee (W3C/MIT).

B. Authors

This section is informative

The authors of this document are the members of the W3C Web APIs Working Group.

Robin Berjon, Expway (Working Group Chair)
Ian Davis, Talis Information Limited
Gorm Haug Eriksen, Opera Software
Marc Hadley, Sun Microsystems
Scott Hayman, Research In Motion
Ian Hickson, Google
Björn Höhrmann, Invited Expert
Dean Jackson, W3C
Christophe Jolif, ILOG
Luca Mascaro, HTML Writers Guild
Charles McCathieNevile, Opera Software
T.V. Raman, Google
Arun Ranganathan, AOL
John Robinson, AOL
Doug Schepers, Vectoreal
Michael Shenfield, Research In Motion
Jonas Sicking, Mozilla Foundation
Stéphane Sire, IntuiLab
Maciej Stachowiak, Apple Computer
Anne van Kesteren, Opera Software

Thanks to all those who have helped to improve this specification by sending suggestions and corrections. (Please, keep bugging us with your issues!)

Interesting. A W3C specification that documents a proprietary Microsoft API which not only does not include a Microsoft employee as a spec author but doesn't even reference any of the IXMLHttpRequest documentation on MSDN.

I'm sure there's a lesson in there somewhere. ;)

Categories: Web Development | XML

January 24, 2006

@ 07:28 PM

WordPerfect to Support Microsoft Office Open XML Formats

Brian Jones has a blog post entitled Corel to support Microsoft Office Open XML Formats which begins

Corel has stated that they will support the new XML formats in Wordperfect once we release Office '12'. We've already seen other applications like OpenOffice and Apple's TextEdit support the XML formats that we built in Office 2003. Now as we start providing the documentation around the new formats and move through Ecma we'll see more and more people come on board and support these new formats. Here is a quote from Jason Larock of Corel talking about the formats they are looking to support in coming versions (http://labs.pcw.co.uk/2006/01/new_wordperfect_1.html):

Larock said no product could match Wordperfect's support for a wide variety of formats and Corel would include OpenXML when Office 12 is released. "We work with Microsoft now and we will continue to work with Microsoft, which owns 90 percent of the market. We would basically cut ouirselves off if you didn't support the format."

But he admitted that X3 does not support the Open Document Format (ODF), which is being proposed as a rival standard, "because no customer that we are currently dealing with as asked us to do so."

X3 does however allow the import and export of portable document format (pdf) files, something Microsoft has promised for Office 12.

I mention this article because I wanted to again stress that even our competitors will now have clear documentation that allows them to read and write our formats. That isn't really as big of a deal though as the fact that any solution provider can do this. It means that the documents can now be easily accessed 100 years from now, and start to play a more meaningful role in business processes.

Again I want to extend my kudos to Brian and the rest of the folks on the Office team who have been instrumental in the transition of the Microsoft Office file formats from proprietary binary formats to open XML formats.

Categories: Mindless Link Propagation | XML

January 21, 2006

@ 01:01 PM

Metadata Quality and Mapping Between Domain Languages

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach.

In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

The problem Stefano has pointed out is that just being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

The example I tend to cull from in my personal experience is mapping between different XML syndication formats such as Atom 1.0 and RSS 2.0. Mapping between both formats isn't simply a case of saying <atom:published> owl:sameAs <pubDate> or that <atom:author> owl:sameAs <author> . In both cases, an application that understands how to process one format (e.g. an RSS 2.0 parser) would not be able to process the syntax of the equivalent elements in the other (e.g. processing RFC 3339 dates as opposed to RFC 822 dates).

Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world. I've seen some claims that simply using XML technologies for mapping between XML vocabularies means you will need N² transforms as opposed to needing 2N transforms if using SW technologies (Stefano mentions this in his post as has Ken Macleod in his post XML vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies have similar data models and semantics which should be true otherwise a mapping wouldn't be possible. However the implicit assumption is that the syntax of each vocabulary is practically identical (e.g. same naming conventions, same date formats, etc) which this post provides a few examples where this is not the case.

What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...

Categories: Web Development | XML

January 18, 2006

@ 12:39 PM

Microformats vs. XML: Pros and Cons

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches.

I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently.

In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote

Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.

http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication

I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is wasted bandwidth since web pages tend to contain more stuff than feeds and change more often.

Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using <div> and <span> to markup the individual appointments, so I can't add hCalendar to them.

I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.

They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an <abbr> with class="dtstart" interleaved with an <abbr> with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun.

I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.

Categories: XML

January 12, 2006

@ 09:58 PM

Comments [9]

Microformats vs. XML: Was the XML Vision Wrong?

Over a year ago, I wrote a blog post entitled SGML on the Web: A Failed Dream? where I asked whether the original vision of XML had failed. Below are excerpts from that post

The people who got together to produce the XML 1.0 recommendation where motivated to do this because they saw a need for SGML on the Web. Specifically
their discussions focused on two general areas:
Classes of software applications for which HTML was an inadequate information format
Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology

The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases.

The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web.

And thus XML was born.
...
The W3C's attempts to get people to author XML directly on the Web have mostly failed as can be seen by the dismal adoption rate of XHTML and in fact many [including myself] have come to the conclusion that the costs of adopting XHTML compared to the benefits are too low if not non-existent. There was once an expectation that content producers would be able to place documents conformant to their own XML vocabularies on the Web and then display would entirely be handled by stylesheets but this is yet to become widespread. In fact, at least one member of a W3C working group has called this a bad practice since it means that User Agents that aren't sophisticated enough to understand style sheets are left out in the cold.

Interestingly enough although XML has not been as successfully as its originators initially expected as a markup language for authoring documents on the Web it has found significant success as the successor to the Comma Separated Value (CSV) File Format. XML's primary usage on the Web and even within internal networks is for exchanging machine generated, structured data between applications. Speculatively, the largest usage of XML on the Web today is RSS and it conforms to this pattern.

These thoughts were recently rekindled when reading Tim Bray's recent post Don’t Invent XML Languages where Tim Bray argues that people should stop designing new XML formats. For designing new data formats for the Web, Tim Bray advocates the use of Microformats instead of XML.

The vision behind microformats is completely different from the XML vision. The original XML inventers started with the premise that HTML is not expressive enough to describe every possible document type that would be exchanged on the Web. Proponents of microformats argue that one can embed additional semantics over HTML and thus HTML is expressive enough to represent every possible document type that could be exchanged on the Web. I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan.

Categories: XML

December 15, 2005

@ 06:25 PM

Don Demsak on XSLT 2.0 and Microsoft

Don Demsak has a post entitled XSLT 2.0, Microsoft, and the future of System.Xml which has some insightful perspectives on the future of XML in the .NET Framework

Oleg accidentally restarted the XSLT 2.0 on .Net firestorm by trying to startup an informal survey. Dare chimed in with his view of how to get XSLT 2.0 in .Net. M. David (the guy behind Saxon.Net which let .Net developers use Saxon on .Net) jumped in with his opinion.
...

One of the things that I’ve struggled with in System.Xml is how hard it is sometimes to extend the core library. The XML MVPs have done a good job with some things, but other things (like implementing XSLT 2.0 on top of the XSLT 1.0 stuff) are impossible because so much of the library is buried in internal classes. When building a complex library like System.Xml, there are 2 competing schools of thought:

Make the library easy to use and create a very small public facing surface area.
Make the library more of a core library with most classes and attributes public, and let others build easy (and very specific) object models on top of it.

The upside of the first methodology is that it is much easier to test, and the library just works out of the box. The downside is that it very hard to extend the library, so it can only be used in very specific ways.

The upside of the second methodology is that you don’t have to trying to envision all the ways the library should be used. Over time others will extend it to accomplish things that the original developers never thought of. The downside is that you have a much larger surface area to test, and you are totally reliant on other projects to make your library useful. This goes for both projects internal to Microsoft and external projects like the Mvp.Xml lib.

The System.Xml team has tended to use the first methodology, where the ASP.Net team tends to build their core stuff according to the second methodology, and then have a sub-team create another library using the first methodology, so developers have something to use right out of the box (think of System.Web.UI.HtmlControls as the low level API and System.Web.UI.WebControls as the higher level API). The ASP.Net team builds their API this way because, from the beginning, they have always envisioned 3rd parties extending their library. At the moment, this is not the case for the System.Xml library. But the question is, should System.Xml be revamped and become a lower level API, and then rely on 3rd parties (like the Mvp.Xml project) to create more specific and easier to use APIs? Obviously this is not something to be taken lightly. It will be more costly to expose more of the internals of System.Xml. But, if only the lower level API was part of the core .Net framework, it may then be possible to roll out newer, higher level, APIs on a release schedule different then the .Net framework schedule. This way projects like XSLT 2.0 could be released without having to what for the next version of the framework.

I’ve always been of the opinion that XSLT 2.0 does not need to be part of the core .Net framework. Oleg doesn’t believe that the .Net open source community is as passionate as some of the other communities, so he would like to see Microsoft build XSLT 2.0. I’d rather see the transformation of the System.Xml team into more of an ASP.Net like team. If .Net is the future of development on the Windows platform, and XML is the future of Microsoft, then the System.Xml team needs to grow beyond its legacy as just an offshoot of the SQL Server team. The System.Xml team still resides in the SQL Server building. Back before .Net, the System.Xml was known as the SQL Web Data team, and unfortunately, still carries some of that mentality. Folks like Joshua Allen and Dare (who are both not on the team anymore) fought to bring the team out from the shadows of SQL Server. With new XML related groups, like XLinq and Windows Communication Framework, popping up within the company the System.Xml group is at a major crossroads. They will either grow (in status and budget) and become more like the ASP.Net or they will get absorbed into one of the new groups.

I’d prefer to see the System.Xml team grow and become full partners with teams like ASP.Net and the CLR team. I’d like to see the XML based languages become first class programming languages within the Visual Studio IDE. That means not only using things like XSLT and XML Schema as dynamic languages, but also be able to compile them down to IL and compiled with the other .Net languages. I want to be able to create projects that contain not only VB or C#, but also XSLT and XML Schema (to name a couple), and have them compile into one executable. Then developers can use things like XSLT 2.0, or the next in vogue XML based language, and take advantage of that language’s unique benefits, without having to choose between a compiled procedural language (like C# or VB) and dynamic functional languages like XSLT. Linq is starting to bring in more of the functional programming style to the average procedural programmer, so I can start to see the rise public awareness of functional programming. It is only a matter of time before the average programmer feels as comfortable with functional programming as they do with procedural programming, so we need to look towards including these languages within the Visual Studio IDE (which then leads into my discussion about evolving Visual Studio into more of an IDE Framework, and extended with add-ins.)

There is a lot of stuff which I agree with in Don's post which is why I forwarded it to some of the folks on the XML team. I'll be having lunch over there today to talk about some of the topics it raised

Don does gloss over something when it comes to the decision between whether Microsoft should implement a technology like XSLT 2.0 or whether we should just make it easy for third parties to do so. The truth is that Microsoft now has a large number of products which utilize XML-related technologies. For example, implementing something like XSLT 2.0 isn't just about providing a new version of the System.Xml.Xsl.XslCompiledTransform Class in the .NET Framework. It's also about deciding whether to update the XSLT engine used by Internet Explorer to support XSLT 2.0 (which is an entirely different code base), it's about adding support to the XSLT debugger in Visual Studio to support XSLT 2.0, and maybe even updating the Biztalk Mapper. Users of Microsoft products expect a cohesive and comprehensive experience. In many cases, only Microsoft can provide that experience (e.g. supporting XSLT 2.0 across our entire breadth of technologies and products that use XSLT). It was a really tough balance deciding what we should make extensible and what was too expensive to make extensible since we'd probably be the only ones who could take proper advantage of it when I was on the XML team. I'm glad to see that some of our MVPs understand how delicate of a balancing act shipping platform technologies can be.

Categories: Life in the B0rg Cube | XML

December 11, 2005

@ 05:45 PM

Comments [13]

XSLT 2.0 and Microsoft

I've been following a series of posts on Oleg Tkachenko's blog with some bemusement. In his post A business case for XSLT 2.0? he writes

If you are using XSLT and you think that XSLT 2.0 would provide you some real benefits, please drop a line of comment with a short explanation pleeeease. I'm collecting some arguments for XSLT 2.0, some real world scenarios that are hard with XSLT 1.0, some business cases when XSLT 2.0 would provide an additional value. That's really important if we want to have more than a single XSLT 2.0 implementation...

PS. Of course I've read Kurt's "The Business Case for XSLT 2.0" already.

Update: I failed to stress it enough that it's not me who needs such kind of arguments. We have sort of unique chance to persuade one of software giants (guess which one) to support XSLT 2.0 now.

In a follow up post entitled XSLT 2.0 and Microsoft Unofficial Survey he reveals which of the software giants he is trying to convince to implement XSLT 2.0 where he writes

Moving along business cases Microsoft seeks to implement XSLT 2.0 I'm trying to gather some opinion statistics amongs developers working with XML and XSLT. So I'm holding this survey at the XML Lab site:

Would you like to have XSLT 2.0 implementation in the .NET Framework?

The possible answers are:

Yes, I need XSLT 2.0
Yes, that would be nice to have
No, continue improving XSLT 1.0 impl instead
No, XSLT 1.0 is enough for me

...
Take your chance to influence Microsoft's decision on XSLT 2.0 and win XSLT 2.0 book!

My advice to Oleg, if you want to see XSLT 2.0 in the .NET Framework then gather some like minded souls and build it yourself. Efforts like the MVP.XML library for the .NET Framework shows that there are a bunch of talented developers building cool enhancements to the basic XML story Microsoft provides in the .NET Framework.

I'm not sure how an informal survey in a blog would convince Microsoft one way or the other about implementing a technology. A business case to convince a product team to do something usually involves showing them that they will lose or gain significant marketshare or revenue by making a technology choice. A handful of XML geeks who want to see the latest and greatest XML specs implemented by Microsoft does not a business case make. Unfortunately, this means that Microsoft will tend to be a follower and not a leader in such cases because customer demand and competitive pressure don't occur until other people have implemented and are using the technology. Thus if you want Microsoft to implement XSLT 2.0, you're best bet is to actually have people using it on other platforms or on Microsoft platforms who will clamor for better support instead of relying on informal surveys and comments in your blog.

Just my $0.02 as someone who used to work on the XML team at Microsoft.

Categories: XML

November 28, 2005

@ 06:17 PM

Comments [14]

Tim Bray's Hypocrisy and Competing XML Formats

Tim Bray has a post entitled Thought Experiments where he writes

To keep things short, let’s call OpenDocument Format 1.0 "ODF" and the Office 12 XML File Formats "O12X".

Alternatives · In ODF we have a format that’s already a stable OASIS standard and has multiple shipping implementations. In O12X we have a format that will become a stable ECMA standard with one shipping implementation sometime a year or two from now, depending on software-development and standards-process timetables. ODF is in the process of working its way through ISO, and O12X will apparently be sent down that road too, which should put ISO in an interesting situation.

On the technology side, the two formats are really more alike than they are different. But, there are differences: O12X's design center, Microsoft has said repeatedly, is capturing the exact semantics of the billions of existing Microsoft Office documents. ODF’s design center is general-purpose reusability, and leveraging existing standards like SVG and MathML and so on.

Which do you like better? I know which one I’d pick. But I think we’re missing the point.

Why Are There Two? · Almost all office documents are just paragraphs of text, with some bold and some italics and some lists and some tables and some pictures. Almost all spreadsheets are numbers and labels, with some sums and averages and pivots and simple algebra. Almost all presentations are lists of bullet points with occasional pictures.

The capabilities of ODF and O12X are essentially identical for all this basic stuff. So why in the flaming hell does the world need two incompatible formats to express it? The answer, obviously, is, "it doesn’t".

I find it extremely ironic that one of the driving forces behind creating a redundant and duplicative XML format for website syndication would be one of the first to claim that we only need one XML format to solve any problem. For those who aren't in the know, Tim Bray is one of the chairs of the Atom Working Group in the IETF whose primary goal is to create a competing format to RSS 2.0 which does basically the same thing. In fact Tim Bray has written a decent number of posts attempting to explain why we need multiple XML formats for syndicating blog posts, news and enclosures on the Web.

But let's ignore the messenger and focus on the message. Tim Bray's question is quite fair and in fact he answers it later on in his blog entry. As Tim Bray writes, "Microsoft wants there to be an office-document XML format that covers their billions of legacy documents". That's it in a nutshell. Microsoft created XML versions of its binary document formats like .doc and .xls that had full fidelity with the features of these formats. That way a user can convert a 'legacy' binary Office document to a more interoperable Office XML document without worrying about losing data, formatting or embedded rich media. This is a very important goal for the Microsoft Office team and very different from the goal of the designers of the OpenDocument format.

Is it technically possible to create a 'common shared office-XML dialect for the basics' as Tim Bray suggests? It is. It'll probably take several years (e.g. the Atom syndication format which is simply a derivative of RSS has taken over two years to come to fruition) and once it is done, Microsoft will have to 'embrace and extend' it to meet its primary goal of 100% backwards compatibility with its legacy formats. And that doesn't answer the question of what Microsoft should ship in the meantime with regards to file formats in its Office products. After all, Office 12 is scheduled to ship in the second half of 2006.

There is no simple technical solution on the horizon that will change the fact that there are be multiple XML formats for Office documents. What we need to agree on is the best way forward, not attempt to demonize each other for trying to do what's best for our customers.

Disclaimer: I work at Microsoft. However I do not work in any area related to the Office XML formats. The above is my personal opinion and should not be construed as an expression of the opinions, intents or strategies of my employer.

Categories: XML

November 9, 2005

@ 09:12 PM

Adam Bosworth's "Learning From the Web"

Last week Andrew Conrad told me to check out a recent article by Adam Bosworth in the ACM Queue because he wondered what I thought about. I was rather embarassed to note that althought I'd seen some mention of it online, I hadn't read it. I read it today and as usual, Adam Bosworth is on point.

The article is entitled Learning from THE WEB and it begins by listing eight "unintuitive lessons" we have learned from the Web. The lessons are listed below

Simple, relaxed, sloppily extensible text formats and protocols often work better than complex and efficient binary ones.
It is worth making things simple enough that one can harness Moore’s law in parallel.
It is acceptable to be stale much of the time.
The wisdom of crowds works amazingly well.
People understand a graph composed of tree-like documents (HTML) related by links (URLs).
Pay attention to physics.
Be as loosely coupled as possible.
KISS. Keep it (the design) simple and stupid.

Where the paper gets interesting is that then tries to apply these lessons to XML. Remember that Adam was one of the founder of the XML team at Microsoft and knows a thing or two about it. So he writes

In my humble opinion, however, we ignored or forgot lessons 3, 4, and 5. Lesson 3 tells us that elements in XML with values that are unlikely to change for some known period of time (or where it is acceptable that they are stale for that period of time, such as the title of a book) should be marked to say this. XML has no such model.
...
Lesson 4 says that we shouldn’t over-invest in making schemas universally understood.
...
Lessons 1 and 5 tell us that XML should be easy to understand without schemas

I totally agree with his assessment of the lessons learned from lessons 4 & 5. However the issue of being able to mark an element in an XML file as 'relatively unchanging' in a generic way seems to be lost on me. He then goes on to point out more of the problems with XML [and the Semantic Web/RDF]

There are some interesting implications in all of this.

One is that the Semantic Web is in for a lot of heartbreak. It has been trying for five years to convince the world to use it. It actually has a point. XML is supposed to be self-describing so that loosely coupled works. If you require a shared secret on both sides, then I’d argue the system isn’t loosely coupled, even if the only shared secret is a schema. What’s more, XML itself has three serious weaknesses in this regard:

It doesn’t handle binary data well.

It doesn’t handle links.

XML documents tend to be monolithic.

Now it's gotten pretty interesting and at this point, Adam throws the curve ball.

Recently, an opportunity has arisen to transcend these limitations. RSS 2.0 has become an extremely popular format on the Web. RSS 2.0 and Atom (which is essentially isomorphic) both support a base schema that provides a model for sets. Atom’s general model is a container (a <feed>) of <entry> elements in which each <entry> may contain any namespace scoped elements it chooses (thus any XML), must contain a small number of required elements (<id>, <updated>, and <title>), and may contain some other well-known ones in the Atom namespace such as <link>s. Even better, Atom clearly says that the order doesn’t matter.This immediately gives a simple model for sets missing in XML.
...
Atom also supports links of other sorts, such as comments, so clearly an Atom entry can contain links to related feeds (e.g., Reviews for a Restaurant or Complaints for a Customer) or links to specific posts. This gives us the network and graph model that is missing in XML. Atom contains a simple HTTP-based way to INSERT, DELETE, and REPLACE s within a . There is a killer app for all these documents because the browsers already can view RSS 2.0 and Atom and, hopefully, will soon natively support the Atom protocol as well, which would mean read and write capabilities.

Now that's deep. Why not move up one level of abstraction from exchanging XML documents to exchanging Web Feeds (RSS/Atom documents)? Adam ends his article by throwing a challenge out to database vendors who he believes have failed to learn the lessons of the Web by writing

All of this has profound implications for databases. Today databases violate essentially every lesson we have learned from the Web.

Are simple relaxed text formats and protocols supported? No.

Have databases enabled people to harness Moore’s law in parallel? This would mean that databases could scale more or less linearly to handle both the volume of the requests coming in and even the complexity. The answer is no.

Do databases optimize caching when it is OK to be stale? No.

Do databases let schemas evolve for a set of items using a bottom-up consensus/tipping point? Obviously not.

Do databases handle flexible graphs (or trees) well? No, they do not.

Have the databases learned from the Web and made their queries simple and flexible? No, just ask a database if it has anyone who, if they have an age, are older than 40; and if they have a city, live in New York; and if they have an income, earn more than $100,000. This is a nightmare because of all the tests for NULL.

The article ends by arguing that database vendors should add native support for the Atom Protocol and wire format. I find this interesting since based on conversations on the atom-protocol list, it is clear that Google is very interested in the Atom API. Perhaps they have already built this Atom store that Adam is arguing for and will expose the Atom API as a way to interact with it. Perhaps this Atom store accessible via Atom feeds and the Atom API is Google Base? Speculation is fun.

As for me, I tend to agree with Adam that moving up layers of abstraction is a good idea. We've all agreed on XML, the next thing to do is to agree on applications of XML. We've all agreed on RSS, the next thing to do is figure out what scenarios are enabled by the subscribe model. This is one of the reasons why I disliked the unnecessary fragmentation caused by the RSS vs. Atom battles. As for whether we need to start seeing databases with native RSS/Atom support, I think it's too early in the game to jump there. Heck, RDF has been around for a while but we are just know seeing some decent things happening with SPARQL and various RDF stores. Similarly with XML and XQuery. I don't think enough lessons have been learned from either to start thinking about what it would mean to have a native RSS/Atom store. It is an interesting idea though.

Categories: XML

October 18, 2005

@ 07:08 PM

The Myth of the Office XML Binary Key

A recent comment on the Groklaw blog entitled Which Binary Key? claims that one needs a "binary key" to consume XML produced by Microsoft Office 2003. Specifically the post claims

No_Axe speaks as if MS Office 12 had already been released and everyone was using it. He assumes everyone knows the binary key is gone. Yet Microsoft is saying that MS Office 12 is more or less a year away from release. So who really knows when and if the binary key has been dropped? All i know is that MSXML 12 is not available today. And that MSXML 2003 has a binary key in the header of every file.
...
So let me close with this last comment on the fabled “binary key”. In March of 2005, when phase II of the ODF TC work was complete, and the specification had been prepared for both OASIS and ISO ratification, the ODF TC took up the issue of “compliance and conformance” testing. Specifically, we decided to start work on a compliance testing suite that would be useful for developers and application providers to perfect their implementations of ODF. Guess who's XML file format was the first test target? Right. And guess what the problem is with MSXML? Right. It's the binary key. We can't do even a simple transformation between MSXML and ODF!

As someone who's used the XML features of Excel and Word, I know for a fact that you don't need a "binary key" to process the files using traditional XML tools. Brian Jones, who works on a number of the XML features in Office, has a post entitled The myth of the Binary Key where he mentions various parts of the Office XML formats that may confuse one into thinking they are some sort of "binary key" such as namespace URIs, processing instructions and Base64 encoded binary data. All of these are standard aspects of XML which one may typically doesn't see in simple uses of the technology such as in RSS feeds.

Being that I used to work on the XML team there is one thing I want to add the Brian's list which often confuses people trying to process XML; the unicode byte order mark (BOM). This is often at the beginning of documents saved in UTF-16 or UTF-8 encoding on Windows. However as the Wikipedia entry on BOM's states

In UTF-16, a BOM is expressed as the two-byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order.

Whilst UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters "ï»¿" in most text editors and web browsers not prepared to handle UTF-8.

I wouldn't be surprised if the alleged "binary key" was just a byte order mark which caused problems when trying to process the XML file using non-Unicode savvy tools. I suspect some of the ODF folks who had problems with the XML file would get some use out of Sam Ruby's Just Use XML talk at this year's XML 2005 conference.

Categories: XML

September 30, 2005

@ 08:14 PM

On Crappy XML Formats

There have been a number of amusing discussions in the recent back and forth between Robert Scoble and several others on whether OPML is a crappy XML format. In posts such as OPML "crappy" Robertson says and More on crappy formats Robert defends OPML. I've seen some really poor arguments made as people rushed to bash Dave Winer and OPML but none made me want to join the discussion until this morning.

In the post Some one has to say it again… brainwagon writes

Take for example Mark Pilgrim's comments:

I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML.

The reason just isn't that programmers are lazy (we are, but we also like stuff to work). The fact is that the specification itself is ambiguous and weak enough that nobody really knows what it means. As a result, there are all sorts of flavors of RSS out there, and parsing them is a big hassle.

The promise of XML was that you could ignore the format and manipulate data using standard off-the-shelf-tools. But that promise is largely negated by the ambiguity in the specification, which results in ill-formed RSS feeds, which cannot be parsed by standard XML feeds. Since Dave Winer himself managed to get it wrong as late as the date of the above article (probably due to an error that I myself have done, cutting and pasting unsafe text into Wordpress) we really can't say that it's because people don't understand the specification unless we are willing to state that Dave himself doesn't understand the specification.

As someone who has (i) written a moderately popular RSS reader and (ii) worked on the XML team at Microsoft for three years, I know a thing or two about XML-related specifications. Blaming malformed XML in RSS feeds on the RSS specification is silly. That's like blaming the large number of HTML pages that don't validate on the W3C's HTML specification instead of on the fact that instead of erroring on invalid web pages web browsers actually try to render them. If web browsers didn't render invalid web pages then they wouldn't exist on the Web.

Similarly, if every aggregator rejected invalid feeds then they wouldn't exist. However, just like in the browser wars, aggregator authors consider it a competitive advantage to be able to handle malformed feeds. This has nothing to do with the quality of the RSS specification [or the HTML specification], this is all about applications trying to get marketshare.

As for whether OPML is a crappy spec? I've had to read a lot of technology specifications in my day from W3C recommendations and IETF RFCs to API documentation and informal specs. They all suck in their own ways. However experience has thought me that the bigger the spec, the more it sucks. Given that, I'd rather have a short, human readable spec that sucks a little (e.g. RSS, XML-RPC, OPML etc.) than a large, jargon filled specificaton which sucks a whole lot more (e.g. WSDL, XML Schema, C++, etc). Then there's the issue of using the right tool for the job but I'll leave that rant for another day.

Categories: XML

September 18, 2005

@ 04:15 AM

Questioning RDF

I've been a long time skeptic when it comes to RDF and the Semantic Web. Every once in a while I wonder if perhaps what I have a problem with is the W3C's vision of the Semantic Web as opposed to RDF itself. However in previous attempts to explore RDF I've been surprised to find that its proponents seem to ignore some of the real world problems facing developers when trying to use RDF as a basis for information integration.

Recently I've come across blog posts by RDF proponents who've begun to question the technology. The first is the blog post entitled Crises by Ian Davis where he wrote

We were discussing the progress of the Dublin Core RDF task force and there were a number of agenda items under discussion. We didn’t get past the first item though - it was so hairy and ugly that no-one could agree on the right approach. The essence of the problem is best illustrated by the dc:creator term. The current definition says An entity primarily responsible for making the content of the resource. The associated comments states Typically, the name of a Creator should be used to indicate the entity and this is exactly the most common usage. Most people, most of the time use a person’s name as the value of this term. That’s the natural mode if you write it in an HTML meta tag and it’s the way tens or hundreds of thousands of records have been written over the past six years...Of course, us RDFers, with our penchant for precision and accuracy take issue with the notion of using a string to denote an “entity”. Is it an entity or the name of an entity. Most of us prefer to add some structure to dc:creator, perhaps using a foaf:Person as the value. It lets us make more assertions about the creator entity.

The problem, if it isn’t immediately obvious, is that in RDF and RDFS it’s impossible to specify that a property can have a literal value but not a resource or vice versa. When I ask “what is the email address of the creator of this resource?” what should the (non-OWL) query engine return when the value of creator is a literal? It isn’t a new issue, and is discussed in-depth on the FOAF wiki.

There are several proposals for dealing with this. The one that seemed to get the most support was to recommend the latter approach and make the first illegal. That means making hundreds of thousands of documents invalid. A second approach was to endorse current practice and change the semantics of the dc:creator term to explictly mean the name of the creator and invent a new term (e.g. creatingEntity) to represent the structured approach.
...
That’s when my crisis struck. I was sitting at the world’s foremost metadata conference in a room full of people who cared deeply about the quality of metadata and we were discussing scraping data from descriptions! Scraping metadata from Dublin Core! I had to go check the dictionary entry for oxymoron just in case that sentence was there! If professional cataloguers are having these kinds of problems with RDF then we are fucked.

It says to me that the looseness of the model is introducing far too much complexity as evidenced by the difficulties being experienced by the Dublin Core community and the W3C HTML working group. A simpler RDF could take a lot of this pain away and hit a sweet spot of simplicity versus expressivity.

Ian Davis isn't the only RDF head wondering whether there is too much complexity involved when trying to use RDF to get things done. Uche Ogbuji also has a post entitled Is RDF moving beyond the desperate hacker? And what of Microformats? where he writes

I've always taken a desperate hacker approach to RDF. I became a convert to the XML way of expressing documents right away, in 1997. As I started building systems that managed collections of XML documents I was missing a good, declarative means for binding such documents together. I came across RDF, and I was sold. I was never really a Semantic Web head. I used RDF more as a desperate hacker with problems in a fairly well-contained domain.
...
I've developed an overall impression of dismay at the latest RDF model semantics specs. I've always had a problem with Topic Maps because I think that they complicate things in search of an unnecessary level of ontological purity. Well, it seems to me that RDF has done the same thing. I get the feeling that in trying to achieve the ontological purity needed for the Semantic Web, it's starting to leave the desperate hacker behind. I used to be confident I could instruct people on almost all of RDF's core model in an hour. I'm no longer so confident, and the reality is that any technology that takes longer than that to encompass is doomed to failure on the Web. If they think that Web punters will be willing to make sense of the baroque thicket of lemmas (yes, "lemmas", mi amici docte) that now lie at the heart of RDF, or to get their heads around such bizarre concepts as assigning identity to literal values, they are sorely mistaken. Now I hear the argument that one does not need to know hedge automata to use RELAX NG, and all that, but I don't think it applies in the case of RDF. In RDF, the model semantics are the primary reason for coming to the party. I don't see it as an optional formalization. Maybe I'm wrong about that and it's the need to write a query language for RDF (hardly typical for the Web punter) that is causing me to gurgle in the muck. Assuming it were time for a desperate hacker such as me to move on (and I'm not necessarily saying that I am moving on), where would he go from here?

Uche is one of the few RDF heads whose opinions seem grounded in practicality (Joshua Allen is another) so it is definitely interesting to see him begin to question whether RDF is the right path.

I definitely think there is some merit to disconnecting RDF from the Semantic Web and seeing if it can hang on its own from that perspective. For example, XML as a Web document format is mostly dead-on-arrival but it has found a wide variety of uses as a general data interchange format instead. I've wondered if there is similar usefulness lurking within RDF once it loses its Semantic Web baggage.

Categories: Web Development | XML

September 16, 2005

@ 05:27 PM

XLinq and Visual Basic 9

The announcements from about Microsoft's Linq project just keep getting better and better. In his post XML, Dynamic Languages, and VB, Mike Champion writes

Thursday at PDC saw lots of details being put out about another big project our team has been working on -- the deep support for XML in Visual Basic 9...On the VB9 front, the big news is that two major features beyond and on top of LINQ will be supported in VB9:

"XML Literals" is the ability to embed XML syntax directly into VB code. For example,

Dim ele as XElement = <Customer/>

Is translated by the compiler to

Dim ele as XElement = new XElement("Customer")

The syntax further allows "expression holes" much like those in ASP.NET where computed values can be inserted.

"Late Bound XML" is the ability to reference XML elements and attributes directly in VB syntax rather than having to call navigation functions. For example

Dim books as IEnumerable(Of XElement) = bib.book

Is translated by the compiler to

Dim books as IEnumerable(Of XElement) = bib.Elements("book")

We believe that these features will make XML even more accessible to Visual Basic's core audience. Erik Meijer, a hard core languages geek who helped devise the Haskell functional programming language and the experimental XML processing languages X#, Xen, and C-Omega, now touts VB9 as his favorite.

Erik Meijer and I used to spend a lot of time talking about XML integration into popular programming languages back when I was on the XML team. In fact, all the patent cubes on my desk are related to work we did together in this area. I'm glad to see that some of the ideas we tossed around are going to make it out to developers in the near future. This is great news.

Categories: XML

September 15, 2005

@ 03:43 PM

Integrating XML into Programming Languages: Diving Into XLinq

You know you're a geek when it's not even 7AM but you've already spent half the morning reading a whitepaper about Microsoft's plans to integrate XML and relational query language functionality into the .NET Framework with Linq. C# 3.0 is going to be hot.

Like it's forefather X# ~~Xen~~ Cω, XLinq does an amazing job of integrating XML directly into the Common Language Runtime and the C#/VB.NET programming languages. Below are some code samples to whet your appetite until I can get around to writing an article later this year

Creating an XML document

XDocument contactsDoc =
    new XDocument(
      new XDeclaration("1.0", "UTF-8", "yes"),
      new XComment("XLinq Contacts XML Example"),
      new XProcessingInstruction("MyApp", "123-44-4444"),
        new XElement("contacts",
          new XElement("contact",
            new XElement("name","Patrick Hines"),
            new XElement("phone", "206-555-0144"),
            new XElement("address",
           new XElement("street1", "123 Main St"),
             new XElement("city", "Mercer Island"),
             new XElement("state", "WA"),
             new XElement("postal", "68042")
                        )
                      )
                    )
                  );
Creating an XML element in the "http://example.com" namespace

XElement contacts = new XElement("{http://example.com}contacts");
Loading an XML element from a file

XElement contactsFromFile = XElement.Load(@"c:\myContactList.xml");
Writing out an array of Person objects as an XML file

class Person {
        public string Name;
        public string[] PhoneNumbers;
}

var persons = new [] { new Person
                                                            {Name="Patrick Hines",
                               PhoneNumbers = new string[]
                                                                                    {"206-555-0144", "425-555-0145"}
                               },
                       new Person {Name="Gretchen Rivas",
                                   PhoneNumbers = new string[]
                                                                                    {"206-555-0163"}
                               }
                      };

XElement contacts = new XElement("contacts",
                        from p in persons
                        select new XElement("contact",
                            new XElement("name", p.Name),
                            from ph in p.PhoneNumbers
                            select new XElement("phone", ph)
                        )
                    );

Console.WriteLine(contacts);
Print out all the element nodes that are children of the <contact> element

foreach (x in contact.Elements()) {
Console.WriteLine(x);
}
Print all the <phone> elements that are children of the <contact> element

foreach (x in contact.Elements("phone")) {
Console.WriteLine(x);
}
Adding a <phone> element as a child of the <contact> element

XElement mobilePhone = new XElement("phone", "206-555-0168");
contact.Add(mobilePhone);
Adding a <phone> element as a sibling of another <phone> element

XElement mobilePhone = new XElement("phone", "206-555-0168");
XElement firstPhone = contact.Element("phone");
firstPhone.AddAfterThis(mobilePhone);
Adding an <address> element as a child of the <contact> element

contact.Add(new XElement("address",
              new XElement("street", "123 Main St"),
              new XElement("city", "Mercer Island"),
              new XElement("state", "WA"),
              new XElement("country", "USA"),
              new XElement("postalCode", "68042")
            ));
Deleting all <phone> elements under a <contact> element

contact.Elements("phone").Remove();
Delete all children of the <address> element which is a child of the <contact> element

contacts.Element("contact").Element("address").RemoveContent();
Replacing the content of the <phone> element under a <contact> element

contact.Element("phone").ReplaceContent("425-555-0155");
Alternate technique for replacing the content of the <phone> element under a <contact> element

contact.SetElement("phone", "425-555-0155");
Creating a contact element with attributes multiple phone number types

XElement contact =
      new XElement("contact",
            new XElement("name", "Patrick Hines"),
            new XElement("phone",
                  new XAttribute("type", "home"),
                  "206-555-0144"
            ),
            new XElement("phone",
                  new XAttribute("type", "work"),
                  "425-555-0145"
            )
      );
Printing the value of the <phone> element whose type attribute has the value "home"

foreach (p in contact.Elements("phone")) {
            if ((string)p.Attribute("type") == "home")
                Console.Write("Home phone is: " + (string)p);
   }
Deleting the type attribute of the first <phone> element under the <contact> element

contact.Elements("phone").First().Attribute("type").Remove();
Transforming our original <contacts> element to a new <contacts> element containing a list of <contact> elements whose children are <name> and <phoneNumbers>

new XElement("contacts",
      from c in contacts.Elements("contact")
      select new XElement("contact",
            c.Element("name"),
            new XElement("phoneNumbers", c.Elements("phone"))
      )
);
Retrieving the names of all the contacts from Washington, sorted alphabetically

from c in contacts.Elements("contact")
where (string) c.Element("address").Element("state") == "WA"
orderby (string) c.Element("name")
select (string) c.Element("name");

All examples were taken from the XLinq: .NET Language Integrated Query for XML Data white paper.

Categories: XML

September 13, 2005

@ 11:02 PM

Microsoft announces LINQ (and XLinq)

The former co-workers (the Microsoft XML team) have been hard at work with the C# language team to bring the XML query integration into the core languages for the .NET Framework. From Dave Remy's post Anders unveils LINQ! (and XLinq) we learn

In Jim Allchin's keynote At PDC2005 today Anders Hejlsberg showed the LINQ project for the first time. LINQ stands for Language Integrated Query. The big idea behind LINQ is to provide a consistent query experience across different "LINQ enabled" data access technologies AND to allow querying these different data access technologies in a single query. Out of the box there are three LINQ enabled data access technologies that are being shown at PDC. The first is any in-memory .NET collection that you foreach over (any .NET collection that implements IEnumerable<T>). The second is DLinq which provides LINQ over a strongly typed relational database layer. The third, which I have been working on for the last 6 months or so (along with Anders and others on the WebData XML team), is XLinq, a new in-memory XML programming API that is Language Integerated Query enabled. It is great to get the chance to get this technology to the next stage of development and get all of you involved. The LINQ Preview bits (incuding XLinq and DLinq) are being made available to PDC attendees. More information on the LINQ project (including the preview bits) are also available online at http://msdn.microsoft.com/netframework/future/linq

This is pretty innovative stuff and I definitely can't wait to download the bits when I get some free time. Perhaps I need to write an article exploring LINQ for XML.com the way I did with my Introducing C-Omega article? Then again, I still haven't updated my C# vs. Java comparison to account for C# 2.0 and Java 1.5. It looks like I'll be writing a bunch of programming language articles this fall.

Which article would you rather see?

Categories: XML

August 8, 2005

@ 01:47 PM

Microformats vs. XML vs. RDF

In response to my post Using XML on the Web is Evil, Since When? Tantek updated his post Avoiding Plain XML and Presentational Markup. Since I'm the kind of person who can't avoid a good debate even when I'm on vacation I've decided to post a response to Tantek's response. Tantek wrote

The sad thing is that while namespaces theoretically addressed one of the problems I pointed out (calling different things by the same name), it actually WORSENED the other problem: calling the same thing by different names. XML Namespaces encouraged document/data silos, with little or no reuse, probably because every person/political body defining their elements wanted "control" over the definition of any particular thing in their documents. The <svg:a> tag is the perfect example of needless duplication.

And if something was theoretically supposed to have solved something but effectively hasn't 6-7 years later, then in our internet-time-frame, it has failed.

This is a valid problem in the real world. For example, for all intents an purposes an <atom:entry> element in an Atom feed is semantically equivalent to an <item> element in an RSS feed to every feed reader that supports both. However we have two names for what is effectively the same thing as far as an aggregator developer or end user is concerned.

The XML solution to this problem has been that it is OK to have myriad formats as long as we have technologies for performing syntactic translations between XML vocabularies such as XSLT. The RDF solution is for us to agree on the semantics of the data in the format (i.e. a canonical data model for that problem space) in which case alternative syntaxes are fine and we performs translations using RDF-based mapping technologies like DAML+OIL or OWL. The microformat solution which Tantek espouses is that we all agree on a canonical data model and a canonical syntax (typically some subset of [X]HTML).

So far the approach that has gotten the most traction in the real world is XML. From my perspective, the reason for this is obvious; it doesn't require that everyone has to agree on a single data model or a single format for that problem space.

Microformats don't solve the problem of different entities coming up with the different names for the same concept. Instead its proponents are ignoring the reasons why the problem exists in the first place and then offering microformats as a panacea when they are not.

I personally haven't seen a good explanation of why <strong> is better than <b>...

A statement like that begs some homework. The accessibility, media independence, alternative devices, and web design communities have all figured this out years ago. This is Semantic (X)HTML 101. Please read any modern web design book like those on my SXSW Required Reading List, and we'll continue the discussion afterwards.

I can see the reasons for a number of the semantic markup guidelines in the case of HTML. What I don't agree with is jumping to the conclusion that markup languages should never have presentational markup. This is basically arguing that every markup language that may be used as a presentation format should use CSS or invent a CSS equivalent. I think that is a stretch.

Finally, one has to seriously cast doubt on XML opinions on a page that is INVALID markup. I suppose following the XML-way, I should have simply stopped reading Dare's post as soon as I ran into the first well-formedness error. Only 1/2 ;)

The original permalink to Tantek's article was broken after he made teh edit. I guess since I couldn't find it, it doesn't exist. ;)

Categories: Web Development | XML

July 28, 2005

@ 09:48 AM

Comments [19]

Using XML on the Web is Evil, Since When?

I've been reading some of the hype around microformats in certain blogs with some amusement. I have been ignoring microformats but now I see that some of its proponents have started claiming that using XML on the Web is bad and instead HTML is the only markup language we'll ever need.

In her post Why generic XML on the Web is a bad idea Anne van Kesteren writes

Of course, using XML or even RDF serialized as XML you can describe your content much better and in far more detail, but there is no search engine out there that will understand you. For RDF there is a chance one day they will. Generic XML on the other hand will always fail to work. (Semantics will not be extracted.)

An example that shows the difference more clearly:
<em>Look at me when I talk to you!</em>
… and:
<angry>Look at me when I talk to you!</angry>
The latter element describes the content probably more accurately, but on ‘the web’ it means close to nothing. Because on the web it’s not humans who come by and try to parse the text, they already know how to read something correctly. No, software comes along and tries to make something meaningful of the above. As the latter is in a namespace no software will know and the latter is also not specified somewhere in a specification it will be ignored. The former however has been here since the beginning of HTML — even before it’s often wrongly considered presentational equivalent I — and will be recognized by software.

This post in itself isn't that bad, if anything it is just somewhat misguided. However Tantek Celik followed it up with his post Avoiding plain XML and presentational markup which boggled my mind. Tantek wrote

The marketing message of XML has been for people to develop their own tags to express whatever they wanted, rather than being stuck with the limited predefined tag set in HTML. This approach has often been labeled "plain XML" or "generic XML" or "SGML, but easier, better, and designed just for the Web".

The problem with this approach is that while having the freedom to make up all your own tags and attributes sounds like a huge improvement over the (mostly perceived) limits of HTML, making up your own XML has numerous problems, both for the author, and for users / readers, especially when sharing with others (e.g. anything you publish on the Web) is important.

This post by no means contains a complete set of arguments against plain/generic XML and presentational markup, nor are the arguments presented as definitive proofs. Mostly I wanted to share a bunch of reinforcing resources in one place. Readers are encouraged to improve upon the arguments made here.

The original impetus for creating XML was to enable SGML on the Web. People had become frustrated with the limited tag set in HTML and the solution was to create a language that enabled content creators to create their own tags yet have them still readable in browsers via stylesheet technologies (e.g. CSS). Over time, XML has failed to take off as a generic document format used by content authors for creating human readable documents on the Web but has become popular as a data format used for machine to machine communications on the Web(RSS, XML-RPC, SOAP, etc) .

Thus any arguments against XML usage on the Web today are really arguing about using XML as a data format since it isn't really used as a document format except for XHTML [and even that is only by markup geeks like Tantek & Anne].

Anyway let's look at some of Tantek's arguments against using XML on the Web...

Tower of Babel Problem

If everyone invents their own tags and attributes, pretty soon you get people calling the same thing by different names and different things by the same name. While avoid both of those occurences completely is very difficult (many of the microformats principles are design to help avoid those problems), downright encouraging authors to make up their own tags and attributes makes it much worse and all you end up with are a bunch of documents that give you the illusion of self-description.

Didn't the XML world solve this with XML namespaces like six or seven years ago?

Temptation of Presentational Markup

What happens all too often when authors or developers make up their own tags is that they choose tags that are tightly tied to a specific presentation rather than abstracting them with semantics. Quite similar to the phenomenon of authors picking presentational class names.

As a casual user of HTML, I personally haven't seen a good explanation of why <strong> is better than <b> so arguments whose entire basis is "presentational markup is evil" don't carry much weight in my book. If I come up with a custom markup format and it has a <bold> element, is that really so evil? I'm pretty sure that the XML formats used by OpenOffice or Microsoft Office contain markup that is presentational in nature whether it is setting font sizes, text colors or paragraph alignemnt. Are they evil or does the fact that they aren't intended for the Web give them a pass?

Preferring Semantic Richness

Sometimes something is a bad idea not just in absolute terms, but also relative to other approaches and solutions.

A while ago I wrote about a semantic richness spectrum on the www-style mailing list which went into a bit more detail. Håkon Wium Lie wrote a paper that both predated my rough summary by a couple of years, and provided a much more thorough analysis.

Languages with well-known semantics are preferred to proprietary/made-up XML. This is for many reasons, including accessibility, cross-device support, and future user agent support.

This seems to be arguing that instead of cooking up your own custom format you should pick an established format with the semantics you want if one exists. This is regularly practiced in the XML world especially when it comes to the Web so I don't see how this is an argument against using XML.

--

Seriously, I feel like I am in some bizarre alternate universe if having aggregators subscribe to HTML web pages is being advocated as being a better idea than using specialized XML formats like RSS & Atom.

That's it...I'm going back to my vacation. The world has gone too loopy for me.

Categories: XML

July 13, 2005

@ 01:36 PM

Comments [11]

Hacking MSN Virtual Earth

I stumbled on Bus Monster last week and even though I don't take the bus I thought it was a pretty cool application. There's a mapping application that I've been wanting for a few years and I instantly realized that given the Google Maps API I could just write it myself.

Before starting I shot a mail off to Chandu and Steve on the MSN Virtual Earth team and asked if their API would be able to support building the application I wanted. They were like "Hell Yeah" and instead of working on my review I started hacking on Virtual Earth. In an afternoon hacking session, I discovered that I could build the app I wanted and learned new words like geocoding.

My hack should be running internally on my web server at Microsoft before the end of the week. Whenever Virtual Earth goes live I'll move the app to my personal web site. I definitely learned something new with this application and will consider Hacking MSN Virtual Earth as a possible topic for a future Extreme XML column on MSDN. Would anyone be interested in that?

Categories: MSN | Web Development | XML

June 28, 2005

@ 05:08 PM

Apple Embraces and Extends RSS with iTunes 4.9

Today I learned that Apple brings podcasts into iTunes which is excellent news. This will definitely push subscribing to music and videos via RSS feeds into the mainstream. I wonder how long it'll take MTV to start providing podcast feeds.

One interesting aspect of the announcement which I didn't see in any of the mainstream media coverage was pointed out to me in Danny Ayers's post Apple - iTunes - Podcasting where he wrote

Apple - iTunes - Podcasting and another RSS 2.0 extension (PDF). There are about a dozen new elements (or “tags” as they quaintly describe them) but they don’t seem to add anything new. I think virtually everything here is either already covered by RSS 2.0 itself, except maybe tweaked to apply to the podcast rather than the item.
They’ve got their own little category taxonomy and this delightful thing:

<itunes :explicit>
This tag should be used to note whether or not your Podcast contains explicit material.
There are 2 possible values for this tag: Yes or No

I wondered at first glance whether this was so you could tell when you were dealing with good data or pure tag soup. However, the word has developed a new meaning:

If you populate this tag with “Yes”, a parental advisory tag will appear next to your Podcast cover art on the iTunes Music Store
This tag is applicable to both Channel & Item elements.

So, in summary it’s a bit of a proprietary thing, released as a fait accompli. Ok if you’re targetting for iTunes, for anything else use Yahoo! Media RSS . I wonder where interop went.

This sounds interesting. So now developers of RSS readers that want to consume podcasts have to know how to consume the RSS 2.0 <enclosure> element, Yahoo!'s extensions to RSS and Apple's extensions to RSS to make sure they cover all the bases. Similarly publishers of podcasts also have to figure out which ones they want to publish as well.

I guess all that's left is for Real Networks and Microsoft to publish their own extensions to RSS for dealing with providing audio and video metadata in RSS feeds to make it all complete. This definitely complicates my plans for adding podcasting support to RSS Bandit. And I thought the RSS 1.0 vs. RSS 2.0 vs. Atom discussions were exciting. Welcome to the world of syndication.

PS: The title of this post is somewhat tongue in cheek. It was inspired by Slashdot's headline over the weekend titled Microsoft To Extend RSS about Microsoft's creation of an RSS module for making syndicating lists work better in RSS. Similar headlines haven't been run about Yahoo! or Apple's extensions to RSS but that's to be expected since we're Microsoft. ;)

Categories: Syndication Technology | XML

June 20, 2005

@ 04:08 PM

Joe Wilcox on Microsoft's Office Open XML formats

Joe Wilcox has a post that has me scratching my head today. In his post Even More on New Office File Formats, he writes

Friday's eWeek story about Microsoft XML-based formats certainly raises some questions about how open they really are. Assuming reporter Pater Galli has his facts straight, Microsoft's formats license "is incompatible with the GNU General Public License and will thus prevent many free and open-source software projects from using the formats." Earlier this month, I raised different concerns about the new formats openness.

To reiterate a point I made a few weeks ago: Microsoft's new Office formats are not XML. The company may call them "Microsoft Office Open XML Fromats," but they are XML-based, which is nowhere the same as being XML or open, as has been widely misreported by many blogsites and news outlets.

There are two points I'd like to make here. The first is that "being GPL compatible" isn't a definition of 'open' that I've ever heard anyone make. It isn't even the definition of Open Source or Free Software (as in speech). Heck, even the GNU website has a long list of Open Source licenses that are incompatible with the GPL. You'll notice that this list includes the original BSD license, the Apache license, the Zope license, and the Mozilla public license. I doubt that EWeek will be writing articles about how Apache and Mozilla are not 'open' because they aren't GPL compatible.

Secondly, it's completely unclear to me what distinction Joe Wilcox is making between being XML and being XML-based. The Microsoft Office Open XML formats are XML formats. They are stored on the hard drive as compressed XML files using standard compression techniques that are widely available on most platforms. Compressing an XML file doesn't change the fact that it is XML. Reading his linked posts doesn't provide any insight into whether this is the distinction Joe Wilcox is making or whether there is another. Anyone have any ideas about this?

Categories: XML

June 8, 2005

@ 02:41 PM

.NET vs. J2EE Performance Benchmarks for XML and XML Web Services

About a year ago, the folks at Sun Microsystems came up with a bunch of benchmarks that showed that XML parsing in Java was much faster than on the .NET Framework. On the XML team at Microsoft we took this as a challenge to do much better in the next version of the .NET Framework. To see how much we improved, you can check out A comparison of XML performance on .NET 2.0 Beta2, .NET 1.1, and Sun Java 1.5 Platforms which is available on MSDN.

In the three test cases, Java 1.5 XML APIs are faster than those in the .NET Framework v1.1 both of which are about half as fast as the XML APIs in .NET Framework v2.0. The source code for the various tests is available so individuals can confirm the results for themselves on their own configurations.

A lot of the improvements in XML parsing on the .NET Framework are due to the excellent work of Helena Kupkova. She is also the author of the excellent XmlBookMarkReader. Great stuff.

For the XML web services geeks there is also a similar comparison of XML Web Services Performance for .NET 2.0 Beta2, .NET 1.1, Sun JWSDP 1.5 and IBM WebSphere 6.0.

Categories: XML | XML Web Services

June 4, 2005

@ 08:07 AM

Microsoft Office and XML: Why not the OASIS OpenOffice.org XML format?

Since the recent announcement that the next version of Microsoft Office would move to open XML formats as the default file format in the next version, I've seen some questions raised about why the OpenOffice.org XML formats which were standardized with OASIS weren't used. This point is addressed in a comment by Jean Paoli in the article Microsoft to 'Open' Office File Formats which is excerpted below

"We have legacy here," Jean Paoli, Senior Microsoft XML Architect, told BetaNews. "It is our responsibility to our users to provide a full fidelity format. We didn't see any alternative; believe me we thought about it. Without backward compatibility we would have other problems."

"Yes this is proprietary and not defined by a standards body, but it can be used by and interoperable with others. They don't need Microsoft software to read and write. It is not an open standard but an open format," Paoli explained.

When asked why Microsoft did not use the OASIS (Organization for the Advancement of Structured Information Standards) OpenOffice.org XML file format, Paoli answered, "Sun standardized their own. We could have used a format from others and shoehorned in functionality, but our design needs to be different because we have 400 million legacy users. Moving 400 million users to XML is a complex problem."

There is also somewhat of a double standard at play here. The fact that we are Microsoft means that we will get beaten up by detractors no matter what we do. When Sun announced Java ~~1.5~~ 5.0 with a feature set that looked a lot like those in C#, I don't remember anyone asking why they continued to invest in their proprietary programming language and platform instead of just using C# and the CLI which have been standardized by both ECMA and the ISO. If Microsoft had modified the OpenOffice.org XML file format so that it was 100% backwards compatible with the previous versions of Microsoft Office it is likely that same people would be yelling "embrace and extend". I'm glad the Office guys went the route they chose instead. Use the right tool for the job instead of trying to turn a screwdriver into a hammer.

It's a really powerful thing that the most popular Office productivity suite on the planet is wholeheartedly embracing open formats and XML. It's unfortunate that some want to mar this announcement with partisan slings and arrows instead of recognizing the goodness that will come from ending the era of closed binary document formats on the desktop.

Categories: XML

June 2, 2005

@ 12:08 PM

Next Version of Office Will Use XML as the Default File Format

About two and half years ago, I was hanging out with several members of the Office team as they gave the details about how Office 2003 would support XML file formats at XML 2002. Now that it's 2005, juicy information like that is now transmitted using blogs.

Brian Jones has a blog post entitled New default XML formats in the next version of Office were he reveals some of the details of XML support in the next version of Office. He writes

Open XML Formats Overview

To summarize really quickly what’s going on, there will be new XML formats for Word, Excel, and PowerPoint in the next version of Office, and they will be the default for each. Without getting too technical, here are some basic points I think are important:

Open Format: These formats use XML and ZIP, and they will be fully documented. Anyone will be able to get the full specs on the formats and there will be a royalty free license for anyone that wants to work with the files.

Compressed: Files saved in these new XML formats are less than 50% the size of the equivalent file saved in the binary formats. This is because we take all of the XML parts that make up any given file, and then we ZIP them. We chose ZIP because it’s already widely in use today and we wanted these files to be easy to work with. (ZIP is a great container format. Of course I’m not the only one who thinks so… a number of other applications also use ZIP for their files too.)

Robust: Between the usage of XML, ZIP, and good documentation the files get a lot more robust. By compartmentalizing our files into multiple parts within the ZIP, it becomes a lot less likely that an entire file will be corrupted (instead of just individual parts). The files are also a lot easier to work with, so it’s less likely that people working on the files outside of Office will cause corruptions.

Backward compatible: There will be updates to Office 2000, XP, and 2003 that will allow those versions to read and write this new format. You don’t have to use the new version of Office to take advantage of these formats. (I think this is really cool. I was a big proponent of doing this work)

Binary Format support: You can still use the current binary formats with the new version of Office. In fact, people can easily change to use the binary formats as the default if that’s what they’d rather do.

New Extensions: The new formats will use new extensions (.docx, .pptx, .xlsx) so you can tell what format the files you are dealing with are, but to the average end user they’ll still just behave like any other Office file. Double click & it opens in the right application.

...

Whitepapers

The Microsoft Office Open XML Formats: New File Formats for "Office 12"

http://download.microsoft.com/download/c/2/9/c2935f83-1a10-4e4a-a137-c1db829637f5/Office12NewFileFormatsWP.doc

The Microsoft Office Open XML Formats: Preview for Developers

http://download.microsoft.com/download/c/2/9/c2935f83-1a10-4e4a-a137-c1db829637f5/Office12FileFormatDevPreviewWP.doc

This is totally awesome news. I remember asking, back in 2002, why Powerpoint didn't have an XML file format and the answer was that it was due to schedule constraints but it would be fixed in the next version. Not only did the Office guys keep their word but they went above and beyond.

This should make Sam Ruby happy.

Categories: XML

May 18, 2005

@ 02:01 PM

Jonathan Marsh On XInclude and XML Schema

It seems Jonathan Marsh has joined the blogosphere with his new blog Design By Committee. If you don't know Jonathan Marsh, he's been one of Microsoft's representatives at the W3C for several years and has been an editor of a variety of W3C specifications including XML:Base, XPointer Framework, and XInclude.

In his post XML Base and open content models Jonathan writes

There is a current controversy about XInclude adding xml:base attributes whenever an inclusion is done. If your schema doesn't allow those attributes to appear, you're document won't validate. This surprises some people, since the invalid attributes were added by a previous step in the processing chain (in this case XInclude), rather than by hand. As if that makes a difference to the validator!

Norm Walsh , after a false start, correctly points out this behavior was intentional. But he doesn't go the next step to say that this behavior is vital! The reason xml:base attributes are inserted is to keep references and links from breaking. If the included content has a relative URI, and the xml:base attribute is omitted, the link will no longer resolve - or worse, it will resolve to the wrong thing. Can you say "security hole"?

Sure it's inconvenient to fail validation when xml:base attributes are added, especially when there are no relative URIs in the included content (and thus the xml:base attributes are unnecessary.) But hey, if you wanted people or processes to add attributes to your content model, you should have allowed them in the schema!

I agree that the working group tried to address a valid concern. But this seems to me to be a case of the solution being worse than the problem. To handle a situation for which workarounds will exist in practice (i.e. document authors should use absolute URIs instead of relative URIs in documents) the XInclude working group handicapped using XInclude as part of the processing chain for documents that will be validated by XML Schema.

Since the problem they were trying to solve exists in instance documents, even if the document author don't follow a general guideline of favoring absolute URIs over relative URIs, these URIs can be expanded in a single pass using XSLT before being processed up the chain by XInclude. On the other hand if a schema doesn't allow xml:base elements everywhere (basically every XML format in existence) then one cannot use XInclude as part of the pipeline that creates the document if the final document will undergo schema validation.

I think the working group optimized for an edge case but ended up breaking a major scenario. Unfortunately this happens a lot more than it should in W3C specifications.

Categories: XML

May 16, 2005

@ 04:24 PM

XInclude and W3C XML Schema Will Play Nice Together in .NET Framework v2.0

Stan Kitsis, who replaced me as the XML Schema program manager on the XML team, has a blog post about XInclude and schema validation where he writes

A lot of people are excited about XInclude and want to start using it in their projects. However, there is an issue with using both XInclude and xsd validation at the same time. The issue is that XInclude adds xml:* attributes to the instance documents while xsd spec forces you to explicitly declare these attributes in your schema. Daniel Cazzulino, an XML MVP, blogged about this a few months ago: "W3C XML Schema and XInclude: impossible to use together???"

To solve this problem, we are introducing a new system.xml validation flag AllowXmlAttributes in VS2005. This flag instructs the engine to allow xml:* attributes in the instance documents even if they are not defined in the schema. The attributes will be validated based on their data type.

This design flaw in the aforementioned XML specifications is a showstopper that prevents one from performing schema validation using XSD on documents that were pre-processed with XInclude unless the schema designer decided up front that they want their format to be used with XInclude. This is fundamentally broken. The sad fact is that as Norm Walsh pointed out in his post XInclude, xml:base and validation this was a problem the various standards groups were aware of but decided to dump on implementers and users anyway. I'm glad the Microsoft XML team decided to take this change and fix a problem that was ignored by the W3C standards groups involved.

Categories: XML

May 10, 2005

@ 06:23 PM

UPDATED: Things to note about foreach and System.Xml.XPath.XPathNodeIterator

Oleg Tkachenko has a post about one of the changes I was involved in while the program manager for XML programming models in the .NET Framework. In the post foreach and XPathNodeIterator - finally together Oleg writes

This one little improvement in System.Xml 2.0 Beta2 is sooo cool anyway: XPathNodeIterator class at last implements IEnumerable! Such unification with .NET iteration model means we can finally iterate over nodes in an XPath selection using standard foreach statement:
XmlDocument doc = new XmlDocument();
doc.Load("orders.xml");
XPathNavigator nav = doc.CreateNavigator();
foreach (XPathNavigator node in nav.Select("/orders/order"))
    Console.WriteLine(node.Value);
Compare this to what we have to write in .NET 1.X:
XmlDocument doc = new XmlDocument();
doc.Load("../../source.xml");
XPathNavigator nav = doc.CreateNavigator();
XPathNodeIterator ni = nav.Select("/orders/order");
while (ni.MoveNext())      
  Console.WriteLine(ni.Current.Value);
Needless to say - that's the case when just a dozen lines of code can radically simplify a class's usage and improve overall developer's productivity. How come this wasn't done in .NET 1.1 I have no idea.

One of the reasons we were hesitant in adding support for the IEnumerable interface to the XPathNodeIterator class is that the IEnumerator returned by the IEnumerable.GetEnumerator method has to have a Reset method. However a run of the mill XPathNodeIterator does not have a way to reset its current position. That means that code like the following has problems

XmlDocument doc = new XmlDocument();
doc.Load("orders.xml");
XPathNodeIterator it = doc.CreateNavigator().Select("/root/*");

foreach (XPathNavigator node in it) 
  Console.WriteLine(node.Name);
						
foreach (XPathNavigator node in it) 
 Console.WriteLine(node.Value);

The problem is that after the first loop the XPathNodeIterator is positioned at the end of the sequence of nodes so the second loop should not print any values. However this violates the contract of IEnumerable and the behavior of practically every other class that implements the interface. We considered adding an abstract Reset() method to the XPathNodeIterator class in Whideby but this would have broken implementations of that class written against previous versions of the .NET Framework.

Eventually we decided that even though the implementation of IEnumerable on the XPathNodeIterator would behave incorrectly when looping over the class multiple times, this was an edge case that shouldn't prevent us from improving the usability of the class. Of course, it is probable that someone may eventually be bitten by this weird behavior but we felt the improved usability was worth the trade off.

Yes, backwards compatibility is a pain.

UPDATE: Andrew Kimball, who's one of the developers of working on XSLT and XPath technologies in System.Xml posted a comment that corrected some of my statements. It seems that some different implementation decisions were made after I left the team. He writes

"You know how I hate to contradict you, but the example you give actually does work correctly in 2.0. The implementation of IEnumerable saves a Clone of the XPathNodeIterator so that Reset() can simply reset to the saved Clone. There were a couple of limitations/problems, but neither was serious enough to forego implementing IEnumerable:

1. Performance -- 2 clones of the XPathNodeIterator must be taken, one in case Reset is called, and one to iterate over. In addition, getting the Current property must clone the current navigator so that the navigator's position is independent of the iterator's position.

2. Mid-Iteration Weirdness -- If MoveNext() is called several times on the XPathNodeIterator, and *then* GetEnumerator() is called, the enumerator will only enumerate the remaining nodes, not the ones that were skipped over. Basically, users should either use the XPathNodeIterator methods to iterate, *or* the IEnumerable/IEnumerator methods, not both."

I guess it just goes to show how quickly knowledge can get obsoleted in the technology game. :)

Categories: XML

May 9, 2005

@ 03:09 PM

Microsoft licensed Mvp.Xml library

A little while ago I noticed a post by Oleg Tkachenko entitled Microsoft licensed Mvp.Xml library where he wrote

On behalf of the Mvp.Xml project team our one and the only lawyer - XML MVP Daniel Cazzulino aka kzu has signed a license for Microsoft to use and distribute the Mvp.Xml library. That effectively means Microsoft can (and actually wants to) use and distribute XInclude.NET and the rest Mvp.Xml goodies in their products. Wow, I'm glad XML MVPs could come up with something so valuable than Microsoft decided to license it.

Mvp.Xml project is developed by Microsoft MVPs in XML technologies and XML Web Services worldwide. It is aimed at supplementing .NET framework functionality available through the System.Xml namespace and related namespaces such as System.Web.Services. Mvp.Xml library version 1.0 released at January 2005 includes Common, XInclude.NET and XPointer.NET modules.

As a matter of interest - Mvp.Xml is an open-source project hosted at SourceForge.

Since Oleg wrote this I've seen other Microsoft XML MVPs mention the event including Don Demsak and Daniel Cazzulino. I think this is very cool and something I feel somewhat proud of since I helped get the XML MVP program started.

A few years ago, as part of my duties as the program manager responsible for putting together a community outreach plan for the XML team I decided that we needed an MVP category for XML. I remember the first time I brought it up with the folks who run the Microsoft MVP program and they thought it was such a weird idea since there were already categories for languages like C# and VB but XML was just a config file format and didn't need enough significant expertise. I was persistent and pointed out that a developer could be a guru at XSLT, XPath, XSD, DOM, etc without necessarily being a C# or VB expert. Eventually they buckled and an MVP category for XML was created. Besides getting the XML Developer Center on MSDN launched, getting the XML MVP program started was probably my most significant achievement as part of my community outreach duties while on the XML team at Microsoft.

Now it is quite cool to see this community of developers contributing so much value to the .NET XML community that Microsoft has decided to license their technologies.

I definitely want to build a similar developer community around the stuff we are doing at MSN once we start shipping our APIs. I can't wait to get our next release out the door so I can start talking about stuff in more detail.

Categories: Life in the B0rg Cube | XML

April 27, 2005

@ 12:52 PM

Adam Bosworth's Web of Data: Is RSS the only API your Website Needs?

Daniel Steinberg has a an article entitled Bosworth's Web of Data where he discusses some of the ideas Adam Bosworth evangelized in his keynote at the MySQL Users Conference 2005. Daniel writes,

Bosworth explained that the key factors that enabled the web began with simplicity. HTTP was simple enough that any "P" language or JavaScript programmer could build applications. On the consumption side, web browsers such as Internet Explorer 4 were committed to rendering whatever they got. This meant that people could be sloppy and they didn't need to be high priests of syntax. Because it was a sloppy standard, people who otherwise couldn't have authored content did. The fact that it was a standard allowed this single, simple, sloppy, open wire format to run on every platform.
...
The challenge is to take a database and do for the web what was done for content. Bosworth explained that you "need a model that allows for massively linear scalability and federation of information that can spread effortlessly across a federated web."

Solutions that were suggested were to use XML and XQuery. The problem with XML is that unlike HTML, there is not a single grammar. This removed the simple and sloppy aspects of the web. The problem with XQuery is the time it took to finish the specification. Bosworth noted that it took more than four years and that "anything that takes four years is not worth doing. It is over-designed. Intead, take six months and learn from customers."
...
The next solution used web services, which began as an easy idea: you send an XML request and you get XML back. Instead, the collection of WS-* specs were huge and again, overly complicated. Bosworth said that this was a deliberate effort on the part of the companies that control the specs, like IBM and Microsoft, which deliberately made the specification hard, because then only they could deliver technology to do it.
...
Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything.

There are lots of interesting ideas here. I won't dwell on the criticisms of XQuery & WS-* mainly because I tend to agree that they are both overdesigned and complicated. I also wont dwell on the apparent contradiction inherent in claiming that the Semantic Web is doomed because it requires people to agree on the same schema for everything then proposing that everyone agree on using RSS as the schema for all data on the Web. I have a suspicion of what he sees as the difference but I'll wait for a blog post from him clarifying that.

What I find very interesting is using RSS is the data access format for the Web. RSS gained popularity as a way to syndicate blog posts and news sites but its turned out to be a lot more versatile than that. Sites like Feedster and Amazon's OpenSearch technology show you can use RSS as a mechanism for providing search results and integrating search engines respectively. Podcasting shows you can use RSS to syndicate digital media content instead of just plain old text or HTML. With Amazon's syndicated feeds one can keep abreast of when new CDs, books and more are released.

Over the weekend I wrote the MSN Spaces photo album browser page which displays slideshows of all the photos in the various albums on a particular user's MSN Spaces space. This page also can display the photos on a randomly selected space. This webpage is entirely powered by RSS. The photos are obtained from the RSS feed for the Space and the list of random spaces is obtained by querying MSN search with the query "site:spaces.msn.com photo album" and requesting the results as RSS. In fact, the information from the MSN Spaces RSS feeds is enough to build something like the Flickr related tags browser, where instead of showing related tags one could show spaces related to the user from the information in their blog roll which happens to also be provided in the RSS feed. Pretty nifty and all without requiring building a REST, SOAP or XML-RPC API.

In situations where one simply wants to expose read-only data via a service on the Web, it's looking like RSS is the technology to beat. As more and more information is exposed as RSS feeds, there will be even more interesting things people will be able to do with this technology. At Microsoft we definitely are gung ho about exposing as much data as possible via RSS and I've been amazed at how much enthusiasm there is around the opportunites in this area.

Side Note: Yesterday while at the Microsoft Research Social Computing Symposium I was chatting with Randy Farmer, who's one of the guys behind Yahoo! 360° and Yahoo's purchase of Flickr, and I mentioned that it seemed like 2003 was the year that RSS really started to take off. This was also the year that Dave Winer froze the RSS 2.0 spec and Sam Ruby gathered all the malcontents in the XML syndication space and gave them a shiny new toy to play with in Atom. Coincidence?

Categories: Syndication Technology | XML

April 24, 2005

@ 11:16 PM

Fun with XMLHttpRequest and RSS: Browsing Photo Albums on MSN Spaces

I mentioned in a recent post that I was considering writing an article entitled Using Javascript, XMLHttpRequest and RSS to create an MSN Spaces photo album browser. It actually took less work than I thought to build a webpage that allows you to browse the photo albums in a particular person's Space or from a randomly chosen Space.

I haven't used Javascript in about 5 years but it didn't take much to put the page together thanks mostly to the wealth of information available on the Web.

You can find the page at http://www.25hoursaday.com/spaces/photobrowser.html

The page requires Javascript and currently only works in Internet Explorer but I'm sure that some intrepid soul could make it work in Firefox in a couple of minutes. If you can, please send me whatever changes are necessary.

Categories: MSN | XML

April 22, 2005

@ 03:05 PM

Contract-First XML Web Service Design is No Panacea

Every once in a while I see articles like Aaron Skonnard's Contract-First Service Development which make me shake my head in sorrow. His intentions are good but quite often advising people to design their XML Web services starting from an XSD/WSDL file instead of a more restricted model leads to more problems than what some have labelled the "code-first" approach.

For example, take this recent post to the XML-DEV mailing list entitled incompatible uses of XML Schema

I just got a call from a bespoke client (the XML guru in a large bank)
asking whether I knew of any XML Schema refactoring tools.

His problem is that one of their systems (from a big company)
does not handle recursive elements. Another one of their
systems (from another big company) does not handle substitution
groups (or, at least, dynamic use of xsi:type.) Another of their
systems (from a third big company) does not handle wildcards.
(Some departments also used another tool that generated ambiguous
schemas.)

This is causing them a major headache: they are having to
refactor 7,000 element schemas by hand to munge them into
forms suited for each system.

Their schema-centricism has basically stuffed up the ready
interoperability they thought they were buying into with XML,
on a practical level. This is obviously a trap: moving to a
services-oriented architecture means that the providers can
say "we provide XML with a schema" and the pointy-headed bosses
can say "you service-user: this tool accepts XML with a schema
so you must use that!" and the service-user has little recourse.

This is one of the problems of contract first development that many of the consultants, vendors and pundits who are extolling its virtues fail to mention. A core fact of building XML Web services that use WSDL/XSD as the contract is that most people will use object<->XML mapping technologies to either create or consume the web services. However there are fundamental impedance mismatches between the W3C XML Schema Definition (XSD) Language and objects in a traditional object oriented programming language that ensure that these mappings will be problematic. I have written about these impedance mismatches several times over the past few years including posts such as The Impedence Mismatch between W3C XML Schema and the CLR.

Every XML Web Service toolkit that consumes WSDL/XSD and generates objects has different parts of the XSD spec that they either fail to handle or handle inadequately. Many of the folks encouraging contract first development are refusing to acknowledge that if developers build schemas by hand for use in XML Web Services, it is likely they will end up using capabilities of XSD that are not supported by one or more of their consuming applications. The post from XML-DEV is just one example of this happening. When I was the program manager for XML Schema technologies in the .NET Framework I regularly had to help customers who had to deal with the interoperability problems they encountered because they'd read some article extolling the virtues of schema first design which failed to acknowledge the realities of the XML Web Service landscape.

From my experience "contract first" design is actually more likely to lead to interoperability problems than "code first" design. The only time this isn't the case is when the schema designer actually pays attention to use a minimal subset of XSD as opposed to using its full capabilities. This is one of the reasons I have tried to provide some guidance on what XSD features to avoid in my XML Schema Design guidelines series on XML.com.

However it is far easier to avoid these missteps if one starts from objects instead of XSD/WSDL since the expressiveness of objects is less than that of XSD which automatically means the web service contracts are less complicated. I remember getting this insight from Don Box and Doug Purdy a couple of months ago and rejecting it at the time since it seemed anti-XML but now I realize that it is actually the most practical thing to do.

Categories: XML | XML Web Services

April 19, 2005

@ 04:49 PM

Comments [12]

Ideas for my next Extreme XML column on MSDN

It looks like I didn't get an Extreme XML column out last month. Work's been hectic but I think I should be able to start on a column by the end of the week and get it done before the end of the month. I have a couple of ideas I'd like to write about but as usual I'm curious as to what folks would be interested in reading about. Below are three article ideas in order of preference.

Using Javascript, XMLHttpRequest and RSS to create an MSN Spaces photo album browser: The RSS feed for a space on MSN Spaces contains information about the most recent updates to a user's blog, photo album and lists. RSS items containing lists are indicated by using the msn:type element with the value "photoalbum". It is possible to build a photo album browser for various spaces by using a combination of Javascript for dynamic display and XMLHttpRequest for consuming the RSS feed. Of course, my code sample will be nowhere near as cool as the Flickr related tag browser.
Fun with operator overloading and XML: This would be a follow up piece to my Overview of Cω article. This article explores how one could simulate adding XML specific language extensions by overloading various operators on the System.Xml.XmlNode class.
Processing XML in the Real World: 10 Things To Worry About When Processing RSS feeds on the Web: This will be an attempt to distill the various things I've learned over the 2 years I've been working on RSS Bandit. It will cover things like how to properly use the System.Xml.XmlReader class for processing RSS feeds in a streaming fashion, bandwidth saving tips from GZip encoding to sending If-Modified-Since/If-None-Match headers in the request, dealing with proxy servers and authentication.

Which ones would you like to see and/or what is your order of preference?

Categories: XML

April 13, 2005

@ 02:33 PM

Interested in Improving XML Support in Internet Explorer?

My friend Derek, who's the dev lead for MSXML (the XML toolkit used by practically every Microsoft application from Office to Internet Explorer), has a blog post entitled XML use in the browser where he writes

C|Net has an article on what people have started calling AJAX. 'A'synchronous JavaScript and Xml. I have seen people using MSXML to build these kinds of web-apps for years, but only recently have people really pulled it all together enough, such as GMail or Outlook Web-Access (OWA). In fact, MSXML's much copied XMLHTTP (a.k.a. IXMLHttpRequest) (Copied by Apple and Mozilla/Firefox) was actually created basically to support the first implementation of OWA.

I've been thinking about what our customers want in future versions of MSXML. What kind of new functionality would enable easier/faster developement of new AJAX style web applications? XForms has some interesting ideas... I've been thinking about what we might add to MSXML to make it easier to develop rich DHtml applications. XForms is an interesting source of ideas, but I worry that it removes too much control. I don't think you could build GMail on XForms, for example.

The most obvious idea, would be to add some rich data-binding. Msxml already has some _very_ limited XML data-binding support. I have not looked much into how OWA or GMail work, but I bet that a significant part of the client-side jscript is code to regenerate the UI from the XML data behind the current page. Anyone who has used ASP/PHP/etc is used to the idea of some sort of loop to generate HTML from some data. What if the browser knew how to do that for you? And knew how to push back changes from editable controls? You can do that today with ADO.

Any other ideas? For those of you playing with 'AJAX' style design. What are the pain points? (Beside browser compatibility... )

If you are building applications that use XML in the browser and would like to influence the XML framework that will be used by future versions of Microsoft applications from Microsoft Office to Internet Explorer then you should go over to Derek's blog and let him know what you think.

Categories: XML

March 22, 2005

@ 02:50 PM

Comments [17]

SOA, AJAX and REST: The Software Industry Devolves into the Fashion Industry

Ever since the article Ajax: A New Approach to Web Applications unleashed itself on the Web I've seen the cacophony of hype surrounding Asynchronous JavaScript + XML (aka AJAX reach thunderous levels. The introduction to the essay already should make one wary about the article, it begins

Ajax isnt a technology. Its really several technologies, each flourishing in its own right, coming together in powerful new ways. Ajax incorporates:

standards-based presentation using XHTML and CSS;

dynamic display and interaction using the Document Object Model ;

data interchange and manipulation using XML and XSLT ;

asynchronous data retrieval using XMLHttpRequest ;

and JavaScript binding everything together.

So AJAX is using Javascript and XML with the ~~old~~ new twist being that one communicates with a server using Microsoft's proprietary XmlHttpRequest object. AJAX joins SOA in ignominy as yet another buzzword created by renaming existing technologies which becomes a way for some vendors to sell more products without doing anything new. I agree with Ian Hixie's rant Call an apple an apple where he wrote

Several years ago, HTML was invented, and a few years later, JavaScript (then LiveScript, later officially named ECMAScript) and the DOM were invented, and later CSS. After people had been happily using those technologies for a while, people decided to call the combination of HTML, scripting and CSS by a new name: DHTML. DHTML wasn't a new technology it was just a new label for what people were already doing.

Several years ago, HTTP was invented, and the Web came to be. HTTP was designed so that it could be used for several related tasks, including:

Obtaining a representation of a resource from a remote host using that resource's identifier (GET requests).

Executing a procedure call on a remote host using a structured set of arguments (POST requests).

Uploading a resource to a remote host (PUT requests).

Deleting a resource from a remote host (DELETE requests).

People used this for many years, and then suddenly XML-RPC and SOAP were invented. XML-RPC and SOAP are complicated ways of executing remote procedure calls on remote hosts using a structured set of arguments, all performed over HTTP.

Of course you'll notice HTTP can already do that on its own, it didn't need a new language. Other people noticed this too, but instead of saying "hey everyone, HTTP already does all this, just use HTTP", they said, "hey everyone, you should use REST!". REST is just a name that was coined for the kind of architecture on which HTTP is based, and, on the Web, simply refers to using HTTP requests.

Several years ago, Microsoft invented XMLHttpRequest. People used it, along with JavaScript and XML. Google famously used it in some of their Web pages, for instance GMail. All was well, another day saved... then someone invented a new name for it: Ajax.
...
So I have a request: could people please stop making up new names for existing technologies? Just call things by their real name! If the real name is too long (the name Ajax was apparently coined because "HTTP+XML+HTML+XMLHttpRequest+JavaScript+CSS" was too long) then just mention the important bits. For example, instead of REST, just "HTTP"; instead of DHTML just "HTML and script", and instead of Ajax, "XML and script".

What I find particularly disappointing about the AJAX hype is that it has little to do with the technology and more to do with the quality of developers building apps at Google. If Google builds their next UI without the use of XML but only Javascript and HTML will we be inundiated with hype about the new JUDO approach (Javascript and Unspecified DOm methods) because it uses proprietary DOM extensions not in the W3C standard?

The software industry perplexes me. One minute people are complaining about standards compliance in various websites and browsers but the next minute Google ships websites built on proprietary Microsoft APIs and it births a new buzzword. I doubt that even the fashion industry is this fickle and inconsistent.

Postscript: I wasn't motivated to post about this topic until I saw the comments to the post Outlook Web Access should be noted as AJAX pioneer by Robert Scoble. It seems some people felt that Outlook Web Access did not live up to the spirit of AJAX. Considering that the distinguishing characteristic of the AJAX buzzword is using XmlHttpRequest and Outlook Web Access is the reason it exists (the first version was written by the Exchange team) I find this highly disingenious. Others have pointed this out as well, such as Robert Sayre in his post Ever Wonder Why It's Called "XMLHTTPRequest"?

Categories: XML

February 11, 2005

@ 04:57 AM

The Fallacy of "XML Will Save Us"

Steve Vinoski has a blog posting entitled Focus on the contract where he writes

Tim offers some extremely excellent advice (as usual) regarding what really matters when you write your services. If I may paraphrase what he says and perhaps embellish it a bit, starting from the implementation language and generating your contracts from it is just plain wrong, wrong, wrong, at least for systems of any appreciable magnitude, reach, or longevity. Instead, focusing on the contracts first is the way to go. I've written about this for many years now, starting well over a decade ago.

When you start with the code rather than the contract, you are almost certainly going to slip up and allow style or notions or idioms particular to that programming language into your service contract. You might not notice it, or you might notice it but not care. However, the guy on the other side trying to consume your service from a different implementation language for which your style or notions or idioms don't work so well will care.

Although Steve Vinoski's argument sounds convincing, there is one problem with it. It is actually much easier to make an uninteroperable Web service if one starts with the service contract instead of with object oriented code. The reason for this is quite simple and one I've harped on several times in the past; the impedance mismatch between XSD and objects is quite significant. There are several constructs in W3C XML Schema which simply have no counterpart in traditional object oriented languages which cause current XML Web service toolkits to barf when consuming them. For example, the XmlSerializer class in the .NET Framework supports about half the constructs in W3C XML Schema. Most XML Web Service toolkits support a similar number [but different set] of features of W3C XML Schema.

This isn't theoretical, more than once while I was the program manager for XML Schema technologies in the .NET Framework I had to take conference calls with customers who'd been converted to the 'contract first' religion only to find out that toolkits simply couldn't handle a lot of the constructs they were putting in their schemas.Those conversations were never easy.

The main thing people fail to realize when they go down the 'contract first' route is that it is quite likely that they have also gone down the 'XML first' route which most of them don't actually want to take. Folks like Tim Ewald don't mind the fact that sometimes going 'contract first' may mean they can't use traditional XML Web Service toolkits but instead have to resort to SAX, DOM and XSLT. However for many XML Web Service developers this is actually a problem instead of a solution.

Categories: XML | XML Web Services

February 9, 2005

@ 03:05 PM

On the Complexity of XML APIs

David Megginson (the creator of SAX) has a post entitled The complexity of XML parsing APIs where he writes

Dare Obasanjo recently posted a message to the xml-dev mailing list as part of the ancient and venerable binary XML permathread (just a bit down the list from attributes vs. elements, DOM vs. SAX, and why use CDATA?). His message including the following:

I don’t understand this obsession with SAX and DOM. As APIs go they both suck[0,1]. Why would anyone come up with a simplified binary format then decide to cruft it up by layering a crufty XML API on it is beyond me.

[0] http://www.megginson.com/blogs/quoderat/archives/2005/01/31/sax-the-bad-the-good-and-the-controversial/

[1] http://www.artima.com/intv/dom.html

I supposed that I should rush to SAX’s defense. I can at least point to my related posting about SAX’s good points, but to be fair, I have to admit that Dare is absolutely right – building complex applications that use SAX and DOM is very difficult and usually results in messy, hard-to-maintain code.

I think this is a pivotal part of the binary XML debate. The primary argument for binary serializations of XML is that certain parties want to get the benefit of the wide array of technologies for processing XML yet retain the benefits of a binary format such as reduced size on the wire and processing time. Basically having one's cake and eating it too.

For me, the problem is that XML is already being pulled from too many directions as it is. In retrospect I realize it was foolish for me to think that the XML team could come up with a single API that would satisfy people processing business documents written in wordprocessingML, people building distributed computing applications using SOAP or developers reading & writing to application configuration files. All of these scenarios use intersecting subsets of the full functionality of the XML specification. The SOAP specs go as far as banning some features of XML while others are simply frowned upon based on the fact that the average SOAP toolkit simply doesn't know what to do with them. One man's meat (e.g. mixed content) is another man's poison.

What has ended up happening is that we have all these XML APIs that expose a lot of cruft of XML that most developers don't need or even worse make things difficult in the common scenarios because they want to support all the functionality of XML. This is the major failing of APIs such as the .NET Framework's pull model parser class, System.Xml.XmlReader, DOM and SAX. The DOM also has issues with the fact that it tries to support conflicting data models (DOM vs. XPath) and serialization formats (XML 1.0 & XML 1.0 + XML namespaces). At the other extreme we have APIs that try to simplify XML by only supporting specific subsets of its expressivity such as the System.Data.DataSet and the System.Xml.XmlSerializer classes in the .NET Framework. The problem with these APIs is that the developer is dropped of a cliff once they reach the limits of the XML support of the API and have to either use a different API or resort to gross hacks to get what they need done.

Unfortunately one of the problems we had to deal with when I was on the XML team was that we already had too many XML APIs as it was. Introducing more would create developer confusion but trying to change the existing ones would break backwards compatibility. Personally I'd rather see efforts being to create better toolkits and APIs for the various factions that use XML to make it easier for them to get work done than constantly churning the underlying format thus fragmenting it.

Categories: XML

February 3, 2005

@ 05:44 PM

Square Pegs in Round Holes: W3C XML Schema and XInclude

One of the biggest assumptions I had about software development was shattered when I started working on the XML team at Microsoft. This assumption was that standards bodies know what they are doing and produce specifications that are indisputable. However I've come to realize that the problems of design by committee affects illustrious names such as the W3C and IETF just like everyone else. These problems become even more pernicious when trying to combine technologies defined in multiple specifications to produce a coherent end to end application.

An example of the problem caused by contradictions in core specifications of the World Wide Web is summarized in Mark Pilgrim's article, XML on the Web Has Failed. The issue raised in his article is that determining the encoding to use when processing an XML document retrieved off the Web via HTTP, such as an RSS feed, is defined in at least three specifications which contradict each other somewhat; XML 1.0, HTTP 1.0/1.1 and RFC 3023. The bottom line being that most XML processors including those produced by Microsoft ignore one or more of these specifications. In fact, if applications suddenly started following all these specifications to the letter a large number of the XML documents on the Web would be considered invalid. In Mark Pilgrim's article, 40% of 5,000 RSS feeds chosen at random would be considered invalid even though they'd work in almost all RSS aggregators and be considered well-formed by most XML parsers including the System.Xml.XmlTextReader class in the .NET Framework and MSXML.

The newest example, of XML specifications that should work together but instead become a case of putting square pegs in round holes is Daniel Cazzulino's article, W3C XML Schema and XInclude: impossible to use together??? which points out

The problem stems from the fact that XInclude (as per the spec) adds the xml:base attribute to included elements to signal their origin, and the same can potentially happen with xml:lang. Now, the W3C XML Schema spec says:

3.4.4 Complex Type Definition Validation Rules

Validation Rule: Element Locally Valid (Complex Type)
...

3 For each attribute information item in the element information item's [attributes] excepting those whose [namespace name] is identical to http://www.w3.org/2001/XMLSchema-instance and whose [local name] is one of type, nil, schemaLocation or noNamespaceSchemaLocation, the appropriate case among the following must be true:

And then goes on to detailing that everything else needs to be declared explicitly in your schema, including xml:lang and xml:base, therefore :S:S:S.

So, either you modify all your schemas to that each and every element includes those attributes (either by inheriting from a base type or using an attribute group reference), or you validation is bound to fail if someone decides to include something. Note that even if you could modify all your schemas, sometimes it means you will also have to modify the semantics of it, as a simple-typed element which you may have (with the type inheriting from xs:string for example), now has to become a complex type with simple content model only to accomodate the attributes. Ouch!!! And what's worse, if you're generating your API from the schema using tools such as xsd.exe or the much better XsdCodeGen custom tool, the new API will look very different, and you may have to make substancial changes to your application code.

This is an important issue that should be solved in .NET v2, or XInclude will be condemned to poor adoption in .NET. I don't know how other platforms will solve the W3C inconsistency, but I've logged this as a bug and I'm proposing that a property is added to the XmlReaderSettings class to specify that XML Core attributes should be ignored for validation, such as XmlReaderSettings.IgnoreXmlCoreAttributes = true. Note that there are a lot of Ignore* properties already so it would be quite natural.

I believe this is a significant bug in W3C XML Schema that it requires schema authors to declare up front in their schemas where xml:lang, xml:base or xml:base may occur in their documents. Since I used to be the program manager for XML Schema technologies in the .NET Framework this issue would have fallen on my plate. I spoke to Dave Remy who toke over my old responsibilities and he's posted his opinion about the issue in his post XML Attributes and XML Schema . Based on the discussion in the comments to his post it seems the members of my old team are torn on whether to go with a flag or try to push an errata through the W3C. My opinion is that they should do both. Pushing an errata through the W3C is a time consuming process and in the meantime using XInclude in combination with XML Schema is signficantly crippled on the .NET Framework (or on any other platform that supports both technologies). Sometimes you have to do the right thing for customers instead of being ruled by the letter of standards organizations especially when it is clear they have made a mistake.

Please vote for this bug on the MSDN Product Feedback Center.

Categories: XML

January 28, 2005

@ 12:13 AM

Article Idea: Processing XML in the Real World

Coincidentally just as I finished reading a post by Tim Bray about Private Syndication, I got two bug reports filed almost simultaneously about RSS Bandit's support for secure RSS feeds. The first was SSL challenge for non-root certs where the user complained that instead of prompting the user when there is a problem with an SSL certificate like browsers do we simply fail. One could argue that this is the right thing to do especially when you have folks like Tim Bray suggesting that bank transactions and medical records should be flowing through RSS. However given the precedent set by web browsers we'll probably be changing our behavior. The second bug was that RSS Bandit doesn't support cookies. Many services use cookies to track authenticated users as well as provide individual views tailored to a user. Although there are a number of folks who tend to consider cookies a privacy issue, most popular websites use them and they are mostly harmless. I'll likely fix this bug in the next couple of weeks.

These bug reports in combination with a couple more issues I've had to deal with while writing code to process RSS feeds in RSS Bandit has givn me my inspiration for my next Extreme XML column. I suspect there's a lot of mileage that can be obtained from an article that dives deep into the various issues one deals with while processing XML on the Web (DTDs, proxies, cookies, SSL, unexpected markup, etc) which uses RSS as the concrete example. Any of the readers of my column care to comment on whether they'd like to see such an article and if so what they'd want to see covered?

Categories: RSS Bandit | XML

January 5, 2005

@ 03:54 PM

Getting Back in the Saddle

I finished my first article since I switching jobs this weekend. It's tentatively titled Integrating XML into Popular Programming Languages: An Overview of Cω and should show up on both XML.com and my Extreme XML column on MSDN at the end of the month. I had initially planned to do the overview of Cω(C-Omega) for MSDN and do a combined article about ECMAScript for XML (E4X) & Cω for XML.com but it turned out that just an article on Cω was already fairly long. My plan is to follow up with an E4X piece in a couple of months. For the geeks in the audience who are a little curious as to exactly what the heck Cω is, here's an introduction to one of the sections of the article to whet your appetite.

The Cω Type System

The goal of the Cω type system is to bridge the gap between Relational, Object and XML data access by creating a type system that is a combination of all three data models. Instead of adding built-in XML or relation types to the C# language, the approach favored by the Cω type system has been to add certain general changes to the C# type system that make it more conducive for programming against both structured relational data and semi structured XML data.

A number of the changes to C# made in Cω make it more conducive for programming against strongly typed XML, specifically XML constrained using W3C XML Schema. Several concepts from XML and XML Schema have analogous features in Cω. Concepts such as document order, the distinction between elements and attributes, having multiple fields with the same name but different values, and content models that specify a choice of types for a given field exist in Cω. A number of these concepts are handled in traditional Object<->XML mapping technologies but it is often with awkwardness. Cω aims at makes programming against strongly typed XML as natural as programming against arrays or strings in traditional programming languages.

I got a lot of good feedback on the article from a couple of excellent reviewers including the father of the X#/Xen himself, Erik Meijer. For those not in the know, X#/Xen was merged with Polyphonic C# to create Cω. Almost all of my article focuses on the aspects of Cω inherited from X#/Xen.

Categories: XML

December 22, 2004

@ 05:27 PM

The Value of RSS

It seems there's been some recent hubbub in the world of podcasting about how to attach multiple binary files to a single post in an RSS feed In a post entitled Multiple-enclosures on RSS items? Dave Winer weighs in on the issue. He writes

This question comes up from time to time, and I've resisted answering it directly, thinking that anyone who really read the spec would come to the conclusion that RSS allows zero or one enclosures per item, and no more. The same is true for all other sub-elements of item, except category, where multiple elements are explicitly allowed. The spec refers to "the enclosure" in the singular. Regardless, some people persist in thinking that you may have more than one enclosure per item.

Okay, let's play it out. So if I have more than one enclosure per item, how do I specify the publication date for each enclosure? How do I specify the title, author, a link to comments, a description perhaps, or a guid? The people who want multiple enclosures suggest schemes that are so complicated that they're reduced to hand-waving before they get to the spec, which I would love to read, if it could be written. Some times some things are just too hard to do. This is one of them.

And there's a reason why it's too hard. Because you're throwing out the value of RSS and then trying to figure out how to bring it back. There's no need for items any more, so you might as well get rid of them. At the top level of channel would be a series of enclosures, and then underneath each enclosure, all the meta-data. Voila, problem solved. Only what have you actually solved? You've just re-created RSS, but instead of calling the main elements "item" we now call them "enclosure".

The value of RSS is fairly self evident to me but it seems that given the amount of people who keep wanting to reinvent the wheel it may not be as clear to others. As someone who used to work on core XML technologies at Microsoft, the value of XML was obvious to me. It allowed developers to agree to use the same data format for information interchange which led to a proliferation of a wide and uniform set of tools for working with data formats. XML is not an optimal format for most of the tasks it is used for but it more than makes up for this with the plethora of tools and technologies that exist for processing XML.

My expectation about XML was always that the software industry would move on to agreeing on other higher level protocols built on XML for application information interchange. So I've always been frustrated to see many attempts by various parties, including the W3C with efforts such as XML 1.1 and binary XML, take us steps back by wanting to fragment the interoperability promise of XML.

RSS is a wonderful example of the higher level of interoperability that can be built upon XML formats. Instead of information sources using various incompatible mechanisms for providing information to end users such as NOAA's SOAP web service and the Microsoft.com web services which each require a separate custom application to consume them, sites can all standardize on RSS. This standardization creates an ecosystem of applications that produce and consume RSS feeds which is a lot larger than what would exist for each site specific web services or market specific XML syndication formats. Specifically, it allows for the evolution of the digital information hub where users can view data from the various information sources they care about (blogs, news, weather reports, etc) in their choice of applications.

Additionally, RSS is extensible. This means that even if the core elements and attributes do not satisfy all the requirements of a particular problem domain, then domain-specific information can be added to the feed. This allows for regular consumers of RSS to still be able to consume the content while domain specific applications can give users a richer experience. This is a much better solution for both content producers and consumers than coming up with domain specific applications.

As a user I want less formats not more. I want my email to come in my RSS aggregator, I want my favorite newsgroups to show up in my RSS aggregator, I'm tired of having a separate application for what is essentially the same kind of data. In fact, it seems Google agrees with me as evidenced by them exposing XML feeds for your GMail inbox and for USENET newsgroups via Google Groups. Unfortunately, if you have a plain old RSS reader, you can't view these feeds and instead have to find an aggregator that supports Atom 0.3. Two steps forward, one step back.

We need less data interchange formats not more. It is better for content producers, better for end users and better for developers of applications that use these formats. Existing problems in syndication should focus on how to make the existing formats work for us instead of inventing new formats.

Vive la RSS.

Categories: Syndication Technology | XML

December 10, 2004

@ 04:57 PM

Interested in XML Integration in Conventional Programming Languages?

Looking at the calendar I realized that I have two articles due this month, one for my Extreme XML column on MSDN and another that I promised XML.com a few months ago. I was reminded of this by the following excerpt from Ed Dumbill's recent article On Folly where he wrote

Champion cited two developments of particular interest. The first is E4X, the addition of native XML capabilities to ECMAScript. An implementation of this in the Mozilla project is currently coming to fruition. The second development is "Comega" (aka "Cω"), an extension of C# including native XML data types. (Editor's Note: Watch XML.com for a forthcoming introduction to Comega from Dare Obasanjo.)

So I'm on the hook for an overview of Cω. I started to wonder whether it wouldn't be cool if my Extreme XML column focused deeply on Cω while my XML.com article was an overview of both E4X and Cω. This would save me some effort in coming up with a separate topic for my Extreme XML column but should provide interesting information for both XMl.com readers and MSDN readers. What do you think?

Categories: XML

December 8, 2004

@ 04:41 PM

XML Heavyweights Migrate to Redmond

Since I left the XML team at Microsoft they've gone on an impressive hiring spree. Two people who I'd have loved to work with who have joined up after I left are

David Remy: former Director of Product Engineering at Bea Systems, he was responsible for security, web services, and XML for Bea's Weblogic Workshop product line.

Mike Champion: formerly a research and development specialist at Software AG. He was on the W3C's Document Object Model (DOM) Working Group for over three years and was the co-chair of the Web Services Architecture Working Group.

I have to see how quickly I can nag both of them to start blogging. David used to be in charge of the folks who produced XML Beans so I suspect he has all sorts of interesting perspectives on XML programming models. Mike and I have been exchanging mail on XML-DEV for years and now that we can have some of these conversations in person, I work across campus. Irony indeed. :)

Categories: Life in the B0rg Cube | XML

November 30, 2004

@ 06:04 AM

XSD, RELAX NG and Why We Didn't Ship System.Xml.IXmlType

Tim Bray has a post entitled More Relax where he writes

I often caution people against relying too heavily on schema validation. “After all,” I say, “there is lots of obvious run-time checking that schemas can’t do, for example, verifying a part number.” It turns out I was wrong; with a little extra work, you can wire in part-number validation—or pretty well anything else—to RelaxNG. Elliotte Rusty Harold explains how. Further evidence, if any were required, that RelaxNG is the world’s best schema language, and that anyone who who’s using XML but not RelaxNG should be nervous.

Elliote Rusty Harold's article shows how to plug in custom datatype validation into Java RELAX NG validators. This enables one to enforce complex constraints on simple types such as such as "the content of an element is correctly spelled, as determined by consulting a dictionary file" or "the number is prime" to take examples from ERH's article.

Early in the design of the version 2.0 of the System.Xml namespace in the .NET Framework we considered creating a System.Xml.IXmlType interface. This interface would basically represent the logic for plugging one's custom types into the XSD validation engine. After a couple of months and a number of interesting discussions between myself, Joshua and Andy we got rid of it.

There were two reasons we got rid of this functionality. The simple reason was that we didn't have much demand for this functionality. Whenever we had people complaining about the limitations of XSD validation it was usually due to its inability to define co-occurence constraints (i.e. if some element or attribute has a certain value then the expected content should be blah) and other aspects of complex type validation than needing more finer grained simple type validation. The other reason was that the primary usage of XSD for many of our technologies is primarily as a type system not as a validation language. There's already the fact that XSD schemas are used to generate .NET Framework classes via the System.Xml.Serialization.XmlSerializer and relational tables via the System.Data.DataSet. However there were already impedence mismatches between these domains and XSD, for example if one defined a type as xs:nonNegativeInteger this constraint was honored in the generated C#/VB.NET classes created by the XmlSerializer or in the relational tables created by the DataSet. Then there was the additional wrinkle that at the time we were working on XQuery which used XSD as its typoe system and we had to factor in the fact that if people could add their own simple types we didn't just have to worry about validation but also how query operators would work on them. What would addition, multiplication or subtraction of such types mean? How would type promotion, casting or polymorphism work with some user's custom type defined outside the rules of XSD?

Eventually we scrapped the class as having too much cost for too little benefit.

This reminds me of Bob DuCharme's XML 2004 talk Documents vs. Data, Schemas vs. Schemas where he advised people on how to view RELAX NG and XSD. He advised viewing RELAX NG as a document validation language and considering XSD as a datatyping language. I tend to agree although I'd probably have injected something in there about using XSD + Schematron for document validation so one could get the best of both worlds.

Categories: XML

November 19, 2004

@ 09:40 AM

Some Thoughts on Adam Bosworth's ISCOC04 Talk

Adam Bosworth has posted his ISCOC04 talk on his weblog. The post is interesting although I disagreed with various bits and pieces of it. Below are some comments in response to various parts of his talk

On the one hand we have RSS 2.0 or Atom. The documents that are based on these formats are growing like a bay weed. Nobody really cares which one is used because they are largely interoperable. Both are essentially lists of links to content with interesting associated metadata. Both enable a model for capturing reputation, filtering, stand-off annotation, and so on. There was an abortive attempt to impose a rich abstract analytic formality on this community under the aegis of RDF and RSS 1.0. It failed. It failed because it was really too abstract, too formal, and altogether too hard to be useful to the shock troops just trying to get the job done. Instead RSS 2.0 and Atom have prevailed and are used these days to put together talk shows and play lists (podcasting) photo albums (Flickr), schedules for events, lists of interesting content, news, shopping specials, and so on. There is a killer app for it, Blogreaders/RSS Viewers.

Although it is clear that RSS 2.0 seems to be edging out RSS 1.0, I wouldn't say it has failed per se. I definitely wouldn't say it failed for being too formal and abstract. In my opinion it failed because it was more complex with no tangible benefit. This is the same reason XHTML has failed when compared to HTML. This doesn't necessarily mean that more rigid sysems will fail to take hold when compared to less rigid systems, if so we'd never have seen the shift from C to C++ then from C++ to C#/Java.

Secondly, it is clear It seems Adam is throwing out some Google spin here by trying to lump the nascent and currently in-progress Atom format in the same group as RSS 2.0. In fact, if not for Google jumping on the Atom bandwagon it would even be more of an intellectual curiousity than RSS 1.0.

As I said earlier, I remember listening many years ago to someone saying contemptuously that HTML would never succeed because it was so primitive. It succeeded, of course, precisely because it was so primitive. Today, I listen to the same people at the same companies say that XML over HTTP can never succeed because it is so primitive. Only with SOAP and SCHEMA and so on can it succeed. But the real magic in XML is that it is self-describing. The RDF guys never got this because they were looking for something that has never been delivered, namely universal truth. Saying that XML couldn't succeed because the semantics weren't known is like saying that Relational Databases couldn't succeed because the semantics weren't known or Text Search cannot succeed for the same reason. But there is a germ of truth in this assertion. It was and is hard to tell anything about the XML in a universal way. It is why Infopath has had to jump through so many contorted hoops to enable easy editing. By contrast, the RSS model is easy with an almost arbitrary set of known properties for an item in a list such as the name, the description, the link, and mime type and size if it is an enclosure. As with HTML, there is just enough information to be useful. Like HTML, it can be extended when necessary, but most people do it judiciously. Thus Blogreaders and aggregators can effortlessly show the content and understanding that the value is in the information. Oh yes, there is one other difference between Blogreaders and Infopath. They are free. They understand that the value is in the content, not the device.

Lots of stuff to agree with and disagree with here. Taking it from the top, the assertion that XML is self-describing is a myth. XML is a way to attach labels to islands of data, the labels are only useful if you know what they mean. Where XML shines is that one can start with a limited set of labels that are widely understood (title, link, description) but attach data with labels that are less likely to be understood (wfw:commentRss, annotate:reference, ent:cloud) without harming the system. My recent talk at XML 2004, Designing XML Formats: Versioning vs. Extensibility, was on the importance of this and how to bring this flexibility to the straitjacketed world of XML Schema.

I also wonder who the people are that claim that XML over HTTP will never succeed. XML over HTTP already has in a lot of settings. However I'd question that it is all you need. The richer the set of interactions allowed by the web site the more an API is needed. Google, Amazon and eBay all have XML-based APIs. Every major blogging tool has an XML-based API even though those same tools are using vanilla XML over HTTP for serving RSS feeds. XML over HTTP can succeed in a lot of settings but as the richness of the interaction between client and server grows so also does the need for a more powerful infrastructure.

The issue is knowing how to pick right tool for the job. You don't need the complexity of the entire WS-* stack to build a working system. I know a number of people at Microsoft realize that this message needs to get out more which is why you've begun to see things like Don Box's WS-Why Talk and the WS Kernel.

What has been new is information overload. Email long ago became a curse. Blogreaders only exacerbate the problem. I can't even imagine the video or audio equivalent because it will be so much harder to filter through. What will be new is people coming together to rate, to review, to discuss, to analyze, and to provide 100,000 Zagat's, models of trust for information, for goods, and for services. Who gives the best buzz cut in Flushing' We see it already in eBay. We see it in the importance of the number of deals and the ratings for people selling used books on Amazon. As I said in my blog, My mother never complains that she needs a better client for Amazon. Instead, her interest is in better community tools, better book lists, easier ways to see the book lists, more trust in the reviewers, librarian discussions since she is a librarian, and so on.
This is what will be new. In fact it already is. You want to see the future. Don't look at Longhorn. Look at Slashdot. 500,000 nerds coming together everyday just to manage information overload. Look at BlogLines. What will be the big enabler' Will it be Attention.XML as Steve Gillmor and Dave Sifry hope' Or something else less formal and more organic' It doesn't matter. The currency of reputation and judgment is the answer to the tragedy of the commons and it will find a way. This is where the action will be. Learning Avalon or Swing isn't going to matter. Machine learning and inference and data mining will. For the first time since computers came along, AI is the mainstream.

I tend to agree with most of this although I'm unsure why he feels the need to knock Longhorn and Java. What he seems to be overlooking is that part of the information overload problem is the prevalance of poor data visualization and user interface metaphors for dealing with significant amounts of data. I know believe that one of the biggest mistakes I made in the initial design of RSS Bandit was modelling it after mail readers like Outlook even though I knew lots of people who had difficulty managing the flood of email they get using them. This is why the next version of RSS Bandit will borrow a leaf from FeedDemon along with some other tricks I have up my sleeve.

A lot of what I do in RSS Bandit is made easy due to the fact that it's built on the .NET Framework and not C++/MFC so I wouldn't be as quick to knock next generation GUI frameworks as Adam is. Of course, now that he works for a Web company the browser is king.

Categories: Syndication Technology | XML

November 19, 2004

@ 08:33 AM

Poppin' Them Thangs at XML 2004

My XML in the .NET Framework: Past, Present & Future talk went well yesterday. The room was full and people seemed to like what they heard. The audience was most enamored with the upcoming System.Xml.Schema.XmlSchemaInference class that provides the ability to generate schemas from sample documents and the new XSLT debugger.

It was nice having people walk up to me yesterday to tell me how much they liked my talk from the previous day. There were even a couple of RSS Bandit users who walked up to me to tell me how much they liked it. This was definitely my best XML conference experience.

Arpan did comment on the irony of me giving more talks about XML after leaving the XML team at Microsoft than when I was on the team. :)

Categories: Ramblings | XML

November 18, 2004

@ 07:12 PM

The Tyranny of MustUnderstand

My XML 2004 talk, Designing XML Formats: Versioning vs. Extensibility, went over well yesterday. Lots of interesting questions were asked during the Q&A session for my talk and the following talk by Dave Orchard, Achieving Distributed Extensibility and Versioning.

One issue that came up during the discussions after our talk was the cost/benefit of using a mustUnderstand construct in an XML format similar to the SOAP mustUnderstand attribute. The primary benefit of the having such a construct is that it enables third parties to create mandatory extensions to an XML format. However there a number of costs to having such a construct

Entire Element or Document Must Be Read: A processor that just wants to extract a subset of the data in the document still has to parse the entire document and see if there are any mustUnderstand constructs before it can process the document. This increases the cost of processing instances of the format.
Ambiguity as to what is Meant by 'Understand': The concept of what it means to "understand" an XML vocabulary is context specific. For example, should a stylesheet that pretty prints an XML document fail because the format contains a mustUnderstand construct that is not explicitly handled by the stylesheet? A mustUnderstand construct is particularly limiting since it forces all consumers to fail even though there may be some consumers that can still use the format even if they don't explicitly understand certain elements or attribute in the document.
Causes Confusion for Intermediaries: In certain cases, a format may be processed by an intermediary on the way to the client from the server. For example, HTTP requests often pass through proxy servers and there are also web-based aggregators of RSS/Atom feeds such as Feedster & PubSub which can then be subscribed to by other aggregators. In such cases, it is ambiguous whether intermediaries are expected to fail if a construct which isn't explicitly handled is labelled as mustUnderstand or whether they are expected to pass it on with that label to third party aggregators. In fact certain formats thus have separate mustUnderstand constructs for hop-to-hop versus end-to-end transmission.

From my perspective, the cost of having a mustUnderstand construct is often not worth the benefits provided. This wasn't explicitly in my talk but is a conclusion I came to recently which I expanded upon during the Q&A session.

Categories: XML

November 4, 2004

@ 06:10 PM

XML Specs That Give You Nightmares

Many times when implementing XML specifications I've come across I've come up against feature that just seem infeasible or impractical to implement. However none of them have given me nightmares as they have my friend Mike Vernal, a program manager on the Indigo team at Microsoft. In his post could you stop the noise, i'm trying to get some rest ... he talks about spending nights tossing and turning having nightmares about how the SOAP mustUnderstand header attribute should be processed. In Mike's post More SOAP Sleepness he mentions having sleepless nights worrying about the behavior of SOAP intermediaries as described in Section 2.7: Relaying SOAP Messages.

This isn't to say I didn't have sleepless nights over implementing XML specifications when I worked on the XML team at Microsoft. One of the issues that consumed a lot more of my time than is reasonable is explained in Derek Denny-Brown's post Loving and Hating XML Namespaces

Namespaces and your XML store
For example, load this document into your favorite XML store API (DOM/XmlBeans/etc)
<book title='Loving and Hating XML Namespaces'>
   <author>Derek Denny-Brown</author>
</book>
Then add the attribute named "xmlns" with value "http://book" to the <book> element. What should happen? Should that change the namespaces of the <book> and <author> elements? Then what happens if someone adds the element <ISBN> (with no namespace) under <book>? Should the new element automatically acquire the namespace "http://book", or should the fact that you added it with no namespace mean that it preserves it's association with the empty namespace?

In MSXML, we tried to completely disallow editing of namespace declarations, and mostly succeeded. There was one case, which I missed, and we have never been able to fix it because so many people found it an exploited it. The W3C's XML DOM spec basically says that element/attribute namespaces are assigned when the nodes are created, and never change, but is not clear about what happens when a namespace declaration is edited.

Then there is the problem of edits that introduce elements in a namespace that does not have an existing namespace declaration:
<a xmlns:p="http://p/">
  <b>
    ...
      <c p:x="foo"/>
    ...
  </b>
</a>
If you add attribute "p:z" in namespace "bar" to element <b>, what should happen to the p:x attribute on <c>? Should the implementations scan the entire content of <b> just in case there is a reference to prefix "p"?

Or what about conflicts? Add attribute "p:z" in namespace "bar" to the below sample... what should happen?
<a xmlns:p="http://p/" p:a="foo"/>

This problem really annoyed me while I was the PM for the System.Xml.XmlDocument class and the short-lived System.Xml.XPath.XPathDocument2. In the former, I found out that once you started adding, modifying and deleting namespace declarations the results would most likely be counter-intuitive and just plain wrong. Of course, the original W3C DOM spec existed before XML namespaces and trying to merge them in after the fact was probably a bad idea. With the latter class, it seemed the best we could do was try and prevent editing namespace nodes as much as possible. This is the track we decided to follow with the newly editable System.Xml.XPath.XPathNavigator class in the .NET Framework.

This isn't the most sleep depriving issue I had to deal with when trying to reflect the decisions in various XML specifications in .NET Framework APIs. Unsurprisingly, the spec that caused the most debate amongst our team when trying to figure out how to implement its features over an XML store was the W3C XML Schema Recommendation part 1: Structures. The specific area was the section on contributions to the Post Schema Validation Infoset and the specific infoset contribution which caused so much consternation was the validity property.

After schema validation an XML element or attribute should have additional metadata added to it related to validation such as it's type, its default value specified in the schema if any and whether it is valid or not according to its type. Although the validity property is trivial to implement on a read-only API such as the System.Xml.XmlReader class, it was unclear what would be the right way to expose this in other representations of XML data such as the System.Xml.XmlDocument class. The basic problem is "What happens to the validity propety of the element or attribute those of all its ancestors once the node is updated?". Once I change the value of an age element which is typed as an integer from 17 to seventeen what should happen. Should the DOM test every edit to make sure it is valid for that type then reject it otherwise? Should the edit be allowed but the validity property of the element and all its ancestors be changed? What if there is a name element with required first and last elements and the user wants to delete the first element and replace it with a different one? How would that be reflected with regards to the validity property of the name element?

None of the answers to the question we came up with satisfactory. In the end, we were stuck between a rock and a hard place so we made the compromise choice. I believe we debated this issue every other month for about a year.

Categories: XML

October 26, 2004

@ 11:47 AM

Don Box's WS-Why Talk and the WS Kernel

At Chris Sells' XML DevCon conference* Don Box gave a talk called WS-Why which is described below

"Why? This talk will make sense of why various WS-* specs came to life and which ones every developer should ignore. Naturally, the size of this set is non-zero, however, it is not the entire universe. Hopefully, the audience will be left with a mental model for what to ignore going forward as the WS-* machine continues to move forward."

I got to hang out with Don before the conference as well as read the slides for his talk and although I liked the direction of the talk I wished it could have been more prescriptive. Before continuing to read it's a good idea to read a summary of Don's talk such as the one at Jeff Barr's blog post AXDC - Don Box and WS-Why?.

In his talk Don breaks XML Web Services specs into

WS-DesertIsland - specs that are a must have that form the core XML Web Service specs. These include XML, SOAP, WS-Addressing, WS-MetadataExchange & XSD+WSDL
WS-IslandResort - the next layer of important specs after the core. These include WS-Security, WS-Trust, WS-ReliableMessaging & WS-Policy
WS-NewZealand - specs you'd probably need once in a lifetime. These include WS-Eventing, WS-Enumeration & WS-AtomicTransaction
WS-IslandOfDoctorMoreau - the ugly step children of the WS-* spec family. These include UDDI, WS-Transfer, WS-BusinessActivity, MTOM and BPEL4WS
WS-FantasyIsland - specs Don would love to see. These include WS-Data (XPath/SQL-like query for web services), SOAP over TCP, XSD with better support for versioning, binary XML & WSDL based on RELAX NG.

As can be expected there have been folks who've done a deeper analysis of Don's talk than what I've done above. The most significant so far has been Steve Maine's post The Web Services Kernel which gives an overview of the 5 specs in Don's WS-DesertIsland.

In general I agree with the direction Don took with the talk. However although it was an appropriate talk for the audience of the XML Devcon, a bunch of implementers and industry experts, I don't see it as significant guidance to developers trying to make sense of the mess of WS-* specs. Don's talk is best seen through the lens of "If I was an implementer which specs should I implement in my XML Web Services toolkit", not "If I was a developer which specs should I use from my XML Web Services toolkit". This is an important distinction. This is why specs like WS-Addressing are in Don's WS kernel even though it is only important if you aren't using HTTP as your transport which most developers would be.

The talk I'd most love to see next from Don or whoever else in Indigo is going to be doing the conference route next is WS-Which: How to decide what XML Web Services specs are right for your application. As someone who now has the responsibility of designing XML Web Service end points within an intranet (aren't job switches grand?) and perhaps beyond I'm interested in guidance that explains when I should use WS-Security versus SOAP + HTTPS or whether there are any scenarios where using MTOM or WS-Transfer actually make sense.

Don's talk was a good start by the Indigo team for providing guidance for future users of the XML Web Services family of specs but there's still a lot of guidance that is currently missing from them to the industry. More importantly, just because some spec is or isn't in Don's WS kernel doesn't indicate how significant or not it will be to a particular class of application developers. Perhaps I can get Doug to give this talk next year.

* Someone really needs to explain to Chris Sells how the Web works. Constantly changing the content of the page at http://www.sellsbrothers.com/conference logically breaks links to the site

Categories: XML

October 22, 2004

@ 04:35 AM

SAX for .NET 1.0 and SAXExpat.NET 1.0 Released

Karl Waclaweck has released version 1.0 of the SAX for .NET project. In the announcement on XML-DEV Karl writes

This is the first production release of the C#/.NET port of the SAX API.
It should be compatible with MS.NET 1.1 ~~and Mono 1.0.2~~.

Since the API alone is not enough, a SAX parser implementation has been
released as well: SAXExpat.NET 1.0. It is based on the Expat parser, and
will work on MS.NET 1.1. Currently Mono 1.0.2 is not able to run it,
but this will hopefully change with future Mono releases.

Another implementation based on a port of the AElfred parser will
be available soon. It should work under both, MS.NET and Mono.

The project page is here: http://saxdotnet.sourceforge.net/

It's good to see more Open Source XML projects showing up for the .NET Framework. I haven't missed using SAX that much but I imagine people coming to the .NET Framework from the Java world would love to be able to keep using their favorite push model XML parsing API.

Categories: XML

October 17, 2004

@ 04:56 PM

Hindsight is 20/20: Three Things XML Got Wrong

Derek Denny-Brown, the dev lead for both MSXML & System.Xml, who's been involved with XML before it even had a name has finally started a blog. Derek's first XML-related post is Where XML goes astray... which points out three features of XML that turn out to have caused significant problems for users and implementers of XML technologies. He writes

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in "</tagname>" being required rather than just "</>".
...
Allowed Characters
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone.
...
Whitespace
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
<customer>
           <name>Joe Schmoe</name>
           <addr>123 Seattle Ave</addr>
  </customer>
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab.
...
XML Namespaces
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the 'default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML.

My experiences as the program manager for the majority of the XML programming model in the .NET Framework agree with this list. The above list hits the 3 most common areas people seem to have problems with working with XML in the .NET Framework. His blog post makes a nice companion piece to my The XML Litmus Test: Understanding When and Why to Use XML article on MSDN.

Categories: XML

October 13, 2004

@ 06:24 AM

Upcoming Changes to System.Xml in .NET Framework 2.0 Beta 2

We are in the process of locking down System.Xml for Beta 2 of the .NET Framework 2.0 and Visual Studio 2005. In the past few months we have received customer feedback about our feature set previewed in the Whidbey Alpha & Whidbey Beta 1 and this has guided our decision making process as to where to focus our energies to ensure a comprehensive feature set.

Below is the list of changes to System.Xml and subsidiary namespaces that have occurred between Beta 1 and Beta 2 of the .NET Framework 2.0 release.

ADDITIONS

XmlSchemaValidator

The XmlSchemaValidator class provides a push model API for W3C XML Schema validation. The primary scenario for using the XmlSchemaValidator is for validating an XML infoset in-place without having to serialize it as an XML document then reparse the document using a validating XML reader.

CHANGES

XmlReader

Added overloads to the static Create() method that take XmlParserContext
ReadValueAsXXX() methods renamed to ReadContentAsXXX(). Also reduced the number of ReadContentAsXXX() methods relative to the number of ReadValueAsXXX() methods in Whidbey beta 1.
Added ReadElementContentAsXXX() methods which are specific to obtaining the value of element nodes
Added methods for reading large streams of text or binary data embedded in an XML document in a streaming fashion

public virtual bool CanReadValueChunk { get; }

public virtual int ReadValueChunk (byte[] buffer, int startIndex, int count);

public virtual bool CanReadBinaryContent { get; }

public virtual int ReadContentAsBase64 (byte[] buffer, int startIndex, int count);

public virtual int ReadContentAsBinHex (byte[] buffer, int startIndex, int count);

public virtual int ReadElementContentAsBase64(byte[] buffer, int startIndex, int count);

public virtual int ReadElementContentAsBinHex(byte[] buffer, int startIndex, int count);

Added ReadToFollowing(string localname, string namespaceURI) which moves to the next occurrence of the named element in document order.

XmlReaderSettings

Added XmlSchemaValidationFlags enumeration to replace the following flags; IgnoreInlineSchema, IgnoreSchemaLocation, IgnoreValidationWarnings and IgnoreIdentityConstraints
Added existing ValidationType enumeration to replace to replace the following flags; DtdValidate and XsdValidate

XmlWriter

Reduced number of WriteValue() methods
Removed overloads of WriteStartElement and WriteStartAttribute that took an IXmlSchemaInfo parameter

XPathDocument

Reverted the XPathDocument to what it was in version 1.1 of the .NET Framework. It is no longer editable and no longer implements the IChangeTracking, IXmlSerializable, IXPathChangeNavigable, IXPathEditable and IRevertibleChangeTracking interfaces. The rationale for this decision is at More Information on the XPathDocument/XmlDocument Change in Whidbey beta 2

XPathNavigator and XPathEditableNavigator

The XPathEditableNavigator has been merged into the XPathNavigator, making it an editable XML cursor model API.
The XPathNavigator is the preferred API for exposing data as XML. This has been incorporated into the design guidelines for using XML in the .NET Framework

XmlDocument

The XPathNavigator returned by the CreateNavigator() method now allows one to edit the XmlDocument through the cursor model API.
The XmlDocument now supports XML schema validation of the entire subtree or partial validation of nodes in the document using the Validate() method
The following property added to XmlDocument

public XmlSchemaSet Schemas { get; set; }

The following property added to XmlNode

public virtual IXmlSchemaInfo SchemaInfo { get; }

XsltCommand

The XslTransform class was obsoleted in Whidbey Beta 1 and replaced by the System.Xml.Query.XsltCommand class. In Beta 2, we decided to revamp the XsltCommand API in order to make migration from XslTransform simpler. This effort also resulted in the renaming of the XsltCommand class to System.Xml.Xsl.XslCompiledTransform.
XslCompiledTransform compiles XSLT to MSIL for significantly improved performance at the cost of increased (yet still small) compile times.
Supports the MSXML XSLT extension functions such as format-date, format-time etc.

Inference

This class has been renamed to XmlSchemaInference

XPathExpression -

Added static Compile() method enables one to compile an input string containing an XPath query into an XPathExpression object

REMOVALS

XmlArgumentList

To reduce the cost of churn caused by the obsoletion of XslTransform this class has been removed. In its place the XsltArgumentList from v1.1 can be used

XQueryCommand

Microsoft has decided not to ship a client side XQuery implementation in .NET Framework 2.0 as our customers expect us to ship an implementation that meets the following criteria:

Compliant with the W3C standards
Functionally addresses key scenarios

As a core platform component in Windows, they also expect us to ship a product that meets the high bar of not breaking their applications when future updates are released. After talking to key customers and partners, we have determined it is important that we cross this high bar before shipping a full implementation of XQuery in the platform.

The best estimates tell us that ETA for XQuery to become a W3C recommendation is end of 2005 which does not fit with the .NET Framework 2.0 product release cycle.

In the meantime, we are shipping a well-defined small subset of XQuery in SQL Server 2005 to query information stored natively as XML data type. This will enable new customer scenarios in SQL Server for storing and retrieving semi-structured data.

In the NET Framework 2.0 RTM timeframe, we recommend that our customers continue to use XSLT and XPath on the client side to solve their key client side filtering and transformation scenarios. With this in mind, we have made significant improvements to our client side story including:

Performance improvements - making the .NET Framework XSLT processor the best performing processor.
Functional improvements - improving the usability and feature set of the existing .NET Framework processor

Note: As a result of not shipping XQuery, XML Views using mapping and XQuery to query SQL Server 2005 and the XmlAdapter to perform updates that were originally previewed in the PDC Alpha release of .NET V2.0 have also been removed. These were removed in the Beta 1 release.

Categories: Life in the B0rg Cube | XML

October 11, 2004

@ 01:35 AM

The XML Litmus Test: Understanding When and Why to Use XML

I just finished writing last month's Extreme XML column* entitled The XML Litmus Test: Understanding When and Why to Use XML. The article is a more formal write up from my weblog post The XML Litmus Test expanded to contain examples of appropriate and inappropriate uses of XML as well as with some of the criteria for choosing XML fleshed out. Below is an excerpt from the article which contains the core bits that I hope everyone who reads it remembers

XML is the appropriate tool for the job if the following criteria are satisfied by choosing XML as the data representation format for a given application.

1.      there is a need to interoperate across multiple software platforms

2.      one or more of the off-the-shelf tools for dealing with XML can be leveraged when producing or consuming the data

3.      parsing performance is not critical

4.      the content is not primarily binary content such as a music or image file

5.      the content does not contain control characters or any other characters that are illegal in XML

If the expected usage scenario does not satisfy most or all of the above criteria then it doesn't make much sense to use XML as the data representation format for the situation in question.

As the program manager responsible for XML programming models and schema validation in the .NET Framework I've seen lots and lots of inappropriate usage of XML both from internal teams and our customers. Hopefully once this article is published I can stop repeating myself and just send people links to it next time I see someone asking how to escape control characters in XML or see another online discussion of "binary" XML.

* Yes, it's late

Categories: XML

October 8, 2004

@ 05:58 PM

WS-* Specs are like JSRs Redux

In his post Debating WS-* Geoff Arnold writes

Tim Bray continues to discuss the relevance of the so-called WS-* stack: the collection of specifications related to XML-based web services. I'm not going to dive into the technology or business issues here; however Tim referred to a piece by Dare Obasanjo which argues that WS-* Specs are like JSRs. I tried to add a comment to this, but Dare's blog engine collapsed in a mess of XML, so I'll just post it here. Hopefully you'll be able to get back to read the original piece if you're interested. [Update: It looks as if my comment made it into Dare's blog after all.]

Just out of curiosity... if WS-* are like JSRs, what's the equivalent of the JCP? Where's the process documented, and what's the governance model? The statement "A JSR is basically a way for various Java vendors to standardize on a mechanism for solving a particular customer problem" ignores the fact that it's not just any old "way"; it's a particular "way" that has been publically codified, ratified by the community, and evolved to meet the needs of participants.

Microsoft isn't trying to compete with standards organizations. The JCP process falls out of the fact that Sun decided not to submit Java to a standards body but got pushback from customers and other Java vendors for something similar. So Sun manufactured an organization and process quite similar to a standards body with itself at the head. Microsoft isn't trying to get into this game.

The WS-* strategy that Microsoft is pursuing is informed from a lot of experience in the world of XML and standards. In the early days of XML, the approach to designing XML standards [especially at the W3C] was to throw together a bunch of preliminary ideas and competing draft specs without implementation experience then try to merge that into a coherent whole. This has been problematic as I wrote a few months ago

In recent times the way the W3C produces a spec is to either hold a workshop where different entities can submit proposals and then form a working group based on coming up with a unification of the various proposals or forming a working group to find come up with a unification of various W3C Notes submitted by member companies. Either way the primary mechanism the W3C uses to produce technology specs is to take a bunch of contradictory and conflictiong proposals then have a bunch of career bureaucrats try to find some compromise that is a union of all the submitted specs. There are two things that fall out of this process. The first is that the process takes a long time, for example the XML Query workshop was in 1998 and six years later the XQuery spec is still a working draft. Also XInclude proposal was originally submitted to the W3C in 1999 but five years later it is just a candidate recommendation. Secondly, the specs that are produced tend to be too complex yet minimally functionaly since they compromise between too many wildly differing proposals. For example, W3C XML Schema was created by unifying the ideas behind DCD, DDML, SOX, and XDR. This has lead to a dysfunctional specification that is too complex for the simple scenarios and nigh impossible to use in defining complex XML vocabularies.

The WS-* process Microsoft has engaged the industry in aims at preventing this problems from crippling the world of XML Web Services as it has the XML world. Initial specs are written by the vendors planning who'll primarily be implementing the functionality then they are revised based on the results of various feedback and interoperability workshops. As a result of these workshops some specs are updated while others turn out to be infeasible and are deprecated. Some people such as Simon Fell, in his post WS-Gone, have complained that these leads to a situation where things are too much in flux but I think this is a lot better than publishing standards which turn out to contain features that are either infeasible to implement or are just plain wrong. Working in the world of XML technologies over the past three years I've seen both.

The intention is that eventually the specs that show that they are the fittest will end up in the standards process. This exactly what has happened with WS-Security (OASIS) and WS-Addressing (W3C). I expect more to follow in the future.

Categories: Technology | XML

October 2, 2004

@ 05:57 AM

Why I Love XML-DEV

From Len Bullard

>What's the silver bullet?

It's a bar in Phoenix.

From Tim Bray

I disagree with virtually every technical argument Ted Nelson has ever
made and (in most cases) the implementations are on my side, but it
doesn't matter; Ted's place in history is secure because he asked more
important questions than just about anybody. I think he usually
offered the wrong answers, but questions are more important.

The thread that produced these gems is Ted Nelson's "XML is Evil" which revisits Ted Nelson's classic rant Embedded Markup Considered Harmful.

Categories: XML

September 20, 2004

@ 05:57 PM

Frustrated with the Limitations of XSD for XML Document Validation? Try Schematron!!!

My article Improving XML Document Validation with Schematron is finally up on MSDN. It provides a brief introduction to Schematron, shows how to embed Schematron assertions in a W3C XML Schema document for improved validation capabilities and how to get the power of Schematron in the .NET Framework today. The introduction of the article is excerpted below

Currently the most popular XML schema language is the W3C XML Schema Definition language (XSD). Although XSD is capable of satisfying scenarios involving type annotated infosets it is fairly limited when it comes to describing constraints on the structure of an XML document. There are many examples of situations where common idioms in XML vocabulary design are impossible to express using the constraints available in W3C XML Schema. The three most commonly requested constraints that are incapable of being described by W3C XML Schema are:

The ability to specify a choice of attributes. For example, a server-status element should either have a server-uptime attribute or a server-downtime attribute.

The ability to group elements and attributes into model groups. Although one can group elements using compositors such as xs:sequence, xs:choice, and xs:all, the same cannot be done with both elements and attributes. For example, one cannot create a choice between one set of elements and attributes and another.

The ability to vary the content model based on the value of an element or attribute. For example, if the value of the status attribute is "available" then the element should have an uptime child element; otherwise it should have a downtime child element. The technical name for such constraints is co-occurrence constraints.

Although these idioms are widely used in XML vocabularies it isn't possible to describe them using W3C XML Schema, which makes it difficult to rely on schema validation for enforcing the message contract. This article describes how to layer such functionality on top of the W3C XML Schema language using Schematron.

Embedding Schematron assertions in a W3C XML Schema document allows you to have your cake and eat it to.

Categories: XML

September 19, 2004

@ 10:14 PM

WS-* Specs Are Like JSRs

Tim Bray has another rant on the prolifieration of WS-* specs in the XML Web Services world. In his post The Loyal WS Opposition he writes

I Still Dont Buy It No matter how hard I try, I still think the WS-* stack is bloated, opaque, and insanely complex. I think its going to be hard to understand, hard to implement, hard to interoperate, and hard to secure.

I look at Google and Amazon and EBay and Salesforce and see them doing tens of millions of transactions a day involving pumping XML back and forth over HTTP, and I cant help noticing that they dont seem to need much WS-apparatus.

One way to view the various WS-* specifications is that they are akin to Java Specification Requests (JSRs) in the Java world. A JSR is basically a way for various Java vendors to standardize on a mechanism for solving a particular customer problem. Usually this mechanism takes the form of an Application Programming Interface (API). Some JSRs are widely adopted and have become an integral aspect of programming on the Java platform (e.g. the JAXP JSR). Some JSRs are pushed by certain vendors while being ignored by others leading to overlap (e.g. the JDO JSR which was voted against by BEA, IBM and Oracle but supported by Macromedia and Sun). Then there's Enterprise Java Beans which is generally decried as a bloated and unnecessarily complex solution to business problems. Again that was the product of the JSR process.

The various WS-* specs are following the same pattern as JSRs which isn't much of a surprise since a number of the players are the same (e.g. Sun & IBM). Just like Tim Bray points out that one can be productive without adopting any of the WS-* family of specifications it is similarly true that one can be productive in Java without relying on the products of JSRs but instead rolling one's own solutions. However this doesn't mean there aren't benefits of standardizing on the high level mechanisms for solving various business problems besides saying "We use XML and HTTP so we should interop".

Omri Gazitt, the Produce Unit Manager of the Advanced XML Web Services team has a post on WS-Transfer and WS-Enumeration which should hit close to home for Tim Bray since he is the co-chair of the Atom working group

WS-Transfer is a spec that Don has wanted to publish for a year now. It codifies the simple CRUD pattern for Web services (the operations are named after their HTTP equivalents - GET, PUT, DELETE, and there is also a CREATE pattern. The pattern of manipulating resources using these simple verbs is quite prevalent (Roy Fielding's REST is the most common moniker for it), and of course it underlies the HTTP protocol. Of course, you could implement this pattern before WS-Transfer, but it does help to write this down so people can do this over SOAP in a consistent way. One interesting existing application of this pattern is Atom (a publishing/blogging protocol built on top of SOAP). Looking at the Atom WSDL, it looks very much like WS-Transfer - a GET, PUT, DELETE, and POST (which is the CREATE verb specific to this application). So Atom could easily be built on top of WS-Transfer. What would be the advantage of that? The same advantage that comes with any kind of consistent application of a technology - the more the consistent pattern is applied, the more value it accrues. Just the value of baking that pattern into various toolsets (e.g. VS.NET) makes it attractive to use the pattern.

I personally think WS-Transfer is very interesting because it allows SOAP based applications to model themselves as REST Web Services and get explicit support for this methodolgy from toolkits. I talked about WS-Transfer with Don a few months ago and I've had to bite my tongue for a while whenever I hear people complain that SOAP and SOAP based toolkits don't encourage building RESTful XML Web Services.

I'm not as impressed with WS-Enumeration but I find it interesting that it also covers another use case of the ATOM API which is a mechanism for pulling down the content archive from a weblog or similar system in a sequential manner.

Categories: Technology | XML

September 14, 2004

@ 08:45 AM

EXSLT.NET 1.1 released

Oleg has just announced a new release of EXSLT, his post is excerpted below

Here we go again - I'm pleased to announce EXSLT.NET 1.1 release. It's ready for download. The blurb goes here:

EXSLT.NET library is community-developed free open-source implementation of the EXSLT extensions to XSLT for the .NET platform. EXSLT.NET fully implements the following EXSLT modules: Dates and Times, Common, Math, Random, Regular Expressions, Sets and Strings. In addition EXSLT.NET library provides proprietary set of useful extension functions.

Download EXSLT.NET 1.1 at the EXSLT.NET Workspace home - http://workspaces.gotdotnet.com/exslt
EXSLT.NET online documentation - http://www.xmland.net/exslt

EXSLT.NET Features:

65 supported EXSLT extension functions
13 proprietary extension functions
Support for XSLT multiple output via exsl:document extension element
Can be used not only in XSLT, but also in XPath-only environment
Thoroughly optimized for speed implementation of set functions

Here is what's new in this release:

New EXSLT extension functions has been implemented: str:encode-uri(), str:decode-uri(), random:random-sequence().
New EXSLT.NET extension functions has been implemented: dyn2:evaluate(), which allows to evaluate a string as an XPath expression, date2:day-name(), date2:day-abbreviation(), date2:month-name() and date2:month-abbreviation() - these functions are culture-aware versions of the appropriate EXSLT functions.
Support for time zone in date-time functions has been implemented.
Multithreading issue with ExsltTransform class has been fixed. Now ExsltTransform class is thread-safe for Transform() method calls just like the System.Xml.Xsl.XslTransform class.
Lots of minor bugs has been fixed. See EXSLT.NET bug tracker for more info.
We switched to Visual Studio .NET 2003, so building of the project has been greatly simplified.
Complete suite of NUnit tests for each extension function has been implemented (ExsltTest project).

The EXSLT.NET project has come quite some way since I started it last year. Oleg has done excellent work with this release. It's always great to see the .NET Open Source community come together this way.

Categories: XML

September 8, 2004

@ 03:23 PM

7 Fallacies of Validation

Roger Costello recently started a discussion thread on the XML-DEV mailing list about the common misconceptions people have about XML document validation and schemas. He recently summarized the discussion thread in his post Fallacies of Validation, version #3. His post begins

The purpose of documenting the below "fallacies" is to identify erroneous common thought that many people have with regards to validation and its role in a system architecture. Perhaps "assumptions" would be a better term to use than "fallacies". In any case, the desire of this writeup (which is a compilation of discussions on the xml-dev list) is to provoke new ways of thinking about validation, and reject limiting and static views on validation.

Fallacies of Validation

1. Fallacy of "THE Schema"

2. Fallacy of Schema Locality

3. Fallacy of Requisite Validation

4. Fallacy of Validation as a Pass/Fail Operation

5. Fallacy of a Universal Validation Language

6. Fallacy of Closed System Validation

7. Fallacy that Validation is Exclusively for Constraint Checking

I mostly agree with the fallacies as described in his post.

Fallacy #1 has been a favorite topic of Tim Ewald over the past year. It isn't necessarily true that there is one canonical schema for an XML vocabulary. Instead the schema for the vocabulary may depend on the context the XML document is being used in. A classic example of this is XHTML which has 3 schemas (DTDs) for a single format.

I consider Fallacy #2 to be more of a common mistake than a fallacy. Many people create validation systems that work in a local environment such as creating specific patterns or structures for addresses or telephone numbers which may work in a local system but break down when used in a global environment like the World Wide Web. This common mistake isn't limited to XML validation but applies to all arenas where user input is validated before being stored or processed

Fallacy #3 is interesting to me because I wonder how often it occurs in the wild. Are there really that many people who believe they have to validate XML documents against a schema?

Fallacy #4 is definitely a good one. However I disagree with the quotes he uses to butress the main point for this fallacy. I especially don't like the fact that he uses a generalization from Rick Jellife about bugs in a few schema validators as a core part of his argument. The important point is that schema validation should not always be viewed as a PASS/FAIL operation and in fact schema languages like W3C XML Schema go out of their way to define how one can view an XML document as being part valid, part invalid.

One size doesn't fit all is the message of Fallacy #5 to which I heartily cheer "Hear! Hear!". I agree 100%. There is no one XML schema language that satisfies every validation scenario.

I don't really understand Fallacy #6 without seeing some examples so I won't comment on it. I'll see if I can dig up the discussion threads about this on XML-DEV later.

Fallacy #7 is another one where I agree with the message but mostly disagree with how he argues the point. All of his examples are all variations of using schemas for constraint checking, they just differ on how the document is processed does after constraint checking is done. To me, the prime example of the fact that schema validation is not just for constraint checking is that many technologies actually using schemas for creating typed XML documents or for translating XML from one domain to another (e.g. Object<->XML, Relational<-> XML),

Everything said, this was a good list. Excellent work from Roger as usual.

Categories: XML

September 3, 2004

@ 05:47 AM

More Information on the XPathDocument/XmlDocument Change in Whidbey beta 2

In my recent post entitled The MSDN Camp vs. The Raymond Chen Camp I wrote

Our team [and myself directly] has gone through a process of rethinking a number of decisions we made in this light. Up until very recently we were planning to ship the System.Xml.XPath.XPathDocument class as a replacement for the System.Xml.XmlDocument class
...
The problem was that the XPathDocument had a radically different programming model than the XmlDocument meaning that anyone who'd written code using the XmlDocument against our v1.0/v1.1 bits would have to radically rewrite their code to get performance improvements and new features. Additionally any developers migrating to the .NET Framework from native code (MSXML) or from the Java world would already be familiar with the XML DOM API but not the cursor-based model used by the XPathDocument. This was really an untenable situation. For this reason we've reverted the XPathDocument to what it was in v1.1 while new functionality and perf improvements will be made to the XmlDocument. Similarly we will keep the new and improved ~~XPathEditableNavigator~~ XPathNavigator class which will be the API for programming against XML data sources where one wants to abstract away what the underlying store actually is. We've shown the power of this model with examples such as the ObjectXPathNavigator and the DataSetNavigator.

I've seen some concerned statements about this posts from XML developers who use System.Xml such as Oleg Tkachenko, Fumiaki Yoshimatsu and Tomas Restrepo so it seems I should clarify some of the decisions we made and why we made them.

In version 1.0 of the .NET Framework we provided two primary classes for interacting with XML; the XmlDocument and XmlReader. The XmlReader provided an abstract interface for interacting with a stream of XML. One can create an XmlReader over textual XML using the XmlTextReader or over virtual XML data sources such as is done with the XmlCsvReader. On the other hand with the XmlDocument we decided to eschew the approach favored by the Java world which used interfaces. Instead we created a single concrete implementation. This turned out to be a bad idea. It tied the the interface for programming against XML in a random access manner with a concrete implementation of an XML store. This made it difficult for developers who wanted to expose their data sources as XML stores and led to inefficient solutions such as the XmlDataDocument.

To rectify this we needed to separate the programming model for accessing XML data sources from our concrete implementation of the XmlDocument. We chose to do this by extending the cursor based programming model we introduced in v1 with the XPathNavigator instead of moving to an interface based approach with XmlDocument. The reason for choosing to go with a cursor based model over a tree based model is summed up in this quote from my article Can One Size Fit All?

In A Survey of APIs and Techniques for Processing XML, I pointed out that cursor-model APIs could be used to traverse in-memory XML documents just as well as tree-model APIs. Cursor-model APIs have an added advantage over tree-model APIs in that an XML cursor need not require the heavyweight interface of a traditional tree-model API where every significant token in the underlying XML must map to an object.

So in Whidbey, the XPathNavigator will be the programming model for working with XML data sources when one wants to abstract away from the underlying source. The XPathNavigator will be changed from the v1.0 model in the following ways (i) it will be editable and (ii) it will expose the post schema validation infoset. I've already worked with Krzysztof Cwalina on updating the Design Guidelines for Exposing XML data in WinFX to account for this change in affairs.

As for the XPathDocument, it is what it always has been. A class optimized for use in XPath and XSLT. If you need 10% - 25% better perf [depending on your scenario] when running XPath over an XML document or running XSLT over in-memory XML then this class should be preferred to the XmlDocument.

Categories: Life in the B0rg Cube | XML

August 25, 2004

@ 04:32 PM

Comments [9]

The MSDN Camp vs. The Raymond Chen Camp

A few months ago in Joel Spolsky's How Microsoft Lost the API War he wrote

There are two opposing forces inside Microsoft, which I will refer to, somewhat tongue-in-cheek, as The Raymond Chen Camp and The MSDN Magazine Camp.
...
The Raymond Chen Camp believes in making things easy for developers by making it easy to write once and run anywhere (well, on any Windows box). The MSDN Magazine Camp believes in making things easy for developers by giving them really powerful chunks of code which they can leverage, if they are willing to pay the price of incredibly complicated deployment and installation headaches, not to mention the huge learning curve. The Raymond Chen camp is all about consolidation. Please, don't make things any worse, let's just keep making what we already have still work. The MSDN Magazine Camp needs to keep churning out new gigantic pieces of technology that nobody can keep up with.

When I first read the above paragraphs I disagreed with them because I was in denial. But as the months have passed and I've looked at various decisions my team has made in recent years I see the pattern. The patterns repeats itself in the actions of other product teams and divisions at Microsoft. I know realize this is an unfortunate and poisonous aspect of Microsoft's culture which doesn't work in the best interest of our customers. A few months ago I found some advice given to Ward Cunningham on joining Microsoft which read

Take a running start and don't look back

Recognize that your wonderful inventiveness is the most valuable thing you will own in a culture that values its employees solely by their latest contributions. In a spartan culture like this, you will rise quickly.

Keep spewing ideas, even when those ideas are repeatedly misunderstood, implemented poorly, and excised from products for reasons that have nothing to do with the quality of the idea. When you give up on communicating your new ideas, you will just go insane waiting to vest.

The Microsoft culture is about creating the newest, latest greatest thing that 'changes the world' not improving what is already out there and working for customers. When I read various Microsoft blogs and MSDN headlines about how even though we've made paradigm shifts in developer technologies in the recent years we aren't satisfied and want to introduce radically new and different technologies all over again. This bothers me. I hate the fact that 'you have to rewrite a lot of your code' is a common answer to questions a customer might ask about how to leverage new or upcoming functionality in a developer technology.

Our team [and myself directly] has gone through a process of rethinking a number of decisions we made in this light. Up until very recently we were planning to ship the System.Xml.XPath.XPathDocument class as a replacement for the System.Xml.XmlDocument class. One of the driving reasons for doing this was XPath and XSLT performance. The mismatch between the DOM data model and that of XPath meant that XPath queries or XSLT transformations over the XmlDocument would never be as fast as XPathDocument. Another reason we were doing this was that since the XmlDocument is not an interface based design there isn't a way for people who implement their own XML document-like classes to plug-in to our world. So we decided to de-emphasize (but not deprecate) the XmlDocument by not adding any new functionality or performance improvements to it and focused all our energy on XPathDocument.

The problem was that the XPathDocument had a radically different programming model than the XmlDocument meaning that anyone who'd written code using the XmlDocument against our v1.0/v1.1 bits would have to radically rewrite their code to get performance improvements and new features. Additionally any developers migrating to the .NET Framework from native code (MSXML) or from the Java world would already be familiar with the XML DOM API but not the cursor-based model used by the XPathDocument. This was really an untenable situation. For this reason we've reverted the XPathDocument to what it was in v1.1 while new functionality and perf improvements will be made to the XmlDocument. Similarly we will keep the new and improved ~~XPathEditableNavigator~~ XPathNavigator class which will be the API for programming against XML data sources where one wants to abstract away what the underlying store actually is. We've shown the power of this model with examples such as the ObjectXPathNavigator and the DataSetNavigator.

It's good to be back in the Raymond Chen camp.

Categories: Life in the B0rg Cube | XML

August 22, 2004

@ 03:30 AM

My Current Writing Queue

Every once in a while I like to post a list of articles I'm either in the process of writing or considering writing to get feedback from people on what they'd like to see or whether the topics are even worthwhile. Below is a list of the next couple of articles I'm either in the process of writing or plan to write over the next few months.

An Introduction to Validating XML Documents with Schematron (MSDN) : An introduction to Schematron including examples showing how one can augment a W3C XML Schema document using Schematron thus creating an extremely powerful XML schema language. Code samples will use Schematron.NET
Designing XML Formats: Versioning vs. Extensibility (XML 2004 Conference) : This is the presentation and paper for my XML 2004 talk. It will basically be the ideas in my article On Designing Extensible, Versionable XML Formats with more examples and less fluff.
The XML Litmus Test - Deciding When and Why To Use XML (MSDN) : After seeing more and more people at work who seem to not understand what XML is good for or what the decision making process should be for adopting XML I decided to put this article together. This will basically be an amalgamation of my XML Litmus Test blog post and my Understanding XML article on MSDN.
XML in Cw (XML.com) : An overview of the XML based features of Cw. The Cw type system contains several constructs that reduce the impedance mismatch between OO and XSD by introducing concepts such as anonymous types, choices [aka union types], nullable types and constructing classes from XML literals into the .NET world. The ability to process such strongly typed XML objects using rich query constructs based on SQL's select operator will also be covered.
A Comparison of Microsoft's C# Programming Language to Sun Microsystems' Java Programming Language 2nd edition : About 3 years ago I wrote a C# vs. Java comparison while I was still in school which has become the most popular comparison of both languages on the Web. I still get mail on a semi-regular basis from people who've been able to transition between both languages due to the information in my comparison document. I plan to update this article to reflect the proposed changes announced in Java 1.5 and C# 2.0

On top of this I've been approached twice in the past few months about writing a technology book. Based on watching the experiences of others my gut feel is that it isn't worth the effort. I'd be interested in any feedback on the above article ideas or even suggestions for new articles that you'd be interested in seeing on MSDN or XML.com from me.

Categories: Technology | XML

August 20, 2004

@ 06:13 AM

RELAX NG, XSD and XML Web Services

In his post I Want RELAX NG! Tim Ewald writes

This recent post on Mark Nottingham's site pushed me over the edge. I agree with Sean's comment: I want Relax NG. Can I make systems work with XSD? Yes, sort of. But it adds a ludicrous amount of complexity. First you have to know how it works, then what not to do because it's too complicated (like complicated type or element substitution models), then figure out how to contort your schema to do what you want (like extensibility and versioning). Relax NG is much simpler and much closer to how XML actually works. And yes, you can still map it to /from objects if you want to.

I can't help but wonder why, if WS-* and SOAP 1.2 keep XSD at arms length (referencing simple types only and providing non-normative schema definitions) and WSDL 2.0 defines its own simple types, everyone assumes I want to use XSD to define my Web service interface. Pretty much everyone I know who works in this space agrees that Relax NG is a better choice. What is stopping us from making this change?

This is one of those times where I both agree and disagree with Tim. To explain why, I first need to list the ~~two~~ three reasons people tend to write schemas.

To provide a way to annotate an XML document with type information and thus created a type annotated infoset.
To provide a means to ensure that an XML documents satisfies the constraints of a given message contract
To provide terse, human readable documentation of an XML format.

In most developer scenarios [including XML Web Services] the most popular use case is the first from the list above. An XML Schema is used primarily for mapping the contents of an XML document either into relational tables (e.g. SQLXML, ADO.NET DataSet) or into a set of programming language objects (e.g. System.Xml.Serialization.XmlSerializer). Every XML Web Service toolkit I have encountered emphasizes this scenario and in fact most customers do not use XML schemas for validation of business documents for either performance reasons or the fact that their business rules cannot be adequately described using an XML schema. The main problem with XSD for this use case is that it is actually too expressive and has a richer type system than either the relational model or traditional object oriented programming languages. This leads to impedance mismatches which makes it hard for XML Web Service stacks to map schema declarations to objects thus leading to calls from folks like the WS-I to propose creating a subset or profile of XSD.

On the other hand, XSD is notoriously bad at dealing with the second use case described above. The language makes either makes it hard to describe common XML idioms (see the hoops I have to jump through in my Designing Extensible, Versionable XML Formats article) or impossible (e.g. if an attribute has a certain value then the element should have a certain content model or the providing a choice of attributes). This is where RELAX NG shines. Of course, being more expressive than XSD means that the impedance mismatch between it and the relational and OO models is even more significant.

In practice today, most XML Web Services need an XML schema language for creating type annotated infosets not for validating message structure. This means that for their use cases XSD is preferable to RELAX NG. Ideally, a simple language that just allowed creating named structures and primitive types such as Microsoft's now-obsolete XML Data Reduced (XDR) would be even more optimal.

Of course, the XML Web Services world could one day evolve to the point where being able to validate incoming messages against a schema is deemed more important than being able to deserialize the XML into objects and vice versa. In which case, Aaron Skonnard's statement in his post Could RelaxNG Replace XSD? which describe the existing industry inertia around XSD is also a point to consider.

Categories: XML

August 18, 2004

@ 10:16 AM

Women in XML

I saw the following excerpt in Shelley Powers's post entitled Differences of Humor where she wrote

Sam Ruby has posted a note about the upcoming Applied XML Conference put on by Chris Sells. When I looked at the agenda and realized that the conference managed to put together two days worth of presentations without one woman speaker,

Knowing the nature of Chris Sells's conferences this is unsurprising. They seem to mostly be an opportunity for Chris's DevelopMentor clique and their buddies to hang out. However Shelley's post did make me start thinking about how many women I knew who worked with XML and just like the time I started to keep a list of Seinfeld episodes in which at least one African or African American appeared in (don't ask) I started tracking down the number of women I knew off who worked on XML technologies who's works I'd rather see present than at least one of the presentations currently on the roster. Here is my list

Non-Microsoft

Eve Maler - Sun's most notable XML geek after Jon Bosak and Tim Bray. She's worked on SAML and UBL. I meet her at XML 2003 where we chatted about versioning in UBL and what truly meant by polymorphic XML processing.
Jeni Tennison - the most knowledgeable person on the planet about W3C XML Schema. I've lost count of the amount of times I've seen her school members of the W3C XML Schema working group about the technology on various mailing lists. Also an XSLT and XPath guru. She's always pushing boundaries in the XML world such as with her work on layered hierarchies in markup vocabularies with LMNL
Priscilla Walmsley - the author of Definitive XML Schema which is probably the best book on W3C XML Schema on the market. She's also co-written a book on XML in Office 2003 which I haven't read but would love to get a presentation on especially with regard to some finer details on how Office uses XML schemas.
Amelia Lewis - a co-author of the WS-ReliableMessaging specification and the author of an excellent critique of the W3C XML Schema primitive types in her article Not My Type: Sizing Up W3C XML Schema Primitives

Microsoft

Elena - the Microsoft XML Web Service stack rests on her shoulders. What makes Visual Studio.NET an awesome XML Web Service environment is that there is functionality that lets you point at a WSDL and automatically you get handy dandy .NET classes generated for you. Elena owns the meat of this code, a lot of which resides in the XmlSerializer class
Denise Draper - an architect on our team who in a past life has been a member of the XQuery working group, worked an XML data integration suite for Nimble Technology and worked in the AI field.
Priya Lakshminarayanan - the developer for the W3C XML Schema validation technology in the .NET Framework. She's the most knowledgeable about the technology at Microsoft, I'm a distant second to her breadth of knowledge about this somewhat arcane and cryptic technology. She the first person I've seen implement a tool for generating sample XML documents from XML schemas that didn't suck.
Helena Kupkova - the developer for the XML parser in the .NET Framework. She completely gutted our old implementation and doubled the perf in some scenarios. A totally impressive developer. More impressive is that she ships stuff like the XML Diff and Patch demo on GotDotNet in her spare time.
Nithya Sampathkumar - the developer on the XML schema inference technology in the .NET Framework. Once I took over as the program manager for this technology I grew to understand the subtleties involved in trying to infer a schema for arbitrary XML documents. A presentation on the techniques used in her implementation and the limitations of XML schema inference would be quite interesting.
Neetu Rajpal - the program manager for XML tools in Visual Studio. I've overheard some interesting conversations involving her discussing some of the trickiness involved in implementing an XSLT debugger. An in-depth presentation about what the XML tools team is planning to ship and the issues they encountered would be killer.
Vinita - the program manager for MSXML which is the most widely deployed XML library on the planet. Even without shipping in Internet Explorer, Windows and Office they still get millions of downloads a year.
Tejal Joshi - works on the XML tools in Visual Studio. At last year's XML 2003 conference I enjoyed hearing James Clark discuss implementation strategies for his nxml-mode in Emacs. I'm sure Tejal would have similarly interesting stories to tell.
Lanqing Dai - used to be developer for the XmlDocument class but has moved on to WinFS. I'd love to hear a her thoughts on how working in an XML-centric world compares to living in the item-centric world of WinFS.

There are more women I know off in the XML field both within and outside Microsoft but these are the ones whose presentations I'd rather see than something like XML as a Better COM (for example). Maybe next time Chris Sells should look around the usual XML hang outs both online (like the xml-dev mailing list) and within Microsoft internally for conference speakers instead of announcing them in his blog. It may lead to a more diverse list of topics and speakers.

I need to go watch Berserk. Talk to you guys later.

Categories: XML

August 14, 2004

@ 09:48 PM

Chris Anderson: Developers Hate XML

According to the current version of the Chris Sells XML DevCon page (don't bother bookmarking it, Chris Sells doesn't believe in permalinks so all the content on that page is transient) I noticed that Chris Anderson is presenting the following

Developers Hate XML
Chris Anderson

While everyone is currently infatuated with XML, developers are constantly doing battle with trying to rationalize and leverage XML in their applications. Ill talk about having to balance correct XML-isms vs. usability in XAML, about the preponderance of XML reader/writer/DOM/serialization APIs, and about how all of this throws you into a horrible programming experience of loosely typed runtime errors. This reveals XML for what it is a data encoding. XML is the ASCII text file of the 2000s. While web services are often called "XML Web Services," the reality is that every web service API abstracts the developer from the XML view.

Nothing says vote of confidence like when the chief architects of one of teams you work closely with says your technology sucks. :)

Seriously though, I am curious as to what his presentation actually will be about. Reading the abstract, it seems like it is another iteration of a data-centric user of XML coming to the realization that for their scenarios XML is just CSV on steroids. People's behavior when they realize this usually follows a pattern similar to the five stages of grief. First there is denial, this usually takes the form of an initial disbelief that after all the hype they've heard about XML it isn't working out fantastically for them. Then there is bargaining, this usually manifests itself as attempts to not use XML but still use it. Often you here phrases like "binary XML", "XML subset" or "XML profile" at this point. Then there is anger at XML for being more complex and verbose than they need. At this point you get to either read a rant-filled email, blog post, conference paper or in this case conference presentation about how badly the technology is suited for its purpose. Then there is either despair or acceptance. One doesn't follow the other. If the next stage is despair in this case the person ends up not using XML to solve that particular problem. On the other hand if it is acceptance, XML is still used but in some cases it is in one of the forms that were mentioned in the bargaining stage such as a binary representation of an XML stream or uing some subset of XML.

Hopefully Chris Anderson will post his slides online.

Categories: XML

August 11, 2004

@ 05:22 PM

New and Upcoming articles on the MSDN XML Developer Center

I have been remiss about talking about the ongoing and upcoming content on the MSDN XML Developer Center. The most interesting recent article has been Priya's Generating XML Documents from XML Schemas. In this article she provides a tool that allows you to generate sample documents that validate against a particular schema. This is very useful if you have a schema document but would like to see what a valid instance of that schema looks like. This is one of those tools I'd love to see added to the XML editor in Visual Studio. The one caveat is that her tool requires .NET Framework v2.0 beta 1 or higher to run.

The next couple of articles we have scheduled include an overview of how to use EXSLT to make one more productive as an XSLT developer by Oleg Tkachenko, an implementation of XPathNavigator over the ADO.NET DataSet by Arpan Desai, and an introduction Daniel Cazzulino's implementation of Schematron for the .NET Framework by myself.

I'm currently working on the content plan for the next quarter of the year and would like to know what articles people have liked and would like to see in the future. Also if you are interested in writing for MSDN about XML technologies on Microsoft platfortms go ahead and send me an email at my work address.

Categories: XML

August 3, 2004

@ 05:20 AM

Transcripts of Online Chat with Microsoft XML Team

On July 8th, a couple of us from the XML team held a hosted chat on MSDN Chats. The transcripts of the C# and XML chat are now available. We answered questions on our existing behavior in Everett as well as upcoming technologies in Whidbey. If there are any followup questions to those asked during the chat just post them here and I'd love to answer them.

Categories: Life in the B0rg Cube | XML

August 1, 2004

@ 03:41 AM

Loading XML Files from Behind a Proxy Server

I recently got a number of bug reports that in certain situations RSS Bandit would report a proxy authentication error when fetching certain RSS feeds over the Web when connecting through a proxy server. It seemed most feeds would work fine but a particular set of feeds would result in the following message

The remote server returned an error: (407) Proxy Authentication Required.

Examples of sites that had problems include the feeds for Today on Java.net, Martin Fowler's bliki and Wired News. It dawned on me that the one the one thing all these feeds had in common was that they referenced a DTD. The problem was that although I was using an instance of the System.Net.IWebProxy interface in combination with an HttpWebRequest when fetching the RSS feed I did not provide the XmlValidatingReader used to process the feed that it should use the proxy information when resolving DTDs.

This is where things got less intuitive. All XmlReaders have an XmlResolver property used to retrieve resources external to the file. However the XmlResolver class does not provide a way to specify proxy information, only authenticattion information. To solve this problem I had to create a subclass of the XmlResolver class which used the proxy connection when retrieving external resources. It seems I'm not the only person who's come up across this problem and the solution was presented on the microsoft.public.dotnet.xml newsgroup a while ago in the thread entitled XmlValidatingReader, XmlResolver, Proxy Authentication, Credentials, Remote schema. This post shows how to create a custom XmlResolver which utilizes proxy information and how to use this class to prevent the errors I was seeing.

I checked in the fix to RSS Bandit this morning, so very soon a number of users of the most sophisticated news aggregator on the Windows platform will be very happy campers seeing this annoying bug fixed.

Categories: XML

July 30, 2004

@ 02:44 AM

On Designing Extensible, Versionable XML Formats

About a week ago my article Designing Extensible, Versionable XML Formats appeared on XML.com. However due to a “pilot error” on my end I didn't send the final draft to XML.com. By the time I realized my mistake the article was already live and changing it would have been cumbersome since there were a few major changes in the article.

You can read the final version of the article Designing Extensible, Versionable XML Formats on MSDN. The main differences between the MSDN article and the XML.com one are

Added sections on Message Transfer Negotiation vs. Versioning Message Payloads and Version Numbers vs. Namespace Names
Added more content to the section Using XML Schema to Design an Extensible XML Format especially around usage of substitution groups, xsi:type and xs:redefine.
Amended all sample schemas to use blockdefault="#all".
Added an Acknowledgements section
Schema in for section New constructs in a new namespace approach uses a fixed value instead of a default value for mustUnderstand attribute on isbn element.

Categories: XML

July 21, 2004

@ 06:48 AM

When Backwards Compatibility Mode Isn't

Today Arpan (the PM for XML query technologies in the .NET Framework) and I were talking about features we'd like to see on our 'nice to have' list for the Orcas release of the .NET Framework. One of the things we thought would be really nice to see in the System.Xml namespace was XPath 2.0. Then Derek being the universal pessimist pointed out that we already have APIs that support XPath 1.0 that only take a string as an argument (e.g. XmlNode.SelectNodes) so we'd have difficulty adding support for another version of XPath without contorting the API.

Not to be dissuaded I pointed out that XPath 2.0 has a backwards compatibility mode which makes it compatible with XPath 1.0. Thus we wouldn't have to change our Select methods or introduce new methods for XPath 2.0 support since all queries that used to work in the past against our Select methods would still work if we upgraded our XPath implemention to version 2.0. This is where Arpan hit me with the one-two punch. He introduced me to a section of the XPath 2.0 spec called Incompatibilities when Compatibility Mode is true which reads

The list below contains all known areas, within the scope of this specification, where an XPath 2.0 processor running with compatibility mode set to true will produce different results from an XPath 1.0 processor evaluating the same expression, assuming that the expression was valid in XPath 1.0, and that the nodes in the source document have no type annotations other than xdt:untypedAny and xdt:untypedAtomic.

I was stunned by what I read and I am still stunned now. The W3C created XPath 2.0 which is currently backwards incompatible with XPath 1.0 and added a compatibility mode option to make it backwards compatible with XPath 1.0 but it actually still isn't backwards compatible even when in this mode? This seems completely illogical to me. What is the point of having a backwards compatibility mode if it isn't backwards compatible? Well, I guess now I know if we do decide to ship XPath 2.0 in the future we can't just add support for it transparently to our existing classes without causing some API churn. Unfortunate.

Caveat: The fact that a technology is mentioned as being on our 'nice to have' list or is suggested in a comment to this post is not an indication that it will be implemented in future versions of the .NET Framework.

Categories: XML

July 15, 2004

@ 07:33 AM

Bang! Bang! My Baby Shot Me Down

I was reading an XML-Deviant column on XML.com entitled Browser Boom when I came across the following excerpt

The inevitable association with Microsoft's CLI implementation is proving a source of difficulty for the Mono project. The principal author of Mono's XML support, Atsushi Eno, posted to the Mono mailing list on the problems of being conformant in Mono's XML parser implementation. More specifically, whose rules should Mono conform to. W3C or Microsoft?

MS XmlTextReader is buggy since it accepts XML declaration as element content (that violates W3C XML specification section 3 Logical Structures). ... However, there is another discussion that it is useful that new XmlTextReader (xmlText, XmlNodeType.Element, null) accepts XML declaration.

... that error-prone XmlTextReader might be useful (especially for people who already depends on that behavior)

... we did not always reject Microsoft badness; for example we are copying System.Xml.XmlCDataSection that violates W3C DOM interface hierarchy (!)

The root of the dilemma is similar to that which Mozilla and Opera are trying to manage in the browser world.

What I find interesting is that instead of pinging the MSFT XML folks (like myself) and filing a bug report this spawned a dozen message email discussion on whether Mono should be bug compatible with the .NET Framework. Of course, if the Mono folks decide to be bug compatible with this and other bugs in System.Xml and we fix them thus causing breaking changes in some cases will we see complaints about how Microsoft is out to get them by being backwards incompatible? Now that Microsoft has created the MSDN Product Feedback Center they don't even have to track down the right newsgroup or email address of a Microsoft employee to file the bug.

It's amazing to me how much work people cause for themselves and conspiracy theories they'd rather live in than communicate with others.

Update: I talked to developer responsible for the XmlTextReader class and she responded "This is by design. We allow XML declaration in XML fragments because of the encoding attribute. Otherwise the encoding information would have to be transferred outside of the XML and manually set into XmlParserContext."

Categories: Life in the B0rg Cube | XML

July 14, 2004

@ 03:42 PM

C-Omega compiler preview available for download

A little while ago some members of our team experimented various ways to reduce the Relational<->Objects<->XML (ROX) impedance mismatch by adding concepts and operators from the relational and XML (specifically W3C XML Schema) world into an object oriented programming language. This effort was spear headed by a number of smart folks on our team including Erik Meijer, Matt Warren, Chris Lovett and a bunch of others all led by William Adams. The object oriented programming language which was used as a base for extension was C#. The new language was once called X# but eventually became known as Xen.

Erik Meijer presented Xen at XML 2003 and I blogged about his presentation after the conference. There have also been two papers published about the ideas behind Xen; Programming with Rectangles, Triangles, and Circles and Unifying Tables, Objects and Documents. It's a new year and the folks working on Xen have moved on to other endeavors related to future versions of Visual Studio and the .NET Framework.

However Xen is not lost. It is now part of the Microsoft Research project, Cw (pronounced C-Omega). Even better you can download a preview of the Cw compiler from the Microsoft Research downloads page

Categories: Technology | XML

July 11, 2004

@ 12:50 AM

On Raining on the W3C's Parade

In a post entitled Dare Obasanjo is raining on the W3C's parade, Mike Dierken responds to my recent post which asks Is the W3C Becoming Irrelevant? by writing

Either way the primary mechanism the W3C uses to produce technology specs is to take a bunch of contradictory and conflictiong proposals then have a bunch of career bureaucrats try to find some compromise that is a union of all the submitted specs

Damn those career bureaucrats that built XML. Or is it the SOAP design process that caused the grief? And where did that technology come from anyway?

My original post already described the specs that have caused grief and show the W3C is losing its way. I assume that Mike is trying to use XML 1.0 and SOAP 1.1 as counter examples to the trend I pointed out. Well first of all, XML 1.0 was a proposal to design a subset of SGML so by definition it could not suffer the same problems that face attempts to innovate by committee which have hampered the W3C in current times. Also when XML 1.0 was created it was much smaller and a majority of the participants in the subsetting of SGML had similar goals. As for SOAP 1.1, it isn't a W3C spec. SOAP 1.1 was created by Don Box, Dave Winer and a bunch of Microsoft and IBM folks and then submitted to the W3C as a W3C Note.

Of course, the W3C has created iterations of both specs (XML 1.1 & SOAP 1.2) which in both cases are backwards incompatible with the previous versions. I leave it as an excercise to the reader to decide if having backwards incompatible point releases of Web specifications is how one 'leads the Web to its full potential'.

Categories: XML

July 9, 2004

@ 07:05 AM

Is the W3C Becoming Irrelevant?

For a long time I used to think the W3C held the future of the World Wide Web in its hands. However I have come to realize that although this may have been true in the past the W3C has become too much of a slow moving bureaucratic machine to attract the kind of innovation that will create the next generation of the World Wide Web. From where I sit there are three major areas of growth for the next generation of the World Wide Web; the next generation of the dynamic Web, syndication and distibuted computing across the Web. With the recent decisions of Mozilla and Opera to form the WHAT working group and Atom's decision to go with the IETF it seems the W3C will not be playing a dominant role in any of these 3 areas.

In recent times the way the W3C produces a spec is to either hold a workshop where different entities can submit proposals and then form a working group based on coming up with a unification of the various proposals or forming a working group to find come up with a unification of various W3C Notes submitted by member companies. Either way the primary mechanism the W3C uses to produce technology specs is to take a bunch of contradictory and conflictiong proposals then have a bunch of career bureaucrats try to find some compromise that is a union of all the submitted specs. There are two things that fall out of this process. The first is that the process takes a long time, for example the XML Query workshop was in 1998 and six years later the XQuery spec is still a working draft. Also XInclude proposal was originally submitted to the W3C in 1999 but five years later it is just a candidate recommendation. Secondly, the specs that are produced tend to be too complex yet minimally functionaly since they compromise between too many wildly differing proposals. For example, W3C XML Schema was screated by unifying the ideas behind DCD, DDML, SOX, and XDR. This has lead to a dysfunctional specification that is too complex for the simple scenarios and nigh impossible to use in defining complex XML vocabularies.

It seems many vendors amd individuals are realizing that the way to produce an innovative technology is for the vendors that will mostly be affected by the technology to come up with a specification that is satisfactory to the participants as opposed to trying to innovate by committee. This is exactly what is happening with the next generation of the dynamic Web with the WHAT working group, with XML Web Services with WS-I and in syndication with RSS & Atom.

The W3C still has a good brand name since many associate it with the success of the Web but it seems that it has become damage that vendors route around in their bid to create the next generation of the World Wide Web.

Categories: XML

July 4, 2004

@ 05:41 PM

Breaking Changes in System.Xml from v1.1 to v2.0 of the .NET Framework

At Microsoft one of our goals in developing software is that backwards compatibility when moving from one version of software to the next is high priority. However in certain cases the old behavior may be undesirable enough that we break compatibility. An example of such undesirable behavior are bugs that lead to incorrect results or security issues. Below is a list of breaking changes in the System.Xml namespace in beta 1 of v2.0 of the .NET Framework.

Extension of xs:anyType which changes the content type to mixed="false" results in error.
Add an enumeration member to XmlWriter.WriteState to indicate that the writer is in error state. Change writer to disallow further writes when in error state.
##other namespace constraint now treated correctly on wild cards.
Instances of DateTime object returned by XmlValidatingReader that represent xs:time and other date & time related W3C XML Schema types now initialized using DateTime.MinValue
Incorrect implementation of XSD derivation hierarchy for xs:ENTITY and xs:NCName corrected.
XSD List Types Not Validated Correctly.
Changed to reliably fail when XmlTextReader source stream switches encoding between calls to ResetState()
XmlTextReader should apply the same security restrictions as the XmlReaders that can be created via the static XmlReader.Create() methods

I was directly involved in the decision making process for most of these breaking changes since many are in the W3C XML Schema area which I am responsible for. If any further clarifications about any of the breaking changes is needed, please post a comment with your question below.

Categories: Life in the B0rg Cube | XML

June 24, 2004

@ 04:15 PM

Comments [15]

What Would You Like To See in System.Xml in Orcas/Longhorn?

I'm in the end stages of doing the spec work for the various components in the System.Xml namespace I am responsible for in the Whidbey betas. After the 4th of July holidays we plan to start doing initial brain storming for what feature work we should do in Orcas/Longhorn. I thought it would be valuable to have various users of XML in the .NET Framework suggest what they'd like us to do in the Orcas version of System.Xml. What changes would people like to see? For example, I'm putting Schematron and XPathReader on the 'nice to have' list. No idea is too unconventional since this is the early brainstorming and prototyping phase.

Caveat: The fact that a technology is mentioned as being on our 'nice to have' list or is suggested in a comment to this post is not an indication that it will be implemented in future versions of the .NET Framework.

Categories: Life in the B0rg Cube | XML

June 19, 2004

@ 10:13 PM

Online Chat with Microsoft XML Team

The folks at MSDN Chats have organized an online chat session on C# and XML for next month. The participants on the Microsoft side should include myself, Mark Fussell, Erik Meijer, Neetu Rajpal and couple of folks from the C# team. If you'd like to talk to us about topics surrounding XML and C# then log on to the XML and C# chat session at the following time

July 8, 2004
1:00 - 2:00 P.M. Pacific time
4:00 - 5:00 P.M. Eastern time
20:00 - 21:00 GMT

Event Reminders
Add to Outlook Calendar
Tell a Friend

On a side note, am I the only one that thinks the MSDN Chats site is crying out for an RSS feed? I definitely would love to add it to the subscriptions list in my favorite news aggregator.

Categories: Life in the B0rg Cube | XML

June 18, 2004

@ 07:07 AM

XML 2004: I'll Be There

My submission on Designing Extensible & Version Resilient XML Formats has been accepted to XML 2004. It looks like I'm going to be in Washington D.C. this fall. Currently I'm in the process of writing an article about the topic of my talk which should show up on MSDN and XML.com in the next month or so. Afterwards I plan to submit a revised version of that article as the paper for my talk.

Categories: XML

June 15, 2004

@ 04:17 PM

Objects vs. XML in WinFS Land

The ongoing conversation between Jeremy Mazner and Jon Udell about the capabilities of WinFS deepen this morning with Jeremy's post Did I misunderstand Udell's argument against WinFS? which was followed up by Jon's post When a journalist blogs. In his post Jon asks

We have standard query languages (XPath, XQuery), and standard ways of writing schemas (XSD, Relax), and applications (Office 2003) that with herculean effort have been adapted to work with these query and schema languages, and free-text search further enhancing all this goodness. Strategically, why not build directly on top of these foundations?

Tactically, why do I want to write code like this:
public class Person
  {
  [XmlAttribute()] public string Title;
  [XmlAttribute()] public string FirstName;
  [XmlAttribute()] public string MiddleName;
  [XmlAttribute()] public string LastName;
  ....
in order to consume data like this?
<People>
  <Person
    DisplayName="Woodgrove Bank"
    IMAddress="Support@woodgrovebank.com"
    UserTile=".\user_tiles\Adventure Works.jpg">
    <EmailAddresses>
        <EmailAddress
            Type="Work"
            Address="mortgage@woodgrovebank.com"/>
        <EmailAddress
            Type="Primary"
            Address="Support@woodgrovebank.com"/>
   </EmailAddresses>
I believe two things to be true. First, we have some great XML-oriented data management technologies. Second, the ambitious goals of WinFS cannot be met solely with those technologies. I'm trying to spell out where the line is being drawn between interop and functionality, and why, and what that will mean for users, developers, and enterprises.

Jon asks several questions and I'll try to answer all the ones I can. The first question about why WinFS doesn't build on XML, XQuery and XSD instead of items, OPath and the WinFS schema language is something that the WinFS folks will have to answer. Of course, Jon could also ask why it doesn't build on RDF, RDQL [or any of the other RDF query languages] and RDF Schema which is a related question that naturally follows from the answer to Jon's question.

The second question is why would one want to program against a Person object when they have a <Person> element. This is question has an easy answer which unfortunately doesn't sit well with me. The fact is that developers prefer programming against objects than they do programming with XML APIs. No XML API in the .NET Framework (XmlReader, XPathNavigator, XmlDocument, etc) comes close to the ease of use of programming against strongly typed objects in the general case. Addressing this failing [and it is a failing] is directly my responsibility since I'm responsible for core XML APIs in the .NET Framework. Coincidentally, we just had a review with our new general manager yesterday and this same issue came up and he asked what we plan to do about this in future releases. I have some ideas. The main problem with using objects to program against XML is that although objects work well for programming against data-centric XML (rigidly structured tabular data such as an the data in an Excel spreadsheet, a database dump or serialized objects) there is a signficant impedance mismatch when trying to use strongly typed objects to program against document-centric XML (semi-structured data such as a Word document). However the primary scenarios the WinFS folks want to tackle are about rigidly structured data which works fine with using objects as the primary programming model.

Jon says that he is trying to draw the line between interop and functionality. I'm curious as to what he means by interop in this case. The fact that WinFS is based on items, OPath and WinFS schema doesn't mean that WinFS data cannot be exchanged in an interoperable manner (e.g. some form of XML export and import) nor does it mean that non-Microsoft applications cannot interact with WinFS. I should clarify that I have no idea what the WinFS folks consider their primary interop scenarios but I don't think the way WinFS is designed today means it cannot interoperate with other platforms or data models.

I suspect that Jon doesn't really mean interop when says so. I believe he is using the word the same way Java people use it where it really means 'One Language, One Programming Model, One Platform' everywhere instead of being able to communicate between disparate end points. In this case the language is XML and the platform is the XML family of technologies.

Categories: Life in the B0rg Cube | XML

June 15, 2004

@ 02:39 PM

Michael Rys Comments on Infoworld article "Databases flex their XML"

In a post entitled My comments on the Infoworld article "Databases flex their XML" Michael Rys writes

Sean McCown wrote this analysis (PDF version) in Apr 2004. In the article, he compares the XML capabilities of the 4 major relational database systems (comparing publicly available versions) both in terms of functionality, ease, flexibility and speed, and adds a sidebar on Yukon. Before I start giving my comments on the article, let me disclose that I talked to Sean during his research for the article and answered his questions on SQL Server 2000 and Yukon. Thus, some of the comments below are just my attempts to make Sean's translation of my answers clearer, because I was not answering his questions clear enough .

Michael then goes on to clarify various points around the terminology used in the article, XQuery and SQL Server. Both Sean's article and Michael's followup are excellent reading for anyone interested in the growing trend of XML-enabled relational databases and how the big 3 relational database vendors stack up.

Categories: XML

June 12, 2004

@ 05:49 PM

Office 2003 and XML: It's All About Your Data and Your Formats

In a recent post entitled 15 Science Street Tim Bray, one of the inventors of XML, writes

Microsoft’s main talking point (I’m guessing here from the public documents) was that their software and format had the advantage that in WordML you can edit documents from arbitrary schemas.

Our pushback on that was that editing arbitrary-schema documents is damn hard and damn expensive and has never been anything more than a niche business.

which seems not to jibe with my experiences. Many businesses have XML formats specific to their target industry (LegalXML, HR-XML, FpML, etc) and many businesses use office productivity suites to create and edit documents. It seems very logical to expect that people would like to use their existing spreadsheet and word processing applications to edit their business documents instead of using XMl editors or specialized tools. More interestingly Tim Bray contradicts his position that editing user-defined schemas is a niche scenario when he writes

As we were winding up, a couple of really smart people (don’t know who they were) put up their hands and asked real good questions. The best was essentially “What would you like to see happen?” After some back and forth, I ended up with “You should have the right to own your own information. It’s your intellectual capital and you worked hard to produce it for your citizens. Sun doesn’t own it, Microsoft doesn’t own it, you own it, and that means it should be living in a nice, long-lived, non-proprietary data format that isn’t anyone’s competitive weapon.”

He took the words right out of my mouth. This is exactly what Microsoft has done with Office 2003 by allowing users to edit documents in XML formats of their choosing. In the letter Bringing the XML Vision to the Desktop with Office 2003 written by Jean Paoli of Microsoft (also a co-inventor of XML) he writes

an even greater and more innovative benefit is the fact that companies can now create their own XML schemas specific to their business, define the structure and type of data that each data element in a document contains and exchange information with customers and business partners more easily. This capability opens up a whole new realm of possibilities, not only for end users, but also for the business itself because now organizations can capture and reuse critical information that in the past has been lost or gone unused.

Office 2003 is a great step forward in enabling businesses and end users harness the power of XML in typical document interchange scenarios. Arguments about whether you should use Sun's XML format or Microsoft's XML format aren't the point. The point is which tools allow you to use your XML format with the most ease.

Categories: XML

June 8, 2004

@ 09:46 AM

Applied XML DevCon: Call for Speakers

Chris Sells has announced the call for speakers for the Applied XML Developers Conference 5. From his post

Are you interested in presenting a 45-minute talk on some applied XML or Web Services topic? It doesn't matter which platform or OS you're targeting. It also doesn't matter whether you're an author or vendor or professional speaker or a developer in the trenches (in fact, I tend to be biased towards the latter). We're after interesting and unique applications of XML and Web Services technology and if you're doing good work in that area, then I need you to send me a session topic and 2-4 sentence abstract along with a little bit about yourself. I'll be taking submissions 'til the end of June, but don't delay...

...the conference itself is likely to be in Oregon during the 2nd or 3rd week of September, 2004, but we're still working the details out. One of the fun things that we're thinking about this year is to have the Dev.Conf. in Sunriver, Oregon, a resort and spa town in central Oregon where sun is plentiful and rain is scarce.

Previous XML DevCons have had a wide variety of interesting speakers. Unfortunately, the XML DevCon webpage doesn't provide any information on previous conferences. If you are interested in reports on last year's conference just type "XML DevCon" in your favorite Web search engine to locate blog postings from some of the attendees.

I probably won't be at this conference since the focus is usually XML Web Services while my professional interests are in core XML technologies with working with XML syndication formats being a hobby. However there should be lots of interesting presentations on XML Web Services and other leading edge applications of XML from industry experts if last year's conference is anything to go by.

Categories: XML

June 8, 2004

@ 09:22 AM

Jon Udell and WinFS

Jon Udell has started a series of blog posts about the pillars of Longhorn. So far he has written Questions about Longhorn, part 1: WinFS and Questions about Longhorn, part 2: WinFS and semantics which ask the key question "If the software industry and significant parts of Microsoft such as Office and Indigo have decided on XML as the data interchange format, why is the next generation file system for Windows basically an object oriented database instead of an XML-centric database?"

I'd be very interested in what the WinFS folks like Mike Deem would say in response to Jon if they read his blog. Personally, I worry less about how well WinFS supports XML and more about whether it will be fast, secure and failure resistant. After all, at worst WinFS will support XML as well as a regular file system does today which is good enough for me to locate and query documents with my favorite XML query language today. On the other hand, if WinFS doesn't perform well or shows the same good-idea-but-poorly-implemented nature of the Windows registry then it'll be a non-starter or much worse a widely used but often cursed aspect of Windows development (just like the Windows registry).

As Jon Udell points out the core scenarios touted for the encouraging the creation of WinFS (i.e search and adding metadata to files) don't really need a solution as complex or as intrusive to the operating system as WinFS. The only justification for something as radical and complex as WinFS is if Windows application developers end up utilizing it to meet their needs. However as an application developer on the Windows platform I primarily worry about three major aspects of WinFS. The first is performance, I definitely think having a query language over an optimized store in the file system is all good but wouldn't use it if the performance wasn't up to snuff. Secondly I worry about security, Longhorn evangelists like talking up what a wonderful world it would be if all my apps could share their data but ignore the fact that in reality this can lead to disasters. Having multiple applications share the same data store where one badly written application can corrupt the entire store is worrisome. This is the fundamental problem with the Windows registry and to a lesser extent the cause of DLL hell in Windows. The third thing I worry about is that the programming model will suck. An easy to use programming model often trumps almost any problem. Developers prefer building distributed applications using XML Web Services in .NET to the alternatives even though in some cases this choice leads to lower performance. The same developers would rather store information in the registry than come up with a robust alternative on their own because the programming model for the registry is fairly straightforward.

All things said, I think WinFS is an interesting idea. I'm still not sure it is a good idea but it is definitely interesting. Then again given that WinFS assimilated and thus delayed a very good idea from shipping, I may just be a biased SOB.

PS: I just saw that Jeremy Mazner posted a followup to Jon Udell's post entitled Jon Udell questions the value and direction of WinFS where he wrote

XML formats with well-defined, licensed schemas, are certainly a great step towards a world of open data interchange. But XML files alone don't make it easier for users to find, relate and act on their information. Jon's contention is that full text search over XML files is good enough, but is it really? I did a series of blog entries on WinFS scenarios back in February, and I don't think's Jon full text search approach would really enable these things.

Jeremy mostly misses Jon's point which is aptly reduced to a single question at the beginning of this post. Jon isn't comparing full text search over random XML files on your file system to WinFS. He is asking why couldn't WinFS be based on XML instead of being an object oriented database.

Categories: Technology | XML

June 6, 2004

@ 04:18 AM

RDF as a More Extensible XML

One of my friends, Joshua Allen, is a fan of RDF and Semantic Web technologies. Given that I respect his opinion a lot I keep trying to delve into RDF and its family of technologies every couple of months to see what it provides to the world of data access and information interchange above and beyond existing technologies. Recently I discovered that there are some in the RDF camp that position it as a "better XML". The first example of this I saw was an old article by Tim Berners-Lee entitled Why RDF model is different from the XML model. According to Tim the note is an attempt to answer the question, "Why should I use RDF - why not just XML?". However instead of answering the question his note just left me with more questions than answers. The pivotal point for me in Tim Berners-Lee's note is the following excerpt

Things you can do with RDF which you can't do with XML include

You can parse the semantic tree, which end up giving you a set of (possibly mutually referential) triples and then you can use the ones you want ignoring the ones you don't understand.

Problems with basing you understanding on the structure include

Without having gone to the trouble of getting the schema, or having an application hand-programmed to recognise a particular document type, you can't pick up any semantic information from a document;
When an XML schema changes, it could typically introduce new intermediate elements (like "details" in the tree above or "div" is HTML). These may or may or may not invalidate any query which has been based on the structure of the document.
If you haven't gone to the trouble of making a semantic model, then you may not have a well defined one.

It seems that the point being argued is that with RDF you can get more understanding of the information in the document than with just XML. Being that one could consider RDF as just a logical model layered on top of an XML document (e.g. RDF/XML) I find it hard to understand how viewing some XML document through RDF colored glasses buys one so much more understanding of the data.

Recently I discovered a presentation entitled REST, Self-description, and XML by Mark Baker. This presentation discusses the ideas in Tim Berners-Lee's note in more depth and in a way I finally understand. The first key idea in Mark's presentation is the notion of "self describing" data formats which were also covered in Tim Berners-Lee's presentation at WWW2002 entitled Specs Count. The core tennets of "self describing" data formats are covered in slide 10 and slide 11 of Mark's presentation. A "self describing" data formats contains all the data needed to figure out how to process the format from publically accessible specs. For example, an HTTP response tells you the MIME type of the document which can be used to locate the appropriate RFC which governs how the format should be processed. In the case of XML, Tim Berners-Lee states that an HTTP response which returns an XML document either as application\xml or text\xml should be processed according to the rules of the XML and XML namespaces recommendations which state that the identity of an element is determined based on its namespace name. So when processing an XML document, Tim asserts that it is self describing because one can locate the spec for the format from the namespace URI of the root element. Of course, Mark disagrees with this but his reasons for doing so is pedantic spec lawyering. I disagree with it as well but for different reasons. The main reason I disagree with it is because it puts a stake in the ground and says that any XML format on the Web that doesn't use namespace name for its root element or whose namespace name is not a dereferenceable URI that leads to a spec is broken. This automatically states that XML formats used on the Web today such as RSS 1.0, RSS 2.0, OPML and the Atom 0.3 syndication format are broken.

Mark then goes on to state in slide 20 that a problem with XML formats is that one can't arbitrarily extend an XML document without it's schema or without breaking some application somewhere. It's unclear as to what he means by the document's schema but will grant that it is likely that arbitrary additions to the expected content of an XML document will break certain applications. Getting to slide 24, it is slightly clearer what Mark is getting at. He claims that one although one can add extend a format by adding extra elements from a known namespace using just XML technologies this doesn't tell you how to deal with the extensions. On the other hand, with RDF the extensions are all concepts named with a URI whose meaning can then be looked up using HTTP GET. This is where he lost me. I don't see the difference between seeing a namespaced XML element in an XML format and using HTTP GET on the namespace URI of the element to locate the spec or schema for the namespaced extension and what he describes as the gains of using RDF.

The more I look at how RDF people bag on XML the more it seems that they don't really write applications in today's world. Almost every situation I've seen someone claim that RDF technologies will in the future be able to solve a problem XML cannot, the problem is actually not only solveable with XML technologies but actually is being solved using XML technologies today.

Categories: XML

June 4, 2004

@ 07:47 AM

XML Web Services != Distributed Objects 3: Sometimes U Have To Explain

As I mentioned yesterday Doug Purdy posted an insightful entry in response to Ted Neward's about the inappropriateness of returning ADO.NET DataSets from XML Web Services. Today Ted Neward has a post entitled Why Purchase Orders are the root of all evil? which almost entirely misses the point of Doug's post.

Ted writes

Could you tell me what the schema should be? Doug, it's right there in front of you: the class definition itself is a schema definition for objects of its type. The question I think you mean to ask is, "What the XML schema should be for this Purchase Order?", but I can't do that, because you've already stepped way out into la-la land as far as XML/XSD goes by making use of generic types (like Dictionary) for which there is no XSD equivalent; sure, we can rpc-encode one up, but we're back to turning objects into XML and back again, and I thought we didn't like that....?

Could you tell me what each particle of the schema means? Well, the LineItemAddedEvent certainly isn't a schema construct, so I'm guessing that'll have to be the XML-based representation of a .NET delegate.... the IAddress has no implementation behind it that I can see so once again I'll have to punt....

Oh, I get it... Doug's using one of them anti-pattern thingies to show us what not to do when trying to define types in XML/XSD for use in Web services (or WebServices or web services or however we've decided to spell these silly things anyway).

You're absolutely right, Doug--the way that thing is written, Purchase Orders, while perhaps not the root of ALL evil, are certainly evil and therefore should be banned from the WS-* camp immediately.

Seriously, dude, DataSets as return values from Web services are evil. Get over it.

What I find interesting is that Ted Neward is looking at XML Web Services through the perspective of distributed objects. His entire arguments hinge around the fact that his applications convert XML into Java or CLR objects so the XML returned must be something that is condusive to converting to objects easily. Doug accurately points out that there is no one-to-one mapping between an XML schema and a CLR object. Arguing that your favorite platform has one-to-one mappings for some XML schemas and not others thus banning various XML formats from participating in XML Web Services is a very limiting viewpoint. I'd like to ask Ted whether he also would ban XBRL, wordProcessingML or UBL documents from being used in XML Web Services because there aren't easy ways to convert them to a handy, dandy Java object with strongly typed members and all that jazz.

I don't dispute the practical reasons for discouraging developers from returning ADO.NET DataSets from XML Web Services since most developers trying to access the XML Web Services just use a toolkit that pretends you are building distributed object applications. Usually such toolkits either barf horribly when faced with XML they don't grok or force developers to have deal with scary angle brackets directly instead of the object facade they know & love (ASP.NET XML Web Services included). This is a practical reason to avoid exposing ADO.NET DataSets from XML Web Services that may be accessed from Java platforms especially since such platforms don't make it easy to deal with raw XML.

On the other hand, claiming that there is some philosophical reason not to expose data from an XML Web Service that may be be semi-structured and full of unknown data (i.e XML data) seems quite antithetical to the entire point of XML Web Services and the Service Oriented Architecture fad.

Categories: XML

June 3, 2004

@ 10:25 AM

XML Web Services != Distributed Objects (part 2)

It seems every few months there are a series of blog posts or articles about why returning ADO.NET DataSet objects from XML Web Services. I saw the most recent incarnation of this perma-debate in Scott Hansellman's post Returning DataSets from WebServices is the Spawn of Satan and Represents All That Is Truly Evil in the World and Ted Neward's More on why DataSets are the Root of all Evil.

I was going to type up a response to both posts until I saw Doug Purdy's amusing response, PurchaseOrders are the root of all evil, which succintly points out the flaws in Scott and Ted's arguments.

Now I'm off to bed.

Categories: Mindless Link Propagation | XML

June 1, 2004

@ 01:12 AM

Tim Bray Puzzled by SOA vs. XML Web Services

I just read Tim Bray's entry entitled SOA Talk where he mentions listening to Steve Gillmor, Doc Searls, Jon Udell, Dana Gardner, and Dan Farber talk about SOA via “The Gillmor Gang” at ITConversations. I tried to listen to the radio show a few days ago but had the same problems Tim had. A transcript would definitely be appreciated.

What I found interesting is this excerpt from Tim Bray's blog post

Apparently a recent large-scale survey of professionals revealed that “SOA” has positive buzz and high perceived relevance, while “Web Services” scores very low. Huh?

This is very unsurprising to me. Regular readers of my blog may remember I wrote about the rise of the Service Oriented Architecture fad a few months ago. Based on various conversations with different people involved with XML Web Services and SOA I tend to think my initial observations in that post were accurate. Specifically I wrote

The way I see it the phrase "XML Web Services" already had the baggage of WSDL, SOAP, UDDI, et al so there a new buzzphrase was needed that highlighted the useful aspects of "XML Web Services" but didn't tie people to one implementation of these ideas but also adopted the stance that approaches such as CORBA or REST make sense as well.

Of the three words in the phrase "XML Web Services" the first two are implementation specific and not in a good way. XML is good thing primarily because it is supported by lots of platforms and lots of vendors not because of any inherrent suitability of the technology for a number of the tasks people utilize it for. However in situations where this interop is not really necessary then XML is not really a good idea. In the past, various distributed computing afficionados have tried to get around this by talking up the The InfoSet which was just a nice way of deprecating the notion of usage of the XML text format everywhere being a good thing. The second word in the phrase is similarly inapllicable in the general case. Most of the people interested in XML Web Services are interested in distributed computing which traditionally and currently is more about the intranet than it is about the internet. The need to justify the Web-like nature of XML Web Services when in truth these technologies probably aren't going to be embraced on the Web in a big way seems to have been a sore point of many discussions in distributed computing circles.

Another reason I see for XML Web Services having negative buzz versus SOA is that when many people think of XML Web Services, they think of overhyped technologies that never delivered such as Microsoft's Hailstorm. On the other hand, SOA is about applying the experiences of 2 decades of building distributed applications to building such applications today and in the future. Of course, there are folks at Microsoft who are wary of being burned by the hype bandwagon and there've already been some moves by some of the thought leadership to distance what Microsoft is doing from the SOA hype. One example of this is the observation that lots of the Indigo folks now talk about 'Service Orientation' instead of 'Service Oriented Architecture'.

Disclaimer: The above comments do not represent the thoughts, intentions, plans or strategies of my employer. They are solely my opinion.

Categories: Technology | XML

May 28, 2004

@ 06:52 PM

XML, the New Database Heresy

C.J. Date, one of the most influential names in the relational database world, had some harsh words about XML's encroachment into the world of relational databases in a recent article entitled Date defends relational model that appeared on SearchDatabases.com. Key parts of the article are excerpted below

Date reserved his harshest criticism for the competition, namely object-oriented and XML-based DBMSs. Calling them "the latest fashions in the computer world," Date said he rejects the argument that relational DBMSs are yesterday's news. Fans of object-oriented database systems "see flaws in the relational model because they don't fully understand it," he said.

Date also said that XML enthusiasts have gone overboard.

"XML was invented to solve the problem of data interchange, but having solved that, they now want to take over the world," he said. "With XML, it's like we forget what we are supposed to be doing, and focus instead on how to do it."

Craig S. Mullins, the director of technology planning at BMC Software and a SearchDatabase.com expert, shares Date's opinion of XML. It can be worthwhile, Mullins said, as long as XML is only used as a method of taking data and putting it into a DBMS. But Mullins cautioned that XML data that is stored in relational DBMSs as whole documents will be useless if the data needs to be queried, and he stressed Date's point that XML is not a real data model.

Craig Mullins points are more straightforward to answer since his comments don't jibe with the current state of the art in the XML world. He states that you can't query XML documents stored in databases but this is untrue. Almost three years ago, I was writing articles about querying XML documents stored in relational databases. Storing XML in a relational database doesn't mean it has to be stored in as an opaque binary BLOB or as a big, bunch of text which cannot effectively be queried. The next version of SQL Server will have extensive capabilities for querying XML data in relational database and doing joins across relational and XML data, a lot of this functionality is described in the article on XML Support in SQL Server 2005. As for XML not having a data model, I beg to differ. There is a data model for XML that many applications and people adhere to, often without realizing that they are doing so. This data model is the XPath 1.0 data model, which is being updated to handled typed data as the XQuery and XPath 2.0 data model.

Now to tackle the meat of C.J. Date's criticisms which is that XML solves the problem of data interchange but now is showing up in the database. The thing first point I'd like point out is that there are two broad usage patterns of XML, it is used to represent both rigidly structured tabular data (e.g., relational data or serialized objects) and semi-structured data (e.g., office documents). The latter type of data will only grow now that office productivity software like Microsoft Office have enabled users to save their documents as XML instead of proprietary binary formats. In many cases, these documents cannot simply shredded into relational tables. Sure you can shred an Excel spreadsheet written in spreadsheetML into relational tables but is the same really feasible for a Word document written in WordprocessingML? Many enterprises would rather have their important business data being stored and queried from a unified location instead of the current situation where some data is in document management systems, some hangs around as random files in people's folders while some sits in a database management system.

As for stating that critics of the relational model don't understand it, I disagree. One of the major benefits of using XML in relational databases is that it is a lot easier to deal with fluid schemas or data with sparse entries with XML. When the shape of the data tends to change or is not fixed the relational model is simply not designed to deal with this. Constantly changing your database schema is simply not feasible and there is no easy way to provide the extensibility of XML where one can say "after the X element, any element from any namespace can appear". How would one describe the capacity to store “any data” in a traditional relational database without resorting to an opaque blob?

I do tend to agree that some people are going overboard and trying to model their data hierarchically instead of relationally which experience has thought us is a bad idea. Recently on the XML-DEV mailing list entitled Designing XML to Support Information Evolution where Roger L. Costello described his travails trying to model his data which was being transferred as XML in a hierarchical manner. Micheal Champion accurately described the process Roger Costello went through as having "rediscovered the relational model". In a response to that thread I wrote "Hierarchical databases failed for a reason".

Using hierarchy as a primary way to model data is bad for at least the following reasons

Hierarchies tend to encourage redundancy. Imagine I have a <Customer> element who has one or more <ShippingAddress> elements as children as well as one or more <Order> elements as children as well. Each order was shipped to an address, so if modelled hierarchically each <Order> element also will have a <ShippingAddress> element which leads to a lot of unnecessary duplication of data.
In the real world, there are often multiple groups to which a piece of data belongs which often cannot be modelled with a single hierarchy.
Data is too tightly coupled. If I delete a <Customer> element, this means I've automatically deleted his entire order history since all the <Order> elements are children of <Customer>. Similarly if I query for a <Customer>, I end up getting all the <Order> information as well.

To put it simply, experience has taught the software world that the relational model is a better way to model data than the hierarchical model. Unfortunately, in the rush to embrace XML many a repreating the mistakes from decades ago in the new millenium.

Categories: XML

May 28, 2004

@ 04:35 PM

Document-centric.NET Article on XML.com

XML.com recently ran an article entitled Document-Centric .NET, that highlights the various technologies for working with XML that exist in the .NET Framework. The article provides a good high level overview of the various options you have for processing XML in the .NET Framework. The article includes an all important caveat which I wish more people knew about and which I keep wanting to write an article about but never get around to doing. The author writes

However, keep in mind that there are W3C XML Schema features that are not directly compatible with .NET's XML-to-database and XML-to-object mapping tools.

This is very true. Besides our schema validation technologies, most Microsoft technologies or products that utilize W3C XML Schema support a subset of the language due to impedance mismatches between the language and the underlying data model or type system of the target environment.

In fact the only complaint I have about the article is a nitpick about its title. In XML circles, document-centric implies a usage of XML that isn't borne out by his article. If you are interested in the difference between data-centric XML and document-centric XML you should read my article Can One Size Fit All? in XML Journal. In that article I talk about the differences between XML that is used to represent both rigidly structured tabular data (e.g., relational data or serialized objects) and semi-structured data (e.g., office documents). The former is data-centric XML while the latter is document-centric.

Categories: Mindless Link Propagation | XML

May 26, 2004

@ 06:11 PM

The RSS enclosure element and the Dangers of Overspecification

I recently stumbled on an entry by Lucas Gonze where he complains about the RSS <enclosure> element. He writes

Problems with the enclosure element:

It causes users to download big files that they will never listen to or watch, creating pointless overload on web hosts.
It doesn't allow us to credit the MP3 host, so we can't satisfy the netiquette of always linking back.
For broadband users, MP3s are not big enough to need advance caching in the first place.
The required content-type attribute is a bad idea in the first place. Mime settings are already prone to breakage, adding an intermediary will just create another source of bugs. There are no usecases for this attribute that can't be more easily and robustly satisfied by having clients HEAD the URL for themselves.
The required content-length attribute should not be there. It requires people who link to MP3s to HEAD them and calculate the length, which is sometimes not practical. It makes variable-length MP3s illegal. There are no usecases for this attribute that can't be more easily and robustly satisfied by having clients HEAD the URL for themselves.

The primary problem with the <enclosure> element is that it is overspecified. Having an element that says, here is a pointer to some data that is related to this entry that is too large to fit in the feed is a good idea. Similarly providing a hint at what the MIME type is so the reader knows whether it can handle that MIME type or can display something specific to that media type in the user interface without making an additional request to the server is very useful. The description of the enclosure element in RSS 2.0 states

<enclosure> sub-element of <item>

<enclosure> is an optional sub-element of <item>.

It has three required attributes. url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type.

The url must be an http url.

<enclosure url="http://www.scripting.com/mp3s/weatherReportSuite.mp3" length="12216320" type="audio/mpeg" />

Syndication geeks might notice that this is akin to the <link> element in the ATOM 0.3 syndication format which is described as

3.4 Link Constructs

A Link construct is an element that MUST NOT have any child content, and has the following attributes:

3.4.1 "rel" Attribute

The "rel" attribute indicates the type of relationship that the link represents. Link constructs MUST have a rel attribute, whose value MUST be a string, and MUST be one of the values enumerated in the Atom API specification <eref>http://bitworking.org/projects/atom/draft-gregorio-09.html</eref>.

3.4.2 "type" Attribute

The "type" attribute indicates an advisory media type; it MAY be used as a hint to determine the type of the representation which should be returned when the URI in the href attribute is dereferenced. Note that the type attribute does not override the actual media type returned with the representation.

Link constructs MUST have a type attribute, whose value MUST be a registered media type [RFC2045].

3.4.3 "href" Attribute

The "href" attribute contains the link's URI. Link constructs MUST have a href attribute, whose value MUST be a URI [RFC2396].

xml:base [W3C.REC-xmlbase-20010627] processing MUST be applied to the atom:url element.

3.4.4 "title" Attribute

The "title" attribute conveys human-readable information about the link. Link constructs MAY have a title attribute, whose value MUST be a string.

So the ideas behind the <enclosure> element were good enough that they appear in ATOM with some additional niceties and a troublesome bit (the length attribute) removed. So if the concepts behid the <enclosure> element are so good that they are first class members of the ATOM syndication format. Why does Lucas not like it? The big problem with RSS enclosures is how Dave Winer expected them to be used. An aggregator was supposed to act like a TiVo, automatically downloading files in the background and presenting them to you when it's done. The glaring problem with doing this is that it means lots of people are automatically downloading large files that they didn't request which is a significant waste of bandwidth. In fact, most aggregators either do not support enclosures or simply show them as links which is what FeedDemon and RSS Bandit (with the Outlook 2K3 skin) do. The funny thing is that the actual RSS specification doesn't describe this behavior, instead this behavior is implied by Dave Winer's descriptions of use cases.

Lucas also complains about the required length attribute which is problematic if you are pointing to a file on a server you don't own because you have to first download the file or perform a HTTP HEAD to get its size. The average blogger isn't going to go through that kind of trouble. Although tools could help it makes sense for the length attribute to have been an optional hint.

I have to disagree with Lucas's complaints about putting the MIME type in the <enclosure> element. He complains that the MIME type in the <enclosure> could be wrong and in fact that in many cases web servers serve a file with the wrong MIME type. Thus he concludes that putting the MIME type in the enclosure is wrong. Client software should be able, to decide how to react to the enclosure [e.g. if it is audio/mpeg display a play button] without having to make additional HTTP requests especially since as Lucas points out it is not a 100% guaranteed that performing an HTTP HEAD of the linked file will actually get you the correct MIME type from the web server.

In conclusion, I agree that the <enclosure> element is problematic but most of the problems are due to the implied use case suggested by the spec author, Dave Winer, as opposed to the actual information provided by the element. The ATOM approach of describing the information provided by each element in a feed but not explicitly describing the expected behavior of clients is a welcome approach. Of course, there will always be developers who require structure or take an absence of explicit guidelines to mean do stupid things (like aggregators that fetch your feed every 5 minutes) but these are probably better handled in "Best Practices" style documents or test suites than in the actual specification.

Categories: XML

May 26, 2004

@ 05:22 PM

Versioning is Hard

One of the hardest problems in software development is how to version software and data formats. One of the biggest problems for Windows for years has been DLL Hell which is a versioning problem. One of the big issues I have to deal with at work is how to deal with versioning issues when adding or removing functionality from classes.

For a few weeks, I've been planning to write up some guidelines and concerns for versioning XML formats based on my experiences and those of others at Microsoft. I've got some folks on the XML Web Services team interested in riding shotgun such as Gudge and Doug. It also looks like Edd Dumbill is interested in the abstract for the article, so it with any luck it should end up on XML.com when it is done.

I was reminded of the importance of writing this article when I saw a post on the atom-syntax list by Google's Steve Jensen which implied that it just occured to the folks at Google that they'd have to support multiple versions of ATOM. This is excarberated by the fact that they are deploying implementations based on draft specs. Like I said before, never ascribe to malice that which can be explained by incompetence

Categories: XML

May 25, 2004

@ 06:16 PM

XML 1.1 Continues to be a Bitter Pill

I've mentioned in the past why I think XML 1.1 was a bad idea in my post XML 1.1: The W3C Gets It Wrong. It seems at least one W3C working group, the XML Protocols working group to be exact, has now realized why XML 1.1 is a bad idea a few months later. Mark Nottingham recently posted a message to the W3C Technical Architecture Group's mailing list entitled Deployment and Support of XML 1.1 where he writes

In the Working Group's view, this highlights a growing misalignment in
the XML architecture. Until the advent of XML 1.1, XML 1.0 was a single
point of constraint in the XML stack, with all of the benefits (e.g.,
interoperability, simplicity) that implies. Because XML 1.1 has
introduced variability where before there was only one choice, other
standards now need to explicitly identify what versions of XML they are
compatible with. This may lead to a chicken-and-egg problem; until
there is a complete stack of XML 1.1-capable standards available, it is
problematic to use it.

Furthermore, XML-based applications will likewise need to identify
their capabilities and constraints; unfortunately, there is no
consistent way to do this in the Web architecture (e.g., RFC3023 does
not provide a means of specifying XML versions in media types).

As I mentioned in my previous post about the topic, XML 1.1 hurts the interoperability story of XML which is one of the major reasons of using it in the first place. Unfortunately, the cat is already out of the bag, all we can do now is try to contain or avoid it without getting our eyes clawed out. I tend to agree with my coworker Michael Rys, the day XML 1.1 became a W3C recommendation was a day of mourning.

Categories: XML

May 25, 2004

@ 04:37 PM

XML in SQL Server 2005

The next version of SQL Server will have a significant amount of functionality related to storing, querying and extracting XML from the database. To accompany the information being imparted at TechEd 2004, myself and rest of the folks behind the XML Developer Center on MSDN decided to run a series of articles on the XML features of SQL Server 2005. The articles will run through the month of June.

The first article in the series is XML Support in Microsoft SQL Server 2005. Read this article if you are interested in learning how SQL Server 2005 has become an fully XML-aware database including the addition of the XML datatype, support for XML Schemas, indexing of XML data, XQuery, querying XML views of relational data using XPath and much more.

Categories: XML

May 22, 2004

@ 06:02 PM

RSS, Atom and Microsoft

Joshua Allen has a post entitled RSS Politics which does a good job of properly framing the growing Microsoft and RSS vs. Google and Atom silliness spurred by Joi Ito that I've been seeing in the comments on Robert Scoble's weblog. Joshua writes

First, be very clear. The “debate“ over Atom vs. RSS is a complete non-issue for Microsoft. We use RSS to serve thousands of customers right now, and most of the people setting up RSS feeds have never heard of the political “debates“. RSS works for them, and that's all they care about. On the other hand, if Atom ever reaches v1.0 and we had a business incentive to use it, we would use it. No need for debate.

Now, of the three or four people at Microsoft who know enough about Atom to have said anything about it, I wouldn't say that anyone has trashed the format. I and others have pointed out that it's just fine for what it does; just like RSS. If anything, I have asked hard questions about why I or any business decision maker should be spending resources on the whole debate right now. If a business has deployed using RSS, what financial motive would they have to switch to a new, nearly identical, format once it ships? I've got nothing against the Atom people inventing new syndication formats, but I just don't see why *I* should be involved right now. There's no good reason.

The other comment I've made before is that the Atom community is not being served by the polarizing attitudes of some participants. The “us vs. them“ comments are not helpful, especially when untrue, and the constant personalization (”Support Atom because I hate Dave Winer!”) just damages the credibility of the whole group (many of whom might have good motives for being involved).

I totally echo his sentiments. In the past couple of months more and more folks at Microsoft have pinged me about syndication and blogging technologies once they learn I wrote RSS Bandit. Every single time I've given them the same advice I gave in my post, Mr. Safe's Guide to the RSS vs. ATOM debate. If you are a feed consumer you'll need to support the various flavors of RSS and the various flavors of ATOM (of which there'll at least be two, ATOM 0.3 and whatever is produced from the IETF/W3C process). If you are a feed producer, you should stick with RSS 0.91/2.0 since this is the widest supported format and the most straightforward.

Although no one has asked yet, I'm also going to give my advice on Mr. Safe at Microsoft should consider adopting the ATOM API. In my personal opinion, the current draft of the ATOM API seems better designed and falls more inline with Microsoft's technologies than the existing alternatives (Blogger API/MetaWeblog API/LiveJournal API), etc. However the API lacks lots of functionality and in fact already there are extensions to the ATOM API showing up in the wild. Currently, these "innovations" are being lauded but given the personalities behind ATOM it is likely that if Microsoft products supported the API and extended it there could be a negative backlash. In which case perhaps going with a product specific API may be the best option if there is sensitivity to such feedback or the ATOM API has to be significantly extended to fit the product's needs.

Categories: Life in the B0rg Cube | XML

May 22, 2004

@ 05:17 PM

Another XML Geek Questions the Value of the Semantic Web

I've posted a few entries in the past questioning the value of the Semantic Web as currently envisioned by the W3C along with its associated technologies like RDF and OWL. My most recent post about this was On Semantic Integration and XML. It seems I'm not the only XML geek who's been asking the same questions after taking a look at the Semantic Web landscape. Elliotte Rusty Harrold is at WWW2004 and wrote the following opinions of the Semantic Web on Day 4 of WWW2004

This conference is making me think a lot about the semantic web. I'm certainly learning more about the details (RDF, OWL etc.). However, I still don't see the point. For instance what does RDF bring to the party? The basic idea of RDF is that a collection of URIs forms a vocabulary. Different organizations and people define different vocabularies, and the URIs sort out whose name, date, title, etc. property you're using at any given time. Remind you of anything? It reminds me a lot of XML + namespaces. What exactly does RDF bring to the party? OWL (if I understand it) lets you connect different vocabularies. But so does XSLT. I guess the RDF model is a little simpler. It's all just triples, that can be automatically combined with other triples, and thereby inferences can be drawn. Does this actually produce anything useful, though? I don't see the killer app. Theoretically a lot of people are talking about combining RDF and ontologies from mulktiple sources too find knowledge that isn't obvious from any one source. However, no one's actually publishing their RDF. They're all transforming to HTML and publishing that.

I've written variations of the same theme over the past couple of months. It's just hard to point at any practical value that RDF/OWL/etc provide over XML/XSLT/etc for semantic integration.

Categories: XML

May 21, 2004

@ 06:56 PM

Query and Transformation: 2 Sides of the Same Coin?

I've been reading the various pieces of feedback on my recent blog post on Why You Won't See XSLT 2.0 or XPath 2.0 in the Next Version of the .NET Framework including the 40 comments in response to the post and the "Microsoft is killing XSLT" thread on xsl-list. Most of it has been flames witrh little useful feedback but there was an interesting response by Norm Walsh entitled XQuery 1.0 or XSLT 2.0? which I've been drawn to respond to. Norm writes

Dare Obasanjo argues that “XQuery is strongly and statically typed while XPath 2.0 is weakly and dynamically typed.” What’s not clear from his post is that he is comparing XQuery 1.0 to XPath 2.0 in backwards compatibility mode (Michael Rys did provide a clarification). That’s an odd comparison to make. XPath 2.0 needs a backwards compatibility mode so that it stands some chance of doing the right thing when used in the context of an XSLT 1.0 stylesheet, but that’s not the expected mode for long-term use.

I thought my point was self evident here but if Norm missed it then it means most of the people who read my original blog post did as well. XPath 2.0 is a subset of XQuery 1.0, the parts of XQuery missing are XML construction, the query prolog, the let-where-orderby parts of the FLWOR expression, typeswitch and a few other things. XPath 2.0 has a backwards compatibility mode which has different semantics from regular XPath 2.0 and XQuery. When I talked about Microsoft not implementing XPath 2.0 I meant XPath 2.0 in backwards compatibility mode since implementing XQuery means you already have regular XPath 2.0. After all, everything you can do in XPath 2.0 you can do in XQuery.

Norm also writes

The funniest arguments are the ones that imply that XQuery is a competitor in the same problem space as XSLT, that users will use XQuery instead of XSLT. I say that’s funny because there are so many problems that you simply cannot solve with XQuery. If your data is regular and especially if it’s all stored in a database already so that your XQuery implementation can run really fast, then XQuery absolutely makes sense, but didn’t the database folks already have a query language? Nevermind. If your customers don’t need to solve the kinds of problems for which XSLT was designed, or if you want to sell them some sort of proprietary system to solve them, then implementing XSLT 2.0 probably doesn’t make sense.

I've seen variations of the above theme (XSLT is for transformation, XQuery is for query) in various responses to my original post. Taking away the words query and transformation out of the picture both XQuery and XSLT are designed to reshape XML data. SQL is primarily a query language but you can use it to reshape relational data, this is exactly how SQL views work. For most people, the transformations they want to perform using XSLT also be expressed using XQuery. Per Bothner wrote an article over a year ago on XML.com about Generating XML and HTML using XQuery showing how you could use XQuery to transform an XML document to another XML format or HTML. There are a few niceties in XSLT 2.0 that don't exist in XQuery such as the ability to write to multiple output streams but in general most of the things you can do in XSLT 2.0 can also be done in XQuery. In fact this leads me to something else Norm wrote

If you want to transform documents that aren’t regular, especially documents that have a lot of mixed content, XSLT is clearly the right answer. I’ll wager dinner at your favorite restaurant that XQuery cannot be used to implement the functionality of the DocBook XSLT Stylesheets. (You produce the XQuery that does the job, I buy you dinner.)

First of all XSLT is actually very bad at dealing with XML that isn't regular and has lots of mixed content. This is why a number of XSLT gurus got together to created EXSLT and why I started the EXSLT.NET project (grab the latest version from the Microsoft.com download servers here). As for transforming DocBook with XQuery, as I mentioned before Per Bothner wrote an article about using XQuery for transformations. In fact, he specifically writes about Transforming DocBook to HTML using XQuery.

The bottom line is that XQuery is as much a "transformation language" as XSLT. XSLT may have some functionality that XQuery does not have but there isn't much I've seen that couldn't be implemented using extension functions. Perhaps I should start an EXQuery.NET project? :)

Categories: XML

May 21, 2004

@ 06:38 AM

What's New in System.Xml for .NET Compact Framework v2.0

A few months ago Mark Fussell wrote an article entitled What's New in System.Xml for Visual Studio 2005 and the .NET Framework 2.0 Release. Mark Ihimoyan has a followup series of blog posts which mentions which of the new features of System.Xml mentioned in Mark's article will actually be in the .NET Compact Framework. The blog posts are listed below

Categories: XML

May 18, 2004

@ 12:13 PM

Improving XML Performance in .NET Framework Applications

The Microsoft Pattern and Practices folks have produced an excellent guide to Improving .NET Application Performance and Scalability with a chapter on Improving XML Performance. If you build .NET Framework applications that utilize XML then you owe it to yourself to take a look at the guidelines in that document. There is also a handy, easily printable XML Performance checklist which can be used as a quick way to check that your application is doing the right thing with regards to getting the best performance for XML applications.

On a similar note, Mark Fussell has posted XmlNameTable: The Shiftstick of System.Xml and XmlNameTable Revisited which provide some tips about how to use the XmlNameTable class to improve processing speed by up to 10% when processing XML documents.

Categories: XML

May 15, 2004

@ 05:35 PM

Binary Formats, XML and Performance

Charles Cook has a blog posting on XML and Performance where he writes

XML-based Web Services look great in theory but I had one nagging thought last week while on the WSA course: what about performance? From my experience with VoiceXML over the last year it is obvious that processing XML can soak up a lot of CPU and I was therefore interested to see this blog post by Jon Udell in which he describes how Groove had problems with XML:

Sayonara, top-to-bottom XML I don't believe that I pay a performance penalty for using XML, and depending on how you use XML, you may not believe that you do either. But don't tell that to Jack Ozzie. The original architectural pillars of Groove were COM, for software extensibility, and XML, for data extensibility. In V3 the internal XML datastore switches over to a binary record-oriented database.

You can't argue with results: after beating his brains out for a couple of years, Jack can finally point to a noticeable speedup in an app that has historically struggled even on modern hardware. The downside? Debugging. It was great to be able to look at an internal Groove transaction and simply be able to read it, Jack says, and now he can't. Hey, you've got to break some eggs to make an omelette.

Is a binary representation of the XML Infoset a useful way of improving performance when handling XML? Would it make a big enough difference?

For the specific case of Groove I'd be surprised if they used a binary representation of the XML infoset as opposed to a binary representation of their application object model. Lots of applications that utilize XML for data storage or configuration data immediately populate this data into application objects. This is a layer of unnecessary processing since one could skip the XML reading and writing step and directly read and write serialized binary objects. If performance is that important to your application and there are no interoperability requirements it is a better choice to serialize binary objects instead of going through the overhead of XML serialization/deserialization. The main benefit of using XML in such scenarios is that in many cases there is existing infrastructure for working with XML such as parsers, XML serialization toolkits and configuration handlers. If your performance requirements are so high that the overhead of going from XML to application objects is too high then getting rid of the step in the middle is a wise decision. Although as pointed out by Jon Udell you loose the ease of debugging that comes with using a text based format.

If you are considering using XML in your applications always take the XML Litmus Test

Categories: XML

May 15, 2004

@ 05:09 PM

Why You Won't See XSLT 2.0 or XPath 2.0 in the Next Version of the .NET Framework

The XML team at Microsoft has recently started getting questions about our position on XQuery 1.0, XPath 2.0 and XSLT 2.0. My boss, Mark Fussell, posted about why we have decided to implement XQuery 1.0 but not XSLT 2.0 in the next version of the .NET Framework. Some people misinterpreted his post to mean that the we chose to implement XQuery 1.0 over XSLT 2.0 because we prefer the syntax of the former over that of the latter. However decisions of such scale aren't made that lightly.

There are several reasons why we aren't implementing XSLT 2.0 and XPath 2.0

It takes a lot of effort and resources to implement all 3 technologies (XQuery, XSLT 2.0 & XPath 2.0). Our guiding principle was that we believe creating a proliferation of XML query technologies is confusing to end users. We'd rather implement one more language that we push people to learn than have to support and explain three more XML query and transformation languages, in addition to XPath 1.0 & XSLT 1.0 which already exist in the .NET Framework. Having our customers and support people have to deal with the complexity of 3 sophisticated XML query languages two of which are look similar but behave quite differently in the case of XPath 2.0 and XQuery seemed to us not to be that beneficial.

XPath 2.0 has different semantics from XQuery, XQuery is strongly and statically typed while XPath 2.0 is weakly and dynamically typed. So it isn't simply a case that if you implementing XQuery means that you can simply flip some flag and disable a feature or two to turn it into an XPath 2.0 implementation. However all of the use cases satisfied by XPath 2.0 can be satisfied by XQuery. In the decision to go with XQuery over XSLT 2.0, Mark is right that we felt that developers would prefer the familiar procedural model and syntax of XQuery to the template based model and syntax of XSLT 2.0. Most developers working with XSLT try to use it as a procedural language anyway, and don't really harness the power of templates. There's always the steep learning curve until you get to the “Aha“ moment and everything clicks. XQuery with its FLWOR construct and user defined functions fits more naturally with how both programmers and database administrators access and manipulate data than does XSLT 2.0. Thus we feel XQuery and not XSLT is the future of XML based query and transformation.

This doesn't mean that we will be removing XSLT 1.0 or XPath 1.0 support from the .NET Framework. It just means that our innovation and development efforts will be focused around XQuery going forward.

Categories: Life in the B0rg Cube | XML

May 11, 2004

@ 03:02 PM

How To Programmatically Modify an InfoPath Form Template

The folks behind the InfoPath team blog have posted a short series on how to programmatically modify InfoPath form templates. The example in the series shows how to change the URL of a XML Web Service end point used by the InfoPath form using a script instead of having to do it manually by launching InfoPath. The posts in the series are linked below

The posts highlight that the InfoPath XSN format is really a CAB file, and the files that make up the template are XML files which can easily be modified programmatically.

Categories: XML

May 6, 2004

@ 04:56 PM

Design Guidelines for Exposing XML in APIs for Whidbey/Longhorn

I recently submitted the Design Guidelines for Exposing XML Data as part of the WinFX design guidelines. You can read the guidelines in Krzysztof Cwalina's weblog. These are also the application guidelines that developers working with XML should follow for working with XML in the Whidbey timeframe. I'll be working with the FxCop team to get some rules written to check for compliance to these guidelines in the Whidbey base class library over the next few weeks.

Categories: Life in the B0rg Cube | XML

May 6, 2004

@ 04:19 PM

Knowing the Limitations of XML Schema Validation

I recently stumbled on blog posting by Phil Ringnalda called a little chip in the concept where he notes

Still, I was a bit surprised when Xiven linked to a post to the validator mailing list, pointing out that the utterly wrong HTML <a href=""><b><a href=""></a></b></a>, which is reported as invalid in HTML, is ignored in XHTML. Nesting links is one of those basic, there's absolutely no way you can ever do this, things, but in XHTML if you put a nested link inside an inline element, the validator won't catch it. According to Hixie's answer, it's because the validator uses an XML DTD for XHTML, and an SGML DTD for HTML, and while you can say that a/b/a is wrong in an SGML DTD, you can't in an XML DTD. As he puts it, in XHTML it's XML-valid but non-compliant.

Phil has just stumbled on just one of many limitations of XML schema languages. At first, when people see an XML schema language they expect that they will be able to use it to declaratively describe all the rules of their vocabulary. However this is rarely the case, every XML schema language has limitations in the constraints it can express. For example, W3C XML Schema can't express constraints such as a choice between attributes (either an uptime or downtime attribute appears on an element), DTDs can't express constraints on the range a text value can be (must be an integer between 5 and 10), RELAX NG can't express identity constraints on numeric values (e.g. each book in the inventory must have a unique ISBN) , and so on.

This means that developers using an XML schema language should be very careful when designing XML applications or XML vocabularies about what rules they can validate when they receive an input document. In some cases, the checks performed by schema validation may be so limited for a vocabulary that it is better to check the constraints using custom code or at the very least augment schema validation with some custom checks as well.

The fact is that many XML vocabularies are complex enough that their constraints aren't easily be expressible using a conventional XML schema language. XML vocabulary designers and developers of XML applications should always be on the look out for such cases else incorrect decisions be made in choosing a validation framework for incoming XML documents.

Categories: XML

April 26, 2004

@ 04:31 PM

XInclude for .NET, Sequential XPath and XPathNavigatorReader

It seems April is the month of custom implementations of the XmlReader. The first entry was Daniel Cazzulino's XPathNavigatorReader. As Daniel writes

There are many reasons why developers don't use the XPathDocument and XPathNavigator APIs and resort to XmlDocument instead... XPathNavigator is a far superior way of accessing and querying data because it offers built-in support for XPath querying independently of the store, which automatically gain the feature and more importantly, because it abstracts the underlying store

There are some problems with the XPathNavigator as implemented in the .NET Framework in v1.0 and v1.1. For the most part the APIs in the .NET Framework that work with XML mostly operate on instances of XmlReader or to a lesser extent XmlNode not XPathNavigator. Also there are some basic features one would expect from an XML API such as the ability to get the XML as a string which don't exist on the class. Daniel solves a number of these problems by implementing the XPathNavigatorReader which is a subclass of XmlTextReader implemented of over an XPathNavigator. This way you can pass an XPathNavigator to APIs expecting an XmlReader or XmlTextReader and get some user friendly functions like ReadInnerXml().

The second custom XmlReader I've seen this month is Oleg Tkachenko's XIncludingReader which is featured as part of his article on the MSDN entitled Combining XML documents with XInclude which provides a brief overview of XInclude and shows how to use the XIncludingReader which implements the XInclude 1.0 Last Call Working Draft from November 10, 2003. From the article

The key class within XInclude.NET is the XIncludingReader, found in the GotDotNet.XInclude namespace. The primary design goal was to build pluggable, streaming pipeline for XML processing. To meet that goal, XIncludingReader is implemented as an XmlReader, which can be wrapped around another XmlReader. This architecture allows easy plugging of XInclude processing layer into a variety of applications without any major modifications.

XML Inclusion process is orthogonal to XML parsing, validation, or transformation. That effectively means it's up to you when to allow XML Inclusion happen: after parsing, but before validation; or after validation, but before transformation, or even after transformation

The design of the XIncludingReader highlights the composability that was our goal when we originally shipped the XmlReader. One can layer readers on top of each other augmenting their capabilities as needed. We will definitely be emphasizing this more in Whidbey.

The third custom reader I've seen this month is the XPathReader. Nothing has been published about this class so far but I'm in the process of putting together an article about it which should show up on the MSDN XML Developer Center at the end of this week or early next week. The whet your appetite imagine an XmlReader that allows you to read XML in a forward-only, streaming manner but allows you to match XPath expressions based on the subset put forward by Arpan Desai in his paper Introduction to Sequential XPath. The following is a sample of how the XPathReader can be used

XPathCollection xc = new XPathCollection(); int query1 = xc.Add('//book/title'); XmlTextReader reader = new XmlTextReader("books.xml"); XPathReader xpathReader = new XPathReader(reader, xc); while (xpathReader.ReadUntilMatch()){ Console.WriteLine("Title={0}", xpathReader.ReadString()); }

I should be done with getting the article reviewed and the like in the next few days. April's definitely been the month of the XmlReader.

Categories: XML

April 24, 2004

@ 04:52 AM

On Reinventing Terminology

In response to a post about tunneling variables in XSLT 2.0 on the Lambda the Ultimate weblog, Frank Atanassow writes

The markup language community is notorious for reinventing and duplicating concepts and terminology, sometimes even their own. Thus they have "minimum literals" rather than "string literals", "parameter entities" rather than "macros", "templates" rather than "procedures" or "functions", "validate" rather "type-check", "data binding" rather than "translation", "unmarshal" rather than "parse" et cetera.

I suspect that a lot of this is due to sins of omission instead of sins of commision. People in the markup world just aren't that aware of what's going on in the world of programming languages [or databases] and vice versa. I have to deal with this a lot at work.

Thanks to Joshua Allen for pointing out this comment to me.

Categories: XML

April 22, 2004

@ 05:00 PM

Serializing an Object's State != Serializing an Object

In his post Why not standardize an Object Schema? Jason Mauss writes

I was listening to the latest .NET Rocks! episode; the part where they were discussing Service-Oriented systems. I don't remember exactly who-said-what but I do remember what was said. There was mention of something like, “You only want to pass XML messages back and forth, not objects.” The reasoning behind this (IIRC) had to do with interoperability. Let's say you have a .NET caller and a J2EE caller. Since they both define objects differently (and perhaps create and expect different serialized representations of objects) it's not gonna work. This got me thinking, why not have someone (like say, the W3C w/ the help of people at Sun, IBM, MS, etc.) develop a standard “object” schema for Web Services (and SO systems) to pass back and forth?

For example (this is just off the top of my head and not thought through well):
<object type=““ basetype=““>
   <property name=““ value=““ />
   <method name=““ accesstype=”” address="">
     <parameters>
        <parameter name="" type="" required="" />
     </parameters>
   </method>
</object>
I realize this is a huge simplification of what the schema might actually look like, but perhaps someone could provide me with some insight as to why this would or wouldn't be a good idea.

There are a number of points to tackle in this one post. The first is the misconception that XML and service orientation are somehow linked. Service orientation is simply a state of mind, go back and read Don's Don's four fundamentals of service orientation;

Boundaries are explicit
Services are autonomous
Services share schema and contract, not class
Service compatibility is determined based on policy

None of these explicitly rely on XML, except for the part about services sharing schemas and contracts not classes but XML isn't the only data format with a schema language. Some people such as Sun Microsoystems like to point out that ASN.1 schemas and binary encodings fit this bill as well. The key point is that you should be passing around messages with state not executable code. The fundamental genius of the SOAP 1.1 specification is that it brought this idea into the mainstream and built this concept into its very core. The original spec has this written into its design goals

A major design goal for SOAP is simplicity and extensibility. This means that there are several features from traditional messaging systems and distributed object systems that are not part of the core SOAP specification. Such features include

Distributed garbage collection
Boxcarring or batching of messages
Objects-by-reference (which requires distributed garbage collection)
Activation (which requires objects-by-reference)

Once you start talking about passing around objects and executable code the system becomes much more complex and much more tightly coupled. However experience from enterprise messaging systems and global distributed systems such as the World Wide Web show that you can build scalable, loosely coupled yet powerful applications in an architecture based on passing around messages and defining a couple of operations that can be performed on these messages. Would the Web be as successful if to make web requests you had to fire up Java RMI, DCOM, CORBA or some equivalent instead of making HTTP GET & HTTP POST requests over network sockets with text payloads?

Now as for Jason's schema, besides the fact that doing what he requests defeats the entire purpose of claiming to have built a service oriented application (even though the term is mostly meaningless anyway) the schema is missing the most important part. An object has state (fields & properties) as well as behavior (methods). Service oriented architectures dictate that you pass around state while the methods exist at the service end point, (e.g. an HTTP GET or HTTP POST request sends some state to the server either in the form of a payload or as HTTP headers which are then operated upon by the server which sends a result after said processing is done). Once you start wanting to send behavior over the wire you are basically asking to send executable code. The question then becomes what do you send; MSIL, Java byte codes, x86 instructions or some new fangled binary format? When you finally decide that all you would have done is reinvent Java RMI, CORBA, DCOM and every other distributed object system but this time it uses the XML magic pixie dust.

Categories: XML

April 21, 2004

@ 06:42 PM

The W3C Trudges Along Ever So Slowly...

Am I the only one saddened by the fact that it's been over four years since Microsoft and IBM co-submitted the XInclude NOTE and the spec is still just a Candidate Recommendation? How about the fact that the W3C Query Languages workshop which led to the creation of the XQuery working group was held almost six years ago and the XQuery specification is still a Working Draft which means it is still a year or two from being done.

This lateness in delivering specs in combination with the unnecessary complexity yet lack of features of other W3C technologies such as XML Schema makes me feel more and more that the W3C is more of a hinderance to the world of XML development than a boon at this point.

Many feel that there isn't any alternative but grinning and bearing it. I wonder if that is truly the case and whether individual or community based innovation such as has happened with technologies like RSS or EXSLT isn't the way forward.

Categories: XML

April 14, 2004

@ 05:59 PM

Static classes in C#

According to Eric Gunnerson we now write Static classes in C#

So, for Whidbey, we allow the user to mark a class as static, which means that it's sealed, has no constructor, and the compiler will give you an error if you write an instance method.

Rumor has it that the 1.0 frameworks shipped with an instance method on a static class.

This would be nice to have in the language. I know it would have helped us catch the fact that the System.Xml.XmlConvert class which only has static methods was shipped with a [useless] default constructor.

Categories: Life in the B0rg Cube | XML

April 14, 2004

@ 05:22 PM

C|Net Invents Yet Another XML Syndication Format

I just noticed a post on Mark Pilgrim's blog entitled Hot RSS where he writes

I would like to applaud CNET for their courageous invention of a completely new and incompatible version of RSS. They call it dlhottitles, but I think it deserves to be named something sexy, like “Hot RSS”. Here’s a live sample (static mirror). Some people might say that CNET was ripping up the pavement by inventing their own incompatible syndication format instead of re-using one of the myriad of existing incompatible syndication formats. Some people might get a little hot and bothered about the fact that CNET is featuring it prominently on a page entitled “Simply RSS”, complete with the requisite orange XML button that shows that this is truly the product of a clued-in syndication producer. But I say: the more RSS the merrier!

I am completely perplexed by this move by C|Net, I can only hope it's a belated April Fool's joke. I wanted to join in the fun and add support for the format to RSS Bandit but my my problems from yesterday have left me without a working install of Visual Studio.

Viva la Hot RSS.

Categories: XML

April 12, 2004

@ 03:33 AM

Designing XML Formats: Versioning vs. Extensibility

Many people designing XML formats whether for application-specific configuration files, website syndication formats or new markup languages have to face the problem of how to design their formats to be extensible and yet be resilient to changes due to changes to versions of the format. One thing I have noticed in talking to various teams at Microsoft and some of our customers is that many people think about extensibility of formats and confuse that for being the same as the versioing problem. I have written previously On Versioning XML Vocabularies in which I stated

At this point I'd like to note that this a versioning problem which is a special instance of the extensibility problem. The extensibility problem is how does one describe an XML vocabulary in a way that allows producers to add elements and attributes to the core vocabulary without causing problems for consumers that may not know about them. The versioning problem is specific to when the added elements and attributes actually are from a subsequent version of the vocabulary (i.e. a version 2.0 server talking to a version 1.0 client).

The problem with the above paragraph is that it it focuses on a narrow aspect of the versioning problem. A versioning policy should not only be concerned with when new elements and attributes are added to the format but also when existing ones are changed or even removed.

The temptation to think about versioning as a variation of the extensibility problem is due to the fact that the focus of the XML family of technologies has been about extensibility. As I wrote in my previous posting

One of the primary benefits of using XML for building data interchange formats is that the APIs and technologies for processing XML are quite resistant to additions to vocabularies. If I write an application which loads RSS feeds looking for item elements then processes their link and title elements using any one of the various technologies and APIs for processing XML such as SAX, the DOM or XSLT it is quite straightforward to build an application that processes said elements which is resistant to changes in the RSS spec or extensions to the RSS spec as the link and title elements always appear in a feed.

Similarly XML schema languages such as W3C XML Schema have a number of features that promote extensibility such as wildcards, substitution groups and xsi:type but few if any that target the versioning problem. I've written about a number of techniques for adding extensibility to XML formats using W3C XML Schema in my article W3C XML Schema Design Patterns: Dealing With Change but none so far on approaches to versioning in combination with your favorite XML schema language.

There are a number of things that could change about the constructs in a data transfer format including

New concepts are added (e.g. new elements or attributes added to format or new values for enumerations)
Existing concepts are changed (e.g. existing elements & attributes should be interpreted differently, added elements or attributes alter semantics of their parent/owning element)
Existing concepts are deprecated (e.g. existing elements & attributes should now issue warning when consumed by an application)
Existing concepts are removed (e.g. existing elements & attributes should no longer work when consumed by an application)

How all four of the above changes between versions of the XML format are handled should be considered when designing the format. Below are sample solutions for each of the aformentioned changes

New concepts are added: In some cases the new concepts are completely alien to those in the existing format. For example, the second version of XQueryX will most likely have to deal with the additions of data update commands such as insert or delete while the existing format only has query constructs. In such cases it is most prudent to eschew backwards compatibility by either changing the version number or namespace of XML format. On the other hand, if the new additions are optional or ignorable and the format has extensibility rules for items from a different namespace than that of the format itself then the new additions (elements and attributes) can be placed in a different namespace from that of the format. In more complex cases, it may be likely that there are some additions that cannot be ignored by older processors while others can. In such cases, serious consideration should be made for adding a concept similar to the mustUnderstand attribute in SOAP 1.1 where one can indicate which additions to the format are backwards compatible and which ones are not.

In the case of new possible values being added to an enumeration (e.g. a color attribute that had the option of being "red" or "blue" in version 1.0 of the format has "green" added as a possible value in future version of the format) the specification for the format needs to determine what the behavior of older processors should be when they see values they do not understand.
Existing concepts are changed: In certain cases the interpretation of an element or attribute may be changed across versions of a vocabulary. For example, the current working draft of XSLT 2.0 has a list of incompatibilities between it and XSLT 1.0 when the same elements and attributes are used in a stylesheet. In such cases it is most prudent to change the major version number of the format if one exists or change the namespace of the format otherwise. This means the format will not be backwards compatible.
Existing concepts are deprecated: Sometimes as a format evolves, one realizes that some concepts need to be reworked and replaced by improved implementations of these concepts. An example of this is the deprecation of the requiredRuntime element in favor of the supportedRuntime element in .NET Framework application configuration files. Format designers need to consider how to make such changes work in a backwards compatible manner. In the case of .NET Framework configuration files, both elements are used for applications targetting version 1.0 of the .NET Framework since the former is understood by the configuration engine while the latter is ignored.
Existing concepts are removed: Constructs may sometimes be removed from formats because they prove to be inappropriate or insecure. For example, the most recent draft of XHTML 2.0 removes familiar elements like img and br (descriptions of the backwards incompatible changes in XHTML 2.0 from XHTML 1.1 are available in the following articles by Mark Pilgrim, All That We Can Leave Behind and The Vanishing Image: XHTML 2 Migration Issues). This approach removes forwards compatibility and in such cases it is most prudent to either change the version number or namespace of the XML format.

This blog post just scratches the surface of what can be written about the various concerns when designing XML formats to be version resilient. There are a couple of issues as to how best to represent such changes in an XML schema and if one should even bother trying in certain cases. I'll endeavor to put together an artricle about this on MSDN in the next month or two.

Categories: XML

April 10, 2004

@ 07:36 PM

Implementing Standards is a Game of Dice

Matevz Gacnik points out Serious bug in System.Xml.XmlValidatingReader, he writes

The schema spec and especially RFC 2396 state that xs:anyURI instance can be empty, but System.Xml.XmlValidatingReader keeps failing on such an instance.

To reproduce the error use the following schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="AnyURI" type="xs:anyURI">
</xs:element>
</xs:schema>

And this instance document:

<?xml version="1.0" encoding="UTF-8"?>
<AnyURI/>

There is currently no workaround for .NET FX 1.0/1.1. Actually Whidbey is the only patch that fixes this. :)

The schema validation engine in the .NET Framework uses the System.Uri class for parsing URIs. This class doesn't consider an empty string to be a valid URI which is why our schema validation considers the above instance to be invalid according to its schema. However it isn't clear cut in the specs whether this is valid or not at least not without a bunch of sleuthing. As Micheal Kay (XSLT working group member) and C.M. Speilberg-McQueen (chairman of the XML Schema working group) wrote on XML-DEV

To: Michael Kay <michael.h.kay@ntlworld.com>
Subject: RE: [xml-dev] Can anyURI be empty?
From: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
Date: 07 Apr 2004 10:49:51 -0600
Cc: xml-dev@lists.xml.org

On Wed, 2004-04-07 at 03:47, Michael Kay wrote:
> > If it couldn't, it would be wrong. An empty string is a valid URI.
>
> On this, like so many other things, RFC 2396 is a total disaster. An empty
> string is not valid according to the BNF syntax, but the RFC gives detailed
> semantics for what it means (detailed semantics, though very imprecise
> semantics).
>
> And the schema REC doesn't help. It has the famous note saying that the
> definition places "only very modest obligations" on an implementation, and
> it doesn't say what those obligations are.

Yes. This is a direct result of our realization that
we have as much trouble understanding RFC 2396 as anyone
else. The anyURI type imposes the obligations of
RFC 2396, whatever those are. Any attempt to paraphrase
them on our part would lead, I fear, to an unsatisfactory
result: either we would make some mistake (like believing
that since the BNF does not accept the empty string,
it must not be legal) or we would make no mistakes. In
the one case, we'd be misleading our readers, and in
either case, we'd find ourselves mired in a never-ending
effort to prove that our paraphrase was, or was not,
correct.

RFC 2396 is one of the fundamental specifications of the World Wide Web yet it is vague and contradictory in a number of key places. Those of us implementing standards often have to go on gut feel or try and track the spec authors whenever we bump across issues like this but sometimes we miss them.

All I can do is apologize to people like Matevz Gacnik who have to bear the brunt of the lack of interoperability caused by vaguely written specifications implemented on our platform and for the fact that a fix for this problem won't be available until Whidbey.

Categories: XML

April 5, 2004

@ 03:53 AM

WS-MetadataExchange and the ATOM API

I finally got to take a look at the WS-MetadataExchange specification while hanging out in Don's office last week. The spec is fairly straightforward, it defines a mechanism for one to request the WSDL, Policy or XML Schema of a target namespace (i.e. a URI) from an XML Web Service endpoint. Basically one can ask what services an endpoint supports and what the messages the end point accepts should look like.

Both Don and Omri have suggested that WS-MetadataExchange can solve a problem I had with the SOAP-based version of the ATOM API. The problem is how an ATOM client is supposed to know what services an ATOM end point supports. Here are three descriptions of ATOM-enabled sites that I might want to interact with as an RSS Bandit user.

A weblog that supports user comments posted anonymously and provides the ability to search the weblog archives. The user comments must use a subset of HTML. For example, Sam Ruby's weblog.
A weblog that doesn't have comments enabled but does provide the ability to search the weblog archives. For example, Mark Pilgrim's weblog
A weblog that only supports comments that have been authenicated with TypeKey and doesn't support search. Again user comments must use a subset of HTML. Any Movable Type blog that supports TypeKey is an example.

All three would require a smart client to give the user visual hints and clues as to how they can interact with the site. At the very minimum a search box that is grayed out when the target weblog doesn't support search.

So far the only mechanism I've seen proposed for solving this problem in the case of the ATOM API is the link element used for locating service endpoints. This allows you to get the URI of service end points like where to post comments or where to send search results if they exist but do not answer finer grained questions. Questions such as “What subset of HTML can I use in comments?” or “Do I need to be authenticated before I post comments” are currently not answered by any of the draft ATOM specs.

So far WS-MetadataExchange or something like it look like the best way to support such scenarios for SOAP-enabled ATOM end points in a way that is consistent with the Global XML Web Services architecture. I would be interested in seeing an ATOM-specific solution evolve as well since some of this issues hurt usability of weblogs. I've lost count of the amount of times I've posted a comment or seen someone post a comment only to complain about the fact that the weblog doesn't support HTML or mangled some text. Having a way to inquire about this in a standard way would definitely improve the user experience.

Categories: XML

April 1, 2004

@ 04:29 PM

Some Thoughts on the SAX dot NET Project

A little while ago I noticed the SAX dot NET project was announced on the XML-DEV mailing list. From the desxcription on the project page

SAX dot NET is a C# port of the original Java based SAX API specifications. When compiled into a .NET assembly it becomes available to the other .NET languages as well.

The .NET Framework doesn't ship with an implementation of a SAX push model XML parser but instead ships with the pull-model parser in the form of the System.Xml.XmlReader class. The primary reasons for this can be gleaned from my article A Survey of APIs and Techniques for Processing XML where I list the pros and cons of various approaches for processing XML. The main advanatages a pull-model XML parser like the XmlReader have over a push model XML parser like SAX are

Pull model parsers typically do not require a specialized class for handling XML processing since there is no requirement to implement specific interfaces or subclass certain classes for the purpose of registering callbacks. Also the need to explicitly track application states using boolean flags and similar variables is significantly reduced when using a pull model parser

I can understand that developers migrating to the .NET Framework from Java platforms or MSXML would like to have the familiar feel of the SAX API so I definitely welcome such projects. However I have seen some criticism of the project from Daniel Cazzulino, a Microsoft XML MVP, in his post Do we need SAX for .NET? (or does Java ports to C# make sense?) he points out of some of the disadvantages of blindly porting an API from one platform to another. He points out some inconsistencies and redundancies between SAX dot NET and the .NET Framework such as

There is an XmlNamespaces class that does the same thing as the System.Xml.XmlNamespaceManager class
There are IAttributes AND IAttributes2, and the corresponding implementations called AttributesImpl and AttributesImpl2 which seem to imply interface versioning problems and legacy issues in a brand new project.
The existence of non-standard delegates such as OnPropertyChange(IProperty property, object newValue) instead of the typical pattern in the .NET world where it should be OnPropertyChange(object sender, ProperyChangeEventArgs e).

I think Daniel raises good points and encourage any developer porting an API to the .NET Framework to endeavor to make it consistent with the patterns and naming conventions in the .NET Framework. Doing so makes it easier for developers to understand how to use the API since it will be familiar and contains few surprises.

Categories: XML

March 29, 2004

@ 09:07 PM

XML Developer Center on MSDN Launched

After talking about it for the past few weeks the XML Developer Center on MSDN is finally here. As mentioned in my previous post on the Dev Center the most obvious changes from the previous incarnation of http://msdn.microsoft.com/xml are

The XML Developer Center will provide an entry point to working with XML in Microsoft products such as Office and SQL Server.
The XML Developer Center will have an RSS feed.
The XML Developer Center will pull in content from my work weblog.
The XML Developer Center will provide links to recommended books, mailing lists and weblogs.
The XML Developer Center will have content focused on explaining the fundamentals of the core XML technologies such as XML Schema, XPath, XSLT and XQuery.
The XML Developer Center will provide sneak peaks at advances in XML technologies at Microsoft that will be shipping future releases of the .NET Framework, SQL Server and Windows.

As mentioned in my previous post the first in a series of articles describing the changes to System.Xml in version 2.0 of the .NET Framework is now up. Mark Fussell has published What's New in System.Xml for Visual Studio 2005 and the .NET Framework 2.0 Release which mentiones the top 10 changes to the core APIs in the System.Xml namespace.

There is one cool new addition that is missing from Mark's article, which I guess would be number 11 of his top 10 list. The XSD Inference API which can be used to create an XML Schema definition language (XSD) schema from an XML instance document will also be part of System.Xml in Whidbey. Given the enthusiasm we saw in various parties about XSD inference we decided to promote it from just being a freely downloadable tool to being part of the .NET Framework. Below are a couple of articles about XSD Inference

Generate XSD Schemas by Inference by Roger Jennings, XML Web Services Magazine.
Modeling biz docs in XML by Jon Udell, InfoWorld.
Using the XSD Inference Utility by Nithya Sampathkumar, MSDN.

If you have any thoughts about what you'd like to see on the Dev Center or any comments on the new design, please let me know.

Categories: XML

March 28, 2004

@ 07:59 PM

The Holy Grail in XML<->Object Mapping Technologies

I was reading a post by Rory Blyth where he points to Steve Maine's explanation of the benefits of Prothon (an object oriented programming language without classes). He writes

One quote from Steve's post that has me thinking a bit, though, is the following:

The inherent extensibility and open content model of XML makes coming up with a statically typed representation that fully expresses all possible instance documents impossible. Thus, it would be cool if the object representation could expand itself to add new properties as it parsed the incoming stream.

I can see how this would be cool in a "Hey, that's cool" sense, but I don't see how it would help me at work. I fully admit that I might just be stupid, but I'm honestly having a hard time seeing the benefit. Right now, I'm grabbing XML in the traditional fashion of providing the name of the node that I want as a string key, and it seems to be working just fine.

The problem solved by being able to dynamically add properties to a class in the case of XML<->object mapping technologies is that it allows developers to program against aspects of the XML document in a strongly typed manner even if they are not explicitly described in the schema for the XML document.

This may seem unobvious so I'll provide an example that illustrates the point. David Orchard of BEA wrote a schema for the ATOM 0.3 syndication format. Below is the fragment of the schema that describes ATOM entries

<xs:complexType name="entryType"> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="link" type="atom:linkType"/> <xs:element name="author" type="atom:personType" minOccurs="0"/> <xs:element name="contributor" type="atom:personType" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="id" type="xs:string"/> <xs:element name="issued" type="atom:iso8601dateTime"/> <xs:element name="modified" type="atom:iso8601dateTime"/> <xs:element name="created" type="atom:iso8601dateTime" minOccurs="0"/> <xs:element name="summary" type="atom:contentType" minOccurs="0"/> <xs:element name="content" type="atom:contentType" minOccurs="0" maxOccurs="unbounded"/> <xs:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute ref="xml:lang" use="optional"/> <xs:anyAttribute/> </xs:complexType>

The above schema fragment produces the following C# class when the .NET Framework's XSD.exe tool is run with the ATOM 0.3 schema as input.

/// <remarks/>
[System.Xml.Serialization.XmlTypeAttribute(Namespace="http://purl.org/atom/ns#")]
public class entryType {

    /// <remarks/>
    public string title;

    /// <remarks/>
    public linkType link;

    /// <remarks/>
    public personType author;

    /// <remarks/>
    [System.Xml.Serialization.XmlElementAttribute("contributor")]
    public personType[] contributor;

    /// <remarks/>
    public string id;

    /// <remarks/>
    public string issued;

    /// <remarks/>
    public string modified;

    /// <remarks/>
    public string created;

    /// <remarks/>
    public contentType summary;

    /// <remarks/>
    [System.Xml.Serialization.XmlElementAttribute("content")]
    public contentType[] content;

    /// <remarks/>
    [System.Xml.Serialization.XmlAnyElementAttribute()]
    public System.Xml.XmlElement[] Any;

    /// <remarks/>
    [System.Xml.Serialization.XmlAttributeAttribute(Namespace="http://www.w3.org/XML/1998/namespace")]
    public string lang;

    /// <remarks/>
    [System.Xml.Serialization.XmlAnyAttributeAttribute()]
    public System.Xml.XmlAttribute[] AnyAttr;
}

As a side note I should point out that David Orchard's ATOM 0.3 schema is invalid since it refers to an undefined authorType so I had to remove the reference from the schema to get it to validate.

The generated fields highlighted in bold show the problem that the ability to dynamically add fields to a class would solve. If programming against an ATOM feed using the above entryType class then once one saw an extension element, you'd have to fallback to XML processing instead of programming using strongly typed constructs. For example, consider Mark Pilgrim's RSS feed which has dc:subject elements which are not described in the ATOM 0.3 schema but are allowed due to the existence of xs:any wildcards. Watch how this complicates the following code which prints the title, issued date and subject of each entry.

foreach(entryType entry in feed.Entries){ Console.WriteLine("Title: " + entry.title); Console.WriteLine("Issued: " + entry.issued); string subject = null; //find the dc:subject foreach(XmlElement elem in entry.Any){ if(elem.LocalName.Equals("subject") && elem.NamespaceUri.Equals("http://purl.org/dc/elements/1.1/"){ subject = elem.InnerText; break; } } Console.WriteLine("Subject: " + subject); }

As you can see, one minute you are programming against statically and strongly typed C# constructs and the next you are back to checking the names of XML elements and programming against the DOM. If there was infrastructure that enabled one to dynamically add properties to classes then it is conceivable that even though the ATOM 0.3 schema doesn't define the dc:subject element one would still be able program against them in a strongly typed manner in generated classes. So one could write code like

foreach(entryType entry in feed.Entries){ Console.WriteLine("Title: " + entry.title); Console.WriteLine("Issued: " + entry.issued); ); Console.WriteLine("Subject: " + entry.subject); }

Of course, there are still impedance mismatches to resolve like how to reflect namespace names of elements or make the distinction between attributes vs. elements in the model but having the capabilities Steve Maine describes in his original post would improve the capabilities of the XML<->Object mapping technologies that exist today.

Categories: XML

March 24, 2004

@ 04:50 PM

MSDN Magazine Article on Blogging and RSS

Aaron Skonnard has a new MSDN magazine article entitled All About Blogs and RSS where he does a good job of summarizing the various XML technologies around weblogs and syndication. It is a very good FAQ and one I definitely will be pointing folks to in future when asked about blogging technologies.

Categories: Mindless Link Propagation | XML

March 24, 2004

@ 04:19 PM

Design Guidelines for Working with XML in the .NET Framework

My recent Extreme XML column entitled Best Practices for Representing XML in the .NET Framework is up on MSDN. The article was motivated by Krzysztof Cwalina who asked the XML team for design guidelines for working with XML in WinFX. There had been and currently is a bit of inconsistency in how APIs in the .NET Framework represent XML and this is the first step in trying to introduce a set of best practices and guidelines.

As stated in the article there are three primary situations when developers need to consider what APIs to use for representing XML. The situations and guidelines are briefly described below:

Classes with fields or properties that hold XML: If a class has a field or property that is an XML document or fragment, it should provide mechanisms for manipulating the property as both a string and as an XmlReader.
Methods that accept XML input or return XML as output: Methods that accept or return XML should favor returning XmlReader or XPathNavigator unless the user is expected to be able to edit the XML data, in which case XmlDocument should be used.
Converting an object to XML: If an object wants to provide an XML representation of itself for serialization purposes, then it should use the XmlWriter if it needs more control of the XML serialization process than what is provided by the XmlSerializer. If the object wants to provide an XML representation of itself that enables it to participate fully as a member of the XML world, such as allow XPath queries or XSLT transformations over the object, then it should implement the IXPathNavigable interface.

A piece of criticism I got from Joshua Allen was that the guidelines seemed to endorse a number of approaches instead of defining the one true approach. The reason for this is that there isn't one XML API that satisfies the different scenarios described above. In Whidbey we will be attempting to collapse the matrix of choices by expanding the capabilities of XML cursors so that there shouldn't be a distinction between situations where an API exposes an API like XmlDocument or one like XPathNavigator.

One of the interesting design questions we've gone back and forth on is whether we have both a read-only XML cursor and read-write XML cursor (i.e. XPathNavigator2 and XPathEditor) or a single XML cursor class which has a flag that indicates whether it is read-only or not (i.e. the approach taken by the System.IO.Stream class which has CanRead and CanWrite properties). In Whidbey beta 1 we've gone with the former approach but there is discussion on whether we should go with the latter approach in beta 2. I'm curious as to which approach developers using System.Xml would favor.

Categories: XML

March 17, 2004

@ 03:15 PM

Countdown to the XML Developer Center on MSDN

In less than a week we'll be launching the XML Developer Center on MSDN and replacing the site at http://msdn.microsoft.com/xml. The main differences between the XML Developer Center and what exists now will be

The XML Developer Center will provide an entry point to working with XML in Microsoft products such as Office and SQL Server.
The XML Developer Center will have an RSS feed.
The XML Developer Center will pull in content from my work weblog.
The XML Developer Center will provide links to recommended books, mailing lists and weblogs.
The XML Developer Center will have content focused on explaining the fundamentals of the core XML technologies such as XML Schema, XPath, XSLT and XQuery.
The XML Developer Center will provide sneak peaks at advances in XML technologies at Microsoft that will be shipping future releases of the .NET Framework, SQL Server and Windows.

During the launch the feature article will be the first in a series by Mark Fussell detailing the changes we've made to the System.Xml namespaces in Whidbey. His first article will focus on the core System.Xml classes like XmlReader and XPathNavigator. A follow up article is scheduled that will talk about additions to System.Xml since the last version of the .NET Framework such as XQuery. Finally, either Mark or Matt Tavis will write an article about the changes coming to System.Xml.Serialization such as the various hooks for allowing custom code generation from XML schemas such as IXmlSerializable (which is no longer an unsupported interface) and SchemaImporterExtensions.

I'll also be publishing our guidelines for exposing XML in .NET applications as well during the launch. If there is anything else you'd like to see on the XML Developer Center let me know.

Categories: XML

March 16, 2004

@ 09:02 PM

Misunderstanding XML and Other RSS Follies

I just noticed that Arve Bersvendsen has written a post entitled 11 ways to valid RSS where he states he has seen 11 different ways of providing content in an RSS feed namely

Content in the description element

I have so far identified five different variants of content in the <description> element:

Plaintext as CDATA with HTML entities - Validate
HTML within CDATA - Validate
HTML escaped with entities - Validate
Plain text in CDATA - Validate
Plaintext with inline HTML using escaping - Validate

<content:encoded>

I have encountered and identified two different ways of using <content:encoded>:

Using entities - Validate
Using CDATA - Validate

XHTML content

Finally, I have encountered and identified four different ways in which people has specified XHTML content:

Using <xhtml:body> - Validate
Using <xhtml:div> - Validate
Using <body> with default namespace - Validate
Using <div> with default namespace - Validate

At first these seem like a lot until you actually try to program against this using an XML parser. In which case, the first thing you notice is that there is no difference programming against CDATA vs. escaped entities since they are both syntactic sugar. For example, the XML infoset and data models compatible with it such as the XPath data model do not differentiate character content that is written as character references, CDATA sections or entered directly. So the following

    <test><![CDATA[ ]]>2</test>
    <test>&#160;2</test>
    <test> 2</test>

are all equivalent. More directly if you loaded all three into an instance of System.Xml.XmlDocument and checked their InnerText property they'd all return the same result. So this reduces Arve's first two elements to

Content in the description element

I have so far identified ~~five~~ two different variants of content in the <description> element:

HTML
Plain text

<content:encoded>

I have encountered and identified ~~two different ways~~ one way of using <content:encoded>:

Containing escaped HTML content

If your code makes any distinctions other than these then it is a sign that you have (a) misunderstood how to process RSS or (b) are using a crappy XML parser. When I first started working on RSS Bandit I also was confused by these distinctions but after a while things became clearer. The only problem here is the description element since you can't tell whether it is HTML or not without guessing. Since RSS Bandit always provides the content to an embedded web browser this isn't a problem but I can see how it could be one for aggregators that don't know how to process HTML (although I've never seen one before).

Another misunderstanding by Arve seems to be how namespaces work in XML. A few years ago I wrote an XML Namespaces and How They Affect XPath and XSLT where I wrote

A qualified name, also known as a QName, is an XML name called the local name optionally preceded by another XML name called the prefix and a colon (':') character...The prefix of a qualified name must have been mapped to a namespace URI through an in-scope namespace declaration mapping the prefix to the namespace URI. A qualified name can be used as either an attribute or element name.

Although QNames are important mnemonic guides to determining what namespace the elements and attributes within a document are derived from, they are rarely important to XML aware processors. For example, the following three XML documents would be treated identically by a range of XML technologies including, of course, XML schema validators.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xs:complexType id="123" name="fooType"/>
</xs:schema>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <xsd:complexType id="123" name="fooType"/>
</xsd:schema>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
        <complexType id="123" name="fooType"/>
</schema>

Bearing this information in mind this reduces Arve's example to

XHTML content

Finally, I have encountered and identified ~~four~~ two different ways in which people has specified XHTML content:

Using <xhtml:body>
Using <xhtml:div>

Thus with judicious use of an XML parser (which makes sense since RSS is an XML format), Arve's list of eleven ways of providing content in RSS is actually whittled down to five. I assume Arve is unfamiliar with XML processing which led to his initial confusion.

NOTE: Before anyone bothers to start pointing out that Atom somehow frees aggregator author from this myriad of options I'll point out that Atom has more ways of encoding content than these. Even ignoring the inconsequential differences in syntactic sugar in XML (escaped tags vs. unescaped tags in CDATA sections) the various combinations of the <summary> and <content> elements, the mode attribute (escaped vs. xml) and MIME types (text/plain, text/html, application/xhtml+xml) more than double the number of variations possible in RSS.

Categories: XML

March 15, 2004

@ 04:35 PM

Tim Bray now Works at Sun

It what seems to be the strangest news story I've read this year I find out Sun snatches up XML guru what I found particularly interesting in the story was the following excerpt

One of the areas Bray expects to work on is developing new applications for Web logs, or "blogs," and the RSS (Resource Description Framework Site Summary) technology that grew out of them. "I think that this is potentially a game-changer in some respects, and there are quite a few folks at Sun who share that opinion," he said.

Though RSS is traditionally thought of as a Web publishing tool, it could be used for much more than keeping track of the latest posts to blogs and Web sites, Bray said. "I would like to have an RSS feed to my bank account, my credit card, and my stock portfolio," he said.

Personally I think it's a waste of Tim Bray's talents having him work on RSS or it's competitor du jour, Atom, but it should be fun seeing whether he can get Sun out of it's XML funk as well stop them from spreading poisonous ideas like replacing XML with ASN.1.

Update: Tim Bray has a post about his new job entitled Sunny Boy where he writes

That aside, I’m comfy being officially a direct competitor of Microsoft. On the technical side, I find the APIs inelegant, the UI aesthetics juvenile, and the neglect of the browser maddening.

Sounds like fighting words. This should be fun. :)

Categories: XML

March 11, 2004

@ 04:54 PM

On Merging RSS and Atom

Dave Winer has a proposal for merging RSS and ATOM. I'm stunned. It takes a bad idea (coming up with a redundant XML syndication format that is incompatible with existing ones) and merges it with a worse idea (have all these people who dislike Dave Winer have to work with him).

After adding Atom support to RSS Bandit a thought crystallized in my head which had been forming for a while; Atom really is just another flavor of RSS with different tag names. It looks like I'm not the only aggregator author to come to this conclusion, Luke Huttemann also came to the same conclusion when describing SharpReader implementation of Atom. What this means in practice is that once you've written some code that handles one flavor of RSS be it RSS 0.91, RSS 1.0, or RSS 2.0 then adding support for other flavors isn't that hard and they basically all have the same information just hidden in different tag names (pubDate vs. dc:date, admin:errorsReportsTo vs. webMaster, author vs. dc:creator, etc). To the average user of any popular aggregator there isn't any noticeable difference when subscribed to an RSS 1.0 feed vs. an RSS 2.0 feed or an RSS 2.0 feed vs. an Atom feed.

And just like with RSS, aggregators will special case popular ATOM feeds with weird behavior that isn't described in any spec or interprets the specs in an unconventional manner. As Phil Ringnalda points out Blogger ATOM feeds claim that the summary contains XHTML when in fact they contain plain text. This doesn't sound like a big deal until you realize that in XHTML whitespace isn't significant (e.g. newlines are treated as spaces) which leads to poorly formatted content when the aggregator displays the content as XHTML when it truth it is plain text. Sam Ruby's ATOM feed contains relative links in the <url> and <link> elements but doesn't use xml:base. There is code in most aggregators to deal with weird but popular RSS feeds and it seems Atom is already gearing up to be the same way. Like I said, just another flavor of RSS. :)

As an aside I find it interesting that currently Sam Ruby's RSS 2.0 feed provides a much better user experience for readers than his ATOM feed. The following information is in Sam's RSS feed but not his Atom feed

Email address of the webmaster of the site. [who to send error reports to]
The number of comments per entry
An email address for sending a response to an entry
An web service endpoint for posting comments to an entry from an aggregator
An identifier for the tool that generated the feed
The trackback URL of each entry

What this means is that if you subscribe to Sam's RSS feed with an aggregator such as SharpReader or RSS Bandit you'll get a better user experience than if you used his Atom feed. Of course, Sam could easily put all the namespace extensions in his RSS feed in his Atom feed as well in which case the user experience subscribing to either feed would be indistinguishable.

Arguing about XML syndication formats is meaningless because the current crop that exist all pretty much do the same thing. On that note, I'd like to point out that websites that provide multiple syndication formats are quite silly. Besides confusing people trying to subscribe to the feed there isn't any reason to provide an XML syndication feed in more than one format. Particularly silly are the ones that provide both RSS and Atom feeds (like mine).

Blogger has it right here by providing only one feed format per blog (RSS or Atom). Where they screwed up is by forcing users to make the choice instead of making the choice for them. That's on par with asking whether they want the blog served up using HTTP 1.0 or HTTP 1.1. I'm sure there are some people that care but for the most part it is a pointless technical question to shove in the face of your users.

Categories: XML

March 9, 2004

@ 03:39 PM

On The Upcoming XML Developer Center on MSDN

MSDN has a number of Developer Centers for key developer topics such as XML Web Services and C#. There are also node home pages for lesser interesting [according to MSDN] topics such as Windows Scripting Host or SQLXML. Besides the fact that developer centers are highlighted more prominently on MSDN as key topics the main differences between the developer centers and the node home pages are

Developer Centers have a snazzier look and feel than node home pages.
Developer Centers have an RSS feed.
Developer Centers can pull in blog content (e.g. Duncan Mackenzie's blog on the C# Developer Center)

I've been working on getting a Developer Center on MSDN that provides a single place for developers to find out about XML technologies and products at Microsoft for about a year or more. The Developer Center is now about two weeks from being launched. There are only two questions left to answer.

The first question is what the tagline for the Developer Center should be. Examples of existing taglines are

Microsoft Visual C# Developer Center: An innovative language and tool for building .NET-connected solutions
Data Access and Storage Developer Center: Harnessing the power of data
Web Services Developer Center: Connecting systems and sharing information
.NET Architecture Developer Center: Blueprint for Success

I need something similar for the XML Developer Center but my mind's been drawing a blank. My two top choices are currently “The language of information interchange” or “Bridging gaps across platforms with the ubiqitous data format”. In my frivilous moments, I've also considered “Unicode + Angle Brackets = Interoperability”. Any comments on which of the three taglines I have in mind sounds best or suggestions for taglines would be much appreciated.

The second issue is how much we should talk about unreleased technologies. I personally dislike talking about technologies before they ship because history has taught me that projects slip or get cut when you least expect them to do so. For example, when I was first hired fulltime at Microsoft about two years ago we were working on XQuery which was supposed to be in version 2.0 of the .NET Framework. At the time the assumption was that they'd both (XQuery & the next version of the .NET Framework) be done by the end of 2003. It is now 2004 and it is optimistic to expect that either XQuery or the next version of the .NET Framework will both be done at the end of this year. If we had gone off our initial assumptions and started writing about XQuery and the classes we were designing for the .NET Framework (e.g. XQueryProcessor ) in 2002 and 2003 on MSDN then we'd currently have a number of outdated and incorrect articles on MSDN. On the other hand this does mean that while you won't find articles on XQuery on MSDN you do find articles like An Introduction to XQuery, XML for Data: An early look at XQuery ,X is for XQuery, and XQuery Tricks and Traps on the developer websites of our competitors like IBM and Oracle. All four of those articles contain information that is either outdated or will be outdated when the W3C is done with the XQuery recommendation. However they do provide developers with a glimpse and an understanding of the fundamentals of XQuery.

The question I have is whether it would be valuable for our developers if we wrote articles about technologies that haven't shipped and whose content may differ from what we actually ship? Other developer centers on MSDN have decided to go this route such as the Longhorn Developer Center and Web Services Developer Center which regularly feature content that is a year or more away from shipping. I personally think this is unwise but I am interested in what the Microsoft developer community thinks of providing content about upcoming releases versus focusing on existing releases.

Categories: XML

February 28, 2004

@ 06:14 PM

A Look at the xml:base attribute and the .NET Framework's XmlReader

The W3C xml:base recommendation describes the attribute xml:base when appearing on an XML element allows one to specify a base URI for the element and its children other than the base URI of the document or external entity. The base URI of a document or entity is the URI from which the document or entity was loaded. For example, the base URI of my RSS feed is http://www.25hoursaday.com/weblog/SyndicationService.asmx/GetRss. The following example taken from the W3C recommendation shows how xml:base processing works.

<?xml version="1.0"?>
<doc xml:base="http://example.org/today/"
     xmlns:xlink="http://www.w3.org/1999/xlink">
<head>
    <title>Virtual Library</title>
</head>
<body>
    <paragraph>See <link xlink:type="simple" xlink:href="new.xml">what's
      new</link>!</paragraph>
    <paragraph>Check out the hot picks of the day!</paragraph>
    <olist xml:base="/hotpicks/">
      <item>
        <link xlink:type="simple" xlink:href="pick1.xml">Hot Pick #1</link>
      </item>
      <item>
        <link xlink:type="simple" xlink:href="pick2.xml">Hot Pick #2</link>
      </item>
      <item>
        <link xlink:type="simple" xlink:href="pick3.xml">Hot Pick #3</link>
      </item>
    </olist>
</body>
</doc>

The URIs in the xlink:href attributes in this example resolve to full URIs as follows:

"what's new" resolves to the URI "http://example.org/today/new.xml"

"Hot Pick #1" resolves to the URI "http://example.org/hotpicks/pick1.xml"

"Hot Pick #2" resolves to the URI "http://example.org/hotpicks/pick2.xml"

"Hot Pick #3" resolves to the URI "http://example.org/hotpicks/pick3.xml"

xml:base exists as a mechanism to mimic HTML's BASE element and bring that functionality to the XML world. This was supposed to be a companion technology to XLink which was supposed to be a generic way to describe links in XML documents. Both XLink and xml:base were expected to be used in XHTML 2.0. However the XHTML working group rejected them and instead proposed HLink which was rejected by the W3C Technical Architecture Group. A lot of this is covered in the XML.com articles Introducing HLink and TAG Rejects HLink by Kendall Clark.

Even though xml:base has been rejected by the designers of the technologies it was primarily intended to be used with it has still made its way into the core of the XML family of technologies. Specifically, xml:base is used by the XML Infoset recommendation to define base URIs. This elevated xml:base and HTML-style base URI processing from being an application-specific construct to being a core part of XML that should be supported by XML parsers. For example, XQuery and XPath 2.0 will have the base-uri() function which returns the base URI of a node and takes into account the xml:base attribute.

The next question is whether the .NET Framework supports the xml:base recommendation. At first glance it looks this way since there is BaseURI property on both the XmlNode and XmlReader classes. However these properties report the BaseURI in the classic sense only (i.e. where the node was loaded from which is either the URI of the document or the URI of the entity it was expanded from). We were planning to add support for xml:base to the core XML parser as part of implementing XInclude but given that that it recently went from being a W3C candidate recommendation to going back to being a W3C working draft (partly due to a number of the architectural issues raised by Murata Makoto) the future of the spec is currently uncertain so we've backed off on our implementation. In the meantime, developers can use XInclude.NET if they need XML Inclusions and its associated support for the xml:base attribute in the .NET Framework.

Categories: XML

February 28, 2004

@ 11:07 AM

XmlReader and the Factory Design Pattern

Daniel Cazzulino writes in response to Don Demsak's post on Waking Up From A DOM Induced Coma

So, in this regard, I believe SUN is doing a good job at concentrating on pluggable and standard interfaces and specifications, and letting whoever wants to take the time to implement custom stuff.
I don't want to "new XmlTextReader". I want some app/system-wide factory take care of creating the appropriate parser implementation for me based on declarative configuration, and I want my to code to work against a single unified interface/base class always.
Changing the parser shouldn't mean I have to change my working app code. If MS provides the appropriate abstractions, it wouldn't even be necessary to rely on some implementation-specific feature such as XmlTextReader.GetRemainder that is not part of the abstract contract defined by XmlReader.

I both agree and disagree with Daniel. We do have a single unified interface for processing XML which developers can program against, it is called the XmlReader. Unfortunately, we subclassed this class into the XmlTextReader and XmlValidatingReader which are actually what most developers program against including our devs internally. In the next version of the .NET Framework we are moving away from the XmlTextReader and XmlValidating reader. Instead we will emphasize programming directly to the XmlReader and will provide an implementation of the factory design patterns which returns different XmlReader instances based on which features the user is interested. More importantly users will be able to layer different XmlReader implementations on those created by our factory which was always our intention since v1.0 of the .NET Framework. For example, one could layer XSD Validation on top the XIncludingReader from XInclude.NET to combine third party XInclude support with Microsoft's W3C XML Schema validation technologies.

As for whether the Sun's approach of just providing interfaces instead of concrete for XML parsing was such a great thing in Java I'd claim that it's been hit and miss. Most XML developers from the Java world despise the DOM for the reasons described in Chapter 33 of Elliotte Rusty Harold's Effective XML. This is the reason for the existence of extensions and alternatives to the DOM API which extend it such as Oracle's XDK, dom4J, JDOM, Xerces and XOM. Heck, you can't even get the XML as a string out of node or save an XML document object to a file without using extensions since these aren't in the base DOM API. As for SAX, the API just gives you access to regular parsing events nothing fancy. There isn't much difference functionally from programming against the base SAX APIs and programming against XmlReader.

The one point of interest is that Daniel claims that the Java way of not shipping with any XML APIs but just interfaces is somehow better than the .NET way. In Java one can programa against interfaces and loads the XML parser by passing the class name to a factory method. One could put this name in a config file and change it at runtime. The question is whether anyone in the .NET world actually thinks being able to change your XML parser implementation at runtime is anything more than a geek feature. I consider it as geeky as asking why you can't change the implementation of the System.String class to a user defined class that uses less memory at runtime without having to recompile. An interesting idea but one primarily of interest to the ultimate of power users.

The funny thing is that even if we shipped functionality where we looked in the registry or in some config file before figuring out what XML parser to load it's not as if there are an abundance of third party XML parsers targetting the .NET Framework in the first place. There is definitely no intention to ship any functionality like this in future versions of the .NET Framework.

Categories: XML

February 25, 2004

@ 12:16 PM

Still Waters Run Deep

In his post JDOM Hits Beta 10 Jason Hunter writes

According to my Palm Pilot calendar, we laid out the vision for JDOM on March 28th, 2000. I figure we'll ship before March 28, 2004. If we can ship 1.0 before it's been a full four years, I can just round down and call it three.

What took it so long? Several things. I discovered XML is "fractally complex". At the highest level it appears only slightly complicated, but as you dig deeper you discover increasing complexity, and the deeper you go the more complicated it continues to become. Trying to be faithful to the XML standards while staying easy to use and intuitive was a definite challenge.

This is one of challenges I face in my day job designing XML APIs for the .NET Framework. The allure of XML and its related technologies is that they appear simple and straightforward but once one digs a little it turns out that everything isn't quite as easy as it seemed at first.

One of the drawbacks of this appearance of simplicity is that everyone thinks they can write an XML parser which leads to occurences such as what is described in this post by Shawn Farkas Creating a SecurityElement from XML

The overhead of a full-fledged XML parser would be too much. Even if you accept the fact that we need a lightweight security XML object, we can't even provide utility methods on SecurityElement to convert back and forth System.Xml objects, since the CAS code lives in mscorlib.dll, and mscorlib cannot take a dependency on external DLL's. (Think of what would happen if mscorlib depended on System.Xml.dll, and System.Xml.dll depended on mscorlib ...). As if this weren't enough, there are at least 3 distinct XML parsers in v1.1 of the framework (System.Xml, SecurityElement, and a lightweight parser in mscoree.dll which handles parsing .config files ... this was actually optimized to be able to fit into no more than two pages of memory). Whidbey will be adding yet another parser to handle parsing ClickOnce manifests

One of the things I'm currently working on is coming up with guidelines that prevent occurences like System.Security.SecurityElement, a class that represents XML but does not interact well with the rest of the XML APIs in the .NET Framework, from happening again. This will be akin to Don Box's MSDN TV episode Passing XML Data in the CLR but will take the form of an Extreme XML article and a set of .NET Framework design guidelines.

Categories: XML

February 22, 2004

@ 06:32 PM

Newspaper Views for Reading RSS Feeds

Jon Udell writes in is entry Heads, decks, and leads: revisited

Yesterday, for example, Steve Gillmor told me that he's feeling overwhelmed by thousands of unread items in NetNewsWire. Yet I never feel that way. I suspect that's because I'm reading in batches of 100 (in the Radio UserLand feedreader). I scan each batch quickly. Although opinions differ as to whether or not a feed should be truncated, my stance (which I'm reversing today) has been that truncation is a useful way to achieve the effect you get when scanning the left column of the Wall Street Journal's front page. Of the 100 items, I'll typically only want to read several. I open them into new Mozilla tabs, then go back and read them. Everybody's different, but for me -- and given how newspapers work, I suspect for many others too -- it's useful to separate the acts of scanning and reading. When I'm done with the batch, I click once to delete all 100 items.

and in today's post entitled Different strokes he writes

I agree. In trying to illustrate a point about scanning versus reading, I'm afraid I fanned the flames of the newsreader-style versus browser-style debate. In fact, the two modes can be complementary. I just bought the full version of NetNewsWire, which exploits that synergy as Brent describes. So does FeedDemon, which this posting prompted me to re-explore.

This highlights a conflict between the traditional 3-pane aggregators that follow the mail or news reader model which implies that every post is important and should be read one by one and web-style aggregators like Radio Userland that present blogs in a unified web-based view reminiscent of an aggregated blog or newspaper. On the RSS Bandit wiki there's a wishlist item that reads

Newspaper view. A summery of unread feed items, formatted by a XSLT stylesheet and displayed as HTML/PDF. Inspired by Don Park. also here

which was originally added by Torsten. He never got around to adding this feature because he felt it wasn't that useful after all. I never implemented it because one would have to provide a way to interact with posts from this newspaper view (i.e. mark them as read or deleted, view comments, etc) which either translates to Javascript coding or running a local web server. Neither of the options was palatable.

This morning I downloaded FeedDemon to see how it got around these problems for its newspaper view. I found out that it does the obvious thing, it doesn't. From what I gather there is an option to 'mark all items in a channel as read' once you leave the channel. So once you close the newspaper view it assumes every post that showed up in it was read. A heavy-handed approach but it probably works for the most part.

Looks like something else to add to the RSS Bandit TODO list.

I've been thinking that something like this is necessary after reading Robert Scoble's post 1296 newsfeeds +are+ sustainable where he wrote

Here's my workflow:

At about 5 p.m. every day I tell NewsGator to get me my feeds. It is downloading them in the background as I speak.

Then I open each folder that's bold...

Then I only read the headlines. I'm getting very good at ignoring headlines with subjects like "isn't my cat cute?" See, that's another productivity point. Robin probably assumes I read all the crap that people post. I don't. I only read those things that MIGHT be interesting. If I find a headline that's interesting, then I scan the article it is associated with. I don't read it. Just scan at that point. Usually that means reading the first paragraph and scanning the rest for later.

I've found that reading headlines isn't always the best way to find good stuff and wouldn't mind a way to quickly scan all the articles in a category that goes beyond eyeballing a bunch of headlines. However I'm going to avoid Jon Udell's advice about XHTML-izing all the HTML content in feeds is the way to get you there. Been there, done that, not going back. The approach used by FeedDemon is a step in the right direction and doesn't require absorbing the problems that comes with trying to convert the ill-formed markup that typically shows up in feeds to XHTML.

Categories: RSS Bandit | XML

February 21, 2004

@ 09:15 PM

Microsoft XML Web Services Architect Joins the Atom Effort

In his post entitled Back in the Saddle Don Box writes

My main takeaway was that it's time to get on board with Atom - Sam is a master cat herder and I for one am ready to join the other kittens.

This is good news. Anyone who's read my blog probably can discern that I think the ATOM syndication format is a poorly conceived, waste of effort that unnecessarily fragments the website syndication world. On the other hand the ATOM API especially the bits about SOAP enabled clients are a welcome upgrade to the existing landscape of blog posting/editing APIs.

My experiences considerng how to implement the ATOM API in RSS Bandit have highlighted a one or two places where the API seems 'problematic' which actually point more to holes in the XML Web Services architecture than actual problems with the API. The two scenarios that come most readily to mind are

Currently if a user would wants to post a comment to their blog using client software they need to configure all sorts of technical settings such as which API to use, port numbers, end point URLs and a lot more. For example, look at what one has to post to a dasBlog weblog from w.bloggar Ideally, the end user should just be able to point their client at their blog URL (e.g. http://www.25hoursaday.com/weblog) and it figures out the rest.
The current ATOM specs describe a technique for discovering the web service end points a blog exposes which involves downloading the HTML page and parsing out all the <link> tags. I've disagreed with this approach in the past but the fact is that it does get the job done.

What this situation has pointed out to me is that there is no generic way to go up to a website and find out what XML Web Service end points it exposes. For example, if you wanted to tell all the publiclly available Web Services provided by Microsoft you'd have to read Aaron Skonnard's A Survey of Publicly Available Web Services at Microsoft instead of somehow discovering this programmatically. Maybe this is what UDDI was designed for?
Different blogs allow different syntax for posting comments. I've lost count of the amount of times I've posted a comment to a blog and wanted to provide a link but couldn't tell whether to just use a naked URL (http://www.example.com) or a hyperlink (<a href=“http://www.example.com“>example link</a>). Being that RSS Bandit has supported the CommentAPI for a while now I've constantly been frustrated by the inability to tell what kind of markup or markup subset the blog allows in comments. A couple of blogs provide formatting rules when one is posting a comment but there really is no programmatic way to discover this.

Another class of capabilities I'd like to discover dynamically is which features a blog supports. For instance, the ATOM API spec used to have a 'Search facet' which was removed because it seemed to many people thought it'd be onerous to implement. What I'd have preferred would have been for it to be optional then clients could dynamically discover whether the ATOM end point had search capabilities and if so how rich they were.

The limitation here is that there isn't a generic way to discover and enunciate the fine grained capabilities of an XML Web Service end point. At least not one I am familiar with.

It would be nice to see what someone like Don Box can bring to the table in showing how to architect and implement such a loosely coupled XML Web Service based system on the World Wide Web.

Categories: Technology | XML

February 20, 2004

@ 05:22 PM

The Impedence Mismatch between W3C XML Schema and the CLR

Daniel Cazzulino is writing about W3C XML Schema type system < - > CLR type system and has an informal poll at the bottom of his article where he writes

We all agree that many concepts in WXS don't map to anything existing in OO languages, such as derivation by restriction, content-ordering (i.e. sequence vs choice), etc. However, in the light of the tools the .NET Framework makes available to map XML to objects, we usually have to analyze WXS (used to define the structure of that very XML instance to be mapped) and its relation with our classes
In this light, I'm conducting a survey about developer's view on the relation of the XSD type system and the .NET one. Ignoring some of the more advanced (I could add cumbersome and confusing) features of WXS, would you say that both type systems fit nicely with each other?

I find the question at the end of his post which I highlighted to be highly tautological. His question is basically, “If you ignore the parts where they don't fit well together do the CLR and XSD type system fit well together?”. Well if you ignore the parts where they don't then the only answer is YES. In reality many developers don't have the freedom to ignore parts of XSD they don't want to support especially when utilizing XML Web Services designed by others.

There are two primary ways one can utilize the XmlSerializer which maps between XSD and CLR types

XML Serialization of Object State: In this case the developer is only interested in ensuring that the state of his classes can be converted to XML. This is a fairly simple problem because the expressiveness of the CLR is a subset of that of W3C XML Schema. Any object's state could be mapped to an element of complex type containing a sequence or choice of other nested elements that are either nested simple types or complex types.

Even then there are limitations in the XmlSerializer which make this cumbersome such as the fact that it only serializes public fields but not public properties. But that is just a design decision that can be revisited in future releases.
Conversion of XML to Objects: This is the scenario where a developer converts an XML schema to CLR objects to make them easier to program against. This is particularly common in XML Web Services scenarios which is why the XmlSerializer was originally designed. In this scenario the conversion tool has to contend with the breadth of features in the XML Schema: Structures and XML Schema: Datatypes recommendations.

There are enough discrepancies between the W3C XML Schema type system and that of the CLR to fill a Ph.D thesis. I touched on some of these in my article XML Serialization in the .NET Framework such as
Q: What aspects of W3C XML Schema are not supported by the XmlSerializer during conversion of schemas to classes?

A: The XmlSerializer does not support the following:
- Any of the simple type restriction facets besides enumeration.
- Namespace based wildcards.
- Identity constraints.
- Substitution groups.
- Blocked elements or types.
After gaining more experience with working with the XmlSerializer and talking to a number of customers I wrote som more about the impedance mismatches in my article XML Schema Design Patterns: Is Complex Type Derivation Unnecessary? specifically
For usage scenarios where a schema is used to create strongly typed XML, derivation by restriction is problematic. The ability to restrict optional elements and attributes does not exist in the relational model or in traditional concepts of type derivation from OOP languages. The example from the previous section where the email element is optional in the base type, but cannot appear in the derived type, is incompatible with the notion of derivation in an object oriented sense, while also being similarly hard to model using tables in a relational database.

Similarly changing the nillability of a type through derivation is not a capability that maps to relation or OOP models. On the other hand, the example that doesn't use derivation by restriction can more straightforwardly be modeled as classes in an OOP language or as relational tables. This is important given that it reduces the impedance mismatch which occurs when attempting to map the contents of an XML document into a relational database or convert an XML document into an instance of an OOP class

I'm not the only one at Microsoft who's written about this impedance mismatch or tried to solve it. Gavin Bierman, Wolfram Schulte and Erik Meijer wrote in their paper Programming with Circles, Triangles and Rectangles an entire section about this mismatch. Below are links to descriptions of a couple of the mismatches they found most interesting
The mismatch between XML and object data-models
     Edge-labelled vs. Node-labelled
     Attributes versus elements
     Elements versus complex and simple types
     Multiple occurrences of the same child element
     Anonymous types
     Substitution groups vs derivation and the closed world assumption
     Namespaces, namespaces as values
     Occurence constraints part of container instead of type
     Mixed content

There is a lot of discussion one could have about the impedance mismatch between the CLR type system and the XSD type system but one thing you can't say is that it doesn't exist or that it can be ignored if building schema-centric applications.

In conclusion, the brief summary is that if one is mapping objects to XML for the purpose of serializing their state then there is a good match between the CLR & XSD type systems since the XSD type system is more expressive than the CLR type system. On the other hand, if one is trying to go from XSD to the CLR type system there are significant impedance mismatches some of which are limitations of the current tools (e.g. XmlSerializer could code gen range checks for derivation by restriction of simple types or uniqueness tests for identity constraints ) while others are fundamental differences between the XSD type system and object oriented programming such as the difference between derivation by restriction in XSD and type derivation.

Categories: XML

February 18, 2004

@ 06:26 PM

Mr. Safe's Guide to the RSS vs. ATOM debate

Dave Winer recently wrote that at least one person has asked if it is safe to ignore Atom in his weblog. If you are a cautious person like Tim Bray's Mr. Safe or you fit more on the right than the left side of the Technology Adoption Life Cycle then you are probably wondering why you should want to support the Atom syndication format over one of the many flavors of RSS. There are two parts to this question, if you are a consumer of syndication feeds or if you are a consumer of syndication feeds.

The Safe Syndication Producer's Perspective
An RSS feed is a regularly updated XML document that contains metadata about a news source and the content in it. Minimally an RSS feed consists of a channel that represents the news source, which has a title, link, and description that describe the news source. Additionally, an RSS feed typically contains one or more item elements that represent individual news items, each of which should have a title, link, or description

There are two primary flavors of RSS; Dave Winer's family of specifications (the most popular being RSS 0.91 & RSS 2.0) and the RDF-based RSS 1.0. The most popular are Dave Winer's family of specifications which have been adopted by a number of well-known organizations such as Yahoo! News, the BBC, Rolling Stone magazine, the Microsoft Developer Network (MSDN) , the Oracle Technology Network (OTN), the Sun Developer Network and Apple's iTunes Music Store. According to Syndic8 which tracks over 50,000 RSS feeds RSS 0.91, RSS 1.0 & RSS 2.0 all have about 30% of the RSS marketshare.

Most news aggregators support all 3 major versions of RSS although few actually take advantage of the fact that RSS 1.0 is an RDF vocabulary. If all one want is simple syndication of news items the RSS 0.91 should be satisfactory. If one plans to use extensions to the core RSS specification that expose application or domain specific functionality such as the ability to post comments one can use one of the many RSS modules in combination with RSS 2.0. The only advantage that RSS 1.0 gives over RSS 0.91/RSS 2.0 is that it is an RDF vocabulary and thus fits nicely into the dream of the Semantic Web.

The Atom syndication format can be considered to be a more sophisticated implementation of the ideas in RSS 2.0. It adds richer syndication capabilities such as the ability to put binary formats such as Word documents and Powerpoint documents in feeds and formalizes some of the best practices in the RSS world around putting [X]HTML in feeds.

The average user of a news aggregator will not be able to tell the difference between an Atom or RSS feed from their aggregator if it supports both. However users of aggregators that don't support Atom will not be able to subscribe to feeds in that format. In a few years, the differences between RSS and Atom will most likely be the same as those that are different between RSS 1.0 and RSS 0.91/RSS 2.0; only of interest to a handful of XML syndication geeks. Even then the simplest and safest bet would still be to use RSS as a syndication format. This is the same as the fact that even though the W3C has published XHTML 1.0 & XHTML 1.1 and is working on XHTML 2.0, the safest bet to get the widest reach with the least problems is to publish a website in HTML 3.2 or HTML 4.01.

The Safe Syndication Consumer's Perspective
If you plan to consume feeds from a wide variety of sources then one should endeavor to support as many syndication formats as possible. The more formats a feed consumer supports the more content is available for its users.

Based on their current popularity, degree of support and ease of implementation one should consider supporting the major syndication formats in the following order of priority

RSS 0.91/RSS 2.0
RSS 1.0
Atom

RSS 0.91 support is the simplest to implement and most widely supported by websites while Atom is the most difficult to implement being the most complex and will be least supported by websites in the coming years.

Categories: XML

February 15, 2004

@ 07:00 PM

Combining XPath-based Filtering with Pull-based XML Parsing

Daniel Cazzulino has been writing about his work with XML Streaming Events which combines the ability to do XPath queries with the .NET Framework's forward-only, pull based XML parser. He shows the following code sample

// Setup the namespaces XmlNamespaceManager mgr = new XmlNamespaceManager(temp.NameTable); mgr.AddNamespace("r", RssBanditNamespace); // Precompile the strategy used to match the expression IMatchStrategy st = new RootedPathFactory().Create( "/r:feeds/r:feed/r:stories-recently-viewed/r:story", mgr); int count = 0; // Create the reader. XseReader xr = new XseReader( new XmlTextReader( inputStream ) ); // Add our handler, using the strategy compiled above. xr.AddHandler(st, delegate { count++; }); while (xr.Read()) { } Console.WriteLine("Stories viewed: {0}", count);

I have a couple of questions about his implementation the main one being how it deals with XPath queries such as /r:feeds/r:feed[count(r:stories-recently-viewed)>10]/r:title which can't be done in a forward only manner?

Oleg Tkachenko also pipes in with some opinions about streaming XPath in his post Warriors of the Streaming XPath Order. He writes

I've been playing with such beasts, making all kinds of mistakes and finally I came up with a solution, which I think is good, but I didn't publish it yet. Why? Because I'm tired to publish spoilers :) It's based on "ForwardOnlyXPathNavigator" aka XPathNavigator over XmlReader, Dare is going to write about in MSDN XML Dev Center and I wait till that's published.

May be I'm mistaken, but anyway here is the idea - "ForwardOnlyXPathNavigator" is XPathNavigator implementation over XmlReader, which obviously supports forward-only XPath subset...

And after I played enough with and implemented that stuff I discovered BizTalk 2004 Beta classes contain much better implementation of the same functionality in such gems as XPathReader, XmlTranslatorStream, XmlValidatingStream and XPathMutatorStream. They're amazing classes that enable streaming XML processing in much rich way than trivial XmlReader stack does. I only wonder why they are not in System.Xml v2 ? Is there are any reasons why they are still hidden deeply inside BizTalk 2004 ? Probably I have to evangelize them a bit as I really like this idea.

Actually Oleg is closer and yet farther from the truth than he realizes. Although I wrote about a hypothetical ForwardOnlyXPathNavigator in my article entitled Can One Size Fit All? for XML Journal my planned article which should show up when the MSDN XML Developer Center launches in a month or so won't be using it. Instead it will be based on an XPathReader that is very similar to the one used in BizTalk 2004, in fact it was written by the same guy. The XPathReader works similarly to Daniel Cazzulino's XseReader but uses the XPath subset described in Arpan Desai's Introduction to Sequential XPath paper instead of adding proprietary extensions to XPath as Daniel's does.

When the article describing the XPathReader is done it will provide source and if there is interest I'll create a GotDotNet Workspace for the project although it is unlikely I nor the dev who originally wrote the code will have time to maintain it.

Categories: XML

February 15, 2004

@ 05:50 PM

On Semantic Integration and XML

A few months ago I attended XML 2003 where I first learned about Semantic Integration which is the buzzword term for mapping data from one schema to another with a heavy focus on using Semantic Web technologies such as ontologies and the like. The problem that these technologies solve is enabling one to map XML data from external sources to a form that is compatible with how an application or business entity manipulates them internally.

For example, in RSS Bandit we treat feeds in memory and on disk as if they are in the RSS 2.0 format even though it supports other flavors of RSS as well such as RSS 1.0. Proponents of semantic integration technologies would suggest using a technology such as the W3C's OWL Web Ontology Language. If you are unfamiliar with ontolgies and how they apply to XML a good place to understand what they are useful for is taking a look at the OWL Web Ontology Language Use Cases and Requirements. The following quote from the OWL Use Cases document gives a glimpse into what the goal of ontology languages

In order to allow more intelligent syndication, web portals can define an ontology for the community. This ontology can provide a terminology for describing content and axioms that define terms using other terms from the ontology. For example, an ontology might include terminology such as "journal paper," "publication," "person," and "author." This ontology could include definitions that state things such as "all journal papers are publications" or "the authors of all publications are people." When combined with facts, these definitions allow other facts that are necessarily true to be inferred. These inferences can, in turn, allow users to obtain search results from the portal that are impossible to obtain from conventional retrieval systems

Although the above example talks about search engines it is clear that one can also use this for data integration. In the example of RSS Bandit, one could create an ontology that maps the terms in RSS 1.0 to those in RSS 2.0 and make statements such as

RSS 1.0's <title> element sameAs RSS 2.0's <title> element

Basically, one could imagine schemas for RSS 1.0 and RSS 2.0 represented as two trees and an ontology a way of drawing connections between the leaves and branches of the trees. In a previous post entitled More on RDF, The Semantic Web and Perpetual Motion Machines I questioned how useful this actually would be in the real world by pointing out the dc:date vs. pubDate problem in RSS. I wrote

However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not. A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

Teh above is a simple example, one could imagine more complex examples where the vocabularies to be mapped differ much more syntactically such as

<author>Dare Obasanjo (dareo@example.com)</author> <author> <fname>Dare</fname> <lname>Obasanjo</lname> <email>dareo@example.com</email> </author>

The aformentioned examples point out technical issues with using ontology based techniques for mapping between XML vocabularies but I failed to point out the human problems that tend to show up in the real world. A few months ago I was talking to Chris Lovett about semantic integration and he pointed out that in many cases as applications evolve semantics begin to be assigned to values in often orthogonal ways.

An example of semantics being addd to values again shows up in an example that uses RSS Bandit. A feature of RSS Bandit is that feeds are cached on disk allowing a user to read items that have long since disappeared from the feed. At first we provided the ability for the user to specify how long items should be kept in the cached feed ranging from a day up to a year. We used an element named maxItemAge embedded in the cached feed which contained a serialized instance of the System.Timespan structure. After a while we realized we needed ways to say that for a particular feed use the global default maxItemAge, never cache items for this feed or never expire items for this feed so we used the TimeSpan.MinValue, TimeSpan.Zero, or TimeSpan.MaxValue values of the TimeSpan class respectively.

If another application wanted to consume this data and had a similar notion of 'how long to keep the items in a feed' it couldn't simply map maxItemAge to whatever internal property it used without taking into account the extra semantics embedded in when certain values occur in that element. Overloading the meaning of properties and fields in a database or class is actually fairly commonplace [after all how many different APIs use the occurence of -1 for a value that should typically return a positive number as an error condition?] and something that must also be considered when applying semantic integration technologies to XML.

In conclusion, it is clear that Semantic Web can be used to map between XML vocabularies however in non-trivial situations the extra work that must be layered on top of such approaches tends to favor using XML-centric techniques such as XSLT to map between the vocabularies instead.

Categories: XML

February 13, 2004

@ 03:30 PM

Sex, Lies and XML MIME Types

Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality. Specifically

According to RFC 3023..., if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
us-ascii.

So for this to work correctly it means that if the MIME type of an XML document is text/xml then the web server should look inside it before sending it over the wire and send the correct encoding or else the document will be interpreted incorrectly since it is highly likely that us-ascii is not the encoding of the XML document. In practice, most web servers do not do this. I have confirmed this by testing against both IIS and Apache.

Instead what happens is that an XML document is created by the user and dropped on the file system and the web server assumes it is text/xml which it most likely is and sends it as is without setting the charset in the content type header.

A simple way to test this is to go to Rex Swain's HTTP Viewer and download the following documents from the W3 Schools page on XML encodings

All files are sent with a content type of text/xml and no encoding specified in the charset parameter of the Content-Type HTTP header. According to RFC 3023 which Mark Pilgrim quoted in his article that clients should treat them as us-ascii. With the above examples this behavior would be wrong in all four cases.

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable future [slightly modified from Mark Pilgrim's post]

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml then the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
the encoding given in the encoding attribute of the XML declaration within the document, or
utf-8.

Some may argue that this discussion isn't relevant for news aggregators because they'll only consume XML documents whose MIME type application/atom+xml or application/rss+xml but again this ignores practice. In practice most web servers send back RSS feeds as text/xml, if you don't believe me test ten RSS feeds chosen at random using Rex Swain's HTTP Viewer and see what MIME type the server claims they are.

Categories: XML

February 9, 2004

@ 03:12 PM

Dealing with Difficulties Using Namespaces in XML

In his blog post entitled Namepaces in Xml - the battle to explain Steven Livingstone wrote

It seems that Namespaces is quickly displacing Xml Schema as the thing people "like to hate" - well at least those that are contacing me now seem to accept Schema as "good".

Now, the concept of namespaces is pretty simple, but because it happens to be used explicitly (and is a more manual process) in Xml people just don't seem to get it. There were two core worries put to me - one calling it "a mess" and the other "a failing". The whole thing centered around having to know what namespaces you were actually using (or were in scope) when selecing given nodes. So in the case of SelectNodes(), you need to have a namespace manager populated with the namespaces you intend to use. In the case of Schema, you generally need to know the targetNamespace of the Schema when working with the XmlValidatingReader. What the guys I spoke with seemed to dislike is that you actually have to know what these namespaces are. Why bother? Don't use namespaces and just do your selects or validation.

Given that I am to some degree responsible for both classes mentioned in the above post, XmlNode (where SelectNodes()comes from) and XmlValidatingReader, I feel compelled to respond.

The SelectNodes() problem is that people would like to perform XPath expressions over nodes and have it not worry about namespaces. For example given XML such as

<root xmlns=”http://www.example.com”>
  <child />
</root>

to perform a SelectNodes() or SelectSingleNode() that returns the <child> element requires the following code

XmlDocument doc = new XmlDocument(); doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable); nsmgr.AddNamespace("foo", "http://www.example.com"); //this is the tricky bit Console.WriteLine(doc.SelectSingleNode("/foo:root/foo:child", nsmgr).OuterXml);

whereas developers don't see why the code isn't something more along the lines of

XmlDocument doc = new XmlDocument(); doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); Console.WriteLine(doc.SelectSingleNode("/root/child").OuterXml);

which would be the case if there were no namespaces in the document.

The reason the latter code sample is not the case is because the select methods on the XmlDocument class are conformant to the W3C XPath 1.0 recommendation which is namespace aware. In XPath, path expressions that match nodes based on their names are called node tests. A node test is a qualified name or QName for short. A QName is syntactically an optional prefix and local name separated by a colon. The prefix is supposed to be mapped to a namespace and is not to be used literally in matching the expression. Specifically the spec states

A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

There are a number of reasons why this is the case which are best illustrated with an example. Consider the following two XML documents

<root xmlns=“urn:made-up-example“> 
  <child xmlns=”http://www.example.com”/>
</root>

<root>
  <child />
</root>

Should the query /root/child also match the <child> element for the above two documents as it does for the original document in this example? The 3 documents shown [including the first example] are completely different documents and there is no consistent, standards compliant way to match against them using QNames in path expressions without explicitly pairing prefixes with namespaces.

The only way to give people what they want in this case would be to come up with a proprietary version of XPath which was namespace agnostic. We do not plan to do this. However I do have a tip for developers showing how to reduce the amount of code it does take to write the examples. The following code does match the <child> element in all three documents and is fully conformant with the XPath 1.0 recommendation

XmlDocument doc = new XmlDocument(); doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); Console.WriteLine(doc.SelectSingleNode("/*[local-name()='root']/*[local-name()='child']").OuterXml);

Now on to the XmlValidatingReader issue. Assume we are given the following XML instance and schema

<root xmlns="http://www.example.com"> <child /> </root> <xs:schema targetNamespace="http://www.example.com" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:element name="child" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

The instance document can be validated against the schema using the following code

XmlTextReader tr = new XmlTextReader("example.xml"); XmlValidatingReader vr = new XmlValidatingReader(tr); vr.Schemas.Add(null, "example.xsd"); vr.ValidationType = ValidationType.Schema; vr.ValidationEventHandler += new ValidationEventHandler (ValidationHandler); while(vr.Read()){ /* do stuff or do nothing */

As you can see you do not need to know the target namespace of the schema to perform schema validation using the XmlValidatingReader. However many code samples in our SDK to specify the target namespace where I specified null above when adding schemas to the Schemas property of the XmlValidatingReader. When null is specified it indicates that the target namespace should be obtained from the schema. This would have been clearer if we'd had an overload for the Add() method which took only the schema but we didn't. Hindsight is 20/20.

Categories: XML

February 8, 2004

@ 10:15 PM

My Review of ATOM.NET

I noticed Gordon Weakliem reviewed ATOM.NET, an API for parsing and generating ATOM feeds. I went to the ATOM.NET website and decided to take a look at the ATOM.NET documentation. The following comments come from two perspectives, the first is as a developer who'll most likely have to implement something akin to ATOM.NET for RSS Bandit's internal workings and the other is from the perspective of being one of the folks at Microsoft whose job it is to design and critique XML-based APIs.

The AtomWriter class is superflous. The class that only has one method Write(AtomFeed) which makes more sense being on the AtomFeed class since an object should know how to persist itself. This is the model we followed with the XmlDocument class in the .NET Framework which has an overloaded Save() method. The AtomWriter class would be quite useful if it allowed you to perform schema driven generation of an AtomFeed, the same way the XmlWriter class in the .NET Framework is aimed at providing a convenient way to programmatically generate well-formed XML [although it comes close but doesn't fully do this in v1.0 & v1.1 of the .NET Framework]
I have the same feelings about the AtomReader class. This class also seems superflous. The functionality it provides is akin to the overloaded Load() method we have on the XmlDocument class in the .NET Framework. I'd say it makes more sense and is more usable if this functionality was provided as a Load() method on an AtomFeed class than as a separate class unless the AtomReader class actually gets some more functionality.
There's no easy way to serialize an AtomEntry class as XML which means it'll be cumbersome using this ATOM.NET for the ATOM API since it requires sending elements as XML over the wire. I use this functionality all the time in RSS Bandit internally from passing entries as XML for XSLT themes to the CommentAPI to IBlogExtension.
There is no consideration for how to expose extension elements and attributes in ATOM.NET. As far as I'm concerned this is a deal breaker that makes the ATOM.NET useless for aggregator authors since it means they can't handle extensions in ATOM feeds even though they may exist and have already started popping up in various feeds.

Categories: XML

February 8, 2004

@ 09:21 PM

This Month's Extreme XML Column: Passing XML Data Inside the CLR

A few weeks ago during the follow up to the WinFX review of the System.Xml namespace of the .NET Framework it was pointed out that our team hadn't provided guidelines for exposing and manipulating XML data in applications. At first, I thought the person who brought this up was mistaken but after a cursory search I realized the closest thing that comes to such a set of guidelines is Don Box's MSDN TV episode entitled Passing XML Data Inside the CLR. As good as Don's discussion is, a video stream isn't as accessible as a written article. In tandem with coming up with some of the guidelines for utilizing XML in the .NET Framework for internal purposes I'll put together an article based on Don's MSDN TV episode with an eye towards the next version of the .NET Framework.

If you watched Don's talk and had any questions about it or require any clarifications respond below so I can clarify them in the article I plan to write.

Categories: XML

February 6, 2004

@ 05:00 PM

Comments [13]

XML 1.1: The W3C Gets It Wrong

A few days ago XML 1.1 became an official W3C recommendation. Mark Pilgrim, contrary to W3C guidelines, has celebrated by converting his RSS feed to XML 1.1 which means it currently cannot be processed by any Microsoft XML technologies from the XML parsers in the .NET Framework to MSXML which is used in a host of products from Internet Explorer to Office 2003.

This is the first step in fragmenting the interoperability on the Web gained by XML. It seems the next step will be W3C sanctioned binary XML. Anyway let's get back to XML 1.1. What exactly is wrong with it one might ask? The biggest thing wrong with it is that it is backwards incompatible with XML 1.0. A good summary of all the things you need to know about XML 1.1 is covered in Chapter 3 of Elliote Rusty Harrold's Effective XML

Everything you need to know about XML 1.1 can be summed up in two rules:

Don't use it.

(For experts only) If you speak Mongolian, Yi, Cambodian, Amharic, Dhivehi, Burmese or a very few other languages and you want to write your markup (not your text but your markup) in these languages, then you can set the version attribute of the XML declaration to 1.1. Otherwise, refer to rule 1.

XML 1.1 does several things, one of them marginally useful to a few developers, the rest actively harmful.

It expands the set of characters allowed as name characters

The C0 control characters (except for NUL) such as form feed, vertical tab, BEL, and DC1 through DC4 are now allowed in XML text provided they are escaped as character references.

C1 control characters (except for NEL) must now be escaped as character references

NEL can be used in XML documents, but is resolved to a line feed on parsing.

Parsers may (but do not have to) tell client applications that Unicode data was not normalized

Namespace prefixes can be undeclared

XML is a lousy format for most of the things it is used for. The one benefit it has is that it is widely supported and a guaranteed way to interoperate in a cross-platform manner. By tampering with this the W3C is effectively diluting one of the few benefits of using XML. This is an regrettable occurence. Unfortunately it looks like things will get worse now that the W3C also wants to dabble in “binary XML”.

Categories: XML

February 3, 2004

@ 04:56 PM

Rule Based XML Validation, XML Web Services and Service Oriented Architectures

In his post entitled Business Rules, OCL, XML and Schemas Daniel Cazzulino writes

DonXML is proposing extensions to OCL to express business rules that can be used at code-gen time and at run-time. He mentions my Schematron implementation called Schematron.NET, which allows many business rules to be expressed simply in terms of standard XPath expressions. I believe such an XPath-based language is good enough to express almost every business rule.

Udi Dahan commented as an example, a rule "only a bank manager can authorize a loan above X" which he said couldn't be expressed with Don's idea. It could, indeed, with something along these lines (XPath-like):
<assert test="sec:principal-role('BankManager') and po:Loan/@Amount < 1000"> Only a BankManager can place a loan of more than $1000. </assert>

Using rules-based XML validation is a good way to augment the capabilities of the W3C XML Schema language which is traditionally used to describe message structures in SOAP-based XML Web Services. In the post on Daniel's blog Udi Dahan asks

I like the technique. I'm still puzzling over the strategy. From a SOA approach, where does this go ? What makes it different/better than any other rules engine ? You've given me something to think about. Thank you.

In an SOA approach the rules are part of the message contract. A service endpoint can accept certain kinds of messages that satisfy its message contract. Using a rule-based language like Schematron just makes for writing a tighter contract than one could write using a traditional XML schema language like XSD.

In fact, Aaron Skonnard wrote an article on MSDN entitled Extend the ASP.NET WebMethod Framework by Adding XML Schema Validation that introduced this to some degree which he followed up with two episodes of MSDN TV; Validating Business Rules with XPath Assertions, Pt. 1 and Validating Business Rules with XPath Assertions, Pt. 2

Categories: XML

January 28, 2004

@ 01:38 AM

Validation and XML APIs

Fumiaki Yoshimitu writes

DOM L3 Validation gets recommended.

We know that XPathDocument2 will have validation (and augmentation) feature, but how about XmlDocument? Will it support any of the DOM L3 feature? This is another question related to the rumor that XmlDocument is dead. Dare?

Actually we've decided to rethink having a Validate() method on the class that is currently called XPathDocument2 because it may lead users down the wrong path. Our worry is that users will end up loading an XML document and then call Validate() on it thus incurring the cost of two passes over the document as opposed to the more efficient approach of loading the document with a validating XmlReader. For this reason we've removed the Validate() method from the class.

Also there is no plan to have XmlDocument support any DOM L3 feature. Moving forward, the primary representation of in-memory XML documents on the .NET Framework will be the class currently called XPathDocument2 and that is where the Microsoft WebData XML team's efforts will be spent.

Categories: Life in the B0rg Cube | XML

January 27, 2004

@ 10:38 PM

The Newly Awarded Microsoft XML MVPs

Below is the list of the developers who got the Microsoft Most Valuable Professional (MVP) Award for the 2004-2005 calendar year in the XML category

These developers have all been outstanding members of Microsoft's peer-to-peer communities.

Categories: Life in the B0rg Cube | XML

January 23, 2004

@ 07:20 AM

10 [Bogus] Reasons Why RSS is not Ready for Prime Time

Being a hobbyist developer interested in syndication technologies I'm always on the look out for articles that provide useful critiques of the current state of the art. I recently stumbled on an article entitled 10 reasons why RSS is not ready for prime time by Dylan Greene which fails to hit the mark in this regard. Of the ten complaints, about three seem like real criticisms grounded in fact while the others seem like contrived issues which ignore reality. Below is the litany of issues the author brings up and my comments on each

1) RSS feeds do not have a history. This means that when you request the data from an RSS feed, you always get the newest 10 or 20 entries. If you go on vacation for a week and your computer is not constantly requesting your RSS feeds, when you get back you will only download the newest 10 or 20 entries. This means that even if more entires were added than that while you were gone, you will never see them.

Practically every information medium has this problem. When I travel out of town for a week or more I miss the newspaper, my favorite TV news shows and talk radio shows which once gone I'll likely never get to enjoy. With blogs it is different since most blogs provide an online archive but on the other hand most news sites archive their content and require paid access after they're no longer current.

In general, most individual blogs aren't updated regularly enough that being gone for a week or two means that entries are missed. On the other hand most news sites are. In such cases one could leave their aggregator of choice connected and reduce its refresh rate (something like once a day) and let your fingers do the walking. That's exactly what one would have to do with a TiVo (i.e. leave your cable box on).

2) RSS wastes bandwidth. When you "subscribe" to an RSS feed, you are telling your RSS reader to automatically download the RSS file on a set interval to check for changes. Lets say it checks for news every hour, which is typical. Even if just one item is changed the RSS reader must still download the entire file with all of the entries.

The existing Web architecture provides a couple of ways for polling based applications to save bandwidth including HTTP conditional GET and gzip compression over HTTP. Very few web sites actually support both well-known bandwidth saving techniques including the Dylan Green based on a quick check with Rex Swain's HTTP Viewer. Using both techniques can save bandwidth costs by an order of magnitude (by a factor of 10 for the mathematically challenged). Before coming up with sophisticated hacks for perceived problems it'd be nice if website administrators actually used existing best practices before trying to reinvent the wheel in more complex ways.

That said, it would be a nice additional optimization for web sites to only provide only the items that hadn't been read by a particular client for each request for the RSS feed. However I'd like to see us learn to crawl before we try to walk.

3) Reading RSS requires too much work. Today, in 2004, we call it "browsing the Web" - not "viewing HTML files". That is because the format that Web pages happen to be in is not important. I can just type in "msn.com" and it works. RSS requires much more than that: We need to find the RSS feed location, which is always labeled differently, and then give that URL to my RSS reader.

Yup, there isn't a standard way to find the feed for a website. RSS Bandit tries to make this easier by feed lookup via Syndic8 and supporting one click subscription to RSS feeds. However aggregator authors can't do this alone, the blogging tools and major websites that use RSS need to get in on the act as well.

4) An RSS Reader must come with Windows. Until this happens too, RSS reading will only be for a certain class of computer users that are willing to try this new technology. The web became mainstream when Microsoft started including Internet Explorer with Windows. MP3's became mainstream when Windows Media Player added MP3 support.

I guess my memory is flawed but I always thought Netscape Navigator and Winamp/Napster where the applications that brought the Web and MP3s to the mainstream respectively. I'm always amused by folks that think that unless Microsoft supports some particular technology then it is going to fail. It'd be nice if an RSS aggregator but that doesn't mean that a technology cannot become popular until it ships in Windows. Being a big company, Microsoft is generally slow to react to trends until they've proven themselves in the market which means that if an aggregator ever ships in Windows it will happen when news aggregators are mainstream not before.

5) RSS content is not User-Friendly. It has taken about 10 years for the Web to get to the point where it is today that most web pages we visit render in our browser the way that the designer intended. It's also taken about that long for web designers to figure out how to lay out a web page such that most users will understand how to use it. RSS takes all of that usability work and throws it away. Most RSS feeds have no formatting, no images, no tables, no interactive elements, and nothing else that we have come to rely on for optimal content readability. Instead we are kicked back to the pre-web days of simple text.

I find it hard to connect tables, interactive elements and images with “optimal content readability” but maybe that's just me. Either way, there's nothing stoping folks from using HTML markup in RSS feeds. Most of the major aggregators are either browser based or embed a web browser so viewing HTML content is not a problem. Quite frankly, I like the fact that I don't have to deal with cluttered websites when reading content in my aggregator of choice.

6) RSS content is not machine-friendly. There are search engines that search RSS feeds but none of them are intelligent about the content they are searching because RSS doesn't describe the properties of the content well enough. For example, many bloggers quote other blogs in their blog. Search engines cannot tell the difference between new content and quoted content, so they'll show both in the search results.

I'm curious as to which search engine he's used which doesn't have this problem. Is there an “ignore items that are parts of a quote” option on Google or MSN Search? As search engines go I've found Feedster to be quite good and better than Google for a certain class of searches. It would be cool to be able to execute ad-hoc, structured queries against RSS feeds but this would be icing on the cake and in fact is much more likely to happen in the next few years than is possible that we will ever be able to perform such queries against [X]HTML web sites.

7) Many RSS Feeds show only an abridged version of the content. Many RSS feeds do not include the full text. Slashdot.org, one of the most popular geek news sites, has an RSS feed but they only put the first 30 words of each 100+ word entry in their feed. This means that RSS search engines do not see the full content. This also means that users who syndicate their feed only see the first few words and must click to open a web browser to read the full content.

This is annoying but understandable. Such sites are primarily using an RSS feed as a way to lure you to the site not as a way to provide users with content. I don't see this as a problem with RSS any more than the fact that some news sites need you to register or pay to access their content is a problem with HTML and the Web.

8) Comments are not integrated with RSS feeds. One of the best features of many blogs is the ability to reply to posts by posting comments. Many sites are noteworthy and popular because of their comments and not just the content of the blogs.

Looks like he is reading the wrong blogs and using the wrong aggregators. There are a number of ways to expose comments in RSS feeds and a number of aggregators support them including RSS Bandit which supports them all.

9) Multiple Versions of RSS cause more confusion. There's several different versions of RSS, such as RSS 0.9, RSS 1.0, RSS 2.0, and RSS 3.0, all controlled by different groups and all claiming to be the standard. RSS Readers must support all of these versions because many sites only support one of them. New features can be added to RSS 1.0 and 2.0 can by adding new XML namespaces, which means that anybody can add new features to RSS, but this does mean that any RSS Readers will support those new features.

I assume he has RSS 3.0 in there as a joke. Anyway, the existence of multiple versions of RSS is not that much more confusing to end users than the existence of multiple versions of [X]HTML, HTTP, Flash and Javascript some of which aren't all supported by every web browser.

That said a general plugin mechanism to deal with items from new namespaces would be an interesting problem to try and solve but sounds way too hard to successfully provide a general solution for.

10) RSS is Insecure. Lets say a site wants to charge for access to their RSS feed. RSS has no standard way for inputing a User Name and Password. Some RSS readers support HTTP Basic Authentication, but this is not a secure method because your password is sent as plain text. A few RSS readers support HTTPS, which is a start, but it is not good enough. Once somebody has access to the "secure" RSS file, that user can share the RSS file with anybody.

Two points. (A) RSS is a Web technology so the standard mechanisms for providing restricted yet secure access to content on the Web apply to RSS and (B) there is no way known to man short of magic to provide someone with digital content on a machine that they control and restrict them from copying it in some way, shape or form.

Aight, that's all folks. I'm off to watch TiVoed episodes of the Dave Chappelle show.

Categories: XML

January 21, 2004

@ 08:56 PM

Aaron Skonnard's Blog

Thanks to Technorati Beta, I found Aaron Skonnard's blog. Aaron is the author of the XML Files column in the MSDN Magazine and an all around XML geek.

Categories: XML

January 21, 2004

@ 08:38 AM

Draconians vs. Tolerants: Everyone Lost Yet Everyone Won.

Mark Pilgrim has a post entitled The history of draconian error handling in XML where he excerpts a couple of the discussions on the draconian error handling rules of XML which state that if an XML processor encounters a syntax error in an XML document it should stop parsing and indicate a fatal error as opposed to muddling along or trying to fixup the error in some way. According to Tim Bray

What happened was, we had a really big, really long, really passionate argument on the subject; the camps came to be called “Draconians” and “Tolerants.” After this had gone on for some weeks and some hundreds of emails, we took a vote and the Draconians won 7-4.

Reading some of the posts from 6 years ago on Mark Pilgrim's blog it is interesting to note that most of the arguments on the sides of the Tolerants are simply no longer relevant today while the Draconians turned out to be the reason for XML's current widespread success in the software marketplace.

The original goal of XML was to create a replacement for HTML which allowed you to create your own tags yet have them work in some fashion on the Web (i.e SGML on the Web). Time has shown that placing XML documents directly on the Web for human consumption just isn't that interesting to the general Web development comunity. Most content on the Web for human consumption is still HTML tag soup. Even when Web content claims to be XHTML it often is really HTML tag soup either because it isn't well-formed or is invalid according to the XHTML DTD. Even applications that represent data internally as XML tend to use XSLT to transform the content to HTML as opposed to putting the XML directly on the Web and styling it with CSS. As I've mentioned before the dream of the original XML working group of replacing HTML by inventing “SGML on the Web” is a failed dream. Looking back in hindsight it doesn't seem that the choice of tolerant over draconian error handling would have made a difference to the lack of adoption of XML as a format for representing content targetted for human consumption on the Web today.

On the other hand, XML has flourished as a general data interchange format for machine-to-machine interactions in wide ranging areas from distributed computing and database applications to being a format for describing configuration files and business documents. There are a number of reasons for XML's rise to popularity

The ease with which XML technologies and APIs enabled developers to process documents and data in an easier and more flexible manner than with previous formats and technologies.
The ubiquity of XML implementations and the consistency of the behavior of implementations across platforms.
The fact that XML documents were fairly human-readable and seemed familiar to Web developers since it was HTML-like markup.

Considering the above points, does it seem likely that XML would be as popular outside of its original [failed] design goal of being a replacement for HTML if the specification allowed parsers to pick and choose which parts of the spec to honor with regards to error recovery? Would XML Web Services be as useful for interoperability between platforms if different parser implementations could recover from syntax errors at will in a non-deterministic manner? Looking at some of the comments linked from Mark Pilgrim's blog it does seem to me that a lot of the arguments on the side of the Tolerants came from the perspective of “XML as an HTML replacement” and don't stand up under scrutiny in today's world.

April 19, 1997. Sean McGrath: Re: Error Handling in XML

Programming languages that barf on a syntax error do so because a partial executable image is a useless thing. A partial document is *not* a useless thing. One of the cool things about XML as a document format is that some of the content can be recovered even in the face of error. Compare this to our binary document friends where a blown byte can render the entire content inaccessible.

Given that today XML is used for building documents that are effectively programs such as XSLT, XAML and SVG it does seem like the same rules that apply for partial programs should apply as well.

May 7, 1997. Paul Prescod: Re: Final words, I think, on error handling

Browsers do not just need a well-formed XML document. They need a well-formed XML document with a stylesheet in a known location that is syntactically correct and *semantically correct* (actually applies reasonable styles to the elements so that the document can be read). They need valid hyperlinks to valid targets and pretty soon they may need some kind of valid SGML catalog. There is still so much room for a document author to screw up that well-formedness is a very minor step down the path.

I have to agree here with the spirit of the post [not the content since it assumed that XML was going to primarily be a browser based format]. It is far more likely and more serious that there are logic errors in an XML document than syntax errors. For example, there are more RSS feeds out there with dates are invalid based on the RSS spec they support than there are ill-formed feeds. And in a number of these it is a lot easier to fix the common well-formedness errors than it is to fix violations of the spec (HTML in descriptions or titles, incorrect date formats, data other than email addresses in the <author> element, etc).

May 7, 1997. Arjun Ray: Re: Final words, I think, on error handling

The basic point against the Draconian case is that a single (monolithic?) policy towards error handling is a recipe for failure. ...

XML is many things but I doubt that one could call it a failure except when it comes to its original [flawed] intent of replacing HTML. As an mechanism for describing structured and semi-structured content in a robust, platform independent manner IT IS KING.

So why do I say everyone lost yet everyone won? Today most XML on the Web targetted at human consumption [i.e. XHTML] isn't well-formed so in this case the Tolerants were right and the Draconians lost since well-formed XML has been a failure on the human Web. However in the places were XML is getting the most traction today, the draconian error handling rules promote interoperability and predictability which is the opposite of what a number of the Tolerants expected would happen with XML in the wild.

Categories: XML

January 20, 2004

@ 03:33 PM

On Versioning XML Vocabularies

One of the biggest problems that faces designers of XML vocabularies is how to make them extensible and design them in a way that applications that process said vocabularies do not break in the face of changes to versions of the vocabulary. One of the primary benefits of using XML for building data interchange formats is that the APIs and technologies for processing XML are quite resistant to additions to vocabularies. If I write an application which loads RSS feeds looking for item elements then processes their link and title elements using any one of the various technologies and APIs for processing XML such as SAX, the DOM or XSLT it is quite straightforward to build an application that processes said elements which is resistant to changes in the RSS spec or extensions to the RSS spec as the link and title elements always appear in a feed.

On the other hand, actually describing such extensibility using the most popular XML schema language, W3C XML Schema, is difficult because of several limitations in its design which make it very difficult to describe extension points in a vocabulary in a way that is idiomatic to how XML vocabularies are typically processed by applications. Recently, David Orchard, a standards architect at BEA Systems wrote an article entitled Versioning XML Vocabularies which does a good job of describing the types of extensibility XML vocabularies should allow and points out a number of the limitations of W3C XML Schema that make it difficult to express these constraints in an XML schema for a vocabulary. David Orchard has written a followup to this article entitled Providing Compatible Schema Evolution which contains a lot of assertions and suggestions for improving extensibility in W3C XML Schema that mostly jibe with my experiences working as the Program Manager responsible for W3C XML Schema technologies at Microsoft.

The scenario outlined in his post is

We start with a simple use case of a name with a first and last name, and it's schema. We will then evolve the language and instances to add a middle name. The base schema is:

<xs:complexType name="nameType">
<xs:sequence>
<xs:element name="first" type="xs:string" />
<xs:element name="last" type="xs:string" minOccurs="0"/>
</xs:sequence>
</xs:complexType>

Which validates the following document:

<name>
<first>Dave</first>
<last>Orchard</last>
</name>

And the scenarios asks how to validate documents such as the following where the new schema with the extension is available or not available to the receiver.:

<name>
<first>Dave</first>
<last>Orchard</last>
<middle>B</middle>
</name>

<name>
<first>Dave</first>
<middle>B</middle>
<last>Orchard</last>
</name>

At this point I'd like to note that this a versioning problem which is a special instance of the extensibility problem. The extensibility problem is how does one describe an XML vocabulary in a way that allows producers to add elements and attributes to the core vocabulary without causing problems for consumers that may not know about them. The versioning problem is specific to when the added elements and attributes actually are from a subsequent version of the vocabulary (i.e. a version 2.0 server talking to a version 1.0 client). The additional wrinkle in the specific scenario outlined by David Orchard is that elements from newer versions of the vocabulary have the same namespace as elements from the old version.

A strategy for simplifying the problem statement would be if additions in subsequent versions of the vocabulary had were in a different namespace (i.e. a version 2.0 document would have elements from the version 1.0 namespace and the version 2.0 namespace) which would then make the versioning problem the same as the extensibility problem. However most designers of XML vocabularies would balk at creating a vocabulary which used elements from multiple namespaces for its core [once past version 2.0] and often site that this makes it more cumbersome for applications that process said vocabularies because they have to deal with multiple namespaces. This is a tradeoff which every XML vocabulary designer should consider during the design and schema authoring process.

David Orchard takes a look at various options for solving the extensibility problem outlined above using current XML Schema design practices.

Type extension

Use type extension or substitution groups for extensibility. A sample schema is:

This requires that both sides simultaneously update their schemas and breaks backwards compatibility. It only allows the extension after the last element

There is a [convoluted] way to ensure that both sides do not have to update their schemas. The producer can send a <name> element that contains xsi:type attribute which has the NameExtendedType as its value. The problem is then how the client knows about the definition for the NameExtendedType type which is solved by the root element of the document containing an xsi:schemaLocation attribute which points to a schema for that namespace which includes the schema from the previous version. There are at least two caveats to this approach (i) the client has to trust the server since it is using a schema defined by the server not the client's and (ii) since the xsi:schemaLocation attribute is only a hint it is likely the validator may ignore it since the client would already have provided a schema for that namespace.

Change the namespace name or element name

The author simply updates the schema with the new type. A sample is:

This does not allow extension without changing the schema, and thus requires that both sides simultaneously update their schemas. If a receiver has only the old schema and receives an instance with middle, this will not be valid under the old schema

Most people would state that this isn't really extensibility since [to XML namespace aware technologies and APIs] the names of all elements in the vocabulary have changed. However for applications that key off the local-name of the element or are unsavvy about XML namespaces this is a valid approach that doesn't cause breakage. Ignoring namespaces, this approach is simply adding more stuff in a later revision of the spec which is generally how XML vocabularies evolve in practice.

Use wildcard with ##other

This is a very common technique. A sample is:

The problems with this approach are summarized in Examining elements and wildcards as siblings. A summary of the problem is that the namespace author cannot extend their schema with extensions and correctly validate them because a wildcard cannot be constrained to exclude some extensions.

I'm not sure I agree with David Orchard summary of the problem here. The problem described in the article he linked to is that a schema author cannot refine the schema in subsequent versions to contain optional elements and still preserve the wildcard. This is due to the Unique Particle Attribution constraint which states that a validator MUST always have only one choice of which schema particle it validates an element against. Given an element declaration for an element and a wildcard in sequuence the schema validator has a CHOICE of two particles it could validate an element against if its name matches that of the element declaration. There are a number of disambiguating rules the W3C XML Schema working group could have come up with to allow greater flexibility for this specific case such as (i) using a first match rule or (ii) allowing exclusions in wildcards.

Use wildcard with ##any or ##targetnamespace

This is not possible with optional elements. This is not possible due to XML Schema's Unique Particle Attribution rule and the rationale is described in the Versioning XML Languages article. An invalid schema sample is:

The Unique Particle Attribution rule does not allow a wildcard adjacent to optional elements or before elements in the same namespace.

Agreed. This is invalid.

Extension elements

This is the solution proposed in the versioning article. A sample of the pre-extended schema is:

An extended instance is

This is the only solution that allows backwards and forwards compatibility, and correct validation using the original or the extended schema. This articles shows a number of the difficulties remaining, particularly the cumbersome syntax and the potential for some documents to be inappropriately valid. This solution also has the problem of each subsequent version will increase the nesting by 1 level. Personally, I think that the difficulties, including potentially deep nesting levels, are not major compared to the ability to do backwards and forwards compatible evolution with validation.

The primary problem I have with this approach is that it is a very unidiomatic way to process XML especially when combined with the problem with nesting in concurrent versions. For example, take a look at

Imagine if this is the versioning strategy that had been used with HTML, RSS or DocBook. That gets real ugly, real fast. Unfortunately this is probably the best you can if you want to use W3C XML Schema to strictly define the an XML vocabulary with extensibility yet allow backwards & forwards compatibility.

David Orchard goes on to suggest a number of potential additions to future versions of W3C XML Schema which would make it easier to use it in defining extensible XML vocabularies. However given that my personal opinion is that adding features to W3C XML Schema is not only trying to put lipstick on a pig but also trying to build a castle on a foundation of sand, I won't go over each of his suggestions. My recent suggestion to some schema authors at Microsoft about solving this problem is that they should have two validation phases in their architecture. The first phase does validation according to W3C XML Schema rules while the other performs validation of “business rules“ specific to their scenarios. Most non-trivial vocabularies end up having such an architecture anyway since there are a number of document validation capabilities missing from W3C XML Schema so schema authors shouldn't be too focused on trying to force fit their vocabulary into the various quirks of W3C XML Schema.

For example, in one could solve the original schema with a type definition such as

<xsd:complexType name="nameType">
<xsd:choice minOccurs="1" maxOccurs="unbounded">
   <xsd:element name="first" type="xsd:string" />
   <xsd:element name="last" type="xsd:string" minOccurs="0"/>
   <xsd:any namespace="##other" processContents="lax" />
</xsd:choice>
</xsd:complexType>

where the validation layer above the W3C XML Schema layer ensures that an element doesn't occur twice (i.e. there can't be two <first> elements in a <name>). It adds more code to the clients & servers but it doesn't result in butchering the vocabulary either.

Categories: XML

January 19, 2004

@ 06:30 AM

It's All About Your Point of "View"

In a recent post entitled XML For You and Me, Your Mama and Your Cousin Too I wrote

The main problem is that there are a number of websites which have the same information but do not provide a uniform way to access this information and when access mechanisms to information are provided do not allow ad-hoc queries. So the first thing that is needed is a shared view (or schema) of what this information looks like which is the shared information model Adam talks about...

Once an XML representation of the relevant information users are interested has been designed (i.e. the XML schema for books, reviews and wishlists that could be exposed by sites like Amazon or Barnes & Nobles) the next technical problem to be solved is uniform access mechanisms... Then there's deployment, adoption and evangelism...

We still need a way to process the data exposed by these web services in arbitrary ways. How does one express a query such as "Find all the CDs released between 1990 and 1999 that Dare Obasanjo rated higher than 3 stars"?..

At this point if you are like me you might suspect that defining that the web service endpoints return the results of performing canned queries which can then be post processed by the client may be more practical then expecting to be able to ship arbitrary SQL/XML, XQuery or XPath queries to web service end points.

The main problem with what I've described is that it takes a lot of effort. Coming up with standardized schema(s) and distributed computing architecture for a particular industry then driving adoption is hard even when there's lots of cooperation let alone in highly competitive markets.

A few days ago I got a response to this post from Michael Brundage, author of XQuery : The XML Query Language and a lead developer of the XML<->relational database technologies the WebData XML team at Microsoft produces, on a possible solution to this problem that doesn't require lots of disparate parties to agree on schemas, data model or web service endpoints. Michael wrote

Dare, there's already a solution to this (which Adam created at MS five years ago) -- virtual XML views to unify different data sources. So Amazon and BN and every other bookseller comes up with their own XML format. Somebody else comes along and creates a universal "bookstore" schema and maps each of them to it using an XML view. No loss of performance in smart XML Query implementations.

And if that universal schema becomes widely adopted, then eventually all the booksellers adopt it and the virtual XML views can go away. I think eventually you'll get this for documents, where instead of translating WordML to XHTML (as Don is doing), you create a universal document schema and map both WordML and XHTML into it. (And if the mappings are reversible, then you get your translators for free.)

This is basically putting an XML Web Service front end that supports some degree of XML query on aggegator sites such as AddALL or MySimon. I agree with Michael that this would be a more bootstrapable approach to the problem than trying to get a large number of sites to support a unified data model, query interface and web service architecture.

Come to think of it we're already halfway there to creating something similar for querying information in RSS feeds thanks to sites such as Feedster and Technorati. All that is left is for either site or others like them to provide richer APIs for querying and one would have the equivalent of an XML View of the blogosphere (God, that is such a pretensious word) which you could query to your heart's delight.

Interesting...

Categories: XML

January 15, 2004

@ 08:24 AM

An Industry First

A Newsgator press release from last week reads

Subscription Synchronization

Users who subscribe to NewsGator Online Services can now synchronize their subscriptions across multiple machines. This is an industry first - NewsGator 2.0 for Outlook and NewsGator Online Services are the first commercially available tools to provide this capability in such a flexible manner. This sophisticated system ensures that subscriptions follow users wherever they go, users never have to read the same content twice (unless they choose to), and even supports multiple subscription lists so users can have separate, but overlapping, subscription lists at home and at the office.

Interesting. Synchronizing subscriptions for a news reader across multiple machines doesn't strike me as unprecedented functionality that Newsgator pioneered let alone an industry first. The first pass I've seen at doing this in public was Dave Winer's subscription harmonizer which seemed more of a prototype than an actual product expected to be used by regular users. I implemented and shipped the ability to synchronize subscriptions across multiple machines with RSS Bandit about 2 months ago. As for providing an aggregator that supports this feature and a commercial site that would host feeds synchronization information I believe Shrook has Newsgator beat by about a month if the website is to be believed (I don't have a Mac to test whether it actually works as advertised).

I find it unfortunate that it seems that we are headed for a world where multiple proprietary solutions and non-interoperable solutions exist for providing basic functionality that uses take for granted when it comes to other technologies like email. This was the impetus for starting work on Synchronization of Information Aggregators using Markup (SIAM) . Unfortunately between my day job, the girlfriend and trying to get another release of RSS Bandit out the door I haven't had time to brush up the spec and actually work on an implementation. It'll be a few weeks before I can truly focus on SIAM, hopefully it'll be worth waiting for and it'll gain some traction amongst aggregator developers.

Categories: RSS Bandit | XML

January 15, 2004

@ 07:27 AM

Reading and Writing Well-Formed XML in the .NET Framework

A recent spate of discussions about well-formed XML in the context of the ATOM syndication format kicked of by There are no exceptions to Postel's Law post has reminded me that besides using an implementation of the W3C DOM most developers do not have a general means of generating well-formed, correct XML in their applications. In the .NET Framework we provide the XmlWriter class for generating XML in a streaming manner but it is not without it's issues. In a recent blog post entitled Well-Formed XML in .NET, and Postel's Rebuttal Kirk Allen Evans writes

At any rate, Tim successfully convinced me that aggregators should not have the dubious task of “correcting“ feeds or displaying feeds that are not well-formed.

Yet I still have a concern about Tim's post, concerning XmlWriter and well-formedness:

PostScript: I just did the first proof on the first draft of this article. It had a mismatched tag and wasn’t well-formed. The publication script runs an XML parser over the draft and it told me the problem and I fixed it. It took less time than writing this postscript.

PPS: Putting My Money Where My Mouth Is - If you’re programming in .NET, there’s a decent-looking XmlWriter class.

The problem is that it is quite possible to emit content using the XmlWriter that is not well-formed. From MSDN online's “Customized XML Writer Creation“ topic:

The XmlTextWriter does not verify that element or attribute names are valid.
The XmlTextWriter writes Unicode characters in the range 0x0 to 0x20, and the characters 0xFFFE and 0xFFFF, which are not XML characters.
The XmlTextWriter does not detect duplicate attributes. It will write duplicate attributes without throwing an exception.

Even using the custom XmlWriter implementation that is mentioned in the MSDN article does not remove the possibility of a developer circumventing the writing process:

Kirk provides a code sample that shows that even with an XmlWriter implementation that performs the well-formedness checks that are missing from the XmlTextWriter provided in v1.0 & v1.1 of the .NET Framework, a developer could still inadvertently write out malformed XML if they hand out the XML stream without closing the XmlTextWriter and thus closing all the unclosed tags.

In the next version of the .NET Framework we plan to provide an XmlWriter implementation that performs all the conformance checks required by the W3C XML 1.0 recommendation when generating XML [except for duplicate attribute checking].

Sam Ruby posted an RSS feed that was malformed XML which can be subscribed to from RSS Bandit without any complaints. I mentioned in a response to the post on Sam Ruby's blog that this is because RSS Bandit uses the XmlTextReader class in the .NET Framework which by default doesn't perform character range checking for numeric entities to ensure that the XML document does not contain invalid XML characters. To get conformant behavior from the XmlTextReader one needs to set its Normalization property to true. In retrospect this was an unfortunate design decision and we should have chosen the default to be conformant behavior but allowed users have the option to change it to unconformant behavior if it suited their needs not the other way around.

In the next version of the .NET Framework we plan to provide an implementation of the XmlReader which is fully conformant to the W3C XML 1.0 recommendation by default.

Categories: XML

January 13, 2004

@ 04:06 PM

XML on the Web

Mark Pilgrim's recent post entitled There are no exceptions to Postel's Law among other things implies that news aggregators should process ill-formed XML feeds given that it is better for end users since they don't care about esoteric rules defined in the XML 1.0 recommendation but they do care if they can't read the news from their favorite sites.

This has unleashed some feedback from XML standards folks such as Tim Bray's On Postel, Again and Norman Walsh's On Atom and Postel's Law who argue that if an feed isn't well-formed XML then it is a fatal error. Aggregator authors have also gotten in the mix. Brent Simmons has a posted a number, of, entries on the topic where he mentions that NetNewsWire currently doesn't error on RSS feeds that are ill-formed XML if it work around the error but plans to change this for ATOM so that it errors on ill-formed feeds. Nick Bradbury has posted similar thoughts with regards to how FeedDemon has behaved in the past and will behave in future. On the other end of the spectrum is Greg Reinacker, the author of NewsGator, who has stated that NewsGator will process ill-formed RSS or ATOM feeds because he feels this is the best choice for his customers.

My thoughts on this matter are the same as Dave Winer's in his post Postel's Law has two parts

Personally I disagree with the first half of the law when applied to XML -- the idea that aggregators should bend over backwards to accept poorly formed XML. I always understood that XML was trying to do something different, as a response to the awful mess that HTML became because browser vendors adopted the first half of Postel's philosophy.

When I adopted XML, in 1997, as I understood it -- I signed onto the idea of rejecting invalid XML. It was considered a bug if you accepted invalid XML, not a bug if you didn't.

Brent Simmons, an early player in this market, says users are better served if he reads bad feeds, but when he does that, he's raising the barrier to entry, in undocumented ways that are hard to reproduce.

His interests are served by high barriers to entry, but the users do better if they have more choice.

Now, the users are happy as long as Brent is around to keep updating his aggregator to work around feed bugs, but he might move on, it happens for all kinds of reasons. It's better to insist on tight standards, so users can switch if they want to, for any reason; so that next year's feed will likely work with this year's aggregator, even if it doesn't dominate the market.

I yearn for just one market with low barriers to entry, so that products are differentiated by features, performance and price; not compatibility.

I work on the XML team at Microsoft and one of the things I have to do is coordinate with all the other teams using XML at Microsoft. The ability to consume and produce XML is or will be baked into a wide range of products including BizTalk, SQL Server, Word, Excel, InfoPath, Windows, and Visual Studio. This besides the number of developer technologies for processing XML from XQuery and XSLT to databinding XML documents to GUI components. In a previous post I mentioned my XML Litmus Test for deciding whether XML would beneifit your project

Using XML for a software development project buys you two things (a) the ability to interoperate better with others and (b) a number of off-the-shelf tools for dealing with format.

Encouraging the production and consumption of ill-formed XML damages both these benefits of using XML since interoperability is lost when different tools treat the same XML document differently and off-the-shelf tools can no longer be reliably used to process the documents of that format. This poisons the well for the entire community of developers and users.

Developers and users of RSS or ATOM can't reap the benefits of the various Microsoft technologies and products (i.e querying feeds using XQuery or storing feeds in SQL Server) if there is a proliferation of ill-formed feeds. So far this is not the case (ill-formed feeds are a minority) but every time an aggregator vendor decides to encourage content producers to generate ill-formed XML by working aroound it and displaying the feed to the user with no visible problems that is one more drop of cyanide in the well.

Categories: XML

January 10, 2004

@ 11:28 PM

Dogma, Religion and Computer Programmers

Mark Pilgrim has a fairly interesting post entitled There are no exceptions to Postel’s Law which contains the following gem

There have been a number of unhelpful suggestions recently on the Atom mailing list...

Another suggestion was that we do away with the Atom autodiscovery <link> element and “just” use an HTTP header, because parsing HTML is perceived as being hard and parsing HTTP headers is perceived as being simple. This does not work for Bob either, because he has no way to set arbitrary HTTP headers. It also ignores the fact that the HTML specification explicitly states that all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element. So instead of requiring clients to parse HTML, we should “just” require them to parse HTTP headers... and HTML.

Given that I am the one that made this unhelpful suggestion on the ATOM list it only seems fair that I clarify my suggestion. The current proposal for how an ATOM client (for example. a future version of RSS Bandit) determines how to locate the ATOM feed for a website or post a blog entry or comment is via Mark Pilgrim's ATOM autodsicovery RFC which basically boils down to parsing the webpage for <link> tags that point to the ATOM feed or web service endpoints. This is very similar to RSS autodiscovery which has been a feature of RSS Bandit for several months.

The problem with this approach is that it means that an ATOM client has to know how to parse HTML on the Web in all it's screwed up glory including broken XHTML documents that aren't even wellformed XML, documents that use incorrect encodings and other forms of tag soup. Thankfully on major platforms developers don't have to worry about figuring out how to rewrite the equivalent of the Internet Explorer or Mozilla parser themselves because others have done so and made the libraries freely available. For Java there's John Cowan's TagSoup parser while for C# there's Chris Lovett's SgmlReader (speaking of which it looks like he just updated it a few days ago meaning I need to upgrade the version used by RSS Bandit). In RSS Bandit I use SgmlReader which in general works fine until confronted with weirdness such as the completely broken HTML produced by old versions of Microsoft Word including tags such as

<?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office" />

Over time I've figured out how to work past the markup that SgmlReader can't handle but it's been a pain to track down what they were and I often ended up finding out about them via bug reports from frustrated users. Now Mark Pilgrim is proposing that ATOM clients must have to go through the same problems that're faced by folks like me who've had to deal with RSS autodiscovery.

So I proposed an alternative, that instead of every ATOM client having to require an HTML parser that instead this information is provided in a custom HTTP header that is returned by the website. Custom HTTP headers are commonplace on the World Wide Web and are widely supported by most web development technologies. The most popular extension header I've seen is the X-Powered-By header although I'd say the most entertaining is the X-Bender header returned by Slashdot which contains a quote from Futurama's Bender. You can test for yourself which sites return custom HTTP headers by trying out Rex Swain's HTTP Viewer. Not only is generating custom headers widely supported by web development technologies like PHP and ASP.NET but also extracting them from an HTTP response is fairly trivial on most platforms since practically every HTTP library gives you a handy way to extract the headers from a response in a collection or similar data structure.

If ATOM autodiscovery used a custom header as opposed to requiring clients to use an HTML parser it would make the process more reliable (no more worry about malformed [X]HTML borking the process) which is good for users as I can attest from my experiences with RSS Bandit and reduce the complexity of client applications (no dependence on a tag soup parsing library).

Reading Mark Pilgrim's post the only major objection he raises seems to be that the average user (Bob) doesn't know how add custom HTTP headers to their site which is a fallacious argument given that the average user similarly doesn't know how to generate an XML feed from their weblog either. However the expectation is that Bob's blogging software should do this not that Bob will be generating this stuff by hand.

Mark also incorrectly states that the HTML spec states that any “all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element”. The HTML specification actually states

META and HTTP headers

The http-equiv attribute can be used in place of the name attribute and has a special significance when documents are retrieved via the Hypertext Transfer Protocol (HTTP). HTTP servers may use the property name specified by the http-equiv attribute to create an [RFC822]-style header in the HTTP response. Please see the HTTP specification ([RFC2616]) for details on valid HTTP headers.
The following sample META declaration:
<META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
will result in the HTTP header:
Expires: Tue, 20 Aug 1996 14:25:27 GMT

That's right, the HTML spec says that authors can put <meta http-equiv="..."> in their HTMl documents and a web server gets a request for a document it should parse out these tags and use them to add HTTP headers to the response. In reality this turned out to be infeasible because it would be highly inefficient and require web servers to run a tag soup parser over a file each time they served it up to determine which headers to send in the response. So what ended up happening, is that certain browsers support a limited subset of the HTTP headers if they appear as <meta http-equiv="..."> in teh document.

It is unsurprising that Mark mistakes what ended up being implemented by the major browsers and web servers as what was in the spec after all he who writes the code makes the rules.

At this point I'd definitely like to see an answer to the questions Dave Winer asked on the atom-syntax list about its decision making process. So far it's seemed like there's a bunch of discussion on the mailing list or on the Wiki which afterwards may be ignored by the powers that be who end up writing the specs (he who writes the spec makes the rules). The choice of <link> tags over using RSD for ATOM autodiscovery is just one of many examples of this occurence. It'd be nice to some documentation of the actual process as opposed to the anarchy and “might is right” approach that currently exists.

Categories: XML

January 6, 2004

@ 09:28 PM

XQuery on the Web

In response to my previous post David Orchard provides a link to his post entitled XQuery: Meet the Web where he writes

In fact, this separation of the private and more general query mechanism from the public facing constrained operations is the essence of the movement we made years ago to 3 tier architectures. SQL didn't allow us to constrain the queries (subset of the data model, subset of the data, authorization) so we had to create another tier to do this.

What would it take to bring the generic functionality of the first tier (database) into the 2nd tier, let's call this "WebXQuery" for now. Or will XQuery be hidden behind Web and WSDL endpoints?

Every way I try to interpret this it seems like a step back to me. It seems like in general the software industry decided that exposing your database & query language directly to client applications was the wrong way to build software and 2-tier client-server architectures giving way to N-tier architectures was an indication of this trend. I fail to see why one would think it is a good idea to allow clients to issue arbitrary XQuery queries but not think the same for SQL. From where I sit there is basically little if any difference from either choice for queries. Note that although SQL is also has a Data Definition Langauge (DDL) and Data Manipulation Language (DML) as well as a query language for the purposes of this discussion I'm only considering the query aspects of SQL.

David then puts forth some questions about this idea that I can't help offering my opinions on

If this is an interesting idea, of providing generic and specific query interfaces to applications, what technology is necessary? I've listed a number of areas that I think need examination before we can get to XQuery married to the Web and to make a generic second tier.

1. How to express that a particular schema is queryable and the related bindings and endpoint references to send and receive the queries. Some WSDL extensions would probably do the trick.

One thing lacking in the XML Web Services world are the simple REST-like notions of GET and POST. In the RESTful HTTP world one would simply specify a URI which one could perform an HTTP GET on an get back an XML document. One could then either use the hierarchy of the URI to select subsets of the document or perhaps use HTTP POST to send more complex queries. All this indirection with WSDL files and SOAP headers yet functionality such as what Yahoo has done with their Yahoo! News Search RSS feeds isn't straightforward. I agree that WSDL annotations would do the trick but then you have to deal with the fact that WSDL's themselves are not discoverable. *sigh* Yet more human intervebtion is needed instead of loosely coupled application building.

2. Limit the data set returned in a query. There's simply no way an large provider of data is going to let users retrieve the data set from a query. Amazon is just not going to let "select * from *" happen. Perhaps fomal support in XQuery for ResultSets to be layered on any query result would do the trick. A client would then need to iterate over the result set to get all the results, and so a provider could more easily limit the # of iterations. Another mechanism is to constrain the Return portion of XQuery. Amazon might specify that only book descriptions with reviews are returnable.

This is just a difficult problem. Some queries are complex, computationally intensive but return few results. In some cases it is hard to tell by just looking at the query how badly it'll perform. A notion of returning result sets makes sense in a mid-tier application that's talking to a database but not to client app half-way across the world talking to a website.

3. Subset the Xquery functionality. Xquery is a very large and complicated specification. There's no need for all that functionality in every application. This would make implementation of XQuery more wide spread as well. Probably the biggest subset will be Read versus Update.

Finally something I agree with although David shows some ignorance of XQuery by assuming that there is an update aspect to XQuery when DML was shelved for the 1.0 version. XQuery is just a query language. However it is an extremely complex query language which is hundreds of pages in specification long. The most relevant specs from the W3C XML Query page are linked to below

XQuery 1.0 and XPath 2.0 Data Model (updated, LAST CALL), last release 12 November 2003
XSLT 2.0 and XQuery 1.0 Serialization, (updated, LAST CALL), last release 12 November 2003
XQuery 1.0 and XPath 2.0 Formal Semantics (updated), last release 12 November 2003
XQuery 1.0: An XML Query Language (updated, LAST CALL), [DIFF with previous version], last release 12 November 2003
XML Syntax for XQuery 1.0 (XQueryX), updated, last release 19 December 2003
XQuery 1.0 and XPath 2.0 Functions and Operators (updated, LAST CALL), [DIFF with previous version], last release 12 November 2003

I probably should also link to the W3C XML Schema: Structures and W3C XML Schema: Datatypes specs since they are the basis of the type system of XQuery. My personal opinion is that XQuery is probably too complex to use as the language for such an endeavor since you want something that is simple to implement and fairly straightforward so that there can be ubiqitous implementations and therefore lots of interoperability (unlike the current situation with W3C XML Schema). I personally would start with XPath 1.0 and subset or modify that instead of XQuery.

4. Data model subsets. Particular user subsets will only be granted access to a subset of the data model. For example, Amazon may want to say that book publishers can query all the reviews and sales statistics for their books but users can only query the reviews. Maybe completely separate schemas for each subset. The current approach seems to be to do an extract of the data subset accoring to each subset, so there's a data model for publishers and a data model for users. Maybe this will do for WebXQuery.

5. Security. How to express in the service description (wsdl or policy?) that a given class of users can perform some subset of the functionality, either the query, the data model or the data set. Some way of specifying the relationship between the set of data model, query functionality, data set and authorization.

I'd say the above two features are tied together. You need some way to restrict what the sys admin vs. the regular user executing such a query over the wire can do as well as a way to authenticate them.

6. Performance. The Web has a great ability to increase performance because resources are cachable. The design of URIs and HTTP specifically optimizes for this. The ability to compare URIs is crucial for caching., hence why so much work went into specifying how they are absolutized and canonically compared. But clearly XQuery inputs are not going to be sent in URIs, so how do we have cachable XQueries gven that the query will be in a soap header? There is a well defined place in URIs for the query, but there isn't such a thing in SOAP. There needs to be some way of canonicalizing an Xquery and knowing which portions of the message contain the query. Canonicalizing a query through c14n might do the trick, though I wonder about performance. And then there's the figuring out of which header has the query. There are 2 obvious solutions: provide a description annotation or an inline marker. I don't think that requiring any "XQuery cache" engine to parse the WSDL for all the possible services is really going to scale, so I'm figuring a well-defined SOAP header is the way to go.

Sounds like overthinking the problem and yet not being general enough. The first problem is that there should be standard ways that proxies and internet caches know how to cache XML Web Service results in the same way that they know how to cache results of HTTP GET requests today. After that figuring out how to canonicalize a query expression (I'm not even sure what that means- will /root/child and /root/*[local-name()='child'] be canonicalized into the same thing?) is probably a couple of Ph.D theses of work.

Then there's just the fact that allowing clients to ship arbitrary queries to the server is a performance nightmare waiting to happen...

Your thoughts? Is WebXQuery an interesting idea and what are the hurdles to overcome?

It's an interesting idea but I suspect not a very practical or useful one outside of certain closed applications with strict limitations on the number of users or the type of queries issued.

Anyway, I'm off to play in the snow. I just saw someone skiing down the street. Snow storms are fun.

Categories: XML

January 6, 2004

@ 04:17 PM

XML For You and Me, Your Mama and Your Cousin Too

Jon Udell recently wrote in his post entitled XML for the rest of us

By the way, Adam Bosworth said a great many other interesting things in his XML 2003 talk. For those of you not inclined to watch this QuickTime clip -- and in particular for the search crawlers -- I would like to enter the following quote into the public record.

The reason people get scared of queries is that it's hard to say 'You can send me this kind of query, but not that kind of query.' And therefore it's hard to have control, and people end up building other systems. It's not clear that you always want query. Sometimes people can't handle arbitrary statements. But we never have queries. I don't have a way to walk up to Salesforce and Siebel and say tell me everything I know about the customer -- in the same way. I don't even have a way to say tell me everything about the customers who meet the following criteria. I don't have a way to walk up to Amazon and Barnes and Noble and in a consistent way say 'Find me all the books reviewed by this person,' or even, 'Find me the reviews for this book.' I can do that for both, but not in the same way. We don't have an information model. We don't have a query model. And for that, if you remember the dream we started with, we should be ashamed.

I think we can fix this. I think we can take us back to a world that's a simple world. I think we can go back to a world where there are just XML messages flowing back and forth between...resources. <snipped />

Three things jump out at me from that passage. First, the emphasis on XML query. My instincts have been leading me in that direction for a while now, and much of my own R&D in 2003 was driven by a realization that XPath is now a ubiquitous technology with huge untapped potential. Now, of course, XQuery is coming on like a freight train.

When Don and I hung out over the holidays this was one of the things we talked about. Jon's post has been sitting flagged for follow up in my aggregator for a while. Here are my thoughts...

The main problem is that there are a number of websites which have the same information but do not provide a uniform way to access this information and when access mechanisms to information are provided do not allow ad-hoc queries. So the first thing that is needed is a shared view (or schema) of what this information looks like which is the shared information model Adam talks about. There are two routes you can take with this, one is to define a shared data model with the transfer syntax being secondary (i.e. use RDF) while another is to define a shared data model and transfer syntax (i.e use XML). In most cases, people have tended to pick the latter.

Once an XML representation of the relevant information users are interested has been designed (i.e. the XML schema for books, reviews and wishlists that could be exposed by sites like Amazon or Barnes & Nobles) the next technical problem to be solved is uniform access mechanisms. The eternal REST vs. SOAP vs. XML-RPC that has plagued a number of online discussions. Then there's deployment, adoption and evangelism.

Besides the fact that I've glossed over the significant political and business reasons that may or may not make such an endeavor fruitful we still haven't gotten to Adam's Nirvana. We still need a way to process the data exposed by these web services in arbitrary ways. How does one express a query such as "Find all the CDs released between 1990 and 1999 that Dare Obasanjo rated higher than 3 stars"? Given the size of the databases hosted by such sites would it make more sense to ship the documents to the client or some mid-tier which then performs the post-processing of the raw data instead of pushing such queries down to the database? What are the performance ramifications of exposing your database to anyone with a web browser and allowing them to run ad-hoc queries instead of specially optimized, canned queries?

At this point if you are like me you might suspect that defining that the web service endpoints return the results of performing canned queries which can then be post processed by the client may be more practical then expecting to be able to ship arbitrary SQL/XML, XQuery or XPath queries to web service end points.

The main problem with what I've described is that it takes a lot of effort. Coming up with standardized schema(s) and distributed computing architecture for a particular industry then driving adoption is hard even when there's lots of cooperation let alone in highly competitive markets.

In an ideal world, this degree of boot strapping would be unnecessary. After all, people can already build the kinds of applications Adam described today by screen scraping [X]HTML although they tend to be brittle. What the software industry should strive for is a way to build such applications in a similarly loosely connected manner in the XML Web Services world without requiring the heavy investment of human organizational effort that is currently needed. This was the initial promise of XML Web Services which like Adam I am ashamed has not come to pass. Instead many seem to be satisfied with reinventing DCOM/CORBA/RMI with angle-brackets (then replacing it with "binary infosets"). Unfortunate...

Categories: XML

January 5, 2004

@ 07:39 AM

Comments [17]

Request For Comments: Synchronization of Information Aggregators using Markup (SIAM)

I've just finished the first draft of a specification for Synchronization of Information Aggregators using Markup (SIAM) which is the result of a couple of weeks of discussion between myself and a number of others authors of news aggregators. From the introduction

A common problem for users of desktop information aggregators is that there is currently no way to synchronize the state of information aggregators used on different machines in the same way that can be done with email clients today. The most common occurence of this is a user that uses a information aggregator at home and at work or at school who'd like to keep the state of each aggregator synchronized independent of whether the same aggregator is used on both machines.

The purpose of this specification is to define an XML format that can be used to describe the state of a information aggregator which can then be used to synchronize another information aggregator instance to the same state. The "state" of information aggregator includes information such as which feeds are currently subscribed to by the user and which news items have been read by the user.

This specification assumes that a information aggregator is software that consumes an XML syndication feed in one of the following formats; ATOM, [RSS0.91], [RSS1.0] or [RSS2.0]. If more syndication formats gain prominence then this specification will be updated to take them into account.

This final draft owes a lot of its polish to comments from Luke Hutteman (author of SharpReader), Brent Simmons (author of NetNewsWire) and Kevin Hemenway aka Morbus Iff (author of AmphetaDesk ). There are no implementations out there yet although once enough feedback has been gathered about the current spec I'll definitely add this to RSS Bandit and deprecate the existing mechanisms for subscription harmonization.

Brent Simmons has a post which highlights some of the the various issues that came up in our discussions entitled The challenges of synching.

Categories: Technology | XML

January 1, 2004

@ 10:51 AM

The Unified Theory of Everything

Sean Campbell or Scott Swigart writes

I want this also. I want a theory that unifies objects and data. We're not there yet.

With a relational database, you have data and relationships, but no objects. If you want objects, that's your problem, and the problem isn't insignificant. There’s been a parade of tools and technologies, and all of them have fallen short on the promise of bridging the gap. There's the DataSet, which seeks to be one bucket for all data. It's an object, but it doesn't give you an object view of the actual data. It leaves you doing things like ds.Tables["Customer"].Rows[0]["FirstName"].ToString(). Yuck. Then there are Typed DataSets. These give you a pseudo-object view of the data, letting you do: ds.Customer[0].FirstName. Better, but still not what I really want. And it's just code-gen on top of the DataSet. There's no real "Customer" object here.

Then, there are ObjectSpaces that let you do the XSD three-step to map classes to relational data in the database. With ObjectSpaces you get real, bona fide objects. However, this is just a bunch of goo piled on top of ADO.NET, and I question the scalability of this approach.

Then there are UDTs. In this case, you've got objects all the way into the database itself, with the object serialized as one big blob into a single column. To find specific objects, you have to index the properties that you care about, otherwise you're looking at not only a table scan, but rehydrating every row into an object to see if it's the object you're looking for.

There's always straight XML, but at this point you're essentially saying, "There are no objects". You have data, and you have schema. If you're seeing objects, it's just an optical illusion on top of the angle brackets. In fact, with Web services, it's emphatically stated that you're not transporting objects, you're transporting data. If that data happens to be the serialization of some object, that's nice, but don't assume for one second that that object will exists on the other end of the wire.

And speaking of XML, Yukon can store XML as XML. Which is to say you have semi-structured data, as XML, stored relationally, which you could probably map to an XML property of an object with ObjectSpaces.

What happens when worlds collide? Will ObjectSpaces work with Yukon UDTs and XML?

Oh, and don't forget XML Views, which let you view your relational data as XML on the client, even though it's really relational.

<snip />

So for a given scenario, do all of you know which technology to pick? I'm not too proud to admit that honestly I don't. In fact, I honestly don't know if I'll have time to stress test every one of these against a number of real problem domains and real data. And something tells me that if you pick the wrong tool for the job, and it doesn't pan out, you could be pretty hosed.

Today we have a different theory for everything. I want the Theory of Everything.

I've written about this problem in the past although at the time I didn't have a name for the Theory of Everything, now I do. From my previous post entitled Dealing with the Data Access Impedance Mismatch I wrote

The team I work for deals with data access technologies (relational, object, XML aka ROX) so this impedance mismatch is something that we have to rationalize all the time.

Up until quite recently the primary impedance mismatch application developers had to deal with was the Object<->Relational impedance mismatch. Usually data was stored in a relational database but primarily accessed, manipulated and transmitted over the network as objects via some object oriented programming language. Many felt (and still feel) that this impedance mismatch is a significant problem. Attempts to reduce this impedance mismatch has lead to technologies such as object oriented databases and various object relational mapping tools. These solutions take the point of view that the problem of having developers deal with two domains or having two sets of developers (DB developers and application coders) are solved by making everything look like a single domain, objects. One could also argue that the flip side of this is to push as much data manipulation as you can to the database via technologies like stored procedures while mainly manipulating and transmitting the data on the wire in objects that closely model the relational database such as the .NET Framework's DataSet class.

Recently a third player has appeared on the scene, XML. It is becoming more common for data to be stored in a relational database, mainly manipulated as objects but transmitted on the wire as XML. One would then think that given the previously stated impedance mismatch and the fact that XML is mainly just a syntactic device that XML representations of the data being transmitted is sent as serialized versions of objects, relational data or some subset of both. However, what seems to be happening is slightly more complicated. The software world seems to moving more towards using XML Web Services built on standard technologies such as HTTP, XML, SOAP and WSDL to transmit data between applications. And taken from the WSDL 1.1 W3C Note

WSDL recognizes the need for rich type systems for describing message formats, and supports the XML Schemas specification (XSD) [11] as its canonical type system

So this introduces a third type system into the mix, W3C XML Schema structures and datatypes. W3C XML Schema has a number of concepts that do not map to concepts in either the object oriented or relational models. To properly access and manipulate XML typed using W3C XML Schema you need new data access mechanisms such as XQuery. Now application developers have to deal with 3 domains or we need 3 sets of developers. The first instinct is to continue with the meme where you make everything look like objects which is what a number of XML Web Services toolkits do today including Microsoft's .NET Framework via the XML Serialization technology. This tends to be particularly lossy because traditionally object oriented systems do not have the richness to describe the constraints that are possible to create with a typical relational database let alone the even richer constraints that are possible with W3C XML Schema. Thus such object oriented systems must evolve to not only capture the semantics of the relational model but those of the W3C XML Schema model as well. Another approach could be to make everything look like XML and use that as the primary data access mechanism. Technologies already exist to make relational databases look like XML and make objects look like XML. Unsurprisingly to those who know me, this is the approach I favor. The relational model can also be viewed as a universal data access mechanism if one figured out how to map the constraints of the W3C XML Schema model. The .NET Framework's DataSet already does some translation of an XML structure defined in a W3C XML Schema to a relational structure.

The problem with all three approaches I just described is that they are somewhat lossy or involve hacking one model into becoming the uber-model. XML trees don't handle the graph structures of objects well, objects can't handle concepts like W3C XML Schema's derivation by restriction and so on. There is also a fourth approach which is endorsed by Erik Meijer in his paper Unifying Tables, Objects, and Documents where one creates a new unified model which is a superset of the pertinent features of the 3 existing models. Of course, this involves introducing a fourth model.

The fourth model mentioned above is the unified theory of everything that Scott or Sean is asking for. Since the last time I made this post, my friend Erik Meijer has been busy and produced another paper that shows what such a unification of the ROX triangle would look like if practically implemented as a programming language in his paper Programming with Circles, Triangles and Rectangles. In this paper Erik describes the research language Xen which seems to be the nirvana Scott or Sean is looking for. However this is a research project and not something Sean or Scott will be likely to use in production in the next year.

The main problem is that Microsoft has provided .NET developers with too much choice when it comes to building apps that retrieve data from a relational store, manipulate the data in memory then either push the updated information back to the store or send it over the wire. The one thing I have learned working as a PM on core platform technologies is that our customers HATE choice. It means having to learn multiple technologies and make decisions on which is the best, sometimes risking making the wrong choice. This is exactly the problem Scott or Sean is having with the technologies we announced at the recent Microsoft Professional Developer Conference (PDC) which will should be shiping this year. What technology should I use and when I should I use it?

This is something the folks on my team (WebData - the data access technology team) know we have to deal with when all this stuff ships later this year which we will deal with to the best of our ability. Our users want architectural guidance and best practices which we'll endeavor to make available as soon as possible.

The first step in providing this information to our users are the presentations and whitepaper we made available after PDC, Data Access Design Patterns: Navigating the Data Access Maze (Powerpoint slides) and Data Access Support in Visual Studio.NET code named “Whidbey”. Hopefully this will provide Sean, Scott and the rest of our data access customers with some of the guidance needed to make the right choice. Any feedback on the slides or document would be appreciated. Follow up documents should show up on MSDN in the next few months.

Categories: Technology | XML

December 29, 2003

@ 02:25 AM

On Euphemisms: XML Web Services & SOA

Chris Sells recently complained that a recent interview of Don Box by Mary Jo Foley is "a relatively boring interview" because "Mary Jo doesn't dig for any dirt and Don doesn't volunteer any". He's decided to fix this by proposing an alternate interview where folks send in their favorite questions and he picks the 10 best and formwards them to Don (kinda like Slashdot interviews). Chris offers some seed questions but they are actually much lamer than any of the ones Mary Jo asked so I suspect his idea of questions that dig for dirt are different from mine.

I drafted 10 questions and picked the 3 least controversial for my submissions to the Don Box interview pool.

People often come up with euphemisms for an existing word or phrase that has become "unpleasant" which although technically mean a different thing from the previous terminology are used interchangeably. A recent example of this is the replacement of "black" with "African American" in the modern American lexicon when describing people of African descent.
I suspect something similar has happened with XML Web Services and Service Oriented Architecture. Many seem to think that the phrases are interchangeable when on the surface it seems the former is just one instance of the latter. To you what is the difference between XML Web Services and Service Oriented Architectures?
For a short while you were active in the world of weblogging technologies, you tried to come up with an RSS profile and were working on a blogging tool with Yasser Shohoud and Martin Gudgin. In recent times, you have been silent about these past activities. What sparked your interest in weblogging technologies and why does that interest seem to have waned?
What team would you not want to work for at Microsoft and why?

These were my tame questions but I get to hang with Don sometime this week so I'll ask him some of the others in person. I hope one of my questions gets picked by Chris Sells.

Categories: Life in the B0rg Cube | XML

December 26, 2003

@ 04:07 PM

Hypocrisy That Turns My Stomach

Mark Pilgrim's most recent entry in his RSS feed contains the following text

The best things in life are not things. (11 words)

Note: The "dive into mark" feed you are currently subscribed to is deprecated. If your aggregator supports it, you should upgrade to my Atom feed, which includes both summaries and full content.

A lot of the ATOM vs. RSS discussion has been mired in childishness and personality conflicts with the main proponents of ATOM claiming that the creation of the ATOM syndication format will be a good thing for users of syndication and blogging software. Now let's pretend this is true and the only people who have to bear the burden are aggregator authors like me who now have to add support for yet another syndication format. Let's see what my users get out of ATOM feeds compared to RSS feeds.

Mark Pilgrim's ATOM feed: As I write this his feed contains the following elements per entry; id, created, issued, modifed, link, summary,title, dc:subject and content. The aformentioned elements are equivalent to guid, pubDate, ~~issued~~, ~~modified~~, link, description, title, dc:subject and content:encoded/xhtml:body that exist in RSS feeds today. In fact an RSS feed with those elements and Mark Pilgrim's feed will be treated identically by RSS Bandit. The only problematic pieces are that his feed contains three dates that express when the entry was issued, when it was modified and when it was created. Most puzzling is that the issued date is before its created date. I have no idea what this distinction means and quite frankly I doubt many people will care.

Basically, it looks like Mark Pilgrim's ATOM feed doesn't give users anything they couldn't get from an equivalent RSS feed except the fact that they have to upgrade their news aggregators and deal with potential bugs in the implementations of these features [because there are always bugs in new features]
LiveJournal's ATOM feeds: As I write this a sample feed from Live Journal (in this case Jamie Zawinski's) contains the following elements per entry; id, modified, issued, link, title, author and content . The aformentioned elements are equivalent to guid, ~~modified~~, ~~issued~~, link, title, author/dc:author and content:encoded/xhtml:body. Comparing this feed to Mark Pilgrim's I already see a bunch of ambiguity which supposedly is not supposed to exist since what ATOM supposedly gives consumers over RSS is that it will be better defined and less ambiguous than RSS. How are news aggregators supposed to treat the three date types defined in ATOM? In RSS I could always use the pubDate or dc:date now I have to figure out which of <modified>, <issued> or <created> is the most relevant one to show the user. Another point is what do I do if a feed contains <content rel="fragment"> amd a <summary>? Which one do I show the user?
Movable Type's ATOM feeds: As I write this the MovableType ATOM template contains the following elements; id, modified, issued, link, title, author, dc:subject. summary and content. The aformentioned elements are equivalent to guid, modified, issued, link, title, author/dc:author, dc:subject, description and content:encoded/xhtml:body. Again besides the weirdness with dates (and I suspect RSS Bandit will end up treating <modifed> equivalent to <pubDate>) there isn't anything users get from the ATOM feed that they don't get from the equivalent RSS feed. Interesting, I'd expected that I'd find at least one of the first 3 sample ATOM feeds that I took a look at would show me why it was worth it that I spend a weekend or more implementing ATOM support in RSS Bandit.

The fundamental conceit of the ATOM effort is that they think writing specifications is easy. Many of its proponents deride RSS for being ambiguous and not well defined yet they are producing a more complex specification with more significant ambiguities in it than I've seen in RSS. I actually have a mental list of significant issues with ATOM that I haven't even posted yet, the ones I mentioned above were just from glancing at the aforementioned feeds. My day job involves reading or writing specs all day. Most of the specs I read either were produced by the W3C or by folks within Microsoft. Every one of them contains contradictions, ambiguities and lack crucial information for determining in edge cases. Some are better than others but they all are never well-defined enough. Every spec has errata.

The ATOM people seem to think that if a simple spec like RSS can have ambiguities they can fix it with a more complex spec, which anyone who actually does this stuff for a living will tell you just leads to more complex ambiguities to deal with not less.

I wish them luck. As I implement their spec I at least hope that some of these ATOM supporters get a clue and actually use some of the features of ATOM that RSS users have enjoyed for a while and are lacking in all of the feeds I linked to above such as the ATOM equivalent to wfw:commentRss. It's quite irritating to be able to read the comments to any .TEXT or dasBlog weblog in my news aggregator but then have to navigate to the website when I'm reading a Movable Type or LiveJournal feed to see the comments.

Categories: XML

December 24, 2003

@ 05:09 AM

When Bad Ideas Attack

Joshua Allen writes

Before discussing qnames in content, let's discuss a general issue with qnames that you might not have known about. Take the following XML:
<?xml version="1.0" ?>
<root xmlns:p="http://foo.org">
<p:elem att1="" att2="" ... />
<p:elem att1="" att2="" ... xmlns:p="http://bar.org" / >
<x:elem att1="" att2="" xmlns:x="http://foo.org" />
</root>

Notice the first two elements, both ostensibly named "p:elem", but if we treat the element names as opaque strings, we'll get confused and think the elements are the same. Luckily, we have this magical thing called a qname that uses namespace instead of prefix, and so we can note that the two element names are actually "{http://bar.org}elem" and "{http://foo.org}/elem" -- different. By the same token, if we compare the first and third element using opaque strings, we think that they are different ("p:elem" and "x:elem"). But if we look at the qnames, we see they are both "{http://foo.org}elem".
...
so what is the big deal for qnames in content? Look at the following XML:

<?xml version="1.0" ?>
<root xmlns:x="urn:x" xmlns:p="http://www.foo.org" >
<p:elem>here is some data: with a colon for no good reason</p:elem>
<p:elem>x:address</p:elem>
<p:elem xmlns:x="urn:y">x:address</p:elem>
</root>

Now, do the last two "p:elem" elements contain the same text, or different text? If you compared using XSLT or XPath, what would be the result? How about if you used the values in XSD key/keyref? The answer is that XSLT and XPath have no way of knowing that you intend those last two elements to be qnames, so they will treat them as opaque strings. With XSD, you could type the node as qname... Most APIs are smart enough to inject namespace declarations if necessary, so the first node would write correctly as:

<p:elem xmlns:p="http://www.foo.org">here is some data: with a colon for no good reason</p:elem>

But, since the DOM has no idea that you stuffed a qname in the element content, it's got no way to know that you want to preserve the namespace for x:

<p:elem xmlns:p="http://www.foo.org">x:address</p:elem>

There is really only one way to get around this, and this is for any API which writes XML to always emit namespace declarations for all namespaces in scope, whether they are used or not (or else understand enough about the XSD and make some guesses). Some APIs do this, but it is not something that all APIs can be trusted to do, and it yields horribly cluttered XML output and other problems.

Joshua has only hit the surface of what the real problem which is that there is no standard way to write out an XML infoset with the PSVI contributions added during validation. In plain English, there is no standard way to write out an XML document that has been validated using W3C XML Schema containing all the relevant type annotations plus other infoset augmentations. In the above example, the fact that the namespace declaration that uses the "x" prefix is not included in the output is not as significant as the fact that there is no way to tell that the type of p:elem's content is the xs:QName type.

However this doesn't change the fact that using QNames in content in an XML vocabulary is a bad idea. Specifically I am talking about using the xs:QName type in your vocabulary. The semantics of this type are so absurd it boggles the mind. Below is the definition from the W3C XML Schema recommendation

[Definition:] QName represents XML qualified names. The ·value space· of QName is the set of tuples {namespace name, local part}, where namespace name is an anyURI and local part is an NCName. The ·lexical space· of QName is the set of strings that ·match· the QName production of [Namespaces in XML].

This basically says that text content of type xs:QName in an XML document such as "x:address" actually is a namespace name/local name pair such as "{http://www.example.com}address". This instantly means that you can not interpret this type without carrying around some sort of context (i.e a list of namespace name<->prefix bindings) which makes it different from most other types defined in the W3C XML Schema recommendation because it has no canonical lexical representation. A value such as "x:address" is meaningless without knowing what XML document it came from and specifically what the namespace binding for the "x" prefix was at that particular scope.

Of course, the existence of the QName type means you can do interesting things like use a different prefix for a particular namespace in the schema than you use in the XML instance so you can specify that the content of the <p:elem> element should be one of a:address or a:location but have x:address in the instance which would be fine if the "a" prefix is bound to the "http://www.example.com" namespace in the schema and the "x" is bound to the same namespace in the instance document. You can also ask interesting questions such as What happens if I have a default value that is of type xs:QName but there is no namespace declaration for the namespace name at that scope? Does this mean that not only should a default value be inserted as the content of an element or attribute but also that a namespace declaration is also created at the same scope if one does not exist?

Fun stuff, not.

Categories: XML

December 23, 2003

@ 08:12 PM

One of the Most Difficult Tasks in Software Development: Choosing Good Names

Choosing a name for a product or software component that can stand the test of time is often difficult and can often be a source of confusion for users of the software if the its usage outgrows that implied by its name. I have examples from both my personal life and my professional life.

RSS Bandit

When I chose this name I never considered that there might one day be another popular syndication format (i.e. ATOM) which I'd end up supporting. Given that Blogger, Movable Type, and LiveJournal are going to provide ATOM feeds and utilize the ATOM API for weblog editing/management then it is a foregone conclusion that RSS Bandit will support ATOM the specifications are in slightly less flux which should be in the next few months.

One that happens the name "RSS Bandit" will be an anachronism given that RSS will no longer be the only format supported by the application. In fact, the name may become a handicap in the future once ATOM becomes popular because there is the implicit assumption that I support the "old" and "outdated" syndication format not the "shiny" and "new" syndication format.

XPathDocument

In version 1.0 of the .NET Framework we shipped three classes that acted as in-memory representations of an XML document

XmlDocument - an implementation of the W3C Document Object Model (DOM) with a few .NET specific extensions [whose functionality eventually made it into later revisions of the spec]
XmlDataDocument - a subclass of the XmlDocument which acts as an XML view of a DataSet
XPathDocument - a read-only in-memory representation of an XML document which conforms to the XPath data model as opposed to the DOM data model upon which the XmlDocument is based. This class primarily existed as a more performant data source for performing XSLT transformations and XPath queries

Going forward, various limitations of all of the above classes meant that we came up with a fourth class which we planned to introduce in Whidbey. After an internal review we decided that it would be two confusing to add yet another in-memory representation of an XML document to the mix and decided to instead improve on the ones we had. The XmlDataDocument is really a DataSet specific class so it doesn't really fall into this discussion. We were left with the XmlDocument and the XPathDocument. Various aspects of the XmlDocument made it unpalatable for a number of the plans we had in mind such as acting as a strongly typed XML data source and moving away from a tree based DOM model for interacting with XML.

Instead we decided to go forward with the XPathDocument and add a bunch of functionality to it such as adding the ability to bind it to a store and retrieved strongly typed values via integrated support for W3C XML Schema datatyping, change tracking and the write data to it using the XmlWriter.

The primary feedback we've gotten about the new improved XPathDocument from usability studies and WinFX reviews is that there is little chance that anyone who hasn't read our documentation would realize that the XPathDocument is the preferred in-memory representation of an XML document for certain scenarios and not the XmlDocument. In v1.0 we could argue that the class was only of interest to people doing advanced stuff with XPath (or XSLT which is significantly about XPath) but now the name doesn't jibe with its purpose as much. The same goes for the primary mechanism for interacting with the XPathDocument (i.e. the XPathNavigator) which should be the preffered mechanism for representing and passing data as XML in the .NET Framework going forward.

If only I had a time machine and could go back and rename the classes XmlDocument2 and XmlNavigator. :(

Categories: Life in the B0rg Cube | XML

December 17, 2003

@ 04:28 PM

XML 2003 Highlights: Most Interesting Sessions

There were a number of sessions I found particularly interesting either because they presented novel ways to utilize and process XML or because they gave an insightful glance at how others view the XML family of technologies.

Imperative Programming with Rectangles, Triangles, and Circles - Erik Meijer
This was a presentation about a research language called Xen that experiments with various ways to reduce the Relational<->Objects<->XML (ROX) impedance mismatch by adding concepts and operators from the relational and XML (specifically W3C XML Schema) world into an object oriented programming language. The main thesis of the paper was that heavily used APIs and programming idioms eventually tend to be likely candidates for including into the language. An example was given with the foreach operator in the C# language which transformed the following regularly used idiom

IEnumerator e = ((IEnumerable)ts).GetEnumerator(); try { while(e.MoveNext()) { T t = (T)e.Current; t.DoStuff(); } } finally { IDisposable d = e as System.IDisposable; if(d != null) d.Dispose(); }

into

foreach(T t in ts){ t.DoStuff(); }

The majority of the presentation was about XML integration. Erik spent some time talking about the XML to object impedance mismatch and how cumbersome programming with XML could be. Either you wrote a bunch of code for walking trees manually or you queried nodes with XPath but then you are embedding one language into another and don't get type safety, etc (if there is an error in my XPath query I can't tell until runtime). He pointed out that various XML<->object mapping technologies fall short because they either don't map a rich enough set of W3C XML Schema constructs to relevant object structures but even if they did one now looses the power of being able to do rich XPath queries or XSLT/XQuery transformations. The XML integration in Xen basically came in 3 flavors; the ability to initialize classes from XML strings, support for W3C XML Schema constructs like union types and sequences into the language and the ability to do XPath-like queries over the contents fields and properties of a class.

There were also a few other things like adding the constraint "not null" into the language (which would be a handy modifier for parameter names in any language given how often one must check parameters for null in method bodies) and the ability to apply the same method to all the members of a collection which seemed like valuable additions to a programming language independent of XML integration.

Thinking about it I am unsure of the practicality of some features such as being able to initialize objects from an XML literal in the code especially since Xen only supported XML documents with schemas although in some cases I could imagine such an approach being more palatable than using XQuery or XSLT 2.0 for constructing or querying strongly typed XML documents. Also I was suspicious of the usefulness of being able to do wildcard queries (i.e. give me all the fields in class Foo) although this could potentially be used to get the string value of an XML element with mixed content.

The language also had integrated SQL like querying with a "select" operator but I didn't pay much attention to this since I was only really interested in XML.

The meat of this presentation is available online in the paper entitled Programming with Circles, Triangles and Rectangles. The presentation was well received although sparsely attended (about two or three dozen people) and the most noteworthy feedback was that from James Clark who was so impressed he kept saying "I'm speechless" in between asking questions about the language. Sam Ruby was also impressed by the fact that not only was there a presentation but the demo which involved compiling and running various samples showed that this you could implement such a language in the CLR and even integrate it into Visual Studio.

Namespace Routing Language (NRL) - James Clark
This was a presentation for a language for validating a single XML document with multiple schemas simultaenously. This was specifically aimed at validating documents that contained XML from multiple vocabularies (e.g. XML content embedded in a SOAP envelope, RDF embedded in HTML, etc).

The core processing model of NRL is that it divides an XML document into sections each containing elements from a single namespace then each section can be validated using the schema for its namespace. There is no requirement that the same schema language is used so one could validate one part of the document using RELAX NG and use W3C XML Schema for another. There also was the ability to to specify named modes like XSLT which allowed you to match against element names against a particular schema instead of just keying off the namespace name. This functionality could be used to validate interleaved documents (such as XHTML within an XSLT stylesheet) but I suspect that this will be easier said than done in practice.

All in all this was a very interesting talk and introduced some ideas I'd never have considered on my own.

There is a spec for the Namespace Routing Language available online.

Categories: XML

December 16, 2003

@ 05:33 PM

XML 2003 Highlights: Conversations

The XML 2003 conference was a very interesting experience. Compared to the talks at XML 2002 I found the talks at XML 2003 to be of more interest and relevance to me as an developer building applications that utilize XML. The various hallway and lunchtime conversations I had with various people were particularly valueable. Below are the highlights from the various conversations I had with some XML luminaries at lunch and over drinks. Tomorrow I'll post about the various talks I attended.

CONVERSATIONS
James Clark: He gave two excellent presentations, one on his Namespace Routing Language (NRL) and the other about some of implementation techniques used in his nxml-mode for Emacs. I asked whether the fact that he gave no talks about RELAX NG meant that he was no longer interested in the technology. He responded that there wasn't really anything more to do with the language besides shepherd it through the standardization process and evangelization. However given how entrenched support for W3C XML Schema was with major vendors evangelization was an uphill battle.

I pointed out that at Microsoft we use XML schema language technologies for two things;

Describing and enforcing the contract between producers and consumers of XML documents: .
Creating the basis for processing and storing typed data represented as XML documents:

The only widely used XML Schema language that fit the bill for both tasks is W3C XML Schema. However W3C XML Schema is too complex and yet doesn't have enough features for the former and has too many features which introduce complexity for the latter case. In my ideal world, people would use something like RELAX NG for the former and XML-Data Reduced (XDR) for the latter. James asked if I saw value in creating a subset of RELAX NG which also satisfied the latter case but I didn't think that there would be compelling argument for people who've already baked W3C XML Schema into the core of their being (e.g. XQuery, XML Web Services, etc) to find interest in such a subset.

In fact, I pointed out that in designing for Whidbey (next version of the .NET Framework) we originally had designed the architecture to have a pluggable XML type system so that one could potentially generate Post Schema Validation Infosets (PSVI) but realized that this was a case of YAGNI. First of all, only one XML schema language exists that can generate PSVIs so creating a generic architecture makes no sense if there was no other XML schema language that could be plugged in to replace W3C XML Schema. Secondly, one of the major benefits of this approach I had envisioned was that one would be able to plug their own type systems into XQuery. This turned out to be more complicated than I thought because XQuery has W3C XML Schema deeply baked into it and it would take more than genericizing at the PSVI level to make it work (we'd also have to genericize operators, type promotion rules, etc) and once then once all that effort would have been expended any language that could be plugged in would have to act a lot like W3C XML Schema anyway. Basically if some RELAX NG subset suddenly came into existence, it wouldn't add much to that we don't already get from W3C XML Schema (except less complexity but you could get the same from coming up with a subset of W3C XML Schema or following my various W3C XML Schema Best Practices articles on XML.com).

I did think that there would be some value to developers building applications on Microsoft platforms who needed more document validation features than W3C XML Schema in having access to RELAX NG tools. This would be nice to have but isn't a showstopper preventing development of XML applications on Microsoft platforms (translation: Microsoft won't be building such tools in the forseeable future). However if such tools existed I definitely would evangelize them to our users who needed more features than W3C XML Schema provides for their document validation needs.

Sam Ruby: I learned that Sam is on one of "emerging technologies" groups at IBM. Basically he works on stuff that's about to become mainstream in big way and helps them along the way. In the past this has included PHP, Open Source and Java (i.e. the Apache project), XML Web Services and now weblogging technologies. Given his track record I asked him to give me a buzz whenever he finds some new technology to work on. : )

I told him that I felt syndication formats weren't the problem with weblogging technologies and he seemed to agree but pointed out that some of the problems they are trying to solve with ATOM make more sense in the context of using the same format for your blog editing/management API and archival format. There were also the various interpersonal conflicts & psychological baggage which needs to be discarded to move the technology forward and a clean break seems to be the best way. On reflection, I agreed with him.

I did point out that the top 3 problems I'd like to fix in syndication were one click subscription, subscription harmonization and adding calendar events to feeds. I mentioned that I should have RFCs for the first two written up over the holidays but the third is something I haven't thought about hard. Sam pointed out that instead of going the route of coming up with a namespaced extension element to describe calendar events in an RSS feed that perhaps a better option is the ATOM approach that uses link tags. Something like

In fact he seemed to have liked this idea so much it ended up in his presentation.

As Sam and I were finishing our meals, Sam talked about the fact that the effect that blogging has had on his visibility is immense. Before blogging he was well known in tight-knit technical circles such as amongst the members of the Apache project but now he knows people from all over the world working at diverse companies and regularly has people go "Wow, you're Sam Ruby, I read your blog". As he said, this the guy sitting across from us at the table said "Wow, you're Sam Ruby, I read your blog", Sam turned to me and said "See what I mean?"

The power of blogging...

Eve Maler: I spoke to her about a talk I'd seen on UBL given by Eduardo Gutentag and Arofan Gregory where they talked about the benefits of using the polymorphic features of W3C XML Schema to good use in business applications. The specific scenario they described was the following

Imagine a small glue supplier that provides glue to various diverse companies such as a shoe manufacturer, an automobile manufacturer and an office supplies company. This company uses UBL to talk to each of its customers who also use UBL but since the types for describing purchase orders and the like are not specific enough for them they use the type derivation features of W3C XML Schema to create specific types (e.g. a hypothetical LineItem type from UBL is derived to AutomobilePart or ShoeComponent by the various companies). However the small glue company can handle all the new types with the same code if they use type aware processing such as the following path XPath 2.0 or XQuery expression which matches all instances of the LineItem type

element(*, LineItem)

The presenters then pointed out that there could be data loss if one of the customers extended the LineItem type by adding information that was pertinent to their business (e.g. priority, pricing information, prefeerred delivery options, etc) since such code would not know about the extensions.

This seems like a horrible idea and yet another reason why I view all the "object oriented" features of W3C XML Schema with suspicion.

Eve agreed that it probably was a bad idea to recommend that people process XML documents this way then stated that she felt that calling such processing "polymorphic" didn't sit right with her since true polymorphism doesn't require subtype relationships. I agreed and disagreed with her. There are at least four types of polymorphism in programming language parlance and the kind used above is subtype polymorphism. This is just one of the four types of polymorphism (the others being coercion, overloading and parametric polymorphism) but the behavior above is polymorphism. From talking to Eve it seemed that she was more interested in parametric polymorphism because it subtype polymorphism is not a loosely coupled approach. I pointed out that just using XPath expressions to match on predicates could be considered to be parametric polymorphism since you are treating instances similarly even though they are of different types but satisfy the same constraints. I'm not sure she agreed with me. :)

Jon Udell: We discussed the online exchange we had about WinFS types and W3C XML Schema types. He apologized if he seemed to be coming on too strong in his posts and I responded that of the hundreds of articles and blog posts I'd read about the technologies unveiled at the recent Microsoft Professional Developer's Conference (PDC) that I'd only seen two people provide insightful feedback; his was the first and Miguel de Icaza's PDC writeup was the second.

Jon felt that WinFS would be more valuable as an XML database as opposed to an object oriented database (I think the terms he used were "XML store" and "CLR store") especially given his belief that XML enables the "Universal Canvas". I agreed with him but pointed out that Microsoft isn't a single entity and even though some parts may think that XML is one step closer to giving us a universal data interchange format and thus universal data access which there are others who see XML as "that format you use for config files" and express incredulity when they here about things like XQuery because they wonder why anyone would need a query language for their config files. :)

Reading Jon's blog post about Word 11, XML and the Universal Canvas it seems he's been anticipating a unified XML storage model for a while which explains his disappointment that the WinFS unveiled at PDC was not it.

He also thought that the fact that so many people at Microsoft were blogging was fantastic.

Categories: XML

December 12, 2003

@ 12:19 PM

Initial Impressions from XML 2003

Today is the last day of the XML 2003 conference. So far it's been a pleasant experience.

XML IN THE REAL WORLD

Attendance at the conference was much lower than last year. Considering that last year Microsoft announced Office 2003 at the conference while this year there was no such major event, this is no surprise. I suspect another reason is that XML is no longer new and is now so low down in the stack that a conference dedicated to just XML is no longer that interesting. Of course, this is only my second conference so this level of attendance may be typical from previous years and I may have just witnessed an abnormality last year.

Like last year, the conference seemed targetted mostly at the ex-SGML crowd (or document-centric XML users) although this time there wasn't the significant focus on Semantic Web technologies such as topic maps that I saw last year. I did learn a new buzzword around Semantic Web technologies, Semantic Integration and found out that there are companies selling products that claim to do what until this point I'd assumed was mostly theoretical. I tried to ask one such vendor how they deal with some of the issues with non-trivial transformation such as the pubDate vs. dc:date example from a previous post but he glossed over details but implied that besides using ontologies to map between vocabularies they allowed people to inject code where it was needed. This seems to confirm my suspicions that in the real world you end up either using XSLT or reinventing XSLT to perform transformations between XML vocabularies.

From looking at the conference schedule, it is interesting to note that some XML technologies got a lot less coverage in the conference relative to how much discussion they cause in the news or blogosphere. For example, I didn't see any sessions on RSS although there is one by Sam Ruby on Atom scheduled for later this morning. Also there didn't seem to be much about XML Web Service technologies being produced by the major vendors such as IBM, BEA or Microsoft. I can't tell if this is because there was no interest in submitting such sessions or whether the folks who picked the sessions didn't find these technologies interesting. Based on the fact that a number of the folks who had "Reviewer" on their conference badge were from the old school SGML crowd I suspect the latter. There definitely seemed to be disconnect between the technologies covered during the conference and how XML is used in the real world in a number of cases.

MEETING XML GEEKS

I've gotten to chat with a number of people I've exchanged mail with but never met including Tim Bray, Jon Udell, Sean McGrath, Norm Walsh and Betty Harvey. I also got to talk to a couple of folks I met last year like Rick Jellife, Sam Ruby, Simon St. Laurent, Mike Champion and James Clark. Most of the hanging out occurred at the soiree at Tim and Lauren's. As Tim mentions in his blog post there were a couple of "Wow, you're Dare?" or 'Wow, you're Sean Mcgrath?" through out the evening. The coolest part of that evening was that I got to meet Eve Maler who I was all star struck about meeting since I'd been seeing her name crop up as being one of the Über-XML geeks at Sun Microsystems since I was a programming welp back in college and I'm there gushing "Wow, you're Eve Maler" and she was like "Oh you're Dare? I read your articles, they're pretty good". Sweet. Since Eve worked at Sun I intended to give her some light-hearted flack over a presentation entitled UBL and Object-Oriented XML: Making Type-Aware Systems Work which was spreading the notion that the relying on the "object oriented" features of W3C XML Schema was a good idea then it turned out that she agreed with me. Methinks another W3C XML Schema article on XML.com could be spawned from this. Hmmmm.

Categories: XML

December 9, 2003

@ 08:41 PM

RESTful XML Web Services vs. RPC-style XML Web Services

Jeremy Zawodney writes

The News RSS Feeds are great if you want to follow a particular category of news. For example, you might want to read the latest Sports (RSS) or Entertainment (RSS) news in your aggregator. But what if you'd like an RSS News feed generated just for you? One based on a word or phrase that you could supply?
...
For example, if you'd like to follow all the news that mentions Microsoft, you can do that. Just subscribe to this url. And if you want to find news that mentions Microsoft in a financial context, use Microsoft's stock ticker (MSFT) as the search parameter like this.

Compare this to how you'd programmatically do the same thing with Google using the Google Web API which utilizes SOAP & WSDL. Depending on whether you have the right toolkit or not, the Google Web API ease either much simpler or much harder to program against than the Yahoo RSS based search. With the Yahoo RSS based search, a programmer has to directly deal with HTTP and XML when programming against it while with the Google API and the appropriate XML Web Service tools this is all hidden behind the scenes and for the most part the developer programs directly against objects that represent the Google API without dealing directly with XML or HTTP. For example, see this example of talking to the Google API from PHP. Without using appropriate XML Web Service tools, the Google API is more complex to program against than the Yahoo RSS search because one now has to deal with sending and receiving SOAP requests not just regular HTTP GETs. However there are a number of freely available XML Web Service toolsets available so there should be no reason to program against the Google API directly.

This being said there are a number of benefits to the URI-based (i.e RESTful) search that Yahoo provides which comes from being a part of the Web architecture.

I can bookmark a Yahoo RSS search or send a link to it in an email. I can't do the same with an RPC-style SOAP API.
Intermediaries between my machine and Google are unlikely to cache the results of a search made via the Google API since it uses HTTP POST but could cache requests that use the Yahoo RSS-based search since it uses HTTP GET. This improves the scalability of the Yahoo RSS-based search without any explicit work from myself or Yahoo, this is just from utilizing the benefits of the Web architecture.

The above contrast of the differing techniques for returning search results as XML used by Yahoo and Google is a good way to compare and contrast RESTful XML Web Services to RPC-based XML Web Services and understand why some people believe strongly [perhaps too strongly] that XML Web Services should be RESTful not RPC-based.

By the way, I noticed that Adam Bosworth is trying to come to grips with REST which should lead to some interesting discussion for those who are interested in the RESTful XML Web Services vs. RPC-based XML Web Services debate.

Categories: XML

December 7, 2003

@ 03:28 AM

One-Click Subscription to RSS and ATOM Feeds

Shelley Powers writes

For instance, The W3C TAG team -- that's the team that's defining the architecture of the web, not a new wrestling group -- has been talking about defining a new URI scheme just for RSS, as brought up today by Tim Bray. With a new scheme, instead of accessing a feed with:

http://weblog.burningbird.net/index.rdf

You would access the feed as:

feed://www.tbray.org/ongoing/ongoing.rss

I've been trying to avoid blogging about this discussion since I'll be leaving for Philly to attend the XML 2003 conference in a few hours and won't be able to participate in any debate. However since it seems some folks have started blogging about this topic and there some misconceptions in their posts I've thrown my hat in the ring.

The first thing I want to point is that although Shelley is correct that some discussion about this has happened on the W3C Technical Architecture Group's mailing list they are not proposing a new URI scheme. Tim Bray was simply reporting on current practices in the RSS world that I mentioned in passing on the atom-syntax list.

THE PROBLEM
The problem statement is "How does a user go to a website such as http://news.yahoo.com or http://www.slashdot.org, who'd like to subscribe to information from these sites in a client-side news aggregator do so in a quick and painless manner?". The current process is to click on an icon (most likely an orange button with the white letters 'XML' on it) that represents an RSS feed, copy the URL from the browser address bar, fire up your RSS client and click on the subscribe dialog (if it has one).

This is lot of steps and many attempts have been made to collapse this into one step (click link and the right dialog pops up).

PREVIOUS SOLUTIONS TO THE PROBLEM
The oldest one I am aware of was pioneered by Dave Winer and involved client side aggregators listening on a port on the local machine and a hyperlink on the website linking to a URL of the form http://127.0.0.1:5335/system/pages/subscriptions. This technique is used by every Radio Userland weblog and is even used by dasBlog which is my blogging tool of choice as is evidenced by clicking on the icon with a picture of a coffee mug and the letters "XML" on it at bottom of my weblog.

There are two problems with this approach. The first is the security issue brought on by the fact that you have a news aggregator listening on a socket on your local machine which could lead to hack attempts if a security exploit is found on in your news aggregator of choice, however this can be mitigated by firewalls and so far thus hasn't been a problem. The second is that if one has multiple aggregators installed there is contention for which one should listen on that port. For this reason different aggregators listen on different local ports; Radio listens on port 5335, AmphetaDesk listens on port 8888, Awasu listens on port 2604, nntp//rss listens on port 7810 and so on.

An alternative solution was chosen by various other aggregator authors in which hyperlinks pointed to the URLs of RSS feeds with the crucial distinction that the http:// part of the URL was substituted with a custom URI scheme. Since most modern browser have a mechanism for handing off unknown URI schemes to other client applications this also allows "one-click feed subscription". Here also there is variance amongst news aggregators; Vox Lite, RSS Bandit & SharpReader support the feed:// URI scheme, WinRSS supports the rss:// URI scheme while NewsMonster supports the news-monster:// scheme.

With all this varying approaches, it means that any website that wants to provide a link that allows one click subscription to an RSS feed needs to support almost a dozen different techniques and thus create a dozen different hyperlinks on their site. This isn't an exaggeration, this is exactly what Feedster when one wants to subscribe to the results of a search. If memory serves correcly, Feedster uses the QuickSub javascript module to present these dozen links in a drop down list.

THE FURORE
The recent debate on both the atom-syntax and the www-tag mailing lists focuses on the feed:// URI proposal and it's lack of adherence to guidelines set in the current draft of Architecture of the World Wide Web document being produced by the W3C Technical Architecure Group. This document is an attempt to document the architecture of the World Wide Web ex post facto.

Specifically the debate hinges on the guideline that states

Authors of specifications SHOULD NOT introduce a new URI scheme when an existing scheme provides the desired properties of identifiers and their relation to resources.
...
If the motivation behind registering a new scheme is to allow an agent to launch a particular application when retrieving a representation, such dispatching can be accomplished at lower expense by registering a new Internet Media Type instead. Deployed software is more likely to handle the introduction of a new media type than the introduction of a new URI scheme.

The use of unregistered URI schemes is discouraged for a number of reasons:

There is no generally accepted way to locate the scheme specification.
Someone else may be using the scheme for other purposes.
One should not expect that general-purpose software will do anything useful with URIs of this scheme; the network effect is lost.

The above excerpt assumes that web browsers on the World Wide Web are more likely to know how to deal with unknown Internet Media Types than unknown URI schemes which is in fact the case. For example, Internet Explorer will fallback to using the file extension of the file if it doesn't know how to deal with the provided MIME type (see MIME Type Detection in Internet Explorer for more details). However there are several problems with using MIME types for one click feed subscription that do not exist in the previously highlighted approaches.

Greg Reinacker detailed them in hist post RSS and MIME types a few months ago.

Problem 1: [severity: deal-breaker] In order to serve up a file with a specific MIME type, you need to make some changes in your web server configuration. There are a LOT of people out there (shared hosting, anyone?) who don't have this capability. We have to cater to the masses, people - we're trying to drive adoption of this technology.

Problem 1a: [severity: annoyance] There are even more people who wouldn't know a MIME type from a hole in the head. If Joe user figures out that he can build a XML file with notepad that contains his RSS data (and it's being done more often than you think), and upload it to his web site, you'd think that'd be enough. Sorry, Joe, you need to change the MIME type too. The what?

Problem 2: [severity: deal-breaker] If you register a handler for a MIME type, the handler gets the contents of the file, rather than the URL. This is great if you're a media player or whatever. However, with RSS, the client tool needs the URL of the RSS file, not the actual contents of the RSS file. Well, it needs the contents too, but it needs the URL so it can poll the file for changes later. This means that the file that's actually registered with a new MIME type would have to be some kind of intermediate file, a "discovery" file if you will. So now, not only would Joe user have to learn about MIME types, but he'd have to create another discovery file as well.

Many people in the MIME type camp have pointed out that problem two can be circumvented by having the feed file contain it's location. Although this seems a tad redudundant and may be prone to breakage if the website is reorganized it probably should work for the most part. However there is at least one other problem with using MIME types which people have glossed over.

Problem 3: If clicking on a link to an RSS feed in your browser always invokes your aggregator's feed subscription dialog then this means you can't view an RSS feed in your browser if you have a client aggregator installed and may not be able to view even if you don't because your browser of choice may not know how to handle the MIME type if it isn't something like text/xml.

At least one person, Tim Bray, doesn't see this as a big deal and in fact stated, "why not? Every time I click on a link to a PDF doc I get a PDF reader. Every time I click on a .mov file I get Quicktime. Seems like a sensible default behavior".

THE BOTTOM LINE
Using MIME types to solve the one click subscription problem is more diffficult for weblog tools to implement than the other two approaches favored by news aggregators and requires changing web server configurations as well while the other approaches do not. Although the architecure astronauts will rail against the URI scheme based approach it is unlikely that anyone who looks dispassionately at all three approaches will choose to use MIME types to solve this problem.

Of course, since one of the main forces behind the ATOM movement has stated that MIME types will be the mechanism used for performing one click subscription to ATOM feeds this just seems like one more reason for me to be skeptical about the benefits of adopting the ATOM syndication format.

Categories: XML

December 3, 2003

@ 09:55 PM

My Love Affair With EXSLT Continues

My latest column is up on MSDN, Extreme XML: EXSLT Meets XPath.

Categories: XML

December 2, 2003

@ 10:36 PM

The ATOM API vs. the ATOM Syndication Format

Robert Scoble wrote

I see over on Evan Williams site that it looks like Google (er, Blogger, which is owned by Google) is going to support Atom. So far Microsoft has been supporting RSS 2.0 (we've spit out RSS 2.0 on MSDN, on the PDC app, on MyWallop, and in a few other places). Atom is a syndication format that's similar, but slightly different from RSS. I wonder how the market will shake out now.

Evan: can you explain, in layman's terms, why you support Atom and not RSS?

This question is misleading. There are two parts to ATOM that are being discussed by Google, the ATOM API and the ATOM syndication format. The ATOM API is competitive with technologies like the Blogger API, MetaWeblog API and the LiveJournal API while the ATOM syndication format competes with technologies like RSS 1.0 and RSS 2.0.

There has been enough written about the history of feed syndication formats named "RSS" so I'll skip that discussion and move directly to discussing the history of weblog posting APIs.

The Blogger API was originally developed by Blogger (now owned by Google) as a way of allowing client applications to talk to blogger weblogs (using client applications such as w.bloggar). This API was later adopted by other blogging tools such as Radio Userland. However Dave Winer decided he didn't like some of the perceived deficiencies in the Blogger API and forked it thus creating the MetaWeblog API. Later on the Blogger folks came out with version 2.0 of the Blogger API which led to online war of words with Dave Winer because he felt they should use his forked version instead even though his version removed functionality that was crucial to Blogger. Eventually Blogger backed off from implementing v2.0 of thier API and has been waiting for an alternative which presented itself in the ATOM API. Most of this history is available from Evan Williams's blog post entitled the Tragedy of the API.

<update source="Dave Winer" >

ManilaRPC came first, way before all the others you mention. It was an XML-RPC then SOAP-based API for driving Manila, and is still in use today, and is much deeper than any of the other APIs.
The MetaWeblog API addressed a very well-known deficiency in the Blogger API, no support for titles. You neglected to mention that it was broadly supported by tools and blogging systems, by everyone except Blogger.

</update>

The ATOM effort is aimed at replacing both the popular syndication formats and the popular weblog publishing APIs. Both of which have been burdened with histores full of turbulent turf battles and personal recriminations.

Based on my experiences working with syndication software as a hobbyist developer for the past year is that the ATOM syndication format does not offer much (if anything) over RSS 2.0 but that the ATOM API looks to be a significant step forward compared to previous attempts at weblog editting/management APIs especially with regard to extensibility, support for modern practices around service oriented architecture, and security. The problem is that if one implements the ATOM API it makes sense that since this API uses the feed syndication format as the payload of the messages sent between the client and the server then one should also implement the ATOM syndication format. This is probably why Blogger/Google will support both the ATOM API and the ATOM syndication format.

I personally tend to agree with Don Park's proposal

IMHO, the most practical path out of this mess is for the Atom initiative to hi-jack RSS 2.0 and build on it without breaking backward compatibility. A new spec will obviously have to be written to avoid copyright problems with Dave's version of the RSS 2.0 spec, but people were complaining about the old spec anyway.

As to the Atom API, I won't bitch about it any more if RSS 2.0 is adopted as the core Atom feed format because the feed format is far more important than the API. This should satisfy Evan Williams since his real beef is with the API. Yes, there are some issues people have with RSS 2.0 but they can be ignored or worked-around with extensions until later, hopefully much later.

This compromise will give the best of all world's to users. There is no discontinuity in syndication formats yet blog editting APIs are improved and brought in line with 21st century practices. I've mentioned this on the atom-syntax mailing list in the past but the idea seemed to receive a cold reception.

Regardless of what ends up happening, the ATOM API is best poised to be the future of weblog editting APIs. The ATOM syndication format on the other hand...

Categories: XML

November 20, 2003

@ 09:39 PM

Misinformation on XML.com: Microsoft and Binary XML

In his recent article entitled Binary Killed the XML Star? Kendall Clark writes

Many XML proponents and users came out of various binary exchange and format camps, and they are very unwilling to return to what were for them, or so it would seem, dark days. In this case, however, given the real power of those who most seem to want a binary variant -- including Sun, IBM, and Microsoft -- they may have to adopt a carefully tactical plan to limit the damage, rather than preventing the fight completely.

This claim by Kendall Clark seems to contradict the conclusions in the postion papers provided by both Microsoft and IBM at the The W3C Workshop on Binary Interchange of XML Information Item Sets.

IBM's position paper concludes with

IBM believes that wherever possible, implementations of the existing XML 1.x Recommendation should be optimized to meet the needs of customers. While we expect to see non-standard binary forms used internally within certain vendors’ implementations, including perhaps our own, we are not yet convinced that there is justification to standardize an interchange format other than XML 1.x. We thus believe that it would be premature for the W3C to launch a formal workgroup, or to recharter an existing group, to develop a Binary XML Recommendation

Microsoft's position paper concludes with

For different classes of applications, the criterion (minimize footprint or minimize parse/generate time) for the binary representation is different and often conflicting. There is no single criterion that optimizes all applications. Consequently, a binary standard could result in a suite of allowable representations that clients and servers must be prepared to receive and process. This is a retrograde step from the portability goals of XML 1.0. Furthermore, the optimal binary representation depends on the machine and OS architectures on each end — translating between binary representations negates much of the advantages that binary XML has over text.

Besides the position paper from Microsoft there've have been many comments both in Weblogs and mailing lists from Microsoft people against this movement for a standardized binary XML format (oxymoron that it is). There have been weblog posts by myself, Joshua Allen and Omri Gazitt (all of whom work on XML technologies at Microsoft) decrying the movement towards binary XML and thus potential fragmentation of the XML world.

There have also been a number of posts by Microsoft employees against standardized binary XML on mailing list such as XML-DEV some of which have been quoted on Elliotte Rusty Harold's Cafe con Leche XML News website

I fear that splitting the interop story of XML into a textual and Infoset-based/binary representation, we are going to get the "divide and conquer" effect that in the end will make XML just another ASN.1: a niche model that does not deliver the interop it promises and we will be back to lock-in.

--Michael Rys on the xml-dev mailing list, Tue, 18 Nov 2003

XML has succeeded in large part because it is text and because it is perceived as "the obvious choice" to many people. The world was a lot different before XML came around, when people had to choose between a dizzying array of binary and text syntaxes (including ASN.1). Anyone who tries to complicate and fragment this serendipitous development is, IMO, insane.

--Joshua Allen on the xml-dev mailing list, Tue, 18 Nov 2003

Unfortunately, it seems that Kendall Clark must have missed the various discussions, weblog posts and the position paper where Microsoft's view of the importance of textual XML 1.0 were put forth.

Categories: XML

November 18, 2003

@ 04:57 PM

XQuery is a Better XPath not a Better XSLT

Elliotte Rusty Harold writes

In XSLT 1.0 all output is XML. A transformation creates a result tree, which can always be serialized as either an XML document or a well-formed document fragment. In XSLT 2.0 and XQuery the output is not a result tree. Rather, it is a sequence. This sequence may contain XML; but it can also contain atomic values such as ints, doubles, gYears, dates, hexBinaries, and more; and there's no obvious or unique serialization for these things. For instance, what exactly are you supposed to do with an XQuery that generates a sequence containing a date, a document node, an int, and a parentless attribute? How do you serialize this construct? That a sequence has no particular connection to an XML document was very troubling to many attendees.

Looking at it now, I'm seeing that perhaps the flaw is in thinking of XQuery as like XSLT; that is, a tool to produce an XML document. It's not. It's a tool for producing collections of XML documents, XML nodes, and other non-XML things like ints. (I probably should have said it that way last night.) However, the specification does not define any concrete serialization or API for accessing and representing these non-XML collections. That's a pretty big hole left to implementers to fill.

The main benefits of XQuery are as a better way to retrieve data from one or more XML documents than previous methods (i.e. a better XPath) not as a way to transform one XML structure to another (i.e. XSLT). I assume that if Elliotte Rusty Harold isn't familiar with APIs that provided XPath as a standalone language such as the .NET Framework's XPathNavigator, the Oracle XDK, or Jaxen since all of these provide a way to get atomic values (number, string, or boolean) as well as nodes from querying an XML document.

Similrly, there is no well defined way to serialize the results of performing an arbitrary XPath on an XML document. The tough parts for implementers aren't atomic values or XML fragments as Elliotte Rusty Harrold describes both more mundane things like attribute values. For instance consider the following document

<test xmlns:e="http://www.example.com" e:id="1" />

queried using the following XPath expression

/*/@*[1]

which returns the first attribute of the document element. How would one serialize these results? There are a bunch of options such as

e:id="1"
{http://www.example.com}id="1"
@e:id="1"
{xmlns:e="http://www.example.com"}id="1"

All of which I could argue are valid serializations of the attribute node returned by that query. By the way, the .NET Framework uses the first serialization if one calls XmlNode.OuterXml on the XmlAttribute object returned by executing that query on an XmlDocument object.

So what's my point? That the situation Elliotte Rusty Harrold bemoans as being unique to XQuery has always existed with XPath. Even more, as Oleg Tkachenko points out there is an XSLT 2.0 and XQuery 1.0 Serialization draft recommendation which specifies how to serialize instances of the XPath 2.0/XQuery data model which even resolves the question about how one would serialize the results of the query above

It is a serialization error if an item in the sequence is an attribute node or a namespace node.

Short answer, you can't.

Categories: XML

November 17, 2003

@ 04:10 PM

Open and Royalty-Free License For Office 2003 XML Reference Schemas

From Microsoft Announces Availability of Open and Royalty-Free License For Office 2003 XML Reference Schemas

Microsoft Corp. today announced the availability of a royalty-free licensing program for its Microsoft® Office 2003 XML Reference Schemas and accompanying documentation. ... Microsoft's new Office 2003 versions of Word, Excel and the InfoPath (TM) information-gathering program utilize schemas that describe how information is stored when documents are saved as XML....

To ensure broad availability and access, Microsoft is offering the royalty-free license using XML Schema Definitions (XSDs), the cross-industry standard developed by the W3C. The license provides access to the schemas and full documentation to interested parties and is designed for ease of use and adoption. The Microsoft Office 2003 XML Reference Schemas include WordprocessingML (Microsoft Office Word 2003), SpreadsheetML (Microsoft Office Excel 2003) and FormTemplate XML schemas (Microsoft Office InfoPath 2003).

The biggest gripe when Office 2003's XML support was announced was that the schemas for WordprocessingML (aka WordML) and co. were proprietary. This was reported in a number of fora including Slashdot & C|Net news. I wonder how many will carry the announcements that these schemas are available for all to peruse and reuse in a royalty free manner?

Update: On C|Net news: Microsoft pries open Office 2003

Update2: On Slashdot: Microsoft Word Document ML Schemas Published

Categories: XML

November 17, 2003

@ 06:32 AM

XAML Passes the XML Litmus Test

George Mladenov asked

Why does XAML need to be (well-formed) XML in the first place?

To which Rob Relyea responds with the following reasons

1. Without extra work from the vendors involved, we’d like all XML editors be able to work with XAML.

2. We’d like transformations (XSLT, other) be able to move content to/from XAML.

3. We didn’t want to write our own file parsing code, the parser code we do have is built on top of System.XML.XmlTextReader. We are able to focus on our value add.

Thus it looks like XAML's use of XML passes the XML Litmus Test, specifically

Using XML for a software development project buys you two things (a) the ability to interoperate better with others and (b) a number of off-the-shelf tools for dealing with format. If neither of these things apply to a given situation then it doesn't make much sense to use XML.

However there are tradeoffs to using XML, some of which Rob points out. They are listed below with some of my opinions

1. We want to enable setting the Background property on a Button (for example) in one of two ways:

a. Using a normal attribute - <Button Background=”Red”>Click Here</Button>

b. Using our compound property syntax –

...

c. Ideally if somebody tried to use both syntaxes at the same time, we could error. XML Schema – as far as I am aware – isn’t well equipped to describe that behavior.

Being the PM for W3C XML Schema technologies in the .NET Framework means I get to see variations of this request regularly. This feature is typically called co-occurence constraints and is lacking in W3C XML Schema but is supported by other XML schema languages like RELAX NG and can be added to W3C XML Schema using Schematron annotations. Given the existing complexity of W3C XML Schema's conflicting design goals (validation language vs. type system) and contradictory rules I for one am glad this feature doesn't exist in the language.

However this means that users who want to describe their schemas using W3C XML Schema need to face the fact that not all the constraints of their vocabulary can be expressed in a schema which is always the case it's just that some constraints seem significant enough to be in the schema while others are OK being checked in code during "business logic processing". In such cases there are basically 3 choices (i) try to come as close as possible to describing the content model in the schema which sometimes may lead to what us language lawyers like to call "gross hacks" (ii) use an alternate XML schema language or extend the W3C XML Schema language in some way or (iii) live with the fact that some constraints won't be describable in the schema.

It is a point of note that althogh the W3C XML Schema recommendation contains what seems like a schema for Schema (sForS) (i.e. the rules of W3C XML Schema are themselves described as a schema) this is in fact not the case. The schema in the spec, although normative is invalid and even if it was valid still does not come close to rigidly specifying all the rules of W3C XML Schema. The way I look at it is simple, if the W3C XML Schema working group couldn't come up with a way to fully describe an XML vocabulary using XML Schema then the average vocabulary designer shouldn't be bothered if they can't either.

2. It is a bit strange, for designers or developers moving from HTML to XML. HTML is forgiving. XML isn’t. Should we shy away from XML so that people don't have to quotes around things? I think not.

Having to put quotes around everything isn't the biggest pain in the transition from HTML to XML, and after a while it comes naturally. A bigger pain is dealing with ensuring that nested tags are properly closed and I'm glad I found James Clark's nxml-mode for Emacs which has helped a lot with this. The XML Editor in the next version of Visual Studio should also be similarly helpful in this regard.

The lack of the HTML predefined entities is also a bit of culture shock when moving to XML from HTML, and one some consider a serious bug with XML, I tend to disagree.

3. It is difficult to keep XAML as a human readable/writable markup, as that isn’t one of XML’s goals. I think it needs to be one of XAML’s goals. It is a continual balancing act.

Actually one of the main goals of XML is to be human-readable, at least as human readable as HTML was since it was intended to replace HTML in the beginning. There's a quick history lesson in my SGML on the Web: A Failed Dream? post from earlier last month.

Categories: XML

November 14, 2003

@ 04:22 PM

Is XML About Text or Not?

Fumiaki Yoshimatsu writes

Why does someone still think that they have to write Unicode BOMs by themselves, digging deep inside XmlTextWriter.BaseStream and UnicodeEncoding.GetPreamble? Encoding hint in the XML declarations and Unicode BOMs are all about XML 1.0 thing, but WriteStartElement and WriteStartDocument are not. They are InfoSet thing, so they do not have anything to do with the serialization format. Think about XmlNodeWriter for example. Why does XmlNodeWriter NOT have any constructor that have a parameter of type Encoding? Why does it always call XmlDocument.CreateXmlDeclaration with null as the second argument?

This is a common point of confusion for users of XML in the CLR. XmlNodeWriter doesn't have a parameter of type Encoding because it writes to an XmlDocument which is stored in memory and all strings in the CLR are in UTF-16 encoding. Setting the encoding only matters when saving the XmlDocument to a stream. As for having to dig into XmlTextWriter.BaseStream to set the encoding, I find this weird considering that the XmlTextWriter constructor has a number of ways to specifying the encoding on instantiating an instance of the class. Since XML 1.0 mandates that an XML document can only have one encoding there is no reason for methods like WriteStartElement and WriteStartDocument to concern themselves with encoding issues.

If you really want to dive deep into issues involving specifying the encoding of XML documents and the CLR take a look at this discussion in Robert McLaws's weblog.

PS: One of my pet peeves is the way people misuse the term XML infoset to mean "things in XML I don't care about" even though there is a precise definitition (nay an entire spec) that describes what it means. The document information item clearly has a [character encoding scheme] property which means character encodings are an XML infoset thing.

Categories: XML

November 13, 2003

@ 01:29 AM

When You Have A Hammer Everything Looks Like A Nail

Oleg Tkachenko writes

Just found new beast in the Longhorn SDK documentation - OPath language:
The OPath language is the query language used to query for objects using an ObjectSpace. The syntax of OPath also allows you to query for objects using standard object oriented syntax. OPath enables you to traverse object relationships in a query as you would with standard object oriented application code and includes several operators for complex value comparisons.

Orders[Freight > 5].Details.Quantity > 50 OPath expression should remind you something familiar. Object-oriented XPath cross-breeded with SQL? Hmm, xml-dev flamers would love it.

The approach seems to be exactly opposite to ObjectXPathNavigator's one - instead of representing object graphs in XPathNavigable form, brand new query language is invented to fit the data model. Actually that makes some sense, XPath as XML-oriented query language can't fit all. I wonder what Dare think about it. More studying is needed, but as for me (note I'm not DBMS-oriented guy though) it's too crude yet

Oleg is right that an XML oriented query language like XPath doesn't fit for querying objects. There is a definitely an impedance mismatch between XML and objects, a good number of which were pointed out by Erik Meijer in his paper Programming with Circles, Triangles and Rectangles. A significant number of constructs and semantics of XPath simply don't make sense in a language designed to query objects. The primary construct in XPath is the location step which consists of an axis, a node test and zero or more predicates, of which both the axis and the node test are out of place in an object query language.

From the XPath Grammar, there are 13 axes of which almost none make sense for objects besides self. They are listed below

[6]	AxisName	::=	'ancestor'
			\| 'ancestor-or-self'
			\| 'attribute'
			\| 'child'
			\| 'descendant'
			\| 'descendant-or-self'
			\| 'following'
			\| 'following-sibling'
			\| 'namespace'
			\| 'parent'
			\| 'preceding'
			\| 'preceding-sibling'
			\| 'self'

The ones related to document order such as preceding, following, preceding-sibling and following-siblings don't really apply to objects since there is no concept of order amongst the properties and fields of a class. The attribute axis is similarly unrelated since there is no equivalent of the distinction between elements and attributes among the fields and properties of a class.

The axes related to document hierarchy such as parent, child, ancestor, descendent, etc look like they may make sense to map to object oriented concepts until one asks what exactly is meant to be the parent of an object? Is it the base class or the object to which the current object belongs as a field or property? Most would respond that it is the latter. However what happens when multiple objects have the same object as a field which is often the case since objects structures are graph-like not tree-like as XML structures? It also gets tricky when an object that is a field in one class is a member of a collection in another class. Is the object a child of the collection? If so what is the parent of the object, if not what is the relationship of the object to the collection then? The questions can go on...

On the surface the namespace axes sounds like it could map to concepts from object oriented programming since languages like C#, C++ and Java all have a concept of a "namespace". However namespace nodes in the XPath data model have distinct characteristics (such as the fact that each element node in document has a distinct set of namespace nodes regardless of whether each of these namespace nodes represent the same mapping of a prefix to a namespace URI).

A similar argument can also be made around node tests which are the second primary constructs in XPath location steps. A node test either specifies a name or a type of node to match. A number of XPath node types don't have equivelants in the object oriented world such as comment and processing instruction nodes. Other nodes such as text and element nodes are problematic when one begins to try to tie them in to the various axes such as the parent axis.

Basically, a significant amount of XPath is not really applicable to querying objects without changing the semantics of certain aspects of the language in a way that conflicts with how XPath is used when querying XML documents.

As for how this compares to my advocacy of XML to object mapping techniques such as the ObjectXPathNavigator, the answer is simple; XML is the universal data interchange format and the software world is moving to a situation where all the major sources of important data can be accessed or viewed as XML from office documents to network messages to information locked within databases. It makes sense then that in creating this universal data access layer that one create a way for all interesting sources of data to be viewed as XML so they to can participate as input for data aggregation technologies such as XSLT or XQuery and enable the reuse of XML technologies for processing and manipulating them.

Categories: Life in the B0rg Cube | XML

November 11, 2003

@ 11:10 PM

Curiosity Killed the Cat

I noticed the followingRDF Interest Group IRC chat log discussing my recent post More on RDF, The Semantic Web and Perpetual Motion Machines in my referrer logs. I found the following excerpts quite illuminating

15:43:42 <f8dy> is owl rich enough to be able to say that my <pubDate>Tue, Nov 11, 2003</pubDate> is the same as your <dc:date>2003-11-11</dc:date>

15:44:35 <swh> shellac: I believe that XML datatypes are...

...

16:08:15 <f8dy> that vocabulary also uses dates, but it stores them in rfc822 format

16:08:51 <f8dy> 1. how do i programmatically determine this?

16:08:58 <JHendler> ah, but you cannot merge graphs on things without the same URI, unless you have some other way to do it

16:09:02 <f8dy> 2. how do i programmatically convert them to a format i understand?

...

16:09:40 <shellac> 1. use

...

16:10:13 <shellac> 1. use a xsd library

16:10:32 <shellac> 2. use an xsd library

...

16:11:08 <JHendler> n. use an xsd library :->

16:11:30 <shellac> the graph merge won't magically merge that information, true

16:11:34 <JHendler> F: one of my old advisors used to say the only thing better than a strong advocate is a weak critic

This argument cements my suspicions that the using RDF and Semantic Web technologies are a losing proposition when compared to using XML-centric technologies for information interchange on the World Wide Web. It is quite telling that none of the participants who tried to counter my arguments gave a cogent response besides "use an xsd library" when in fact anyone with a passing knowledge of XSD would inform them that XSD only supports ISO 8601 dates and would barf on RFC 822 if asked to treat them as dates. In fact, this is a common complaint about them from our customers w.r.t internationalization [that and the fact decimals use a period as a delimiter instead of a comma for fractional digits].

Even in this simple case of mapping equivalent elements (dc:date and pubDate) the Semantic Web advocates cannot provide a solution to how their vaunted ontolgies can provide a solution to a problem the average RSS aggregator author solves in about 5 minutes of coding using off-the-shelf XML tools. It is easy to say philosphically that dc:date and pubDate after all, they are both dates, but another thing to write code that knows how to treat them uniformly. I am quite surprised that such a straightforward real-world example cannot be handled by Semantic Web technologies. Clay Shirky's The Semantic Web, Syllogism, and Worldview makes even more sense now.

One of my co-workers recently called RDF an academic plaything, after seeing how many of its advocates ignore the difficult real world problems faced by software developers and computer users today while pretending that obtuse solution to trivial problems are important, I've definitely lost any interest I had left in investigating any further about the Semantic Web.

Categories: XML

November 11, 2003

@ 02:03 PM

More on RDF, The Semantic Web and Perpetual Motion Machines

My post from yesterday garnered a couple of responses from the RDF crowd who questioned the viability of the approaches I described. Below I take a look at some of their arguments and relate them to practical examples of exchanging information using XML I have encountered in my regular development cycle.

Shelley Powers writes

One last thing: I wanted to also comment on Dare Obasanjo's post on this issue. Dare is saying that we don't need RDF because we can use transforms between different data models; that way everyone can use their own XML vocabulary. This sounds good in principle, but from previous experience I've had with this type of effort in the past, this is not as trivial as it sounds. By not using an agreed on model, not only do you now have to sit down and work out an agreement as to differences in data, you also have to work out the differences in the data model, too. In other words -- you either pay upfront, once; or you keep paying in the end, again and again. Now, what was that about a Perpetual Motion Machine, Dare?

In responding to Shelley's post it is easier for me if I use a concrete example. RSS Bandit uses a custom format that I came up with for describing a user's list of subscribed feeds. However in the wild, other news aggregators us differing formats such as OPML and OCS. To ensure that users who've used other aggregators can try out RSS Bandit without having to manually enter all their feeds I support importing feed subscription lists in both the OPML and OCS format even though this is distinct from the format and data model I use internally. This importation is done by applying an XSLT to the input OPML or OCS file to convert it to my internal format then converting that XML into the RSS Bandit object model. The stylesheets took me about 15 to 30 minutes to write for each one. This is the XML-based solution.

Folks like Shelley believe my problem could be better solved by RDF and other Semantic Web technologies. For example, if my internal format was RDF/XML and I was trying to import an RDF-based format such as OCS then instead of using a language like XSLT that performs a syntactic transform of one XML format to the other I'd use an ontology language such as OWL to map between the data models of my internal format and OCS. This is the RDF-based solution.

Right of the bat, it is clear that both approaches share certain drawbacks. In both cases, I have to come up with a transformation from one represention of a feed list to another. Ideally, for popular formats there would be standard transformations described by others to move from one popular format to another (e.g. I don't have to write a transformation for WordML to HTML but do for WordML to my custom document format) so developers who stick to popular formats simply have to locate the transformation as opposed to actually authoring it themselves.

However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not. A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

The other practical concern is that I already know XSLT and have good books to choose from to learn about it such as Michael Kay's XSLT : Programmer's Reference and Jeni Tennison's XSLT and XPath On The Edge as well as mailing lists such as xsl-list where experts can help answer tough questions.

From where I sit picking an XML-based solution over an RDF-based one when it comes to dealing with issues involving interchange of XML documents just makes a lot more sense. I hope this post helps clarify my original points.

Ken MacLeod also wrote

In his article, Dare suggests that XSLT can be used to transform to a canonical format, but doesn't suggest what that format should be or that anyone is working on a common, public repository of those transforms.

The transformation is to whatever target format the consumer is comfortable with dealing with. In RSS Bandit the transformations are OCS/OPML to my internal feed list format and RSS 1.0 to RSS 2.0. There is no canonical transformation to one Über XML format that will solve every one's problems. As for keeping a common, public repository of such transformations that is an interesting idea which I haven't seen anyone propose in the past. A publicly accessible database of XSLT stylesheets for transforming between RSS 1.0 and RSS 2.0, WordML to HTML, etc. would be a useful addition to the XML community.

Sam Ruby muddies the waters in his post Blind Spots and subsequent comments in that thread by confusing the use cases around XML as a data interchange format and XML as a storage data format. My comments above have been about XML as a data interchange format, I'll probably post more in future about RDF vs. XML as a data storage format using the thread in Sam's blog for context.

Categories: XML

November 10, 2003

@ 04:46 AM

RDF, The Semantic Web and Perpetual Motion Machines

Ken MacLeod writes

Clay Shirky criticizes the Semantic Web in his article, The Semantic Web, Syllogism, and Worldview, to which Sam Ruby accurately assesses, "Two parts brilliance, one part strawman."

Joe Gregorio responds to Shirky's piece with this very concrete statement:

This is exactly the point I made in The Well-Formed Web, that the value that the proponents of the Semantic Web were offering could be achieved just as well with just XML and HTTP, and we are doing it today with no use of RDF, no need to wait for ubiquitous RDF deployment, no need to wait for RDF parsing and querying tools.

Yet, in the "just XML" world there is no one that I know of working on a "layer" that lets applications access a variety of XML formats (schemas) and treat similar or even logically equivalent elements or structures as if they were the same. This means each XML application developer has to do all of the work of integrating each XML format (schema): N × M.

The difference between the RDF proponents and the XML proponents is fairly simple. In the XML-centric world parties can utilize whatever internal formats and data sources they want but exchange XML documents that conform to an agreed upon format, in cases where the agreed upon format conflicts with internal formats then technologies like XSLT come to the rescue. The RDF position is that it is too difficult to agree on interchange formats so instead of going down this route we should use A.I.-like technologies to map between formats. Note, that this doesn't mean transformations don't need to be done as Ken points out

The RDF model along with the logic and equivalency languages, like OWL (nee DAML+OIL),

Thus, if you are an XML practitioner RDF doesn't change much except new transformation techniques and technologies to learn.

Additionally as Clay Shirky points out, on investigation it isn't even clear whether the basic premises of RDF and similar Semantic Web technologies is based on a firm foundation and sound logic. In conclusion Ken wrote,

One can take potshots at RDF for how it addresses the problem, and the Semantic Web for possibly reaching too far too quickly in making logical assertions based on relations modeled in RDF, but to dismiss it out of hand or resort to strawmen to attack it all while not recognizing the problem it addresses or offering an alternative solution simply tells me they don't see the problem, and therefore have no credibility in knocking RDF or the Semantic Web for trying to solve it.

I wonder if I'm the only one that sees the parallels between the above quote and statements that attributed to religious fundamentalists. I wonder if Ken is familiar with Perpetual Motion Machines? The problem they want to solve is real albeit impossible to solve. Does he also feel that no one has the credibility to knock any one of the numerous designs for one that have been proposed until the critic can themselves produce a perpetual motion machine?

Categories: XML

November 7, 2003

@ 03:23 PM

The W3C Learns From Past Mistakes

I've posted previously on why I think the recent outcry for the W3C to standardize on a binary format for the representation of XML information sets (aka "binary XML") is a bad idea which could cause significant damage to interoperability on the World Wide Web. Specifically I wrote

Binary XML Standard(s): Just Say No

Omri and Joshua have already posted the two main reasons why attempting to create a binary XML standard is folly (a) the various use cases and requirements are contradictory (small message size for low bandwidth situations and minimal parsing/serialization time for sitautions where minimizing processing time is prime) thus a single standard is unlikely to satisfy a large proportion of the requesters and (b) creation of a binary XML standard especially by an organization such as the W3C muddies the water with regards to interop, people already have to worry about the interop pain that will occur whenever XML 1.1 gets out of the door (which is why Elliotte Rusty Harold advises avoiding it like the plague) let alone adding one or more binary XML standards to the mix.

I just read the report from the W3C Workshop on Binary Interchange of XML Information Item Sets and I'm glad to see the W3C did not [completely] bow to pressure from certain parties to start work on a "binary XML" format. The following is the conclusion from the workshop

CONCLUSIONS

The Workshop concluded that the W3C should do further work in this area, but that the work should be of an investigative nature, gathering requirements and use cases, and prepare a cost/benefit analysis; only after such work could there be any consideration of whether it would be productive for W3C to attempt to define a format or method for non-textual interchange of XML.

See also Next Steps below for the conclusions as they were stated at the end of the Workshop.

This is new ground for the W3C. Usually W3C working groups are formed to take competing requirements from umpteen vendors and hash out a spec. Of course, the problems with this approach is that it doesn't scale. It may have worked for HTML when the competing requirements primarily came from two vendors but now that XML is so popular it doesn't work quite as well, as Tim Bray put it "any time there's a new initiative around XML, there are instantly 75 vendors who want to go on the working group".

It's good to see the W3C decide to take an exploratory approach instead of just forging ahead to create a spec that tries to satisfy myriad competing and contradictory requirements. They've done this before with W3C XML Schema (and to a lesser extent with XQuery) and the software industry is still having difficulty digesting the results. Hopefully at the end of their investigation they'll come to the right conclusions.

Categories: XML

November 5, 2003

@ 03:45 PM

Comments [16]

A Request for Authors of Weblog Software

One of the biggest concerns about RSS is the amount of bandwidth consumed by wasteful requests. Recently on an internal mailing list discussion there was a complaint about the amount of bandwidth wasted because weblog servers send a news aggregator an RSS feed containing items it has already seen. A typical news feed contains 10 - 15 news items where the oldest is a few weeks old and the newest is a few days old. A typical user has used their news aggregator to fetch the an RSS feed about once every other day. This means on average at least half the items in an RSS feed are redundant to people who are subscribed to the feed yet everyone (client & server) incurs bandwidth costs by having the redundant items appear in the feeds.

So how can this be solved? All the pieces to solve this puzzle are already on the table. Every news aggregator worth it's salt (NetNewsWire, SharpReader, NewsGator, RSS Bandit, FeedDemon, etc) uses HTTP Conditional GET requests. What does that mean in English? It means that most aggregators send information about when last they retrieved the RSS feed via the If-Modified-Since HTTP header and a the hashcode of the RSS feed provided by the server the last time it was fetched via the If-None-Match HTTP header. The interesting point is that although most news aggregators tell the server the last time they fetched the RSS feed almost no weblog server I am aware of actually uses this information to tailor the information sent back in the RSS feed. The weblog software I use is guilty of this as well.

If you fetched my RSS feed yesterday or the day before there is no reason for my weblog server to send you a 200K file containing five entries from last week which it currently does. Actually it is worse, currently my weblog software doesn't even perform the simple check of seeing whether there are any new items before choosing to send down a 200K file.

Currently this optimization is the one performed by weblog servers, if there are no new items then a HTTP 304 response is sent otherwise a feed containing the last n items is sent. A further optimization is possible where the server only sends down the last n items newer than the If-Modified-Since date sent by the client.

I'll ensure that this change makes it into the next release of dasBlog (the weblog software I use) and if you use weblog software I suggest requesting that your software vendor to do the same.

UPDATE: There is a problem with the above proposal in that it calls for a reinterpretation of how If-Modified-Since is currently used by most HTTP clients and directly violates the HTTP spec which states

b) If the variant has been modified since the If-Modified-Since
date, the response is exactly the same as for a normal GET.

The proposal is still valid except that this time instead of misusing the If-Modified-Since header I'd propose that clients and servers respect a new custom HTTP header such as "X-Feed-Items-New-Than" whose value would be a date in the same format as that used by the If-Modified-Since header.

Categories: XML

November 4, 2003

@ 03:59 PM

XPath Over Arbitrary Object Graphs: The Perils of Being Easily Distracted

I was planning to write this month's Extreme XML column on the recently released EXSLT.NET implementation produced by myself and a couple of others. One of the cool things about the EXSLT.NET project is that we added the ability to use the EXSLT extension functions in XPath queries over any data source that provides an XPathNavigator (i.e. implements IXPathNavigable). Thus one would be able to use functions like set:distinct and regexp:match when running XPath queries over objects that implement the IXPathNavigable interface such as the XPathDocument, XmlDocument or XmlDataDocument.

In constructing my examples I decided that it would be even cooler to show the extensibility of the .NET Framework if I showed how one could use the XPath extension functions in queries over implementations of XPathNavigator not provided by the .NET Framework such as my perennial favorite, the ObjectXPathNavigator.

After fixing some bugs in the ObjectXPathNavigator implementation on MSDN (MoveToParent() didn't take you to the root node from the document element and the navigator only exposed public properties but not public fields) I came across a problem which will probably turn into yet another project on GotDotNet workspaces. The heuristics the ObjectXPathNavigator uses to provide an XML view of an arbitrary object graph doesn't take into account the class annotations used by XML Serialization in the .NET Framework. Basically this means that if one reads in an XML document, converts it to objects using the XmlSerializer then creates an ObjectXPathNavigator over the objects...the XML view of the object provided by the ObjectXPathNavigator would not be the same as the XML generated when the class is serialized as XML via the XmlSerializer.

In fact for the ObjectXPathNavigator to provide the same XML view of an objects as the XmlSerializer would involve having it understand the various attributes for annotating classes from the System.Xml.Serialization namespace. Considering that in the future, the XPathNavigator should be the primary API for accessing XML in the .NET Framework it would be extremely quite useful if there was an API that allowed any object to be treated as a first class citizen of the XML world. The first step was the XmlSerializer which allowed any class to be saved and loaded to and from XML streams, the next step should be enabling any object to be accessed in the same way XML documents are as well. Instant benefits are things like the ability to perform XPath and XSLT over arbitrary objects. In the Whidbey/Yukon (Visual Studio v.next/SQL Server v.next) timeframe this means getting stuff like XQuery over objects or the ability to convert any object graph to an XmlReader for free.

It looks like I have a winter project, but first I have to finish this month's column on EXSLT.NET. *sigh*

Categories: XML

October 30, 2003

@ 03:01 AM

What's New For XML Programming Models in the Next Version of the .NET Framework

Drew Marsh blogged about the talk given by my boss at this year's Microsoft Professional Developer's Conference (PDC) entitled What's New In System.Xml For Whidbey?". Since I'm directly responsible for some of the stuff mentioned in the talk I though it would make sense if I made some clarifications or added details where some where lacking from his coverage.

Usability Improvements (Beta 1)

CLR type accesors on XmlReader, XmlWriter and XPathNavigator: Double unitPrice = reader.ValueAsDouble

This was a big gripe from folks in v1.0 of the .NET Framework that they couldn't access the XML in a validated document as a typed value, this is no longer the case in Whidbey. However people who want this functionality will have to move to the XPathDocument instead of the XmlDocument. People will be able to get typed values from an XmlDocument (actually from anything that implements IXPathNavigable) but actually storing the data in the in-memory representation as a typed value will only be available on the XPathDocument.

XPathDocument A Better XML DOM

"XmlDocument is dead."

XPathDocument replaces the XmlDocument as the primary XML store.
Feature Set

20%-40% more performant for XSLT and Xquery
Editing capabilities through the XPathEditor (derives from XPathDocument) using an XmlWriter (the mythical XmlNodeWriter we've all been searching for).
XML schema validation
Strongly typed store. Integers stored as int internally (per schema) (Beta 1)
Change tracking at node level
UI databinding support to WinForms and ASP.NET controls (Beta 1)

Yup, in v1.0 of the .NET Framework we moved away from a push-based parser (SAX) in MSXML to a pull-based parser (XmlReader) in the .NET Framework. In v2.0 of the .NET Framework there's been a similar shift, from the DOM data model & tree based APIs for accessing XML to the XPath data model & cursor based APIs for accessing XML. If you are curious about some of the thinking that went into this decision you should take a look at my article in XML Journal entitled Can One Size Fit All?

Note: XPathDocument2 in PDC bits will be XPathDocument once again by Beta 1. "We were at an unfortunate design stage at the point where the PDC bits were created."

Yeah, things were in flux for a while during our development process. The features of the class called XPathDocument2 in the PDC builds will be integrated back into the XPathDocument class that was in v1.0 of the .NET Framework.

The rest of the stuff in the talk (XQuery, new XML editor in Visual Studio.NET, ADO.NET with SQLXML, etc) isn't stuff I'm directly responsible for so I hesitate to comment further, however Drew has taken excellent notes about them so it is clear which direction we're going in for Whidbey.

Categories: Life in the B0rg Cube | XML

October 30, 2003

@ 01:47 AM

XML Schema Design Patterns (part 3)

The third in my semi-regular series of guidelines for working with W3C XML Schema for XML.com is now up. The article is entitled XML Schema Design Patterns: Is Complex Type Derivation Unnecessary? and the article is excerpted below for those who may not have the time to read the entire article

INTRODUCTION

W3C XML Schema (WXS) possesses a number of features that mimic object oriented concepts, including type derivation and polymorphism. However real world experience has shown that these features tend to complicate schemas, may have subtle interactions that lead tricky problems, and can often be replaced by other features of WXS. In this article I explore both derivation by restriction and derivation by extension of complex types showing the pros and cons of both techniques, as well as showing alternatives to achieving the same results

MIDDLE

As usage of XML and XML schema languages has become more widespread, two primary usage scenarios have developed around XML document validation and XML schemas.

Describing and enforcing the contract between producers and consumers of XML documents: ...
Creating the basis for processing and storing typed data represented as XML documents: ...

CONCLUSION

Based on the current technological landscape the complex type derivation features of WXS may add more problems than they solve in the two most commmon schema use cases. For validation scenarios, derivation by restriction is of marginal value, while derivation by extension is a good way to create modularity as well as encourage reuse. Care must however be taken to consider the ramifications of the various type substitutability features of WXS (xsi:type and substitution groups) when using derivation by extension in scenarios revolving around document validation.

Currently processing and storage of strongly typed XML data is primarily the province of conventional OOP languages and relational databases respectively. This means that certain features of WXS such as derivation by restriction (and to a lesser extent derivation by extension) cause an impedance mismatch between the type system used to describe strongly typed XML and the mechanisms used for processing and storing said XML. Eventually when technologies like XQuery become widespread for processing typed XML and support for XML and W3C XML Schema is integrated into mainstream database products this impedance mismatch will not be important. Until then complex type derivation should be carefully evaluated before being used in situations where W3C XML Schema is primarily being used as a mechanism to create type annotated XML infosets.

Categories: XML

October 29, 2003

@ 01:57 PM

On the Death of the XML Database

A recent article by Phil Howard of Bloor Research on IT-Director.com talks about the Demise of the XML Database. Excerpts below

While you can still buy an XML database purely because it provides faster storage capability and greater functionality than a conventional database, all the erstwhile XML database vendors are increasingly turning to other sources of use for their products.

These other markets basically consist of two different sectors: the use of XML databases as a part of an integration strategy, where the database is used to provide on-the-fly translation for XML documents, and for content management...

The reason why there is this trend away from pure XML storage is because advanced XML capabilities are being introduced by all the leading relational vendors.

This has been considered "fighting words" from some in the XML database camp such as Mike Champion (works on Tamino XML database) and Kimbro Staken (one of the originators of Apache Xindice). Mike Champion comes up with a number of counter-arguments to the claims in the article I found interesting and felt compelled to comment on. According to Mike

It is widely believed that less than a quarter of enterprise data is currently stored in RDBMS systems. This suggests that the market is not "making do" with what the relational database products offer today, but using a wide variety of technologies.

This is actually the mantra of the team I work for at Microsoft. We are responsible for data access technologies (Relational, Object and XML) and our GM is fond of trotting out the quote about "less than a quarter of enterprise data is currently stored in a relational database". A lot of data important to businesses is just siting around on file systems in various Microsoft Office documents and other file formats. The bet across the software industry is that moving all this semi-structured business documents to XML is the right way to go and the first step has been achieved given that modern business productivity software (including the Open Source ones) are moving to fully supporting XML for their document formats. Step one is definitely to get all those memos, contracts and spreadsheets into XML.

The main reason OODBMS didn't hit the sweet spot, AFAIK, is that they created a tight coupling between application code and the DBMS. Potential performance gains this allows can outweigh the maintenance challenges in extremely business critical, high transaction volume environments...XML DBMS, on the other hand, inherit XML's suitability for loosely coupling systems, applications, and tools across a wide range of environments.

Totally agree here about the weakness of OODBMSs in creating a tight coupling between applications and the data they accessed. For a more in-depth description of the disadvantages of object oriented databases in comparison to their relational counterparts you can read my article An Exploration of Object Oriented Database Management Systems.

Again AFAIK (having only played with OODBMS personally), there is relatively little portability across OODBMS systems; code written for one would be very expensive to adapt to another. Investing in the technology required one to make a risky bet on the vendor who supplied it. This created an environment where the object-relational vendors could prosper by offering only a subset of the features but the absolute assurance that they would be in business for years to come. In the XML DBMS world, on the other hand, all support roughly the same schema, query language, and API standards;

There are two points Mike is making here

There is very little portability across OODBMS systems.
In the XML DBMS world, on the other hand, all support roughly the same schema, query language, and API standards

Based on my experiences with OODBMSs the first claim is entirely accurate, moving data from one OODBMS system was a pain and there was a definitle lack of standardization of APIs and query languages across various products. The second claim is rather suspect to me. I am unaware of any schema, query or API standards that are supported uniformly across XML database products. This isn't to say there aren't standardized W3C branded XML schema languages or query languages nor that there haven't been moves to come up with standard XML database APIs but when last I looked these weren't uniformly supported across many the XML database products and where they were there was a distinct lack of maturity in their offerings. Granted it's been almost a year since I last looked.

However there is an obvious point about portability that Mike doesn't mention (perhaps because it is so obvious). The entire point of XML is being portable and interoperability, moving data from one XML database to another should simply be a case of "export database as XML" from one and "import XML into database" on the other.

The standards of the XML world provide a clearly defined and fairly high bar for those who would seek to take away the market pioneered by the XML DBMS vendors. For better or worse, the XML family of specs is complex and quite challenging to support efficiently in a DBMS system. It's one thing to support, as the RDBMS vendors now do quite well, XML views of structured, typed, relatively "flat" data such as are typically found in RDBMS applications. It is quite another to efficiently and scalably support queries and updates on "document-like" XML with relatively open content models, lots of recursion, mixed content, and where wildcard text comparisions are more frequent than typed value comparisons. The dominant DBMS vendors obviously have talent and money to throw at the problem, but analysts should not assume that they will surpass theese capabilities of the XML DBMS systems anytime soon

OK, this one sounds like FUD. Basically Mike seems to be saying the family of XML specs is so complex (thanks to the W3C, but that's another story) that companies like Oracle, IBM and Microsoft won't be able to come up with ways to query semi-structured data efficiently or perform text comparison searches well so you are best of sticking to a seperate database for your XML data instead of having all your data stored in a single unified store.

So what is my position on the death of native XML databases? Like Phil Howard, I suspect that once XML support becomes [further] integrated into mainstream relational databases (which it already has to some degree) then native XML databases will be hard pressed to come up with reasons why one would want to buy a separate product for storing XML data distinct from the rest of the data for a business when a traditional relational database can store it all. It's all about integration. Businesses prefer buying a single office productivity suite than mixing and matching word processors, spreadsheets and presentation programs from different vendors. I suspect the same is true when it comes to their data storage needs.

Categories: XML

October 28, 2003

@ 05:36 AM

Indigo

Clemens Vasters writes

Indigo is the successor technology and the consolidation of DCOM, COM+, Enterprise Services, Remoting, ASP.NET Web Services (ASMX), WSE, and the Microsoft Message Queue. It provides services for building distributed systems all the way from simplistic cross-appdomain message passing and ORPC to cross-platform, cross-organization, vastly distributed, service-oriented architectures providing reliable, secure, transactional, scalable and fast, online or offline, synchronous and asynchronous XML messaging.

I think is truly awesome, they (folks like Don Box, Doug Purdy, Steve Swartz, Scott Gellock, Omri Gazitt , Mike Vernal , John Lambert et al) have not just cooked up a brand new distributed computing platform but have built it on open standards and open technologies meaning that probably for the first time in decades there won't be artifical, politics induced divisions limiting a distributed computing technology to particular platforms or operating systems (i.e. like CORBA, DCOM & Java RMI). The extra goodness is that these open standards are all XML based so crazy XML geeks like me can do stuff like this or people like Sam Ruby can do stuff like that.

The next generation of DCOM, just that this time it interoperates with everyone regardless of what programming language or operating system they are running.

Fucking sweet.

Categories: Life in the B0rg Cube | XML

October 27, 2003

@ 03:39 PM

While The Cat's Away

So it looks like my boss, his boss, his boss's boss, and his boss's boss's boss are all out at the Microsoft Professional Developer's Conference 2003 (aka PDC) where folks will get a sneak peak at the next versions of Windows, SQL Server and Visual Studio. Thus it looks like won't be much whip cracking going on this week so I can spend time working on my pet projects for work.

XML Developer Center on MSDN: Mark Fussel recently posted complaints about the quality of some articles on XML he'd recently read. I generally feel the same way about websites dedicated to articles about XML. Of all the developer sites devoted to XML there are only two I've seen that aren't utter crap; XML.com and IBM's XML developerWorks site. Even these are kind of hit or miss, XML.com usually publishes about 3 articles a week of which one is excellent, one is good and one is crap. Which is fine except that the excellent article is typically about something that isn't directly applicable to what I work on. The problem with IBM's DeveloperWorks is that all the code is Java-centric which doesn't help me since I work with the .NET Framework.
After seeing some of what Tim Ewald did with producing content around Microsoft technologies and XML Web Services via the Web Services Developer Center on MSDN I talked to some of the folks at MSDN about creating something similar for XML content. This was green lighted a while ago but preparations for PDC has stopped this from taking off until next month. In the meantime, I'll be creating my content plan and coming up with a list of authors (both Microsoft employees and non-Microsoft folks) for new dev center.

So far I've gotten a couple of folks lined up internally as well as some excellent non-Microsoft folks like Daniel Cazzulino, Christoph Schittko and Oleg Tkachenko. Definitely expect some pages to the XML Home Page on MSDN in the next few months.
Sequential XPath and Pull Based XML Parsing: In 2001, Arpan Desai presented on Sequential XPath at XML 2001. Relevant bits from the paper

This paper will provide an explanation of and the subset of XPath which we will tentatively dub: Sequential XPath, or SXPath for ease of use. SXPath allows a event-based XML parser, such as a typical SAX-compliant XML parser, to execute XPath-like expressions without the need of more memory consumption than is normally used within a ~~sequential~~ pull-based parser.
...
By creating a streaming XML parser which utilizes Sequential XPath, one is able to reap the inherent benefits of a streaming parser with the querying power of XPath. By defining this proper subset of XPath, we enable developers and users to utilize XML in a wide array of applications thought to be too performance sensitive for traditional XML processing.
The code for the technology outlined above has actually been gathering dust on some hard drives at work for a while. I'm currently in the process of liberating this code so that everyone can get access to the combined benefits of pull-based parsing and XPath based matching of nodes. Hopefully folks should be able to download classes similar to the ones outlined in Arpan's presentation in the next few weeks. Hopefully by Christmas, everyone will be able to write code similar to the following snippet taken from Tim Bray's XML is too Hard for Programmers

while (<STDIN>) {
  next if (X<meta>X);
  if    (X<h1>|<h2>|<h3>|<h4>X)
  { $divert = 'head'; }
  elsif (X<img src="/^(.*\.jpg)$/i>X)
  { &proc_jpeg($1); }
  # and so on...
}
Of course you'll have to substitute the Perl code above for C#, VB.NET or any one the various languages targetted at the .NET Framework.

Categories: XML

October 25, 2003

@ 02:13 PM

Something Cool From Microsoft You Might Not See At The PDC

"This paper proposes extending popular object-oriented programming languages such as C#, VB or Java with native support for XML. In our approach XML documents or document fragments become first class citizens. This means that XML values can be constructed, loaded, passed, transformed and updated in a type-safe manner. The type system extensions, however, are not based on XML Schemas. We show that XSDs and the XML data model do not fit well with the class-based nominal type system and object graph representation of our target languages. Instead we propose to extend the C# type system with new structural types that model XSD sequences, choices, and all-groups. We also propose a number of extensions to the language itself that incorporate a simple but expressive query language that is influenced by XPath and SQL. We demonstrate our language and type system by translating a selection of the XQuery use cases."

From Programming with Rectangles, Triangles, and Circles by Erik Meijer and Wolfram Schulte

I talk to Erik about this stuff all the time, so it's great to finally see some of the thoughts and discussions around this topic actually written down in a research paper. According to Erik's blog post from a few weeks ago he'll actually be presenting about this at XML 2003

Categories: XML

October 19, 2003

@ 07:56 PM