February 5, 2003
@ 11:58 PM
From The XML 1.0 to the XML Infoset: A Useful Abstraction

A useful abstraction is one which simplifies the details of a problem or its solution by providing a palatable and consistent logical model. A more important characteristic of a useful abstraction is that it allows one to change the details of the problem or implementation of its solution without having to change the abstraction. This latter characteristic is quite beneficial because it often lends to extending the abstraction to solve different problems than originally imagined and gives the underlying implementation flexibility.

The XML Infoset is an abstract representation of XML 1.0. It provides a simplified logical view of XML and papers over certain details. The infoset describes all the pertinent information that is contained within an XML document without getting bogged down in the differences between characters entered directly, in CDATA sections or as entities along with various other syntactic minutae. The infoset abstraction gives us several things. The first is that it conclusively states what information within an XML document is pertinent information. The second is that it provides a starting point for mapping non-XML data sources to XML data.

Given the following XML document
<foo attr1="value1"        attr2='value2' >me&you<\foo>
I can tell that the pertinent information is that I have a document information item with an element information item that has two attribute information items and six child character information items. Details like how much space is between attributes, whether single or double quotes are used for attributes or the fact that the ampersand had to be escaped are not significant information. This lack of focus on the textual nature of XML 1.0 gives one a launch pad towards creating XML infoset compatible syntaxes for describing structured data.

As long as mappings from one syntax to the XML infoset exist then these alternate serializations of the infoset can be processed using XML technologies like XQuery, XPath and XML Schema. Proposals like Don Park's SML or the various flavors of binary XML only need to worry about being compatible with the XML infoset to the extent which it defines conformance and not the XML 1.0 syntax.

From URLs & URNS to URI: A Step Backward

URLs refer to Uniform Resource Locators described in RFC 1738. According to the RFC
URLs are used to `locate' resources, by providing an abstract identification of the resource location. Having located a resource, a system may perform a variety of operations on the resource, as might be characterized by such words as `access', `update', `replace', `find attributes'. In general, only the `access' method needs to be specified for any URL scheme
URNs refer to Uniform Resource Names described in RFC 2141. According to the RFC
Uniform Resource Names (URNs) are intended to serve as persistent, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.
Reading the above excerpts it seems clear that URLs and URNs are used in connection with retrieving resources from a network. URLs tell you where to fetch the resource while URNs are the name of the resource from which you can then go find its location. Basically the difference between URNs and URLs is the difference between The White House and The White House, 1600 Pennsylvania Avenue NW, Washington, DC 20500 . At first glance one can consider both URNs and URLs an abstraction over IP addresses, DNS and all the other gunk that goes on when one wants to grab stuff of the network be they web pages, music files or images.

However there is a wrinkle which isn't obvious nor does it matter at the currently described level of abstraction. The wrinkle is that the term resource which litters both RFCs isn't rigorously defined but since we are just talking about grabing files of a network we can just assume they refer to files on a network. This is until URIs enter the picture.

URIs refer to Uniform Resource Identifiers described in RFC 2396. According to the RFC
A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.
URIs are a merger of the syntax of URLs and URNs which seem to have been repurposed from their original task of identifying and locating network retrievable documents to being more readable versions UUIDs which can be used to identify any person, place or thing regardless of whether it is a file on the Internet or a feeling in your heart.

This addition to the URN/URL abstraction seemed to address some of the bits which may have been considered to be leaky (if I enter http://www.yahoo.com in my browser and it loads it from its cache then the URL isn't acting as a location but as an identifier). Others also saw URIs as a way for people who needed user friendly UUIDs for use on the Web. I've so far come into contact with URIs in two aspects of my professional experience and they have both left a bad taste in my mouth. Read on for details.

URIs in Action: XML Namespaces

The goal of the W3C's Namespaces in XML recommendation was to create a mechanism in which elements and attributes within an XML document that were from different markup vocabularies could be unambiguously identified and combined without processing problems ensuing. To achieve this XML namespaces were invented. An XML namespace is a collection of names, identified by a Uniform Resource Identifier (URI) reference, which are used in XML documents as element and attribute names. Below is an example of a document that uses XML namespaces
<dare:foo xmlns:dare="http://www.25hoursaday.com" />
The above document has a foo element that is from the "http://www.25hoursaday.com" namespace. The first thing people ask about namespaces in XML without fail is "What is at the namespace website?". Now, given that URIs are just glorified UUIDs the answer to this question is that "http://www.25hoursaday.com" isn't necessarily a website just a pseudo-unique identifier.

This answer has caused several thousand emails to fly back and forth on various XML and W3C mailing lists because of the utter confusion it causes. Several thousand emails is not an exagerration. Looking at the archives for xml-uri@w3c.org show peak traffic of almost a thousand mails a month and lists like XML-DEV usually have several hundred email in threads that URI & XML namespaces come up.

As I type this there is currently an active thread on the WWW-TAG mailing list about namespace documents and what constitutes a "valid representation" of the abstract resource that is an XML namespace. For those that have tons of free time to read technical yet pointlessly philosophical discussions, the threads are here and here.

URIs and the Semantic Web: Ambiguity2

One problem with URIs is that they don't uniquely identify a single thing. Consider the following hyperlinked statements
Dare is a Georgia Tech alumni.

Dare's website is valid XHTML.
In the above statements I use the URI "http://www.25hoursaday.com" to identify both myself and my web page. This is a bad thing for the Semantic Web. If you read Aaron Swartz's excellent primer on the Semantic Web you will notice where he talks about RDF and its dependence on URIs specifically

RDF gives you a way to make statements that are machine-processable. Now the computer can't actually "understand" what you said, of course, but it can deal with it in a way that makes it seem like it does. For example, I could search the Web for all book reviews and create an average rating for each book. Then, I could put that information back on the Web. Another website could take that information (the list of book rating averages) and create a "Top Ten Highest Rated Books" page.

RDF is really quite simple. An RDF statement is a lot like a simple sentence, except that almost all the words are URIs. Each RDF statement has three parts: a subject, a predicate and an object. Let's look at a simple RDF statement:

<http://aaronsw.com/> <http://love.example.org/terms/reallyLikes> <http://www.w3.org/People/Berners-Lee/Weaving/> .

Can you guess what this says? The first URI is the subject. In this instance, the subject is me. The second URI is the predicate. It relates the subject to the object. In this instance, the predicate is "reallyLikes." The third URI is the object. Here, the object is Tim Berners-Lee's book "Weaving the Web." So the RDF statement above says that I really like "Weaving the Web."

Now consider changing his RDF example to
<http://aaronsw.com/> <http://love.example.org/terms/reallyLikes> <http://www.25hoursaday.com/> .
Can you tell whether Aaron really like my website or me personally from the above RDF statement? Neither can I. This inherrent ambiguity is yet another issue with the vision of the Semantic Web and the current crop of Semantic Web technologies that are overly dependent on URIs.

Lessons Learned

Part of me feels there are several lessons to be learned from the problems caused by the URI abstraction and the potential problems that could be caused by the XML Infoset (proliferation of un-interoperable XML serialization formats) while embracing the benefits of useful abstractions as well. However, I have to go to work so this early morning ramble will have to end here.

Get yourself a News Aggregator and subscribe to my RSSfeed

Disclaimer: The above comments do not represent the thoughts, intentions, plans or strategies of my employer. They are solely my opinion.