Sunday, 15 February 2004 - Dare Obasanjo's weblog

February 15, 2004

@ 07:00 PM

Combining XPath-based Filtering with Pull-based XML Parsing

Daniel Cazzulino has been writing about his work with XML Streaming Events which combines the ability to do XPath queries with the .NET Framework's forward-only, pull based XML parser. He shows the following code sample

// Setup the namespaces XmlNamespaceManager mgr = new XmlNamespaceManager(temp.NameTable); mgr.AddNamespace("r", RssBanditNamespace); // Precompile the strategy used to match the expression IMatchStrategy st = new RootedPathFactory().Create( "/r:feeds/r:feed/r:stories-recently-viewed/r:story", mgr); int count = 0; // Create the reader. XseReader xr = new XseReader( new XmlTextReader( inputStream ) ); // Add our handler, using the strategy compiled above. xr.AddHandler(st, delegate { count++; }); while (xr.Read()) { } Console.WriteLine("Stories viewed: {0}", count);

I have a couple of questions about his implementation the main one being how it deals with XPath queries such as /r:feeds/r:feed[count(r:stories-recently-viewed)>10]/r:title which can't be done in a forward only manner?

Oleg Tkachenko also pipes in with some opinions about streaming XPath in his post Warriors of the Streaming XPath Order. He writes

I've been playing with such beasts, making all kinds of mistakes and finally I came up with a solution, which I think is good, but I didn't publish it yet. Why? Because I'm tired to publish spoilers :) It's based on "ForwardOnlyXPathNavigator" aka XPathNavigator over XmlReader, Dare is going to write about in MSDN XML Dev Center and I wait till that's published.

May be I'm mistaken, but anyway here is the idea - "ForwardOnlyXPathNavigator" is XPathNavigator implementation over XmlReader, which obviously supports forward-only XPath subset...

And after I played enough with and implemented that stuff I discovered BizTalk 2004 Beta classes contain much better implementation of the same functionality in such gems as XPathReader, XmlTranslatorStream, XmlValidatingStream and XPathMutatorStream. They're amazing classes that enable streaming XML processing in much rich way than trivial XmlReader stack does. I only wonder why they are not in System.Xml v2 ? Is there are any reasons why they are still hidden deeply inside BizTalk 2004 ? Probably I have to evangelize them a bit as I really like this idea.

Actually Oleg is closer and yet farther from the truth than he realizes. Although I wrote about a hypothetical ForwardOnlyXPathNavigator in my article entitled Can One Size Fit All? for XML Journal my planned article which should show up when the MSDN XML Developer Center launches in a month or so won't be using it. Instead it will be based on an XPathReader that is very similar to the one used in BizTalk 2004, in fact it was written by the same guy. The XPathReader works similarly to Daniel Cazzulino's XseReader but uses the XPath subset described in Arpan Desai's Introduction to Sequential XPath paper instead of adding proprietary extensions to XPath as Daniel's does.

When the article describing the XPathReader is done it will provide source and if there is interest I'll create a GotDotNet Workspace for the project although it is unlikely I nor the dev who originally wrote the code will have time to maintain it.

Categories: XML

February 15, 2004

@ 05:50 PM

Comments [2]

On Semantic Integration and XML

A few months ago I attended XML 2003 where I first learned about Semantic Integration which is the buzzword term for mapping data from one schema to another with a heavy focus on using Semantic Web technologies such as ontologies and the like. The problem that these technologies solve is enabling one to map XML data from external sources to a form that is compatible with how an application or business entity manipulates them internally.

For example, in RSS Bandit we treat feeds in memory and on disk as if they are in the RSS 2.0 format even though it supports other flavors of RSS as well such as RSS 1.0. Proponents of semantic integration technologies would suggest using a technology such as the W3C's OWL Web Ontology Language. If you are unfamiliar with ontolgies and how they apply to XML a good place to understand what they are useful for is taking a look at the OWL Web Ontology Language Use Cases and Requirements. The following quote from the OWL Use Cases document gives a glimpse into what the goal of ontology languages

In order to allow more intelligent syndication, web portals can define an ontology for the community. This ontology can provide a terminology for describing content and axioms that define terms using other terms from the ontology. For example, an ontology might include terminology such as "journal paper," "publication," "person," and "author." This ontology could include definitions that state things such as "all journal papers are publications" or "the authors of all publications are people." When combined with facts, these definitions allow other facts that are necessarily true to be inferred. These inferences can, in turn, allow users to obtain search results from the portal that are impossible to obtain from conventional retrieval systems

Although the above example talks about search engines it is clear that one can also use this for data integration. In the example of RSS Bandit, one could create an ontology that maps the terms in RSS 1.0 to those in RSS 2.0 and make statements such as

RSS 1.0's <title> element sameAs RSS 2.0's <title> element

Basically, one could imagine schemas for RSS 1.0 and RSS 2.0 represented as two trees and an ontology a way of drawing connections between the leaves and branches of the trees. In a previous post entitled More on RDF, The Semantic Web and Perpetual Motion Machines I questioned how useful this actually would be in the real world by pointing out the dc:date vs. pubDate problem in RSS. I wrote

However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not. A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

Teh above is a simple example, one could imagine more complex examples where the vocabularies to be mapped differ much more syntactically such as

<author>Dare Obasanjo (dareo@example.com)</author> <author> <fname>Dare</fname> <lname>Obasanjo</lname> <email>dareo@example.com</email> </author>

The aformentioned examples point out technical issues with using ontology based techniques for mapping between XML vocabularies but I failed to point out the human problems that tend to show up in the real world. A few months ago I was talking to Chris Lovett about semantic integration and he pointed out that in many cases as applications evolve semantics begin to be assigned to values in often orthogonal ways.

An example of semantics being addd to values again shows up in an example that uses RSS Bandit. A feature of RSS Bandit is that feeds are cached on disk allowing a user to read items that have long since disappeared from the feed. At first we provided the ability for the user to specify how long items should be kept in the cached feed ranging from a day up to a year. We used an element named maxItemAge embedded in the cached feed which contained a serialized instance of the System.Timespan structure. After a while we realized we needed ways to say that for a particular feed use the global default maxItemAge, never cache items for this feed or never expire items for this feed so we used the TimeSpan.MinValue, TimeSpan.Zero, or TimeSpan.MaxValue values of the TimeSpan class respectively.

If another application wanted to consume this data and had a similar notion of 'how long to keep the items in a feed' it couldn't simply map maxItemAge to whatever internal property it used without taking into account the extra semantics embedded in when certain values occur in that element. Overloading the meaning of properties and fields in a database or class is actually fairly commonplace [after all how many different APIs use the occurence of -1 for a value that should typically return a positive number as an error condition?] and something that must also be considered when applying semantic integration technologies to XML.

In conclusion, it is clear that Semantic Web can be used to map between XML vocabularies however in non-trivial situations the extra work that must be layered on top of such approaches tends to favor using XML-centric techniques such as XSLT to map between the vocabularies instead.

Categories: XML

February 15, 2004

@ 03:07 AM

Comments [0]

Everything's Xen

Just as it looks like my buddy Erik Meijer is done with blogging (despite his short lived guest blogging stint at Lamda the Ultimate) it looks like a couple more of the folks who brought Xen to the world have started blogging. They are

William Adams: Dev Manager for the WebData XML team.
Matt Warren: Formerly a developer on the WebData XML team, now works on the C# team or on the CLR (I can never keep those straight).

Both of them were also influential in the design and implementation of the System.Xml namespace in version 1.0 of the .NET Framework.

Categories: Life in the B0rg Cube | Movie Review

February 14, 2004

@ 09:03 PM

Comments [4]

If Only Life Were An Action Movie

A couple of days ago I wrote about The war in Iraq and whether the actions of the US administration could be considered a war crime. It seems this struck a nerve with at least one of my readers. In a response to that entry Scott Lare wrote

Today, between Afganistan and Iraq there are approx 50 million people who were previously under regimes of torture who now have a "chance" at freedom. Get a grip on reality! Talk about missing the point and moronism.

I find it interesting that Scott Lare sees the need to put chance in quotes. Now ignoring the fact that these “regimes of torture” were in fact supported by the US when it was politcally expedient the question is whether people's lives are any better in Afghanistan and Iraq now that they live in virtual anarchy as opposed to under oppresive regimes? In a post entitled Women as property and U.S.-funded nation-building he excerpts a New York Times opinion piece which states

Consider these snapshots of the new Afghanistan:

• A 16-year-old girl fled her 85-year-old husband, who married her when she was 9. She was caught and recently sentenced to two and a half years' imprisonment.

• The Afghan Supreme Court has recently banned female singers from appearing on Afghan television, barred married women from attending high school classes and ordered restrictions on the hours when women can travel without a male relative.

• When a man was accused of murder recently, his relatives were obliged to settle the blood debt by handing over two girls, ages 8 and 15, to marry men in the victim's family.

• A woman in Afghanistan now dies in childbirth every 20 minutes, usually without access to even a nurse. A U.N. survey in 2002 found that maternal mortality in the Badakshan region was the highest ever recorded anywhere on earth: a woman there has a 50 percent chance of dying during one of her eight pregnancies.

• In Herat, a major city, women who are found with an unrelated man are detained and subjected to a forced gynecological exam. At last count, according to Human Rights Watch, 10 of these "virginity tests" were being conducted daily.

... Yet now I feel betrayed, as do the Afghans themselves. There was such good will toward us, and such respect for American military power, that with just a hint of follow-through we could have made Afghanistan a shining success and a lever for progress in Pakistan and Central Asia. Instead, we lost interest in Afghanistan and moved on to Iraq.

... Even now, in the new Afghanistan we oversee, they are being kidnapped, raped, married against their will to old men, denied education, subjected to virginity tests and imprisoned in their homes. We failed them.

To people like Scott I'll only say this; life isn't an action movie where you show up, shoot up all the bad guys and everyone lives happily ever after. What has happened in Afghanistan is that the US military has shoot up some bad guys who have now been replaced by a different set of bad guys. Short of colonizing the country and forcing social change there isn't much the US military can do for a lot of people in Afghanistan especially the women. I accept this but it really irritates me when I here people mouth off about how “life is so much better” because the US military dropped some bombs on the “bad guys“.

As for Iraq, John Robb has a link to an interesting article on the current state of affairs. He writes

Debka has some interesting analysis that indicates that the US is in a bind. The recent moves to empower Iraqi defense forces to take control of city centers is premature (as proved in the brazen attack in Fallujah yesterday). At the same time the US is committed to a shift of power this summer and the UN is talking about elections this fall. There are three potential outcomes for this:

A full civil war that draws in adjacent powers.
Democracy and stability under Sunni leadership.
More US occupation but with increasing resistance.

How would you assign the odds (in percentages) for each outcome?

Considering the animosity between the various factions in Iraq, democracy and stability may not go hand in hand. Being Nigerian I know first hand that democracy doesn't automatically mean stability, I guess that's why some refer to us as The New Pakistan

Categories: Ramblings

February 13, 2004

@ 03:30 PM

Comments [1]

Sex, Lies and XML MIME Types

Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality. Specifically

According to RFC 3023..., if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
us-ascii.

So for this to work correctly it means that if the MIME type of an XML document is text/xml then the web server should look inside it before sending it over the wire and send the correct encoding or else the document will be interpreted incorrectly since it is highly likely that us-ascii is not the encoding of the XML document. In practice, most web servers do not do this. I have confirmed this by testing against both IIS and Apache.

Instead what happens is that an XML document is created by the user and dropped on the file system and the web server assumes it is text/xml which it most likely is and sends it as is without setting the charset in the content type header.

A simple way to test this is to go to Rex Swain's HTTP Viewer and download the following documents from the W3 Schools page on XML encodings

All files are sent with a content type of text/xml and no encoding specified in the charset parameter of the Content-Type HTTP header. According to RFC 3023 which Mark Pilgrim quoted in his article that clients should treat them as us-ascii. With the above examples this behavior would be wrong in all four cases.

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable future [slightly modified from Mark Pilgrim's post]

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml then the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
the encoding given in the encoding attribute of the XML declaration within the document, or
utf-8.

Some may argue that this discussion isn't relevant for news aggregators because they'll only consume XML documents whose MIME type application/atom+xml or application/rss+xml but again this ignores practice. In practice most web servers send back RSS feeds as text/xml, if you don't believe me test ten RSS feeds chosen at random using Rex Swain's HTTP Viewer and see what MIME type the server claims they are.

Categories: XML

February 12, 2004

@ 11:47 PM

Comments [21]

It Begins

According to Jeremy Zawodney the “My Yahoo's RSS module also groks Atom. It was added last night. It took about a half hour.” Seeing that he said it took only 30 minutes to implement this and there are a couple of things about ATOM that require a little thinking about it even if all you are interested in is titles and dates as My Yahoo! is I decided to give it a try and subscribe to Mark Pilgrim's Atom feed and this is what I ended up being shown in My Yahoo!

dive into mark

The myth of RSS compatibility - 1 week ago
Universal Feed Parser 3.0 beta - 1 day ago
If people won't go to the validator - 2 weeks ago

The first minor issue is that the posts aren't sorted chronologically but that isn't particularly interesting. What is interesting is if you go to the article entitled The myth of RSS compatibility its publication date is said to be “Wednesday, February 4, 2004” which is about a week ago and if you go to the post entitled Universal Feed Parser 3.0 beta its publication date is said to be Wednesday, February 1, 2004 which is almost 2 weeks ago not a day ago like Yahoo! claims.

The simple answer to the confusion can be gleaned from Mark's ATOM feed, that particular entry has a <modified> date of 2004-02-11T16:17:08Z, an <issued> date of 2004-02-01T18:38:15-05:00 and a <created> date of 2004-02-01T23:38:15Z. My Yahoo! is choosing to key the freshness of article of its modified date even though when one gets to the actual content it seems much older.

It is quite interesting to see how just one concept [how old is this article?] can lead to some confusion between the end user of a news aggregator and the content publisher. I also suspect that My Yahoo! could be similarly confused by the various issues with escaping content in Atom when processing titles but since I don't have access to a web server I can't test some of my theories.

I tend to wonder whether the various content producers creating Atom feeds will ditch their feeds for Atom 0.4, Atom 0.5 up until it becomes a final IETF spec or whether they'll keep parallel versions of these feeds so Atom 0.3 continues to live in perpetuity.

It's amazing how geeks can turn the simplest things into such a mess. I'm definitely going to sit it out until the IETF Atom 1.0 syndication format spec before spending any time working on this for RSS Bandit.

Categories: Technology

February 11, 2004

@ 04:02 PM

Comments [5]

What is Metacrap?

One of the big problems with arguing about metadata is that one persons data is another person's metadata. I was reading Joshua Allen's blog post entitled Trolling EFNet, or Promiscuous Memories where he wrote

Some people deride "metacrap" and complain that "nobody will enter all of that metadata". These people display a stunning lack of vision and imagination, and should be pitied. Simply by living their lives, people produce immense amounts of metadata about themselves and their relationships to things, places, and others that can be harvested passively and in a relatively low-tech manner.
Being able to remember what we have experienced is very powerful. Being able to "remember" what other people have experienced is also very powerful. Language improved our ability to share experiences to others, and written language made it possible to communicate experiences beyond the wall of death, but that was just the beginning. How will your life change when you can near-instantly "remember" the relevant experiences of millions of other people and in increasingly richer detail and varied modality?

From my perspective it seems Joshua is confusing data and metadata. If I had a video camera attached to my forehead recording I saw then the actual audiovisual content of the files on my harddrive are the data while the metadata is information such as what date it was, where I was and who I saw. Basically the metadata is the data about data. The interesting thing about metadata is that if we have enough good quality metadata then we can do things like near-instantly "remember" the relevant experiences of ourselves and millions of other people. It won't matter if all my experiences are cataloged and stored on a hard drive if the retrieval process isn't automated (e.g. I can 'search' for experiences by who they were shared with, where they occured or when they occured) as opposed to me having to fast forward through gigabytes of video data. The metadata ideal would be that all this extra, descriptive information would be attached to my audiovisual experiences stored on disk so I could quickly search for “videos from conversations with my boss in October, 2003”.

This is where metacrap comes in. From Cory Doctorow's excellent article entitled Metacrap

A world of exhaustive, reliable metadata would be a utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities.

This applies to Joshua's vision as well. Data acquisition is easy, anyone can walk around with a camcorder or digital camera today recording everything they can. Effectively tagging the content so it can be categorized in a way you can do interesting things with it search-wise is unfeasible. Cory's article does a lot better job than I can at explaining the many different ways this is unfeasible, cameras with datestamps and built in GPS are just a tip of the iceberg. I can barely remember dates once the event didn't happen in the recent past and wasn't a special occassion. As for built in GPS, until the software is smart enough to convert longitude and latitude coordinates to “that Chuck E Cheese in Redmond“ then they only solve problems for geeks not regular people. I'm sure technology will get better but metacrap is and may always be an insurmountable problem on a global network like the World Wide Web without lots of standardization.

Categories: Technology

February 11, 2004

@ 03:36 PM

Comments [5]

RSS Bandit Nightly Builds

Besides our releases, Torsten packages nightly builds of RSS Bandit for folks who want to try out bleeding edge features or test recent bug fixes without having to set up a CVS client. There is currently a bug pointed out by James Clarke that we think is fixed but would like interested users to test

I'm hitting a problem once every day or so when the refresh thread seems to be exiting - The message refreshing Feeds never goes away and the green download icons for the set remain green forever. No feed errors are generated. Only way out is to quit.

If you've encountered this problem on recent versions of RSS Bandit, try out the RSS Bandit build from February 9^th 2004 and see if that fixes the problem. Once we figure out the root of the problem and fix it there'll be a refresh of the installer with the updated bits.

Categories: RSS Bandit

February 11, 2004

@ 02:51 AM

Comments [0]

ATOM: An IETF Future

From Sam Ruby's slides for the O'Reilly Emerging Technology Conference

Where are we going?

A draft charter will be prepared in time to be informally discussed at the IETF meeting is Seoul, Korea on the week of 29 February to 5 March

Hopefully, the Working Group itself will be approved in March

Most of the work will be done on mailing lists

Ideally, a face to face meeting of the Working Group will be scheduled to coincide with the August 1-6 meeting of the IETF in San Diego

Interesting. Taking the spec to IETF implies that Sam thinks it's mostly done. Well, I just hope the IETF's errata process is better than the W3C's.

Categories: Technology

February 10, 2004

@ 05:30 PM

Comments [6]

WinFS and Metacrap

Robert Scoble has a post entitled Metadata without filling in forms? It's coming where he writes

Simon Fell read my interview about search trends and says "I still don't get it" about WinFS and metadata. He brings up a good point. If users are going to be forced to fill out metadata forms, like those currently in Office apps, they just won't do it. Fell is absolutely right.But, he assumed that metadata would need to be entered that way for every photo. Let's go a little deeper....OK, I have 7400 photos. I have quite a few of my son. So, let's say there's a new kind of application. It recognizes the faces automatically and puts a square around them. Prompting you to enter just a name. When you do the square changes color from red to green, or just disappears completely.
...
A roadblock to getting that done today is that no one in the industry can get along for enough time to make it possible to put metadata into files the way it needs to be done. Example: look at the social software guys. Friendster doesn't play well with Orkut which doesn't play well with MyWallop, which doesn't play well with Tribe, which doesn't play well with ICQ, which doesn't play well with Outlook. What's the solution? Fix the platform underneath so that developers can put these features in without working with other companies and/or other developers they don't usually work with.

The way WinFS is being pitched by Microsoft folks reminds me a lot of Hailstorm [which is probably unsurprising since a number of Hailstorm folks work on it] in that there are a lot of interesting and useful technical ideas burdened by bad scenarios being hung on them. Before going into the the interesting and useful technical ideas around WinFS I'll start with why I consider the two scenarios mentioned by Scoble as “bad scenarios”.

The thought that if you make the file system a metadata store automatically makes search better is a dubious proposition to swallow when you realize that a number of the searches that people can't do today wouldn't be helped much by more metadata. This isn't to say some searches wouldn't work better (e.g. searching for songs by title or artist), however there are some search scenarios such as searching for a particular image or video from a bunch of image files with generic names or searching for a song by lyrics which simply having the ability to tag media types with metadata doesn't seem like enough. Once your scenarios start having to involve using “face recognition software” or “cameras with GPS coordinates” for a scenario to work then it is hard for people not to scoff. It's like a variation of the popular Slashdot joke

Add metadata search capabilities to file system
???
You can now search for “all pictures taken on Tommy's 5th birthday party at the Chuck E Cheese in Redmond”.

with the ??? in the middle implying a significant dfficulty in going from step 1 to 3.

The other criticism is the fact that Robert's post implies that the reason applications can't talk to each other are technical. This is rarely the case. The main reasons applications don't talk to each other isn't a lack of technology [especially now that we have an well-defined format for exchanging data called XML] but for various social and business reasons. There are no technical reasons MSN Messenger can't talk to ICQ or which prevent Yahoo! Messenger from talking to AOL Instant Messenger. It isn't technical reasons that prevent my data in Orkut from being shared with Friendster or my book & music preferences in Amazon from being shared with other online stores I visit. All of these entities feel they have a competitive advantage in making it hard to migrate from their platforms.

The two things Microsoft needs to do in this space is are to (i) show how & why it is beneficial for different applications to share data locally and (ii) provide guidelines as well as best practices for applications to share data their data in a secure manner.

While talking to Joshua Allen, Dave Winer, Robert Scoble, Lili Cheng, and Curtis Wong yesterday it seemed clear to me that social software [or if you are a business user; groupware that is more individual-focused which gives people more control over content and information sharing] would be a very powerful and useful tool for businesses and end users if built on a platform like Longhorn with a smart data store that know how to create relationships between concepts as well as files (i.e. WinFS) and a flexible, cross platform distributed computing framework (i.e. Indigo).

The WinFS folks and Longhorn evangelists will probably keep focusing on what I have termed “bad scenarios” because they demo well but I suspect that there'll be difficulty getting traction with them in the real world. Of course, I may be wrong and the various people who've expressed incredulity at the current pitches are a vocal minority who'll be proved wrong once others embrace the vision. Either way, I plan to experiment with these ideas once Longhorn starts to beta and seeing where the code takes me.

Categories: Technology

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Sunday, 15 February 2004 - Dare Obasanjo's weblog

dive into mark
The myth of RSS compatibility - 1 week ago Universal Feed Parser 3.0 beta - 1 day ago If people won't go to the validator - 2 weeks ago