Wednesday, 18 February 2004 - Dare Obasanjo's weblog

February 18, 2004

@ 06:26 PM

Mr. Safe's Guide to the RSS vs. ATOM debate

Dave Winer recently wrote that at least one person has asked if it is safe to ignore Atom in his weblog. If you are a cautious person like Tim Bray's Mr. Safe or you fit more on the right than the left side of the Technology Adoption Life Cycle then you are probably wondering why you should want to support the Atom syndication format over one of the many flavors of RSS. There are two parts to this question, if you are a consumer of syndication feeds or if you are a consumer of syndication feeds.

The Safe Syndication Producer's Perspective
An RSS feed is a regularly updated XML document that contains metadata about a news source and the content in it. Minimally an RSS feed consists of a channel that represents the news source, which has a title, link, and description that describe the news source. Additionally, an RSS feed typically contains one or more item elements that represent individual news items, each of which should have a title, link, or description

There are two primary flavors of RSS; Dave Winer's family of specifications (the most popular being RSS 0.91 & RSS 2.0) and the RDF-based RSS 1.0. The most popular are Dave Winer's family of specifications which have been adopted by a number of well-known organizations such as Yahoo! News, the BBC, Rolling Stone magazine, the Microsoft Developer Network (MSDN) , the Oracle Technology Network (OTN), the Sun Developer Network and Apple's iTunes Music Store. According to Syndic8 which tracks over 50,000 RSS feeds RSS 0.91, RSS 1.0 & RSS 2.0 all have about 30% of the RSS marketshare.

Most news aggregators support all 3 major versions of RSS although few actually take advantage of the fact that RSS 1.0 is an RDF vocabulary. If all one want is simple syndication of news items the RSS 0.91 should be satisfactory. If one plans to use extensions to the core RSS specification that expose application or domain specific functionality such as the ability to post comments one can use one of the many RSS modules in combination with RSS 2.0. The only advantage that RSS 1.0 gives over RSS 0.91/RSS 2.0 is that it is an RDF vocabulary and thus fits nicely into the dream of the Semantic Web.

The Atom syndication format can be considered to be a more sophisticated implementation of the ideas in RSS 2.0. It adds richer syndication capabilities such as the ability to put binary formats such as Word documents and Powerpoint documents in feeds and formalizes some of the best practices in the RSS world around putting [X]HTML in feeds.

The average user of a news aggregator will not be able to tell the difference between an Atom or RSS feed from their aggregator if it supports both. However users of aggregators that don't support Atom will not be able to subscribe to feeds in that format. In a few years, the differences between RSS and Atom will most likely be the same as those that are different between RSS 1.0 and RSS 0.91/RSS 2.0; only of interest to a handful of XML syndication geeks. Even then the simplest and safest bet would still be to use RSS as a syndication format. This is the same as the fact that even though the W3C has published XHTML 1.0 & XHTML 1.1 and is working on XHTML 2.0, the safest bet to get the widest reach with the least problems is to publish a website in HTML 3.2 or HTML 4.01.

The Safe Syndication Consumer's Perspective
If you plan to consume feeds from a wide variety of sources then one should endeavor to support as many syndication formats as possible. The more formats a feed consumer supports the more content is available for its users.

Based on their current popularity, degree of support and ease of implementation one should consider supporting the major syndication formats in the following order of priority

RSS 0.91/RSS 2.0
RSS 1.0
Atom

RSS 0.91 support is the simplest to implement and most widely supported by websites while Atom is the most difficult to implement being the most complex and will be least supported by websites in the coming years.

Categories: XML

February 18, 2004

@ 04:42 PM

Comments [3]

Introductions: My Day Job

This is here mainly for me to be able to look back on in a few years and for any new readers of my blog who wonder what I actually do at Microsoft.

I am a program manager for the WebData XML team. The WebData team is part of the SQL Server Product Unit and produces the major data access technologies that Microsoft produces including MDAC, MSXML, ADO.NET, System.Xml, ObjectSpaces and the WinFS API.

As a technical program manager I am responsible for the nitty gritty of the design of the classes in the following namespaces in the .NET Framework

System.Xml.Schema
System.Xml.XPath
System.Xml (I was sharing this with Joshua before he left our team)

Nitty gritty design details means stuff like triaging bug fixes, designing new features or new classes, writing specifications, and interacting with internal & external customers to discover their likes and dislikes about the APIs in question.

I am also the community lead for the WebData XML team which means I am responsible for things like XML Most Valuable Professional (MVP) program and the upcoming MSDN XML Developer Center. For the MVP program I am the primary point of contact between my team and the Microsoft MVP program and with our MVPs. I am also one of the folks who approves or rejects nominees. As for the developer center, I am the equivalent of what MSDN likes to call a “content strategist“ which basically means I am responsible for the content on the site. For the most part I am also the primary point of contact between my team and MSDN.

If you have any issues or questions related to the aforementioned aspects of my job at Microsoft (e.g. bug reports, feature requests or questions about writing for MSDN) feel free to ping me on my work email address. If you don't know it you should be able to find it from a minute or two of Googling.

Categories: Life in the B0rg Cube

February 18, 2004

@ 06:28 AM

Comments [3]

WinFS as a Digital Media Store

Chris Sells writes

On his quest to find "non-bad WinFS scenarios" (ironically, because he was called out by another Microsoft employee -- I love it when we fight in public : ), Jeremy Mazner, Longhorn Technical Evangelist, starts with his real life use of Windows Movie Maker and trying to find music to use as a soundtrack. Let Jeremy know what you think.

I think the scenario is compelling. In fact, the only issue I have with the WinFS scenario that Jeremy outlines is that he implies that the metadata about music files Windows Media player exposes is tied to the application but in truth most of it is tied to the actual media files as regular file info [file location, date modified, etc] or as ID3 tags [album, genre, artist, etc]. This means that there doesn't even need to be explicit inter-application sharing of data.

If the file system had a notion of a music item which exposed the kind of information one sees in ID3 tags which was also exposed by the shell in standard ways then you could do lots of interesting things with music metadata without even trying hard. I also like it's quite compelling because metadata attached to music files is such a low hanging fruit that one can get immediate value out of and which exists today on the average person's machine.

Categories: Technology

February 17, 2004

@ 06:34 PM

Comments [1]

Slashdot Puts A Human Face On Outsourcing

The folks at Slashdot have Indian Techies Answer About 'Onshore Insourcing'. Excellent stuff.

Categories: Mindless Link Propagation

February 16, 2004

@ 07:17 AM

Comments [12]

I just saw an entry in Ted Leung's blog about SMS messages where he wrote

[via Trevor's ETech notes]

become rude to make a phone call without first checking via sms. [this is becoming more and more the case in europe also]

I would love it if this became the etiquette here in the US as well. For all telephone calls, not just cell calls. People seem to believe that they have the right to call you simply because you have a telephone.

The so-called SMS craze that has hit Europe and Asia seems totally absurd to me. I can understand teenagers and college students using SMS as a more sophisticated way of passing notes to each other in class but can't see any other reason why if I have a device I could use to talk to someone I'd instead send them a hastily written and poorly spelled text message instead. Well maybe if text messages were free and voice calls were fairly expensive but since that isn't the case in the US I guess that's why I don't get it.

Categories: Ramblings

February 15, 2004

@ 07:00 PM

Comments [4]

Combining XPath-based Filtering with Pull-based XML Parsing

Daniel Cazzulino has been writing about his work with XML Streaming Events which combines the ability to do XPath queries with the .NET Framework's forward-only, pull based XML parser. He shows the following code sample

// Setup the namespaces XmlNamespaceManager mgr = new XmlNamespaceManager(temp.NameTable); mgr.AddNamespace("r", RssBanditNamespace); // Precompile the strategy used to match the expression IMatchStrategy st = new RootedPathFactory().Create( "/r:feeds/r:feed/r:stories-recently-viewed/r:story", mgr); int count = 0; // Create the reader. XseReader xr = new XseReader( new XmlTextReader( inputStream ) ); // Add our handler, using the strategy compiled above. xr.AddHandler(st, delegate { count++; }); while (xr.Read()) { } Console.WriteLine("Stories viewed: {0}", count);

I have a couple of questions about his implementation the main one being how it deals with XPath queries such as /r:feeds/r:feed[count(r:stories-recently-viewed)>10]/r:title which can't be done in a forward only manner?

Oleg Tkachenko also pipes in with some opinions about streaming XPath in his post Warriors of the Streaming XPath Order. He writes

I've been playing with such beasts, making all kinds of mistakes and finally I came up with a solution, which I think is good, but I didn't publish it yet. Why? Because I'm tired to publish spoilers :) It's based on "ForwardOnlyXPathNavigator" aka XPathNavigator over XmlReader, Dare is going to write about in MSDN XML Dev Center and I wait till that's published.

May be I'm mistaken, but anyway here is the idea - "ForwardOnlyXPathNavigator" is XPathNavigator implementation over XmlReader, which obviously supports forward-only XPath subset...

And after I played enough with and implemented that stuff I discovered BizTalk 2004 Beta classes contain much better implementation of the same functionality in such gems as XPathReader, XmlTranslatorStream, XmlValidatingStream and XPathMutatorStream. They're amazing classes that enable streaming XML processing in much rich way than trivial XmlReader stack does. I only wonder why they are not in System.Xml v2 ? Is there are any reasons why they are still hidden deeply inside BizTalk 2004 ? Probably I have to evangelize them a bit as I really like this idea.

Actually Oleg is closer and yet farther from the truth than he realizes. Although I wrote about a hypothetical ForwardOnlyXPathNavigator in my article entitled Can One Size Fit All? for XML Journal my planned article which should show up when the MSDN XML Developer Center launches in a month or so won't be using it. Instead it will be based on an XPathReader that is very similar to the one used in BizTalk 2004, in fact it was written by the same guy. The XPathReader works similarly to Daniel Cazzulino's XseReader but uses the XPath subset described in Arpan Desai's Introduction to Sequential XPath paper instead of adding proprietary extensions to XPath as Daniel's does.

When the article describing the XPathReader is done it will provide source and if there is interest I'll create a GotDotNet Workspace for the project although it is unlikely I nor the dev who originally wrote the code will have time to maintain it.

Categories: XML

February 15, 2004

@ 05:50 PM

Comments [2]

On Semantic Integration and XML

A few months ago I attended XML 2003 where I first learned about Semantic Integration which is the buzzword term for mapping data from one schema to another with a heavy focus on using Semantic Web technologies such as ontologies and the like. The problem that these technologies solve is enabling one to map XML data from external sources to a form that is compatible with how an application or business entity manipulates them internally.

For example, in RSS Bandit we treat feeds in memory and on disk as if they are in the RSS 2.0 format even though it supports other flavors of RSS as well such as RSS 1.0. Proponents of semantic integration technologies would suggest using a technology such as the W3C's OWL Web Ontology Language. If you are unfamiliar with ontolgies and how they apply to XML a good place to understand what they are useful for is taking a look at the OWL Web Ontology Language Use Cases and Requirements. The following quote from the OWL Use Cases document gives a glimpse into what the goal of ontology languages

In order to allow more intelligent syndication, web portals can define an ontology for the community. This ontology can provide a terminology for describing content and axioms that define terms using other terms from the ontology. For example, an ontology might include terminology such as "journal paper," "publication," "person," and "author." This ontology could include definitions that state things such as "all journal papers are publications" or "the authors of all publications are people." When combined with facts, these definitions allow other facts that are necessarily true to be inferred. These inferences can, in turn, allow users to obtain search results from the portal that are impossible to obtain from conventional retrieval systems

Although the above example talks about search engines it is clear that one can also use this for data integration. In the example of RSS Bandit, one could create an ontology that maps the terms in RSS 1.0 to those in RSS 2.0 and make statements such as

RSS 1.0's <title> element sameAs RSS 2.0's <title> element

Basically, one could imagine schemas for RSS 1.0 and RSS 2.0 represented as two trees and an ontology a way of drawing connections between the leaves and branches of the trees. In a previous post entitled More on RDF, The Semantic Web and Perpetual Motion Machines I questioned how useful this actually would be in the real world by pointing out the dc:date vs. pubDate problem in RSS. I wrote

However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not. A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

Teh above is a simple example, one could imagine more complex examples where the vocabularies to be mapped differ much more syntactically such as

<author>Dare Obasanjo (dareo@example.com)</author> <author> <fname>Dare</fname> <lname>Obasanjo</lname> <email>dareo@example.com</email> </author>

The aformentioned examples point out technical issues with using ontology based techniques for mapping between XML vocabularies but I failed to point out the human problems that tend to show up in the real world. A few months ago I was talking to Chris Lovett about semantic integration and he pointed out that in many cases as applications evolve semantics begin to be assigned to values in often orthogonal ways.

An example of semantics being addd to values again shows up in an example that uses RSS Bandit. A feature of RSS Bandit is that feeds are cached on disk allowing a user to read items that have long since disappeared from the feed. At first we provided the ability for the user to specify how long items should be kept in the cached feed ranging from a day up to a year. We used an element named maxItemAge embedded in the cached feed which contained a serialized instance of the System.Timespan structure. After a while we realized we needed ways to say that for a particular feed use the global default maxItemAge, never cache items for this feed or never expire items for this feed so we used the TimeSpan.MinValue, TimeSpan.Zero, or TimeSpan.MaxValue values of the TimeSpan class respectively.

If another application wanted to consume this data and had a similar notion of 'how long to keep the items in a feed' it couldn't simply map maxItemAge to whatever internal property it used without taking into account the extra semantics embedded in when certain values occur in that element. Overloading the meaning of properties and fields in a database or class is actually fairly commonplace [after all how many different APIs use the occurence of -1 for a value that should typically return a positive number as an error condition?] and something that must also be considered when applying semantic integration technologies to XML.

In conclusion, it is clear that Semantic Web can be used to map between XML vocabularies however in non-trivial situations the extra work that must be layered on top of such approaches tends to favor using XML-centric techniques such as XSLT to map between the vocabularies instead.

Categories: XML

February 15, 2004

@ 03:07 AM

Comments [0]

Everything's Xen

Just as it looks like my buddy Erik Meijer is done with blogging (despite his short lived guest blogging stint at Lamda the Ultimate) it looks like a couple more of the folks who brought Xen to the world have started blogging. They are

William Adams: Dev Manager for the WebData XML team.
Matt Warren: Formerly a developer on the WebData XML team, now works on the C# team or on the CLR (I can never keep those straight).

Both of them were also influential in the design and implementation of the System.Xml namespace in version 1.0 of the .NET Framework.

Categories: Life in the B0rg Cube | Movie Review

February 14, 2004

@ 09:03 PM

Comments [4]

If Only Life Were An Action Movie

A couple of days ago I wrote about The war in Iraq and whether the actions of the US administration could be considered a war crime. It seems this struck a nerve with at least one of my readers. In a response to that entry Scott Lare wrote

Today, between Afganistan and Iraq there are approx 50 million people who were previously under regimes of torture who now have a "chance" at freedom. Get a grip on reality! Talk about missing the point and moronism.

I find it interesting that Scott Lare sees the need to put chance in quotes. Now ignoring the fact that these “regimes of torture” were in fact supported by the US when it was politcally expedient the question is whether people's lives are any better in Afghanistan and Iraq now that they live in virtual anarchy as opposed to under oppresive regimes? In a post entitled Women as property and U.S.-funded nation-building he excerpts a New York Times opinion piece which states

Consider these snapshots of the new Afghanistan:

• A 16-year-old girl fled her 85-year-old husband, who married her when she was 9. She was caught and recently sentenced to two and a half years' imprisonment.

• The Afghan Supreme Court has recently banned female singers from appearing on Afghan television, barred married women from attending high school classes and ordered restrictions on the hours when women can travel without a male relative.

• When a man was accused of murder recently, his relatives were obliged to settle the blood debt by handing over two girls, ages 8 and 15, to marry men in the victim's family.

• A woman in Afghanistan now dies in childbirth every 20 minutes, usually without access to even a nurse. A U.N. survey in 2002 found that maternal mortality in the Badakshan region was the highest ever recorded anywhere on earth: a woman there has a 50 percent chance of dying during one of her eight pregnancies.

• In Herat, a major city, women who are found with an unrelated man are detained and subjected to a forced gynecological exam. At last count, according to Human Rights Watch, 10 of these "virginity tests" were being conducted daily.

... Yet now I feel betrayed, as do the Afghans themselves. There was such good will toward us, and such respect for American military power, that with just a hint of follow-through we could have made Afghanistan a shining success and a lever for progress in Pakistan and Central Asia. Instead, we lost interest in Afghanistan and moved on to Iraq.

... Even now, in the new Afghanistan we oversee, they are being kidnapped, raped, married against their will to old men, denied education, subjected to virginity tests and imprisoned in their homes. We failed them.

To people like Scott I'll only say this; life isn't an action movie where you show up, shoot up all the bad guys and everyone lives happily ever after. What has happened in Afghanistan is that the US military has shoot up some bad guys who have now been replaced by a different set of bad guys. Short of colonizing the country and forcing social change there isn't much the US military can do for a lot of people in Afghanistan especially the women. I accept this but it really irritates me when I here people mouth off about how “life is so much better” because the US military dropped some bombs on the “bad guys“.

As for Iraq, John Robb has a link to an interesting article on the current state of affairs. He writes

Debka has some interesting analysis that indicates that the US is in a bind. The recent moves to empower Iraqi defense forces to take control of city centers is premature (as proved in the brazen attack in Fallujah yesterday). At the same time the US is committed to a shift of power this summer and the UN is talking about elections this fall. There are three potential outcomes for this:

A full civil war that draws in adjacent powers.
Democracy and stability under Sunni leadership.
More US occupation but with increasing resistance.

How would you assign the odds (in percentages) for each outcome?

Considering the animosity between the various factions in Iraq, democracy and stability may not go hand in hand. Being Nigerian I know first hand that democracy doesn't automatically mean stability, I guess that's why some refer to us as The New Pakistan

Categories: Ramblings

February 13, 2004

@ 03:30 PM

Comments [1]

Sex, Lies and XML MIME Types

Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality. Specifically

According to RFC 3023..., if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
us-ascii.

So for this to work correctly it means that if the MIME type of an XML document is text/xml then the web server should look inside it before sending it over the wire and send the correct encoding or else the document will be interpreted incorrectly since it is highly likely that us-ascii is not the encoding of the XML document. In practice, most web servers do not do this. I have confirmed this by testing against both IIS and Apache.

Instead what happens is that an XML document is created by the user and dropped on the file system and the web server assumes it is text/xml which it most likely is and sends it as is without setting the charset in the content type header.

A simple way to test this is to go to Rex Swain's HTTP Viewer and download the following documents from the W3 Schools page on XML encodings

All files are sent with a content type of text/xml and no encoding specified in the charset parameter of the Content-Type HTTP header. According to RFC 3023 which Mark Pilgrim quoted in his article that clients should treat them as us-ascii. With the above examples this behavior would be wrong in all four cases.

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable future [slightly modified from Mark Pilgrim's post]

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml then the encoding is

the encoding given in the charset parameter of the Content-Type HTTP header, or
the encoding given in the encoding attribute of the XML declaration within the document, or
utf-8.

Some may argue that this discussion isn't relevant for news aggregators because they'll only consume XML documents whose MIME type application/atom+xml or application/rss+xml but again this ignores practice. In practice most web servers send back RSS feeds as text/xml, if you don't believe me test ten RSS feeds chosen at random using Rex Swain's HTTP Viewer and see what MIME type the server claims they are.

Categories: XML

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Wednesday, 18 February 2004 - Dare Obasanjo's weblog