Thursday, 03 February 2005 - Dare Obasanjo's weblog

February 3, 2005

@ 05:44 PM

Square Pegs in Round Holes: W3C XML Schema and XInclude

One of the biggest assumptions I had about software development was shattered when I started working on the XML team at Microsoft. This assumption was that standards bodies know what they are doing and produce specifications that are indisputable. However I've come to realize that the problems of design by committee affects illustrious names such as the W3C and IETF just like everyone else. These problems become even more pernicious when trying to combine technologies defined in multiple specifications to produce a coherent end to end application.

An example of the problem caused by contradictions in core specifications of the World Wide Web is summarized in Mark Pilgrim's article, XML on the Web Has Failed. The issue raised in his article is that determining the encoding to use when processing an XML document retrieved off the Web via HTTP, such as an RSS feed, is defined in at least three specifications which contradict each other somewhat; XML 1.0, HTTP 1.0/1.1 and RFC 3023. The bottom line being that most XML processors including those produced by Microsoft ignore one or more of these specifications. In fact, if applications suddenly started following all these specifications to the letter a large number of the XML documents on the Web would be considered invalid. In Mark Pilgrim's article, 40% of 5,000 RSS feeds chosen at random would be considered invalid even though they'd work in almost all RSS aggregators and be considered well-formed by most XML parsers including the System.Xml.XmlTextReader class in the .NET Framework and MSXML.

The newest example, of XML specifications that should work together but instead become a case of putting square pegs in round holes is Daniel Cazzulino's article, W3C XML Schema and XInclude: impossible to use together??? which points out

The problem stems from the fact that XInclude (as per the spec) adds the xml:base attribute to included elements to signal their origin, and the same can potentially happen with xml:lang. Now, the W3C XML Schema spec says:

3.4.4 Complex Type Definition Validation Rules

Validation Rule: Element Locally Valid (Complex Type)
...

3 For each attribute information item in the element information item's [attributes] excepting those whose [namespace name] is identical to http://www.w3.org/2001/XMLSchema-instance and whose [local name] is one of type, nil, schemaLocation or noNamespaceSchemaLocation, the appropriate case among the following must be true:

And then goes on to detailing that everything else needs to be declared explicitly in your schema, including xml:lang and xml:base, therefore :S:S:S.

So, either you modify all your schemas to that each and every element includes those attributes (either by inheriting from a base type or using an attribute group reference), or you validation is bound to fail if someone decides to include something. Note that even if you could modify all your schemas, sometimes it means you will also have to modify the semantics of it, as a simple-typed element which you may have (with the type inheriting from xs:string for example), now has to become a complex type with simple content model only to accomodate the attributes. Ouch!!! And what's worse, if you're generating your API from the schema using tools such as xsd.exe or the much better XsdCodeGen custom tool, the new API will look very different, and you may have to make substancial changes to your application code.

This is an important issue that should be solved in .NET v2, or XInclude will be condemned to poor adoption in .NET. I don't know how other platforms will solve the W3C inconsistency, but I've logged this as a bug and I'm proposing that a property is added to the XmlReaderSettings class to specify that XML Core attributes should be ignored for validation, such as XmlReaderSettings.IgnoreXmlCoreAttributes = true. Note that there are a lot of Ignore* properties already so it would be quite natural.

I believe this is a significant bug in W3C XML Schema that it requires schema authors to declare up front in their schemas where xml:lang, xml:base or xml:base may occur in their documents. Since I used to be the program manager for XML Schema technologies in the .NET Framework this issue would have fallen on my plate. I spoke to Dave Remy who toke over my old responsibilities and he's posted his opinion about the issue in his post XML Attributes and XML Schema . Based on the discussion in the comments to his post it seems the members of my old team are torn on whether to go with a flag or try to push an errata through the W3C. My opinion is that they should do both. Pushing an errata through the W3C is a time consuming process and in the meantime using XInclude in combination with XML Schema is signficantly crippled on the .NET Framework (or on any other platform that supports both technologies). Sometimes you have to do the right thing for customers instead of being ruled by the letter of standards organizations especially when it is clear they have made a mistake.

Please vote for this bug on the MSDN Product Feedback Center.

Categories: XML

January 31, 2005

@ 06:14 PM

Comments [13]

Technorati Tags: Why Do Bad Ideas Keep Resurfacing?

So I just read an interesting post about Technorati Tags on Shelley Powers's blog entitled Cheap Eats at the Semantic Web Café. As I read Shelley's post I kept feeling a strong sense of deja vu which I couldn't shake. If you were using the Web in the 1990s then the following descriptions of Technorati Tags taken from their homepage should be quite familiar.

What's a tag?

Think of a tag as a simple category name. People can categorize their posts, photos, and links with any tag that makes sense.

....

The rest of the Technorati Tag pages is made up of blog posts. And those come from you! Anyone with a blog can contribute to Technorati Tag pages. There are two ways to contribute:

If your blog software supports categories and RSS/Atom (like Movable Type, WordPress, TypePad, Blogware, Radio), just use the included category system and make sure you're publishing RSS/Atom and we'll automatically include your posts! Your categories will be read as tags.
If your blog software does not support those things, don't worry, you can still play. To add your post to a Technorati Tag page, all you have to do is "tag" your post by including a special link. Like so:
<a href="http://technorati.com/tag/[tagname]" rel="tag">[tagname]</a>
The [tagname] can be anything, but it should be descriptive. Please only use tags that are relevant to the post. No need to include the brackets. Just make sure to include rel="tag".

Also, you don't have to link to Technorati. You can link to any URL that ends in something tag-like. These tag links would also be included on our Tag pages:
<a href="http://apple.com/ipod" rel="tag">iPod</a> <a href="http://en.wikipedia.org/wiki/Gravity" rel="tag">Gravity</a> <a href="http://flickr.com/photos/tags/chihuahua" rel="tag">Chihuahua</a>

If you weren't using the Web in the 1990s this may seem new and wonderful to you but the fact is we've all seen this before. The so-called Technorati Tags are glorified HTML META tags with all their attendant problems. The reason all the arguments in Shelley's blog post seemed so familiar is that a number of them are the same ones Cory Doctorow made in his article Metacrap from so many years ago. All the problems with META tags are still valid today most important being the fact that people lie especially spammers and that even well intentioned people tend to categorize things incorrectly or according to their prejudices.

META tags simply couldn't scale to match the needs of the World Wide Web and are mostly ignored by search engines today. I wonder why people think that if you dress up an old idea with new buzzwords (*cough* folksonomies *cough* social tagging *cough*) that it somehow turns a bad old idea into a good new one?

Categories: Technology

January 31, 2005

@ 04:35 PM

Comments [4]

News Stories Confuse Me: Microsoft Won't Bundle Desktop Search with Windows

I noticed an eWeek article this morning titled Microsoft Won't Bundle Desktop Search with Windows which has had me scratching my head all morning. The article contains the following excerpts

Microsoft Corp. has no immediate plans to integrate desktop search into its operating system, a company executive said at a conference here this weekend.
...
Indeed, while including desktop search in Windows might seem like a logical step to many, "there's no immediate plan to do that as far as I know," Kroese said. "That would have to be a Bill G. [Microsoft chairman and chief software architect Bill Gates] and the lawyers' decision."

I thought Windows already had desktop search. In fact, Jon Udell of Infoworld recently provided a screencast in his blog post Where was desktop search when we needed it? which shows off the capabilities of the built-in Windows desktop search which seems almost on par with the recent offerings from MSN, Google and the like.

Now I'm left wondering what the EWeek article means. Does it mean there aren't any plans to replace the annoying animated dog with the MSN butterfly? That Microsoft has access to a time machine and will go back and rip out desktop search from the operating system including the annoying animated dog? Or is there some other obvious conclusion that can be drawn from the facts and the article that I have failed to grasp?

The technology press really disappoints me sometimes.

Categories: Technology

January 29, 2005

@ 03:38 PM

Comments [8]

One Click Subscription: The World is Full of Bad Ideas

I had promised myself I wouldn't get involved in this debate but it seems every day I see more and more people giving credence to bad ideas and misconceptions. The debate I am talking about is one click subscriptions to RSS feeds which Dave Winer recently called the Yahoo! problem. Specifically Dave Winer wrote

Yahoo sends emails to bloggers with RSS feeds saying, hey if you put this icon on your weblog you'll get more subscribers. It's true you will. Then Feedster says the same thing, and Bloglines, etc etc. Hey I did it too, back when Radio was pretty much the only show in town, you can see the icon to the right, if you click on it, it tries to open a page on your machine so you can subscribe to it. I could probably take the icon down by now, most Radio users probably are subscribed to Scripting News, since it is one of the defaults. But it's there for old time sake, for now.

Anyway, all those logos, when will it end? I can't imagine that Microsoft is far behind, and then someday soon CNN is going to figure out that they can have their own branded aggregator for their own users (call me if you want my help, I have some ideas about this) and then MSNBC will follow, and Fox, etc. Sheez even Best Buy and Circuit City will probably have a "Click here to subscribe to this in our aggregator" button before too long.

That's the problem.

Currently I have four such one click subscription buttons on my homepage; , , , and . Personally, I don't see this as a problem since I expect market forces and common sense to come into play here. But let's see what Dave Winer proposes as a solution.

Ask the leading vendors, for example, Bloglines, Yahoo, FeedDemon, Google, Microsoft, and publishers, AP, CNN, Reuters, NY Times, Boing Boing, etc to contribute financially to the project, and to agree to participate once it's up and running.
...
Hire Bryan Bell to design a really cool icon that says "Click here to subscribe to this site" without any brand names. The icon is linked to a server that has a confirmation dialog, adds a link to the user's OPML file, which is then available to the aggregator he or she uses. No trick here, the technology is tried and true. We did it in 2003 with feeds.scripting.com.

This 'solution' to the 'problem' boggled my bind. So every time someone wants to subscribe to an RSS feed it should go through a centralized server? The privacy implications of this alone are significant let alone the creation of a central point of failure. In fact Dave Winer recently posted a link that highlights a problem with centralized services related to RSS.

Besides my issues with the 'solution' I have quite a few problems with the so-called 'Yahoo problem' as defined. The first problem is that it only affects web based aggregators that don't have a client application installed on the user's desktop. Desktop aggregators already have a de facto one-click subscription mechanism via the feed URI scheme which is supported by at least a dozen of the most popular aggregators across multiple platforms including the next version of Apple's web browser Safari. Recently both Brent Simmons (in his post feed: protocol) and Nick Bradbury (in his post Really Simple Subscription) reiterated this fact. In fact, all four services whose buttons I have on my personal blog can utilize the feed URI scheme as a subscription mechanism since all four services have desktop applications they can piggyback this functionality onto. Yahoo and MSN have toolbars, Newsgator has their desktop aggregator and Bloglines has its notifier.

The second problem I have with this is that it aims to stifle an expression of competition in the market place. If the Yahoo! aggregator becomes popular enough that every website puts a button for it beside a link to their RSS feed then they deserve kudos for spreading their reach since it is unlikely that this would happen without some value being provided by their service. I don't think Yahoo! should be attacked for their success nor should their success be termed some sort of 'problem' for the RSS world to fix.

As I mentioned earlier in my post I was going to ignore this re-re-reiteration of the one click subscription debate until I saw rational folks like Tim Bray in his post One-Click subscription actually discussing the centralized RSS subscription repository idea without giving it the incredulity it deserves. Of course, I also see some upturned eyebrows at this proposal from folks like Bill de hÓra in his post A registry for one click subscription, anyone?

Categories: Syndication Technology

January 28, 2005

@ 12:13 AM

Comments [3]

Article Idea: Processing XML in the Real World

Coincidentally just as I finished reading a post by Tim Bray about Private Syndication, I got two bug reports filed almost simultaneously about RSS Bandit's support for secure RSS feeds. The first was SSL challenge for non-root certs where the user complained that instead of prompting the user when there is a problem with an SSL certificate like browsers do we simply fail. One could argue that this is the right thing to do especially when you have folks like Tim Bray suggesting that bank transactions and medical records should be flowing through RSS. However given the precedent set by web browsers we'll probably be changing our behavior. The second bug was that RSS Bandit doesn't support cookies. Many services use cookies to track authenticated users as well as provide individual views tailored to a user. Although there are a number of folks who tend to consider cookies a privacy issue, most popular websites use them and they are mostly harmless. I'll likely fix this bug in the next couple of weeks.

These bug reports in combination with a couple more issues I've had to deal with while writing code to process RSS feeds in RSS Bandit has givn me my inspiration for my next Extreme XML column. I suspect there's a lot of mileage that can be obtained from an article that dives deep into the various issues one deals with while processing XML on the Web (DTDs, proxies, cookies, SSL, unexpected markup, etc) which uses RSS as the concrete example. Any of the readers of my column care to comment on whether they'd like to see such an article and if so what they'd want to see covered?

Categories: RSS Bandit | XML

January 27, 2005

@ 09:54 PM

Comments [0]

5 Things I'd Like To See Fixed In MSN Spaces

Mike has a post entitled 5 Things I Dislike About MSN Spaces where he lists the top 5 issues he has with the service.

Five things I dislike about MSN Spaces:

1. I can't use BlogJet to post to my blog today. Not a huge deal, but I love this little tool. Web browsers (Firefox included!) just don't do it for me in the way of text editing.

2. We don't have enough themes, which means there isn't enough differentiation between spaces. People come in a lot of flavors.

3. We don't have comments on anything other than blog entries, which means a lot of people are using comments to either a) comment on photos, or b) comment on the space itself.

4. The Statistics page doesn't roll-up RSS aggregator "hits", only web page-views. I want to know who is reading this post in various aggregators.

5. The recent blog entry on the MSN Messenger 7.0 beta contact card doesn't show enough information to compel me to visit the user's space.

since I'm the eternal copycat I also decided to put together a top 5 list of issues I'd like to see fixed in the service. Some of them I am in the process of fixing and some I'm still nagging folks like Mike to fix since they are UI issues and have nothing to do with my areas of responsibility. Here's my list

Five things I'd like to see fixed in MSN Spaces:

1. I can't post to my blog using standard weblog tools such as the BlogJet, w.bloggar or Flickr. I now have 3 blogs and I'd love to author my posts in one place then have it automatically posted to all three of them. My MSN Spaces blog prevents this from happening today.

2. The text control for posting to your blog should support editing raw HTML and support a WYSIWYG editting. I find it stunning that open source projects like dasBlog & Community Server::Blogs have a more full featured edit control for posting to ones blog than a MSN Spaces.

3. The user experience around managing and interacting with my blogroll is subpar. I'd love to be able to upload my aggregator subscription list as an OPML file to my MSN Space blog roll. While we're at it, it would be cool to render MSN Spaces blogs differently in my blog roll and links in my posts perhaps with a pawn such as what LiveJournal does with Livejournal user pawn which links to a user's profile.

4. I want to be able to upload music lists from other play lists formats besides Windows media player. I have a bunch of playlists in WinAmp and iTunes which I want to upload to my MSN Space but don't have the patience to transcribe. It is cool that we allow uploading Windows Media Player playlists but what about iTunes and WinAmp?

5. I need more ways to find other MSN Spaces that I might be interested in. The recently updated spaces and newly created spaces widgets on the MSN Spaces homepage aren't cutting it for me. In additon, once that is fixed it would also be cool if it was made really simple to subscribe to interesting MSN Spaces in my RSS aggregator of choice using a one-click subscription mechanism.

Categories: MSN

January 24, 2005

@ 12:23 AM

Comments [1]

Folksonomies, Taxonomies and Metacrap

I've been doing a bit of reading about folksonomies recently. The definition of folksonomy in Wikipedia currently reads

Folksonomy is a neologism for a practice of collaborative categorization using simple tags in a flat namespace. This feature has begun appearing in a variety of social software. At present, the best examples of online folksonomies are social bookmarking sites like del.icio.us, a bookmark sharing site, Flickr, for photo sharing, and 43 Things, for goal sharing.

What I've found interesting about current implementations of folksonomies is that they are blogging with content other than straight text. del.icio.us is basically a linkblog and Flickr isn't much different from a photoblog/moblog. The innovation in these sites is in merging the category metadata from the different entries such that users can browse all the links or photos which match a specified keyword. For example, here are recent links added del.icio.us with the tag 'chicken' and recent photos added to Flickr with the tag 'chicken'. Both sites not only allow browsing all entries that match a particular tag but also go as far as alowing one to subscribe to particular tags as an RSS feed.

I've watched with growing bemusement as certain people have started to debate whether folksonomies will replace traditional categorization mechanisms. Posts such as The Innovator's Lemma by Clay Shirky, Put the social back into social tagging by David Weinberger and it's the social network, stupid! by Liz Lawley go back and forth about this very issue. This discussion reminds me of the article Don't Let Architecture Astronauts Scare You by Joel Spolsky where he wrote

A recent example illustrates this. Your typical architecture astronaut will take a fact like "Napster is a peer-to-peer service for downloading music" and ignore everything but the architecture, thinking it's interesting because it's peer to peer, completely missing the point that it's interesting because you can type the name of a song and listen to it right away.

All they'll talk about is peer-to-peer this, that, and the other thing. Suddenly you have peer-to-peer conferences, peer-to-peer venture capital funds, and even peer-to-peer backlash with the imbecile business journalists dripping with glee as they copy each other's stories: "Peer To Peer: Dead!"

I think Clay is jumping several steps ahead to conclude that explicit classification schemes will give way to categorization by users. The one thing people are ignoring in this debate (as in all technical debates) is that the various implementations of folksonomies are popular because of the value they provide to the user. When all is said and done, del.icio.us is basically a linkblog and Flickr isn't much different from a photoblog/moblog. This provides inherent value to the user and as a side effect [from the users perspective] each new post becomes part of an ecosystem of posts on the same topic which can then be browsed by others. It isn't clear to me that this dynamic exists everywhere else explicit classification schemes are used today.

One thing that is clear to me is that personal publishing via RSS and the various forms of blogging have found a way to trample all the arguments against metadata in Cory Doctorow's Metacrap article from so many years ago. Once there is incentive for the metadata to be accurate and it is cheap to create there is no reason why some of the scenarios that were decried as utopian by Cory Doctorow in his article can't come to pass. So far only personal publishing has provided the value to end users to make both requirements (accurate & cheap to create) come true.

Postscript: Coincidentally I just noticed a post entitled Meet the new tag soup by Phil Ringnalda pointing out that emphasizing end-user value is needed to woo people to create accurate metadata in the case of using semantic markup in HTML. So far most of the arguments I've seen for semantic markup [or even XHTML for that matter] have been academic. It would be interesting to see what actual value to end users is possible with semantic markup or whether it really has been pointless geekery as I've suspected all along.

Categories: Technology

January 21, 2005

@ 01:44 AM

Comments [1]

My C-Omega Article Published

My article on Cω is finally published. It appeared on XML.com as Introducing Comega while it showed up as the next installment of my Extreme XML column on MSDN with the title An Overview of Cω: Integrating XML into Popular Programming Languages. It was also mentioned on Slashdot in the story Microsoft Research's C-Omega.

I'll be following this with an overview of ECMAScript for XML (E4X) in a couple of months.

Categories: Technology

January 21, 2005

@ 01:31 AM

Comments [0]

O'Reilly Emerging Technology Conference 2005

It looks like I'm going to be attending the O'Reilly Emerging Technology Conference in San Diego. If you're there and want to find me I'll be with the rest of the exhibitors hanging out with MSR Social Computing Group talking about the social software applications that have been and are being built by MSN with input from the folks at Microsoft Research. I should also be at Tuesday's after party.

Categories: Technology

January 19, 2005

@ 05:42 PM

Comments [1]

A Look at Some of the Negative Feedback on rel="nofollow"

If you've been following the blogosphere you should know by now that the Google, Yahoo! and MSN search engines decided to start honoring the rel="nofollow" attribute on links to mean that the linked page shouldn't get any increased ranking from that link. This is intended to reduce the incentive for comment spammers who've been flooding weblogs with links to their websites in comment fields. There is another side effect of the existence of this element which is pointed out by Shelley Powers in her post The Other Shoe on Nofollow where she writes

I expected this reason to use nofollow would take a few weeks at least, but not the first day. Scoble is happy about the other reason for nofollow: being able to link to something in your writing and not give ‘google juice’ to the linked.

Now, he says, I can link to something I dislike and use the power of my link in order to punish the linked, but it won’t push them into a higher search result status.

Dave Winer started this, in a way. He would give sly hints about what people have said and done, leaving you knowing that an interesting conversation was going on elsewhere, but you’re only hearing one side of it. When you’d ask him for a link so you could see other viewpoints, he would reply that "…he didn’t want to give the other party Google juice." Now I imagine that *he’ll link with impunity–other than the fact that Technorati and Blogdex still follow the links. For now, of course. I imagine within a week, Technorati will stop counting links with nofollow implemented. Blogdex will soon follow, I’m sure.

Is this so bad? In a way, yes it is. It’s an abuse of the purpose of the tag, which was agreed on to discourage comment spammers. More than that, though, it’s an abuse of the the core nature of this environment, where our criticism of another party, such as a weblogger, came with the price of a link. Now, even that price is gone.

I don't see this is an abuse of the tag, I see it as fixing a bug in Google's PageRank algorithm. It's always seemed broken to me that Google assumes that any link to a source is meant to convey that the target is authoritative. Many times people link to websites they don't consider authoritative for the topic they are discussing. This notion of 'the price of a link' has been based on a design flaw in Google's PageRank algorithm. Social norms should direct social behavior not bugs in software.

A post entitled Nofollow Sucks on the Aimless Words blog has the following statement

Consider what the wholesale implementation of this new web standard means within the blogosphere. "nofollow" is English for NO FOLLOW and common sense dictates that when spider finds this tag it will not follow the subsequent link.

The author of the blog post later retracts his statements but it does bring up an interesting point. Robert Scoble highlights the fact that it didn't take a standards committee to come up with this just backchannel conversations that took a few hours. However as Tim Ewald recently wrote in his post "Make it easy for people to pay you"

The value of the standardization process is that it digs issues - architectural, security, reliability, scalability, etc. - and addresses them. It also makes the language more tighter and less vague

The Aimless Words weblog points out that it is unclear to anyone who isn't party to whatever conversations that went on between Google, MSN, Yahoo and others what exactly are the semantics of rel='nofollow' on a link. Given that it is highly unlikely that all three search engines even use the same ranking algorithms I'm not even sure what it means for them to say the link doesn't contribute to the ranking of the site. Will the penalty that Yahoo search applies to such sites be the same in Google and MSN search? Some sort of spec or spec text would be nice to see instead of 'trust us' which seems to be what is emanating from all the parties involved at the current time.

PS: I was wondering why I never saw the posts about this on the Google blog in RSS Bandit and it turned out to be because the Google Blog atom feeds are malformed XML. Hopefully they'll fix this soon.

Categories: Technology

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Thursday, 03 February 2005 - Dare Obasanjo's weblog

3.4.4 Complex Type Definition Validation Rules

What's a tag?