January 23, 2006
@ 10:42 PM

I don't usually spam folks with links to amusing video clips that are making the rounds in email inboxes, but the video of Aussie comedy group Tripod performing their song "Make You Happy Tonight" struck a chord with me because I did what the song talks about this weekend.

The game in question was Star Wars: Knights of the Old Republic II. :)


 

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach. 

In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

The problem Stefano has pointed out is that just being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

The example I tend to cull from in my personal experience is mapping between different XML syndication formats such as Atom 1.0 and RSS 2.0. Mapping between both formats isn't simply a case of saying <atom:published>  owl:sameAs <pubDate> or that <atom:author>  owl:sameAs <author> . In both cases, an application that understands how to process one format (e.g. an RSS 2.0 parser) would not be able to process the syntax of the equivalent  elements in the other (e.g. processing RFC 3339 dates as opposed to RFC 822 dates).

Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world. I've seen some claims that simply using XML technologies for mapping between XML vocabularies means you will need N2 transforms as opposed to needing 2N transforms if using SW technologies (Stefano mentions this in his post as has Ken Macleod in his post XML vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies have similar data models and semantics which should be true otherwise a mapping wouldn't be possible. However the implicit assumption is that the syntax of each vocabulary is practically identical (e.g. same naming conventions, same date formats, etc) which this post provides a few examples where this is not the case. 

What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...


 

Categories: Web Development | XML

January 21, 2006
@ 02:26 AM

There's been a bunch of speculation about the recent DOJ requests for logs from the major search engines. Ken Moss of the MSN Search team tells their side of the story in his post Privacy and MSN Search. He writes

There’s been quite a frenzy of speculation over the past 24 hours regarding the request by the government for some data in relation to a child online protection lawsuit.  Obviously both privacy and child protection are both super important topics – so I’m glad this discussion is happening.

Some facts have been reported, but mostly I’ve seen a ton of speculation reported as facts.   I wanted to use this blog post to clarify some facts and to share with you what we are thinking here at MSN Search.

Let me start with this core principle statement: privacy of our customers is non-negotiable and something worth fighting to protect.

Now, on to the specifics.  

Over the summer we were subpoenaed by the DOJ regarding a lawsuit.  The subpoena requested that we produce data from our search service. We worked hard to scope the request to something that would be consistent with this principle.  The applicable parties to the case received this data, and  the parties agreed that the information specific to this case would remain confidential.  Specifically, we produced a random sample of pages from our index and some aggregated query logs that listed queries and how often they occurred .  Absolutely no personal data was involved.

With this data you:

        CAN see how frequently some query terms occurred.
        CANNOT look up an IP and see what they queried
        CANNOT look for users who queried for both “TERM A” and “TERM B”.

At MSN Search, we have strict guidelines in place to protect the privacy of our customers data, and I think you’ll agree that privacy was fully protected.  We tried to strike the right balance in a very sensitive matter.

I've been surprised at how much rampant speculation from blogs has been reported in mainstream media articles as facts without people getting information directly from the source.


 

Categories: Current Affairs

A user of RSS Bandit recently forwarded me a discussion on the atom-syntax mailing list which criticized some of our design decisions. In an email in the thread entitled Reader 'updated' semantics Tim Bray wrote

On Jan 10, 2006, at 9:07 AM, James M Snell wrote:

In RSS there is definite confusion on what constitutes an update. In
Atom it is very clear. If <updated> changes, the item has been updated.
No controversy at all.

Indeed. There's a word for behavior of RssBandit and Sage: WRONG. Atom provides a 100% unambiguous way to say "this is the same entry, but it's been changed and the publisher thinks the change is significant." Software that chooses to hide this fact from users is broken - arguably dangerous to those who depend on receiving timely and accurate information - and should not be used. -Tim

People who write technology specifications often have good intentions but unfortunately they often aren't implementers of the specs they are creating. This leads to disconnects between reality and what is actually in the spec.

The problems with updates to blog posts is straightforward. There are minor updates which don't warrant signalling to the user such as typos being fixed (e.g. 12 of 13 miner survive mine collapse changed to 12 of 13 miners survive mine collapse) and those which do because they add significant changes to the story (e.g. 12 of 13 miners survive mine collapse changed to 12 of 13 miners survive killed in mine collapse). 

James Snell is right that it is ambiguous how to detect this in RSS but not in Atom due to the existence of the atom:updated element. The Atom spec states

The "atom:updated" element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value.

On paper it sounds like this solves the problem. On paper, it does. However for this to work correctly, weblog software now need to include an option such as 'Indicate that this change is significant' when users edit posts. Without such an option, the software cannot correctly support the atom:updated element. Since I haven't found any mainstream tools that support this functionality, I haven't bothered to implement a feature which is likely to annoy users more often than be useful since many people edit their blog posts in ways that don't warrant alerting the user.

However I do plan to add features for indicating when posts have changed in unambiguous scenarios such as when new comments are added to a blog post of interest to the user. The question I have for our users is how would you like this indicated in the RSS Bandit user interface?


 

Categories: RSS Bandit

January 20, 2006
@ 02:27 AM

Richard Searle has a blog post entitled The SOAP retrieval anti-pattern where he writes

I have seen systems that use SOAP based Web Services only to implement data retrievals.

The concept is to provide a standardized mechanism for external systems to retrieve data from some master system that controls the data of interest. This has value in that it enforces a decoupling from the master systems data model. It can also be easy to manage and control than the alternative to allowing the consuming systems to directly query the master systems database tables.
...
The selection of a SOAP interface over a RESTful interface is also questionable. The SOAP interface has a few (generally one) parameters and then returns a large object. Such an interface with a single parameter has a trivial representation as a GET. A multi-parameter call can also be trivial mapped if the parameters define a conceptual heirarchy (eg the ids of a company and one of its employees).

Such a GET interface avoids all the complexities of SOAP, WSDL, etc. AJAX and XForm clients can trivially and directly use the interface. A browser can use XSLT to provide a human readable representation.

Performance can easily be boosted by interposing a web cache. Such optimization would probably occur essentially automatically since any significant site would already have caching. Such caching can be further enhanced by using the HTTP header timestamps to compare against the updated timestamps in the master system tables.

I agree 100%, web services that use SOAP solely for data retrieval are usually a sign that the designers of the service need to get a clue when it comes to building distributed applications for the Web.

PS: I realize that my employer has been guilty of this in the past. In fact, we've been known to do this at MSN as well although at least we also provided RESTful interfaces to the service in that instance. ;)


 

Categories: XML Web Services

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML  page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches.

I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently. 

In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote

Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.

http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication

I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is wasted bandwidth since web pages tend to contain more stuff than feeds and change more often.

Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

  1. They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using <div> and <span> to markup the individual appointments, so I can't add hCalendar to them.

  2. I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.

  3. They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

    It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an <abbr> with class="dtstart" interleaved with an <abbr> with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

    Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun. 

I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.


 

Categories: XML

January 18, 2006
@ 12:03 PM

Once people find out that they can use tools like ecto, Blogjet or W.Bloggar to manage their blog on MSN Spaces via the MetaWeblog API, they often ask me why we don't have something equivalent to the Flickr API so they can do the same for the photos they have in their space. 

My questions for folks out there is whether this is something you'd like to see? Do you want to be able to create, edit and delete photos and photo albums in your Space using desktop tools? If so, what kind of tools do you have in mind?

If you are a developer, what kind of API would you like to see? Should it use XML-RPC, SOAP or REST? Do you want a web service or a DLL?

Let me know what you think.


 

Categories: Windows Live | XML Web Services

A few weeks ago I wrote a blog post entitled Windows Live Fremont: A Social Marketplace about the upcoming social marketplace coming from Microsoft. Since then the project has been renamed to Windows Live Expo and the product team is now blogging.

The team blog is located at http://spaces.msn.com/members/teamexpo and they've already posted an entry addressing their most frequently asked question, "So when is it launching then?".


 

Categories: Windows Live

It's been about three years since I first started on RSS Bandit and it doesn't seem like I've run out of steam yet. Every release the application seems to become more popular and last month we finally broke 100,000 downloads in a single month. The time has come for me to start thinking about what I'd like to see in the next family of releases and elicit feedback from our users. The next release is codenamed Jubilee.

Below are a list of feature areas where I'd like to see us work on over the next few months

  1. Extensibility Framework to Enable Richer Plugins: We currently use the IBlogExtension plugin mechanism which allows one to add new context menu items when right-clicking on an item in the list view.  I've used this to implement features such as the "Email  This" and "Post to del.icio.us" which ship with the default install. Torsten implemented "Send to OneNote" using this mechanism as well.

    The next step is to enable richer plugins so people can add their own menu items, toolbar buttons as well as processing steps for received feed items. Torsten used a prototype of this functionality to add Ad blocking features to RSS Bandit. I'd like to add weblog posting functionality using such a plugin model instead of making it a core part of the application since many of our users may just want a reader and not a weblog editor as well.

  2. Comment Watching: For many blogs such as Slashdot and Mini-Microsoft, the comments are often more interesting than the actual blog post. In the next version we'd like to make it easier to not only be updated when new blog posts appear in a blog you are interested in but also when new comments show up in a post you are interested in as well.

  3. Provide better support for podcasts and other rich media in feeds: With the growing popularity of podcasts, we plan to make it easier for users to discover and download rich media from their feeds. This doesn't just mean supporting downloading media files in the background but also supporting better ways of displaying rich media in our default views. Examples of what we have in mind can be taken from  the post Why should text have all the fun? in the Google Reader blog. We should have richer experiences for photo feeds, audio feeds and video feeds.

  4. Thumbs Up & Thumbs Down for Filtering and Suggesting New Feeds:  big problem with using a news aggregator is that it eventually leads to information overload. One tends to subscribe to feeds which produce lots of content of which only a subset is of interest to the user. At the other extreme, users often find it difficult to find new content that matches their interests. Both of these problems can be solved by providing a mechanism which allows the user to rate feeds or entries that are of interest to the user. A thumbs up or thumbs down rating similar to what systems such as TiVo use today. This system can be used to highlight items of interest from subscribed feeds to the user or suggest new feeds using a suggestion service such as AmphetaRate.

  5. Applying search filters to the list view: In certain cases a user may want to perform the equivalent of a search on the items currently being displayed in the list view without resorting to an explicit search. An example is showing all the unread items in the list view. RSS Bandit should provide a way to apply filters to the items currently being displayed in the list view either by providing certain predefined filters or providing the option to apply search folder queries as filters.

These are just some of the ideas I've had. There are also the dozens of feature requests we've received from our users over the past couple of months which we'll use as fodder for ideas for the Jubilee release.


 

Categories: RSS Bandit

January 15, 2006
@ 08:08 PM

Dave Winer made the following insightful observation in a recent blog post

Jeremy Zawodny, who works at Yahoo, says that Google is Yahoo 2.0. Very clever, and there's a lot of truth to it, but watch out, that's not a very good place to be. That's how Microsoft came to dominate the PC software industry. By shipping (following the analogy) WordPerfect 2.0 (and WordStar, MacWrite and Multimate) and dBASE 2.0 (by acquiring FoxBase) and Lotus 2.0 (also known as Excel). It's better to produce your own 2.0s, as Microsoft's vanquished competitors would likely tell you.

Microsoft's corporate culture is very much about looking at an established market leader then building a competing product which is (i) integrated with a family of Microsoft products and (ii) fixes some of the weakneses in the competitors offerings. The company even came up with the buzzword Integrated Innovation to describe some of these aspects of its corporate strategy. 

Going further, one could argue that when Microsoft does try to push disruptive new ideas the lack of a competitor to focus on leads to floundering by the product teams involved. Projects such as WinFS, Netdocs and even Hailstorm can be cited as examples of projects that floundered due to the lack of a competitive focus.

New employees to Microsoft are sometimes frustrated by this aspect of Microsoft's culture. For some it's hard to acknowledge that working at Microsoft isn't about building cool, new stuff but about building cooler versions of products offered by our competitors which integrate well with other Microsoft products. This ethos not only brought us Microsoft Office which Dave mentions in his post but also newer examples including XBox (a better Playstation), C# (a better Java) and MSN Spaces (a better TypePad/Blogger/LiveJournal). 

The main reason I'm writing this is so I don't have to keep explaining it to people, I can just give them a link to this blog post next time it comes up.


 

Categories: Life in the B0rg Cube