Monday, 18 June 2007 - Dare Obasanjo's weblog

June 16, 2007

@ 08:18 PM

Every once in a while someone asks me about software companies to work for in the Seattle area that aren't Microsoft, Amazon or Google. This is the fourth in a series of ~~weekly~~ posts about startups in the Seattle area that I often mention to people when they ask me this question.

Zillow is a real-estate Web site that is slowly revolutionizing how people approach the home buying experience. The service caters to buyers, sellers, potential sellers and real estate professionals in the following ways

For buyers: You can research a house and find out it's vital statistics (e.g. number of bedrooms, square footage, etc) it's current estimated value and how much it sold for when it was last sold. In addition, you can scope out homes that were recently sold in the neighborhood and get a good visual representation of the housing market in a particular area.
For sellers and agents: Homes for sale can be listed on the service
Potential sellers: You can post a Make Me Move™ price without having to actually list your home for sale.

I used Zillow when as part of the home buying process when I got my place and I think the service is fantastic. They also have the right level of buzz given recent high level poachings of Google employees and various profiles in the financial press.

The company was founded by Lloyd Frink and Richard Barton who are ex-Microsoft folks whose previous venture was Expedia, another Web site that revolutionized how people approached a common task.

Press: Fortune on Zillow

Number of Employees: 133

Location: Seattle, WA (Downtown)

Jobs: careers@zillow.hrmdirect.com, current open positions include a number of Software Development Engineer and Software Development Engineer in Test positions as well as a Systems Engineer and Program Manager position.

Categories: Seattle Startup Shoutout

June 16, 2007

@ 07:28 PM

Comments [11]

Web3S: A RESTful Protocol for Accessing Windows Live Services

I had hoped to avoid talking about RESTful Web services for a couple of weeks but Yaron Goland's latest blog post APP and Dare, the sitting duck deserves a mention. In his post, Yaron talks concretely about some of the thinking that has gone on at Windows Live and other parts of Microsoft around building a RESTful protocol for accessing and manipulating data stores on the Web. He writes

I'll try to explain what's actually going on at Live. I know what's going on because my job for little over the last year has been to work with Live groups designing our platform strategy.
...
Most of the services in Live land follow a very similar design pattern, what I originally called S3C which stood for Structured data, with some kind of Schema (in the general sense, I don't mean XML Schema), with some kind of Search and usually manipulated with operations that look rather CRUD like. So it seemed fairly natural to figure out how to unify access to those services with a single protocol.
...
So with this in mind we first went to APP. It's the hottest thing around. Yahoo, Google, etc. everyone loves it. And as Dare pointed out in his last article Microsoft has adopted it and will continue to adopt it where it makes sense. There was only one problem - we couldn't make APP work in any sane way for our scenarios. In fact, after looking around for a bit, we couldn't find any protocol that really did what we needed. Because my boss hated the name S3C we renamed the spec Web3S and that's the name we published it under. The very first section of the spec explains our requirements. I also published a FAQ that explains the design rationale for Web3S. And sure enough, the very first question, 2.1, explains why we didn't use ATOM.
...
Why not just modify APP?
We considered this option but the changes needed to make APP work for our scenarios were so fundamental that it wasn't clear if the resulting protocol would still be APP. The core of ATOM is the feed/entry model. But that model is what causes us our problems. If we change the data model are we still dealing with the same protocol? I also have to admit that I was deathly afraid of the political implications of Microsoft messing around with APP. I suspect Mr. Bray's comments would be taken as a pleasant walk in the park compared to the kind of pummeling Microsoft would receive if it touched one hair on APP's head.

In his post, Yaron talks about two of the key limitations we saw with the Atom Publishing Protocol (i.e. lack of support for hierarchies and lack of support for granular updates to fields) and responds to the various suggestions about how one can workaround these problems in APP. As he states in the conclusion of his post we are very wary of suggestions to "embrace and extend" Web standards given the amount of negative press the company has gotten about that over the years. It seems better for the industry if we build a protocol that works for our needs and publish documentation about how it works so any interested party can interoperate with us than if we claim we support a Web standard when in truth it only "works with Microsoft" because it has been extended in incompatible ways.

Dealing with Hierarchy

Here's what Yaron had to say with regards to the discussion around APP's lack of explicit support for hierarchies

The idea that you put a link in the ATOM feed to the actual object. This isn't a bad idea if the goal was to publish notices about data. E.g. if I wanted to have a feed that published information about changes to my address book then having a link to the actual address book data in the APP entries is fine and dandy. But if the goal is to directly manipulate the address book's contents then having to first download the feed, pull out the URLs for the entries and then retrieve each and every one of those URLs in separate requests in order to pull together all the address book data is unacceptable from both an implementation simplicity and performance perspective. We need a way where by someone can get all the content in the address book at once. Also, each of our contacts, for example, are actually quite large trees. So the problem recurses. We need a way to get all the data in one contact at a go without having to necessarily pull down the entire address book. At the next level we need a way to get all the phone numbers for a single contact without having to download the entire contact and so on.

Yaron is really calling out two issues here. The first is that if you have a data type that doesn't map well as a piece of authored content then it is better represented as its own content type that is linked from an atom:entry than trying to treat an atom:entry with its requirement of author, summary and title fields as a good way to represent all types of data. The second issue is the lack of explicit support for hierarchies. This situation is an example of how something that seems entirely reasonable in one scenario can be problematic in another. If you are editing blog posts, it probably isn't that much of a burden to first retrieve an Atom feed of all your recent blog posts, locate the link to the one you want to edit then retrieve it for editing. In addition, since a blog post is authored content, the most relevant information about the post can be summarized in the atom:entry. On the other hand, if you want to retrieve your list of IM buddies so you can view their online status or get people in your friend's list to see their recent status updates, it isn't pretty to fetch a feed of your contacts then have to retrieve each contact one by one after locating the links to their representations in the Atom feed. Secondly, you may just want to address part of the data instead of instead of retrieving or accessing an entire user object if you just want their status message or online status.

Below are specification excerpts showing how two RESTful protocols from Microsoft address these issues.

How Web3S Does It

The naming of elements and level of hierarchy in an XML document that is accessible via Web3S can be arbitrarily complex as long as it satisfies some structural constraints as specified in The Web3S Resource Infoset. The constraints include no mixed content and that multiple instances of an element with the same name as children of a node must be identified by a Web3S:ID element (e.g. multiple entries under a feed are identified by ID). Thus the representation of a Facebook user returned by the users.getInfo method in the Facebook REST API should be a valid Web3S document [except that the concentration element would have to be changed from having string content to having two element children, a Web3S:ID that can be used to address each concentration directly and another containing the current textual content].

The most important part of being able to properly represent hierarchies is that different levels of the hierarchy can be directly accessed. From the Web3S documentation section entitled Addressing Web3S Information Items in HTTP

In order to enable maximum flexibility Element Information Items (EIIs) are directly exposed as HTTP resources. That is, each EII can be addressed as a HTTP resource and manipulated with the usual methods...
<articles>
 <article>
  <Web3S:ID>8383</Web3S:ID>
  <title>Manual of Surgery Volume First: General Surgery. Sixth Edition.</title>
  <authors>
   <author>
    <Web3S:ID>23455</Web3S:ID>
    <firstname>Alexander</firstname>
    <lastname>Miles</lastname>    
   </author>
   <author>
    <Web3S:ID>88828</Web3S:ID>
    <firstname>Alexis</firstname>
    <lastname>Thomson</lastname>    
   </author>
  </authors>
 </article>
</articles>
If the non-Web3S prefix path is http://example.net/stuff/morestuff then we could address the lastname EII in Alexander Miles’s entry as http://example.net/stuff/morestuff/net.examples.articles/net.example.article(8383)/net.example.authors/net.example.author(23455)/org.example.lastname.

Although String Information Items (SIIs) are modeled as resources they currently do not have their own URLs and therefore are addressed only in the context of EIIs. E.g. the value of an SII would be set by setting the value of its parent EII.

XML heads may balk at requiring IDs to differentiate elements with the same name at the same scope or level of hierarchy instead of using positional indexes like XPath does. The problem with is that assumes that the XML document order is significant in the underlying data store which may likely not be the case.

Supporting Granular Updates

Here's what Yaron had to say on the topic of supporting granular updates and the various suggestions that came up with regards to preventing the lost update problem in APP.

APP's approach to this problem is to have the client download all the content, change the stuff they understand and then upload all the content including stuff they don't understand.
...
On a practical level though the 'download then upload what you don't understand' approach is complicated. To make it work at all one has to use optimistic concurrency. For example, let's say I just want to change the first name of a contact and I want to use last update wins semantics. E.g. I don't want to use optimistic concurrency. But when I download the contact I get a first name and a last name. I don't care about the last name. I just want to change the first name. But since I don't have merge semantics I am forced to upload the entire record including both first name and last name. If someone changed the last name on the contact after I downloaded but before I uploaded I don't want to lose that change since I only want to change the first name. So I am forced to get an etag and then do an if-match and if the if-match fails then I have to download again and try again with a new etag. Besides creating race conditions I have to take on a whole bunch of extra complexity when all I wanted in the first place was just to do a 'last update wins' update of the first name.
...
A number of folks seem to agree that merge makes sense but they suggested that instead of using PUT we should use PATCH. Currently we use PUT with a specific content type (application/Web3S+xml). If you execute a PUT against a Web3S resources with that specific content-type then we will interpret the content using merge semantics. In other words by default PUT has replacement semantics unless you use our specific content-type on a Web3S resource. Should we use PATCH? I don't think so but I'm flexible on the topic.

This is one place where a number of APP experts such as Bill de hÓra and James Snell seem to agree that the current semantics in APP are insufficient. There also seems to be some consensus that it is too early to standardize a technology for partial updates of XML on the Web without lots more implementation experience. I also agree with that sentiment. So having it out of APP for now probably isn't a bad thing.

Currently I'm still torn on whether Web3S's use of PUT for submitting partial updates is kosher or whether it is more appropriate to invent a new HTTP method called PATCH. There was a thread about this on the rest-discuss mailing list and for the most part it seems people felt that applying merge semantics on PUT requests for a specific media type is valid if the server understands that those are the semantics of that type.

How Web3S Does It

From the Web3S documentation section entitled Application/Web3S+xml with Merge Semantics

On its own the Application/Web3S+xml content type is used to represent a Web3S infoset. But the semantics of that infoset can change depending on what method it is used with.

In the case of PUT the semantics of the Application/Web3S+xml request body are “merge the infoset information in the Application/Web3S+xml request with the infoset of the EII identified in the request-URI.” This section defines how Application/Web3S+xml is to be handled specifically in the case of PUT or any other context in which the Web3S infoset in the Application/Web3S+xml serialization is to be merged with some existing Web3S infoset.

For example, imagine that the source contains:
<whatever> <Web3S:ID>234</Web3S:ID> <yo> <Web3S:ID>efghi</Web3S:ID> <avalue /> <somethingElse>YO!!!</somethingElse> </yo> </whatever>
Now imagine that the destination, before the merge, contains:
<whatever> <nobodyhome /> </whatever>
In this example the only successful outcome of the merge would have to be:
<whatever> <Web3S:ID>234</Web3S:ID> <yo> <Web3S:ID>efghi</Web3S:ID> <avalue /> <somethingElse>YO!!!</somethingElse> </yo> <nobodyhome /> </whatever>
In other words, not only would all of the source’s contents have to be copied over but the full names (E.g. EII names and IDs) must also be copied over exactly.

This an early draft of the spec so there are a lot of rules that aren't explicitly spelled out but now you should get the gist of how Web3S works. If you have any questions, direct them to Yaron not to me. I'm just an interested observer when it comes to Web3S. Yaron is the person to talk to if you want to make things happen. :)

In a couple of days I'll take a look at how Project Astoria deals with the same problems in a curiously similar fashion. Until then you can make do with Leonard Richardson's excellent review of Project Astoria. Until next time.

Categories: Windows Live | XML Web Services

June 16, 2007

@ 02:40 AM

Comments [9]

RSS Bandit 1.5.x (ShadowCat) Beta Installer Available

We are about ready to ship the second release of RSS Bandit this year. This release will be codenamed ShadowCat. This version will primarily be about bug fixes with one or two minor convenience features thrown in. If you are an RSS Bandit user who's had problems with crashes or other instability in the application, please grab the installer at RssBandit.1.5.0.14.ShadowCat.Beta.zip and let us know if this improves your user experience.

New Features

Newspaper view can be configured to show unread items in a feed as multiple pages instead of a single page of items

Major Bug Fixes

Javascript errors on Web pages result in error dialogs being displayed or the application hanging.
Search indexing thread takes 100% of CPU and freezes the computer.
Crashes related to Lucene search indexing (e.g. IO exceptions, access violations, file in use errors, etc)
Crash when a feed has an IRI (i.e. URL with unicode characters such as http://www.bücher-forum.de/forum/rdf.php) instead of issuing an error message since they are not supported.
Context menus no longer work on category nodes after OPML import or remote sync
Crash on deleting a feed which still has enclosures being downloaded
Podcasts downloaded from the http://poweruser.tv/ feed are named "..mp3" instead of the actual file name.
Items marked as read in a search folder not marked as read in original feed
No news shown when subscribed to newsgroups.borland.com
Can't subscribe to feeds on your local hard drive (regression since it worked in previous versions)

After the final release of ShadowCat, the following release of RSS Bandit (codenamed Phoenix) will contain our next major user interface revamp and will likely be the version where we move to version 2.0 of the .NET Framework.You can find some of our early screen shots on Flickr.

Categories: RSS Bandit

June 14, 2007

@ 09:28 PM

Comments [9]

Top 5 Questions I'm Currently Pondering

Bigger disappointment: Clerks 2 or Idiocracy?
Why is Carrot Top so buff?
Best way to introduce my fiancée to the greatness of giant shape shifting robots: Transformers: The Movie (1986) or Transformers: The Movie (2007)?
Was this article meant to be a joke?
Best way to complete this sentence without the help of a search engine: Pinky, are you pondering what I'm pondering? I think so, Brain but...?

Categories: Ramblings

June 12, 2007

@ 06:56 PM

Comments [7]

Windows Live Hotmail and Outlook Together at Last

Omar Shahine has a blog post entitled Hotmail + Outlook = Sweet where he writes

At long last... experience Hotmail inside of Outlook.

What used to be a subscription only offering is now available to anyone that wants it. While Outlook used to have the ability to talk to Hotmail via DAV it was flaky and 2 years ago we no longer offered it to new users of the service.

Well the new Outlook Connector has a few notable features that you didn't get with the old DAV support:

uses DeltaSync, a Microsoft developed HTTP based protocol that sync's data based on change sequence numbers. This means that the server is stateful about the client. Whenever the client connects to Hotmail, the server tells the clients of any changes that happened since the last time the client connected. This is super efficient and allows us to offer the service to many users at substantially lower overhead than stateless protocols. This is the same protocol utilized by Windows Live Mail. It's similar in nature to exchange Cached Mode or AirSync, the mobile sync stack used by Windows Mobile Devices.
Sync of Address Book. Your Messenger/Hotmail contacts get stored in Outlook.
Sync of Calendar (currently premium only)
Sync of allow/block lists for safety/spam

I've been using the Microsoft Office Outlook Connector for a few years now and have always preferred it to the Web interface for accessing my email on Hotmail. It's great that this functionality is now free for anyone who owns a copy of Microsoft Outlook instead of being a subscription service.

PS: Omar mentioning Hotmail and Microsoft Outlook's use WebDAV reminds me that there have been other times in recent memory when using RESTful Web protocols swept Microsoft. Without reading old MSDN articles like Communicating XML Data over the Web with WebDAV when Microsoft Office, Microsoft Exchange and Internet Information Services (IIS) it's easy to forget that Microsoft almost ended up standardizing on WebDAV as the primary protocol for reading and writing Microsoft data sources on the Web. Of course, then SOAP and WS-* happened. :)

Categories: Windows Live | XML Web Services

June 12, 2007

@ 06:34 PM

Comments [2]

I Need a Stevey Yegge Translator

If you don't read Stevey Yegge's blog, you should. You can consider him to be the new school version of Joel Spolsky especially now that most of Joel's writing is about what's going on at Fog Creek software and random rants about applications he's using. However you should be warned that Stevey writes long posts full of metaphors which often border on allegory.

Consider his most recent post That Old MarshMallow Maze Spell which is an interesting read but full of obfuscation. I actually haven't finished it since it is rather longer than I tend to devote to a single blog post. I've been trying to track down summaries of the post and the best I've gotten so far are some comments about the post on reddit which seem to imply that the allegory is about being burned out due to some death march project at his current employer.

I'm as down with schadenfreude as the next guy but a death march project seems wildly contradictory to Stevey's previous post Good Agile, Bad Agile where he wrote

The basic idea behind project management is that you drive a project to completion. It's an overt process, a shepherding: by dint of leadership, and organization, and sheer force of will, you cause something to happen that wouldn't otherwise have happened on its own.
Project management comes in many flavors, from lightweight to heavyweight, but all flavors share the property that they are external forces acting on an organization.

At Google, projects launch because it's the least-energy state for the system.
...
Anyway, I claimed that launching projects is the natural state that Google's internal ecosystem tends towards, and it's because they pump so much energy into pointing people in that direction. All your needs are taken care of so that you can focus, and as I've described, there are lots of incentives for focusing on things that Google likes.

So launches become an emergent property of the system.

This eliminates the need for a bunch of standard project management ideas and methods: all the ones concerned with dealing with slackers, calling bluffs on estimates, forcing people to come to consensus on shared design issues, and so on. You don't need "war team meetings," and you don't need status reports. You don't need them because people are already incented to do the right things and to work together well.

So, did anyone else get anything else out of Stevey's post besides "even at Google we have death marches that suck the soul out of you"? After all, I just kinda assumed that came with the territory.

Categories: Competitors/Web Companies

June 12, 2007

@ 05:53 PM

Comments [2]

Microsoft and the Atom Publishing Protocol

One of the accusations made by Tim Bray in his post I’ve Seen This Movie is that my recent posts about the Atom Publishing Protocol are part of some sinister plot by Microsoft to not support it in our blogging clients. That's really ironic considering that Microsoft is probably the only company that has shipped two blogging clients that support APP.

Don't take my word for it. In his blog post entitled Microsoft is not sabotaging APP (probably) Joe Cheng of the Windows Live Writer team writes

Microsoft has already shipped a general purpose APP client (Word 2007) and GData implementation (Windows Live Writer). These are the two main blogging tools that Microsoft has to offer, and while I can’t speak for the Word team, the Writer team is serious about supporting Atom going forward.
These two clients also already post to most blogs, not just Spaces. In particular, Writer works hard to integrate seamlessly with even clearly buggy blog services. I don’t know anyone who works as hard as we do at this.
...
Spaces may not support APP, but it does support MetaWeblog which Microsoft has a lot less influence over than APP (since MW is controlled by Dave Winer, not by an official standards body). Consider that many of its main competitors, including MySpace, Vox, and almost all overseas social networking sites, have poor or nonexistent support for any APIs.

The reasoning behind Windows Live Spaces supporting the MetaWeblog API and not the Atom Publishing Protocol are detailed in my blog posts What Blog Posting APIs are supported by MSN Spaces? and Update on Blog Posting APIs and MSN Spaces which I made over two years ago when we were going through the decision process for what the API story should be for Windows Live Spaces. For those who don't have time to read both posts, it basically came down to choosing a mature de facto standard (i.e. the MetaWeblog API) instead of (i) creating a proprietary protocol which better our needs or (ii) taking a bet on the Atom Publishing Protocol spec which was a moving target in 2004 and is still a moving target today in 2007.

I hope this clears up any questions about Microsoft and APP. I'm pretty much done talking about this particular topic for the next couple of weeks.

PS: You can download Windows Live Writer from here and you can buy Office 2007 wherever great software is sold.

Categories: Windows Live | XML Web Services

June 11, 2007

@ 03:00 AM

Comments [20]

GData isn't a Best Practice Implementation of the Atom Publishing Protocol

I recently posted a blog post entitled Why GData/APP Fails as a General Purpose Editing Protocol for the Web which pointed out some limitations in the Atom Publishing Protocol (APP) and Google's implementation of it in GData with regards to being a general purpose protocol for updating data stores on the Web. There were a lot of good responses to my post from developers knowledgeable about APP including the authors of the specification, Bill de hÓra and Joe Gregorio. Below are links to some of these responses

Joe Gregorio: In which we narrowly save Dare from inventing his own publishing protocol
Bill de hÓra: APP on the Web has failed: miserably, utterly, and completely
David Megginson: REST, the Lost Update Problem, and the Sneakernet Test
Bill de hÓra: Social networks, web publishing and strategy tax
James Snell: Silly

There was also a post by Tim Bray entitled So Lame which questions my motives for writing the post and implies that it is some sinister plot by Microsoft to make sure that we use proprietary technologies to lock users in. I guess I should have given more background in my previous post. The fact is that lots of people have embraced building RESTful Web Services in a big way. My primary concern now is that we don't end up seeing umpteen different RESTful protocols from Microsoft [thus confusing our users and ourselves] and instead standardize on one or two. For example, right now we already have Atom+SSE, Web3S and Project Astoria as three completely different RESTful approaches for updating or retrieving data from a Microsoft data source on the Web. In my mind, that's two too many and that's just the stuff we've made public so there could be more. I'd personally like to see us reduce the set of RESTful protocols coming out of Microsoft to one and even better end up reusing existing Web standards, if possible. Of course, this is an aspiration and it is quite possible that all of these protocols are different for a reason (e.g. we have FTP, SMTP, and HTTP which all can be used to transfer files but have very different use cases) and there is no hope for unification let alone picking some existing standard. My previous post was intended to point out the limitations I and others had noticed with using the Atom Publishing Protocol (APP) as a general protocol for updating data stores that didn't primarily consist of authored content. The point of the post was to share these learnings with other developers working in this space and get feedback from the general developer community just in case there was something wrong with my conclusions.

Anyway, back to the title of this post. In my previous post I pointed out to the following limitations of APP as a general purpose protocol for editing Web content

Mismatch with data models that aren't microcontent
Lack of support for granular updates to fields of an item
Poor support for hierarchy

I have to admit that a lot of my analysis was done on GData because I assumed incorrectly that it is a superset of the Atom Publishing Protocol. After a closer reading of the ~~fifteenth~~ most recent draft APP specification spurred by the responses to my post by various members of the Atom community it seems clear that the approaches chosen by Google in GData run counter to the recommendations of Atom experts including both authors of the spec.

For problem #1, the consensus from Atom experts was that instead of trying to map a distinct concept such as a Facebook user to an Atom entry complete with a long list of proprietary extensions to the atom:entry element, one should instead create a specific data format for that type then treat it as a distinct media type that is linked from atom:entry. Thus in the Facebook example from my previous post, one would have a distinct user.xml file and a corresponding atom:entry which linked to it for each user of the system. Contrast this with the use of the gd:ContactSection in an atom:entry for representing a user. It also seems that the GData solution to the problem of what to put in the elements such as atom:author and atom:summary which are required by the specification but make no sense outside of content/microcontent editing scenarios is to omit them. It isn't spec compliant but I guess it is easier than putting in nonsensical values to satisfy some notion of a valid feed.

For problem #2, a number of folks pointed out that conditional PUT requests using ETags and the If-Match header are actually in the spec. This was my oversight since I skipped the section since the title "Caching and Entity Tags" didn't imply that it had anything to do with dealing with the lost update problem. I actually haven't found a production implementation of APP that supports conditional PUTs this shouldn't be hard to implement for services that require this functionality. This definitely makes the lost update problem more tractable. However a model where a client can just say "update the user's status message to X" still seems more straightforward than one where the client says "get the entire user element", "update the user's status message to X on the client", "replace the user on the server with my version of the user", and potentially "there is a version mismatch so merge my version of the user with the most recent version of the user from the server and try again". The mechanism GData uses for solving the lost update problem is available in the documentation topic on Optimistic concurrency (versioning). Instead of using ETags and If-Match, GData appends a version number to the URL to which the client publishes the updated atom:entry and then cries foul if the client publishes to a URL with an old version number. I guess you could consider this a different implementation of conditional PUTs from what is recommended in the most recent version of the APP draft spec.

For problem #3, the consensus seemed to be to use a atom:link elements to show hierarchy similar to what has been done in Atom threading extensions. I don't question the value of linking and think this is a fine approach for the most part. However, the fact is that in certain scenarios [especially high traffic ones] it is better for the client to be able to make requests like "give me the email with message ID 6789 and all the replies in that thread" than "give me all the emails and I'll figure out the hierarchy I'm interested in myself by piecing together link relationships". I notice that GData completely punts on representing hierarchy in the MessageKind construct which is intended for use in representing email messages.

Anyway I've learned my lesson and will treat the Atom Publishing Protocol (APP) and GData as separate protocols instead of using them interchangeably in the future.

Categories: XML Web Services

June 9, 2007

@ 03:13 AM

Comments [9]

Why GData/APP Fails as a General Purpose Editing Protocol for the Web

Recently I wrote a blog post entitled Google GData: A Uniform Web API for All Google Services where I pointed out that Google has standardized on GData (i.e. Google's implementation of the Atom 1.0 syndication format and the Atom Publishing Protocol with some extensions) as the data access protocol for Google's services going forward. In a comment to that post Gregor Rothfuss wondered whether I couldn't influence people at Microsoft to also standardize on GData. The fact is that I've actually tried to do this with different teams on multiple occasions and each time the I've tried, certain limitations in the Atom Publishing Protocol become quite obvious when you get outside of blog editing scenarios for which the protocol was originally designed. For this reason, we will likely standardize on a different RESTful protocol which I'll discuss in a later post. However I thought it would be useful to describe the limitations we saw in the Atom Publishing Protocol which made it unsuitable as the data access protocol for a large class of online services.

Overview of the Atom's Data Model

The Atom data model consists of collections, entry resources and media resources. Entry resources and media resources are member resources of a collection. There is a handy drawings in section 4.2 of the latest APP draft specification that shows the hierarchy in this data model which is reproduced below.

                    
             Member Resources
                    |
             -----------------
            |                 |
      Entry Resources     Media Resources
            |
      Media Link Entry

A media resource can have representations in any media type. An entry resource corresponds to an atom:entry element which means it must have an id, a title, an updated date, one or more authors and textual content. Below is a minimal atom:entry element taken from the Atom 1.0 specification


<entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
</entry>

The process of creating and editing resources is covered in section 9 of the current APP draft specification. To add members to a Collection, clients send POST requests to the URI of the Collection. To delete a Member Resource, clients send a DELETE request to its Member URI. While to edit a Member Resource, clients send PUT requests to its Member URI.

Since using PUT to edit a resource is obviously problematic, the specification notes two concerns that developers have to pay attention to when updating resources.

To avoid unintentional loss of data when editing Member Entries or Media Link Entries, Atom Protocol clients SHOULD preserve all metadata that has not been intentionally modified, including unknown foreign markup.
...
Implementers are advised to pay attention to cache controls, and to make use of the mechanisms available in HTTP when editing Resources, in particular entity-tags as outlined in [NOTE-detect-lost-update]. Clients are not assured to receive the most recent representations of Collection Members using GET if the server is authorizing intermediaries to cache them.

The [NOTE-detect-lost-update] points to Editing the Web: Detecting the Lost Update Problem Using Unreserved Checkout which not only talks about ETags but also talks about conflict resolution strategies when faced with multiple edits to a Web document. This information is quite relevant to anyone considering implementing the Atom Publishing Protocol or a similar data manipulation protocol.

With this foundation, we can now talk about the various problems one faces when trying to use the Atom Publishing Protocol with certain types of Web data stores.

Limitations Caused by the Constraints within the Atom Data Model

The following is a list of problems one faces when trying to utilize the Atom Publishing Protocol in areas outside of content publishing for which it was originally designed.

Mismatch with data models that aren't microcontent: The Atom data model fits very well for representing authored content or microcontent on the Web such as blog posts, lists of links, podcasts, online photo albums and calendar events. In each of these cases the requirement that each Atom entry has an an id, a title, an updated date, one or more authors and textual content can be met and actually makes a lot of sense. On the other hand, there are other kinds online data that don't really fit this model.

Below is an example of the results one could get from invoking the users.getInfo method in the Facebook REST API.
```
<user>
    <uid>8055</uid>
    <about_me>This field perpetuates the glorification of the ego.  Also, it has a character limit.</about_me>
    <activities>Here: facebook, etc. There: Glee Club, a capella, teaching.</activities>    
    <birthday>November 3</birthday>
    <books>The Brothers K, GEB, Ken Wilber, Zen and the Art, Fitzgerald, The Emporer's New Mind, The Wonderful Story of Henry Sugar</books>
    <current_location>
      <city>Palo Alto</city>
      <state>CA</state>
      <country>United States</country>
      <zip>94303</zip>
    </current_location>    
    <first_name>Dave</first_name>       
     <interests>coffee, computers, the funny, architecture, code breaking,snowboarding, philosophy, soccer, talking to strangers</interests>
     <last_name>Fetterman</last_name>   
     <movies>Tommy Boy, Billy Madison, Fight Club, Dirty Work, Meet the Parents, My Blue Heaven, Office Space </movies>
     <music>New Found Glory, Daft Punk, Weezer, The Crystal Method, Rage, the KLF, Green Day, Live, Coldplay, Panic at the Disco, Family Force 5</music>
     <name>Dave Fetterman</name>  
     <profile_update_time>1170414620</profile_update_time>
     <relationship_status>In a Relationship</relationship_status>
     <religion/>
     <sex>male</sex>
     <significant_other_id xsi:nil="true"/>
     <status>
       <message>Pirates of the Carribean was an awful movie!!!</message>
     </status>    
   </user>
```
How exactly would one map this to an Atom entry? Most of the elements that constitute an Atom entry don't make much sense when representing a Facebook user. Secondly, one would have to create a large number of proprietary extension elements to anotate the atom:entry element to hold all the Facebook specific fields for the user. It's like trying to fit a square peg in a round hole. If you force it hard enough, you can make it fit but it will look damned ugly.

Even after doing that, it is extremely unlikely that an unmodified Atom feed reader or editing client such as would be able to do anything useful with this Frankenstein atom:entry element. If you are going to roll your own libraries and clients to deal with this Frankenstein element, then it it begs the question of what benefit you are getting from ~~mis~~ using a standardized protocol in this manner?

I guess we could keep the existing XML format used by the Facebook REST API and treat the user documents as media resources. But in that case, we aren't really using the Atom Publishing Protocol, instead we've reinvented WebDAV. Poorly.
Lack of support for granular updates to fields of an item: As mentioned in the previous section editing an entry requires replacing the old entry with a new one. The expected client interaction with the server is described in section 5.4 of the current APP draft and is excerpted below.
Retrieving a Resource
```
Client                                     Server
  |                                           |
  |  1.) GET to Member URI                    |
  |------------------------------------------>|
  |                                           |
  |  2.) 200 Ok                               |
  |      Member Representation                |
  |<------------------------------------------|
  |                                           |
```
1. The client sends a GET request to the URI of a Member Resource to retrieve its representation.
2. The server responds with the representation of the Member Resource.
Editing a Resource
```
Client                                     Server
  |                                           |
  |  1.) PUT to Member URI                    |
  |      Member Representation                |
  |------------------------------------------>|
  |                                           |
  |  2.) 200 OK                               |
  |<------------------------------------------|
```
1. The client sends a PUT request to store a representation of a Member Resource.
2. If the request is successful, the server responds with a status code of 200.
Can anyone spot what's wrong with this interaction? The first problem is a minor one that may prove problematic in certain cases. The problem is pointed out in the note in the documentation on Updating posts on Google Blogger via GData which states

IMPORTANT! To ensure forward compatibility, be sure that when you POST an updated entry you preserve all the XML that was present when you retrieved the entry from Blogger. Otherwise, when we implement new stuff and include <new-awesome-feature> elements in the feed, your client won't return them and your users will miss out! The Google data API client libraries all handle this correctly, so if you're using one of the libraries you're all set.

Thus each client is responsible for ensuring that it doesn't lose any XML that was in the original atom:entry element it downloaded. The second problem is more serious and should be of concern to anyone who's read Editing the Web: Detecting the Lost Update Problem Using Unreserved Checkout. The problem is that there is data loss if the entry has changed between the time the client downloaded it and when it tries to PUT its changes.

Even if the client does a HEAD request and compares ETags just before PUTing its changes, there's always the possibility of a race condition where an update occurs after the HEAD request. After a certain point, it is probably reasonable to just go with "most recent update wins" which is the simplest conflict resolution algorithm in existence. Unfortunately, this approach fails because the Atom Publishing Protocol makes client applications responsible for all the content within the atom:entry even if they are only interested in one field.

Let's go back to the Facebook example above. Having an API now makes it quite likely that users will have multiple applications editing their data at once and sometimes these aplications will change their data without direct user intervention. For example, imagine Dave Fetterman has just moved to New York city and is updating his data across various services. So he updates his status message in his favorite IM client to "I've moved" then goes to Facebook to update his current location. However, he's installed a plugin that synchronizes his IM status message with his Facebook status message. So the IM plugin downloads the atom:entry that represents Dave Fetterman, Dave then updates his address on Facebook and right afterwards the IM plugin uploads his profile information with the old location and his new status message. The IM plugin is now responsible for data loss in a field it doesn't even operate on directly.
Poor support for hierarchy: The Atom data model is that it doesn't directly support nesting or hierarchies. You can have a collection of media resources or entry resources but the entry resources cannot themselves contain entry resources. This means if you want to represent an item that has children they must be referenced via a link instead of included inline. This makes sense when you consider the blog syndication and blog editing background of Atom since it isn't a good idea to include all comments to a post directly children of an item in the feed or when editing the post. On the other hand, when you have a direct parent<->child hierarchical relationship, where the child is an addressable resource in its own right, it is cumbersome for clients to always have to make two or more calls to get all the data they need.

UPDATE: Bill de hÓra responds to these issues in his post APP on the Web has failed: miserably, utterly, and completely and points out to two more problems that developers may encounter while implementing GData/APP.

Categories: Web Development | XML Web Services

June 7, 2007

@ 06:06 PM

Comments [5]

WhyWillYouWorkHere.com

A couple of months ago, I was asked to be part of a recruiting video for Microsoft Online Business Group (i.e. Windows Live and MSN). The site with the video is now up. It's http://www.whywillyouworkhere.com. As I expected, I sound like a dork. And as usual I was repping G-Unit. I probably should have worn something geeky like what I have on today, a a Super Mario Bros. 1up T-shirt. :)

Categories: MSN | Windows Live

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Monday, 18 June 2007 - Dare Obasanjo's weblog

Dealing with Hierarchy

How Web3S Does It

Supporting Granular Updates

How Web3S Does It

Overview of the Atom's Data Model

Limitations Caused by the Constraints within the Atom Data Model

Retrieving a Resource

Editing a Resource