The Newsgator API Continues to Frustrate Me

December 12, 2005

@ 01:05 AM

The number one problem that faces developers of feed readers is how to identify posts. How does a feed reader tell a new post from an old one whose title or permalink changed? In general how you do this is to pick a unique identifier from the metadata of the feed item to use to tell it apart from others. If you are using the Atom 0.3 & 1.0 syndication formats the identifier is the <atom:id> element, for RSS 1.0 it is the rdf:about attribute and for RSS 0.9x & RSS 2.0 it is the <guid> element.

The problem is that many RSS 0.9x & 2.0 feeds do not have a <guid> element which usually means a feed reader has to come up with its own custom mechanism for identifying items. In many cases, using the <link> element is enough because most items in a feed map to a single web resource with a permalink URL. In some pathological cases, a feed may not have <guid> or <link> OR even worse may use the same value in the <link> element for each item in the feed. In such cases, feed readers usually resort to heuristics which are guaranteed to be wrong at least some of the time.

So what does this have to do with the Newsgator API? Users of recent versions of RSS Bandit can synchronize the state of their RSS feeds with Newsgator Online using the Newsgator API. Where things get tricky is that this means that both the RSS Bandit and Newsgator Online either need to use the same techniques for identifying posts OR have a common way to map between their identification mechanisms. When I first used the API, I noticed that Newsgator has it's own notion of a "Newsgator ID" which it expects clients to use. In fact, it's worse than that. Newsgator Online assumes that clients that synchronize with it actually just fetch all their data from Newsgator Online including feed content. This is a pretty big assumption to make but I'm sure it made it easier to solve a bunch of tricky development problems for their various products. Instead of worrying about keeping data and algorithms on the clients in sync with the server, they just replace all the data on the client with the server data as part of the 'synchronization' process.

Now that I've built an application that deviates from this fundamental assumption I've been having all sorts of interesting problems. The most recent being that some users complained that read/unread state wasn't being synced via the Newsgator API. When I investigated it turned out that this is because I use <guid> elements to identify posts in RSS Bandit while the Newsgator API uses the "Newsgator ID". Even worse is that they don't even expose the original <guid> element in the returned feed items. So now it looks like fixing the read/unread not being synced bug involves bigger and more fundamental changes than I expected. More than likely I'll have to switch to using <link> elements as unique identifiers since it looks like the Newsgator API doesn't throw those away.

Frustrating.