The number one problem that faces developers of feed readers is how to identify posts. How does a feed reader tell a new post from an old one whose title or permalink changed? In general how you do this is to pick a unique identifier from the metadata of the feed item to use to tell it apart from others. If you are using the Atom 0.3 & 1.0 syndication formats the identifier is the <atom:id> element, for RSS 1.0 it is the rdf:about attribute and for RSS 0.9x & RSS 2.0 it is the <guid> element.

The problem is that many RSS 0.9x & 2.0 feeds do not have a <guid> element which usually means a feed reader has to come up with its own custom mechanism for identifying items. In many cases, using the <link> element is enough because most items in a feed map to a single web resource with a permalink URL. In some pathological cases, a feed may not have <guid> or <link> OR even worse may use the same value in the <link> element for each item in the feed. In such cases, feed readers usually resort to heuristics which are guaranteed to be wrong at least some of the time.

So what does this have to do with the Newsgator API? Users of recent versions of RSS Bandit can synchronize the state of their RSS feeds with Newsgator Online using the Newsgator API. Where things get tricky is that this means that both the RSS Bandit and Newsgator Online either need to use the same techniques for identifying posts OR have a common way to map between their identification mechanisms. When I first used the API, I noticed that Newsgator has it's own notion of a "Newsgator ID" which it expects clients to use. In fact, it's worse than that. Newsgator Online assumes that clients that synchronize with it actually just fetch all their data from Newsgator Online including feed content. This is a pretty big assumption to make but I'm sure it made it easier to solve a bunch of tricky development problems for their various products. Instead of worrying about keeping data and algorithms on the clients in sync with the server, they just replace all the data on the client with the server data as part of the 'synchronization' process.

Now that I've built an application that deviates from this fundamental assumption I've been having all sorts of interesting problems. The most recent being that some users complained that read/unread state wasn't being synced via the Newsgator API. When I investigated it turned out that this is because I use <guid> elements to identify posts in RSS Bandit while the Newsgator API uses the "Newsgator ID". Even worse is that they don't even expose the original <guid> element in the returned feed items. So now it looks like fixing the read/unread not being synced bug involves bigger and more fundamental changes than I expected. More than likely I'll have to switch to using <link> elements as unique identifiers since it looks like the Newsgator API doesn't throw those away.



Monday, December 12, 2005 4:25:06 AM (GMT Standard Time, UTC+00:00)
"for RSS 1.0 it is the rdf:about attribute"

What does RSS Bandit do with Sigh...
Monday, December 12, 2005 10:58:40 AM (GMT Standard Time, UTC+00:00) it any better with
Monday, December 12, 2005 2:01:34 PM (GMT Standard Time, UTC+00:00)
The bloglines API has so little functionality that I haven't bothered trying to integrate it with RSS Bandit.
Monday, December 12, 2005 3:19:48 PM (GMT Standard Time, UTC+00:00)
For feeds that don't expose GUID's, out internal tools simply generate an MD5 hash of the whole item and use that as a unique identifier. Does wonders for duplicate removal too. :-)
Monday, December 12, 2005 7:26:16 PM (GMT Standard Time, UTC+00:00)
I always figured the timestamp would suffice.
Monday, December 12, 2005 7:41:09 PM (GMT Standard Time, UTC+00:00)
MD5 hashes break if a typo fix or other change is made to the content or title of the post. This is typically the first flawed attempt most feed reader developers try.
Tuesday, December 13, 2005 1:40:48 AM (GMT Standard Time, UTC+00:00)
The fundamental problem, as you know, is that not all items have guids. So in some cases we have to identify items by some other means.

We have a number of algorithms we use to determine uniqueness. Our old, first-generation API (as implemented in NG/Outlook 2.0) depended on the client and server using exactly the same algorithm...and believe me, it's not something you would have wanted to implement.

The current NewsGator API you're working with uses multi-stage algorithms, including "fuzzy" text matching, to uniquely identify items across time, even in the face of minor edits. And since it's implemented on the server only, we can refine it over time and ensure all products that work with our system can take advantage of it.

We then provide a unique ID to the client, which they can use across time to uniquely identify a piece of information. But as you've found, you need to retrieve the content from NewsGator.

That's not nearly as bad as it sounds. We can make some pretty serious bandwidth and other optimizations for you. Right now, when you retrieve directly from sources, this is your workflow:

1. Retrieve feed 1
2. If you didn't get a not-modified response, you have the entire feed. Parse the content, and compare to your existing content store to find new items
3. Repeat for other feeds

If you have 100 feeds, you're making at least 100 requests every polling interval.

With the NG API, here's the workflow:

1. Ask API for new content. It returns a list of feeds with new content or modified state.
2. Retrieve content for that subset of feeds. For each one, you'll get a combination of new content and new state changes.

So not only do you retrieve fewer feeds (say 5 out of 100), but you retrieve less content for each of those feeds. It's highly efficient for the client.

But if you're going to try to retrieve the content yourself, and match it up against what the API gives you, it's going to be dramatically more complicated - and use far more bandwidth. It will actually be the worst case - the combination of both scenarios above.
Comments are closed.