Gnip is a newly launched startup that pitches itself as a service that aims to “make data portability suck less”. Mike Arrington describes the service in his post Gnip Launches To Ease The Strain On Web Services which is excerpted below

A close analogy is a blog ping server (see our overview here). Ping servers tell blog search engines like Technorati and Google Blog Search when a blog has been updated, so the search engines don’t have to constantly re-index sites just to see if new content has been posted. Instead, the blog tells the ping server when it updates, which tells the search engines to drop by and re-index. The creation of the first ping server, Weblogs.com, by Dave Winer resulted in orders of magnitude better efficiency for blog search engines.

The same thinking basically applies to Gnip. The idea is to gather simple information from social networks - just a username and the fact that they created new content (like writing a Twitter message, for example). Gnip then distributes that data to whoever wants it, and those downstream services can then access the core service’s API, with proper user authentication, and access the actual data (in our example, the actual Twitter message).

From a user’s perspective, the result is faster data updates across services and less downtime for services since their APIs won’t be hit as hard.

From my perspective, Gnip also shares some similarity to services like FeedBurner as well as blog ping servers. The original purpose of blog ping servers was to make it cheaper for services like Technorati and Feedster to index the blogosphere without having to invest in a Google-sized server farm and crawl the entire Web every couple of minutes. In addition, since blogs often have tiny readerships and are thus infrequently linked to, crawling alone was not enough to ensure that they find their way into the search index. It wasn’t about taking load off of the sites that were doing the pinging.

On the other hand, FeedBurner hosts a site’s RSS feed as a way to take load off of their servers and then provides analytics data so the site doesn’t miss out from losing the direct connection to its subscribers. This is more in line with the expectation that Gnip will take load off of a service’s API servers. However unlike FeedBurner, Gnip doesn’t actually store the user data from the social networking site. It simply stores a record that indicates that “user X on site Y made an update of type Z at time T”.  The thinking is that web sites will publish a notification to Gnip whenever their users perform an update. Below is a sample interaction between Digg and Gnip where Digg notifies Gnip that the users amy and john.doe have dugg two stories.

===>
  POST /publishers/digg/activity.xml
  Accept: application/xml
  Content-Type: application/xml
  
  <activities>
    <activity at="2008-06-08T10:12:42Z" uid="amy" type="dugg" guid="http://digg.com/odd_stuff/a_story"/>
    <activity at="2008-06-09T09:14:07Z" uid="john.doe" type="dugg" guid="http://digg.com/odd_stuff/really_weird"/>
  </activities>

<---
  200 OK
  Content-Type: application/xml
  <result>Success</result> 

There are two modes in which "subscribers" can choose to interact with the data published to Gnip. The first is in a mode similar to how blog search engines interact with the changes.xml file on Weblogs.com and other blog ping servers. For example, services like Summize or TweetScan can ask Gnip for the last hour of changes on Twitter instead of whatever mechanism they are using today to crawl the site. Below is what a sample interaction to retrieve the most recent updates on Twitter from Gnip would look like

===>
GET /publishers/twitter/activity/current.xml
Accept: application/xml

<---
200 OK
Content-Type: application/xml

<activities>
<activity at="2008-06-08T10:12:07Z" uid="john.doe" type="tweet" guid="http://twitter.com/john.doe/statuses/42"/>
<activity at="2008-06-08T10:12:42Z" uid="amy" type="tweet" guid="http://twitter.com/amy/statuses/52"/>
</activities>

The main problem with this approach is the same one that affects blog ping servers. If the rate of updates is more than the ping server can handle then it may begin to fall behind or lose updates completely. Services that don’t want to risk their content not being crawled are best off providing their own update stream that applications can poll periodically. That’s why the folks at Six Apart came up with the Six Apart Update Stream for LiveJournal, TypePad and Vox weblogs.

The second mode is one that has gotten Twitter fans like Dave Winer raving about Gnip being the solution to Twitter’s scaling problems. In this mode, an application creates a collection of one or more usernames they are interested in. Below is what a collection document created by the Twadget application to indicate that it is interested in my Twitter updates might look like.

<collection name="twadget-carnage4life">
     <uid name="carnage4life" publisher.name="twitter"/>
</collection>

Then instead of polling Twitter every 5 minutes for updates it polls Gnip every 5 minutes for updates and only talks to Twitter’s servers when Gnip indicates that I’ve made an update since the last time the application polled Gnip. The interaction between Twadget and Gnip would then be as follows

===>
GET /collections/twadget-carnage4life/activity/current.xml
Accept: application/xml
<---
200 OK
Content-Type: application/xml

<activities>
<activity at="2008-06-08T10:12:07Z" uid="carnage4life" type="tweet" guid="http://twitter.com/Carnage4Life/statuses/850726804"/>

</activities>

Of course, this makes me wonder why one would think that it is feasible for Gnip to build a system that can handle the API polling traffic of every microblogging and social networking site out there but it is infeasible for Twitter to figure out how to handle the polling traffic for their own service. Talk about lowered expectations. Wink

So what do I think of Gnip? I think the ping server mode may be of some interest for services that think it is cheaper to have code that pings Gnip after every user update instead building out an update stream service. However since a lot of sites already have some equivalent of the public timeline it isn’t clear that there is a huge need for a ping service. Crawlers can just hit the public timeline which I assume is what services like Summize and TweetScan do to keep their indexes of tweets up to date.

As for using Gnip as a mechanism for reducing the load API clients put on a microblogging or similar service? Gnip is totally useless for that in it’s current incarnation. API clients aren’t interested in updates made by single user. They are interested in all the updates made by all the people the user is following. So for Twadget to use Gnip to lighten the load it causes on Twitter’s servers on my behalf, it has to build a collection of all the people I am following in Gnip and then keep that list of users in sync with whatever that list is on Twitter. But if it has to constantly poll Twitter for my friend list, isn’t it still putting the same amount of load on Twitter? I guess this could be fixed by having Twitter publish follower/following lists to Gnip but that introduces all sorts of interesting technical and privacy issues. But that doesn’t matter since the folks at Gnip brag about only keeping 60 minutes of worth of updates as the “secret sauce” to their scalability. This means if I shut my Twitter client hasn’t polled Gnip in a 60 minute window (maybe my laptop is closed) then it doesn’t matter anyway and it has to poll Twitter.  I suspect someone didn’t finish doing their homework before rushing to “launch” Gnip.

PS: One thing that is confusing to me is why all communication between applications and Gnip needs to be over SSL. The only thing I can see it adding is making it more expensive for Gnip run their service. I can’t think of any reason why the interactions described above need to be over a secure channel.

Now Playing: Lil Wayne - Hustler Musik


 

Tuesday, 08 July 2008 01:52:52 (GMT Daylight Time, UTC+01:00)
Dare -- I hope you'll take me up on my offer to call me to chat about Gnip, but in the meantime I'd like to clear up some misconceptions about our service.

We are not targeting distributed apps (and widgets) like Twadget. Over time we'll offer solutions for that, but for now we're focused primarily on large-scale centralized services like MyBlogLog and Plaxo and Lijit that have tens of thousands of service-specific users.

Imagine you have a service with 25,000 Twitter users -- you want realtime updates but you don't want to (or can't) stand up an XMPP server (ignoring the fact that Twitter is no longer offering the XMPP firehose to more than two or three partners). If you query the Twitter API for each of those 25,000 users, you query 2.4 million times in a day. You get throttled at that level, so you have to back off significantly (about once every seven hours). That's not particularly useful. If both sides use Gnip as the middleman, then there's an order of magnitude reduction in polling while simultaneously decreasing latency from hours to seconds.

Over the next 90 days we're rolling out several features that increase the breadth of our service (providing third party polling support, sending complete metadata instead of the notification and normalizing data) and decouple the need for data providers to notify us of all user changes (although that will remain the most effective mechanism).

Hope we can chat sometime soon. I'd love to get your feedback on the stuff we're doing next.

Cheers!
Eric



Comments are closed.