Over the past two weeks I participated in panels at both the SXSW and MIX 09 on the growing trend of provide streams of user activities on social sites and aggregating these activities from multiple services into a single experience. Aggregating activities from multiple sites into a single service for the purpose of creating a activity stream is fairly commonplace today and was popularized by Friendfeed. This functionality now exists on many social networking sites and related services including Facebook, Yahoo! Profile and the Windows Live Profile

In general, the model is to receive or retrieve user updates from a social media site like Flickr and make these updates available on the user's profile on the target social network and share it with the user's friends via an activity stream (or news feed) on the site. The diagram below attempts to capture this many-to-many relationship as it occurs today using some well known services as examples.

The bidirectional arrows are meant to indicate that the relationship can be push-based where the content-based social media site notifies the target social network of new updates from the user or pull-based where the social network polls the site on a regular basis seeking new updates from the target user.

There are two problems that sites have to deal with in this model

  1. Content sites like Flickr have to either deal with being polled unnecessarily millions of times a day by social networks seeking photo updates from their users.  There is the money quote from last year that FriendFeed polled Flickr 2.7 million times a day to retrieve a total of less than 7,000 updates. Even if they move to a publish-subscribe model it would mean not only having to track which users are of interest to which social network but also targeting APIs on different social networks that are radically different (aka the beautiful f-ing snowflake API problem).

  2. Social aggregation services like Friendfeed and Windows Live have to target dozens of sites each with a different APIs or schemas. Even in the case where the content sites support RSS or Atom, they often use radically different schemas for representing the same data

The approach I've been advocating along with others in the industry is that we need to adopt standards for activity streams in a way that reduces the complexity of this many-to-many conversation that is currently going on between social sites.

While I was at SXSW, I met one of the folks from Gnip who is advocating an alternate approach. He argued that even with activity stream standards we've only addressed part of the problem. Such standards may mean that FriendFeed gets to reuse their Flickr code to poll Smugmug with little to no changes but it doesn't change the fact that they poll these sites millions of times a day to get a few thousand updates.

Gnip has built a model where content sites publish updates to Gnip and then social networking sites can then choose to either poll Gnip or receive updates from Gnip when the update matches one of the rules they have created (e.g. notify us if you get a digg vote from Carnage4Life). The following diagram captures how Gnip works.

The benefit of this model to content sites like Flickr is that they no longer have to worry about being polled millions of times a day by social aggregation services. The benefit to social networking sites is that they now get a consistent format for data from the social media sites they care about and can choose to either pull the data or have it pushed to them.

The main problem I see with this model is that it sets Gnip up to be this central point of failure and I'd personally rather deal interact directly with the content services directly instead of inject a middle man into the process. However I can see how their approach would be attractive to many sites who might be buckling under the load of being constantly polled and to social aggregation sites that are tired of hand coding adapters for each new social media sites they want to integrate with. 

What do you think of Gnip's service and the problem space in general?

Note Now Playing: EamonF**k It (I Don't Want You Back) Note


Tuesday, March 24, 2009 4:08:36 PM (GMT Standard Time, UTC+00:00)
Cannot upload folders to skydrive? WTF? How would I ever organize my files if I cannot upload folders?
Tuesday, March 24, 2009 4:16:23 PM (GMT Standard Time, UTC+00:00)
Whether it is better to deal with content providers directly or through a middle man depends on scope and scale. If your site is limited to interaction with a small number of provider sites, and you intend to support a moderate number of users, then dealing directly may be the best option. However, if your site interactions are large in number (scale) and cross many providers' platforms (scope), or if you intend to grow in scope and/or scale, you are in for trouble when you ask for direct interfacing. The middleman may be just the sort of adapter that will save you a lot of time/cost/energy.
Tuesday, March 24, 2009 7:57:53 PM (GMT Standard Time, UTC+00:00)
As a consumer of internet data, I have joined in the calls for the "insane" idea of standardization of data formats on the web for some time now. While I get your point that Gnip may be a single point of failure, I must say that in our case Gnip is takes us one step closer to data format standardization (albeit, virtual). In our case, our business is about information mining and text analysis, however, we had to be distracted from our core mission in order to build a framework for pulling data feeds from the various sources, cleansing and formatting them (and its am ongoing effort). While I am proud of what we have adaptable and scalable framework that we built and the efficiencies that it has brought us, I am also cognizant of the fact that, unlike larger companies, we cannot afford to spend time managing and maintaining data feeds; we need to spend all our time on text analysis. When consuming data feeds from a multitude of data providers, keeping up with their format changes, up times, data throttling and data quality issues can prove to be a full time task. That is what makes Gnip an attractive proposition to data processors like us. I can offload all the work relating to data access to Gnip and I can concentrate on what we do best.
Tuesday, March 24, 2009 9:45:57 PM (GMT Standard Time, UTC+00:00)
I find gnip to be very useful--we use it in Plaxo to get near-real-time updates from our users' twitter, delicious, and digg feeds. Even if we could get push-updates directly from those content sites, unless they all followed some standard protocol for letting us tell them which users to track and pushing us updates, it would still be more painful than just telling gnip "start tracking site-xyz for us" and having everything else "just work".

I certainly agree that single-points-of-failure should be avoided in general (both for technical and strategic/political reasons), but as long as there's no lock-in per se, then Gnip is a convenience that you can choose to take or leave. Right now, we take it because they're helpful and not doing anything evil. But if that changes, we can just go back to polling or getting content sites to push to us.
Tuesday, March 24, 2009 11:49:10 PM (GMT Standard Time, UTC+00:00)
My reply got so long I posted it on my blog: Gnip's Opportunity In the Web API QoS Space
Thursday, March 26, 2009 12:23:25 AM (GMT Standard Time, UTC+00:00)
Dare -- I appreciate the thoughtful writeup. It's great seeing some of the new network diagrams being used elsewhere; it must mean that they're actually helpful.

I don't agree with the Single Point of Failure perspective, or rather, I think that almost all things are SPoFs: Every line of code you write is a SPoF. AWS is a SPoF. PayPal is a SPoF. Hell, Google is a SPoF in terms of search traffic. SPoFs are a fact of life.
That said, when you're talking about Microsoft scale, you literally can throw near limitless resources at commodity work for the warm fuzzies that come with doing your own integrations. In those edge cases, go for it. But as you mention, there are likely (hopefully) a bunch of companies that will find Gnip useful.

Take care, amigo!
Friday, March 27, 2009 3:22:52 AM (GMT Standard Time, UTC+00:00)
Hi Dare, I have blogged my response to your question at Web 2.0 Discovers Integration 2.0
Wednesday, April 1, 2009 1:11:36 AM (GMT Daylight Time, UTC+01:00)
Can't you have the best of both worlds? i.e. A standardized data format for push notification.

Applications publish their endpoints, and users tell sites which applications they want to publish their data to.
Wednesday, April 1, 2009 4:19:10 AM (GMT Daylight Time, UTC+01:00)
There's still the problem that it leaves sites having to manage publishing updates to dozens of sites on behalf of their users. Gnip offers them the option of publishing to Gnip who then worries about publishing to dozens of subscribers on their behalf.
Wednesday, April 1, 2009 9:55:04 PM (GMT Daylight Time, UTC+01:00)

GNip does offer those advantages. But not sure if that scales, the same issues you talked about in this article.

What I suggested was a flip on your current proposal. Instead of the consuming application having to pull data from a dozen sites, let's have the publishing applications push notifications out. Of course, there are new security implications in this approach, but these are not intractable problems.

If the implementation is standardized, it should be easy for sites to integrate this. The problem of consumers managing a user's subscriptions shifts to the publisher managing the user's applications.
Comments are closed.