July 28, 2008
@ 02:27 AM

For several months Nick Carr has pointed out that Wikipedia ranks highly in the search results for a number of common topics in Google's search engine. In his post entitled Googlepedia Nick Carr speculated on why Google would see this trend as a threat in a paragraph which is excerpted below

I'm guessing that serving as the front door for a vast ad-less info-moshpit outfitted with open source search tools is not exactly the future that Google has in mind for itself. Enter Knol.

Clearly Nick Carr wasn't the only one that realized that Google was slowly turning into a Wikipedia redirector. Google wants to be the #1 source for information or at least be serving ads on the #1 sites on the Internet in specific area. Wikipedia was slowly eroding the company's effectivenes at achieving both goals. So it is unsurprising that Google has launched Knol and is trying to entice authors away from Wikipedia by offering them a chance to get paid.

What is surprising is that Google is tipping it's search results to favor Knol. Or at least that is the conclusion of several search engine optimization (SEO) experts and also jibes with my experiences.

Danny Sullivan wrote in his post The Day After: Looking At How Well Knol Pages Rank On Google that

We've been assured that just because content sits on Google's Knol site, it won't gain any ranking authority from being part of the Knol domain. OK, so a day after Knol has launched, how's that holding up? I found 1/3 of the pages listed on the Knol home page that I tested ranked in the top results.

I was surprised to see a post covering how Knol's How to Backpack was already hitting the number three spot on Google. Really? I mean, how many links could this page have gotten already? As it turns out, quite a few. And more important, it's featured on the Knol home page, which itself is probably one of the most important links. While Knol uses nofollow on individual knols to prevent link credit from flowing out, it's not used on the home page -- so home page credit can flow to individual knols featured on it.

here's a test knol I made yesterday -- Firefox Plugins For SEO & SEM -- which ranks 28 for firefox plugins for seo. I never linked to it from my article about knol. I don't think it made the Knol home page. I can see only three links pointing at it, and only one of those links uses anchor text relevant to what the page is ranking for. And it's in the top 30 results?

Look, I know that being ranked 28 is pretty much near invisible in terms of traffic you'll get from search engines. But then again, to go from nowhere to the 28th top page in Google out of 755,000 matches? I'm sorry -- don't tell me that being in Knol doesn't give your page some authority.

Aaron Wall noticed something even more insidious in his post entitled Google Knol - Google's Latest Attack on Copyright where he notices that if Google notices duplicate content then it favors the content on Knol over a site that has existed for years and has decent PageRank. His post is excerpted below

Another Knol Test

Maybe we are being a bit biased and/or are rushing to judgement? Maybe a more scientific effort would compare how Knol content ranks to other content when it is essentially duplicate content? I did not want to mention that I was testing that when I created my SEO Basics Knol, but the content was essentially a duplicate of my Work.com Guide to Learning SEO (that was also syndicated to Business.com). Even Google shows this directly on the Knol page

Google Knows its Duplicate Content

Is Google the Most Authoritative Publisher?

Given that Google knows that Business.com is a many year old high authority directory and that the Business.com page with my content on it is a PageRank 5, which does Google prefer to rank? Searching for a string of text on the page I found that the Knol page ranks in the search results.

If I override some of Google's duplicate content filters (by adding &filter=0 to the search string) then I see that 2 copies of the Knol page outrank the Business.com page that was filtered out earlier.


Following Danny's example, I also tried running some searches for terms that appear on the Knol homepage and seeing how they did in Google's search. Here's the screenshot of the results of searching for "buttermilk pancakes"

Not bad for a page that has existed on the Web for less than two weeks.

Google is clearly favoring Knol content over content from older, more highly linked sites on the Web. I won't bother with the question of whether Google is doing this on purpose or whether this is some innocent mistake. The important question is "What are they going to do about it now that we've found out?"

Now Playing: One Republic - Stop and Stare

There was an interesting presentation at OSCON 2008 by Evan Henshaw-Plath and Kellan Elliott-McCrea entitled Beyond REST? Building Data Services with XMPP PubSub. The presentation is embedded below.

The core argument behind the presentation can be summarized by this tweet from Tim O'Reilly

On monday friendfeed polled flickr nearly 3 million times for 45000 users, only 6K of whom were logged in. Architectural mismatch. #oscon08

On July 21st, FriendFeed had 45,000 users who had associated their Flickr profiles with their FriendFeed account. FriendFeed polls Flickr about once every 20 – 30 minutes to see if the user has uploaded new pictures. However only about 6,000 of those users logged into Flickr that day, let alone uploaded pictures. Thus there were literally millions of HTTP requests made by FriendFeed that were totally unnecessary.

Evan and Kellan's talk suggests that instead of Flickr getting almost 3 million requests from FriendFeed, it would be a more efficient model for FriendFeed to tell Flickr which users they are interested in and then listen for updates from Flickr when they upload photos.

They are right. The interaction between Flickr and FriendFeed should actually be a publish-subscribe relationship instead of a polling relationship. Polling is a good idea for RSS/Atom for a few reasons

  • there are a thousands to hundreds of thousands clients that might be interested in a resource so the server keeping track of subscriptions is prohibitively expensive
  • a lot of these end points aren't persistently connected (i.e. your desktop RSS reader isn't always running)
  • RSS/Atom publishing is as simple as plopping a file in the right directory and letting IIS or Apache work its magic

The situation between FriendFeed and Flickr is almost the exact opposite. Instead of thousands of clients interested in document, we have one subscriber interested in thousands of documents. Both end points are always on or are at least expected to be. The cost of developing a publish-subscribe model is one that both sides can afford.

Thus this isn't a case of REST not scaling as implied by Evan and Kellan's talk. This is a case of using the wrong tool to solve your problem because it happens to work well in a different scenario. The above talk suggests using XMPP which is an instant messaging protocol as the publish-subscribe mechanism. In response to the talk, Joshua Schachter (founder of del.icio.us) suggested a less heavyweight publish-subscribe mechanism using a custom API in his post entitled beyond REST. My suggestion for people who believe they have this problem would be to look at using some subset of XMPP and experimenting with off-the-shelf tools before rolling your own solution. Of course, this is an approach that totally depends on network effects. Today everyone has RSS/Atom feeds while very few services use XMPP. There isn't much point in investing in publishing as XMPP if your key subscribers can't consume it and vice versa. It will be interesting to see if the popular "Web 2.0" companies can lead the way in using XMPP for publish-subscribe of activity streams from social networks in the same way they kicked off our love affair with RESTful Web APIs.

It should be noted that there are already some "Web 2.0" companies using XMPP as a way to provide a stream of updates to subscribing services to prevent the overload that comes from polling. For example, Twitter has confirmed that it provides an XMPP stream to FriendFeed, Summize, Zappos, Twittervision and Gnip. However they simply dump out every update that occurs on Twitter to these services instead of having these services subscribe to updates for specific users. This approach is quite inefficient and brings it's own set of scaling issues.

The interesting question is why people are just bringing this up? Shouldn't people have already been complaining about Web-based feed readers like Google Reader and Bloglines for causing the same kinds of problems? I can only imagine how many millions of times a day Google Reader must fetch content from TypePad and Wordpress.com but I haven't seen explicit complaints about this issue from folks like Anil Dash or Matt Mullenweg.

Now Playing: The Pussycat Dolls - When I Grow Up


Disclaimer: This post does not reflect the opinions, thoughts, strategies or future intentions of my employer. These are solely my personal opinions. If you are seeking official position statements from Microsoft, please go here.

Earlier this week, David Recordon announced the creation of the Open Web Foundation at OSCON 2008. His presentation is embedded below

From the organization's Web site you get the following outline of it's mission

The Open Web Foundation is an attempt to create a home for community-driven specifications. Following the open source model similar to the Apache Software Foundation, the foundation is aimed at building a lightweight framework to help communities deal with the legal requirements necessary to create successful and widely adopted specification.

The foundation is trying to break the trend of creating separate foundations for each specification, coming out of the realization that we could come together and generalize our efforts. The details regarding membership, governance, sponsorship, and intellectual property rights will be posted for public review and feedback in the following weeks.

Before you point out that this seems to create yet another "standards" organization for Web technology, there are already canned answers to this question. Google evangelist Dion Almaer provides justification for why existing Web standards organizations do not meet their needs in his post entitled The Open Web Foundation; Apache for the other stuff where he writes 

Let’s take an example. Imagine that you came up with a great idea, something like OAuth. That great idea gains some traction and more people want to get involved. What do you do? People ask about IP policy, and governance, and suddenly you see yourself on the path of creating a new MyApiFoundation.

Wait a minute! There are plenty of standards groups and other organizations out there, surely you don’t have to create MyApiFoundation?

Well, there is the W3C and OASIS, which are pay to play orgs. They have their place, but MyApi may not fit in there. The WHATWG has come up with fantastic work, but the punting on IP is an issue too.

At face value, it's hard to argue with this logic. The W3C charges fees using a weird progressive taxation model where a company pays anything from a few hundred to several thousand dollars depending on how the W3C assesses their net worth. OASIS similarly charges from $1,000 to $50,000 depending on how much influence the member company wants to have in the organization. After that it seems there are a bunch of one off organizations like the Open ID foundation and the WHATWG that are dedicated to a specific technology. 

Or so the spin from the Open Web Foundation would have you believe.

In truth there is already an organization dedicated to producing "Open" Web technologies that has a well thought out policy on membership, governance, sponsorship and intellectual property rights that isn't pay to play. This is not a new organization, it actually happens to be older than David Recordon who unveiled the Open Web Foundation.

The name of this organization is the Internet Engineering Task Force (IETF). If you are reading this blog post then you are using technologies for the "Open Web" created by the IETF. You may be reading my post in a Web browser in which case the content was transferred to you over HTTP (RFC 2616) and if you're reading it in an RSS reader then I should add that you're also directly consuming my Atom feed (RFC 4287). Some of you are reading this post because someone sent you an email which is another example of an IETF protocol at work, SMTP (RFC 2821).

The IETF policy on membership doesn't get more straightforward; join a mailing list. I am listed as a member of the Atom working group in RFC 4287 because I was a participant in the atom-syntax mailing list. The organization has a well thought out and detailed policy on intellectual property rights as it relates the IETF specifications which is detailed in RFC 3979: Intellectual Property Rights in IETF Technology and slightly updated in RFC 4879: Clarification of the Third Party Disclosure Procedure in RFC 3979.

I can understand that a bunch of kids fresh out of college are ignorant of the IETF and believe they have to reinvent the wheel to Save the Open Web but I am surprised that Google which has had several of it's employees participate in the IETF processes which created RFC 4287, RFC 4959, RFC 5023 and RFC 5034 would join in this behavior. Why would Google decide to sponsor a separate standards organization that competes with the IETF that has less inclusive processes than the IETF, no clear idea of how corporate sponsorship will work and a yet to be determined IPR policy?

That's just fucking weird.

Now Playing: Boyz N Da Hood - Dem Boys (remix) (feat T.I. & The Game)


Categories: Technology

I've been using the redesigned Facebook profile and homepage for the past few days and thought it would be useful to write up my impressions on the changes. Facebook is now the the world's most popular social networking site and one of the ways they've gotten there is by being very focused on listening to their users and improving their user experienced based on this feedback. Below are screenshots of the old and new versions of the pages and a discussion of which elements are changed and the user scenarios the changes are meant to improve.

Homepage Redesign



The key changes and their likely justifications are as follows

  • Entry points for creating content are now at the top of the news feed. One of the key features driving user engagement on Facebook is the News Feed. This lets a user know what is going on with their social network as soon as they logon to the site. In a typical example of network effects at work, one person creates some content by uploading a photo or sharing a link and hundreds of people on their friend list benefit by having content to view in their News Feed. If any of the friends responds to the content this again benefits hundreds of people and so on.  The problem with the old home page was that a user sees their friends uploading photos and sharing links and may want to do so as well but there is no easy way for her to figure out how to do the same thing without having to go two or three clicks away from the home page. The entry points at the top of the feed will encourage more "impulse" content creation.

  • Left sidebar is gone. There were three groups of items in the left nav; a search box, the list of a user's most frequently accessed applications and an advertisement. The key problem is that the ad is in a bottom corner of the feed. This makes it easy for users to mentally segregate that part of the screen from their vision and either never look there or completely ignore it. Removing that visual ghetto and moving ads to being inline with the feed makes it more likely that users will look at the ad. Ah, but now you need more room to show the ad (all the space isn't needed for news feed stories). So the other elements of the left nave are moved, the search box to the header and the list of most accessed applications to the sidebar on the right. Now you have enough room to stretch out the News Feed's visible area and advertisers can reuse their horizontal banner ads on Facebook even though this makes the existing feed content now look awkward. This is one place where monetization trumped usability.

  • Comments now shown inline for News Feed items with comments (not visible in screen shot). This may be the feature that made Mike Arrington decide to call the new redesign the FriendFeedization of Facebook. Sites like FriendFeed have proven that showing the comments on an item in the feed inline gives users more content to view in their feeds and increases the likelihood of engagement since the user may want to join the conversation.

Profile Redesign



The key changes and their likely justifications are as follows

  • The profile now has tabbed model for navigation. This is a massive improvement for a number of reasons. The most important one is that in the old profile, there is a lot of content below the fold. My old profile page is EIGHT pages when printed as opposed to TWO pages when the new profile page is printed. Moving to a tabbed model (i) improves page load times and (ii) increases number of page views and hence ad impressions.

  • The Mini-Feed and the Wall have been merged. The intent here is to give more visibility to the Wall which in the old model was below the fold. The "guest book" or wall is an important part of the interaction model in social networking sites (see danah boyd's Friendster lost steam. Is MySpace just a fad? essay) and Facebook was de-emphasizing theirs in the old model.

  • Entry points for creating content are at the top of the profile page. Done for the same reason as on the Home page. You want to give users lots of entry points for adding content to the site so that they can kick off network effects by generating content which in turn generates tasty page views.

  • Left sidebar is gone. Again the left sidebar is gone and the advertisement is moved closer to the content, and away from the visual ghetto that is the bottom left of the screen. Search box and most accessed applications are now in the header as well. The intent here is also to improve the likelihood that users will view and react to the ads.

Now Playing: Da Back Wudz - I Don't Like The Look Of It (Oompa)


Yesterday Amazon's S3 service had an outage that lasted about six hours. Unsurprisingly this has led to a bunch of wailing and gnashing of teeth from the very same pundits that were hyping the service a year ago. The first person to proclaim the sky is falling is Richard MacManus in his More Amazon S3 Downtime: How Much is Too Much? who writes

Today's big news is that Amazon's S3 online storage service has experienced significant downtime. Allen Stern, who hosts his blog's images on S3, reported that the downtime lasted 3.5 over 6 hours. Startups that use S3 for their storage, such as SmugMug, have also reported problems. Back in February this same thing happened. At the time RWW feature writer Alex Iskold defended Amazon, in a must-read analysis entitled Reaching for the Sky Through The Compute Clouds. But it does make us ask questions such as: why can't we get 99% uptime? Or: isn't this what an SLA is for?

Om Malik joins in on the fun with his post S3 Outage Highlights Fragility of Web Services which contains the following

Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline.

It was no different this time around. I first learned about today’s outage when avatars and photos (stored on S3) used by Twinkle, a Twitter-client for iPhone, vanished.

That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.

Even though the pundits are trying to raise a stink, the people who should be most concerned about this are Amazon S3's customers. Counter to Richard MacManus's claim, not only is there a Service Level Agreement (SLA) for Amazon S3, it promises 99.9% uptime or you get a partial refund. 6 hours of downtime sounds like a lot until you realize that 99% uptime is 8 hours of downtime a month and over three and a half days of downtime a year. Amazon S3 is definitely doing a lot better than that.

The only question that matters is whether Amazon's customers can get better service elsewhere at the prices Amazon charges. If they can't, then this is an acceptable loss which is already covered by their SLA. 99.9% uptime still means over eight hours of downtime a year. And if they can, it will put competitive pressure on Amazon to do a better job of managing their network or lower their prices.

This is one place where market forces will rectify things or we will reach a healthy equilibrium. Network computing is inherently and no amount of outraged posts by pundits will ever change that. Amazon is doing a better job than most of its customers can do on their own for cheaper than they could ever do on their own. Let's not forget that in the rush to gloat about Amazon's down time.

Now Playing: 2Pac - Life Goes On


Categories: Web Development

For the past few years, the technology press has been eulogizing desktop and server-based software while proclaiming that the era of Software as a Service (SaaS) is now upon us. According to the lessons of the Innovator's Dilemma the cheaper and more flexible SaaS solutions will eventually replace traditional installed software and the current crop of software vendors will turn out to be dinosaurs in a world that belongs to the warm blooded mammals who have conquered cloud based services.

So it seems the answer is obvious, software vendors should rush to provide Web-based services and extricate themselves from their "legacy" shrinkwrapped software business before it is too late. What could possibly go wrong with this plan? 

Sarah Lacy wrote an informative article for Business Week about the problems facing software vendors who have rushed into the world of SaaS. The Business Week article is entitled On-Demand Computing: A Brutal Slog and contains the following excerpt

On-demand represented a welcome break from the traditional way of doing things in the 1990s, when swaggering, elephant hunter-style salesmen would drive up in their gleaming BMWs to close massive orders in the waning days of the quarter. It was a time when representatives of Oracle (ORCL), Siebel, Sybase (SY), PeopleSoft, BEA Systems, or SAP (SAP) would extol the latest enterprise software revolution, be it for management of inventory, supply chain, customer relationships, or some other area of business. Then there were the billions of dollars spent on consultants to make it all work together—you couldn't just rip everything out and start over if it didn't. There was too much invested already, and chances are the alternatives weren't much better.

Funny thing about the Web, though. It's just as good at displacing revenue as it is in generating sources of it. Just ask the music industry or, ahem, print media. Think Robin Hood, taking riches from the elite and distributing them to everyone else, including the customers who get to keep more of their money and the upstarts that can more easily build competing alternatives.

But are these upstarts viable? On-demand software has turned out to be a brutal slog. Software sold "as a service" over the Web doesn't sell itself, even when it's cheaper and actually works. Each sale closed by these new Web-based software companies has a much smaller price tag. And vendors are continually tweaking their software, fixing bugs, and pushing out incremental improvements. Great news for the user, but the software makers miss out on the once-lucrative massive upgrade every few years and seemingly endless maintenance fees for supporting old versions of the software.

Nowhere was this more clear than on Oracle's most recent earnings call (BusinessWeek.com, 6/26/08). Why isn't Oracle a bigger player in on-demand software? It doesn't want to be, Ellison told the analysts and investors. "We've been in this business 10 years, and we've only now turned a profit," he said. "The last thing we want to do is have a very large business that's not profitable and drags our margins down." No, Ellison would rather enjoy the bounty of an acquisition spree that handed Oracle a bevy of software companies, hordes of customers, and associated maintenance fees that trickle straight to the bottom line.

SAP isn't having much more luck with Business by Design, its foray into the on-demand world, I'm told. SAP said for years it would never get into the on-demand game. Then when it sensed a potential threat from NetSuite, SAP decided to embrace on-demand. Results have been less than stellar so far. "SAP thought customers would go to a Web site, configure it themselves, and found the first hundred or so implementations required a lot of time and a lot of tremendous costs," Richardson says. "Small businesses are calling for support, calling SAP because they don't have IT departments. SAP is spending a lot of resources to configure and troubleshoot the problem."

In some ways, SaaS vendors have been misled by the consumer Web and have failed to realize that they still need to spend money on sales and support when servicing business customers. Just because Google doesn't advertise it's search features and Yahoo! Mail doesn't seem to have a huge support staff that hand holds customers as it uses their product doesn't mean that SaaS vendors can expect to cut their sales and support calls. The dynamics of running a free, advertising based service aimed at end users is completely different from running a service where you expect to charge business tens of thousands to hundreds of thousands to use your product.

In traditional business software development you have three major cycles with their own attendant costs; you have to write the software, you have to market the software and then you have to support the software. Becoming a SaaS vendor does not eliminate any of these costs. Instead it adds new costs and complexities such as managing data centers and worrying about hackers. In addition, thanks to free advertising based consumer services and the fact that companies like Google that have subsidized their SaaS offerings using their monopoly profits in other areas, business customers expect Web-based software to be cheaper than its desktop or server-based alternatives. Talk about being stuck between a rock and a hard place as a vendor.

Finally, software vendors that have existing ecosystems of partners that benefit from supporting and enhancing their shrinkwrapped products also have to worry about where these partners fit in a SaaS world. For an example of the kinds of problems these vendors now face, below is an excerpt from a rant by Vladimer Mazek, a system administrator at ExchangeDefender, entitled Houston… we have a problem which he wrote after attending one of Microsoft's partner conferences

Lack of Partner Direction: By far the biggest disappointment of the show. All of Microsoft’s executives failed to clearly communicate the partnership benefits. That is why partners pack the keynotes, to find a way to partner up with Microsoft. If you want to gloat about how fabulous you are and talk about exciting commission schedules as a brand recommender and a sales agent you might want to go work for Mary Kay. This is the biggest quagmire for Microsoft – it’s competitors are more agile because they do not have to work with partners to go to market. Infrastructure solutions are easy enough to offer and both Google and Apple and Amazon are beating Microsoft to the market, with far simpler and less convoluted solutions. How can Microsoft compete with its partners in a solution ecosystem that doesn’t require partners to begin with?

This is another example of the kind of problems established software vendors will have to solve as they try to ride the Software as a Service wave instead of being flattened by it.  Truly successful SaaS vendors will eventually have to deliver platforms that can sustain a healthy partner ecosystems to succeed in the long term. We have seen this in the consumer space with the Facebook platform and in the enterprise space with SalesForce.com's AppExchange. Here is one area where the upstarts that don't have a preexisting shrinkwrap software businesses can turn a disadvantage (lack of an established partner ecosystem) into an advantage since it is easier to start from scratch than to retool.

The bottom line is that creating a Web-based version of a popular desktop or server-based product is just part of the battle if you plan to play in the enterprise space. You will have to deal with the sales and support that go with selling to businesses as well as all the other headaches of shipping "cloud based services" which don't exist in the shrinkwrap software world. After you get that figured out, you will want to consider how you can leverage various ISVs and startups to enhance the stickiness of your service and turn it into a platform before one of your competitor's does. 

I suspect we still have a few years before any of the above happens. In the meantime we will see lots more software companies complaining about the paradox of embracing the Web when it clearly cannibalizes their other revenue streams and is less lucrative than what they've been accustomed to seeing. Interesting times indeed.

Now Playing: Flobots - Handlebars


Sometime last week I learned that podcasting startup PodTech was acquired for less than $500,000. This is a rather ignominious exit for a startup that initially entered the public consciousness with its high profile hire of Robert Scoble and the intent to build a technology news media empire using RSS and podcasts instead of radio waves and news print.

When I first heard about PodTech via Robert Scoble's blog, it seemed like a bad business to jump into given the lessons of The Long Tail. The Web creates an overabundance of content and products, which is good for aggregators but bad for creators. Even in 2006 when PodTech was founded you could see this in the success of "Web 2.0" companies that acted as content aggregators like Google, YouTube, Wikipedia and Flickr while content creators like music labels and news papers were beginning to scramble for relevance and revenue. 

Kevin Kelly has a great post about this called Wagging the Long Tail of Love where he writes

So as one crosses the sections -- going from the short head to the long tail -- one should be consistent and view it from the aggregator's point of view or the creator's point of view. I think it is a mistake to conflate the two view points.

I've been wrestling with this for a while and I think the only advantage to the creator that I can see in the long tail is that aggregators can invent or produce a long tail domain that was not present before.  Like Seth's Squidoo does. Before Squidoo or Amazon or Netflix came along there was no market at all for many of the creations they now distribute. The proposition that long tail aggregators can offer to creators is profound, but simple: you have a choice between a itsy bitsy niche audience (with nano profits) or no audience at all. Before the LT was expanded your masterpiece on breeding salt water aquarium fishes from the Red Sea would have no paying fans. Now you have maybe 100.

One hundred readers/watchers/listeners is not economical. There is no business equation that can sustain profits for continual creation from so few buyers. (It can of course support the business of aggregation above the level of creation.) But the long tail niche creation operates perfectly well in the realm of passion, enthusiasm, obsession, curiosity, peerage, love, and the gift economy.  In the exchange of psychic energy, encouragement, meaning of life, and reasons to live, the long now is a boon.

That is not true about profits. Economically, the more the long tail expands, the more stuff there is to compete with our limited attention as an audience, the more difficult it is for a creator to sell profitably. Or, the longer the tail, the worse for sales.

The Web has significantly reduced the costs of producing and distributing content. Anyone with a computer can publish to a potential audience of hundreds of millions of people for as little as the cost of their Internet connection. This is great for content consumers but it has significantly increased the amount of competition among content creators while also reducing their chances of generating profits from their work since the Web/Internet has provided lots of options for getting quality content for free (both legally and illegally). 

All of this is a long way of saying that in the era of "Web 2.0" it was quite unwise for a VC funded startup to jump into the pool of content creators and thus become a victim of The Long Tail instead of becoming a content aggregator and thus benefiting from the Long Tail instead. Of course, even that may not have saved them since the market for podcast aggregators pretty much dried up once Apple entered the fray.

Now Playing: Lil Wayne - I'm Me


Categories: Current Affairs

One of the problems you have to overcome when building a social software application is that such applications often depend on network effects to provide value to users. An instant messaging application isn't terribly useful unless your friends use the same application and using Twitter feels kind of empty if you don't follow anyone. On the flip side, once an application crosses a particular tipping point then network effects often push it to near monopoly status in certain social or regional networks. This has happened with eBay, Craigslist, MySpace, Facebook and a ton of other online services depend on network effects. Thus there is a lot of incentive for developers of social software applications to do their best to encourage and harness network effects in their user scenarios.

These observations have led to the notion of Viral Applications, applications which spread like viruses. The problem with a lot of the thinking behind "viral applications" and applications that borrow their techniques is that attempting to spread by any means necessary can be very harmful to the user experience. Here are two examples taken from this week's headlines

From Justine Ezaric, a post entitled The Loopt Debacle where she writes

Loopt is a location based social networking site that uses GPS to determine your exact location and share it with your friends.. and then spam your entire contact list via an SMS invite.

There’s a good chance that if you installed this application you’ve made the same mistake that most people made. While searching for friends who were on the service, apparently a text message was sent out to a large portion of my contact list, along with my phone number and my exact location (you know, since that’s the point of the application). Granted, you would think that if you have someone’s phone number, they’d have yours as well…

Hi, hey.. Over here!! People change their phone number for a reason!! With the ease of syncing contacts on the iPhone, it’s not always guaranteed that everyone in your contact list is a BFF (read: best friend forever). Also, there’s always people you just never want to text.. Like Steve Jobs, or an old boss, or maybe even an ex who would rather push you in front of a bus than get a text message from you?

From Marshal Kirkpatrick, a post entitled Gmail Tries To Be Less Creepy, Fails which states

Gmail, Google's powerful web based email service, announced some changes to its contact management features today. Contact management has for some time been a contentious matter among Google Account holders - the company does strange and mysterious things with your email contacts, including tying them in to some other applications without anyone's permission.

Today's new changes failed to alleviate those concerns, perhaps making the situation even less clear than it was before.

There Are Your Contacts and Then There Are Your Contacts

The post on the official Gmail blog today announced a new policy. There are now two types of contacts in your Gmail contacts list. There are your explicitly added My Contacts and there are your frequently emailed Suggested Contacts. The distinction between the two is unclear enough that I won't even try to summarize it. Read the following closely.

My Contacts contains the contacts you explicitly put in your address book (via manual entry, import or sync) as well as any address you've emailed a lot (we're using five or more times as the threshold for now).

Suggested Contacts is where Gmail puts its auto-created contacts. By default, Suggested Contacts you email frequently are automatically added to My Contacts, but for those of you who prefer tighter control of your address books, you can choose to disable usage-based addition of contacts to My Contacts (see the checkbox in the screenshot above). Once you do this, no matter how many times you email an auto-added email address it won't move to My Contacts.

When you open up Google Reader, the company's RSS reader, you'll find not just the feeds you've subscribed to but also the feeds of shared items from your "friends." Those friendships were defined somehow by Google, according to who you email in Gmail apparently. They can opt-out of having their shared items publicly visible at all, but short of doing that - you are seeing their shared items and someone, presumably, is seeing your shared items too. No one knows for sure.

Both Loopt and Gmail + Google Talk + Google Reader are examples of applications choosing approaches that encourage virality of the application or features of the application at the risk of putting users in socially awkward situations. As Justine mentions in the Loopt example, just because a person's phone number is in the contact list on your phone doesn't mean they would like to receive a text message from you at some random time of the day asking them to try out some social networking application. A phone isn't a social networking site. I have my doctor, my boss, his boss, our childcare provider, co-workers whose numbers I have in case of emergency and a bunch of other folks in my phone's contact list. These aren't the people I want to send spammy invites to try out some social networking application which probably doesn't even work on their phone. However I'm sure there has been some positive user growth from their "viral" techniques, but at what cost to their brand? Plaxo is still dealing with damage to their brand from their spammy era.

The Gmail behavior is even worse primarily because Google didn't fix the problem. Especially since people have been complaining about it for a while. No one can blame Google for wanting to jump start network effects for features like Shared Items in Google Reader or products like Google Talk, but it seems pretty ridiculous to decide to automatically add people I email to an IM application so they can see when I'm online and contact me anytime or to the list of people who are notified whenever I share something in Google Reader. It's just email, it does not imply an intimate social relationship. The worst thing about Google's practices is how it backfires, I'm less likely to use that combination of Google products so as not to cause inadvertent information leakage because some "viral algorithm" decided that because I sent a bunch of emails to my child care provider she needs to know whenever I share a link in Google Reader. 
If you decide to spread virally, you should be careful that you don't end up causing people to avoid your product like the diseases you are trying to emulate.

Now Playing: David Banner - Get Like Me (feat. Chris Brown, Yung Joc & Jim Jones)


Categories: Social Software

About a week ago, the Facebook Data team quietly released the Cassandra Project on Google Code. The Cassandra project has been described as a cross between Google's BigTable and Amazon's Dynamo storage systems. An overview of the project is available in the SIGMOD presentation on Cassandra available at SlideShare. A summary of the salient aspects of the project follows.

The problem Cassandra is aimed at solving is one that plagues social networking sites or any other service that has lots of relationships between users and their data. In such services, data often needs to be denormalized to prevent having to do lots of joins when performing queries. However this means the system needs to deal with the increased write traffic due to denormalization. At this point if you're using a relational database, you realize you're pretty much breaking every major rule of relational database design. Google tackled this problem by coming up with BigTable. Facebook has followed their lead by developing Cassandra which they admit is inspired by BigTable. 

The Cassandra data model is fairly straightforward. The entire system is a giant table with lots of rows. Each row is identified by a unique key. Each row has a column family, which can be thought of as the schema for the row. A column family can contain thousands of columns which are a tuple of {name, value, timestamp} and/or super columns which are a tuple of {name, column+} where column+ means one or more columns. This is very similar to the data model behind Google's BigTable.

As I mentioned earlier, denormalized data means you have to be able to handle a lot more writes than you would if storing data in a normalized relational database. Cassandra has several optimizations to make writes cheaper. When a write operation occurs, it doesn't immediately cause a write to the disk. Instead the record is updated in memory and the write operation is added to the commit log. Periodically the list of pending writes is processed and write operations are flushed to disk. As part of the flushing process the set of pending writes is analyzed and redundant writes eliminated. Additionally, the writes are sorted so that the disk is written to sequentially thus significantly improving seek time on the hard drive and reducing the impact of random writes to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new tape.

Cassandra is described as "always writable" which means that a write operation always returns success even if it fails internally to the system. This is similar to the model exposed by Amazon's Dynamo which has an eventual consistency model.  From what I've read, it isn't clear how writes operations that occur during an internal failure are reconciled and exposed to users of the system. I'm sure someone with more knowledge can chime in in the comments.

At first glance, this is a very nice addition to the world of Open Source software by the Facebook team. Kudos.

Found via James Hamilton.

PS: Is it me or is this the second significant instance of Facebook Open Sourcing a key infrastructure component "inspired" by Google internals?

Now Playing: Ray J - Gifts


Via Mark Pilgrim I stumbled on an article by Scott Loganbill entitled Google’s Open Source Protocol Buffers Offer Scalability, Speed which contains the following excerpt

The best way to explore Protocol Buffers is to compare it to its alternative. What do Protocol Buffers have that XML doesn’t? As the Google Protocol Buffer blog post mentions, XML isn’t scalable:

"As nice as XML is, it isn’t going to be efficient enough for [Google’s] scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy."

We’ve never had to deal with XML in a scale where programming for it would become unwieldy, but we’ll take Google’s word for it.

Perhaps the biggest value-add of Protocol Buffers to the development community is as a method of dealing with scalability before it is necessary. The biggest developing drain of any start-up is success. How do you prepare for the onslaught of visitors companies such as Google or Twitter have experienced? Scaling for numbers takes critical development time, usually at a juncture where you should be introducing much-needed features to stay ahead of competition rather than paralyzing feature development to keep your servers running.

Over time, Google has tackled the problem of communication between platforms with Protocol Buffers and data storage with Big Table. Protocol Buffers is the first open release of the technology making Google tick, although you can utilize Big Table with App Engine.

It is unfortunate that it is now commonplace for people to throw around terms like "scaling" and "scalability" in technical discussions without actually explaining what they mean. Having a Web application that scales means that your application can handle becoming popular or being more popular than it is today in a cost effective manner. Depending on your class of Web application, there are different technologies that have been proven to help Web sites handle significantly higher traffic than they normally would. However there is no silver bullet.

The fact that Google uses MapReduce and BigTable to solve problems in a particular problem space does not mean those technologies work well in others. MapReduce isn't terribly useful if you are building an instant messaging service. Similarly, if you are building an email service you want an infrastructure based on message queuing not BigTable. A binary wire format like Protocol Buffers is a smart idea if your applications bottleneck is network bandwidth or CPU used when serializing/deserializing XML.  As part of building their search engine Google has to cache a significant chunk of the World Wide Web and then perform data intensive operations on that data. In Google's scenarios, the network bandwidth utilized when transferring the massive amounts of data they process can actually be the bottleneck. Hence inventing a technology like Protocol Buffers became a necessity. However, that isn't Twitter's problem so a technology like Protocol Buffers isn't going to "help them scale". Twitter's problems have been clearly spelled out by the development team and nowhere is network bandwidth called out as a culprit.

Almost every technology that has been loudly proclaimed as unscalable by some pundit on the Web is being used by a massively popular service in some context. Relational databases don't scale? Well, eBay seems to be doing OK. PHP doesn't scale? I believe it scales well enough for Facebook. Microsoft technologies aren't scalable? MySpace begs to differ. And so on…

If someone tells you "technology X doesn't scale" without qualifying that statement, it often means the person either doesn't know what he is talking about or is trying to sell you something. Technologies don't scale, services do. Thinking you can just sprinkle a technology on your service and make it scale is the kind of thinking that led Blaine Cook (former architect at Twitter) to publish a presentation on Scaling Twitter which claimed their scaling problems where solved with their adoption of memcached. That was in 2007. In 2008, let's just say the Fail Whale begs to differ. 

If a service doesn't scale it is more likely due to bad design than to technology choice. Remember that.

Now Playing: Zapp & Roger - Computer Love


Categories: Web Development | XML

I read two stories about companies adopting Open Source this week which give some interesting food for thought when juxtaposed.

The first is a blog post on C|Net from Matt Asay titled Ballmer: We'll look at open source, but we won't touch where he writes

Ballmer lacks the imagination to conceive of a world where Microsoft could open-source code and still make a lot of money (He's apparently not heard of "Google."):

No. 1, are our products likely to be open-sourced? No. We do provide our source code in special situations, but open source also implies free, free is inconsistent with paying for lunches at the partner conference. (Applause.)

But at least he's willing to work with those who do grok that the future of software business (meaning: money) is open source:

The second is an article on InfoWorld by Paul Krill entitled Sun lays off approximately 1,000 employees which contains the following excerpts

Following through on a restructuring plan announced in May, Sun on Thursday laid off approximately 1,000 employees in the United States and Canada. All told, the company plans to reduce its workforce by approximately 1,500 to 2,500 employees worldwide. Additional reductions will occur in other regions including EMEA (Europe, Middle East, Africa), Asia-Pacific, and Latin America. Reducing the number of employees by 2,500 would constitute a loss of about 7 percent of the company's employees.
He also addressed the question of whether Sun should abandon its new strategy of giving away its software. Sun will not stop giving it away, according to Schwartz, citing a priority in developer adoption.

When it comes to the financial benefits of Open Source, you need to look at two perspectives. The perspective of the software vendor (the producer) and the perspective of the software customer (the consumer). A key benefit of Open Source/Free Software to software consumers is that it tends to drive the price of the software to zero. On the other hand, although software producers like Sun Microsystems spend money to produce the software they cannot directly recoup that investment by charging for the software. Thus if you are a consumer of software, it is clear why Open Source is great for your bottom line. On the flip side, it isn't so clear if your primary business is producing software.

Matt Asay's usage of Google as an example of a company "making money" from Open Source is a prime example of this schism in perspectives. Google's primary business is selling advertising. Like every other media business, they gather an audience by using their products as bait and then sell that audience to advertisers. Every piece of software not directly related to the business of selling ads is tangential to Google's business. The only other software that is important to Google's business is the software that gives them a differentiated offering when it comes to gathering that audience. Both classes of software are proprietary to Google and always will be.

This is why you'll never find a Subversion source repository on http://code.google.com with the source code behind Google's AdSense or Adwords products or the current algorithms that power their search engine. Instead you will find Google supporting and releasing lots of Open Source software that is tangential its core business while keeping the software that actually makes them money proprietary. 

This means that in truth Google makes money from proprietary software. However since it doesn't distribute its proprietary software to end users, there isn't anyone complaining about this fact.

Unlike Google, Sun Microsystems doesn't really seem to know how they plan to make money. There is a lot of data out there that shows that the Sun Microsystems' model of scaling services is dying. Recently, Kai Fu Lee of Google argued that scaling out on commodity hardware is 33 times more efficient than using expensive hardware. This jibes with the sentiments of people who work on cloud services at Microsoft and Amazon that I've talked to when comparing the use of lots of "commodity" servers versus more expensive "big iron" server systems. This means Sun's hardware business is being squeezed because it is betting against industry experience. Giving away their software does not fix this problem, it makes it worse by cutting of a revenue stream as their core business is turning into a dinosaur before their eyes.

The bottom line is that giving something away that costs you money to produce only makes sense as part of a strategy that makes you even more money than selling what you gave away (e.g. free T-shirts with corporate logos). Google gets that. It seems Sun Microsystems does not. Neither does Matt Asay.

Now Playing: Inner Circle - Sweat (A La La La La Long)



When it comes to scaling Web applications, every experienced Web architect eventually realizes that Disk is the New Tape. Getting data from off of the hard drive disk is slow compared to getting it from memory or from over the network. So an obvious way to improve the performance of your system is to reduce the amount of disk I/O your systems have to do which leads to the adoption of in-memory caching. In addition, there is often more cacheable data on disk than there is space in memory since memory to disk ratios are often worse than 1:100 (Rackspace's default server config has 1GB of RAM and 250 GB of hard disk ). Which has led to the growing popularity of distributed, in-memory, object caching systems like memcached and Microsoft's soon to be released Velocity

memcached can be thought of as a distributed hash table and its programming model is fairly straightforward from the application developer's perspective. Specifically, There is a special hash table class used by your application which is in actuality a distributed hashtable whose contents are actually being stored on a cluster of machines instead of just in the memory of your local machine.

With that background I can now introduce Terracotta, a product that is billed as "Network Attached Memory" for Java applications. Like distributed hash tables such as memcached, Terracotta springs from the observation that accessing data from a cluster of in-memory cache servers is often more optimal than getting it directly from your database or file store.

Where Terracotta differs from memcached and other distributed hash tables is that it is completely transparent to the application developer. Whereas memcached and systems like it require developers to instantiate some sort of "cache" class and then use that as the hash table of objects that should be stored, Terracotta attempts to be transparent to the application developer by hooking directly into the memory allocation operations of the JVM.

The following is an excerpt from the Terracotta documentation on How Terracotta Works

Terracotta uses ASM to manipulate application classes as those classes load into the JVM. Developers can pick Sun Hotspot or IBM's runtime, and any of several supported application servers

The Terracotta configuration file dictates which classes become clustered and which do not. Terracotta then examines classes for fields it needs to cluster, and threading semantics that need to be shared. For example, if to share customer objects throughout an application cluster, the developer need only tell Terracotta to cluster customers and to synchronize customers cluster-wide.

Terracotta looks for bytecode instructions like the following (not an exhaustive list):


On each of those, Terracotta does the work of Network Attached Memory. Specifically:

BYTECODE Injected Behavior
GETFIELD Read from the Network for certain objects. Terracotta also has a heap-level cache that contains pure Java objects. So GETFIELD reads from RAM if-present and faults in from NAM if a cache miss occurs.
PUTFIELD Write to the Network for certain objects. When writing field data through the assignment operator "=" or through similar mechanisms, Terracotta writes the changed bytes to NAM as well as allowing those to flow to the JVM's heap.
AASTORE Same as PUTFIELD but for arrays
AALOAD Sames as GETFIELD but for arrays
MONITORENTRY Get a lock inside the JVM on the specified object AND get a lock in NAM in case a thread on another JVM is trying to edit this object at the same time
MONITOREXIT Flush changes to the JVM's heap cache back to NAM in case another JVM is using the same objects as this JVM

The instrumented-classes section of the Terracotta config file is where application developers specify which objects types should be stored in the distributed cache and it is even possible to say that all memory allocations in your application should go through the distributed cache.

In general, the approach taken by Terracotta seems more complicated, more intrusive and more error prone than using a distributed hash table like Velocity or memcached. I always worry about systems that attempt to hide or abstract away the fact that network operations are occurring. This often leads to developers writing badly performing or unsafe code because it wasn't obvious that network operations are involved (e.g. a simple lock statement in your Terracotta-powered application may actually be acquiring distributed locks without it being explicit in the code that this is occuring).

Now Playing: Dream - I Luv Your Girl (Remix) (feat. Young Jeezy)


Categories: Web Development

In the past year both Google and Facebook have released the remote procedure call (RPC) technologies that are used for communication between servers within their data centers as Open Source projects. 

Facebook Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. It supports the following programming languages; C++, Java, Python, PHP and Ruby.

Google Protocol Buffers allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. It supports the following programming languages; C++, Java and Python.

That’s interesting. Didn’t Steve Vinoski recently claim that RPC and it's descendants are "fundamentally flawed"? If so, why are Google and Facebook not only using RPC but proud enough of their usage of yet another distributed object RPC technology based on binary protocols that they are Open Sourcing them? Didn’t they get the memo that everyone is now on the REST + JSON/XML bandwagon (preferrably AtomPub)?

In truth, Google is on the REST + XML band wagon. Google has the Google Data APIs (GData) which is a consistent set of RESTful APIs for accessing data from Google's services based on the Atom Publishing Protocol aka RFC 5023. And even Facebook has a set of plain old XML over HTTP APIs (POX/HTTP) which they incorrectly refer to as the Facebook REST API.

So what is the story here?

It is all about coupling and how much control you have over the distributed end points. On the Web where you have little to no control over who talks to your servers or what technology they use, you want to utilize flexible technologies that make no assumptions about either end of the communication. This is where RESTful XML-based Web services shine. However when you have tight control over the service end points (e.g. if they are all your servers running in your data center) then you can use more optimized communications technologies that add a layer of tight coupling to your system. An example of the kind of tight coupling you have to live with is that  Facebook Thrift requires specific versions of g++ and Java if you plan to talk to it using code written in either language and you can’t talk to it from a service written in C#.

In general, the Web is about openness and loose coupling. Binary protocols that require specific programming languages and runtimes are the exact opposite of this. However inside your Web service where you control both ends of the pipe, you can optimize the interaction between your services and simplify development by going with a binary RPC based technology. More than likely different parts of your system are already doing this anyway (e.g. memcached uses a binary protocol to talk between cache instances, SQL Server uses TDS as the communications protocol between the database and it's clients, etc).

Always remember to use the right tool for the job. One size doesn’t fit all when it comes to technology decisions.


  • Exposing a WCF Service With Multiple Bindings and Endpoints – Keith Elder describes how Windows Communication Foundation (WCF) supports multiple bindings that enable developers to expose their services in a variety of ways.  A developer can create a service once and then expose it to support net.tcp:// or http:// and various versions of http:// (Soap1.1, Soap1.2, WS*, JSON, etc).  This can be useful if a service crosses boundaries between the intranet and the Internet.

Now Playing: Pink - Family Portrait


Categories: Platforms | Programming

A year ago Loren Feldman produced a controversial video called "TechNigga" which seems to still be causing him problems today. Matthew Ingram captures the latest fallout from that controversy in his post Protests over Verizon deal with 1938media where he writes

Several civil-rights groups and media watchdogs are protesting a decision by telecom giant Verizon to add 1938media’s video clips to its mobile Vcast service, saying Loren’s "TechNigga" clip is demeaning to black people. Project Islamic Hope, for example, has issued a statement demanding that Verizon drop its distribution arrangement with 1938media, which was just announced about a week ago, and other groups including the National Action Network and LA Humanity Foundation are also apparently calling for people to email Verizon and protest.

The video that has Islamic Hope and other groups so upset is one called "TechNigga," which Loren put together last August. After wondering aloud why there are no black tech bloggers, Loren reappears with a skullcap and some gawdy jewelry, and claims to be the host of a show called TechNigga. He then swigs from a bottle of booze, does a lot of tongue-kissing and face-licking with his girlfriend Michelle Oshen, and then introduces a new Web app called "Ho-Trackr," which is a mashup with Google Maps that allows prospective johns to locate prostitutes. In a statement, Islamic Hope says that the video "sends a horrible message that Verizon seeks to partner with racists."

I remember encountering the video last year and thinking it was incredibly unfunny. It wasn’t a clever juxtaposition of hip hop culture and tech geekery. It wasn’t satire since that involves lampooning someone or something you disapprove off in a humorous way (see The Colbert Report).  Of course, I thought the responses to the video were even dumber; like Robert Scoble responding to the video with the comment “Dare Obasanjo is black”.

Since posting the video Loren Feldman has lost a bunch of video distribution deals with the current Verizon deal being the latest. I’ve been amused to read all of the comments on TechCrunch about how this violates Loren’s freedom of speech.

People often confuse the fact that it is not a crime to speak your mind in America with the belief that you should be able to speak your mind without consequence. The two things are not the same. If I call you an idiot, I may not go to jail but I shouldn’t expect you to be nice to me afterwards. The things you say can come back and bite you on butt is something everyone should have learned growing up. So it is always surprising for me to see people petulantly complain that “this violates my freedom of speech” when they have to deal with the consequences of their actions.

BONUS VIDEO: A juxtaposition of hip hop culture and Web geekery by a black tech blogger.

Now Playing: NWAN*ggaz 4 Life


Categories: Current Affairs

Gnip is a newly launched startup that pitches itself as a service that aims to “make data portability suck less”. Mike Arrington describes the service in his post Gnip Launches To Ease The Strain On Web Services which is excerpted below

A close analogy is a blog ping server (see our overview here). Ping servers tell blog search engines like Technorati and Google Blog Search when a blog has been updated, so the search engines don’t have to constantly re-index sites just to see if new content has been posted. Instead, the blog tells the ping server when it updates, which tells the search engines to drop by and re-index. The creation of the first ping server, Weblogs.com, by Dave Winer resulted in orders of magnitude better efficiency for blog search engines.

The same thinking basically applies to Gnip. The idea is to gather simple information from social networks - just a username and the fact that they created new content (like writing a Twitter message, for example). Gnip then distributes that data to whoever wants it, and those downstream services can then access the core service’s API, with proper user authentication, and access the actual data (in our example, the actual Twitter message).

From a user’s perspective, the result is faster data updates across services and less downtime for services since their APIs won’t be hit as hard.

From my perspective, Gnip also shares some similarity to services like FeedBurner as well as blog ping servers. The original purpose of blog ping servers was to make it cheaper for services like Technorati and Feedster to index the blogosphere without having to invest in a Google-sized server farm and crawl the entire Web every couple of minutes. In addition, since blogs often have tiny readerships and are thus infrequently linked to, crawling alone was not enough to ensure that they find their way into the search index. It wasn’t about taking load off of the sites that were doing the pinging.

On the other hand, FeedBurner hosts a site’s RSS feed as a way to take load off of their servers and then provides analytics data so the site doesn’t miss out from losing the direct connection to its subscribers. This is more in line with the expectation that Gnip will take load off of a service’s API servers. However unlike FeedBurner, Gnip doesn’t actually store the user data from the social networking site. It simply stores a record that indicates that “user X on site Y made an update of type Z at time T”.  The thinking is that web sites will publish a notification to Gnip whenever their users perform an update. Below is a sample interaction between Digg and Gnip where Digg notifies Gnip that the users amy and john.doe have dugg two stories.

  POST /publishers/digg/activity.xml
  Accept: application/xml
  Content-Type: application/xml
    <activity at="2008-06-08T10:12:42Z" uid="amy" type="dugg" guid="http://digg.com/odd_stuff/a_story"/>
    <activity at="2008-06-09T09:14:07Z" uid="john.doe" type="dugg" guid="http://digg.com/odd_stuff/really_weird"/>

  200 OK
  Content-Type: application/xml

There are two modes in which "subscribers" can choose to interact with the data published to Gnip. The first is in a mode similar to how blog search engines interact with the changes.xml file on Weblogs.com and other blog ping servers. For example, services like Summize or TweetScan can ask Gnip for the last hour of changes on Twitter instead of whatever mechanism they are using today to crawl the site. Below is what a sample interaction to retrieve the most recent updates on Twitter from Gnip would look like

GET /publishers/twitter/activity/current.xml
Accept: application/xml

200 OK
Content-Type: application/xml

<activity at="2008-06-08T10:12:07Z" uid="john.doe" type="tweet" guid="http://twitter.com/john.doe/statuses/42"/>
<activity at="2008-06-08T10:12:42Z" uid="amy" type="tweet" guid="http://twitter.com/amy/statuses/52"/>

The main problem with this approach is the same one that affects blog ping servers. If the rate of updates is more than the ping server can handle then it may begin to fall behind or lose updates completely. Services that don’t want to risk their content not being crawled are best off providing their own update stream that applications can poll periodically. That’s why the folks at Six Apart came up with the Six Apart Update Stream for LiveJournal, TypePad and Vox weblogs.

The second mode is one that has gotten Twitter fans like Dave Winer raving about Gnip being the solution to Twitter’s scaling problems. In this mode, an application creates a collection of one or more usernames they are interested in. Below is what a collection document created by the Twadget application to indicate that it is interested in my Twitter updates might look like.

<collection name="twadget-carnage4life">
     <uid name="carnage4life" publisher.name="twitter"/>

Then instead of polling Twitter every 5 minutes for updates it polls Gnip every 5 minutes for updates and only talks to Twitter’s servers when Gnip indicates that I’ve made an update since the last time the application polled Gnip. The interaction between Twadget and Gnip would then be as follows

GET /collections/twadget-carnage4life/activity/current.xml
Accept: application/xml
200 OK
Content-Type: application/xml

<activity at="2008-06-08T10:12:07Z" uid="carnage4life" type="tweet" guid="http://twitter.com/Carnage4Life/statuses/850726804"/>


Of course, this makes me wonder why one would think that it is feasible for Gnip to build a system that can handle the API polling traffic of every microblogging and social networking site out there but it is infeasible for Twitter to figure out how to handle the polling traffic for their own service. Talk about lowered expectations. Wink

So what do I think of Gnip? I think the ping server mode may be of some interest for services that think it is cheaper to have code that pings Gnip after every user update instead building out an update stream service. However since a lot of sites already have some equivalent of the public timeline it isn’t clear that there is a huge need for a ping service. Crawlers can just hit the public timeline which I assume is what services like Summize and TweetScan do to keep their indexes of tweets up to date.

As for using Gnip as a mechanism for reducing the load API clients put on a microblogging or similar service? Gnip is totally useless for that in it’s current incarnation. API clients aren’t interested in updates made by single user. They are interested in all the updates made by all the people the user is following. So for Twadget to use Gnip to lighten the load it causes on Twitter’s servers on my behalf, it has to build a collection of all the people I am following in Gnip and then keep that list of users in sync with whatever that list is on Twitter. But if it has to constantly poll Twitter for my friend list, isn’t it still putting the same amount of load on Twitter? I guess this could be fixed by having Twitter publish follower/following lists to Gnip but that introduces all sorts of interesting technical and privacy issues. But that doesn’t matter since the folks at Gnip brag about only keeping 60 minutes of worth of updates as the “secret sauce” to their scalability. This means if I shut my Twitter client hasn’t polled Gnip in a 60 minute window (maybe my laptop is closed) then it doesn’t matter anyway and it has to poll Twitter.  I suspect someone didn’t finish doing their homework before rushing to “launch” Gnip.

PS: One thing that is confusing to me is why all communication between applications and Gnip needs to be over SSL. The only thing I can see it adding is making it more expensive for Gnip run their service. I can’t think of any reason why the interactions described above need to be over a secure channel.

Now Playing: Lil Wayne - Hustler Musik


Categories: Platforms

Every once in a while I encounter an online service or Web site that is so irritating that it seems like the people behind the service are just in it to frustrate Web users. And I don’t mean the obvious candidates like email spammers and purveyors of popup ads since they’ve been around for so long I’ve either learned how to ignore and avoid them.

There is a new generation of irritants and many of them are part of the new lunacy we call “Web 2.0”

  1. Flash Widgets with Embedded PDF Documents: Somewhere along the line a bunch of startups decided that they needed to put a “Web 2.0” spin on the simple concept of hosting people’s office documents online. You see, lots of people would like to share documents in PDF or Microsoft Office® formats that aren’t particularly Web friendly. So how have sites like Scribd and Docstoc fixed this problem? By creating a Flash widgets containing the embedded PDF/Office documents like the one shown here. So not only are the documents still in a Web unfriendly format but now I can’t even download them and use the tools on my desktop to read them. It’s like let’s combine the FAIL of putting non-Web documents on the Web with the fail of a Web-unfriendly format like Flash. FAIL++. By the way, it’s pretty ironic that a Microsoft enterprise product gets this right where so many “Web 2.0” startups get it wrong.

  2. Hovering Over Links Produces Flash Widgets as Pop Over Windows: The company that takes the cake for spreading this major irritant across the blogosphere is Snap Technologies and their Snap Shots™ product. There’s nothing quite as irritating as hovering over a link on your way to click another link and leaving a wake of pop over windows with previews of the Web pages at the end of said links. I seriously wonder if anyone finds this useful?

  3. Facebook Advertisers: One of the promises of Facebook is that its users will see more relevant advertising because there is all this rich demographic data about the site’s users in their profiles. Somewhere along the line this information is either getting lost or being ignored by Facebook’s advertisers. Even though my profile says I’m married and out of my twenties I keep getting borderline sleazy ads whenever I login to play Scrabulous asking if I want to meet college girls. Then there are the ads which aren’t for dating sites but still use sleazy imagery anyway. It’s mad embarrassing whenever my wife looks over to see what I’m doing on my laptop to have dating site ads blaring in her face. Obviously she knows I’m not on a dating site but still…

  4. Forums that Require Registration Showing Up in Search Results : Every once in a while I do a Web search for a programming problem and a couple of links to Experts Exchange end up in the results. What is truly annoying about this site is that the excerpt on the search result page makes  it seem as though the answer to your question is one click away but when you click through you are greeted with “All comments and solutions are available to Premium Service Members only”. I thought search engines had rules about banning sites with that sort of obnoxious behavior?

  5. Newspaper Websites with Interstitial Ads and Registration Requirements: Newspapers such as the New York Times often act as if they don’t really want me reading the content on their Web site. If I click on a link to a story on the New York Times site such as this one, one of two things will happen; I’m either taken to a full page animated advertisement with an option to skip the ad in relatively small font or I get a one sentence summary of the story with a notice that I need to register on their Web site before I can read the story. Either way it’s a bunch of bull crap that prevents me from getting to the news.

There are two things that strike me about this list as notable. The first is that there are an increasing number of “Web 2.0” startups out there who are actively using Flash to cause more problems than they claim to be solving. The second is that requiring registration to view content is an amazingly stupid trend that is beyond dumb. It’s not like people need to register on your site to see ads so why reduce the size of your potential audience by including this road block? That’s just stupid.

Now Playing: Pleasure P - Rock Bottom (feat. Lil Wayne)


Categories: Rants

July 2, 2008
@ 01:56 PM

Jeff Atwood recently published two anti-XML rants in his blog entitled XML: The Angle Bracket Tax and Revisiting the XML Angle Bracket Tax. The source of his beef with XML and his recommendations to developers are excerpted below

Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails.

I'm deeply ambivalent about XML. I'm reminded of this Winston Churchill quote:

It has been said that democracy is the worst form of government except all the others that have been tried.

XML is like democracy. Sometimes it even works. On the other hand, it also means we end up with stuff like this:

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
    <m:GetLastTradePrice xmlns:m="Some-URI">

You could do worse than XML. It's a reasonable choice, and if you're going to use XML, then at least learn to use it correctly. But consider:
  1. Should XML be the default choice?
  2. Is XML the simplest possible thing that can work for your intended use?
  3. Do you know what the XML alternatives are?
  4. Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs?

I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it. Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. <CleverEndQuote>Again.</CleverEndQuote>

The question of if and when to use XML is one I am intimately familiar with given that I spent the first 2.5 years of my professional career at Microsoft working on the XML team as the “face of XML” on MSDN.

My problem with Jeff’s articles is that they take a very narrow view of how to evaluate a technology. No one should argue that XML is the simplest or most efficient technology to satisfy the uses it has been put to today. It isn’t. The value of XML isn’t in its simplicity or its efficiency. It is in the fact that there is a massive ecosystem of knowledge and tools around working with XML.

If I decide to use XML for my data format, I can be sure that my data will be consumable using a variety off-the-shelf tools on practically every platform in use today. In addition, there are a variety of tools for authoring XML, transforming it to HTML or text, parsing it, converting it to objects, mapping it to database schemas, validating it against a schema, and so on. Want to convert my XML config file into a pretty HTML page? I can use XSLT or CSS. Want to validate my XML against a schema? I have my choice of Schematron, Relax NG and XSD. Want to find stuff in my XML document? XPath and XQuery to the rescue. And so on.

No other data format hits a similar sweet spot when it comes to ease of use, popularity and breadth of tool ecosystem.

So the question you really want to ask yourself before taking on the “Angle Bracket Tax” as Jeff Atwood puts it, is whether the benefits of avoiding XML outweigh the costs of giving up the tool ecosystem of XML and the familiarity that practically every developer out there has with the technology? In some cases this might be true such as when deciding whether to go with JSON over XML in AJAX applications (I’ve given two reasons in the past why JSON is a better choice).  On the other hand, I can’t imagine a good reason to want to roll your own data format for office documents or application configuration files as opposed to using XML.

  • The XML Litmus Test - Dare Obasanjo provides some simple guidelines for determining when XML is the appropriate technology to use in a software application or architecture design. (6 printed pages)
  • Understanding XML - Learn how the Extensible Markup Language (XML) facilitates universal data access. XML is a plain-text, Unicode-based meta-language: a language for defining markup languages. It is not tied to any programming language, operating system, or software vendor. XML provides access to a plethora of technologies for manipulating, structuring, transforming and querying data. (14 printed pages)

Now Playing: Metallica - The God That Failed


Categories: XML

Late last week, the folks on the Google Data APIs blog announced that Google will now be supporting OAuth as the delegated authentication mechanism for all Google Data APIs. This move is meant to encourage the various online services that provide APIs that access a user’s data in the “cloud” to stop reinventing the wheel when it comes to delegated authentication and standardize on a single approach.

Every well-designed Web API that provides access to a customer’s data in the cloud utilizes a delegated authentication mechanism which allows users to grant 3rd party applications access to their data without having to give the application their username and password. There is a good analogy for this practice in the OAuth: Introduction page which is excerpted below

What is it For?

Many luxury cars today come with a valet key. It is a special key you give the parking attendant and unlike your regular key, will not allow the car to drive more than a mile or two. Some valet keys will not open the trunk, while others will block access to your onboard cell phone address book. Regardless of what restrictions the valet key imposes, the idea is very clever. You give someone limited access to your car with a special key, while using your regular key to unlock everything.

Everyday new website offer services which tie together functionality from other sites. A photo lab printing your online photos, a social network using your address book to look for friends, and APIs to build your own desktop application version of a popular site. These are all great services – what is not so great about some of the implementations available today is their request for your username and password to the other site. When you agree to share your secret credentials, not only you expose your password to someone else (yes, that same password you also use for online banking), you also give them full access to do as they wish. They can do anything they wanted – even change your password and lock you out.

This is what OAuth does, it allows the you the User to grant access to your private resources on one site (which is called the Service Provider), to another site (called Consumer, not to be confused with you, the User). While OpenID is all about using a single identity to sign into many sites, OAuth is about giving access to your stuff without sharing your identity at all (or its secret parts).

So every service provider invented their own protocol to do this, all of which are different but have the same basic components. Today we have Google AuthSub, Yahoo! BBAuth, Windows Live DelAuth, AOL OpenAuth, the Flickr Authentication API, the Facebook Authentication API and others. All different, proprietary solutions to the same problem.

This ends up being problematic for developers because if you want to build an application that talks to multiple services you not only have to deal with the different APIs provided by these services but also the different authorization/authentication models they utilize as well. In a world where “social aggregation” is becoming more commonplace with services like Plaxo Pulse & FriendFeed and more applications are trying to bridge the desktop/cloud divide like OutSync and RSS Bandit, it sucks that these applications have to rewrite the same type of code over and over again to deal with the basic task of getting permission to access a user’s data. Standardizing on OAuth is meant to fix that. A number of startups like Digg & Twitter as well as major players like Yahoo and Google have promised to support it, so this should make the lives of developers easier.

Of course, we still have work to do as an industry when it comes to the constant wheel reinvention in the area of Web APIs. Chris Messina points to another place where every major service provider has invented a different proprietary protocol for doing the same task in his post Inventing contact schemas for fun and profit! (Ugh) where he writes

And then there were three
Today, Yahoo!
announced the public availability of their own Address Book API.

However, I have to lament yet more needless reinvention of contact schema. Why is this a problem? Well, as I pointed out about Facebook’s approach to developing their own platform methods and formats, having to write and debug against yet another contact schema makes the “tax” of adding support for contact syncing and export increasingly onerous for sites and web services that want to better serve their customers by letting them host and maintain their address book elsewhere.

This isn’t just a problem that I have with Yahoo!. It’s something that I encountered last November with the SREG and proposed Attribute Exchange profile definition. And yet again when Google announced their Contacts API. And then again when Microsoft released theirs! Over and over again we’re seeing better ways of fighting the password anti-pattern flow of inviting friends to new social services, but having to implement support for countless contact schemas. What we need is one common contacts interchange format and I strongly suggest that it inherit from vcard with allowances or extension points for contemporary trends in social networking profile data.

I’ve gone ahead and whipped up a comparison matrix between the primary contact schemas to demonstrate the mess we’re in.

Kudos to the folks at Google for trying to force the issue when it comes to standardizing on a delegated authentication protocol for use on the Web. However there are still lots of places across the industry where we speak different protocols and thus incur a needless burden on developers when a single language might do. It would be nice to see some of this unnecessary redundancy eliminated in the future.

Now Playing: G-Unit - I Like The Way She Do It


Categories: Platforms | Web Development