November 14, 2009
@ 03:03 PM

Joe Hewitt, the developer of the Facebook iPhone application, has an insightful  blog post on the current trend of developers favoring native applications over Web applications on mobile platforms with centrally controlled app stores in his post On Middle Men. He writes

The Internet has been incredibly empowering to creators, and just as destructive to middle men. In the 20th century, every musician needed a record label to get his or her music heard. Every author needed a publishing house to be read. Every journalist needed a newspaper. Anyone who wanted to send a message needed the post office. In the Internet age, the tail no longer wags the dog, and those middle men have become a luxury, not a necessity.

Meanwhile, the software industry is moving in the opposite direction. With the web and desktop operating systems, the only thing in between software developers and users is a mesh of cables and protocols. In the new world of mobile apps, a layer of bureacrats stand in the middle, forcing each developer to queue up for a series of patdowns and metal detectors and strip searches before they can reach their customers.
We're at a critical juncture in the evolution of software. The web is still here and it is still strong. Anyone can still put any information or applications on a web server without asking for permission, and anyone in the world can still access it just by typing a URL. I don't think I appreciated how important that is until recently. Nobody designs new systems like that anymore, or at least few of them succeed. What an incredible stroke of luck the web was, and what a shame it would be to let that freedom slip away.

Am I the only one who thinks the above excerpt would be similarly apt if you replaced the phrase "mobile apps" with "Facebook apps" or "OpenSocial apps"?

Note Now Playing: Lady GaGa - Bad Romance Note


Categories: Web Development

November 7, 2009
@ 04:41 PM

There was an article in the The Register earlier this week titled Twitter fanatic glimpses dark side of OAuth which contains the following excerpt

A mobile enthusiast and professional internet strategist got a glimpse of OAuth's dark side recently when he received an urgent advisory from Twitter.  The dispatch, generated when Terence Eden tried to log in, said his Twitter account may have been compromised and advised he change his password. After making sure the alert was legitimate, he complied.

That should have been the end of it, but it wasn't. It turns out Eden used OAuth to seamlessly pass content between third-party websites and Twitter, and even after he had changed his Twitter password, OAuth continued to allow those websites access to his account.

Eden alternately describes this as a "gaping security hole" and a "usability issue which has strong security implications." Whatever the case, the responsibility seems to lie with Twitter.

If the service is concerned enough to advise a user to change his password, you'd think it would take the added trouble of suggesting he also reset his OAuth credentials, as Google, which on Wednesday said it was opening its own services to work with OAuth, notes here.

I don't think the situation is as cut and dried as the article makes it seem. Someone trying to hack your account by guessing your password and thus triggering a warning that your account is being hacked is completely different from an application you've given permission to access your data doing the wrong thing with it.

Think about it. Below is a list of the applications I've allowed to access my Twitter stream. Is it really the desired user experience that when I change my password on Twitter that all of them break and require that I re-authorize each application?

list of applications that can access my Twitter stream

I suspect Terence Eden is being influenced by the fact that Twitter hasn't always had a delegated authorization model and the way to give applications access to your Twitter stream was to handout your user name & password. That's why just a few months ago it was commonplace to see blog posts like Why you should change your Twitter password NOW! which advocate changing your Twitter password as a way to prevent your password from being stolen from Twitter apps with weak security. There have also been a spate of phishing style Twitter applications such as TwitViewer and TwitterCut which masquerade as legitimate applications but are really password stealers in disguise. Again, the recommendation has been to change your password if you fell victim to such scams.

In a world where you use delegated authorization techniques like OAuth, changing your password isn't necessary to keep out apps like TwitViewer & TwitterCut since you can simply revoke their permission to access your account. Similarly if someone steals my password and I choose a new one, it doesn't mean that I now need to lose Twitter access from Brizzly or the new MSN home page until I reauthorize these apps. Those issues are orthogonal unrelated given the OAuth authorized apps didn't have my password in the first place.

Thus I question the recommendation at the end of the article that it is a good idea to even ask users about de-authorizing applications as part of password reset since it is misleading (i.e. it gives the impression the hack attempt was from one of your authorized apps instead of someone trying to guess your password) and just causes a hassle for the user who now has to reauthorize all the applications at a later date anyway.

Note Now Playing: Rascal Flatts - What Hurts The Most Note


Categories: Web Development

“Every marketer's dream is to find an unidentified or unknown market and develop it” – Barry Brand

Last week I read the various notes on the presentations by famous startup founders at Startup School 2009 and found a lot of the anecdotes interesting but wondered if they were truly useful to startup founders. I tried to think of some of the products I started using in the past five years that I now use a lot today and couldn’t do without. In looking for the common thread in these products and the quote above came to mind.

Finding an underserved market sounds like getting the winning ticket in a lottery, unlikely. In truth it isn’t hard to find underserved markets if you recognize the patterns. The hard part in turning it into a successful business is execution. 

There are two patterns I’ve used in looking for underserved markets that were invigorated by products I now use regularly today. The first pattern is looking for activities people like doing where the technology has been stagnant for a while. For the past few decades, we’ve been in world where the technology products you use can be made smaller, cheaper and faster every few years. There is some notion of waiting for the stars to align such as hard drive sizes shrinking until you could fit gigabytes of data in your pocket (e.g. the iPod) or AJAX and online banking becoming ubiquitous (e.g. Mint). However for each product category where some upstart has changed the game by leveraging modern technologies there are still dozens of markets where decade(s) old tech is dictating the user experience.

Another patterns is looking for an activity or task that people hate doing but assume is a fact of life or a necessary aspect of using a product. I remember when I’d buy video games like Soul Calibur and lament about how I could only find people to play with when I went to visit friends from out of town who were also fans of the game. To me, not being able to play multiplayer games without having friends physically in the same room was just a fact of live. This is no longer the case in the world of XBox Live. I also remember almost cancelling my cable subscription about five or six years ago because I resented having to tailor my TV watching schedule around prime time hours which to me represented prime hours to decompress from work and hack on RSS Bandit. The entire notion of “prime time TV” has been a fact of life for decades. Once I discovered TiVo, it stopped being the case for me. Some of these painful facts of life have been with us for centuries. For example, take this quote from John Wanamaker

“Half the money I spend on advertising is wasted; the trouble is I don't know which half”

with modern advertising solutions like Google’s Adwords and Facebook’s Advertising platform this is no longer the case. People can now tell down to minute levels of detail exactly what their return on investment is on advertising.

I’d love to see more startups attempting to find and satisfy underserved markets instead of going with the crowd and doing what everyone else is doing. Less Facebook games and iPhone fart apps, more original ideas that solve real problems like Mint and Flickr. If your startup needs a few hints at what some of these markets are in the technology space, there’s always YCombinator’s list of Startup Ideas they’d fund

Note Now Playing: Jay-Z Feat. Kanye West - Hate Note


Categories: Startup Shoutout

On the MSN blog there's a new blog post entitled The New MSN Homepage Unveiled which states

Today is an exciting day for our team at MSN because we unveiled the most significant redesign our homepage has seen in over a decade. We spent thousands of hours talking with customers; testing hundreds of ideas; experimenting around the world and carefully evaluating what our users want, and don’t want - to deliver a homepage that is designed to be the best homepage on the Web. We hope you’ll agree.

So, we started from scratch to cut the clutter on our homepage and reduced the amount of links by 50%. There’s also a simplified navigation across news, entertainment, sports, money, and lifestyle that lets you drill into information topics that interest you, without being overwhelming. Local information from your neighborhood is important to you and so is high quality, in-line video – so we offer both, right on the homepage. And, you told us you want the latest information not only from your favorite sources, but also from your friends, and the breadth of the Web – so we now offer convenient access to Facebook, Twitter, & Windows Live services and the most powerful search experience on the Web from Bing, empowering you to make more informed, faster decisions. And this is just the beginning - keep visiting our blog for more MSN news in the coming weeks.

This is a really exciting release for my team on Windows Live since we're responsible for the underlying platform that powers the display of what activities your friends have been performing across Windows Live. Working with the MSN home page team was a good experience and its great to see that the tens of millions of people who visit the MSN home page regularly will now get to experience our work. Kudos to the MSN team on a very nice release.

You can try out the new home page for yourself at

Note Now Playing: Jay-Z - Reminder Note


Categories: MSN | Windows Live

Last week John Panzer, who works on Blogger at Google, wrote about some of the work he’s been doing on creating a protocol for syndicating comments associated with activity streams in his post The Salmon Protocol: Introducing the Salmon Project. Key parts of his post are excerpted below

A few days ago, at the Real Time Web Summit, we had a session about Salmon, a protocol for re-aggregated distributed conversations around web content.  I was hoping for some feedback and to generate some interest, and I was overwhelmed by the positive reactions, especially after Louis Gray's post "Proposed Salmon Protocol aims to unify Conversations on the Web". Adina Levin's "Salmon - Re-assembling distributed conversations" is a good, insightful review as well. There's clearly a great deal of interest in this, and so I've gone ahead and expanded Salmon's home at with an open source project,, and a mailing list,

Louis Gray’s post on the topic includes an embedded presentation which captures the essence of the protocol

Before talking about the technical details of the protocol it is a good idea to understand the end user problem the protocol solves. For me, it solves a problem I have in the way that RSS Bandit integrates with Facebook. The problem is that although there is a way to get regular updates on changes to the user’s news feed by polling Facebook’s stream and getting data back in the Activity Stream format there isn’t a mechanism today to get updates on the comments on items in the feed. What it means in practice today is that once an item rolls off of the news feed, there is no way to keep the comments up to date in RSS Bandit.

The Salmon Protocol aims to address this problem by piggybacking on PubSubHubBub as a way for applications to get real-time updates on comments on items in an activity stream not just updates on new activities.

There have also been several mentions of Salmon being a way to aggregate distributed conversations on an item (e.g. this blog post is syndicated to  FriendFeed and there are comments there as well as in the comments on my blog) but I am less clear on those scenarios or whether Salmon is enough to solve the various tough problems that need to be solved to make that work end to end.

Any API for posting comments to a site needs to solve two problems; identity and dealing with comment spam. I decided to take a look at the Salmon Protocol Summary to see how it addresses these problems.

The meat of the Salmon Protocol format is excerpted below

A source provides an RSS/Atom feed of content. It includes a Salmon link in its feed:

<link rel="salmon" href=""/>

An aggregator reads the feed (ideally via a push mechanism such as PubSubHubbub), and sees from the link that it is Salmon-enabled. It remembers the endpoint URL for later use.

When an aggregator's user leaves a comment on a feed item, the aggregator stores the comment as usual, and then also POSTs a salmon version of it to the source's Salmon endpoint:

POST /salmon-endpoint HTTP/1.1


Content-Type: application/atom+xml

<?xml version='1.0' encoding='UTF-8'?>

    <entry xmlns=''>


      <name>John Doe</name>



    <content>Yes, but what about the llamas?</content> 



    <thr:in-reply-to xmlns:thr=''


    <sal:signature xmlns:sal=''>





The commenter is identified in the published comment using the atom:uri element. How this author is authenticated in situations outside of public comments on a blog such as RSS Bandit posting a comment to Facebook on my behalf isn’t really discussed. I noticed an offhand reference to OAuth headers which  seems to imply that the publishing application should also be sending authentication headers as well when publishing the comment. How these authentication headers would flow through the systems involved is unclear to me especially given the approach Salmon has taken to deal with spam prevention.

The workflow for dealing with spam comments is described as follows

A major concern with this type of distributed protocol is how to prevent spam and abuse.  Salmon provides building blocks to allow in-depth defense against attacks.  Specifically, every salmon has a verifiable author and user agent.  The basic security flow when salmon swims upstream looks like this:

  1. "Here is a salmon, authored and signed by ''; please accept it."
  2. Recipient: "I know that this is really due to its OAuth headers, and it has a good reputatation, but I do not trust it completely; I will do a double check."
  3. Recipient: Uses Webfinger/XRD to discover salmon validation service for, which turns out to be hosted by
  4. Recipient: "Given that johndoe has delegated Salmon validation to aggregator-example, and I know I'm talking to aggregator-example already, I'll skip the actual check." (Returns HTTP 200 to

The flow can get more complicated, especially if the aggregator is not also providing identity services for the user.  In the most general case, the recipient needs to take the salmon, discover a salmon validator service for the author via XRD discovery on the author's URI, and POST the salmon to the validator service. The validator service does an integrity / signature check against the salmon and returns 200 if the salmon checks out, 400 if not.  The signature check means that the given author (johndoe in this case) signed the salmon with the given id, parent id, and timestamp.  It does not attempt to do a full, XML-DSig style verification, though such a service is another reasonable extension.

This flow seems weird and it is unclear to me that it actually solves the problems involved in distributed commenting. So let’s say I post a comment to Facebook from RSS Bandit, in step 3 above they are now supposed to use WebFinger to lookup my email address provider and determine which service I use for digitally signing comments. Then they ask it if the comment looks like it was from me.

Hmm, this looks like a user authentication workflow in disguise as a comment validation workflow. Shouldn’t the service receiving the comment (i.e. Facebook) in the example above be responsible for validating my identity not some third party service? Maybe this protocol wasn’t meant for sites like Facebook?

Let’s say this protocol is really meant for situations when the comment recipient doesn’t intend to be the sole identity provider such as commenting on Robert Scoble's blog where he allows comments from anyone with just an email address and an optional web page URL as identifiers. So each commenter needs to provide an email address on an email service provider that supports WebFinger and validates digital signatures in the specific situation related to the Salmon protocol? Sounds like boiling the ocean. I wonder why this can’t work with OpenID validation or some other authentication protocol that has already been validated by developers and is seeing some adoption?

At the end of the day, I think the problem Salmon attempts to solve is one that needs solving as activity streams become a more popular and intrinsic feature across the Web. However in its current form it’s hard for me to see how it actually solves the real problems that exist today in a practical way.

Of course, this may just be my misunderstanding of the protocol documents currently published and I look forward to being corrected by one of the protocol gurus if that is the case.

Note Now Playing: Chris Brown - I Can Transform Ya (feat. Lil Wayne) Note


Robert Scoble had an epiphany on one of the key problems with FriendFeed’s design that he now realizes now that he’s no longer a fanatical user of the service in his post The chat room/forum problem (& an apology to @Technosailor) . Robert writes

Twitter got lists.

This let us throw together a list of experts. For instance, I put together a list of people who have started companies. Compare that feed to your average Facebook feed and you’ll see it in stark black and white: your Facebook feed is “fun” but isn’t teaching you much.

It becomes even more stark when you do a list like my tech news brands list. See, this is NOT a forum! It is NOT a chat room!

No one can enter this community without being invited. Now compare to FriendFeed. We could have built a list like this over there, but it would have gotten noiser because of a feature called “Friend of a Friend.” That drags in people the list owner didn’t invite. Also, anyone can comment underneath any items on Facebook or FriendFeed. That brings people into YOUR life that YOU DID NOT INVITE!

Again, at first, this seems very democratic and very nice. After all, it’s great to throw a party for the whole world and let them drink your wine and have conversations with your kids. But, be honest here, would you rather have a private dinner with Steve Jobs, or would you rather have a dinner with Steve Jobs and 5,000 people who you don’t really know?

As someone who works on a platform for real-time streams for a living I always find it interesting to compare the approaches of various companies.

I agree with Robert that FriendFeed’s friend of friend feature breaks a fundamental model of the stream. In an earlier post on the problems with some of FriendFeed's design decisions I pointed out the following problem with the feature

FriendFeed shows you content from friends of friends: This is major social faux pas. It may sound like a cool viral feature but showing me content from people I haven't subscribed to means I don't have control of who shows up in my feed and it takes away from the intimacy of the site because I'm always seeing content from strangers.

One of the things I’ve learned about how people interact with activity streams and news feeds is that it is important to feel like you are in control of the experience. FriendFeed’s friend of friend feature explicitly takes that away from users. I can understand why they did it (i.e. to increase the amount of content showing up in the stream for people with few friends and as a friend discovery mechanism)  but it doesn’t change the fact that the behavior can seem like a nuisance and even lead to lamebook style socially awkward situations. 

However unlike Robert I don’t really agree with his characterization of the differences between streams on Facebook versus Twitter. On both sites, I as a user choose who the primary sources that show up in my stream are. Facebook is for people I know, Twitter is for brands and people I find interesting. I find both sites fun but I agree that I’m more likely to learn something new related to work from Twitter than from Facebook. Whether I am “learning” something new or not isn’t what’s important but whether I feel like I’m getting value out of the experience. As Biz Stone wrote in a blog post on Twitter's new terms of service 

At the start, critics often said, "Twitter is fun, but it's not useful." At one point @ev responded dryly with, "Neither is ice cream."

Although comments in the news feed on Facebook bring people I didn’t explicitly add into my stream, they are often OK because they are often people who I consider to be part of my extended social network or at least are on topic. This would never work on Twitter with retweets being shown inline (for example) since anyone can follow anyone else or retweet their content which would lead to the same sort of chat room/forum noise that Robert decries in his post.

Looking at the sketches from Twitter’s Project Retweet

I don’t think Robert’s concerns about retweets polluting the stream are warranted. From the above sketch, it just looks like Twitter is fixing the bug in the retweeting process where I have to use part of my 140 character quota to provide attribution when retweeting an interesting status update. This leads to interesting behavior such as people keeping their tweets under 125 characters to enable retweeting. I have to admit I’ve wondered more than once if I should make a tweet shorter so that it is easier to retweet. Project Retweet is fixing in a way that encourages an existing user practice on the site. That deserves kudos in my book.

Note Now Playing: Jay-Z Feat. Alicia Keys - Empire State Of Mind Note


Categories: Social Software

Over the weekend, Torsten and I shipped a new release of RSS Bandit. Besides bug fixes there is one key new feature in the release, the ability to view and comment on your Facebook news feed. The flow for adding Facebook to the application is as follows. Go to the File->Synchronize Feeds menu option then select "Facebook"


then go through the Facebook Connect authorization flow including optionally signing in and granting the application permission to view your news feed


This creates a new feed source containing your Facebook news feed complete with inline comments as shown below


You can download the new release from here. More details about the bug fixes in the release are in the official RSS Bandit blog post on the release.

The primary purpose of this release (codenamed Colossus) was to bring stability to the code base before we made radical changes. With this release out of the way, we will start working on the Gambit release right away. The purpose of the next release is primarily to make RSS Bandit a more modern application that looks like it belongs on Windows 7 and Windows Vista instead of harkening back to the Office 2003 look of yesteryear.

You can see our plans for updating the RSS Bandit user interface in my blog post and prototype screenshots of the RSS Bandit ribbon. We will also support new features of Windows 7 such as jump lists. As usual, comments and feedback are welcome.

Note Now Playing: Dashboard Confessional - Stolen Note


Categories: RSS Bandit

Last week Joel Spolsky wrote a blog post entitled The Duct Tape Programmer where he praises developers who favor simple programming practices to complex ones. This blog post strongly resonated with me and made me recall some related thoughts on complexity and solving problems in software projects. Some key excerpts from his which I'll use as a jumping off point are below

Jamie Zawinski is what I would call a duct-tape programmer. And I say that with a great deal of respect. He is the kind of programmer who is hard at work building the future, and making useful things so that people can do stuff.
Duct tape programmers are pragmatic. Zawinski popularized Richard Gabriel’s precept of
Worse is Better. A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has because it’s in your lab where you’re endlessly polishing the damn thing. Shipping is a feature. A really important feature. Your product must have it.

One principle duct tape programmers understand well is that any kind of coding technique that’s even slightly complicated is going to doom your project. Duct tape programmers tend to avoid C++, templates, multiple inheritance, multithreading, COM, CORBA, and a host of other technologies that are all totally reasonable, when you think long and hard about them, but are, honestly, just a little bit too hard for the human brain.

The urge the reduce the complexity of the tools used to solve software problems is one that every software developer should share. However even more important is reducing the complexity of the actual solutions that are delivered to your customers at the end of the day. End users can't tell if you used complicated C++ techniques like template metaprogramming and mixins to build the application. They can tell when your application fails to solve their actual problems in a straightforward way or is so late to ship due to project delays that they lose interest in waiting for you to solve their problems.

There are many famous and everyday examples of this culture of complexity in software projects which are eventually trumped by solutions that solve 80% of the problem in a simple way. My favorite example is contrasting the World Wide Web invented by Tim Berners-Lee with Project Xanadu as envisioned by Ted Nelson.  Today the WWW is used by over a billion people to enrich their lives in myriad ways on a daily basis and has created hundreds of billions dollars in value by minting an entire new industry. Project Xanadu is a sad footnote spoken about in hushed tones by fans of hypertext who bewail the success of the Web and how it has forced us to settle for less (i.e. Worse Is Better).

If you aren't familiar with Project Xanadu you can think of it as a networked system of hyperlinked documents and media just like the WWW which had to satisfy the following seventeen rules

    1. Every Xanadu server is uniquely and securely identified.
    2. Every Xanadu server can be operated independently or in a network.
    3. Every user is uniquely and securely identified.
    4. Every user can search, retrieve, create and store documents.
    5. Every document can consist of any number of parts each of which may be of any data type.
    6. Every document can contain links of any type including virtual copies ("transclusions") to any other document in the system accessible to its owner.
    7. Links are visible and can be followed from all endpoints.
    8. Permission to link to a document is explicitly granted by the act of publication.
    9. Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document.
    10. Every document is uniquely and securely identified.
    11. Every document can have secure access controls.
    12. Every document can be rapidly searched, stored and retrieved without user knowledge of where it is physically stored.
    13. Every document is automatically moved to physical storage appropriate to its frequency of access from any given location.
    14. Every document is automatically stored redundantly to maintain availability even in case of a disaster.
    15. Every Xanadu service provider can charge their users at any rate they choose for the storage, retrieval and publishing of documents.
    16. Every transaction is secure and auditable only by the parties to that transaction.
    17. The Xanadu client-server communication protocol is an openly published standard. Third-party software development and integration is encouraged.

Reading this list is like going through a list of places where World Wide Web fails. Rule #14 which implies every document on the network is redundantly backed up in disparate locations so they can always be is something the WWW doesn't do today which is why we have broken links and 404s all the time. Rule #9 implies that not only is copyright respected and tracked throughout the system but there is even a micropayment platform built in. All the discussions on micropayments saving newspapers would be moot if Project Xanadu ruled the world since it would have existed from day one. Rule #16 on transactions being secure and auditable sounds like Nirvana in today's world of botnets, malware and phishing scams which plague the Web.

Yet despite the fact that the forty year old Project Xanadu is a more compelling vision than were we are today it failed and Tim Berners-Lee's World Wide Web succeeded. In practical terms, Project Xanadu was trying to solve too many complex problems in a v1 product. In contrast, Tim Berners-Lee focused on the most valuable problems to solve for end users which was sharing documents and media with anyone on the Internet and punted on a bunch of the hard problems that would require a more controlled and tightly coupled network as well as a ton of more code. Tim Berners-Lee solved less than half the problems Project Xanadu set out to solve but has changed the world immeasurably for billions of people by providing simple solutions to complex problems and running away from trying to create complex solutions to complex problems.

The bottom line is that a lot of the time it's OK to create a solution that solves 80% of the problem. Always remember that shipping is a feature.

Note Now Playing: Drake, Kanye West, Lil Wayne & Eminem - Forever Note


Categories: Programming | Ramblings

Every week there seems to be some new A-list blogger criticizing Twitter's Suggested User's List which is a selection of celebrities and brands that are suggested to new Twitter users as people the user might like to follow. This week it's Robert Scoble with You’re not on Twitter’s suggested user list but you are in good company that points out a number of interesting celebrities and brands that aren't on the list. Last week Dave Winer asked The SUL as a tool to control news?

I've had my issues with the SUL mainly from the perspective of how it ends up presenting Twitter to new users. When my wife joined Twitter I'd have loved it if the service had used integration with Facebook, Windows Live, MySpace, etc to suggest people who she already knew who were on Twitter. Instead the service prioritized pitching that she follow Shaquille O'Neal, Dell Outlet stores, NBC's Today Show and Jessica Simpson's kid sister. To find me on Twitter, my wife had to ask me for my Twitter handle in person. I felt like we were back in the dark ages of social networking.

In retrospect, not doing what I preferred them to do shows a lot of insight. It prevents the site from being viewed as yet another service where you have a duplicated social graph and thus has to compete head to head with the Facebooks and MySpaces of the world. Instead it pitches Twitter as a sort of user friendly RSS reader where you connect with your favorite celebrities and brands instead of another place where you get status updates from people who you're already getting status updates from in Facebook.


Note Now Playing: Jay Sean - Down (feat. Lil Wayne) Note


Categories: Social Software

Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

Why Denormalize Your Database?

Today, lots of Web applications have "social" features. A consequence of this is that whenever I look at content or a user in that service, there is always additional content from other users that also needs to be pulled in to page. When you visit the typical profile on a social network like Facebook or MySpace, data for all the people that are friends with that user needs to be pulled in. Or when you visit a shared bookmark on you need data for all the users who have tagged and bookmarked that URL as well. Performing a query across the entire user base for "all the users who are friends with Robert Scoble" or "all the users who have bookmarked this blog link" is expensive even with caching. It is orders of magnitude faster to return the data if it is precalculated and all written to the same place.

This is optimizes your reads at the cost of incurring more writes to the system. It also means that you'll end up with redundant data because there will be multiple copies of some amount of user data as we try to ensure the locality of data.

A good example of a Web application deciding to make this trade off is the recent post on the Digg Blog entitled Looking to the Future with Cassandra which contains the following excerpt

The Problem

In both models, we’re computing the intersection of two sets:

  1. Users who dugg an item.
  2. Users that have befriended the digger.

The Relational Model

The schema for this information in MySQL is:

  `id`      INT(11),
  `itemid`  INT(11),
  `userid`  INT(11),
  `digdate` DATETIME,
  PRIMARY KEY (`id`),
  KEY `user`  (`userid`),
  KEY `item`  (`itemid`)
  `id`           INT(10) AUTO_INCREMENT,
  `userid`       INT(10),
  `username`     VARCHAR(15),
  `friendid`     INT(10),
  `friendname`   VARCHAR(15),
  `mutual`       TINYINT(1),
  `date_created` DATETIME,
  PRIMARY KEY                (`id`),
  UNIQUE KEY `Friend_unique` (`userid`,`friendid`),
  KEY        `Friend_friend` (`friendid`)

The Friends table contains many million rows, while Diggs holds hundreds of millions. Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP. The steps are:

  1. Query Friends for all my friends. With a cold cache, this takes around 1.5 seconds to complete.
  2. Query Diggs for any diggs of a specific item by a user in the set of friend user IDs. This query is enormous, and looks something like:
    SELECT `digdate`, `id` FROM `Diggs`
     WHERE `userid` IN (59, 9006, 15989, 16045, 29183,
                        30220, 62511, 75212, 79006)
       AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;

    The real query is actually much worse than this, since the IN clause contains every friend of the user, and this can balloon to hundreds of user IDs. A full query can actually clock in at 1.5kb, which is many times larger than the actual data we want. With a cold cache, this query can take 14 seconds to execute.

Of course, both queries are cached, but due to the user-specific nature of this data, it doesn’t help much.

The solution the Digg development team went with was to denormalize the data. They also went an additional step and decided that since the data was no longer being kept in a relational manner there was no point in using a traditional relational database (i.e. MySQL) and instead they migrated to a non-RDBMS technology to solve this problem.


How Denormalization Changes Your Application

There are a number of things to keep in mind once you choose to denormalize your data including

  1. Denormalization means data redundancy which translates to significantly increased storage costs. The fully denormalized data set from the Digg exampled ended up being 3 terabytes of information. It is typical for developers to underestimate the data bloat that occurs once data is denormalized.

  2. Fixing data inconsistency is now the job of the application. Let's say each user has a list of the user names of all of their friends. What happens when one of these users changes their user name? In a normalized database that is a simple UPDATE query to change a single piece of data and then it will be current everywhere it is shown on the site. In a denormalized database, there now has to be a mechanism for fixing up this name in all of the dozens, hundreds or thousands of places it appears. Most services that create denormalized databases have "fixup" jobs that are constantly running on the database to fix such inconsistencies.

The No-SQL Movement vs. Abusing Relational Databases for Fun & Profit

If you’re a web developer interested in building large scale applications, it doesn’t take long in reading the various best practices on getting Web applications to scale such as practicing database sharding or eschewing transactions before it begins to sound like all the advice you are getting is about ignoring or abusing the key features that define a modern relational database system. Taken to its logical extreme all you really need is a key<->value or tuple store that supports some level of query functionality and has decent persistence semantics. Thus the NoSQL movement was borne.

The No-SQL movement is a used to describe the increasing usage of non-relational databases among Web developers. This approach has initially pioneered by large scale Web companies like Facebook (Cassandra), Amazon (Dynamo) & Google (BigTable) but now is finding its way down to smaller sites like Digg. Unlike relational databases, there is a yet to be a solid technical definition of what it means for a product to be a "NoSQL" database aside from the fact that it isn't a relational database. Commonalities include lack of fixed schemas and limited support for rich querying. Below is a list of some of the more popular NoSQL databases that you can try today along with a brief description of their key qualities 

  1. CouchDB: A document-oriented database where documents can be thought of as JSON/JavaScript objects. Creation, retrieval, update and deletion (CRUD) operations are performed via a RESTful API and support ACID properties. Rich querying is handled by creating Javascript functions called "Views" which can operate on the documents in the database via Map/Reduce style queries. Usage: Although popular among the geek set most users seem to be dabblers as opposed to large scale web companies. 

  2. Cassandra: A key-value store where each key-value pair comes with a timestamp and can be grouped together into a column family (i.e. a table). There is also a notion of super columns which are columns that contain whose values are a list of other key-value pairs. Cassandra is optimized to be always writable and uses eventual consistency to deal with the conflicts that inevitably occur when a distributed system aims to be always writable yet node failure is a fact of life. Querying is available via the Cassandra Thrift API and supports fairly basic data retrieval operations based on key values and column names. Usage: Originally developed and still used at Facebook today. Digg and Rackspace are the most recent big name adopters.

  3. Voldemort: Very similar to Cassandra which is unsurprising since they are both inspired by Amazon's Dynamo. Voldemort is a key-value store where each key value pair comes with a timestamp and eventual consistency is used to address write anomalies. Values can contain a list of further key value pairs. Data access involves creation, retrieval and deletion of serialized objects whose format can be one of JSON, strings, binary BLOBs, serialized Java objects and Google Protocol Buffers. Rich querying is non-existent, simple get and put operations are all that exist.  Usage: Originally developed and still used at LinkedIn.

There are a number of other interesting NoSQL databases such as HBase, MongoDB and Dynomite but the three above seem to be the most mature from my initial analysis. In general, most of them seem to be a clone of BigTable, Dynamo or some amalgam of ideas from both papers. The most original so far has been CouchDB.

An alternative to betting on a speculative database technologies at varying levels of maturity is to misuse an existing mature relational database product. As mentioned earlier, many large scale sites use relational databases but eschew relational features such as transactions and joins to achieve scalability. Some developers have even taken that practice to an extreme and built schema-less data models on top of traditional relational database. A great example of this How FriendFeed uses MySQL to store schema-less data which is a blog post excerpted below

Lots of projects exist designed to tackle the problem storing data with flexible schemas and building new indexes on the fly (e.g., CouchDB). However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs (see this somewhat outdated article on CouchDB, for example). MySQL works. It doesn't corrupt data. Replication works. We understand its limitations already. We like MySQL for storage, just not RDBMS usage patterns.

After some deliberation, we decided to implement a "schema-less" storage system on top of MySQL rather than use a completely new storage system.

Our datastore stores schema-less bags of properties (e.g., JSON objects or Python dictionaries). The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque as far as the datastore is concerned. We can change the "schema" simply by storing new properties.

In MySQL, our entities are stored in a table that looks like this:

CREATE TABLE entities (
    id BINARY(16) NOT NULL,
    body MEDIUMBLOB,
    UNIQUE KEY (id),
    KEY (updated)

The added_id column is present because InnoDB stores data rows physically in primary key order. The AUTO_INCREMENT primary key ensures new entities are written sequentially on disk after old entities, which helps for both read and write locality (new entities tend to be read more frequently than old entities since FriendFeed pages are ordered reverse-chronologically). Entity bodies are stored as zlib-compressed, pickled Python dictionaries.

Now that the FriendFeed team works at Facebook I suspect they'll end up deciding that a NoSQL database that has solved a good story around replication and fault tolerance is more amenable to solving the problem of building a schema-less database than storing key<->value pairs in a SQL database where the value is a serialized Python object.

As a Web developer it's always a good idea to know what the current practices are in the industry even if they seem a bit too crazy to adopt…yet.

Further Reading

Note Now Playing: Jay-Z - Run This Town (feat. Rihanna & Kanye West) Note


    Categories: Web Development