Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

September 10, 2009

@ 05:18 PM

Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

Why Denormalize Your Database?

Today, lots of Web applications have "social" features. A consequence of this is that whenever I look at content or a user in that service, there is always additional content from other users that also needs to be pulled in to page. When you visit the typical profile on a social network like Facebook or MySpace, data for all the people that are friends with that user needs to be pulled in. Or when you visit a shared bookmark on del.icio.us you need data for all the users who have tagged and bookmarked that URL as well. Performing a query across the entire user base for "all the users who are friends with Robert Scoble" or "all the users who have bookmarked this blog link" is expensive even with caching. It is orders of magnitude faster to return the data if it is precalculated and all written to the same place.

This is optimizes your reads at the cost of incurring more writes to the system. It also means that you'll end up with redundant data because there will be multiple copies of some amount of user data as we try to ensure the locality of data.

A good example of a Web application deciding to make this trade off is the recent post on the Digg Blog entitled Looking to the Future with Cassandra which contains the following excerpt

The Problem

In both models, we’re computing the intersection of two sets:

Users who dugg an item.

Users that have befriended the digger.

The Relational Model

The schema for this information in MySQL is:
CREATE TABLE `Diggs` (
  `id`      INT(11),
  `itemid`  INT(11),
  `userid`  INT(11),
  `digdate` DATETIME,
  PRIMARY KEY (`id`),
  KEY `user`  (`userid`),
  KEY `item`  (`itemid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;   CREATE TABLE `Friends` (
  `id`           INT(10) AUTO_INCREMENT,
  `userid`       INT(10),
  `username`     VARCHAR(15),
  `friendid`     INT(10),
  `friendname`   VARCHAR(15),
  `mutual`       TINYINT(1),
  `date_created` DATETIME,
  PRIMARY KEY                (`id`),
  UNIQUE KEY `Friend_unique` (`userid`,`friendid`),
  KEY        `Friend_friend` (`friendid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The Friends table contains many million rows, while Diggs holds hundreds of millions. Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP. The steps are:
Query Friends for all my friends. With a cold cache, this takes around 1.5 seconds to complete.
Query Diggs for any diggs of a specific item by a user in the set of friend user IDs. This query is enormous, and looks something like:
SELECT `digdate`, `id` FROM `Diggs`
 WHERE `userid` IN (59, 9006, 15989, 16045, 29183,
                    30220, 62511, 75212, 79006)
   AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;
The real query is actually much worse than this, since the IN clause contains every friend of the user, and this can balloon to hundreds of user IDs. A full query can actually clock in at 1.5kb, which is many times larger than the actual data we want. With a cold cache, this query can take 14 seconds to execute.
Of course, both queries are cached, but due to the user-specific nature of this data, it doesn’t help much.

The solution the Digg development team went with was to denormalize the data. They also went an additional step and decided that since the data was no longer being kept in a relational manner there was no point in using a traditional relational database (i.e. MySQL) and instead they migrated to a non-RDBMS technology to solve this problem.

How Denormalization Changes Your Application

There are a number of things to keep in mind once you choose to denormalize your data including

Denormalization means data redundancy which translates to significantly increased storage costs. The fully denormalized data set from the Digg exampled ended up being 3 terabytes of information. It is typical for developers to underestimate the data bloat that occurs once data is denormalized.
Fixing data inconsistency is now the job of the application. Let's say each user has a list of the user names of all of their friends. What happens when one of these users changes their user name? In a normalized database that is a simple UPDATE query to change a single piece of data and then it will be current everywhere it is shown on the site. In a denormalized database, there now has to be a mechanism for fixing up this name in all of the dozens, hundreds or thousands of places it appears. Most services that create denormalized databases have "fixup" jobs that are constantly running on the database to fix such inconsistencies.

The No-SQL Movement vs. Abusing Relational Databases for Fun & Profit

If you’re a web developer interested in building large scale applications, it doesn’t take long in reading the various best practices on getting Web applications to scale such as practicing database sharding or eschewing transactions before it begins to sound like all the advice you are getting is about ignoring or abusing the key features that define a modern relational database system. Taken to its logical extreme all you really need is a key<->value or tuple store that supports some level of query functionality and has decent persistence semantics. Thus the NoSQL movement was borne.

The No-SQL movement is a used to describe the increasing usage of non-relational databases among Web developers. This approach has initially pioneered by large scale Web companies like Facebook (Cassandra), Amazon (Dynamo) & Google (BigTable) but now is finding its way down to smaller sites like Digg. Unlike relational databases, there is a yet to be a solid technical definition of what it means for a product to be a "NoSQL" database aside from the fact that it isn't a relational database. Commonalities include lack of fixed schemas and limited support for rich querying. Below is a list of some of the more popular NoSQL databases that you can try today along with a brief description of their key qualities

CouchDB: A document-oriented database where documents can be thought of as JSON/JavaScript objects. Creation, retrieval, update and deletion (CRUD) operations are performed via a RESTful API and support ACID properties. Rich querying is handled by creating Javascript functions called "Views" which can operate on the documents in the database via Map/Reduce style queries. Usage: Although popular among the geek set most users seem to be dabblers as opposed to large scale web companies.
Cassandra: A key-value store where each key-value pair comes with a timestamp and can be grouped together into a column family (i.e. a table). There is also a notion of super columns which are columns that contain whose values are a list of other key-value pairs. Cassandra is optimized to be always writable and uses eventual consistency to deal with the conflicts that inevitably occur when a distributed system aims to be always writable yet node failure is a fact of life. Querying is available via the Cassandra Thrift API and supports fairly basic data retrieval operations based on key values and column names. Usage: Originally developed and still used at Facebook today. Digg and Rackspace are the most recent big name adopters.
Voldemort: Very similar to Cassandra which is unsurprising since they are both inspired by Amazon's Dynamo. Voldemort is a key-value store where each key value pair comes with a timestamp and eventual consistency is used to address write anomalies. Values can contain a list of further key value pairs. Data access involves creation, retrieval and deletion of serialized objects whose format can be one of JSON, strings, binary BLOBs, serialized Java objects and Google Protocol Buffers. Rich querying is non-existent, simple get and put operations are all that exist. Usage: Originally developed and still used at LinkedIn.

There are a number of other interesting NoSQL databases such as HBase, MongoDB and Dynomite but the three above seem to be the most mature from my initial analysis. In general, most of them seem to be a clone of BigTable, Dynamo or some amalgam of ideas from both papers. The most original so far has been CouchDB.

An alternative to betting on a speculative database technologies at varying levels of maturity is to misuse an existing mature relational database product. As mentioned earlier, many large scale sites use relational databases but eschew relational features such as transactions and joins to achieve scalability. Some developers have even taken that practice to an extreme and built schema-less data models on top of traditional relational database. A great example of this How FriendFeed uses MySQL to store schema-less data which is a blog post excerpted below

Lots of projects exist designed to tackle the problem storing data with flexible schemas and building new indexes on the fly (e.g., CouchDB). However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs (see this somewhat outdated article on CouchDB, for example). MySQL works. It doesn't corrupt data. Replication works. We understand its limitations already. We like MySQL for storage, just not RDBMS usage patterns.

After some deliberation, we decided to implement a "schema-less" storage system on top of MySQL rather than use a completely new storage system.
…
Our datastore stores schema-less bags of properties (e.g., JSON objects or Python dictionaries). The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque as far as the datastore is concerned. We can change the "schema" simply by storing new properties.
…
In MySQL, our entities are stored in a table that looks like this:
CREATE TABLE entities (
    added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    id BINARY(16) NOT NULL,
    updated TIMESTAMP NOT NULL,
    body MEDIUMBLOB,
    UNIQUE KEY (id),
    KEY (updated)
) ENGINE=InnoDB;
The added_id column is present because InnoDB stores data rows physically in primary key order. The AUTO_INCREMENT primary key ensures new entities are written sequentially on disk after old entities, which helps for both read and write locality (new entities tend to be read more frequently than old entities since FriendFeed pages are ordered reverse-chronologically). Entity bodies are stored as zlib-compressed, pickled Python dictionaries.

Now that the FriendFeed team works at Facebook I suspect they'll end up deciding that a NoSQL database that has solved a good story around replication and fault tolerance is more amenable to solving the problem of building a schema-less database than storing key<->value pairs in a SQL database where the value is a serialized Python object.

As a Web developer it's always a good idea to know what the current practices are in the industry even if they seem a bit too crazy to adopt…yet.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Building Scalable Databases: Denormalization, the NoSQL Movement and Digg - Dare Obasanjo's weblog

Why Denormalize Your Database?

The Problem

The Relational Model

How Denormalization Changes Your Application

The No-SQL Movement vs. Abusing Relational Databases for Fun & Profit