About a week ago, the Facebook Data team quietly released the Cassandra Project on Google Code. The Cassandra project has been described as a cross between Google's BigTable and Amazon's Dynamo storage systems. An overview of the project is available in the SIGMOD presentation on Cassandra available at SlideShare. A summary of the salient aspects of the project follows.

The problem Cassandra is aimed at solving is one that plagues social networking sites or any other service that has lots of relationships between users and their data. In such services, data often needs to be denormalized to prevent having to do lots of joins when performing queries. However this means the system needs to deal with the increased write traffic due to denormalization. At this point if you're using a relational database, you realize you're pretty much breaking every major rule of relational database design. Google tackled this problem by coming up with BigTable. Facebook has followed their lead by developing Cassandra which they admit is inspired by BigTable. 

The Cassandra data model is fairly straightforward. The entire system is a giant table with lots of rows. Each row is identified by a unique key. Each row has a column family, which can be thought of as the schema for the row. A column family can contain thousands of columns which are a tuple of {name, value, timestamp} and/or super columns which are a tuple of {name, column+} where column+ means one or more columns. This is very similar to the data model behind Google's BigTable.

As I mentioned earlier, denormalized data means you have to be able to handle a lot more writes than you would if storing data in a normalized relational database. Cassandra has several optimizations to make writes cheaper. When a write operation occurs, it doesn't immediately cause a write to the disk. Instead the record is updated in memory and the write operation is added to the commit log. Periodically the list of pending writes is processed and write operations are flushed to disk. As part of the flushing process the set of pending writes is analyzed and redundant writes eliminated. Additionally, the writes are sorted so that the disk is written to sequentially thus significantly improving seek time on the hard drive and reducing the impact of random writes to the system. How important is improving seek time when accessing data on a hard drive? It can make the difference between taking hours versus days to flush a hundred gigabytes of writes to a disk. Disk is the new tape.

Cassandra is described as "always writable" which means that a write operation always returns success even if it fails internally to the system. This is similar to the model exposed by Amazon's Dynamo which has an eventual consistency model.  From what I've read, it isn't clear how writes operations that occur during an internal failure are reconciled and exposed to users of the system. I'm sure someone with more knowledge can chime in in the comments.

At first glance, this is a very nice addition to the world of Open Source software by the Facebook team. Kudos.

Found via James Hamilton.

PS: Is it me or is this the second significant instance of Facebook Open Sourcing a key infrastructure component "inspired" by Google internals?

Now Playing: Ray J - Gifts


Monday, July 14, 2008 3:56:19 PM (GMT Daylight Time, UTC+01:00)
Does Facebook have frameworks equivalent to MapReduce too?
Monday, July 14, 2008 5:31:55 PM (GMT Daylight Time, UTC+01:00)
Calling 'Always Writable' similar to 'Eventual Consistency' seems to miss some crucial details. In large systems, server errors are a given, just based on raw probabilities. Always Writable optimizes server (and client) processing at the expense a reliability. In case of error, writes are lost and the client is not informed. This is fine for appropriate scenarios, but is not something most people associate with the term database.

Eventual Consistency is about delayed commit. If you give up 2-phase commit (which the CAP principal implies must be abandoned in a large enough system or you loose availability), then you end up with Eventual Consistency. In an eventually consistent system, when a write succeeds it is committed; but it is not necessarily visible everywhere. This is as stronger durability guarantee, at the expense of the data's availability.

Both approaches are similar, in that they break many traditional models of how a database should behave, but they address different use-cases.
Derek Denny-Brown
Monday, July 14, 2008 6:21:36 PM (GMT Daylight Time, UTC+01:00)
If you want an Open Source version of Google's MapReduce, you should take a look at Hadoop which is sponsored by Yahoo! and was inspired by Google's MapReduce.

I assumed [perhaps incorrectly] that Cassandra has some way to reconcile failed writes which is why they drew the comparison to Amazon's Dynamo.
Monday, July 14, 2008 7:27:46 PM (GMT Daylight Time, UTC+01:00)
Love the site redesign, although I'm not sure about 'hacker' vs. 'madman'. :) Either way its good to have you back. I always enjoy your perspective.

Where was the picture in the corner taken?
brad dunbar
Tuesday, July 15, 2008 5:36:06 AM (GMT Daylight Time, UTC+01:00)
Your paragraph describing the write optimizations does not make much sense. Almost all such storage systems (including BigTable and Dynamo) batch and merge writes that are accumulated in memory (and logged ahead of time to a commit log) and all such batched writes are sequential appends to an existing file or creations of new files, with all data in new files in sorted order. Nothing is ever mutated in place, so the question of random writes doesn't really arise. Cassandra is either doing the same thing (in which case their big selling point is that they are open source and have combined two interesting closed-source systems), or they're doing something different which is not being described in this post. Have you read the BigTable and Dynamo papers in detail?

Moreover, unless you have some control over how the OS lays out blocks of the file over sectors of the drive, sequential appends to a file don't necessarily map to a contiguous extent of sectors. But the sequentiality of reads and writes does help the OS caches and prefetchers.
Saturday, July 19, 2008 3:05:09 AM (GMT Daylight Time, UTC+01:00)
Damn, vocoder lays down the smack. RTFM Dare. RTFM.
Comments are closed.