Werner Vogels, CTO of Amazon, has a blog post entitled Amazon's Dynamo which contains the HTML version of an upcoming paper entitled Dynamo: Amazon’s Highly Available Key-value Store which describes a highly available, distributed storage system used internally at Amazon. 

The paper is an interesting read and a welcome addition to the body of knowledge about building megascale distributed storage systems. I particularly like that it isn’t simply another GFS or BigTable, but is unique in it’s own right. Hopefully, this will convince folks that just because Google were first to publish papers about their internal infrastructure doesn’t mean that what they’ve done is the bible of building megascale distributed systems. Anyway, on to some of the juicy bits

Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.

Although I work for a company that sells a relational database product, I think it is still fair to say that there is a certain level of scale where practically every feature traditionally associated with an RDBMS works against you.

Luckily, there are only a handful of companies and Web services in the world that need to operate at that scale.

2.1 System Assumptions and Requirements

The storage system for this class of services has the following requirements:

Query Model: simple read and write operations to a data item that is uniquely identified by a key. State is stored as binary objects (i.e., blobs) identified by unique keys. No operations span multiple data items and there is no need for relational schema. This requirement is based on the observation that a significant portion of Amazon’s services can work with this simple query model and do not need any relational schema. Dynamo targets applications that need to store objects that are relatively small (usually less than 1 MB).

ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability. This has been widely acknowledged by both the industry and academia [5]. Dynamo targets applications that operate with weaker consistency (the “C” in ACID) if this results in high availability. Dynamo does not provide any isolation guarantees and permits only single key updates.

Other Assumptions: Dynamo is used only by Amazon’s internal services. Its operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization. Moreover, since each service uses its distinct instance of Dynamo, its initial design targets a scale of up to hundreds of storage hosts. We will discuss the scalability limitations of Dynamo and possible scalability related extensions in later sections.

Lots of worthy items to note here. The first is that you can get a lot of traction out of a simple data structure such as a hash table. Specifically, as noted by Sam Ruby in his post Key + Data, accessing data by key instead of using complex queries is becoming a common pattern in large scale distributed storage systems. Sam actually missed pointing out that Google’s Bigtable is another example of this trend given that data items within it are accessed using the tuple {row key, column key, timestamp} instead of being queried using data manipulation language.

Another interesting thing, from my perspective, is that they’ve gotten around hitting scaling limits at running it on hundreds of storage hosts by having different teams at Amazon run their own instances of Dynamo. Then again, there are 200 clusters of GFS running at Google, so this is probably common sense as well.

4.4 Data Versioning

Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously. A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates.. If there are no failures then there is a bound on the update propagation times. However, under certain failure scenarios (e.g., server outages or network partitions), updates may not arrive at all replicas for an extended period of time.

There is a category of applications in Amazon’s platform that can tolerate such inconsistencies and can be constructed to operate under these conditions. For example, the shopping cart application requires that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of the cart, which itself may contain changes that should be preserved. Note that both “add to cart” and “delete item from cart” operations are translated into put requests to Dynamo. When a customer wants to add an item to (or remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later.

In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data. It allows for multiple versions of an object to be present in the system at the same time. Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). However, version branching may happen, in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object. In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation). A typical example of a collapse operation is “merging” different versions of a customer’s shopping cart. Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.

Fascinating. I can just imagine how scary this most sound to RDBMS heads to think that instead of the database enforcing the rules of consistency, it just keeps multiple versions of the “row” around and then asks the client to figure out which is which if there were multiple updates that couldn’t be reconciled.

The folks at Amazon have taken acknowledgement of the CAP Conjecture to its logical extreme. Consistency, Availability, and Partition-tolerance. Pick two.

There’s lots of other interesting stuff in the paper but I’ll save some for you to read and end my excerpts here. This will make great bedtime reading this weekend.

Now playing: Geto Boys - My Mind's Playin Tricks On Me


Tracked by:
"Garage Door Parts" (Garage Door Parts) [Trackback]