Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. The justification for database sharding is that after a certain scale point it is cheaper and more feasible to scale a site horizontally by adding more machines than to grow it vertically by adding beefier servers.

Why Shard or Partition your Database?

Let's take as an example. In early 2004, the site was mostly used by Harvard students as a glorified online yearbook. You can imagine that the entire storage requirements and query load on the database could be handled by a single beefy server. Fast forward to 2008 where just the Facebook application related page views are about 14 billion a month (which translates to over 5,000 page views per second, each of which will require multiple backend queries to satisfy). Besides query load with its attendant IOPs, CPU and memory cost there's also storage capacity to consider. Today Facebook stores 40 billion physical files to represent about 10 billion photos which is over a petabyte of storage. Even though the actual photo files are likely not in a relational database, their metadata such as identifiers and locations still would require a few terabytes of storage to represent these photos in the database. Do you think the original database used by Facebook had terabytes of storage available just to store photo metadata?

At some point during the development of Facebook, they reached the physical capacity of their database server. The question then was whether to scale vertically by buying a more expensive, beefier server with more RAM, CPU horsepower, disk I/O and storage capacity or to spread their data out across multiple relatively cheap database servers. In general if your service has lots of rapidly changing data (i.e. lots of writes) or is sporadically queried by lots of users in a way which causes your working set not to fit in memory (i.e. lots of reads leading to lots of page faults and disk seeks) then your primary bottleneck will likely be I/O. This is typically the case with social media sites like Facebook, LinkedIn, Blogger, MySpace and even Flickr. In such cases, it is either prohibitively expensive or physically impossible to purchase a single server to handle the load on the site. In such situations sharding the database provides excellent bang for the buck with regards to cost savings relative to the increased complexity of the system.

Now that we have an understanding of when and why one would shard a database, the next step is to consider how one would actually partition the data into individual shards. There are a number of options and their individual tradeoffs presented below – Pseudocode / Joins

How Sharding Changes your Application

In a well designed application, the primary change sharding adds to the core application code is that instead of code such as

//string connectionString = @"Driver={MySQL};SERVER=dbserver;DATABASE=CustomerDB;"; <-- should be in web.config
string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"];          
OdbcConnection conn = new OdbcConnection(connectionString);
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader(); 

the actual connection information about the database to connect to depends on the data we are trying to store or access. So you'd have the following instead

string connectionString = GetDatabaseFor(customerId);          
OdbcConnection conn = new OdbcConnection(connectionString);
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader(); 

the assumption here being that the GetDatabaseFor() method knows how to map a customer ID to a physical database location. For the most part everything else should remain the same unless the application uses sharding as a way to parallelize queries. 

A Look at a Some Common Sharding Schemes

There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large scale Web applications today.

  1. Vertical Partitioning: A simple way to segment your application database is to move tables related to specific features to their own server. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The key benefit of this approach is that is straightforward to implement and has low impact to the application as a whole. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers (e.g. handling metadata queries for 10 billion photos by 140 million users may be more than a single server can handle).

  2. Range Based Partitioning: In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity. For example, splitting up sales transactions by what year they were created or assigning users to servers based on the first digit of their zip code. The main problem with this approach is that if the value whose range is used for partitioning isn't chosen carefully then the sharding scheme leads to unbalanced servers. In the previous example, splitting up transactions by date means that the server with the current year gets a disproportionate amount of read and write traffic. Similarly partitioning users based on their zip code assumes that your user base will be evenly distributed across the different zip codes which fails to account for situations where your application is popular in a particular region and the fact that human populations vary across different zip codes.

  3. Key or Hash Based Partitioning: This is often a synonym for user based partitioning for Web 2.0 sites. With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. A simplistic example is to consider if you have ten database servers and your user IDs were a numeric value that was incremented by 1 each time a new user is added. In this example, the hash function could be perform a modulo operation on the user ID with the number ten and then pick a database server based on the remainder value. This approach should ensure a uniform allocation of data to each server. The key problem with this approach is that it effectively fixes your number of database servers since adding new servers means changing the hash function which without downtime is like being asked to change the tires on a moving car.

  4. Directory Based Partitioning: A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the GetDatabaseFor() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a modulo operation. Let's say we want to add five database servers to the pool without incurring downtime. We can keep the existing hash function, add these servers to the pool and then run a script that copies data from the ten existing servers to the five new servers based on a new hash function implemented by performing the modulo operation on user IDs using the new server count of fifteen. Once the data is copied over (although this is tricky since users are always updating their data) the lookup service can change to using the new hash function without any of the calling applications being any wiser that their database pool just grew 50% and the database they went to for accessing John Doe's pictures five minutes ago is different from the one they are accessing now.

Problems Common to all Sharding Schemes

Once a database has been sharded, new constraints are placed on the operations that can be performed on the database. These constraints primarily center around the fact that operations across multiple tables or multiple rows in the same table no longer will run on the same server. Below are some of the constraints and additional complexities introduced by sharding

  • Joins and Denormalization – Prior to sharding a database, any queries that require joins on multiple tables execute on a single server. Once a database has been sharded across multiple servers, it is often not feasible to perform joins that span database shards due to performance constraints since data has to be compiled from multiple servers and the additional complexity of performing such cross-server.

    A common workaround is to denormalize the database so that queries that previously required joins can be performed from a single table. For example, consider a photo site which has a database which contains a user_info table and a photos table. Comments a user has left on photos are stored in the photos table and reference the user's ID as a foreign key. So when you go to the user's profile it takes a join of the user_info and photos tables to show the user's recent comments.  After sharding the database, it now takes querying two database servers to perform an operation that used to require hitting only one server. This performance hit can be avoided by denormalizing the database. In this case, a user's comments on photos could be stored in the same table or server as their user_info AND the photos table also has a copy of the comment. That way rendering a photo page and showing its comments only has to hit the server with the photos table while rendering a user profile page with their recent comments only has to hit the server with the user_info table.

    Of course, the service now has to deal with all the perils of denormalization such as data inconsistency (e.g. user deletes a comment and the operation is successful against the user_info DB server but fails against the photos DB server because it was just rebooted after a critical security patch).

  • Referential integrity – As you can imagine if there's a bad story around performing cross-shard queries it is even worse trying to enforce data integrity constraints such as foreign keys in a sharded database. Most relational database management systems do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references once they move to using database shards.

    Dealing with data inconsistency issues due to denormalization and lack of referential integrity can become a significant development cost to the service.

  • Rebalancing (Updated 1/21/2009) – In some cases, the sharding scheme chosen for a database has to be changed. This could happen because the sharding scheme was improperly chosen (e.g. partitioning users by zip code) or the application outgrows the database even after being sharded (e.g. too many requests being handled by the DB shard dedicated to photos so more database servers are needed for handling photos). In such cases, the database shards will have to be rebalanced which means the partitioning scheme changed AND all existing data moved to new locations. Doing this without incurring down time is extremely difficult and not supported by any off-the-shelf today. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).  


Further Reading

Note Now Playing: The Kinks - You Really Got Me Note


Friday, January 16, 2009 5:24:28 PM (GMT Standard Time, UTC+00:00)
Thank you for the informative write-up on database sharding.
Friday, January 16, 2009 9:05:58 PM (GMT Standard Time, UTC+00:00)
You left out random partitioning with a non-fixed number of partitions. I use this technique when dealing with message boards. Topics are stored in the primary database (summary/master), while posts are stored in a random partition. The partition they are stored in has to be saved along with the topic, but this technique allows for massive scalability for schema's that can benefit from it.

The other techniques don't work well with boards, since the activity within a topic (posts) is what requires sharding. Ranged sharding doesn't scale well in this case, since the average posts per topic within a range can vary greatly, not to mention it's generally new topics that are most active. When this is the case, the newest ranged shard generally receives a disproportion amount of the activity.

Random partitioning ensures a more even distribution of querying. And if you throw in weighting and the ability to turn off writes to a random shard, you can easily add new shards, increase their weight, and still grow horizontally with no pain.
Tom Dean
Friday, January 16, 2009 9:11:23 PM (GMT Standard Time, UTC+00:00)
Also forgot to mention that you didn't bring up what I call a summary database, which solves the problem off no longer being able to do joins.

This database is used to perform arbitrary queries across multiple shards. Essentially you update this summary behind the scene with recent updates to the shards, and denormalize the data in such a way that that's more conducive to querying.

You'll have to live with the fact that this data will be stale, though for the most part this is rarely a problem if you have a site large enough to require this approach in the first place.

A perfect example is photos. A users photos are stored locally to that user (within a user shard), but you also have a summary DB that aggregates photos across all shards. Thus when you want to find all photos from all users tagged "doggy", you can use the summary to retrieve a list of photo_id's and user_id's, and from their go to the respective shards to fetch out additional information.
Tom Dean
Friday, January 16, 2009 10:59:52 PM (GMT Standard Time, UTC+00:00)
So then, how is the WL What's New feed stored?

I've been wondering about this a lot lately actually. Do you store all of the notifiations that I create on a single shard (thus making it easy to retrieve everything when someone looks at my profile)? Or, do you store everything that I see at on a single shard (thus replicating everyone's notifications to all their friends' shards? I assume its the former but it'd be goo to know the details.

Friday, January 16, 2009 11:00:10 PM (GMT Standard Time, UTC+00:00)
Tom, what you describe sounds like a mix of vertical (summary vs posts) and horizontal partitioning (posts). Besides, I believe Dare has the Azure Table service in mind where you have an unkown number of partitions managed by the system.

Dare, the "Beautiful Architecture" book became available just yesterday, including an article on Facebook's architecture by Dave Fetterman. Is this a coincidence?
Saturday, January 17, 2009 11:25:16 PM (GMT Standard Time, UTC+00:00)
Dare - FaceBook certainly doesn't store 40 billion files as BLOBs in the DB. They have a system called 'Haystack' that leverages the XFS file system...

HayStack Presentation

I am not sure Haystack is "live" yet and if their dependency on Akamai has been alleviated, but I like the engineering regardless :).
Sunday, January 18, 2009 12:02:39 PM (GMT Standard Time, UTC+00:00)
Thanks for your info.
Sunday, January 18, 2009 8:04:33 PM (GMT Standard Time, UTC+00:00)
Excellent write up! clear and simple, and I am sure there are other methods too, but this really gives a good overview. I am going through something similar with a startup at present, i.e. where we need to at least consider now what the implications will be to grow the database. As we are using MS SQL (yes some of us do.. I love mySQL too..), and we have a less simple social model (not 1000's of web 2 dudes uploaded photos etc), we are experimenting with the XML field type to store say a users contacts/friends. So the XML field stores something like: <conref><id>1342</id><id>332</id>.... and since the XML fieldtype is just that, we can index the XML, and we are finding the speed excellent (so far!). The main reason for this option is of course the "flatness factor" of the table, and that we don't need external joins if we need to shard the DB etc.. in fact even if we did a shard by range or something else, at least the core data is together.. I'm not a DB guru by any means, in fact we are going to hopefully find some DB consultants that can help verify the design, but we have about 4 of these XML fieldtype fields which seems to handle the "internal joins" pretty well.

Again, very good article - I'm sure you guys know of too?


Now listening to: the sound of the morning.. summer and blue skies.
Wednesday, January 21, 2009 5:52:54 PM (GMT Standard Time, UTC+00:00)
Similar to Jamie's request, I would be interested in hearing about how MS has solved some of these issues/problems/etc on Live. Anything you can share?
Comments are closed.