How Facebook Keeps Memcached Consistent Across Geo-Distributed Data Centers

August 21, 2008

@ 02:02 PM

Last year I wrote a blog post entitled When Databases Lie: Consistency vs. Availability in Distributed Systems where I talked about the kinds of problems Web applications face when trying to keep data consistent across multiple databases spread out across the world.

Jason Sobel, a developer at Facebook has some details on how they've customized MySQL to solve a variation of the problem I posed in a blog post entitled Scaling Out where he writes

A bit of background on our caching model: when a user modifies a data object our infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object we pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is deleted by another update.

...

Consider the following example:

I update my first name from "Jason" to "Monkey"
We write "Monkey" in to the master database in California and delete my first name from memcache in California and Virginia
Someone goes to my profile in Virginia
We don't find my first name in memcache so we read from the Virginia slave database and get "Jason" because of replication lag
We update Virginia memcache with my first name as "Jason"
Replication catches up and we update the slave database with my first name as "Monkey"
Someone else goes to my profile in Virginia
We find my first name in memcache and return "Jason"

Until I update my first name again or it falls out of cache and we go back to the database, we will show my first name as "Jason" in Virginia and "Monkey" in California. Confusing? You bet. Welcome to the world of distributed systems, where consistency is a really hard problem.
Fortunately, the solution is a lot easier to explain than the problem. We made a small change to MySQL that allows us to tack on extra information in the replication stream that is updating the slave database. We used this feature to append all the data objects that are changing for a given query and then the slave database "sees" these objects and is responsible for deleting the value from cache after it performs the update to the database.

...

The new workflow becomes (changed items in bold):

I update my first name from "Jason" to "Monkey"
We write "Monkey" in to the master database in California and delete my first name from memcache in California but not Virginia
Someone goes to my profile in Virginia
We find my first name in memcache and return "Jason"
Replication catches up and we update the slave database with my first name as "Monkey." We also delete my first name from Virginia memcache because that cache object showed up in the replication stream
Someone else goes to my profile in Virginia
We don't find my first name in memcache so we read from the slave and get "Monkey"

Facebook's solution is clever and at first I couldn't shake the feeling that it is an example of extremely tight coupling for database replication to also be responsible for evicting expired items from your in-memory cache. After some thought, I realized that this is no different from the SqlCacheDependency class in ASP.NET which allows you to create a dependency between objects in your ASP.NET cache and those in your SQL database. When the underlying tables change, the Cache is updated to reflect this change in database state.

In fact, the combination of replication and the SqlCacheDependency class should mean that you get this sort of behavior for free if you are using ASP.NET caching and SQL Server. Unfortunately, it looks like Microsoft's upcoming in-memory distributed object caching product, Velocity, won't support SqlCacheDependency in its initial release according to a comment by one its developers.

Of course, there is a significant performance difference between actively monitoring the database for changes like SqlCacheDependency does and updating the cache when updates made to the database are received as part of the replication stream. I wonder if this pattern will turn out to be generally useful to Web developers (at least those of us who work on geo-distributed services) or whether this will just go down as a clever hack from those kids at Facebook that was cool to share.

Now Playing: Rihanna - Disturbia

Categories: Web Development

« Explaining REST to Damien Katz | Home | Some Thoughts on Amazon's Elastic Block ... »

Thursday, 21 August 2008 14:28:35 (GMT Daylight Time, UTC+01:00)

Yeah but isn't SqlCacheDependency per-table? So you only know that the table changed and would have to delete all profile objects from the cache. That wouldn't work.

El Guapo

Thursday, 21 August 2008 16:36:16 (GMT Daylight Time, UTC+01:00)

We have SQL CLR triggers which use an embedded memcached client to invalidate the cache.

Harry M

Thursday, 21 August 2008 18:59:25 (GMT Daylight Time, UTC+01:00)

Just to second Harry M, this seems like what triggers are for.

Eric Larson

Saturday, 23 August 2008 16:03:17 (GMT Daylight Time, UTC+01:00)

Mucking with replication streams/using triggers with it's own set of problems - why not do something simple.

How about maintaining a timestamp in the profile that reflects last updated time for fields that are cached in the memcached profile. This in conjunction with a job that picks up profiles that have changed since the last run and updates the cache, should do the trick.

Confused

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for How Facebook Keeps Memcached Consistent Across Geo-Distributed Data Centers - Dare Obasanjo's weblog