Earlier today (or should I say yesterday) I attended Facebook’s Seattle Engineering Road Show which was a part technical talk and part recruiting event where Mike Shroepfer and a number of other Facebook engineers gave a fairly deep technical talk about the technologies used by Facebook.
Below are my notes on the presentation and ensuing Q&A session. One thing I found interesting about the talk was how similar their architecture for the platform that powers the Facebook news feed is to what my team has built for powering the Windows Live What's New feed.
Before the presentation started there were a bunch of slides that carried some interesting and sometimes impressive stats about Facebook including
Mike Shreopfer started his talk with a number of impressive statistics. These stats include the fact that 2 billion pieces of content are shared per week when you add up all of the status updates, comments, likes and other sharing gestures that are performed on the site. There are also 2 billion photos uploaded per month, 15,000 Facebook Connect implementations which in combination with the various platform applications make over 5 billion API calls per day and about 300 million active users per month.
Scaling Facebook has proven more challenging than scaling a traditional website. In a traditional website, you put a user’s data in a database then whenever the user’s data is requested you have some web or mid-tier layer retrieve the user’s data from the database. Such applications (e.g. Web-based email services) are easily partitionable and it is thus straightforward to scale them horizontally by creating different pools of users and putting various groups of users on particular servers. The original design for Facebook was similarly trivial to scale. In the early days, each college lived on it’s own cluster (e.g. http://harvard.facebook.com) and the service was thus trivial to scale. This model broke down when Facebook made it possible for users to make friends from different colleges and eventually making friends across different geographical networks. At this point it became harder to partition data since all users were interconnected. Rendering a user’s homepage didn’t just mean looking up that user’s data but also data from all their friends. This change continues to be Facebook’s core scalability challenge.
The most famous example of tackling this challenge is the Facebook news feed. When you hit the news feed they render 45 stories on your homepage. The service has a couple of milliseconds to query the feeds of up to 5,000 of your closest friends and then choose which 45 stories are the best ones to show the user. There is a platform named Multifeed which is the underlying platform which powers the news feed and allows them to filter down tens of thousands of stories to forty five stories when rendering the home page. This process involves fetching metadata for news stories from memcache clusters that perform over 50 million operations per second. These queries include rich media such as photos which are served at the rate of 1.2 million photos per second.
In aggregate they currently store more than 20 billion photos at 4 different resolutions per photo (i.e. 80 billion image files). Back in 2005, photos were served from NFS clusters. They eventually hit some bottlenecks because file systems are not really designed for serving large numbers of photos. The metadata ended up being too large to fit in memory and there were numerous I/O bottlenecks. The second implementation of the photos service contained several implementations such as putting profile photos in a caching tier and moving a large number of photos to a Content Distribution Network (CDN). They also built a file handle cache for their NFS systems to reduce the amount of disk I/O involved when looking up the location of the file. The 3rd generation of their photo service is a custom storage solution known as Haystack. Haystack runs on commodity hardware (i.e. Dell Linux boxes with a terabyte SATA drives) and is a custom file system which keeps all file indexes in RAM to speed up lookups. Today Haystack is so optimal that it only takes one I/O operation to load a photo compared to 3 I/O operations for the previous optimized implementation and 10 I/O operations for the initial NFS based system. Haystack is so scalable that they once accidentally disabled their CDNs and Haystack served all photos on the site without breaking a sweat. The service was built by 3 developers in a couple of months. It will be Open Sourced in the near future.
Another example of Facebook’s approach to scaling is the recent addition of custom usernames to the site. When the feature launced they had over 200 million active users. There was a lot of internal debate on how to fairly allocate usernames to their users and they even tested auctioning usernames. However this ended up being confusing to users so they decided to go with the simpler first come, first served model as the primary way for users to get their preferred user name. Given the expected rush of users who would sign up for username, the developers expected the load to be the equivalent of a denial of service attack and planned accordingly. There were a number of contingency plans to reduce load on the site including disabling the chat feature, reducing the number of items in the news feed from 45 to 15 and even serving static HTML on some pages. However none of these contingencies had to be put in place when the username feature was launched because the service scaled so well. Users signed up for 200,000 usernames in the first 3 minutes, 500,000 within the first 15 minutes and had created over 1 million usernames within the first hour. Since usernames were created at the root of the Facebook namespace, many users created “fun” usernames such as http://www.facebook.com/home.aspx and they actually had manually reject a user who registered a common typo to one of their pages.
At the high level you can think of their architecture as being as follows.
There are PHP web servers which then aggregate data from dedicated web services, in-memory caches and their relational database. They use PHP on the front-end because it is easy to learn, easy to develop in and it is very productive. However there are downsides such as the fact that their benchmarks have shown that it is ten times slower than if they used more traditional web programming technologies like C# or Java. They have done a lot of work to optimize PHP and will continue to invest in optimizations instead of switching the language they use on their front ends.
Their dedicated web services are built in whatever technology the developer building the service is most comfortable with. Thus far they have dedicated services in C++, Erlang (see Eugene Letuchy’s post on Facebook Chat’s usage of Erlang), Python and Java. They have also built a number of tools that act as building blocks for creating new services so their developers do not reinvent the wheel. These building blocks include tools like Thrift; a cross-language RPC framework that allows you to communicate using serialized objects in multiple languages C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml).There is Scribe which is a system for aggregating server logs in real-time and publishing them to a centralized location. There are also custom internal tools for centrally managing configuration of services, as well as monitoring and alerting so that each new service doesn’t have to figure out how to solve these problems but can instead plug into what they have already.
An example of a dedicated internal service at Facebook is MultiFeed. This is the service which takes the tens of thousands of updates from a user’s friends and whittles them down to the forty five updates that are shown on the home page when a user logs in. When a user publishes an update it is both written to MySQL and then published to a their memcache cluster using Scribe. There is a cache of the fifty most recent items each user has performed always stored in memory. When a user hits the home page, an aggregator takes the list of the user’s friends and then queries a bunch of leaf nodes for activities their friends have performed. Each leaf node does some filtering before returning data to the aggregator which in turn does some filtering of its own before returning a list of story IDs to the calling application. The application then looks up the actual items from memcache given their story IDs and then renders the home page.
Multifeed is a custom distributed system which takes the tens of thousands of updates from friends and picks the 45 that are either most relevant or most recent. Bob updates status to DB and then Scribe pushes this to a Multifeed server which is acache of the most recent 50 items the user has performed. This caches data for every user in the system. When a user shows up, the user hits an aggregator which takes the list of friends and then each leaf node does some filtering then gets back to the aggregator which then does some filtering and returns story IDs. Then the aggregator fills out the data from memcache given the story ID.
Facebook heavily uses memcached which is an Open Source distributed, in-memory hash table they use to keep load off of their MySQL databases. They have a love/hate relationship with memcache. On the on hand it is reliable, has low latency and can handle an extremely high service rate. On the other hand data is easy to corrupt, the data model is limited and it is an inefficient way to store small items. They have made several customizations to memcache such as porting it to 64-bit, moving from TCP to UDP, optimizing the network stack and adding multithreading support. Their changes have improved the throughput of memcache by a factor of five compared to when they first started using it.
Once they got memcached throughput that high, they hit another scaling bottleneck in that they started flooding their network switches due to putting too much data on the network. This problem was addressed by using a throttling approach where web servers request data piecemeal from the caches instead of in one huge request. They performed a number of experiments comparing batch request size to memcached cluster size and came away with the conclusion that requesting items in batches of 50 was the most optimal solution for their system. This throttling also applies at the user facing application layer. For example, instead of requesting all forty five items at once when rendering the homepage, a subset of the data is requested and then only when the users scrolls down below the fold is the rest of the data fetched [Ed. Note – hey, I’ve noticed that when using Facebook]. They try to keep the memcache clusters small so that web servers don’t have to hit lots of cache servers to satisfy a request. To ensure this locality, data is partitioned vertically by object type to ensure that all instances of a particular type of object reside on a small set of cache servers. Since this creates single points of failure, objects in the cache are sometimes replicated. The term replicated is used loosely to describe the process of the application layer storing objects on multiple cache servers to simulate database style replication. When asked how they deal with data inconsistency issues, the response was that for the most part they limit the object types they “replicate” in this manner then live with the data inconsistency issues.
MySQL was initially chosen as their database of choice because it is free, simple, fast and reliable. All told, they have about 6000 server years of runtime experience without data loss or corruption due to software. However like most megascale services they often end up using MySQL as a glorified key<->value store as opposed to taking advantage of the relational features of the database. When asked why bother using a relational database, the response is that the database management features such as data replication and administration are still valuable even if you aren’t using foreign keys and other relational features. One interesting challenge is that they often use “data driven schemas” to enable programmers to add new types to the database without requiring schema changes. This approach sounds similar to what Bret Taylor described in his post How FriendFeed uses MySQL to store schema-less data. The challenge with this approach is that you end up with key<->value pairs stuffed in some column in the database where you can’t even efficiently query because you can’t index the fields you care about.
One of their biggest unsolved problems is coming up with a good story for being a geographically distributed service. Right now they have a situation where they have West Coast and East Cost data centers where they use the West Coast DCs for read & write operations while the East Coast data centers are read-only. This means a user in New York hits the East Coast DC when reading their news feed but hits a DC in California when publishing a status update. To keep the data in sync across data centers they use MySQL replication and have hacked the replication streams to also propagate memcached cache invalidation commands as well. This clever solution hack has been previously described in the Facebook engineering blog post by Jason Sobel entitled Keeping Up. There are some tricky problems they have to solve such as users often wanting the behavior to seem instantaneous despite the replication lag (e.g. I often change my status then want to check to see what it looks like on my profile/feed). There is also the open question on whether this solution is workable at a global scale (i.e. if Facebook ever adds data centers outside of the United States).
The company has released or contributes to a ton of Open Source software. Besides aforementioned technologies like Thrift, Scribe and contributions to memcached there are also a number of other interesting Facebook originated projects like Cassandra and Tornado. They also contribute quite heavily to Hive which is a data warehousing solution built on top of Hadoop.
There are three key mantras that describe Facebook’s engineering culture
Move Fast and Break Things – The company encourages people to take risks and innovate even at the risk of jeopardizing the stability of the site. It is important to them to not become risk averse because this is often the downfall of companies as they grow.
Huge Impact with Small Teams – Most features and underlying systems on the site are built by teams of one to three people. They have a metric of users per developers that they are particularly proud. They claim that at Facebook they have 1 developer for every 1.2 million users which is better than most industry leaders including Google (1 developer per 190,000 users), Amazon (1 developer per 96,000 users) and Microsoft (1 developer per 75,000). [Ed. Note – not to be defensive but this metric seems weird to me and it wasn’t clear how it is calculated. ]
Be Bold and Innovative – self explanatory
There were a number of interesting Q & A questions asked and answered but I ran out of juice in my laptop so didn’t capture them. The answers I do remember are sprinkled in the notes above.
I found it very interesting to learn about the architecture behind MultiFeed as well as the various scaling challenges and unsolved problems that Facebook still faces. I'd say there is something like 80% overlap in the way we've approached solving problems and where we differ our approaches seem more philosophical than architectural. For example, when our service hit similar issues around network switch saturation due to high throughput from our in-memory caches we solved the problem in a simpler way than Facebook's throttled back-off approach although I have to admit their solution is clever. Our approach was to look for ways to reduce the amount of data we put out on the network instead Facebook’s approach of staggering requests to reduce the peak load on the system.
If you’re interested in working on or building systems similar to the ones described above for the 500 million users who utilize Windows Live services. Send me a resume, we’re always looking for good developers.
Now Playing: Yo Gotti - 5 Star (Remix) [feat. Gucci Mane, Trina & Nickie Minaj]
The opinions expressed herein are my own personal opinions and do not represent
my employer's view in any way.
© Copyright 2013, Dare Obasanjo - Powered by: newtelligence dasBlog 2.1.8102.813
DirectionalRedux theme by John Forsythe and Jon Stovall