A few members of the Hotmail Windows Live Mail team have been doing some writing about scalability recently

From the ACM Queue article A Conversation with Phil Smoot

BF Can you give us some sense of just how big Hotmail is and what the challenges of dealing with something that size are?

PS Hotmail is a service consisting of thousands of machines and multiple petabytes of data. It executes billions of transactions over hundreds of applications agglomerated over nine years—services that are built on services that are built on services. Some of the challenges are keeping the site running: namely dealing with abuse and spam; keeping an aggressive, Internet-style pace of shipping features and functionality every three and six months; and planning how to release complex changes over a set of multiple releases.

QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.

BF I have this sense that the challenges don’t scale uniformly. In other words, are there certain scaling points where the problem just looks completely different from how it looked before? Are there things that are just fundamentally different about managing tens of thousands of systems compared with managing thousands or hundreds?

PS Sort of, but we tend to think that if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.

We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.

Another hard problem is load balancing across the site. That is, ensuring that user transactions and storage capacity are equally distributed over all the nodes in the system without any particular set of nodes getting too hot.

From the the blog post entitled Issues with .NET Frameworks 2.0by Walter Hsueh

Our team is tackling the scale issues, delving deep into the CLR and understanding its behavior.  We've identified at least two issues in .NET Frameworks 2.0 that are "low-hanging fruit", and are hunting for more.

1a)  Regular Expressions can be very expensive.  Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior.  We've taken several hotfixes for this.  RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.

1b)  Designing an AJAX-style browser application (like most engineering problems) involves trading one problem for another.  We can choose to shift the application burden from the client onto the server.  In the case of RegExes, it might make sense to move them to the client (where CPU can be freely used) instead of having them run on the server (where you have to share).  WindowsLive Mail made this tradeoff in one case.

2)  Managed Thread Local Storage (TLS) is expensive.  There is a global lock in the Whidbey RTM implementation of Thread.GetData/Thread.SetData which causes scalability issues.  Recommendation is to use the [ThreadStatic] attribute on static class variables.  Our RPS went up, our CPU % went down, context switches dropped by 50%, and lock contentions dropped by over 80%.  Good stuff.

Our devs have also started migrating some of our services to Whidbey and they've also found some interesting issues with regards to performance. It'd probably would be a good idea to get together some sort of lessons learned while building mega-scale services on the .NET Framework article together.