Operations Expertise is the Secret Sauce of Web Development

July 10, 2006

@ 09:34 PM

Tim O'Reilly has a blog post entitled Operations: The New Secret Sauce where he summarizes an interview he had with Debra Chrapaty, the VP of Operations for Windows Live. He writes

People talk about "cloud storage" but Debra points out that that means servers somewhere, hundreds of thousands of them, with good access to power, cooling, and bandwidth. She describes how her "strategic locations group" has a "heatmap" rating locations by their access to all these key limiting factors, and how they are locking up key locations and favorable power and bandwidth deals. And as in other areas of real estate, getting the good locations first can matter a lot. She points out, for example, that her cost of power at her Quincy, WA data center, soon to go online, is 1.9 cents per kwh, versus about 8 cents in CA. And she says, "I've learned that when you multiply a small number by a big number, the small number turns into a big number." Once Web 2.0 becomes the norm, the current demands are only a small foretaste of what's to come. For that matter, even server procurement is "not pretty" and there will be economies of scale that accrue to the big players. Her belief is that there's going to be a tipping point in Web 2.0 where the operational environment will be a key differentiator
...
Internet-scale applications are really the ones that push the envelope with regard not only to performance but also to deployment and management tools. And the Windows Live team works closely with the Windows Server group to take their bleeding edge learning back into the enterprise products. By contrast, one might ask, where is the similar feedback loop from sites like Google and Yahoo! back into Linux or FreeBSD?

This is one of those topics I've been wanting to blog about for a while. I think somewhere along the line at ~~MSN~~ Windows Live we realized there was more bang for the buck optimizing some of our operations characteristics such as power consumption per server, increasing the number of servers per data center, reducing cost per server, etc than whatever improvements we could make in code or via database optimizations. Additionally, it's also been quite eye opening how much stuff we had to roll on our own which isn't just standard parts of a "platform". I remember talking to a coworker about all the changes we were making so that MSN Spaces could be deployed in multiple data centers and he asked why we didn't get this for free from "the platform". I jokingly responded "It isn't like the .NET Framework has a RouteThisUserToTheRightDataCenterBasedOnTheirGeographicalLocation() API does it?".

I now also give mad props to some of our competitors for what used to seem like quirkiness that now is clearly a great deal of operational savviness. There is a reason why Google builds their own servers, when I read things like "One-third of the electricity running through a typical power supply leaks out as heat" I get quite upset and now see it as totally reasonable to build your own power supplies to get around such waste. Unfortunately, there doesn't seem to be a lot of knowledge out there about the building and managing a large scale, globally distributed server infrastructure. However we are feeding a lot of our learnings back to the folks building enterprise products at Microsoft (e.g. our team now collaborates a lot with the Windows Communication Foundation team) as Debra states which is great for developers building on Microsoft platforms.