Tim O'Reilly has a blog post entitled Operations: The New Secret Sauce where he summarizes an interview he had with Debra Chrapaty, the VP of Operations for Windows Live. He writes

People talk about "cloud storage" but Debra points out that that means servers somewhere, hundreds of thousands of them, with good access to power, cooling, and bandwidth. She describes how her "strategic locations group" has a "heatmap" rating locations by their access to all these key limiting factors, and how they are locking up key locations and favorable power and bandwidth deals. And as in other areas of real estate, getting the good locations first can matter a lot. She points out, for example, that her cost of power at her Quincy, WA data center, soon to go online, is 1.9 cents per kwh, versus about 8 cents in CA. And she says, "I've learned that when you multiply a small number by a big number, the small number turns into a big number." Once Web 2.0 becomes the norm, the current demands are only a small foretaste of what's to come. For that matter, even server procurement is "not pretty" and there will be economies of scale that accrue to the big players. Her belief is that there's going to be a tipping point in Web 2.0 where the operational environment will be a key differentiator
Internet-scale applications are really the ones that push the envelope with regard not only to performance but also to deployment and management tools. And the Windows Live team works closely with the Windows Server group to take their bleeding edge learning back into the enterprise products. By contrast, one might ask, where is the similar feedback loop from sites like Google and Yahoo! back into Linux or FreeBSD?

This is one of those topics I've been wanting to blog about for a while. I think somewhere along the line at MSN Windows Live we realized there was more bang for the buck optimizing some of our operations characteristics such as power consumption per server, increasing the number of servers per data center, reducing cost per server, etc than whatever improvements we could make in code or via database optimizations. Additionally, it's also been quite eye opening how much stuff we had to roll on our own which isn't just standard parts of a "platform". I remember talking to a coworker about all the changes we were making so that MSN Spaces could be deployed in multiple data centers and he asked why we didn't get this for free from "the platform". I jokingly responded "It isn't like the .NET Framework has a RouteThisUserToTheRightDataCenterBasedOnTheirGeographicalLocation() API does it?".

I now also give mad props to some of our competitors for what used to seem like quirkiness that now is clearly a great deal of operational savviness. There is a reason why Google builds their own servers, when I read things like "One-third of the electricity running through a typical power supply leaks out as heat" I get quite upset and now see it as totally reasonable to build your own power supplies to get around such waste. Unfortunately, there doesn't seem to be a lot of knowledge out there about the building and managing a large scale, globally distributed server infrastructure. However we are feeding a lot of our learnings back to the folks building enterprise products at Microsoft (e.g. our team now collaborates a lot with the Windows Communication Foundation team) as Debra states which is great for developers building on Microsoft platforms. 


Monday, July 10, 2006 11:16:48 PM (GMT Daylight Time, UTC+01:00)
Thanks for the props Dare. I've worked in operations at Microsoft since 1999 and most people outside of the group are truly amazed at what is involved.
Tuesday, July 11, 2006 1:48:30 PM (GMT Daylight Time, UTC+01:00)
The power supply issue in particular is yet another of those things that mainframe people have thought about hard for years and that PC people have finally realized "Hey! when you scale up enough, this becomes something to plan around!"
Tuesday, July 11, 2006 2:52:03 PM (GMT Daylight Time, UTC+01:00)
Hmmm, I take it you have comment moderation on... I don't actually see any note, here to that effect, though.
Daniel Walker
Tuesday, July 11, 2006 3:05:01 PM (GMT Daylight Time, UTC+01:00)
5 years ago, IBM were making this very point. They were replacing hundreds of Sun small-box servers with individual mainframes:
About that time, Scott McNealy was talking his "big talk" about meeting the needs of customers who measured their computing needs "by the acre". A lot of this was disregarded, since the bell whether companies of the Web were simply building out vast clusters of commodity PC-based servers (and in this, I actually include Google, which was bragging about it's ability to quicky add server capacity to it's network, this way).

In my opinion, the companies that have been grappling with computing on a global scale for decades, such as banks and travel agencies, don't attract the same glamour to them, because people just expect the cash machine to work, or the plastic card to handle the transaction, without even noticing that what they are using is one of the best computer interfaces in the world (they don't even notice that they are using it, or think of it as a computer interface). Fancy interfaces get lengthy articles written about them: *great* computer interfaces disappear into the background and become indespensible.

There's actually more work involved in becoming invisible. To level-up to the kind of massive processing, where the service you are offering has vanished into the infrastructure, you need to work with large scale computing as a 'given'.

Yours, crankily
P.S. I've cut this comment's length down a bit: is there a size limit, Dare?
Daniel Walker
Tuesday, July 11, 2006 3:41:09 PM (GMT Daylight Time, UTC+01:00)
I don't have comment moderation enabled. If there is a size limit on comments, I'm not aware of it but will take a look at my blog software's configuration options later today.
Tuesday, July 11, 2006 4:49:42 PM (GMT Daylight Time, UTC+01:00)
Cool. BTW, /please/ don't read that original artical as some sort of ABM troll: it's not. The original story is ancient news, and MSFT have had this sort of thing possible with virtual instances of Windows Server running on comparable macines like the Unisys EU7000 for just as long.

I was just making a point that the need to 'build big', and consider using server virtualisation, to reduce the actual machine count, and energy usage, are both things that have long been recognised in the number-crunching industries.
Daniel Walker
Tuesday, July 11, 2006 8:21:45 PM (GMT Daylight Time, UTC+01:00)
While I totally agree that ops is and will continue to play a huge role, I came to the very opposite conclusion: you want to be on open source where zillions of people can and are optimizing for various situations. With Microsoft, you're stuck with whatever Microsoft decides to provide, which doesn't always (in fact, rarely) match up with user needs.

Can anyone even imagine Google, Yahoo or Amazon running anything at scale on Microsoft?
Tuesday, July 11, 2006 10:23:08 PM (GMT Daylight Time, UTC+01:00)
Daniel, I think what you were noticing was that if you write a long comment or spend time reading comments, your server session expires and the CAPTCHA code you enter is rejected. However, your text should intact and you can just re-enter the new CAPTCHA.
Comments are closed.