December 7, 2005
@ 07:48 PM

The folks at 37 Signals have an insightful blog post entitled Don’t scale: 99.999% uptime is for Wal-Mart which states

Jeremy Wright purports a common misconception about new companies doing business online: That you need 99.999% uptime or you’re toast. Not so. Basecamp doesn’t have that. I think our uptime is more like 98% or 99%. Guess what, we’re still here!

Wright correctly states that those final last percent are incredibly expensive. To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9% tens of thousands more. Now contrast that with the value. What kind of service are you providing? Does the world end if you’re down for 30 minutes?

If you’re Wal-Mart and your credit card processing pipeline stops for 30 minutes during prime time, yes, the world does end. Someone might very well be fired. The business loses millions of dollars. Wal-Mart gets in the news and loses millions more on the goodwill account.

Now what if Delicious, Feedster, or Technorati goes down for 30 minutes? How big is the inconvenience of not being able to get to your tagged bookmarks or do yet another ego-search with Feedster or Technorati for 30 minutes? Not that high. The world does not come to an end. Nobody gets fired.

Scalability issues are probably the most difficult to anticipate and mitigate when building a web application. When we first shipped MSN Spaces last year, I assumed that we'd be lucky if we became as big as LiveJournal, I never expected that we'd grow to be three times as big and three times as active within a year. We've had our growing pains and it's definitely been surprising at times finding out which parts of the service are getting the most use and thus needed the most optimizations.

The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture. Even the much vaunted Google has had these problems, just look at the reviews of Google Reader that called it excruciatingly slow or the complaints that Google Analytics was so slow as to be unusable.

If you are a startup, don't waste your time and money worrying about what happens when you have millions of users. Premature optimization is the root of all evil and in certain cases will lead you to being more conservative than you should be when designing features. Remember, even the big guys deal with scalability issues.


 

Wednesday, 07 December 2005 21:59:40 (GMT Standard Time, UTC+00:00)
Dare, any startup that takes your advice guarantees it will die. There are two scenarios:

1. Don't attract lots of users and die of loneliness.

2. Attract lots of users and crater under the load.

It isn't premature optimization to design your systems to scale. It is, in fact, the only way to succeed.
Wednesday, 07 December 2005 23:36:02 (GMT Standard Time, UTC+00:00)
Emphatically disagree with Greg. Services rarely if ever fail because their creators aren't able to scale them in time. Services fail all the time because they can't find paying customers.

Most startups can only dream of having scalability issues. Fact is, just-in-time scaling is relatively easy to deliver. It's over-engineering in the beginning that can lead to failure.
pwb
Wednesday, 07 December 2005 23:40:10 (GMT Standard Time, UTC+00:00)
I'm going to side with Dare on this one. A surprisingly large portion of up-front scalability work on most projects is ultimately wasted because the sad reality is that system designers have a hard time anticipating which scalability issues are important. Instead, they waste development time by trying to anticipate potential issues up front, hoping that if the issues do appear, their "high-performance, scalable architecture" will make them easy to solve.

Often, however, the real problems are unanticipated, and the scalable architecture turns out to be confining. Worse, the complexity burden (or "tax") imposed by the underlying architecture must be paid for the lifetime of the project and slows down all future development.

Wouldn't it be more sensible to build a straightforward implementation that gets the features right *before* worrying about performance? That way, you'll have lots of developer cycles left over from *not* building something more complex, and these cycles can be invested where they count -- on real bottlenecks, exposed by real measurements and real user behavior.
Thursday, 08 December 2005 01:40:57 (GMT Standard Time, UTC+00:00)
Re-reading the posts and looking at the comments, I think we all may be arguing for roughly the same thing here but not realizing it.

Clearly, the extreme of spending many months and draining piles of cash on a 99.999% uptime system with no users is absurd.

Likewise, the extreme of just hacking away on a twisted mass of PHP blissfully ignorant of how to make the site survive if it does get popular is also absurd.

Findory is built on cheap commodity hardware. We constantly are launching to the site, learning what works and what doesn't. The underlying architecture is constantly being adapted, refactored, and modified.

But we also are thinking about the future. We do consider how new features will perform over time and make sure we don't paint ourselves into a corner.

As a result, our uptime is exceptionally good even with our rapid growth. The last downtime I can remember for Findory was a network problem that caused 60 minutes of downtime on August 28.

I'm not arguing for a excessively detailed planning, tome-like design documents, or massively redundant hardware. But I do think that a little time spent considering how features will perform and scale is time well spent.
Thursday, 08 December 2005 02:07:59 (GMT Standard Time, UTC+00:00)
Greg, +1.

The "build it now and we'll fix it later" approach that seems to be being suggested here is just scary.
Of course, I work for a company that specialises in moving software off of mainframes and onto Windows and Unix servers so scalability is a HUGE deal (it's normally question 1 or 2 with security being the other) so I may be slightly biased, but just because you're not a bank or an insurance company doesn't mean you won't be the next flickr, or myspace.

99.999 is about 4 minutes of downtime a year, thats a big target that most web applications don't need to match but to say that scaling is easy, don't worry about it is just plain irresponsible. Thats how you build crappy software that has to be constantly changed over time to support the growth of your company.
Upfront benchmarking, capacity planning and a solid architecture aren't 'nice to haves' they should be crucial to the design and ongoing implimentation of a product.

Tom's comments about building the features and not bothering with performance don't seem to gybe with his website statements *at all* and they scare the hell out of me if he really believes them.
Throwing features into a product without careful consideration of how they scale, and how they affect the performance of the overall product is crazy, and having "lots of developer cycles left over from *not* building something more complex" to throw at the problem later is a terrible solution.
Thursday, 08 December 2005 05:25:09 (GMT Standard Time, UTC+00:00)
Ian, you're arguing against a straw-man version of my argument. To be clear, I'm saying that investing development cycles in optimization without due consideration for the costs of doing so is wasteful, especially when the optimizations are targeting problems that may very well be imaginary.

Most startups die because they run out of money. Making sure that every development dollar buys something *real* is important. Money spent on optimizations that solve nonexistent performance problems are wasted -- they add no value and only increase the burn rate.

Startups ought to spend their money where it counts. Early on, what counts most is getting the application *right* so that people might actually want to use it. Later, if the application is good enough to draw users, investments in performance become more likely to pay off.

In sum, I'm not saying you should ignore performance. Rather, I'm saying you shouldn't invest in it until you have a reasonable expectation that your investment will pay off.
Thursday, 08 December 2005 08:43:03 (GMT Standard Time, UTC+00:00)
i think we all encounter scaling issues. nobody's excellent i think so tieing down the spots, i agree with tom
professional web design
Thursday, 08 December 2005 09:11:10 (GMT Standard Time, UTC+00:00)
"the extreme of just hacking away on a twisted mass of PHP blissfully ignorant of how to make the site survive if it does get popular is also absurd. "

While I doubt the CD Baby guys are blissfully ignorant of how to scale (not that I think you can be blissfully ignorant when your system is going down all the time, that is agonisingly ignorant) they did have the mass of twisted PHP and have ended up re-writing it all in Rails: http://www.oreillynet.com/pub/wlg/8274

So there is a real world example.
Thursday, 08 December 2005 19:59:16 (GMT Standard Time, UTC+00:00)
The thing I'm not getting from this discussion is exactly where the investment/payoff ratio is high. What is the corner that you cut to get 99.9% instead of 99.99% and save a wad of cash?

In the Web application space, reliability and scalability are usually achieved by commodity servers and load balancing. So the cost there is simply buying too many servers or more expensive load balancing ($ for throughput). I understand the database backend scaling problem gets glossed over, even though it is very critical and adds a significant cost.

But once the application is designed for this situation, there shouldn't be additional development cost after the second server is added. So is the point that we shouldn't design the application for multiple servers until we have to? Or is there another big cost that jumps into the mix after the second server?

If the concern is premature tuning against large loads, then we have a fuzzy problem. Is the application database-bound? What about the pipeline (where the application fronts a document or map image server, for example). Maybe the tuning cost recovers pretty well if just a few users will bring the system to its knees without it.

I remember hearing tuning tales from a guy that was a dev lead at Match.com. Their concern was always just staying ahead of the demand, both in hardware and software. And that was job one.
Thursday, 08 December 2005 22:12:22 (GMT Standard Time, UTC+00:00)
Walt, you've inadvertently answered the question. It's difficult to determine what your scaling issues are going to be before you actually hit them. So it's best to spend your energy on more important items such as figuring out how to acquire customers and make money. You can scale just-in-time. It's pretty easy.

WRT CD Baby, it's almost always better to just clean up your existing system rather than re-write it from scratch. It's only advisable to re-write an app from scratch if you are looking for an excercise or have a lot of surplus resources.
pwb
Sunday, 11 December 2005 03:56:16 (GMT Standard Time, UTC+00:00)
I never said companies had to build out multi-million dollar infrastructures. Nor did I say that 5 9s was the target every company should hit. Merely that IF your business lives or dies by its uptime, you'd damned well better be able to stay up.

Nobody intelligent argues for companies building for millions of users when they can't even get 1. Nor do I think anybody intelligent would say "build if fast, screw the future as long as we can save a day or two now".

But after several months of working with major companies on scalability issues I just had to rant a bit. If your business NEEDS to scale, then build it with that in mind :)
Wednesday, 14 December 2005 14:43:33 (GMT Standard Time, UTC+00:00)
I think it also depends on where the company is coming from. If MSN is launching a service, sure they don't expect to grow as big as a major player in a short period of time, but with the kind of audience reach they have at their finger tips its def. a strong possibility things just might take off. Contrast that with a 1 man operation coding a web application in his spare time at nights, he might look at things a little different.
Comments are closed.