Scalability: I Don't Think That Word Means What You Think It Does

July 14, 2008

@ 12:40 PM

Via Mark Pilgrim I stumbled on an article by Scott Loganbill entitled Google’s Open Source Protocol Buffers Offer Scalability, Speed which contains the following excerpt

The best way to explore Protocol Buffers is to compare it to its alternative. What do Protocol Buffers have that XML doesn’t? As the Google Protocol Buffer blog post mentions, XML isn’t scalable:

"As nice as XML is, it isn’t going to be efficient enough for [Google’s] scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy."

We’ve never had to deal with XML in a scale where programming for it would become unwieldy, but we’ll take Google’s word for it.

Perhaps the biggest value-add of Protocol Buffers to the development community is as a method of dealing with scalability before it is necessary. The biggest developing drain of any start-up is success. How do you prepare for the onslaught of visitors companies such as Google or Twitter have experienced? Scaling for numbers takes critical development time, usually at a juncture where you should be introducing much-needed features to stay ahead of competition rather than paralyzing feature development to keep your servers running.

Over time, Google has tackled the problem of communication between platforms with Protocol Buffers and data storage with Big Table. Protocol Buffers is the first open release of the technology making Google tick, although you can utilize Big Table with App Engine.

It is unfortunate that it is now commonplace for people to throw around terms like "scaling" and "scalability" in technical discussions without actually explaining what they mean. Having a Web application that scales means that your application can handle becoming popular or being more popular than it is today in a cost effective manner. Depending on your class of Web application, there are different technologies that have been proven to help Web sites handle significantly higher traffic than they normally would. However there is no silver bullet.

The fact that Google uses MapReduce and BigTable to solve problems in a particular problem space does not mean those technologies work well in others. MapReduce isn't terribly useful if you are building an instant messaging service. Similarly, if you are building an email service you want an infrastructure based on message queuing not BigTable. A binary wire format like Protocol Buffers is a smart idea if your applications bottleneck is network bandwidth or CPU used when serializing/deserializing XML. As part of building their search engine Google has to cache a significant chunk of the World Wide Web and then perform data intensive operations on that data. In Google's scenarios, the network bandwidth utilized when transferring the massive amounts of data they process can actually be the bottleneck. Hence inventing a technology like Protocol Buffers became a necessity. However, that isn't Twitter's problem so a technology like Protocol Buffers isn't going to "help them scale". Twitter's problems have been clearly spelled out by the development team and nowhere is network bandwidth called out as a culprit.

Almost every technology that has been loudly proclaimed as unscalable by some pundit on the Web is being used by a massively popular service in some context. Relational databases don't scale? Well, eBay seems to be doing OK. PHP doesn't scale? I believe it scales well enough for Facebook. Microsoft technologies aren't scalable? MySpace begs to differ. And so on…

If someone tells you "technology X doesn't scale" without qualifying that statement, it often means the person either doesn't know what he is talking about or is trying to sell you something. Technologies don't scale, services do. Thinking you can just sprinkle a technology on your service and make it scale is the kind of thinking that led Blaine Cook (former architect at Twitter) to publish a presentation on Scaling Twitter which claimed their scaling problems where solved with their adoption of memcached. That was in 2007. In 2008, let's just say the Fail Whale begs to differ.

If a service doesn't scale it is more likely due to bad design than to technology choice. Remember that.

Now Playing: Zapp & Roger - Computer Love

Categories: Web Development | XML

« Giving Sh*t Away is not a Business Strat... | Home | Project Cassandra: Facebook's Open Sourc... »

Monday, 14 July 2008 17:08:33 (GMT Daylight Time, UTC+01:00)

Great post and agree 100%. It is so often the response to anyone who chooses a technology and it doesn't behave as they expected, they instantly blame the technology instead of how they implemented it.

We see this all the time regardless if its Rails or ASP.NET or whatever technology. I love all the pundits saying Rails doesn't scale and folks come out and prove it does.

Rob Bazinet

Monday, 14 July 2008 20:14:17 (GMT Daylight Time, UTC+01:00)

FWIW, since some scheduled downtime last week, Twitter seems pretty solid now. At least compared to a few weeks ago.

zack

Monday, 14 July 2008 21:06:45 (GMT Daylight Time, UTC+01:00)

I agree that Protocol Buffers isn't inherently more "scalable" then XML, in that the difference in overhead costs, is mostly going to be a constant factor for small documents. However, I disagree that Protocol Buffers doesn't hold value for twitter. Certainly they don't need protocol buffers on their front end, but how can you totally dismiss the merits of faster server to server communication in a high availability setting?

Alexander Fairley

Tuesday, 15 July 2008 01:48:43 (GMT Daylight Time, UTC+01:00)

Certainly brief articles (such as this one) beg a lot of questions, such as "what do you mean when you say 'scalability'". Of course, answering all of the questions that would be "begged" would end the appearance of short articles. However, it does seem that a simple comment or two would help in most cases, so that readers would at least have a "hook" to prompt searching for more discussion.

In this case, perhaps such a simple comment could have addressed the simple fact the XML documents tend to involve pushing a lot more bytes over the wire, into memory, or onto a storage device than the equivalent binary representation. Also, processing XML via DOM further adds to the burden since the entire document is loaded into memory and parsed into an object representation. XSLT and other XML-related technologies often make this much worse, while also making simple problems into massive undertakings (boy, could I tell you some stories...) due to complicated syntax and various limitations.

This overhead can be measured and compared, in terms of storage and processing cycles. Indeed, Protocol Buffer may be an ideal complementary (to XML) implementation for judging such issues--implement a communication protocol via both technologies and compare the results of transporting the same payload each way under realistic conditions.

My experience and research make the decision easy for me. If I need to exchange data with another system when I have only an "arms-length" relationship with the owning organization, then I use a text-based format (usually XML). If I need to exchange data with another system when I have a "family" relationship with the owning organization, then I will start with a text-based format but would consider a binary format. If I need to exchange data when my organization controls the source for both systems, then I would prefer a binary format. For a binary format, I now prefer Protocol Buffer. For text-based formats, I prefer something simple first (name/value pairs), and then XML (when I need structure).

However, the tools you choose are only a small part of the overall picture. It is how you use them that really matters most. Performance has to be designed in, and rarely can be significantly impacted after-the-fact (without a rewrite).

Rob Williams

Tuesday, 15 July 2008 14:45:11 (GMT Daylight Time, UTC+01:00)

Excellent. It's not just technology where people misuse the term scalability. Business and revenue models are rife with the same kind of abuse. Unfortunately it's especially common with internet business models and even VCs are guilty at times. Abusing the term with respect to marginal cost is the most common.

Jes

Wednesday, 16 July 2008 00:39:17 (GMT Daylight Time, UTC+01:00)

I fully agree with the direction of this, and said essentially the same thing on my blog a few months ago.

I do have to differ as regards my presentation. At no point did I say that sprinkling on some memcache solved all of our scalability problems. That talk was about the *process*. of scaling twitter, not about having *finished* scaling the site and was given, as you mention, in 2006. You know better than to suggest that we were sitting on our laurels at that point, or any point after.

Blaine Cook

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Scalability: I Don't Think That Word Means What You Think It Does - Dare Obasanjo's weblog