In my efforts to learn more about Web development and what it is like for startups adopting Web platforms I've decided to build an application on the Facebook platform. I haven't yet decided on an application but for the sake of argument let's say it is a Favorite Comic Books application which allows me to store my favorite comic books and shows me to most popular comic books among my friends.

The platform requirements for the application seems pretty straightforward. I'll need a database and some RESTful Web services which provide access to the database from the widget which can be written in my language of choice. I'll also need to write the widget in FBML which will likely mean I'll have to host images and CSS files as well. So far nothing seems particularly esoteric. 

Since I didn't want my little experiment eventually costing me a lot of money, I thought this was an excellent time to try out Amazon's Simple Storage Service (S3) and Elastic Compute Cloud (EC2) services since I'll only pay for as much resources as I use instead of paying a flat hosting fee..

However it seems supporting this fairly straightforward application is beyond the current capabilities of EC2 + S3. S3 is primarily geared towards file storage so although it makes a good choice for cheaply hosting images and CSS stylesheets, it's a not a good choice for storing relational or structured data. If it was just searching within a single user's data ( e.g. just searching within my favorite comics) I could store it all in single XML file then use XPath to find what I was looking for. However my application will need to perform aggregated queries across multiple user's data (i.e. looking at the favorite comics of all of my friends then fetching the most popular ones) so a file based solution isn't a good fit. I really want a relational database.

EC2 seemed really promising because I could create a virtual server running in Amazon's cloud and load it up with my choice of operating system, database and Web development tools. Unfortunately, there was a fly in the ointment. There is no persistent storage in EC2 so if your virtual server goes down for any reason such as taking it down to install security patches or a system crash, all your data is lost.

This is a well known problem within the EC2 community which has resulted in a bunch of clever hacks being proposed by a number of parties. In his post entitled Amazon EC2, MySQL, Amazon S3 Jeff Barr of Amazon writes

I was on a conference call yesterday and the topic of ways to store persistent data when using Amazon EC2 came up a couple of times. It would be really cool to have a persistent instance of a relational database like MySQL but there's nothing like that around at the moment. An instance can have a copy of MySQL installed and can store as much data as it would like (subject to the 160GB size limit for the virtual disk drive) but there's no way to ensure that the data is backed up in case the instance terminates without warning.

Or is there?

It is fairly easy to configure multiple instances of MySQL in a number of master-slave, master-master, and other topologies. The master instances produce a transaction log each time a change is made to a database record. The slaves or co-masters keep an open connection to the master, reading the changes as they are logged and mimicing the change on the local copy. There can be some replication delay for various reasons, but the slaves have all of the information needed to maintain exact copies of the database tables on the master.

Besides the added complexity this places on the application, it still isn't fool proof as is pointed out in the various comments in response to Jeff's post.

Demitrious Kelly who also realizes the problems with relying on replication to solve the persistence problem proposed an alternate solution in his post MySQL on Amazon EC2 (my thoughts) where he writes

Step #2: I’m half the instance I used to be! With each AMI you get 160GB of (mutable) disk space, and almost 2GB of ram, and the equivalent of a Xeon 1.75Ghz processor. Now divide that, roughly, in half. You’ve done that little math exercise because your one AMI is going to act as 2 AMI's. Thats right. I’m recommending running two separate instances of MySQL on the single server.

Before you start shouting at the heretic, hear me out!

+-----------+   +-----------+
| Server A | | Server B |
+-----------+ +-----------+
| My | My | | My | My |
| sQ | sQ | | sQ | sQ |
| l | l | | l | l |
| | | | | |
| #2<=== #1 <===> #1 ===>#2 |
| | | | | |
+ - - - - - + + - - - - - +

On each of our servers, MySQL #1 and #2 both occupy a max of 70Gb of space. The MySQL #1 instances of all the servers are setup in a master-master topography. And the #2 instance is setup as a slave only of the #1 instance on the same server. so on server A MySQL #2 is a copy (one way) of #1 on server A.

With the above setup *if* server B were to get restarted for some reason you could: A) shut down the MySQL instance #2 on server A. Copy that MySQL #2 over to Both slots on server B. Bring up #1 on server B (there should be no need to reconfigure its replication relationship because #2 pointed at #1 on server A already). Bring up #2 on server B, and reconfigure replication to pull from #1 on ServerB. This whole time #1 on Server A never went down. Your services were never disrupted.

Also with the setup above it is possible (and advised) to regularly shut down #2 and copy it into S3. This gives you one more layer of fault tollerance (and, I might add, the ability to backup without going down.)

Both solutions are fairly complicated, error prone and still don't give you as much reliability as you would get if you simply had a hard disk that didn't lose all its data when you rebooted the server goes down. At this point it is clear that a traditional hosted service solution is the route to go. Any good suggestions for server-side LAMP or WISC hosting that won't cost an arm and a leg? Is Joyent any good?

PS: It is clear this is a significant problem for Amazon's grid computing play and one that has to be fixed if the company is serious about getting into the grid computing game and providing a viable alternative to startups looking for a platform to build the next "Web 2.0" hit. Building a large scale, distributed, relational database then making it available to developers as a platform is unprecedented so they have their work cut out for them. I'd incorrectly assumed that BigTable was the precedent for this but I've since learned that BigTable is more like a large scale, distributed, spreadsheet table as opposed to a relational database. This explains a lot of the characteristics of the query API of Google Base.


 

Wednesday, 04 July 2007 21:41:06 (GMT Daylight Time, UTC+01:00)
Those are the same problems that kept me away from EC2 (those, and the DNS hacks necessary). I would like to point out that there is a MySQL storage engine that uses S3 as its backend, though it's probably far from stable:

http://aws.typepad.com/aws/2007/04/mysql_interface.html
Wednesday, 04 July 2007 21:57:42 (GMT Daylight Time, UTC+01:00)
Redwood virtual (www.redwoodvirtual.com) is good enough for something like this.

$10 a month gets you more than enough for most experiments like this, and they scale well.

You get a virtualized debian system, which can be easily upgraded to an Ubuntu system with the use of an apt.sources file from ubuntu and executing "apt-get update; apt-get dist-upgrade"

They do annoying go down for upgrade about 2 times a year for a half hour. But the price tag justifies the use for me in spite of this.
Michael Langford
Wednesday, 04 July 2007 22:10:55 (GMT Daylight Time, UTC+01:00)
>> ... and still don't give you as much reliability as you would get if you simply had a hard disk that didn't lose all its data when you rebooted the server. <<

Who told you that lie? You can reboot your servers as often and you'd like and never lose a single byte's worth of data. If you purposely shut it down then you're right, your data is gone. And if your server suddenly crashes, then you're right, your data is gone. That is, of course, if you haven't configured it to restart itself after a crash, but that's beside the point which is this: You're presenting a Fear, Uncertainty, and Doubt-based argument using false statements as the basis of you're argument. In other words, what you are suggesting is that if you are a complete idiot and don't make an effort to build in a fairly simple layer of data redundancy, there's a chance -- however remote and unlikely it might be -- that your server could crash and you could lose all of your data.

Taking it back a step, the argument that "you could lose at least *some* data" (referring to the comment regarding the MySQL slave/server redundancy solution not being full proof) is true about more hosting solutions than just S3+EC2. And to be quite honest, if I am already taking a chance by farming out my hosting services to a 3rd party, I'm sure as hell going to put more faith in Amazon's ability to keep the engine running than ANY OTHER hosting provider on the planet.

And if you still want to stand behind your claim that S3+EC2 is not a sufficient solution for "Real" applications (apparently Amazon.com isn't a "Real" enough solution for you to qualify as a "Real" web-based application?), could you at very least take a moment and fix the false statment that restarting your server means losing all of your data?

Thanks!
Wednesday, 04 July 2007 23:00:54 (GMT Daylight Time, UTC+01:00)
M. David Peterson,
One of the quotes from my post is from Jeff Barr who is the Amazon Web Services evangelist, is he also spreading FUD?

By the way, if you think EC2 + S3 is all it takes to run a site like Amazon I've got a bridge in Brooklyn for sale that I can sell to you for a good price. :)
Thursday, 05 July 2007 01:30:19 (GMT Daylight Time, UTC+01:00)
@M. David Peterson:

EC2 isn't like a traditional virtual server. If a server instance goes away then you lose all storage.

That's nice for Amazon because it makes it pretty much stateless. It does make it difficult for traditional LAMP deployments.
Thursday, 05 July 2007 04:23:06 (GMT Daylight Time, UTC+01:00)
And Re: the PS:

There's some evidence from the EC2 forums that Amazon is considering or already building a BigTable like query mechanism on-top of S3.

That will be interesting, since (as you point out) most people are used developing against a database which allows ad-hoc querying.

There are a few project around to allow ad-hoc querying of BigTable-like datastores - for instance Yahoo's Pig (http://research.yahoo.com/project/pig) runs on top of Hadoop (http://lucene.apache.org/hadoop/). Something like that would increase the flexibility of EC2/S3, but I don't expect Amazon to release that - at least initially.
Thursday, 05 July 2007 05:17:44 (GMT Daylight Time, UTC+01:00)
@Dare,

>> One of the quotes from my post is from Jeff Barr who is the Amazon Web Services evangelist, is he also spreading FUD?

Jeff Barr said: "An instance can have a copy of MySQL installed and can store as much data as it would like (subject to the 160GB size limit for the virtual disk drive) but there's no way to ensure that the data is backed up in case the instance terminates without warning."

You said: "and still don't give you as much reliability as you would get if you simply had a hard disk that didn't lose all its data when you rebooted the server."

My quarrel is with the difference between "in case the instance terminates without warning" which is a true statement and "lose all its data when you rebooted the server." which is not a true statement.

>> By the way, if you think EC2 + S3 is all it takes to run a site like Amazon I've got a bridge in Brooklyn for sale that I can sell to you for a good price. :) <<

Oh, nice! I've been looking for a good bridge to invest in. How much you askin'? ;-)

Peace :)
Thursday, 05 July 2007 05:41:59 (GMT Daylight Time, UTC+01:00)
@Nick,

>> EC2 isn't like a traditional virtual server. If a server instance goes away then you lose all storage. <<

Oh I do understand that. But the only time it goes away is if,

1) You determine you no longer need it and turn it off.
2) The hardware fails.
3) The server software itself blows a gasket for some odd reason and completely shuts itself down as a result.
4) The world ends.

From the above list, the only realistic scenario I can see taking place is a hardware failure. Of course, the world could realistically end tomorrow, but I'll assume that we can all at very least agree on the fact that if it did, whether our EC2 instance lives on is going to be the least of our concerns.

And even if the server software were to blow a gasket and call it quits w/o any warning and w/o rebooting itself, and from a more realistic viewpoint, even if the hardware were to fail, that still leaves the remaining slave servers who are at least within a handful of bytes of being accurate at the point the master server (if, in fact, it was the master server that gave up the ghost) called it quits. In this regard I fail to see a "this isn't capable of handling real-world applications" and instead "it's possible that something bad could happen, so prepare for the worst yet hope for the best and know that, within reason, it's far more likely that the best scenario (i.e. your data layer remains persistent as it should) will be what actually takes place 99.9999999999% of the time.

>> That's nice for Amazon because it makes it pretty much stateless. It does make it difficult for traditional LAMP deployments. <<

I've been using EC2 since within a few days of its original launch, building a fairly extensive, *data intensive* web application on top of it, and have yet to find myself in a position in which I was left wondering how to rebuild the various MySQL-based apps I have running on it. Of course, the core of the applications data persistence comes from a completely custom interface that uses S3 as the primary data storage layer, so in this regard its really not a fair comparison to a traditional LAMP application, so fair enough,

+1/2 Nick
+1/2 M:D
Thursday, 05 July 2007 05:47:40 (GMT Daylight Time, UTC+01:00)
Correction: "From the above list, the only realistic scenario I can see taking place is a hardware failure." is obviously incorrect. The first item happens all the time, but it's controlled and therefore doesn't present the same problem encountered when a server gives up the ghost w/o any warning.
Thursday, 05 July 2007 12:45:43 (GMT Daylight Time, UTC+01:00)
EC2's persistant storage going away is really no different from your raid set catching fire. All EC2 does is make this happen on every instance termination (but not reboots!) rather than just on a catastrophic hardware crash.

If your data is important to you, you need to have backups and/or replication anyway. EC2 just makes you think about this upfront instead of after an unlikely event nukes your data storage. As an added benefit, the nature of the service gives you far more opportunity to test your recovery strategy in a more controlled way.

How would you deal with your database server crashing hard in your own datacentre?
Jeremy
Thursday, 05 July 2007 15:38:08 (GMT Daylight Time, UTC+01:00)
I wholeheartedly agree with Jeremy. EC2 just brings standard disaster-recovery strategies to the fore. I have had PLENTY of RAID controllers die in my time.

The argument you present is pretty weak, Dare, and the headline is pretty inflammatory. If you're not willing to build redundancy into your application than perhaps a better headline would have been "Amazon EC2 + S3 Doesn't Cut it for Real (Simple) Applications".
Jake
Thursday, 05 July 2007 16:41:59 (GMT Daylight Time, UTC+01:00)
It's true that EC2 + S3 forces you to rethink your architecture assumptions upfront. I'm also in the camp that considers this a good thing.

The truth is that most(all?) engineers give lip service to scalability/redundancy but only come to grips with with the practical ramifications when it's far too late. Hence the typical progression of: site/prototype built quickly on a simple CRUD access model with a single db backend -> site gets popular -> panic! throw memcached in front of the requests -> rearchitect entire site.

J Marlowe
Thursday, 05 July 2007 16:46:47 (GMT Daylight Time, UTC+01:00)
Anyone checked out www.flexiscale.com? The company seems to offer quite a lot of advantages cf EC2, who I gather from their website they're aiming squarely at. For us, the two biggest winners are the load-balancing and the snapshot-based backups (we need point-in-time rollbacks).

We've not used them yet, but I see they're under the banner of xcalibre.co.uk so they've got a great pedigree. Any thoughts?

Cheers,
Jof
Thursday, 05 July 2007 17:15:26 (GMT Daylight Time, UTC+01:00)
@Dare,

Thanks for the fix!

Now about that bridge: Does it have a nice view? ;-)
Sunday, 08 July 2007 16:50:23 (GMT Daylight Time, UTC+01:00)
I think the point of all this is not that EC2+SQS+S3 doesn't allow traditional LAMP development. (WISC? who's using that?) That's obvious and Amazon never said it did.

Instead, you should realize that traditional LAMP development is dead and that you should start thinking about structuring your applications in a fundamentally different manner. This reliance on relational databases to do anything is a major bane to the industry right now and it would be far better off to find alternatives to enable scalability right from the start.

If you look at your argument above, what you are really bitching about is that EC2+S3 doesn't have a query mechanism. Some knowledge of database internals is enlightening here: RDBMSs insert your writes into a journal and then subsequently applies the mutation to applicable rows. A journal is just an S3 file away.

Need an index on top of that? No sweat. You just can't actually update rows, but rather must replace them. Upset that SQL is missing? Use a language with list comprehensions, like Python, Erlang or Haskell. Now go look up how Google File System, MapReduce and BigTable work and see that this kind of thing is already old hat. (disclaimer: I don't now, nor have I ever, worked for Google)

It should be very interesting to you that most of the big guys do not use RDBMSs for their main services: Google, Amazon, Yahoo!, Flickr, YouTube, etc. More interesting for your argument are the big services that do have scalability issues constantly: eBay, MySpace, Twitter, etc. The thing they have in common is reliance on RDBMSs, admittedly so in the case of eBay and Twitter.

I'm not saying that RDBMSs are all bad. I'm saying that if you are using one by default and/or that you don't know any other way to do things, your skills need updating.
Monday, 16 July 2007 18:00:31 (GMT Daylight Time, UTC+01:00)
I agree with (some of) what Toby has to say.

"[...]you should start thinking about structuring your applications in a fundamentally different manner. This reliance on relational databases to do anything is a major bane to the industry[...]"

Restructuring applications to move away from a monolithic database is key for web-scale. Mark Atwood has a BigTable like abstraction over S3: http://fallenpegasus.com/code/mysql-awss3/ to help.

"Some knowledge of database internals is enlightening here: RDBMSs insert your writes into a journal and then subsequently applies the mutation to applicable rows. A journal is just an S3 file away."

This is another good point. If you don't need the massive data storage (or even if you), you could just run a db or a cluster db as a cache. Write the db logs out to S3 in the background. Your DBMS supports on-line backups, so you can grab a consistent snapshot at any given time. Now you've got ACI, but not durability. Maybe you can survive losing some of your data (say 10-15 minutes worth?).

With EC2 and S3 you can scale out your app (cluster db) to overcome some of the perf issues of this frequent on-line backup paradigm. This comes at a financial cost, but that's probably the best way to state the trade-off.

If you heard Amazon CTO Werner Vogels speak at the Seattle Conference on Scalability, you know what I'm talking about in terms of the trade-offs involved between scalability and ACID (shameless plug: http://twopieceset.blogspot.com/2007/06/seattle-conference-on-scalability.html)
Comments are closed.