Google Map/Reduce vs. Relational Databases

January 23, 2008

@ 04:00 AM

The top story in my favorite RSS reader is the article MapReduce: A major step backwards by David J. DeWitt and Michael Stonebraker. This is one of those articles that is so bad you feel dumber after having read it. The primary thesis of the article is

MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

A giant step backward in the programming paradigm for large-scale data intensive applications

A sub-optimal implementation, in that it uses brute force instead of indexing

Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on

One of the worst things about articles like this is that it gets usually reasonable and intelligent sounding people spouting of bogus responses as knee jerk reactions due to the articles stupidity. The average bogus reaction was the kind by Rich Skrenta in his post Database gods bitch about mapreduce which talks of "disruption" as if Google MapReduce is actually comparable to a relation database management system.

On the other hand, the good thing about articles like this is that you get often get great responses from smart folks that further your understanding of the subject matter even though the original article was crap. For example, take the post from Google employee Mark Chu-Carroll entitled Databases are Hammers; MapReduce is a ScrewDriver where he writes eloquently that

The beauty of MapReduce is that it's easy to write. M/R programs are really as easy as parallel programming ever gets. So, getting back to the article. They criticize MapReduce for, basically, not being based on the idea of a relational database.
…
That's exactly what's going on here. They've got their relational databases. RDBs are absolutely brilliant things. They're amazing tools, which can be used to build amazing software. I've done a lot of work using RDBs, and without them, I wouldn't have been able to do some of the work that I'm proudest of. I don't want to cut down RDBs at all: they're truly great. But not everything is a relational database, and not everything is naturally suited towards being treated as if it were relational. The criticisms of MapReduce all come down to: "But it's not the way relational databases would do it!" - without every realizing that that's the point. RDBs don't parallelize very well: how many RDBs do you know that can efficiently split a task among 1,000 cheap computers? RDBs don't handle non-tabular data well: RDBs are notorious for doing a poor job on recursive data structures. MapReduce isn't intended to replace relational databases: it's intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines. That's all it was intended to do.

Mark’s entire post is a great read.

Greg Jorgensen also has a good rebutal in his post Relational Database Experts Jump The MapReduce Shark which points out that if the original article had been a critique of a Web-based structured data storage systems such as Amazon’s SimpleDB or Google Base then the comparison may have been almost logical as opposed to being completely ridiculous. Wink