Some Thoughts on Hadoop - Dare Obasanjo's weblog

August 14, 2007

@ 03:19 AM

Recently I've seen a bunch of people I consider to be really smart sing the praises of Hadoop such as Sam Ruby in his post Long Bets, Tim O’Reilly in his post Yahoo!’s Bet on Hadoop, and Bill de hÓra in his post Phat Data. I haven’t dug too deeply into Hadoop due to the fact that the legal folks at work will chew out my butt if I did, there a number of little niggling doubts that make me wonder if this is the savior of the world that all these geeks claim it will be. Here are some random thoughts that have made me skeptical

Code Quality: Hadoop was started by Doug Cutting who created Lucene and Nutch. I don’t know much about Nutch but I am quite familiar with Lucene because we adopted it for use in RSS Bandit. This is probably the worst decision we’ve made in the entire history of RSS Bandit. Not only are the APIs a usability nightmare because they were poorly hacked out then never refactored, the code is also notoriously flaky when it comes to dealing with concurrency so common advice is to never use multiple threads to do anything with Lucene.

Incomplete Specifications: Hadoop’s MapReduce and HDFS are a re-implementation of Google’s MapReduce and Google File System (GFS) technologies. However it seems unwise to base a project on research papers that may not reveal all the details needed to implement the service for competitive reasons. For example, the Hadoop documentation is silent on how it plans to deal with the election of a primary/master server among peers especially in the face of machine failure which Google solves using the Chubby lock service. It just so happens that there is a research paper that describes Chubby but how many other services within Google’s data centers do MapReduce and Google File System (GFS) depend on which are yet to have their own public research paper? Speaking of which, where are the Google research papers on their message queueing infrastructure? You know they have to have one, right? How about their caching layer? Where are the papers on Google’s version of memcached?Secondly, what is the likelihood that Google will be as forthcoming with these papers now that they know competitors like Yahoo! are knocking off their internal architecture?
A Search Optimized Architecture isn’t for Everyone: One of the features of MapReduce is that one can move the computation close to the data because “Moving Computation is Cheaper than Moving Data”. This is especially important when you are doing lots of processing intensive operations such as the kind of data analysis that goes into creating the Google search index. However what if you’re a site whose main tasks are reading and writing lots of data (e.g. MySpace) or sending lots of transient messages back and forth yet ensuring that they always arrive in the right order (e.g. Google Talk) then these optimizations and capabilities aren’t much use to you and a different set of tools would serve you better.

I believe there are a lot of lessons that can be learned from how the distributed systems that power the services behind Google, Amazon and the like. However I think it is waaaay to early to be crowning some knock off of one particular vendors internal infrastructure as the future of distributed computing as we know it.

Seriously.

PS: Yes, I realize that Sam and Bill are primarily pointing out the increasing importance of parellel programming as it relates to the dual trends of (i) almost major website that ends up dealing with lots of data and has lots of traffic eventually eschews relational database features like joins, normalization, triggers and transactions because they are not cost effective and (ii) the increased large amounts of data that the we generate and now have to process due to falling storage costs. Even though their mentions of Hadoop are incidental it still seems to me that it’s almost become a meme, one which deserves more scrutiny before we jump on that particular band wagon.

Now playing: N.W.A. - Appetite For Destruction