August 14, 2007
@ 03:19 AM

Recently I've seen a bunch of people I consider to be really smart sing the praises of Hadoop such as Sam Ruby in his post Long Bets, Tim O’Reilly in his post Yahoo!’s Bet on Hadoop, and Bill de hÓra in his post Phat Data. I haven’t dug too deeply into Hadoop due to the fact that the legal folks at work will chew out my butt if I did, there a number of little niggling doubts that make me wonder if this is the savior of the world that all these geeks claim it will be. Here are some random thoughts that have made me skeptical

  1. Code Quality: Hadoop was started by Doug Cutting who created Lucene and Nutch. I don’t know much about Nutch but I am quite familiar with Lucene because we adopted it for use in RSS Bandit. This is probably the worst decision we’ve made in the entire history of RSS Bandit. Not only are the APIs a usability nightmare because they were poorly hacked out then never refactored, the code is also notoriously flaky when it comes to dealing with concurrency so common advice is to never use multiple threads to do anything with Lucene.

  2. Incomplete Specifications: Hadoop’s MapReduce and HDFS are a re-implementation of Google’s MapReduce and Google File System (GFS)  technologies. However it seems unwise to base a project on research papers that may not reveal all the details needed to implement the service for competitive reasons. For example, the Hadoop documentation is silent on how it plans to deal with the election of a primary/master server among peers especially in the face of machine failure which Google solves using the Chubby lock service. It just so happens that there is a research paper that describes Chubby but how many other services within Google’s data centers do MapReduce and Google File System (GFS)  depend on which are yet to have their own public research paper? Speaking of which, where are the Google research papers on their message queueing infrastructure? You know they have to have one, right? How about their caching layer? Where are the papers on Google’s version of memcached?Secondly, what is the likelihood that Google will be as forthcoming with these papers now that they know competitors like Yahoo! are knocking off their internal architecture?

  3. A Search Optimized Architecture isn’t for Everyone: One of the features of MapReduce is that one can move the computation close to the data because “Moving Computation is Cheaper than Moving Data”. This is especially important when you are doing lots of processing intensive operations such as the kind of data analysis that goes into creating the Google search index. However what if you’re a site whose main tasks are reading and writing lots of data (e.g. MySpace) or sending lots of transient messages back and forth yet ensuring that they always arrive in the right order (e.g. Google Talk) then these optimizations and capabilities aren’t much use to you and a different set of tools would serve you better. 

I believe there are a lot of lessons that can be learned from how the distributed systems that power the services behind Google, Amazon and the like. However I think it is waaaay to early to be crowning some knock off of one particular vendors internal infrastructure as the future of distributed computing as we know it.


PS: Yes, I realize that Sam and Bill are primarily pointing out the increasing importance of parellel programming as it relates to the dual trends of (i) almost major website that ends up dealing with lots of data and has lots of traffic eventually eschews relational database features like joins, normalization, triggers and transactions because they are not cost effective and (ii) the increased large amounts of data that the we generate and now have to process due to falling storage costs. Even though their mentions of Hadoop are incidental it still seems to me that it’s almost become a meme, one which deserves more scrutiny before we jump on that particular band wagon. 

Now playing: N.W.A. - Appetite For Destruction


Tuesday, 14 August 2007 08:51:50 (GMT Daylight Time, UTC+01:00)
I think it is not just hadoop but what you can build utilizing open concept like hadoop, for example,

Yahoo has also developed an abstraction interface layer on top of Hadoop called Pig. Pig allows one to express data analysis tasks in relational algebra, giving them some semblance to SQL.

Another extension to Hadoop is Hbase, a clone of Google’s Bigtable used to store structured data over a distributed system.

And in case you don’t have clusters of machines lying around, Hadoop has also been extended with the ability to run on Amazon’s EC2.
Tuesday, 14 August 2007 18:26:40 (GMT Daylight Time, UTC+01:00)
Your complaints about Lucene seem to be directed towards a port of Lucene from Java to the .Net platform, a project which is under incubation at Apache, but which is by no means the stable, recommended version of Lucene.

Your "Incomplete Specification" concern just seems like FUD. Yes, perhaps it is harder to implement these systems than is described in the research papers, but the fact is that they have been largely re-implemented, and they work. Google uses Hadoop to teach folks about MapReduce. Yahoo! runs Hadoop on thousands of nodes. If Hadoop were still a proposal, yours might be a reasonable concern, but we're well past the proposal stage.

Finally, you suggest that Hadoop is not appropriate for all tasks. This is certainly true. MapReduce is useful for managing and mining very large datasets. Both Y! and Google were initially interested in the technology primarily for improving production search systems, but have found that it is much more broadly applicable. The ability to quickly perform ad-hoc queries over terabytes of unindexed data is transformative.

Comments are closed.