Greg Linden has a blog post entitled Yahoo building a Google FS clone? where he writes
The Hadoop open source project is
building a clone of the powerful Google cluster tools Google File System and MapReduce.
I was
curious to see how much Yahoo appears to be involved in Hadoop. Doug Cutting, the
primary developer of Lucene, Nutch, and Hadoop, is now working for
Yahoo but, at the time, that hiring was described as supporting an
independent open source project.
Digging further, it seems Yahoo's role
is more complicated. Browsing through the Hadoop developers mailing list, I
can see that more than a dozen people from Yahoo appear to be involved in
Hadoop. In some cases, the involvement is deep. One of the Yahoo
developers, Konstantin Shvachko, produced a detailed
requirement document for Hadoop. The document appears to lay out what Yahoo
needs from Hadoop, including such tidbits as handling 10k+ nodes, 100k
simultaneous clients, and 10 petabytes in a cluster.
Also noteworthy is
Eric Baldeschwieler, a director of software development at Yahoo, who recently
talked about direct support from Yahoo for Hadoop. Eric said, "How we are
going to establish a testing / validation regime that will support innovation
... We'll be happy to help staff / fund such a testing policy."
I find this effort by Yahoo! to be rather interesting given that platform pieces like GFS, BigTable, MapReduce and Sawzall give Google quite the edge in building mega-scale services and in Greg Linden's words are 'major force multipliers' that enable them to pump out new online services at a rapid pace. I'd expect Google's competitors to build similar systems and keep them close to their chest not give them away. I suspect that the reason Yahoo! is going this route is that they don't have enough folks to build this in-house and have thus collaborated with Hadoop project to get some help. This could potentially backfire since there is nothing stopping small or large competitors from reusing their efforts especially if it uses a traditional Open Source license.
On a related note, Greg also posted a link to an article by David F. Carr entitled How Google Works which has the following interesting quote
Google has a split personality when it comes to questions about its back-end
systems. To the media, its answer is, "Sorry, we don't talk about our
infrastructure."
Yet, Google engineers crack the door open wider when
addressing computer science audiences, such as rooms full of graduate students
whom it is interested in recruiting.
As a result, sources for this story
included technical presentations available from the University of Washington Web
site, as well as other technical conference presentations, and papers published
by Google's research arm, Google Labs.
I do think it is cool that Google developers publish so much about the stuff they are working on. One of the things I miss from being on the XML team at Microsoft is being around people with a culture of publishing research like Erik Meijer and Michael Rys. I even got a research paper on XML query languages published while on the team. I'd definitely would like to publish research quality papers on some of the stuff I'm working on now. I've done MSDN articles and a ThinkWeek paper in the past few years, it's probably about time I start thinking about writing a research paper again.
PS: If you work on online services and you don't read Greg Linden's blog, you are missing out. Subscribed.