Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets

June 26, 2007

@ 03:03 AM

These are my notes from the talk Using MapReduce on Large Geographic Datasets by Barry Brummit.

Most of this talk was a repetition of the material in the previous talk by Jeff Dean including reusing a lot of the same slides. My notes primarily contain material I felt was unique to this talk.

A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to male lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).

Below are examples of the kinds of problems the Google Maps has used MapReduce to solve.

Locating all points that connect to a particular road
Input	Map	Shuffle	Reduce	Output
List of roads and intersections	Create pairs of connected points such as {road, intersection} or {road, road} pairs	Sort by key	Get list of pairs with the same key	A list of all the points that connect to a particular road

Rendering Map Tiles
Input	Map	Shuffle	Reduce	Output
Geographic Feature List	Emit each feature on a set of overlapping lat/long rectangles	Sort by Key	Emit tile using data for all enclosed features	Rendered tiles

Finding Nearest Gas Station to an Address within five miles
Input	Map	Shuffle	Reduce	Output
Graph describing node network with all gas stations marked	Search five mile radius of each gas station and mark distance to each node	Sort by key	For each node, emit path and gas station with the shortest distance	Graph marked with nearest gas station to each node

When issues are encountered in a MapReduce it is possible for developers to debug these issues by running their MapReduce applications locally on their desktops.

Developers who would like to harness the power of a several hundred to several thousand node cluster but do not work at Google can try

Hadoop which is an Open Source project that is Yahoo! sponsored attempt to duplicate Google's key infrastructure pieces.
Amazon's Elastic Compute Cloud (EC2) which allows developers to rent access to Amazon's computing resource

Recruiting Sales Pitch

[The conference was part recruiting event so some of the speakers ended their talks with a recruiting spiel. - Dare]

The Google infrastructure is the product of Google's engineering culture has the following ten characteristics

single source code repository for all Google code
Developers can checkin fixes for any Google product
You can build any Google product in three steps (get, configure, make)
Uniform coding standards across the company
Mandatory code reviews before checkins
Pervasive unit testing
Tests run nightly, emails sent to developers if any failures
Powerful tools that are shared company-wide
Rapid project cycles, developers change projects often, 20% time
Peer driven review process, flat management hierarchy

Q&A

Q: Where are intermediate results from map operations stored?
A: In BigTable or GFS

Q: Can you use MapReduce incrementally? For example, when new roads are built in North America do we have to run MapReduce over teh entire data set or can we only factor in the changed data?
A: Currently, you'll have to process the entire data stream again. However this is a problem that is the target of lots of active research at Google since it affects a lot of teams.

Categories: Platforms | Trip Report

« Google Scalability Conference Trip Repor... | Home | Google Scalability Conference Trip Repor... »

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets - Dare Obasanjo's weblog

Recruiting Sales Pitch

Q&A