Greg Linden has a blog post entitled Google Personalized Search and Bigtable where he writes

One tidbit I found curious in the Google Bigtable paper was this hint about the internals of Google Personalized Search:
Personalized Search generates user profiles using a MapReduce over Bigtable. These user profiles are used to personalize live search results.
This appears to confirm that Google Personalized Search works by building high-level profiles of user interests from their past behavior.

I would guess it works by determining subject interests (e.g. sports, computers) and biasing all search results toward those categories. That would be similar to the old personalized search in Google Labs (which was based on Kaltix technology) where you had to explicitly specify that profile, but now the profile is generated implicitly using your search history.

My concern with this approach is that it does not focus on what you are doing right now, what you are trying to find, your current mission. Instead, it is a coarse-grained bias of all results toward what you generally seem to enjoy.

This problem is worse if the profiles are not updated in real time.

I totally disagree with Greg here on almost every point. Building a profile of a user's interests to improve their search results is totally different from improving their search results in realtime. The former is personalized search while the latter is more akin to clustering of search results. For example, if I search for "football", a search engine can either use the fact that I've searched for soccer related terms in the past to bubble up the offical website of Fédération Internationale de Football Association (FIFA) instead of the National Football League (NFL) website in the search results or it could cluster the results of the search so I see all the options. Ideally, it should do both. However, expecting that my profile is built in realtime (e.g. learning from my search results from five minutes ago as opposed to those from five days ago) although ideal doesn't seem to me to be necessary to be beneficial to end users. This seems like one of those places where a good enough offline-processing based solution is better than a over better engineered real-time solution. Search is rarely about returning or reacting to realtime data anyway. :) 

PS: I do think it's quite interesting to see how many Google applications are built on BigTable and MapReduce. From the post Namespaced Extensions in Feeds it looks like Google Reader is another example.


 

Friday, 01 September 2006 22:36:40 (GMT Daylight Time, UTC+01:00)
Hi, Dare. Wow, totally disagree! : ) Naw, c'mon, I'm sure we aren't that far off from each other.

You are right that a non-real-time personalization adds some value. There is little doubt that a bias built on coarse-grained profile is useful.

However, I think there are some easy cases where at least near real-time produces further improvements. From example, merely increasing the rank of things people clicked on in the past has been demonstrated (by Teevan et al. and others) to improve search quality.

Stepping back for a second, what I really want to do is break the idea that each search is independent. If you do search A, don't find what you want, and then do search B, the fact that you didn't find what you wanted in search A should influence what you see for search B.

Right now, it doesn't. Each search is treated as independent. That's clearly wrong. There's valuable information being lost about what you did and did not find.

Search should be more of a dialogue rather than a one-shot deal. Search should pay attention to what I have done in the past and help me find what I need.
Sunday, 03 September 2006 17:25:17 (GMT Daylight Time, UTC+01:00)
Greg,
This sounds more like improving the search refinement process than personalization. Over the long term (i.e. using data aggregated over time) is personalization but near real-time seems more like getting better at 'suggested searches' beyond the fixing up typos which is the most common usage today.
Comments are closed.