February 12, 2007
@ 08:13 PM

A couple of weeks ago I read a blog post by Matt Cutts entitled What did I miss last week? where he wrote

- Hitwise offered a market share comparison between Bloglines, Google Reader, Rojo, and other feed readers that claimed Bloglines was about 10x more popular than Google Reader. My hunch is that both AJAX and frames may be muddying the water here; I’ve mentioned that AJAX can heavily skew pageview metrics before. If the Google Reader team gets a chance to add subscriber numbers to the Feedfetcher user-agent (which may not be a trivial undertaking, since they probably share code with other groups at Google that fetch using the same bot mechanism), that would allow an apples-to-apples comparison.

As I was thinking about the fact that Google Reader can't make changes to the FeedFetcher user agent without tightly coupling a general platform component that likely services Google Reader, Google Homepage, Google Blog Search and other services with their own. I realized that by using one user agent for all of this servides, it pretty much makes it impossible for Web masters to exclude themselves from some of Google's crawlers.

Exactly how one would go about creating a robots.txt file that limits your feed from showing up in Google Blog Search results but doesn't end up exlcuding you from Google Reader and Google Homepage as well? I can't think of a way to do this but maybe it's because my kung fu is weak. Any suggestions? 

PS: This isn't work related.


 

Monday, 12 February 2007 21:21:48 (GMT Standard Time, UTC+00:00)
Although not robots.txt based, they could add a preference within Webmaster Tools to finely permission each application using the centralized fetcher. You could also control throttles the same way.
Monday, 12 February 2007 21:40:49 (GMT Standard Time, UTC+00:00)
Reader and the Personalized Homepage are a bit different from Blog Search. Since they're acting on behalf of a user, they're not considered robots and thus always ignore robots.txt, as discussed here:

http://www.google.com/support/webmasters/bin/answer.py?answer=33545

Blog Search on the other hand is an automated robot, and will respect robots.txt, as described here:

http://www.google.com/help/about_blogsearch.html#notlisted

It also supports an under-documented extension for preventing indexing:

http://www.intertwingly.net/blog/2006/08/02/Feed-Access-Control#c1154539474

Mihai Parparita
Google Reader Engineer
Comments are closed.