I've recently ben thinking about the problems facing search and navigation systems that depend on metadata applied to content provided by the creator of the content. This includes systems like Technorati Tags which searches the <category> elements in various RSS feeds and folksonomies like del.icio.us which searches tags applied to links submitted by users.

A few months ago I wrote a post entitled Technorati Tags: Why Do Bad Ideas Keep Resurfacing? which pointed out that Technorati Tags had the same problems that had plagued previous metadata self-annotation schemes on the Web such as HTML META tags. The main problem being that People Lie. Since then I've seen a number of complaints from developers of search engines that depend on RSS metadata.

In a comment to a post entitled Blogspot Spam in Matthew Mullenweg's weblog, Bob Wyman of PubSub.com writes

A very high percentage of the spam blogs that we process at PubSub.com also come from blogspot. We’ve got more serious “problems” in Japan and China, however, for the English language, blogspot is pretty much “spamspot.” It is, as always, disappointing to see people abuse a good and free service like that offered by Google/Blogspot in such a way.

In a post entitled Turning Blogspot Off Scott Johnson of Feedster wrote

All Blogspot blogs right now are included in every Feedster search by default. And now, due to the massive problems with spam on Blogspot, we're actually at the point of saying "Why don't we make searching Blogspot optional for all Feedster users". What's going on is that spammers have learned how to massively exploit Blogspot -- to the point where at times 90% of the blog traffic we get from Blogspot is spam.

Now that's bad. Actually this spam issue just plain sucks. And its starting to ruin the user experience that people have with Feedster.

The main reason these spam blogs haven't started affecting the Technorati Tags feature is that Blogspot doesn't support categories. However it is clear that the same problems search engines faced when they decided to trust HTML metadata are beginning to show up when it comes to searching RSS metadata. This is one place where established search engines would have a leg up on upstarts like Feedster and PubSub if they got into the RSS search market since they've already had to adapt to all sorts of 'search engine optimization' tricks.

On a related note, combining the above information about the high number of spam blogs on Google's Blogspot service with the recent article Bloggers Pitch Fits Over Glitches which among other things states

In fact, enter "Blogger sucks" in Google and you get 720,000 results, with most of the entries on the first few pages (read: the most popular) dedicated to these exasperating tech snafus. It can make for some pretty ugly reading. Imagine what they might say if they actually paid for the service?

But if you look at Blogger's status page, which lists service outages, you can see why they are so mad.

It seems that Doc Searles may have been onto something about Google quiting innovating in Blogger.  


Thursday, April 14, 2005 3:27:13 PM (GMT Daylight Time, UTC+01:00)
They did add a CAPTCHA for new blog creation on Blog*Spot a couple of days ago (though curiously, from what I've read about CAPTCHA breaking, they're using a pretty weak design). But the lack of categories isn't really stopping them from being a spam source for Technorati, it's just a lack of perceived value so far. After all, Technorati's happy to consider a link to greatbigbouncyones.com/schpam/ipod to be a tag, as long as it includes rel="tag". They don't care whether it redirects, or even 404s, as long as it ends in something they can call a tag.
Thursday, April 14, 2005 3:32:30 PM (GMT Daylight Time, UTC+01:00)
we have portals that are created by some nobody, they just crawl over RSS feeds and then create an endless portlets for RSS found.
Friday, April 15, 2005 2:20:58 AM (GMT Daylight Time, UTC+01:00)
Spam issues are going to affect all free services such as Yahoo 360 and MSN Spaces, not just blogger. The critical question is how do we teach search engine to distinguish between spam and non spam.

One idea I have seen floating around is the idea of Trust Rank. That maybe a worthwhile endeavour to overcome spam issues.

Comments are closed.