News Aggregators As Denial of Service Clients

May 3, 2004

@ 07:18 PM

Every once in a while I see a developer of a news aggregator that decides to add a 'feature' that unnecessarily chomps down the bandwidth of a web server in a manner one could classify as rude. The first I remember was Syndirella which had a feature that allowed you to syndicate an HTML page then specify regular expressions for what parts of the feed you wanted it to treat as titles and content. There are three reasons I consider this rude,

If a site hasn't put up an RSS feed it may be because they don't want to deal with the bandwidth costs of clients repeatedly hitting their sites on behalf of a few users
An HTML page is often larger than the corresponding RSS feed. The Slashdot RSS feed is about 2K while just the raw HTML of the front page of slashdot is about 40K
An HTML page could change a lot more often than the RSS feed [e.g. rotating ads, trackback links in blogs, etc] in situations where an RSS feed would not

For these reasons I tend to think that the riught thing to do if a site doesn't support RSS is to send them a request that they do highlighting its benefits instead of eating up their bandwidth.

The second instance I've seen of what I'd call rude bandwidth behavior is a feature of NewsMonster that Mark Pilgrim complained about last year where every time it finds a new RSS item in your feed, it will automatically download the linked HTML page (as specified in the RSS item's link element), along with all relevant stylesheets, Javascript files, and images. Considering that the user may never click through to web site from the RSS view this is potentially hundreds of unnecessary files being downloaded by the aggregator a day. This is not an exaggeration, I'm subscribed to a hundred feeds in my aggregator and there are is an average of two posts a day to each feed so downloading the accompanying content and images is literally hundreds of files in addition to the RSS feeds being downloaded.

The newest instance of unnecessary bandwidth hogging behavior I've seen from a news aggregator was pointed out by Phil Ringnalda's comments about excessive hits from NewsCrazwler which I'd also seen in my referrer logs and had been puzzled about. According to the answer on the NewzCrawler support forums when NewzCrawler updates the channel supporting wfw:commentRss it first updates the main feed and then it updates comment feeds. Repeatedly downloading the RSS feed for the comments to each entry in my blog when the user hasn't requested them is unnecessary and quite frankly wasteful.

Someone really needs to set up an aggregator hall of shame.

Categories: Technology

« Robert Scoble on MSDN Losing Its Sense o... | Home | RSS Bandit v1.2.0.112 Released »

Monday, 03 May 2004 20:07:56 (GMT Daylight Time, UTC+01:00)

Dare - I think that is the wrong way to look at the problem. It's like calling users stupid when they don't use the software right.

The RSS specification is incomplete and therefor broken. RSS does not specify how these behaviors should be properly addressed.

Your example of the tiny slashdot RSS feed is a perfect one - the Slashdot feed by itself is pointless. It has the article title and the first few words of the story. You *have* to download the web page to read the whole post. However, since RSS does not specify how that should be done, aggregators are doing the only thing they can think of - download the whole page from the link tag.

RSS aggregators that download comments are doing exactly what the specification says that the wmw:commentRSS links are there for. How else can one read and search RSS content offline? Who wants to wait for that stuff to load?

As long as RSS supports these features, aggregators are going to use them. And until RSS "matures" and properly documents how these features "must" work, aggregator authors are going to use hacks like downloading entire web pages to get around bigger problems, like incomplete content in the RSS feed.

Dylan Greene

Monday, 03 May 2004 20:20:57 (GMT Daylight Time, UTC+01:00)

Well, RSS itself isn't going to mature, or change, or specify anything. RSS is Peter Pan, it was absolutely perfect in the fall of 2002 for 2.0 or whenever it was that RSS-DEV actually did anything (summer of 2002 I think) for 1.0, and it's never going to grow up. We don't have that option, of having someone tell us what to do. The only option we have is to just dive in, figure out what works and what sucks, and tell each other what sucks (and maybe what works, too, but what sucks is more fun ;)).

Phil Ringnalda

Monday, 03 May 2004 20:32:35 (GMT Daylight Time, UTC+01:00)

Dylan,
I grow tired of seeing people use the "RSS is broken crutch" instead of thinking rationally about the problems with how XML content syndication is implemented today. The fact that Slashdot's feeds don't contain full content is explicitly because Slashdot (a) doesn't want the high bandwidth costs of automated clients hitting downloading lots of content and (b) they want to drive people to their site to get ad impressions. Neither of these is a "bug" in RSS. The fact that some client thinks it's too smart for its own good and ignores robots.txt thus making it the equivalent of a malicious Web spider is not a sign of brokenness in syndication formats (not RSS 1.0, RSS 2.0 or Atom).

You can overspecify as much as you want in want part of the spec there'll always be somewhere something is ambiguous or where someone claims that since it isn't explicitly disallowed it is allowed. Nothing disallows me from writing a client that downloads your web page a dozen times an hour or sends you an email a dozen times a dat, is this also due to the fundamental 'brokenness' of HTML, HTTP and SMTP?

Dare Obasanjo

Monday, 03 May 2004 22:10:14 (GMT Daylight Time, UTC+01:00)

"The fact that Slashdot's feeds don't contain full content is explicitly because Slashdot (a) doesn't want the high bandwidth costs of automated clients hitting downloading lots of content"

The RSS readers that suck down the full page because Slashdot doesn't put the full content in their RSS feed is just making matters worse for Slashdot. Instead of a larger paragraph of text, the entire page (along with the first 60 or so comments) are downloaded.

"(b) they want to drive people to their site to get ad impressions. "

RSS is about getting content without having to go to the web page. Ad impression concerns should be addressed in the RSS feed, as opposed to expecting the person to visit the site.

"'brokenness' of HTML, HTTP and SMTP?"
Those aren't subscription models.

Dylan Greene

Monday, 03 May 2004 22:47:48 (GMT Daylight Time, UTC+01:00)

Dylan,
So it's Slashdot's fault that people write the equivalent of Web spiders that eat up huge amounts of their bandwidth because they want to give their readers a 'feature' that they may not use? This has little to do with RSS the same way the fact that the rules for how often a Web spider should download an HTML page [which do not exist by the way] have anything to do with HTML.

Your attitude seems to be that any means necessary is OK for client aggregators as long as it is a feature that an end user may use even if it wastes the bandwidth of servers. Unnecessarily wasted bandwidth by poorly written aggregators and poorly considered features is already a concern for many users of syndication technologies [both producers and consumers]. Already some folks have considered banning malicious aggregators and the aggregator authors have considered lying in the user agent string (see the link to the back and forth between Mark Pilgrim and the author of NewsMonster).

Syndication is still growing and the participants in the syndication community should learn how to work together to best grow the community instead of causing a tragedy of the commons by spoiling things for everybody. If a website learns that exposing wfw:commentRss means that users of Newzcrawler will be trying to DDoS your site [imagine how much bandwidth it would suck from Slashdot if they supported wfw:commentRss] and NewzCrawler downloaded every comments feed several times a day, going back several months for each of their users.

Dare Obasanjo

Tuesday, 04 May 2004 04:49:15 (GMT Daylight Time, UTC+01:00)

OTOH, sites that provide the conversion from HTML to RSS as a web gateway could actually help bandwidth. A smart gateway could cache responses for multiple clients, and would hit the HTML site infrequently. I would imagine (hope) that bloglines does something like this, since it's such an obvious optimization.

Joshua Allen

Tuesday, 04 May 2004 10:27:19 (GMT Daylight Time, UTC+01:00)

An important note:

The Newzcrawler folks have made it clear that they've already addressed one of Phil's complaints in released code, and are working on throttling the traffic generated by their otherwise wonderful support for wfw:commentRss.

Far from deserving shame, they've warranted my continued support for their quick acknowledgement of the problem, and their resolve to address it. Best aggregator on the market... period.

Roger Benningfield

Tuesday, 04 May 2004 15:42:21 (GMT Daylight Time, UTC+01:00)

The real problem is not the tool itself but a user that misuses the tool. The frequency that a user chooses to check a site is determined by their own needs. If they are going to check the site themselves and do so at that frequency then it really does not matter if it is them or a machine checking for them and then letting them decide to view the content if it has changed. But as with many tools it could be abused if they have it checking considerably more often then they would go to the site themselves. Used properly the user would not be a rude user but rather a more productive one.

wjvii

Tuesday, 04 May 2004 19:13:29 (GMT Daylight Time, UTC+01:00)

So while rss popularity grows we start facing the bandwidth problem.

But if author providing Rss feed and RssComments feeds do not understand what will happen when using the rss feature and wondering seeing in your logs, so what we can say about end user? Subscriber wants to get content fast (with high refresh rate) and watch for comments other people make. Forcing subscriber to click in comment feeds, start comments download and waiting when download complete will waste their precious time (and money). So it is definitely the wrong solution.

IMHO, there is no problem with rss clients or subscribers. The problem is in rss content providers. Content provider may use http headers (If-Modified-Since) and decide what news to put in content. This may reduce bandwidth 10-100 times. Moreover, CommentRss is a worst choice. When we have N news in main feed, M news in user database then we get N+M+1 requests to http server just to let us know that 1-2 comment posted. Author might use another rss extensions (blogcomments:comments, annotate:reference for example) and reduce server load up to 10 times. It is time to think about it.

I think it is time also to write rss client behavior recommendations and include it in rss (rss-extensions) specification. I think rss aggregators developers may help in this. Unfortunately, moderator of the information_aggregators group has denied my request for membership. Maybe there was some reason, I don’t know :((

Alexei Vorontsov

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for News Aggregators As Denial of Service Clients - Dare Obasanjo's weblog