July 21, 2004
@ 07:15 AM

It seems every 3 months some prominent online publication complains about the amount of traffic RSS news readers cause to websites that provide RSS feeds. This time it is Slashdot with their post When RSS Traffic Looks Like a DDoS which references a post by Chad Dickerson, the CTO of Infoworld, entitled RSS growing pains. Chad writes

Several months ago, I spoke to a Web architect at a large media site and asked why his site didn’t support RSS. He raised the concern that thousands (or even millions) of dumb clients could wreak havoc on a popular Web site. Back when I was at CNN.com, I recall that our servers got needlessly pounded by a dumb client (IE4) requesting RSS-like CDF files at frequent intervals regardless of whether they had changed. As the popularity of RSS feeds at InfoWorld started to surge, I began to notice that most of the RSS clients out there requested and downloaded our feeds regardless of whether the feeds themselves had changed. At the time, we hadn’t quite reached the RSS tipping point, so I filed these thoughts away for later -- but “later” came sooner than I thought.

At this point I'd like to note that HTTP provides two mechanisms for web servers to tell clients if a network resource has changed or not. The basics of this mechanism is explained in the blog post HTTP Conditional Get for RSS Hackers which provides a way to prevent clients such as news readers from repeatedly downloading a Web document if it hasn't been updated. At this point I'd like to point out that at the current time, the InfoWorld RSS feed supports neither.

Another technique for reducing bandwidth consumption by HTTP clients is to use HTTP compression which greatly reduces the amount of data that has to be sent to a client when the feed has to be downloaded. For example, the current InfoWorld feed is 7427 bytes which shrinks to 2551 bytes when zipped using GZip on my home machine. This is a reduction by a factor of 3, on larger files the ratio of the reduced size to the original size is even better. Again, InfoWorld doesn't support this technique for reducing bandwidth consumption.

It is unsurprising that they are seeing significant bandwidth consumption from news aggregators. An RSS reader polling the InfoWorld site once an hour over an 8 hour period would download about 60 kilobytes of XML, on the other hand if they supported HTTP conditional GET requests and HTTP compression via GZip encoding this number would be under 3 kilobytes.  

The one thing that HTTP doesn't provide is a way for clients to deal with numerous connections being made to the site at once. However this problem isn't much different from the traditional scaling problem that web sites have to deal with today when they get a lot of traffic from regular readers.  


 

Wednesday, July 21, 2004 3:12:37 PM (GMT Daylight Time, UTC+01:00)
Exactly. I don’t know what all the hullaballoo is about. Scaling a site, whether it’s serving RSS or HTML, is a solved problem.
Wednesday, July 21, 2004 3:52:09 PM (GMT Daylight Time, UTC+01:00)
One thing very few feed providers bother to do in http is set the content expiration property. If they set that, aggregators would know not to even perform a conditional get, or if they are behind a caching proxy, it would take care of this.

In this example, theETag, Last-Modified, AND expires and Cache-Control are all set:

Date: Wed, 21 Jul 2004 14:44:12 GMT
Server: Apache/2.0.49 (Fedora)
Last-Modified: Sat, 10 Jul 2004 12:00:08 GMT
ETag: "c6e0d-1375-dc752200"
Accept-Ranges: bytes
Content-Length: 4981
Vary: Accept-Encoding
Cache-Control: max-age=86400
Expires: Thu, 22 Jul 2004 14:44:12 GMT
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/xml

Now if we look at the info world feed response headers


HTTP/1.1 200 OK
Date: Wed, 21 Jul 2004 14:48:06 GMT
Server: Apache
Accept-Ranges: bytes
Content-Length: 7252
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8

There is no ETag, Last Modified, or content-expires. No wonder it looks like a DOS attack on infoworld - they are telling the world their content is continously changing.


Wednesday, July 21, 2004 10:59:34 PM (GMT Daylight Time, UTC+01:00)
"Oh no, they're requesting information from us every hour. Those horrible users. We should shut down our website and give them information via postal mail to save the servers."

Thanks for the post. What a silly argument.
Wednesday, July 21, 2004 11:01:31 PM (GMT Daylight Time, UTC+01:00)
"Oh no, they're requesting information from us every hour. Those horrible users. We should shut down our website and give them information via postal mail to save the servers."

Thanks for the post. What a silly argument.
Thursday, July 22, 2004 2:49:58 AM (GMT Daylight Time, UTC+01:00)
I didn't notice before, check out the content type of the info world feed in my post above. Its from http://www.infoworld.com/rss/applications.xml.

To really see they are sending xml as html, run

wget -s http://www.infoworld.com/rss/applications.xml
Thursday, July 22, 2004 8:47:06 PM (GMT Daylight Time, UTC+01:00)
Excellent discussion guys. We used to have ETags active, and I'm not sure what changed to disable them but I'm researching that now. I've also enabled Expires. =)

Comments are closed.