One of the biggest concerns about RSS is the amount of bandwidth consumed by wasteful requests. Recently on an internal mailing list discussion there was a complaint about the amount of bandwidth wasted because weblog servers send a news aggregator an RSS feed containing items it has already seen. A typical news feed contains 10 - 15 news items where the oldest is a few weeks old and the newest is a few days old. A typical user has used their news aggregator to fetch the an RSS feed about once every other day. This means on average at least half the items in an RSS feed are redundant to people who are subscribed to the feed yet everyone (client & server) incurs bandwidth costs by having the redundant items appear in the feeds.

So how can this be solved? All the pieces to solve this puzzle are already on the table. Every news aggregator worth it's salt (NetNewsWire, SharpReader, NewsGator, RSS Bandit, FeedDemon, etc) uses HTTP Conditional GET requests. What does that mean in English? It means that most aggregators send information about when last they retrieved the RSS feed via the If-Modified-Since HTTP header and a the hashcode of the RSS feed provided by the server the last time it was fetched via the If-None-Match HTTP header. The interesting point is that although most news aggregators tell the server the last time they fetched the RSS feed almost no weblog server I am aware of actually uses this information to tailor the information sent back in the RSS feed. The weblog software I use is guilty of this as well.

If you fetched my RSS feed yesterday or the day before there is no reason for my weblog server to send you a 200K file containing five entries from last week which it currently does. Actually it is worse, currently my weblog software doesn't even perform the simple check of seeing whether there are any new items before choosing to send down a 200K file.

Currently this optimization is the one performed by weblog servers, if there are no new items then a HTTP 304 response is sent otherwise a feed containing the last n items is sent. A further optimization is possible where the server only sends down the last n items newer than the If-Modified-Since date sent by the client.

I'll ensure that this change makes it into the next release of dasBlog (the weblog software I use) and if you use weblog software I suggest requesting that your software vendor to do the same.

UPDATE: There is a problem with the above proposal in that it calls for a reinterpretation of how If-Modified-Since is currently used by most HTTP clients and directly violates the HTTP spec which states

b) If the variant has been modified since the If-Modified-Since
         date, the response is exactly the same as for a normal GET.

The proposal is still valid except that this time instead of misusing the If-Modified-Since header I'd propose that clients and servers respect a new custom HTTP header such as "X-Feed-Items-New-Than"  whose value would be a date in the same format as that used by the If-Modified-Since header.


 

Wednesday, November 5, 2003 5:30:18 PM (GMT Standard Time, UTC+00:00)
Pyblosxom supports conditional get, but doesn't review entries to return only new items. It seems to me that with proper ETag support on the server, and Accept-Encoding: gzip, deflate, you've gone a long way to reducing your bandwidth usage, though.
Wednesday, November 5, 2003 6:23:05 PM (GMT Standard Time, UTC+00:00)
Dare, what you're proposing would be pretty much hopelessly broken by many proxy servers.
Wednesday, November 5, 2003 10:48:12 PM (GMT Standard Time, UTC+00:00)
Let's face facts: there's no simple way with an HTTP GET against a random URI that you can achieve what you want, without breaking a lot of stuff.

So let's stop hacking headers into the equation. That's the WRONG way to go about this. You want interactivity and a little intelligence? Create a new protocol, based on a POST (although I'm loathe to suggest SOAP or XML-RPC).

I agree with Gordon that compressing the stream is sufficient. I'd also say that an amendment be made to the interpretation of RSS feeds, such as: no more than 3 days, unless that gives you less than 5 items, in which case you go back as far as you need to to get 5 items. That should sufficiently cover the occasional poster (5 items of any age) and the person who can't stop posting one-lines (*coughScoblecough*) by letting them have 3 days of items, no matter how many that might be.
Thursday, November 6, 2003 12:39:37 AM (GMT Standard Time, UTC+00:00)
One solution would be to notify a central server when your blog changed and then have readers check the central server for blogs that have changed rather than hitting each site individually.

(That's my snarky way of saying, "ping blo.gs, damnit!"
Thursday, November 6, 2003 12:40:35 AM (GMT Standard Time, UTC+00:00)
BTW: are you aware that the comment engine for this site dies with a cryptic error if you try to use html tags?
Thursday, November 6, 2003 12:44:07 AM (GMT Standard Time, UTC+00:00)
ucblockhead,
Yes I do. That's another one of those things that I'll add to my list of things to file bugs against in their database.
Thursday, November 6, 2003 1:22:33 AM (GMT Standard Time, UTC+00:00)
Brad,
If you suggest a different protocol then I'm surprised you'd be against using SOAP or XML-RPC. The amount of things that need to be fixed in how weblog-related clients and servers interact may eventually need the functionality technologies related to XML Web Services such as WS-Security and WS-Policy to name two.

What I was proposing was something I considerd to be a simple way to to reduce bandwidth costs but Gordon is right that compression probably does a good enough job already.
Thursday, November 6, 2003 7:07:18 AM (GMT Standard Time, UTC+00:00)
If you upgrade to DasBlog 1.4 (you are running 1.3) you'll get gzip/deflate compression.
Omar Shahine
Thursday, November 6, 2003 8:08:42 AM (GMT Standard Time, UTC+00:00)
You are proposing to use HTTP headers to specify a feed query filter. This will break proxies and be difficult to program in some scenarios.

I recommend using HTTP GET with a query parameter to specify the filter. The basic idea is to add a version string the to the syndication format. The format of the version string is determined by the server and is opaque to the client. The client presents version string in a query parameter when fetching the feed. If there are no new items, then the server returns a response with no items and the same version string. If there new items, then the server returns a response with a new version string and the new items.

This method can be combined with conditional GET and will work with proxies.

I recommend doing some analysis on what the bandwidth savings will be for this sort of optimization. It's possible that something like Shrook's distributed checking might save more bandwidth.
Thursday, November 6, 2003 12:48:33 PM (GMT Standard Time, UTC+00:00)
With .Text I do not use the last-mod header, but I am using the Etag header.

What I do is store the date of the last post a specific client (ie, you) requested. I then query my posts and only return posts that have been added since the last time you requested a .Text feed. So if all works well, instead of sending the last 10, 15, or 20 posts, most .Text blogs should only be sending the most recent posts you have yet to download.
Thursday, November 6, 2003 3:45:25 PM (GMT Standard Time, UTC+00:00)
Scott, how do you identify the client with .Text?
Thursday, November 6, 2003 8:35:55 PM (GMT Standard Time, UTC+00:00)
Another solution, recently proposed for Atom but equally applicable in RSS, is to take the bulk of the content out of the feed and leave it on the server to be fetched seperately (or prefetched for offline readers).
Thursday, November 6, 2003 9:02:57 PM (GMT Standard Time, UTC+00:00)
The client (ie aggregator) is not identified specifically. When a client requests a feed, I insert the date of the last updated post into the etag header.

When the same client requests the feed again, the etag date is sent back. I check to see if a new post has been added (or updated) since the etag date. If there are no new posts, I send back a 304 response. Otherwise, I send back the new posts (but only the new posts) and reset the etag value.
Monday, November 10, 2003 8:07:25 PM (GMT Standard Time, UTC+00:00)
One HTTP- and proxy-friendly way to do this might be to:
1. make new content available at the end of an entity
2. vary the ETag: according to the content (e.g., CRC32)
3. support Range: and If-None-Match:
4. support HEAD
5. use Last-Modified: and friends for safety

Something like this (omitting some extraneous text):
/*
** original content
*/
// client
GET http://server/entity

// server
200 OK
ETag: "0x8675309"
Content-Length: 12451
Last-Modified: Fri, 31 Oct 2003 19:43:31 GMT

/*
** later refresh
*/
// client
HEAD http://server/entity

// server
200 OK
ETag: "0xABC12342"
Content-Length: 16384
Last-Modified: Fri, 07 Nov 2003 12:43:31 GMT

/*
** check if cached content changed
*/
// client
GET http://server/entity
Range: bytes=1-12451
If-None-Match: "0x8675309"

// server
304 Not Modified
Last-Modified: Fri, 07 Nov 2003 12:43:31 GMT

/*
** get new content
*/
// client
GET http://server/entity
Range: bytes=12452-
If-Unmodified-Since: Fri, 07 Nov 2003 12:43:31 GMT

// server
206 Partial Content
Last-Modified: Fri, 07 Nov 2003 12:43:31 GMT

Note that if the content had changed since the client's previous request, the server would reply with "412 Precondition Failed" due to the mismatched timestamp.
Gifford Hesketh
Tuesday, November 11, 2003 2:04:33 AM (GMT Standard Time, UTC+00:00)
Hey Scott,

If you're really doing what you say you're doing, it's broken by many caching proxy servers. I'm sending you an email now about it...
Friday, January 9, 2004 3:07:46 AM (GMT Standard Time, UTC+00:00)
The problems mentioned above concerning "breaking HTTP" and proxy caching servers can be eliminated by relying on an extension of RFC3229 "Delta Encoding for HTTP" rather than the specific mechanism mentioned by Dare. On the atom-syntax list, I have proposed the definition of a new, RFC3229-compliant, "items" delta-algorithm which would provide a functional equivelant to Dare's proposal which can be implemented without the issues mentioned above.

bob wyman
Comments are closed.