January 25, 2003
@ 11:58 PM
URLs as GUIDs

Dave Winer has an essay entitled Guids are not just for geeks anymore which describes a technique used in his blogging software Radio Userland which caused me some consternation while developing RSS Bandit.

The primary feature that made me go ahead and decide to write RSS Bandit was the ability to track read/unread items in an RSS feed in a persistent manner. To track such items I had to find a unique way of identifying an item once downloaded so I could compare its key with my application's list of read messages. From perusing various RSS feeds available online it seemed there were only three elements that showed up with any regularity in an RSS feed as children of the item element: description, title and link. The title of the item was definitely not guaranteed to be unique although either the description or link probably would be. I was then torn between using an MD5 hash of the description of an item and using the story link as the unique identifier.

There are pros and cons to both approaches. However in many cases hashing the description seemed to be a better idea especially since links are used inconsistently by various RSS feed providers in a manner that may fail to guarantee uniqueness. In general the link in an RSS item is a link to the story or blog entry about some topic but in some cases, such as the feed provided by Eclectic, it is a link to the item being talked about in the story or blog entry. In the latter case the link may not be unique.

A further problem was how to detect when the news item or blog entry had been updated. In this case, I wanted RSS Bandit to be able to reflag such messages as unread. This is where I abandoned hashing the description since it turned out that some popular weblogging software especially Radio Userland usually change the URL when a news item or blog entry is updated since the URL is partially constructed from the time of the post while failing to update the description in any way. The only question I have about this is wondering what happens to people who link to old versions of the entry before an update?

#

// Considered Dangerous

I and Andy were talking recently and he complained about the XPath abbreviated query// which is short for /descendant-or-self::node()/. From the XPath recommendation
For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document (even a para element that is a document element will be selected by //para since the document element node is a child of the root node)
To Andy this such queries are bad and he describes them as being fragile in that changes in a document's structure can cause significant issues with queries such as the above. One example we constructed of a potential negative consequence of using // queries is using the query //title to get all the titles of books ordered by customers from an XML document containing customer order information instead of something like /customers/customer/order/book/title. The problem Andy had with this query is that if the XML document is extended in a future version to include movie info then the former query will accidentally grab all those titles while the latter would not.

Andy has mentioned considering writing an article about the evils of //. However I disagree that the uses of // are all bad and like refering to examples that involve multiple XML documents that contain islands of structure that the user is interested in. For example I use the query //rss:item to get all the RSS items from an RSS feed regardless of whether it is RSS 0.91, 1.0 or 2.0. This is because I know the structure of an RSS item is the same for all three versions although the structure of the XML document containing the item varies from version to version. The code that does this is shown below

            string rssNamespaceUri = "";

            if(feed.DocumentElement.LocalName.Equals("RDF") &&
                feed.DocumentElement.NamespaceURI.Equals("http://www.w3.org/1999/02/22-rdf-synta x-ns#")){ //RSS 1.0
               
                rssNamespaceUri = "http://purl.org/rss/1.0/";

            }else if(feed.DocumentElement.LocalName.Equals("rss")){ //RSS 0.91 & RSS 2.0

                    rssNamespaceUri = feed.DocumentElement.NamespaceURI;                                
            }

               
            //convert RSS items in feed to RssItem objects and add to list
            XmlNamespaceManager nsMgr = new XmlNamespaceManager(feed.NameTable);
            nsMgr.AddNamespace("rss", rssNamespaceUri);

            foreach(XmlNode node in feed.SelectNodes("//rss:item", nsMgr)){           
                RssItem item = MakeRssItem((XmlElement)node);

In examples like the above I think the usage of // is acceptable.

#

RSS Bandit: A Bad Netizen

I recently discovered that RSS Bandit attempted to download feeds every five minutes regardless of the user specified delay for how often to attempt such downloads. Although it was correctly sending the "If-None-Match" and "If-Modified-Since" HTTP headers, it was still clogging up logfiles with an average of twelve requests an hour. It turned out that the problem was in differences in assumptions between myself and the designers of the System.Net.HttpWebRequest class. Check out the code below

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(current.link);

    request.Timeout = 1 * 60 * 1000; //one minute timeout

    request.UserAgent = this.UserAgent;

    request.Proxy = this.Proxy;

   
    HttpWebResponse response = (HttpWebResponse) request.GetResponse(); 


    if(response.StatusCode == HttpStatusCode.OK){

On the surface I thought this seemed like fairly braindead code and couldn't figure out where I went wrong. It turned out that the problem was with the line in bold [and the fact that I wasn't logging net exceptions which would have helped me catch this easier]. The designers of the class felt that any response that wasn't successful according to HTTP 1.1 or a certain class of redirection was an exception. Considering that an exception is a fatal error I didn't believe a message from the server indicating that I already have the cached message counts as an error let alone a fatal one. They disagreed. :)

I've since fixed my code although from my referrer logs it does look like there may be one or two people among my "early adopters" who are still using the abusive bits. My apologies go out to all those who are getting too many hits from the bandit.

#

Future Jerry Springer Guests

Last night I was dancing with this girl (Girl A) whose friend (Girl B) was dancing right beside us and groping the heck out of the guy she was dancing with. Shortly afterwards, she came over and tongue kissed Girl A. This was confusing so I asked the Girl A.

So it turns out that Girl B is Girl A's best friend and the guy is Girl A's ex. They all live together and have some sort of three way sexual relationship. That isn't the kicker.

The kicker is that Girl A is engaged and is moving to the East coast to get married in a month or so but would like Girl B to move with her so they don't have to end their "friendship".

That's deep.

#

Your Daily Show Moment of Zen

Excerpted list of Winners of Open Source Product Excellence Awards Announced At Linuxworld
Best System Integration Software
Microsoft - Services for Unix 3.0
ftp://ftp.microsoft.com/developr/Interix/interix22/GPL.TXT

#



Get yourself a News Aggregator and subscribe to my RSSfeed

Disclaimer: The above comments do not represent the thoughts, intentions, plans or strategies of my employer. They are solely my opinion.
 

Categories:

Comments are closed.