November 26, 2007
@ 01:56 PM

My weekend project was to read Dive Into Python and learn enough Python to be able to port Sam Ruby's meme tracker (source code) from CPython to Iron Python. Sam's meme tracker, shows the most popular links from the past week from the blogs in his RSS subscriptions. A nice wrinkle Sam added to his code is that more recent posts are weighted higher than older posts. So a newly hot item with 3 or 4 links posted yesterday ends up ranking higher than an item with 6 to 10 posts about it from five days ago. Using that sort of weighting probably wouldn't have occurred to me if I just hacked the feature on my own, so I'm glad I spent the time learning Sam's code.

There are a few differences between Sam's code and mine, the most significant being that I support two modes; showing the most popular items from all unread posts and showing the most popular items from the past week. The other differences mainly have to do with the input types (Atom entries vs. RSS feeds) and using .NET Libraries like System.Xml and System.IO instead of CPython libraries like libxml2 and blob. You can see the difference between both approaches for determining top stories on my feed subscriptions below

Top Stories in the Past Week

  1. Mobile Web: So Close Yet So Far (score: 1.71595943656)
  2. The Secret Strategies Behind Many "Viral" Videos (score: 1.52423410473)
  3. Live Documents is Powerful Stuff (score: 1.35218082421)

Top Stories in all Unread Posts

  1. OpenSocial (score: 5.0)
  2. The Future of Reading (score: 3.0)
  3. MySpace (score: 3.0) [Ed Note: Mostly related to OpenSocial]

As you can probably tell, the weighted scoring isn't used when determining top stories in all unread posts. I did this to ensure that the results didn't end up being to similar for both approaches. This functionality is definitely going to make its way into RSS Bandit now that I've figured out the basic details on how it should work. As much as I'd like to keep this code in Iron Python, I'll probably port it to C# when integrating it for a number of practical reasons including maintainability (Torsten shouldn't have to learn Python as well), performance and better integration into our application.

Working with Python was a joy. I especially loved programming with a REPL. If I had a question about what some code does, it's pretty easy to write a few one or two liners to figure it out. Contrast this with  using Web searches, trawling through MSDN documentation or creating a full blown program just to test the out some ideas when using C# and Visual Studio. I felt a lot more productive even though all I was using was Emacs and a DOS prompt. 

I expected the hardest part of my project to be getting my wife to tolerate me spending most of the weekend hacking code. That turned out not to be a problem because it didn't take as long as I expected and for the most part we did spend the time together (me on the laptop, her reading The Other Boleyn Girl, both of us on the sofa). 

There are at least two things that need some fine tuning. The first is that I get the title of the link from the text of the links used to describe it and that doesn't lead to very useful link text in over half of the cases. After generating the page, there may need to be a step that goes out to the HTML pages and extracts their title elements for use as link text. The second problem is that popular sites like Facebook and Twitter tend to show up every once in a while in the list just because people talk about them so much. This seems to happen less than I expected however, so this may not be a problem in reality.

Now I just have to worry about whether to call the button [Show Popular Stories] or [Show Most Linked Stories]. Thoughts?


import time, sys, re, System, System.IO, System.Globalization
from System import *
from System.IO import *
from System.Globalization import DateTimeStyles
import clr
clr.AddReference("System.Xml")
from System.Xml import *

#################################################################
#
# USAGE: ipy memetracker.py <directory-of-rss-feeds> <mode>
# mode = 0 show most popular links in unread items
# mode = 1 show most popular links from items from the past week
#################################################################

all_links = {}
one_week =  TimeSpan(7,0,0,0)

cache_location = r"C:\Documents and Settings\dareo\Local Settings\Application Data\RssBandit\Cache"
href_regex     = r"<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>"
regex          = re.compile(href_regex)

(popular_in_unread, popular_in_past_week) = range(2)
mode = popular_in_past_week

class RssItem:
    """Represents an RSS item"""
    def __init__(self, permalink, title, date, read, outgoing_links):
        self.outgoing_links = outgoing_links
        self.permalink      = permalink
        self.title          = title
        self.date           = date
        self.read           = read

def MakeRssItem(itemnode):
    link_node  = itemnode.SelectSingleNode("link")
    permalink  = link_node and link_node.InnerText or ''
    title_node = itemnode.SelectSingleNode("title")
    title      = link_node and title_node.InnerText or ''
    date_node  = itemnode.SelectSingleNode("pubDate")
    date       = date_node and DateTime.Parse(date_node.InnerText, None, DateTimeStyles.AdjustToUniversal) or DateTime.Now 
    read_node  = itemnode.SelectSingleNode("//@*[local-name() = 'read']")
    read       = read_node and int(read_node.Value) or 0
    desc_node  = itemnode.SelectSingleNode("description")
    # obtain href value and link text pairs
    outgoing   = desc_node and regex.findall(desc_node.InnerText) or []
    outgoing_links = {}
    #ensure we only collect unique href values from entry by replacing list returned by regex with dictionary
    if len(outgoing) > 0:
        for url, linktext in outgoing:
            outgoing_links[url] = linktext
    return RssItem(permalink, title, date, read, outgoing_links)   

if __name__ == "__main__":
    if len(sys.argv) > 1: #get directory of RSS feeds
        cache_location = sys.argv[1]
    if len(sys.argv) > 2: # mode = 0 means use only unread items, mode = 1 means use all items from past week
        mode           = int(argv[2]) and popular_in_past_week or popular_in_unread

    print "Processing items from %s seeking items that are %s" % (cache_location,
                                                                  mode and "popular in items from the past week"
                                                                  or "popular in unread items" )
    #decide what filter function to use depending on mode
    filterFunc = mode and (lambda x : (DateTime.Now - x.date) < one_week) or (lambda x : x.read == 0)
    #in mode = 0 each entry linking to an item counts as a vote, in mode = 1 value of vote depends on item age
    voteFunc   = mode and (lambda x: 1.0 - (DateTime.Now.Ticks - x.date.Ticks) * 1.0 / one_week.Ticks) or (lambda x: 1.0)

    di = DirectoryInfo(cache_location)
    for fi in di.GetFiles("*.xml"):     
        doc = XmlDocument()
        doc.Load(Path.Combine(cache_location, fi.Name))
        # for each item in feed       
        #  1. Get permalink, title, read status and date
        #  2. Get list of outgoing links + link title pairs
        #  3. Convert above to RssItem object
        items = [ MakeRssItem(node) for node in doc.SelectNodes("//item")]
        feedTitle = doc.SelectSingleNode("/rss/channel/title").InnerText
        # apply filter to pick candidate items, then calculate vote for each outgoing url
        for item in filter(filterFunc, items):
            vote = (voteFunc(item), item, feedTitle)
            #add a vote for each of the URLs
            for url in item.outgoing_links.Keys:
                if all_links.get(url) == None:
                    all_links[url] = []
                all_links.get(url).append(vote)

       # tally the votes, only 1 vote counts per feed
    weighted_links = []
    for link, votes in all_links.items():
        site = {}
        for weight, item, feedTitle in votes:               
            site[feedTitle] = min(site.get(feedTitle,1), weight)
        weighted_links.append((sum(site.values()), link))
    weighted_links.sort()
    weighted_links.reverse()

    # output the results, choose link text from first item we saw story linked from
    print "<ol>"   
    for weight, link in weighted_links[:10]:
        link_text = (all_links.get(link)[0])[1].outgoing_links.get(link)
        print "<li><a href='%s'>%s</a> (%s)" % (link, link_text, weight)
        print "<p>Seen on:"
        print "<ul>"
        for weight, item, feedTitle in all_links.get(link):
            print "<li>%s: <a href='%s'>%s</a></li>" % (feedTitle, item.permalink, item.title)
        print "</ul></p></li>" 
    print "</ol>"