Thursday, 03 January 2008 - Dare Obasanjo's weblog

January 3, 2008

@ 04:51 PM

Facebook Right, Scoble Wrong: Social Network Interoperability and the O'Reilly Social Graph FOO Camp

I’ve read a number of stories this week that highlight that interoperability between social networking sites will be a “top ask” in 2008 (as we say at Microsoft). Earlier this week I read the Wired article Should Web Giants Let Startups Use the Information They Have About You? which does a good job of telling both sides of the story when it comes to startups ~~screen scraping~~ importing user data such as social graphs (i.e. friend and contact lists) from more successful sites as a way to bootstrap their social networks. The Wired article is a good read if you want to hear all sides of the story when it comes to the issue of sharing user social data between sites.

Yesterday, I saw Social Network Aggregation, Killer App in 2008? which points out the problem that users often belong to multiple social networks at once and that bridging between them is key. However I disagree with the premise that this points to need for a “Social Network Aggregator” category of applications. I personally believe that the list of 20 or so Social Network Aggregators on Mashable are all companies that would cease to exist if the industry got off it’s behind and worked towards actual interoperability between social networking sites.

Today, I saw saw Facebook disabled Robert Scoble’s account. After reading Robert’s account of the incident, I completely agree with Facebook.

Why Robert Scoble is Wrong and Facebook is Right

Here’s what Robert Scoble wrote about the incident

My account has been “disabled” for breaking Facebook’s Terms of Use. I was running a script that got them to keep me from accessing my account
…
I am working with a company to move my social graph to other places and that isn’t allowable under Facebook’s terms of service. Here’s the email I received:

+++++

Hello,

Our systems indicate that you’ve been highly active on Facebook lately and viewing pages at a quick enough rate that we suspect you may be running an automated script. This kind of Activity would be a violation of our Terms of Use and potentially of federal and state laws.

As a result, your account has been disabled. Please reply to this email with a description of your recent activity on Facebook. In addition, please confirm with us that in the future you will not scrape or otherwise attempt to obtain in any manner information from our website except as permitted by our Terms of Use, and that you will immediately delete and not use in any manner any such information you may have previously obtained.

The first thing to note is that Facebook allows you to extract your social graph data from their site using the Facebook platform. In fact, right now whenever I get an email from someone on my Facebook friend list in Outlook or I get a phone call from them, I see the picture from their Facebook profile. I did this using OutSync which is an application that utilizes the Facebook platform to merge data from my contacts in Outlook/Exchange with my Facebook contacts.

So if Facebook allows you to extract information about your Facebook friends via their APIs, why would Robert Scoble need to run a screen scraping script? The fact is that the information returned by the Facebook API about a user contains no contact information (no email address, no IM screen names, no telephone numbers, no street address). Thus if you are trying to “grow virally” by spamming the Facebook friend list of one of your new users about the benefits of your brand new Web 2.0 site then you have to screen scrape Facebook. However there is the additional wrinkle that unlike address books in Web email applications Robert Scoble did not enter any of this contact information about his friends. With this in mind, it is hard for Robert Scoble to argue that the data is “his” to extract from Facebook. In addition, as a Facebook user I consider it a feature that Facebook makes it hard for my personal data to be harvested in this way. Secondly, since Robert’s script was screen scraping it means that it had to hit the site five thousand times (once for each of his contacts) to fetch all of Robert’s friends personally idenitifiable information (PII). Given that eBay won a court injunction against Bidder’s Edge for running 100,000 queries a day, it isn’t hard to imagine that the kind of screen scraping script that Robert is using would be considered malicious even by a court of law.

I should note that Facebook is being a bit hypocritical here since they do screen scrape other sites to get the email addresses of the contacts of new users. This is why I’ve called them the Social Graph Roach Motel in the recent past.

O’Reilly Social Graph FOO Camp

This past weekend I got an email from Tim O'Reilly, David Recordon, and Scott Kveton inviting me to a Friends of O’Reilly Camp (aka FOO Camp) dedicated to “social graph” problems. I’m still trying to figure out if I can make it based on my schedule and whether I’m really the best person to be representing Microsoft at such an event given that I’m a technical person and “social graph problems” for the most part are not technical issues.

Regardless of whether I am able to attend or not, there were some topics I wanted to recommend should be added to a list of “red herring” topics that shouldn’t be discussed until the important issues have been hashed out.

Google OpenSocial: This was an example of unfortunate branding. Google should really have called this “Google OpenWidgets” or “Google Gadgets for your Domain” since the goal was competing with Facebook’s widget platform not actually opening up social networks. Since widget platforms aren’t a “social graph problem” it doesn’t seem fruitful the spend time discussing this when there are bigger fish to fry.
Social Network Portability: When startups talk about “social network portability” it’s usually a euphemism for collecting a person’s username and password for another site, retrieving their contact/friend list and spamming those people about their hot new Web 2.0 startup. As a user of the Web, making it easier to receive spam from startups isn’t something I think should be done let alone a “problem” that needs solving. I understand that lots of people will disagree with this [even at Microsoft] but I’m convinced that this is not the real problem facing the majority of users of social networking sites on the the Web today.

What I Want When It Comes to Social Network Interoperability

Having I’ve said what I don’t think is important to discuss when it comes to “social graph problems”, it would be rude not to provide an example fof what I think would be fruitful discussion. I wrote the problem I think we should be solving as an industry a while back in a post entitled A Proposal for Social Network Interoperability via OpenID which is excerpted below

I have a Facebook profile while my ~~fiancée~~ wife has a MySpace profile. Since I’m now an active user of Facebook, I’d like her to be able to be part of my activities on the site such as being able to view my photos, read my wall posts and leave wall posts of her own. I could ask her to create a Facebook account, but I already asked her to create a profile on Windows Live Spaces so we could be friends on that service and quite frankly I don’t think she’ll find it reasonable if I keep asking her to jump from social network to social network because I happen to try out a lot of these services as part of my day job. So how can this problem be solved in the general case?

This is a genuine user problem which the established players have little incentive to fix. The data portability folks want to make it easy for you to jump from service to service. I want to make it easy for users of one service to talk to people on another service. Can you imagine if email interoperability was achieved by making it easy for Gmail users to export their contacts to Yahoo! mail instead of it being that Gmail users can send email to Yahoo! Mail users and vice versa?

Think about that.

Now playing: DJ Drama - The Art Of Storytellin' Part 4 (Feat. Outkast And Marsha Ambrosius)

Categories: Competitors/Web Companies | Social Software | Windows Live

January 2, 2008

@ 03:06 AM

Comments [31]

Does C# 3.0 Beat Dynamic Languages at their Own Game?

For the past few years I've heard a lot of hype about dynamic programming languages like Python and Ruby. The word on the street has been that their dynamic nature makes developers more productive that those of us shackled to statically typed languages like C# and Java. A couple of weeks ago I decided to take the plunge and start learning Python after spending the past few years doing the majority of my software development in C#. I learned that it was indeed true that you could get things the same stuff done in far less lines of Python than you could in C#. Since it is a general truism in the software industry that the number of bugs per thousand lines of code is constant irrespective of programming language, the more you can get done in fewer lines of code, the less defects you will have in your software.

Shortly after I started using Python regularly as part of the prototyping process for developing new features for RSS Bandit, I started trying out C# 3.0. I quickly learned that a lot of the features I'd considered as language bloat a couple of months ago actually made a lot of sense if you're familiar with the advantages of dynamic and functional programming approaches to the tasks of software development. In addition, C# 3.0 actually fixed one of the problems I'd encountered in my previous experience with a dynamic programming language while in college.

Why I Disliked Dynamism: Squeak Smalltalk

Back in my college days I took one of Mark Guzdial's classes which involved a group programming project using Squeak Smalltalk. At the time, I got the impression that Squeak was composed of a bunch of poorly documented libraries cobbled together from the top graded submissions to assignments from Mark's class. What was particularly frustrating about the lack of documentation was that even looking at method prototypes gave no insight into how to call a particular library. For example, here's an example of a SalariedEmployee class taken from an IBM SmallTalk tutorial

"A subclass of Employee that adds protocol needed for 
          employees with salaries"

       Employee subclass: #SalariedEmployee
          instanceVariableNames:  'position salary'
          classVariableNames: ' '
          poolDictionaries: ' ' !

       ! SalariedEmployee publicMethods !

          position: aString 
             position := aString !
          position 
             ^position !
          salary: n 
             salary := n !
          salary 
             ^salary ! !

In the example above, there is a method called salary() that takes a parameter n whose type we don't know. n could be a string, an integer or a floating point number. If you are using the SalariedEmployee class in your code and want to set the employee's salary, the only way to find out what to pass in is to grep through the code and find out how the method is being used by others. You can imagine how frustrating it gets when every time you want to perform a basic task like perform a Web request you have to grep around trying to figure out if the url parameter you pass to the Http classes is a string, a Uri class or some oter random thing you haven't encountered yet.

For a long time, this was my only experience with a dynamic programming language and I thought it sucked...a lot.

Why I Grew to Love Dynamism: XML and C#

The first half of my career at Microsoft was spent working on the XML team which was responsible for the core XML processing APIs that are utilized by the majority of Microsoft's product line. One of the things that was so cool about XML was that it enabled data formats to be as strongly structured or semi-structured depending on the needs of the application. This flexibility is what gives us data formats like the Atom syndication format which although rigidly structured in parts (e.g. atom:entry elements MUST contain exactly one atom:id element, etc) also supports semi-structured data (e.g. atom:content can contain blocks of XHTML) and enables distributed extensibility where anyone on the Web is free to extend the data format as long as they place their extensions in the right namespace.

However one problem we repeatedly bumped against is that data formats that can have unknown data types show up in them at runtime bump up against the notion of static typing that is a key aspect of languages in C#. I've written about this in the past in posts such as What's Right and Wrong with Code Generation in Web Services which is excerpted below

Another problem is that the inflexible and rigid requirements of static typing systems runs counter to the distributed and flexible nature of the Web. I posted a practical example a few years ago in my post entitled Why You Should Avoid Using Enumerated Types in XML Web Services. In that example, I pointed out that if you have a SOAP Web Service that returns an enumeration with the possible value {CDF, RSS10, RSS20} and in a future release modify that enumeration by adding a new syndication format {CDF, RSS10, RSS20, Atom} then even if you never return that syndication format to old clients written in .NET, these clients will still have to be recompiled because of the introduction of a new enumeration value. I find it pretty ridiculous that till today I have list of "people we need to call and tell to recompile their code whenever we change an enum value in any of our SOAP Web Services".

I came to the realization that some degree of dynamism is desirable especially when dealing with the loosely coupled world of distributed programming on the Web. I eventually decided to ignore my earlier misgivings and start exploring dynamic programming languages. I chose IronPython because I could focus on learning the language while relying on the familiar .NET Framework class library when I wanted to deal with necessary tasks like file I/O or Web requests.

After getting up to speed with Python and then comparing it to C# 2.0, it was clear that the dynamic features of Python made my life as a programmer a lot easier. However something interesting happened along the way. Microsoft shipped C# 3.0 around the same time I started delving into Python. As I started investigating C# 3.0, I discovered that almost all the features I'd fallen in love with in Python which made my life as a developer easier had been integrated into C#. In addition, there was also a feature which is considered to be a killer feature of the Ruby programming language which also made it into C# 3.0.

Python vs. C# 3.0: Lambda Expressions

According to the Wikipedia entry on Dynamic Programming Languages

There are several mechanisms closely associated with the concept of dynamic programming. None are essential to the classification of a language as dynamic, but most can be found in a wide variety of such languages.
...
Higher-order functions
However, Erik Meijer and Peter Drayton caution that any language capable of loading executable code at runtime is capable of eval in some respect, even when that code is in the form of dynamically linked shared libraries of machine code. They suggest that higher-order functions are the true measure of dynamic programming, and some languages "use eval as a poor man's substitute for higher-order functions."^[1]

The capability of a programming language to treat functions as first class objects that can be the input(s) or the output(s) of a function call is a key feature of many of today's popular "dynamic" programming languages. Additionally, creating a short hand syntax where anonymous blocks of code can be treated as function objects is now commonly known as "lambda expressions". Although C# has had functions as first class objects since version 1.0 with delegates and introduced anonymous delegates in C# 2.0, it is in C# 3.0 where the short hand syntax of lambda expressions has found its way into the language. Below are source code excerpts showing the difference between the the lambda expression functionality in C# and IronPython

C# Code
//decide what filter function to use depending on mode 
Func<RssItem, bool> filterFunc = null;
if(mode == MemeMode.PopularInPastWeek) 
   filterFunc = x => (DateTime.Now - x.Date < one_week) ;
else 
   filterFunc = x => x.Read == false;
IronPython Code

#decide what filter function to use depending on mode
filterFunc = mode and (lambda x : (DateTime.Now - x.date) < one_week) or (lambda x : x.read == 0)

Although the functionality is the same, it takes a few more lines of code to express the same idea in C# 3.0 than in Python. The main reason for this is due to the strong and static typing requirements in C#. Ideally developers should be able to write code like

Func<RssItem, bool> filterFunc = (mode == MemeMode.PopularInPastWeek ? x => (DateTime.Now - x.Date < one_week) : x => x.read == false);

However this doesn’t work because the compiler cannot determine whether each of the lambda expressions that can be returned by the conditional expression are of the same type. Despite the limitations due to the static and strong typing requirements of C#, the lambda expression feature in C# 3.0 is extremely powerful.

You don’t have to take my word for it. Read Joel Spolsky’s Can Your Programming Language Do This? and Peter Norvig’s Design Patterns in Dynamic Programming. Peter Norvig’s presentation makes a persuasive argument that a number of the Gang of Four’s Design Patterns either require a lot less code or are simply unneeded in a dynamic programming language that supports higher order functions. For example, he argues that the Strategy pattern does not need separate classes for each algorithm in a dynamic language and that closures eliminate the need for Iterator classes. Read the entire presentation, it is interesting and quite illuminating.

Python vs. C# 3.0: List Comprehensions vs. Language Integrated Query

A common programming task is to iterate over a list of objects and either filter or transform the objects in the list thus creating a new list. Python has list comprehensions as a way of simplifying this common programming task. Below is an excerpt from An Introduction to Python by Guido van Rossum on list expressions

List comprehensions provide a concise way to create lists without resorting to use of map(), filter() and/or lambda. The resulting list definition tends often to be clearer than lists built using those constructs. Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it.

Below is a code sample showing how list comprehensions can be used to first transform a list of objects (i.e. XML nodes) to another (i.e. RSS items) and then how the resulting list can be further filtered to those from a particular date.

IronPython Code

# for each item in feed
# convert each <item> to an RssItem object then apply filter to pick candidate items
items = [ MakeRssItem(node) for node in doc.SelectNodes("//item")]
filteredItems = [item for item in items if filterFunc(item)]

My friend Erik Meijer once observed that certain recurring programming patterns become more obvious as a programming language evolves, these patterns first become encapsulated by APIs and eventually become part of the programming language’s syntax. This is what happened in the case of the Python’s map() and filter() functions which eventually gave way to list comprehensions.

C# 3.0 does something similar but goes a step further. In C# 3.0, the language designers made the observation that performing SQL-like projection and selection is really the common operation and not just filtering/mapping of lists. This lead to Language Integrated Query (LINQ). Below is the same filtering operation on a list of XML nodes performed using C# 3.0

C# 3.0 Code

//for each item in feed        
// convert each <item> to an RssItem object then apply filter to pick candidate items
var items = from rssitem in 
              (from itemnode in doc.Descendants("item") select MakeRssItem(itemnode))
            where filterFunc(rssitem)
            select rssitem;

These are two fundamentally different approaches to tackling the same problem. Where LINQ really shines is when it is combined with custom data sources that have their own query languages such as with LINQ to SQL and LINQ to XML which map the query operations to SQL and XPath queries respectively.

Python vs. C# 3.0: Tuples and Dynamic Typing vs. Anonymous Types and Type Inferencing

As I’ve said before, tuples are my favorite Python feature. I’ve found tuples useful in situations where I have to temporarily associate two or three objects and don’t want to go through the hassle of creating a new class just to represent the temporary association between these types. I’d heard that a new feature in C# 3.0 called anonymous types which seemed like it would be just what I need to fix this pet peeve once and for all. The description of the feature is as follows

Anonymous types are a convenient language feature of C# and VB that enable developers to concisely define inline CLR types within code, without having to explicitly define a formal class declaration of the type.

I assumed this feature in combination with the var keyword would make it so I would no longer miss Python tuples when I worked with C#. However I was wrong. Let’s compare two equivalent blocks of code in C# and IronPython. Pay particular attention to the highlighed lines

IronPython Code

      for item in filteredItems:
            vote = (voteFunc(item), item, feedTitle)

            #add a vote for each of the URLs
            for url in item.outgoing_links.Keys:
                if all_links.get(url) == None:
                    all_links[url] = []
                all_links.get(url).append(vote)

    # tally the votes, only 1 vote counts per feed
    weighted_links = []
    for link, votes in all_links.items():
        site = {}
        for weight, item, feedTitle in votes:
            site[feedTitle] = min(site.get(feedTitle,1), weight)
        weighted_links.append((sum(site.values()), link))
    weighted_links.sort()
    weighted_links.reverse()

The key things to note about the above code block are (i) the variable named vote is a tuple of three values; the numeric weight given to a link received from a particular RSS item, an RSS item and the title of the feed Python and (ii) the items in the tuple can be unpacked into individual variables when looping over the contents of the tuple in a for loop.

Here’s the closest I could come in C# 3.0

C# 3.0 Code

// calculate vote for each outgoing url
 foreach (RssItem item in items) { 
       var vote = new Vote(){ Weight=voteFunc(item), Item=item, FeedTitle=feedTitle };
       //add a vote for each of the URLs
       foreach (var url in item.OutgoingLinks.Keys) {
           List<Vote> value = null;
           if (!all_links.TryGetValue(url, out value))
                value = all_links[url] = new List<Vote>(); 
                            
           value.Add(vote);                                                    
         }
   }// foreach (RssItem item in items)

//tally the votes
  List<RankedLink> weighted_links = new List<RankedLink>();
  foreach (var link_n_votes in all_links) {
       Dictionary<string, double> site = new Dictionary<string, double>();
       foreach (var vote in link_n_votes.Value) {
           double oldweight;
           site[vote.FeedTitle] = site.TryGetValue(vote.FeedTitle, out oldweight) ? 
                                  Math.Min(oldweight, vote.Weight): vote.Weight; 
        }
        weighted_links.Add(new RankedLink(){Score=site.Values.Sum(), Url=link_n_votes.Key});
    }
    weighted_links.Sort((x, y) => y.Score.CompareTo(x.Score));

The relevant line above is

var vote = new Vote() { Weight=voteFunc(item), Item=item, FeedTitle=feedTitle };

which I had INCORRECTLY assumed I would be able to write as

var vote = new { Weight=voteFunc(item), Item=item, FeedTitle=feedTitle };

In Python, dynamic typing is all about the developer knowing what types they are working with while the compiler is ignorant about the data types. However type inferencing in C# supports the opposite scenario, when the compiler knows the data types but the developer does not.

The specific problem here is that if I place an anonymous type in a list, I have no way of knowing what the data type of the object I’m pulling out of the list will be. So I will either have to interact with them as instances of System.Object when popped from the list which makes them pretty useless or access their fields via reflection. Python doesn’t have this problem because I don’t need to know the type of an object to interact with it, I just need to know how what properties/fields and methods it supports.

At the end of the day, I realized that the var keyword is really only useful when constructing anonymous types as a result of LINQ expressions. In every other instance where it is acceptable to use var, you have to know the type of the object anyway so all you are doing is saving keystrokes by using it. Hmmmm.

Ruby vs. C# 3.0: Extension Methods

Extension methods is a fairly disconcerting feature that has been made popular by the Ruby programming language. The description of the feature is excerpted below

Extension methods allow developers to add new methods to the public contract of an existing CLR type, without having to sub-class it or recompile the original type. Extension Methods help blend the flexibility of "duck typing" support popular within dynamic languages today with the performance and compile-time validation of strongly-typed languages.

I consider this feature to be the new incarnation of operator overloading. Operator overloading became widely reviled because it made code harder to read because you couldn’t just look at a code block and know what it does if you didn’t know how the operators had been implemented. Similarly, looking at an excerpt of C# code you may not realize that everything isn’t what it seems.

I spent several minutes being confused today because I couldn’t get the line

XAttribute read_node = itemnode.XPathEvaluate("//@*[local-name() = 'read']") as XAttribute;

to compile. It turns out that XPathEvaluate is an extension method and you need to import the System.Xml.XPath namespace into your project before the XPathEvaluate() method shows up as a method in the XElement class.

I’ve heard the arguments that Ruby makes it easier to express programmer intent and although I can see how XElement.XPathEvaluate(string) is a more readable choice than XPathQueryEngine.Evaluate(XElement, string) if you want to perform an XPath query on an XElement object, for now I think the readability issues it causes by hiding dependencies isn’t worth it. I wonder if any Ruby developers out there with a background in other dynamic languages that don’t have that feature (e.g. Python) care to offer a counter opinion based on their experience?

FINAL THOUGHTS

C# has added features that make it close to being on par with the expressiveness of functional and dynamic programming languages. The only thing missing is dynamic typing (not duck typing), which I’ve come to realize is has a lot more going for it than lots of folks in the strongly and statically typed world would care to admit. At first, I had expected that after getting up to speed with C# 3.0, I’d lose interest in Python but that is clearly not the case.

I love the REPL, I love the flexibility that comes from having natural support tuples in the language and I love the more compact syntax. I guess I’ll be doing a lot more coding in Python in 2008.

Now Playing: Da Back Wudz - U Gonna Love Me

Categories: Programming

January 2, 2008

@ 03:05 AM

Comments [0]

A Memetracker in C# 3.0

A few weeks ago, I wrote a prototype for the meme tracking feature of RSS Bandit in IronPython. The code was included in my blog post A Meme Tracker In IronPython. The script was a port of Sam Ruby's original MeMeme script which shows the most recently popular links from from a set of RSS feeds.

I was impressed with how succinct the code was in IronPython when compared to what the code eventually looked like when I ported it to C# 2.0 and integrated it into RSS Bandit. Looking over the list of new features in C# 3.0, it occurred to me that a C# 3.0 version of the script would be as concise or even more concise than the IronPython version. So I ported the script to C# 3.0 and learned a few things along the way.

I'll post something shortly that goes into some details on my perspectives on the pros and cons of the various C# 3.0 features when compared to various Python features. For now, here's the meme tracker script in C# 3.0. Comparing it to the IronPython version should provide some food for thought.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Globalization;

namespace Memetracker {

    enum MemeMode { PopularInUnread, PopularInPastWeek }

    class RankedLink{
       public string Url { get; set;}
       public double Score { get; set; }   
    }

    class Vote {
        public double Weight { get; set; }
        public RssItem Item { get; set; }
        public string FeedTitle { get; set; }
    }


    class RssItem {
        public string Title { get; set; }
        public DateTime Date { get; set; }
        public bool Read { get; set; }
        public string Permalink { get; set; }
        public Dictionary<string, string> OutgoingLinks { get; set; }
    }

    class Program {

        static Dictionary<string, List<Vote>> all_links = new Dictionary<string, List<Vote>>();
        static TimeSpan one_week = new TimeSpan(7, 0, 0, 0);
        static MemeMode mode = MemeMode.PopularInPastWeek;

        static string cache_location = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData), "Temp");
        static string href_regex = @"<a[\s]+[^>]*?href[\s]?=[\s""']+(.*?)[\""']+.*?>([^<]+|.*?)?<\/a>";
        static Regex regex       = new Regex(href_regex);

        static RssItem MakeRssItem(XElement itemnode) {

            XElement link_node = itemnode.Element("link");
            var permalink = (link_node == null ? "" : link_node.Value);
            XElement title_node = itemnode.Element("title");
            var title = (title_node == null ? "" : title_node.Value);
            XElement date_node = itemnode.Element("pubDate");
            var date = (date_node == null ? DateTime.Now : DateTime.Parse(date_node.Value, null, DateTimeStyles.AdjustToUniversal));
            XAttribute read_node = itemnode.XPathEvaluate("//@*[local-name() = 'read']") as XAttribute;
            var read = (read_node == null ? false : Boolean.Parse(read_node.Value));
            XElement desc_node = itemnode.Element("description");
            // obtain href value and link text pairs
            var outgoing = (desc_node == null ? regex.Matches(String.Empty) : regex.Matches(desc_node.Value));
            var outgoing_links = new Dictionary<string, string>();
            //ensure we only collect unique href values from entry by replacing list returned by regex with dictionary
            if (outgoing.Count > 0) {
                foreach (Match m in outgoing)
                    outgoing_links[m.Groups[1].Value] = m.Groups[2].Value;
            }
            return new RssItem() { Permalink = permalink, Title = title, Date = date, Read = read, OutgoingLinks = outgoing_links };
        }

        static void Main(string[] args) {

            if (args.Length > 0) //get directory of RSS feeds
                cache_location = args[0];
            if (args.Length > 1) //mode = 0 means use only unread items, mode != 0 means use all items from past week
                mode = (Int32.Parse(args[1]) != 0 ? MemeMode.PopularInPastWeek : MemeMode.PopularInUnread);

            Console.WriteLine("Processing items from {0} seeking items that are {1}", cache_location,
                (mode == MemeMode.PopularInPastWeek ? "popular in items from the past week" : "popular in unread items"));
            
            //decide what filter function to use depending on mode 
            Func<RssItem, bool> filterFunc = null;
            if(mode == MemeMode.PopularInPastWeek) 
                filterFunc = x => (DateTime.Now - x.Date < one_week) ;
            else 
                filterFunc = x => x.Read == false;

            //in mode = 0 each entry linking to an item counts as a vote, in mode != 0 value of vote depends on item age
            Func<RssItem, double> voteFunc   = null; 
            if(mode == MemeMode.PopularInPastWeek) 
                voteFunc = x => 1.0 - (DateTime.Now.Ticks - x.Date.Ticks) * 1.0 / one_week.Ticks; 
            else 
                voteFunc = x => 1.0;

            
            var di = new DirectoryInfo(cache_location); 
            foreach(var fi in di.GetFiles("*.xml")){
                var doc = XElement.Load(Path.Combine(cache_location, fi.Name));
                // for each item in feed
                //  1. Get permalink, title, read status and date
                //  2. Get list of outgoing links + link title pairs
                //  3. Convert above to RssItem object
                //  4. apply filter to pick candidate items
                var items = from rssitem in 
                            (from itemnode in doc.Descendants("item")                            
                            select MakeRssItem(itemnode))
                            where filterFunc(rssitem)
                            select rssitem;
                var feedTitle = doc.XPathSelectElement("channel/title").Value;
                // calculate vote for each outgoing url
                foreach (RssItem item in items) { 
                    var vote = new Vote(){ Weight=voteFunc(item), Item=item, FeedTitle=feedTitle };
                    //add a vote for each of the URLs
                    foreach (var url in item.OutgoingLinks.Keys) {
                        List<Vote> value = null;
                        if (!all_links.TryGetValue(url, out value))
                            value = all_links[url] = new List<Vote>(); 
                            
                        value.Add(vote);                                                    
                    }
                }// foreach (RssItem item in items)
            }// foreach(var fi in di.GetFiles("*.xml"))
           
            //tally the votes
            List<RankedLink> weighted_links = new List<RankedLink>();
            foreach (var link_n_votes in all_links) {
                Dictionary<string, double> site = new Dictionary<string, double>();
                foreach (var vote in link_n_votes.Value) {
                    double oldweight;
                    site[vote.FeedTitle] = site.TryGetValue(vote.FeedTitle, out oldweight) ? 
                                            Math.Min(oldweight, vote.Weight): vote.Weight; 
                }
                weighted_links.Add(new RankedLink(){Score=site.Values.Sum(), Url=link_n_votes.Key});
            }
            weighted_links.Sort((x, y) => y.Score.CompareTo(x.Score));

            //output the results, choose link text from first item we saw story linked from
            Console.WriteLine("<html><body><ol>");
            foreach(var rankedlink in weighted_links.GetRange(0, 10)){
                var link_text = (all_links[rankedlink.Url][0]).Item.OutgoingLinks[rankedlink.Url];
                Console.WriteLine("<li><a href='{0}'>{1}</a> {2}", rankedlink.Url, link_text, rankedlink.Score);
                Console.WriteLine("<p>Seen on:");
                Console.WriteLine("<ul>");
                foreach (var vote in all_links[rankedlink.Url]) {
                    Console.WriteLine("<li>{0}: <a href='{1}'>{2}</a></li>", vote.FeedTitle, vote.Item.Permalink, vote.Item.Title);
                }
                Console.WriteLine("</ul></p></li>");
            }
            Console.WriteLine("</ol></body></html>");
            Console.ReadLine();
        }
    }
}

Now Playing: Lloyd Banks - Boywonder

Categories: Programming

December 31, 2007

@ 07:07 AM

Comments [7]

Command Line Client for Google Reader in IronPython

A few days ago I blogged about my plans to make RSS Bandit a desktop client for Google Reader. As part of that process I needed to verify that it is possible to programmatically interact with Google Reader from a desktop client in a way that provides a reasonable user experience. To this end, I wrote a command line client in IronPython based on the documentation I found at the pyrfeed Website.

The command line client isn't terribly useful on its own as a way to read your feeds but it might be useful for other developers who are trying to interact with Google Reader programmatically who would learn better from code samples than reverse engineered API documentation.

Enjoy...

PS: Note the complete lack of error handling. I never got a hang of error handling in Python let alone going back and forth between handling errors in Python vs. handling underlying .NET/CLR errors.

import sys
from System import *
from System.IO import *
from System.Net import *
from System.Text import *
from System.Globalization import DateTimeStyles
import clr
clr.AddReference("System.Xml")
from System.Xml import *
clr.AddReference("System.Web")
from System.Web import *

#################################################################
#
# USAGE: ipy greader.py <Gmail username> <password> <path-to-directory-for-storing-feeds>
# 
# username & password are required
# feed directory location is optional, defaults to C:\Windows\Temp\
#################################################################

#API URLs
auth_url          = rhttps://www.google.com/accounts/ClientLogin?continue=http://www.google.com&service=reader&source=Carnage4Life&Email=%s&Passwd=%s
feed_url_prefix   = rhttp://www.google.com/reader/atom/
api_url_prefix    = rhttp://www.google.com/reader/api/0/
feed_cache_prefix = r"C:\\Windows\Temp\\"
add_url           = r"http://www.google.com/reader/quickadd"

#enumerations
(add_label, remove_label) = range(1,3)

class TagList:
    """Represents a list of the labels/tags used in Google Reader"""
    def __init__(self, userid, labels):
        self.userid = userid
        self.labels = labels

class SubscriptionList:
    """Represents a list of RSS feeds subscriptions"""
    def __init__(self, modified, feeds):
        self.modified = modified
        self.feeds    = feeds

class Subscription:
    """Represents an RSS feed subscription"""
    def __init__(self, feedid, title, categories, firstitemmsec):
        self.feedid        = feedid
        self.title         = title
        self.categories    = categories
        self.firstitemmsec = firstitemmsec

def MakeHttpPostRequest(url, params, sid):
    """Performs an HTTP POST request to a Google service and returns the results in a HttpWebResponse object"""
    req = HttpWebRequest.Create(url)
    req.Method = "POST"
    SetGoogleCookie(req, sid)

    encoding = ASCIIEncoding();
    data     = encoding.GetBytes(params)

    req.ContentType="application/x-www-form-urlencoded"
    req.ContentLength = data.Length
    newStream=req.GetRequestStream()
    newStream.Write(data,0,data.Length)
    newStream.Close()
    resp = req.GetResponse()
    return resp

def MakeHttpGetRequest(url, sid):
    """Performs an HTTP GET request to a Google service and returns the results in an XmlDocument"""
    req          = HttpWebRequest.Create(url)
    SetGoogleCookie(req, sid)
    reader = StreamReader(req.GetResponse().GetResponseStream())
    doc          = XmlDocument()
    doc.LoadXml(reader.ReadToEnd())
    return doc

def GetToken(sid):
    """Gets an edit token which is needed for any edit operations using the Google Reader API"""
    token_url = api_url_prefix + "token"
    req          = HttpWebRequest.Create(token_url)
    SetGoogleCookie(req, sid)
    reader = StreamReader(req.GetResponse().GetResponseStream())
    return reader.ReadToEnd()

def MakeSubscription(xmlNode):
    """Creates a Subscription class out of an XmlNode that was obtained from the feed list"""
    id_node     = xmlNode.SelectSingleNode("string[@name='id']")
    feedid      = id_node and id_node.InnerText or ''
    title_node  = xmlNode.SelectSingleNode("string[@name='title']")
    title       = title_node and title_node.InnerText or ''
    fim_node    =  xmlNode.SelectSingleNode("string[@name='firstitemmsec']")
    firstitemmsec = fim_node and fim_node.InnerText or ''
    categories  = [MakeCategory(catNode) for catNode in xmlNode.SelectNodes("list[@name='categories']/object")]
    return Subscription(feedid, title, categories, firstitemmsec)

def MakeCategory(catNode):
    """Returns a tuple of (label, category id) from an XmlNode representing a feed's labels that was obtained from the feed list"""
    id_node     = catNode.SelectSingleNode("string[@name='id']")
    catid       = id_node and id_node.InnerText or ''
    label_node  = catNode.SelectSingleNode("string[@name='label']")
    label       = label_node and label_node.InnerText or ''
    return (label, catid)

def AuthenticateUser(username, password):
    """Authenticates the user and returns a username/password combination"""
    req = HttpWebRequest.Create(auth_url % (username, password))
    reader = StreamReader(req.GetResponse().GetResponseStream())
    response = reader.ReadToEnd().split('\n')
    for s in response:
        if s.startswith("SID="):
            return s[4:]

def SetGoogleCookie(webRequest, sid):
    """Sets the Google authentication cookie on the HttpWebRequest instance"""
    cookie = Cookie("SID", sid, "/", ".google.com")
    cookie.Expires = DateTime.Now + TimeSpan(7,0,0,0)
    container      = CookieContainer()
    container.Add(cookie)
    webRequest.CookieContainer = container

def GetSubscriptionList(feedlist, sid):
    """Gets the users list of subscriptions"""
    feedlist_url = api_url_prefix + "subscription/list"
    #download the JSON-esque XML feed list
    doc = MakeHttpGetRequest(feedlist_url, sid)

    #create subscription nodes
    feedlist.feeds  = [MakeSubscription(node) for node in doc.SelectNodes("/object/list[@name='subscriptions']/object")]
    feedlist.modified = False

def GetTagList(sid):
  """Gets a list of the user's tags"""
  taglist_url = api_url_prefix + "tag/list"
  doc = MakeHttpGetRequest(taglist_url, sid)
  #get the user id needed for creating new labels from Google system tags

  userid = doc.SelectSingleNode("/object/list/object/string[contains(string(.), 'state/com.google/starred')]").InnerText
  userid = userid.replace("/state/com.google/starred", "");
  userid = userid[5:]
  #get the user-defined labels
  tags = [node.InnerText.Replace("user/" + userid + "/label/" ,"") for node in doc.SelectNodes("/object/list[@name='tags']/object/string[@name='id']") if node.InnerText.IndexOf( "/com.google/") == -1 ]
  return TagList(userid, tags)

def DownloadFeeds(feedlist, sid):
    """Downloads each feed from the subscription list to a local directory"""
    for feedinfo in feedlist.feeds:
        unixepoch  = DateTime(1970, 1,1, 0,0,0,0, DateTimeKind.Utc)
        oneweek_ago   = DateTime.Now - TimeSpan(7,0,0,0)
        ifmodifiedsince = oneweek_ago - unixepoch
        feed_url = feed_url_prefix + feedinfo.feedid +  "?n=25&r=o&ot=" + str(int(ifmodifiedsince.TotalSeconds))
        continuation = True
        continuation_token = ''
        feedDoc      = None

        while True:
            print "Downloading feed at %s" % (feed_url  + continuation_token)
            doc = MakeHttpGetRequest(feed_url + continuation_token, sid)
            continuation_node     = doc.SelectSingleNode("//*[local-name()='continuation']")
            continuation_token    = continuation_node and ("&c=" + continuation_node.InnerText) or ''

            if feedDoc is None:
                feedDoc = doc
            else:
                for node in doc.SelectNodes("//*[local-name()='entry']"):
                    node = feedDoc.ImportNode(node, True)
                    feedDoc.DocumentElement.AppendChild(node)

            if continuation_token == '':
                break

        print "Saving %s" % (feed_cache_prefix + feedinfo.title + ".xml")
        feedDoc.Save(feed_cache_prefix + feedinfo.title + ".xml")

def ShowSubscriptionList(feedlist, sid):
    """Displays the users list of subscriptions including the labels applied to each item"""
    if feedlist.modified:
        GetSubscriptionList(feedlist, sid)
    count = 1
    for feedinfo in feedlist.feeds:
        print "%s. %s (%s)" % (count, feedinfo.title, [category[0] for category in feedinfo.categories])
        count = count + 1

def Subscribe(url, sid):
    """Subscribes to the specified feed URL in Google Reader"""
    params        = "quickadd=" + HttpUtility.UrlEncode(url) + "&T=" + GetToken(sid)
    resp = MakeHttpPostRequest(add_url, params, sid)

    if resp.StatusCode == HttpStatusCode.OK:
        print "%s successfully added to subscription list" % url
        return True
    else:
        print resp.StatusDescription
        return False

def Unsubscribe(index, feedlist, sid):
    """Unsubscribes from the feed at the specified index in the feed list"""
    unsubscribe_url = api_url_prefix + "subscription/edit"
    feed = feedlist.feeds[index]
    params = "ac=unsubscribe&i=null&T=" + GetToken(sid) + "&t=" + feed.title  + "&s=" + feed.feedid
    resp = MakeHttpPostRequest(unsubscribe_url, params, sid)

    if resp.StatusCode == HttpStatusCode.OK:
        print "'%s' successfully removed from subscription list" % feed.title
        return True
    else:
        print resp.StatusDescription
        return False

def Rename(new_title, index, feedlist, sid):
    """Renames the feed at the specified index in the feed list"""
    api_url = api_url_prefix + "subscription/edit"
    feed = feedlist.feeds[index]
    params = "ac=edit&i=null&T=" + GetToken(sid) + "&t=" + new_title  + "&s=" + feed.feedid
    resp = MakeHttpPostRequest(api_url, params, sid)

    if resp.StatusCode == HttpStatusCode.OK:
        print "'%s' successfully renamed to '%s'" % (feed.title, new_title)
        return True
    else:
        print resp.StatusDescription
        return False

def EditLabel(label, editmode, userid, feedlist, index, sid):
    """Adds or removes the specified label to the feed at the specified index depending on the edit mode"""
    full_label = "user/" + userid + "/label/" + label
    label_url = api_url_prefix + "subscription/edit"
    feed = feedlist.feeds[index]
    params = "ac=edit&i=null&T=" + GetToken(sid) + "&t=" + feed.title  + "&s=" + feed.feedid

    if editmode == add_label:
        params = params + "&a=" + full_label
    elif editmode == remove_label:
        params = params + "&r=" + full_label
    else:
        return

    resp = MakeHttpPostRequest(label_url, params, sid)
    if resp.StatusCode == HttpStatusCode.OK:
        print "Successfully edited label '%s' of feed '%s'" % (label, feed.title)
        return True
    else:
        print resp.StatusDescription
        return False

def MarkAllItemsAsRead(index, feedlist, sid):
    """Marks all items from the selected feed as read"""
    unixepoch  = DateTime(1970, 1,1, 0,0,0,0, DateTimeKind.Utc)

    markread_url = api_url_prefix + "mark-all-as-read"
    feed = feedlist.feeds[index]
    params = "s=" + feed.feedid + "&T=" + GetToken(sid) + "&ts=" + str(int((DateTime.Now - unixepoch).TotalSeconds))
    MakeHttpPostRequest(markread_url, params, sid)
    print "All items in '%s' have been marked as read" % feed.title

def GetFeedIndexFromUser(feedlist):
    """prompts the user for the index of the feed they are interested in and returns the index as the result of this function"""
    print "Enter the numeric position of the feed from 1 - %s" % (len(feedlist.feeds))
    index = int(sys.stdin.readline().strip())
    if (index < 1) or (index > len(feedlist.feeds)):
        print "Invalid index specified: %s" % feed2label_indx
        return -1
    else:
        return index

if __name__ == "__main__":
       if len(sys.argv) < 3:
           print "ERROR: Please specify a Gmail username and password"
       else:
           if len(sys.argv) > 3:
               feed_cache_prefix = sys.argv[3]

           SID = AuthenticateUser(sys.argv[1], sys.argv[2])
           feedlist = SubscriptionList(True, [])
           GetSubscriptionList(feedlist, SID)
           taglist = GetTagList(SID)

           options = "***Your options are (f)etch your feeds, (l)ist your subscriptions, (s)ubscribe to a new feed, (u)nsubscribe, (m)ark read , (r)ename, (a)dd a label to a feed, (d)elete a label from a feed or (e)xit***"
           print "\n"

           while True:
               print options
               cmd = sys.stdin.readline()
               if cmd == "e\n":
                   break
               elif cmd == "l\n": #list subscriptions
                   ShowSubscriptionList(feedlist, SID)
               elif cmd == "s\n": #subscribe to a new feed
                   print "Enter url: "
                   new_feed_url = sys.stdin.readline().strip()
                   success = Subscribe(new_feed_url, SID)

                   if feedlist.modified == False:
                       feedlist.modified = success
               elif cmd == "u\n": #unsubscribe from a feed
                   feed2remove_indx = GetFeedIndexFromUser(feedlist)
                   if feed2remove_indx != -1:
                       success = Unsubscribe(feed2remove_indx-1, feedlist, SID)

                       if feedlist.modified == False:
                           feedlist.modified = success
               elif cmd == "r\n": #rename a feed
                   feed2rename_indx = GetFeedIndexFromUser(feedlist)
                   if feed2rename_indx != -1:
                       print "'%s' selected" % feedlist.feeds[feed2rename_indx -1].title
                       print "Enter the new title for the subscription:"
                       success = Rename(sys.stdin.readline().strip(), feed2rename_indx-1, feedlist, SID)

                       if feedlist.modified == False:
                           feedlist.modified = success
               elif cmd == "f\n": #fetch feeds
                   feedlist = DownloadFeeds(feedlist, SID)
               elif cmd == "m\n": #mark all items as read
                   feed2markread_indx = GetFeedIndexFromUser(feedlist)
                   if feed2markread_indx != -1:
                       MarkAllItemsAsRead(feed2markread_indx-1, feedlist, SID)
               elif (cmd == "a\n") or (cmd == "d\n"): #add/remove a label on a feed
                   editmode = (cmd == "a\n") and add_label or remove_label
                   feed2label_indx = GetFeedIndexFromUser(feedlist)
                   if feed2label_indx != -1:
                       feed = feedlist.feeds[feed2label_indx-1]
                       print "'%s' selected" % feed.title
                       print "%s" % ((cmd == "a\n") and "Enter the new label:" or "Enter the label to delete:")
                       label_name = sys.stdin.readline().strip()
                       success = EditLabel(label_name, editmode, taglist.userid, feedlist, feed2label_indx-1, SID)

                       if feedlist.modified == False:
                           feedlist.modified = success
               else:
                   print "Unknown command"

Now Playing: DJ Drama - Cannon (Remix) (Feat. Lil Wayne, Willie The Kid, Freeway And T.I.)

Categories: Programming

December 30, 2007

@ 11:19 PM

Comments [7]

REST APIs that Suck: Google Reader

REQUEST:

POST /reader/api/0/subscription/edit HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.google.com
Cookie: SID=DQAAAHoAAD4SjpLSFdgpOrhM8Ju-JL2V1q0aZxm0vIUYa-p3QcnA0wXMoT7dDr7c5FMrfHSZtxvDGcDPTQHFxGmRyPlvSvrgNe5xxQJwPlK_ApHWhzcgfOWJoIPu6YuLAFuGaHwgvFsMnJnlkKYtTAuDA1u7aY6ZbL1g65hCNWySxwwu__eQ
Content-Length: 182
Expect: 100-continue

s=http%3a%2f%2fwww.icerocket.com%2fsearch%3ftab%3dblog%26q%3dlink%253A25hoursaday.com%252Fweblog%2b%26rss%3d1&ac=subscribe&T=wAxsLRcBAAA.ucVzEgL9y7YfSo5CU5omw.w1BCzXzXHsyicU9R3qWgQ

RESPONSE:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Set-Cookie: GRLD=UNSET;Path=/reader/
Transfer-Encoding: chunked
Cache-control: private
Date: Sun, 30 Dec 2007 23:08:51 GMT
Server: GFE/1.3

<html><head><title>500 Server Error</title>
<style type="text/css">
      body {font-family: arial,sans-serif}
      div.nav {margin-top: 1ex}
      div.nav A {font-size: 10pt; font-family: arial,sans-serif}
      span.nav {font-size: 10pt; font-family: arial,sans-serif; font-weight: bold}
      div.nav A,span.big {font-size: 12pt; color: #0000cc}
      div.nav A {font-size: 10pt; color: black}
      A.l:link {color: #6f6f6f}
      </style></head>
<body text="#000000" bgcolor="#ffffff"><table border="0" cellpadding="2" cellspacing="0" width="100%"></table>
<table><tr><td rowspan="3" width="1%"><b><font face="times" color="#0039b6" size="10">G</font><font face="times" color="#c41200" size="10">o</font><font face="times" color="#f3c518" size="10">o</font><font face="times" color="#0039b6" size="10">g</font><font face="times" color="#30a72f" size="10">l</font><font face="times" color="#c41200" size="10">e</font>  </b></td>
<td> </td></tr>
<tr><td bgcolor="#3366cc"><font face="arial,sans-serif" color="#ffffff"><b>Error</b></font></td></tr>
<tr><td> </td></tr></table>
<blockquote><h1>Server Error</h1>
The server encountered a temporary error and could not complete your request.<p></p> Please try again in 30 seconds.
<p></p></blockquote>
<table width="100%" cellpadding="0" cellspacing="0"><tr><td bgcolor="#3366cc"><img alt="" width="1" height="4"></td></tr></table></body></html>

Categories: Platforms | Programming | XML Web Services

December 28, 2007

@ 04:31 PM

Comments [11]

RSS Bandit (Phoenix) Thoughts: Integrating Google Reader and RSS Bandit

With the v1.6.0.0 out of the door, I've shipped what I think is our most interesting feature in years and resolved an issue that was making RSS Bandit a nuisance to lots of sites on the Internet.

The feature I'm currently working on is an idea I'm calling supporting multiple feed sources. For a few years, we've had support for roaming your feed list and read/unread state between two computers using an FTP site, a shared folder or NewsGator Online. Although useful, this functionality has always seemed bolted on. You have to manually upload and download feeds from these locations instead of things happening automatically and transparently as they do with the typical mail reader + mail server scenario (e.g. Outlook + Exchange) which is the most comparable model.

My original idea for the feature was simply to make the existing NewsGator and RSS Bandit integration work automatically instead of via a manual download so it could be more like Outlook + Exchange. Then I realized that there could never be full integration because there are feeds that RSS Bandit can read that a Web-based feed reader like NewsGator Online can not (e.g. feeds within your company's intranet if you read feeds at work). This meant that we would need an explicit demarcation of feeds that roamed in NewsGator Online and those that were local to that machine.

In addition, I got a bunch of feedback from our users that there were a lot more of them using Google Reader than using NewsGator Online. Since I was already planning to do a bunch of work to streamline synchronizing with NewsGator Online, adding another Web-based feed reader didn't seem like a stretch. I'm currently working on a command line only prototype in IronPython which uses the information from the reverse engineered Google Reader API documentation to retrieve and update my feed subscriptions. I'm about part way through and it seems that the Google Reader API is as full featured as the NewsGator API so we should be good to go. I should be able to integrate this functionality into RSS Bandit within the next few weeks.

The tricky part will be how the UI integration should work. For example, Google Reader doesn't support hierarchical folders of feeds like we do. Instead there is a flat namespace of tag names but each feed can have one or more tags applied to it. On the flip side, NewsGator Online uses the hierarchical folder model like RSS Bandit does. I'm considering moving to a more Google Reader friendly model in the next release where we flatten hierarchies and instead go with a flat tag-based approach to organizing feeds. For the case, of feeds synchronized from NewsGator Online we will prevent users from putting feeds in multiple categories since that won't be supported by the service.

Now Playing: Eminem - Evil Deeds

Categories: RSS Bandit

December 26, 2007

@ 05:22 PM

Comments [6]

RSS Bandit v1.6.0.0 Released

The new version of RSS Bandit is now available. This release fixes a bug that causes the application to repeatedly request favicons from a feed's website in a manner that eventually resembles a denial of service attack. The new feature in this release is the [Top Stories] button.

The rationale for the new feature is given in Omar Shahine's blog post entitled Google Reader needs Mute. Omar wrote

Here is a feature that Google Reader needs: Mute.

Why, Cause subscribing to a lot of tech bloggers, a-list folks, and news outlets is extremely annoying when they write about the same thing. You get tired of seeing dozens or hundreds of posts about Kindle, Facebook, ThinkSecret and on and on.

These days I feel like my blogging info is like the local news (which I stopped watching some time back in high school).

So, please google, let me mute or mark read all feed items on a certain topic as read and save me the hassle of suffering through the repetition and pain.

The Top Stories feature is meant to target exactly this scenario. When you click on it, you get a list of the most recently popular items among your subscriptions. From there you can hit "Mark Items as Read" and mark all of the linking posts as read once you've gotten the gist of the story.

We don't have a Mute option where all posts that link to a story are automatically marked as read or deleted after being downloaded. This seems like overkill to me but would love to get some feedback from our users if this would be a desirable feature.

Translations
This release is available in the following languages; English, German, Polish, French, Simplified Chinese, Russian, Brazilian Portuguese, Turkish, Dutch, Italian, Serbian and Bulgarian.

Installer
Download the installer from RssBandit1.6.0.0a_Installer.zip . A snapshot of the source code will be available later in the week as a source code release.

New Features

Top Stories button shows the ten most recently popular links in your subscriptions.
Twitter plugin enables posting tweets about news stories or responding to tweets in an RSS feed.

Major Bug Fixes

Del.ico.us plugin silently fails when posting items with tags containing special characters like '#' or '+'
Downloading feed list from NewsGator Online deletes local machine and intranet feeds
KeyNotFoundException if "Mark All Items as Read" clicked shortly after changing the URL for a subscribed feed.
100% CPU used when an RSS feed with no <channel> element is encountered.
Downloading favicons happens several times while the application is running instead of just once.
The "Check for updates" feature would sometimes result in the application crashing.

Categories: RSS Bandit

December 26, 2007

@ 05:21 PM

Comments [13]

The Facebook Effect: Google Reader Violates User's Privacy

There is a post in a Slashdot user Felipe Hoffa's journal entitled Google Reader shares private data, ruins Christmas (alternate link) which contains a very damning indictment of the Google Reader team. It all starts with the release of the Sharing with Friends feature which is described below

We've just launched a new feature that makes it easier to follow your
friends' shared items in Google Reader. Check out the announcement on
our blog:
http://googlereader.blogspot.com/2007/12/reader-and-talk-are-friends....

The short description of it is this: If any of your friends from
Google Talk are using Reader and sharing items, you'll see them listed
in your sidebar under "Friends' shared items." Similarly, they'll be
able to see any items you're sharing. You can hide items from any
friend you don't want to see, and you can also opt out of sharing by
removing all your shared items. For full details, check out the
following help articles:
http://www.google.com/support/reader/bin/answer.py?answer=83000
http://www.google.com/support/reader/bin/answer.py?answer=83041

This is still a very experimental feature, so we'd love to hear what
you think of it.

Unsurprisingly, there has been a massive negative outcry about this feature. The main reason for the flood of complaints (many of which are excerpted in Felipe Hoffa's journal) is the fact that the Google Reader team has decided to define "friends" as anyone in your Gmail contact list.

On the surface this seems a lot like the initial backlash over the Facebook news feed. Google Reader users are complaining about their Gmail contacts having an easy way of viewing a list of feeds the user had already made public. I imagine that the Google folks have begun to make arguments like "If Facebook can get away with it, we should be able to as well" to justify some of their recent social networking moves such as this one and Google Profiles.

However the Google Reader team made failed to grasp two key aspects of social software here:

Internet Users Don't Fully Grasp that Everything on the Web is Public Unless Behind Access Controls: To most users of the Internet, if I create a Web page and don't tell anyone about it, then the page is private and known only to me. Similarly, if I create a blog or shared bookmarks on a social bookmarking site then no one should know about it unless I send them links to the page.

As someone who's worked on the Access Control technology behind Windows Live sharing initiatives from SkyDrive to Windows Live Spaces I know this isn't the case. The only way to make something private on the Web is to place it behind access controls that require users to be authenticated and authorized before they can view the content you've created.

The Google Reader developers assumed that their average users were like me and would assume that their content was public even if it had an obfuscated URL. The problem here is that even if it was "technically" true that Shared Items in Google Reader were public although with an obfuscated URL, the fact that there was URL obfuscation involved implies that they realized that users didn't want their Shared Items to be PUBLIC. Arguing that the items were "technically" public and thus justifying broadcasting the items to the user's Gmail contacts seems dubious at best.
Friends in One Context are not Necessarily Friends in Another: The bigger problem is that the folks at Google are trying to build a unified social graph across all their application as a way to compete with the powerful social network that Facebook has built. I've previously talked about the problems faced by a unified social graph based on what I've seen working on the ~~social graph~~ contacts platform for Windows Live. The fact that I send someone email does not mean that I want to make them an IM buddy nor does it mean that I want them to have access to all the items I find interesting in my RSS feeds since some of these items may reveal political, religious or even sexual leanings that I did not mean to share with someone I just happen to exchange email with frequently.

Deciding that instead of having GTalk IM buddies, Gmail contacts, and Google Reader friends that users should just have Google Friends may simplify things for some program managers at Google but it causes problems for users who now have to deal with the consequence of their different social contexts beginning to bleed into each other. Even though Facebook is a single application, they have this problem with users having to manage contacts from multiple social contexts (family, friends, co-workers, etc) within a single application let alone applications with extremely different uses.

My assumption is that the folks at Google Reader will put in a some time over the weekend and will add granular privacy controls as recommended by Robert Scoble. I also predict that we will see more ham fisted attempts to grow their social graph at the expense of user privacy from various large [and small] Web properties including Facebook in 2008.

In the words of Scott McNealy, "Privacy is Dead. Get Over It"

Categories: Social Software

December 26, 2007

@ 05:21 PM

Comments [2]

Congratulations to Justin Rudd

Justin Rudd writes in his blog post entited Your Attention Please

After 3 years and 3 months, I am leaving my position at Amazon.com on December 31st.
...
My next “gig” is one that I am extraordinarily excited about. I’m going to Microsoft to be part of the Live Labs team. This group really excites me because it gives me a chance to find new areas for Microsoft Live to get into, to expand on what Microsoft Live already has, work closely with Microsoft Research, etc. This is a job that really excites the tinkerer side of my brain. I can’t wait to get started.

Many thanks to Dare Obasanjo for being my employee referral

Justin is my second official referral of someone I've "known" via reading their blog. I hope he ends up working at Microsoft a little longer than the last blog friend I referred. :)

Categories: Personal

December 21, 2007

@ 04:34 PM

Comments [5]

Amazon SimpleDB: The Good, the Bad and the Ugly

Sometime last week, Amazon soft launched Amazon SimpleDB, a hosted service for storing and querying structured data. This release plugged a hole in their hosted Web services offerings which include the Amazon Simple Storage Service (S3) and the Amazon Elastic Compute Cloud (EC2). Amazon’s goal of becoming the “Web OS” upon which the next generation of Web startups builds upon came off as hollow when all they gave you was BLOB storage and hosted computation but not structured storage. With SimpleDB, they’re almost at the point where all the tools you need for building the next del.icio.us or Flickr can be provided by Amazon’s Web Services. The last bit they need to provide is actual Web hosting so that developers don’t need to resort to absurd dynamic DNS hacks when interacting with their Amazon applications from the Web.

The Good: Comoditizing hosted services and getting people to think outside the relational database box

The data model of SimpleDB is remarkably similar to Google’s BigTable in that instead of having multiple tables and relations between them, you get a single ~~big~~ giant table which is accessed via the tuple of {row key, column key}. Although, both SimpleDB and BigTable allow applications to store multiple values for a particular tuple, they do so in different ways. In BigTable, multiple values are additionally keyed by timestamp so I can access data such using tuples such as {”http://www.example.com”, “incoming_links”, “12–12–2007”}. In Amazon’s SimpleDB I’d simply be able to store multiple values for a particular key pair so I could access {”Dare Obasanjo”, “weblogs”} and it would return (“http://www.25hoursaday.com/weblog”, “http://blogs.msdn.com/dareobasanjo”, “http://carnage4life.spaces.live.com”).

Another similarity that both systems share, is that there is no requirement that all “rows” in a table share the same schema nor is there an explicit notion of declaring a schema. In SimpleDB, tables are called domains, rows are called items and the columns are called attributes.

It is interesting to imagine how this system evolved. From experience, it is clear that everyone who has had to build a massive relational database that database joins kill performance. The longer you’ve dealt with massive data sets, the more you begin to fall in love with denormalizing your data so you can scale. Taking to its logical extreme, there’s nothing more denormalized than a single table. Even better, Amazon goes a step further by introducing multivalued columns which means that SimpleDB isn’t even in First Normal Form whereas we all learned in school that the minimum we should aspire to is Third Normal Form.

I think it is great to see more mainstream examples that challenge the traditional thinking of how to store, manage and manipulate large amounts of data.

I also think the pricing is very reasonable. If I was a startup founder, I’d strongly consider taking Amazon Web Services for a spin before going with a traditional LAMP or WISC approach.

The Bad: Eventual Consistency and Data Values are Weakly Typed

The documentation for the PutAttributes method has the following note

Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data.

This may or may not be a problem depending on your application. It may be OK for a del.icio.us style application if it took a few minutes before your tag updates were applied to a bookmark but the same can’t be said for an application like Twitter. What would be useful for developers would be if Amazon gave some more information around the delayed propagation such as average latency during peak and off-peak hours.

There is another interesting note in the documentation of the Query method which states

Lexicographical Comparison of Different Data Types

Amazon SimpleDB treats all entities as UTF-8 strings. Keep this in mind when storing and querying different data types, such as numbers or dates. Design clients to convert their data into an appropriate string format, so that query expression return expected results.

The following are suggested methods for converting different data types into strings for proper lexicographical order enforcement:

Positive integers should be zero-padded to match the largest number of digits in your data set. For example, if the largest number you are planning to use in a range is 1,000,000, every number that you store in Amazon SimpleDB should be zero-padded to at least 7 digits. You would store 25 as 0000025, 4597 as 0004597, and so on.

Negative integers should be offset and turned into positive numbers and zero-padded. For example, if the smallest negative integer in your data set is -500, your application should add at least 500 to every number that you store. This ensures that every number is now positive and enables you to use the zero-padding technique.

To ensure proper lexicographical order, convert dates to the ISO 8601 format.

Note

Amazon SimpleDB provides utility functions within our sample libraries that help you perform these conversions in your application.

	Note
Amazon SimpleDB provides utility functions within our sample libraries that help you perform these conversions in your application.

This is ghetto beyond belief. I should know ahead of time what the lowest number will be in my data set and add/subtract offsets from data values when inserting and retrieving them from SimpleDB? I need to know the largest number in my data set and zero pad to that length? Seriously, WTF?

It’s crazy just thinking about the kinds of bugs that could be introduced into applications because of this wacky semantics and the recommended hacks to get around them. Even if this is the underlying behavior of SimpleDB, Amazon should have fixed this up in an APIs layer above SimpleDB then exposed that instead of providing ghetto helper functions in a handful of popular programming languages then crossing their fingers hoping that no one hits this problem.

The Ugly: Web Interfaces, that Claim to be RESTful but Aren’t

I’ve talked about APIs that claim to be RESTful but aren’t in the past but Amazon’s takes the cake when it comes to egregious behavior. Again, from the documentation for the PutAttributes method we learn

Sample Request

The following example uses PutAttributes on Item123 which has attributes (Color=Blue), (Size=Med) and (Price=14.99) in MyDomain. If Item123 already had the Price attribute, this operation would replace the values for that attribute.

https://sdb.amazonaws.com/
?Action=PutAttributes
&Attribute.0.Name=Color&Attribute.0.Value=Blue
&Attribute.1.Name=Size&Attribute.1.Value=Med
&Attribute.2.Name=Price&Attribute.2.Value=14.99
&Attribute.2.Replace=true
&AWSAccessKeyId=[valid access key id]
&DomainName=MyDomain
&ItemName=Item123
&SignatureVersion=1
&Timestamp=2007-06-25T15%3A03%3A05-07%3A00
&Version=2007-11-07
&Signature=gabYTEXUgY%2Fdg817JBmj7HnuAA0%3D

Sample Response

<PutAttributesResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07">
  <ResponseMetadata>
    <RequestId>490206ce-8292-456c-a00f-61b335eb202b</RequestId>
    <BoxUsage>0.0000219907</BoxUsage>
  </ResponseMetadata>
</PutAttributesResponse>

Wow. A GET request with a parameter called Action which modifies data? What is this, 2005? I thought we already went through the realization that GET requests that modify data are bad after the Google Web Accelerator scare of 2005?

Of course, I'm not the only one that thinks this is ridonkulous. See similar comments from Stefan Tilkov, Joe Gregorio, and Steve Loughran. Methinks, someone at Amazon needs to go read some guidelines on building RESTful Web services.

Bonus points to Subbu Allamaraju for refactoring the SimpleDB API into a true RESTful Web service.

Speaking of ridonkulous APIs trends, it seems the SimpleDB Query method follows the lead of the Google Base GData API in stuffing a SQL-like query language into the query string parameters of HTTP GET requests. I guess it is RESTful, but Damn is it ugly.

Now playing: J. Holiday - Suffocate

Categories: Platforms | XML Web Services

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Thursday, 03 January 2008 - Dare Obasanjo's weblog

Why Robert Scoble is Wrong and Facebook is Right

O’Reilly Social Graph FOO Camp

What I Want When It Comes to Social Network Interoperability

Why I Disliked Dynamism: Squeak Smalltalk

Why I Grew to Love Dynamism: XML and C#

Python vs. C# 3.0: Lambda Expressions

Python vs. C# 3.0: List Comprehensions vs. Language Integrated Query

Python vs. C# 3.0: Tuples and Dynamic Typing vs. Anonymous Types and Type Inferencing

Ruby vs. C# 3.0: Extension Methods

The Good: Comoditizing hosted services and getting people to think outside the relational database box

The Bad: Eventual Consistency and Data Values are Weakly Typed

The Ugly: Web Interfaces, that Claim to be RESTful but Aren’t

Sample Request

Sample Response