November, 2009 - Dare Obasanjo's weblog

November 23, 2009

@ 04:24 PM

The Many Flaws of Twitter's Retweet Feature

I've been a Twitter user for almost two years now and I have always been impressed by the emergent behavior that has developed from simply giving people a text box with 140 character limit. The folks at Twitter have also done a good job of noticing some these emergent behaviors and making them formal features of the site. Both hashtags and @replies are examples of emergent community conventions in authoring tweets that are now formal features of the site.

Twitter recently added retweets to this list with Project Retweet. After using this feature for a few days I've found that unlike hashtags and @replies, the way this feature has been integrated into the Twitter experience is deeply flawed. Before I talking about the problems with Project Retweet, I should talk about how the community uses retweeting today.

Retweeting 101: What is it and why do people do it?

Retweeting is akin to the practice of forwarding along interesting blog posts and links to your friends via email. A retweet repeats the content of a person's tweet (sometimes edited for brevity) along with a reference to the user who is being retweeted. Often times people also add some commentary to the retweets. Examples of both styles of retweets are shown below.

Figure 1: Retweet without commentary

Figure 2: Retweet with added comment

Unlike hashtags and @replies, the community conventions aren't as consistent with retweets. Below are two examples of retweets from my home page which use different prefixes and separators from the one above to indicate the item is a retweet and the user's comment respectively.

Figure 3: Different conventions in retweeting

However there are many issues with retweeting not being a formal feature of Twitter. For one, it is often hard for new users to figure out what's going on when they see people posting updates prefixed with strange symbols and abbreviations. Another problem is that users who want to post a retweet now have to deal with the fact that the original tweet may have taken up all or most of the 140 character limit so there may be little room to credit the author let alone add commentary.

Thus I was looking forward to retweeting becoming a formal feature of Twitter so that these problems would be addressed. Unfortunately, while one of these problems was fixed more problems were introduced.

Flaw #1: Need to visit multiple places to see all retweets of your content

Before the introduction of the retweet feature, users could go to http://www.twitter.com/replies to see all posts that reference their name which would include @replies and retweets. The new Twitter features fragments this in an inconsistent manner.

Figure 4: Current Twitter sidebar

Now users have to visit http://www.twitter.com/replies to see people who has retweeted their posts using community conventions (i.e. copy and pasting then prefixing "RT" to a tweet) and then visit http://twitter.com/retweeted_of_mine to see who has retweeted their posts by clicking the Retweet link in the Twitter web user interface. There will be different people in both lists.

Figure 5: Retweets in the Replies/Mentions page

Figure 6: Retweets on the "Your tweets, retweeted" page

It is surprising to me that Twitter didn't at least include posts that start with RT followed by your username in http://twitter.com/retweeted_of_mine as well.

Flaw #2: No way to add commentary on what you are retweeting

As I mentioned earlier, it is fairly common for people to retweet a status update and then add their own commentary. The retweet feature built into Twitter ignores this common usage pattern and provides no option to add your own commentary.

Figure 7: The Retweet prompt

This omission is particularly problematic if you disagree with what you are sharing and want to clarify to your followers that although you find the tweet interesting you aren't endorsing the opinion.

Flaw #3: Retweets don't show up in Twitter apps

One of the other surprising changes is that Twitter retweets have been introduced in a backwards-incompatible manner into the API. This means that retweets created using the Twitter retweet button do not show up in 3rd party applications that use the Twitter API. See below for an example of what I see in Echofon versus the Twitter web experience and notice the missing tweet.

Figure 8: Twitter website showing a retweet

Figure 9: The retweet is missing in Echofon

Again, I find this surprising since it would have been straightforward to keep retweets in the API and exposing them as if they were regular old school retweets prefixed with "RT".

Flaw #4: Pictures of people I don't know in my stream

The last major problem with the Twitter retweet feature is that it breaks user expectation of the stream. Until this feature shipped, users could rest assured that the only content they saw in their stream was content they had explicitly asked for by subscribing to a user. Thus when you see someone in your stream the person's user name and avatar are familiar to you.

With the new retweet feature, the Twitter team has decided to highlight the person being retweeted and treat the person who I've subscribed to that did the retweeting as an afterthought. Not only does this confuse users at first (who is this person showing up in my feed and why?) but it also assumes that the content being retweeted is more important than who did the retweeting. This is an unfortunate assumption since in many cases the person who did the retweeting adds all the context.

Note Now Playing: Jason Derulo - Whatcha Say Note

Categories: Rants | Social Software

November 23, 2009

@ 02:46 PM

Comments [2]

Building Scalable Databases: Perspectives on the War on Soft Deletes

In the past few months I've noticed an increased number of posts questioning practices around deleting and "virtually" deleting data from databases. Since some of the concerns around this practice have to do with the impact of soft deletes on scalability of a database-based application, I thought it would be a good topic for my ongoing series on building scalable databases.

Soft Deletes 101: What is a soft delete and how does it differ from a hard delete?

Soft deleting an item from a database means that the row or entity is marked as deleted but not physically removed from the database. Instead it is hidden from normal users of the system but may be accessible by database or system administrators.

For example, let's consider this sample database of XBox 360 games I own

Name	Category	ESRB	GamespotScore	Company
Call of Duty: Modern Warfare 2	First Person Shooter	Mature	9.0	Infinity Ward
Batman: Arkham Asylum	Fantasy Action Adventure	Teen	9.0	Rocksteady Studios
Gears of War 2	Sci-Fi Shooter	Mature	9.0	Epic Games
Call of Duty 4: Modern Warfare	First Person Shooter	Mature	9.0	Infinity Ward
Soul Calibur IV	3D Fighting	Teen	8.5	Namco

Now consider what happens if I decide that I'm done with Call of Duty 4: Modern Warfare now that I own Call of Duty: Modern Warfare 2. The expected thing to do would then be to remove the entry from my database using a query such as

DELETE FROM games WHERE name='Call of Duty 4: Modern Warfare';

This is what is considered a "hard" delete.

But then what happens if my friends decide to use my list of games to decide which games to get me for Christmas? A friend might not realize I'd previously owned the game and might get it for me again. Thus it might be preferable if instead of deleting items from the database they were removed from consideration as games I currently own but still could be retrieved in special situations. To address this scenario I'd add an IsDeleted column as shown below

Name	Category	ESRB	GamespotScore	Company	IsDeleted
Call of Duty: Modern Warfare 2	First Person Shooter	Mature	9.0	Infinity Ward	False
Batman: Arkham Asylum	Fantasy Action Adventure	Teen	9.0	Rocksteady Studios	False
Gears of War 2	Sci-Fi Shooter	Mature	9.0	Epic Games	False
Call of Duty 4: Modern Warfare	First Person Shooter	Mature	9.0	Infinity Ward	True
Soul Calibur IV	3D Fighting	Teen	8.5	Namco	False

Then for typical uses an application would interact with the following view of the underlying table

CREATE VIEW current_games AS SELECT Name, Category, ESRB, GameSpotScore, Company FROM games WHERE IsDeleted=False;

but when my friends ask me for a list of all of the games I have, I can provide the full list of all the games I've ever owned from the original games table if needed. Now that we understand how one would use soft deletes we can discuss the arguments against this practice.

Rationale for War: The argument against soft deletes

Ayende Rahien makes a cogent argument against soft deletes in his post Avoid Soft Deletes where he writes

One of the annoyances that we have to deal when building enterprise applications is the requirement that no data shall be lost. The usual response to that is to introduce a WasDeleted or an IsActive column in the database and implement deletes as an update that would set that flag.

Simple, easy to understand, quick to implement and explain.

It is also, quite often, wrong.

The problem is that deletion of a row or an entity is rarely a simple event. It effect not only the data in the model, but also the shape of the model. That is why we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have a parent Order. And that is just the simplest of issues.
...
Let us say that we want to delete an order. What should we do? That is a business decision, actually. But it is one that is enforced by the DB itself, keeping the data integrity.

When we are dealing with soft deletes, it is easy to get into situations where we have, for all intents and purposes, corrupt data, because Customer’s LastOrder (which is just a tiny optimization that no one thought about) now points to a soft deleted order.

Ayende is right that adding an IsDeleted flag mean that you can no longer take advantage of database triggers for use when cleaning up database state when a deletion occurs. This sort of cleanup now has to moved up into the application layer.

There is another set of arguments against soft deletes in Richard Dingwall's post entitled The Trouble with Soft Delete where he points out the following problems

Complexity

To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.

This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query.

Performance

At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance. However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself.

Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes.

Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.

These arguments seem less valid than Ayende's especially when the alternatives proposed are evaluated. Let's look at the aforementioned problems and the proposed alternatives in turn.

Trading the devil you know for the devil you don't: Thoughts on the alternatives to soft deletes

Richard Dingwall argues that soft deletes add unnecessary complexity to the system since all queries have to be aware of the IsDeleted column(s) in the database. As I mentioned in my initial description of soft deletes this definitely does not have to be the case. The database administrator can create views which the core application logic interacts with (i.e. the current_games table in my example) so that only a small subset of system procedures need to actually know that the soft deleted columns even still exist in the database.

A database becoming so large that data manipulation becomes slow due to having to update indexes is a valid problem. However Richard Dingwall's suggested alternative excerpted below seems to trade one problem for a worse one

The memento pattern

Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require.

It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:

The problem I have with this solution is that if your database is already grinding to a halt simply because you track which items are active/inactive in your database, how much worse would the situation be if you now store every state transition in the database as well? Sounds like you're trading one performance problem for a much worse one.

The real problem seems to be that the database has gotten too big to be operated on in an efficient manner on a single machine. The best way to address this is to partition or shard the database. In fact, you could even choose to store all inactive records on one database server and all active records on another. Those interested in database sharding can take a look at a more detailed discussion on database sharding I wrote earlier this year.

Another alternative proposed by both Ayende Rahien and Richard Dingwall is to delete the data but use database triggers to write to an audit log in the cases where auditing is the primary use case for keeping soft deleted entries in the database. This works in the cases where the only reason for soft deleting entries is for auditing purposes. However there are many real world situations where this is not the case.

One use case for soft deleting is to provide an "undo" feature in an end user application. For example, consider a user synchronizes the contact list on their phone with one in the cloud (e.g. an iPhone or Windows Mobile/Windows Phone connecting to Exchange or an Android phone connecting to Gmail). Imagine that the user now deletes a contact from their phone because they do not have a phone number for the person only to find out that person has also been deleted from their address book in the cloud. At that point, an undo feature is desirable.

Other use cases could be the need to reactivate items that have been removed from the database but with their state intact. For example, when people return to Microsoft who used to work there in the past their seniority for certain perks takes into account their previous stints at the company. Similarly, you can imagine a company restocking an item that they had pulled from their shelves because they have become popular due to some new fad (e.g. Beatles memorabilia is back in style thanks to The Beatles™: Rock Band™).

The bottom line is that an audit log may be a useful replacement for soft deletes in some scenarios but it isn't the answer to every situation where soft deletes are typically used.

Not so fast: The argument against hard deletes

So far we haven't discussed how hard deletes should fit in a world of soft deletes. In some cases, soft deletes eventually lead to hard deletes. In the example of video games I've owned I might decide that if a soft deleted item is several years old or is a game from an outdated console then it might be OK to delete. So I'd create a janitor process that would scan the database periodically to seek out soft deleted entries to permanently delete. In other cases, some content may always be hard deleted since there are no situations where one might consider keeping them around for posterity. An example of the latter is comment or trackback spam on a blog post.

Udi Dahan wrote a rebuttal to Ayende Rahien's post where he question my assertion above that there are situations where one wants to hard delete data from the database in his post Don’t Delete – Just Don’t where he writes

Model the task, not the data

Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).

As we broaden our perspective to more parts of the system, we see this same pattern repeating:

Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.

Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.

Jobs aren’t deleted – they’re filled (or their requisition is revoked).

In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.

Statuses

In all the examples above, what we see is a replacement of the technical action ‘delete’ with a relevant business action. At the entity level, instead of having a (hidden) technical WasDeleted status, we see an explicit business status that users need to be aware of.

I tend to agree with Udi Dahan's recommendation. Instead of a technical flag like IsDeleted, we should model the business process. So my database table of games I owned should really be called games_I_have_owned with the IsDeleted column replaced with something more appropriate such as CurrentlyOwn. This is a much better model of the real-life situation than my initial table and the soft deleted entries are now clearly part of the business process as opposed to being part of some internal system book keeping system.

Advocating that items be never deleted is a tad extreme but I'd actually lean closer to that extreme than most. Unless the data is clearly worthless (e.g. comment spam) or the cost is truly prohibitive (e.g. you're storing large amounts of binary data) then I'd recommend keeping the information around instead of assuming the existence of a DELETE clause in your database is a requirement that you use it.

Note Now Playing: 50 Cent - Baby By Me (feat. Ne-Yo) Note

Categories: Web Development

November 14, 2009

@ 03:03 PM

Comments [2]

Joe Hewitt on Irony

Joe Hewitt, the developer of the Facebook iPhone application, has an insightful blog post on the current trend of developers favoring native applications over Web applications on mobile platforms with centrally controlled app stores in his post On Middle Men. He writes

The Internet has been incredibly empowering to creators, and just as destructive to middle men. In the 20th century, every musician needed a record label to get his or her music heard. Every author needed a publishing house to be read. Every journalist needed a newspaper. Anyone who wanted to send a message needed the post office. In the Internet age, the tail no longer wags the dog, and those middle men have become a luxury, not a necessity.

Meanwhile, the software industry is moving in the opposite direction. With the web and desktop operating systems, the only thing in between software developers and users is a mesh of cables and protocols. In the new world of mobile apps, a layer of bureacrats stand in the middle, forcing each developer to queue up for a series of patdowns and metal detectors and strip searches before they can reach their customers.
...
We're at a critical juncture in the evolution of software. The web is still here and it is still strong. Anyone can still put any information or applications on a web server without asking for permission, and anyone in the world can still access it just by typing a URL. I don't think I appreciated how important that is until recently. Nobody designs new systems like that anymore, or at least few of them succeed. What an incredible stroke of luck the web was, and what a shame it would be to let that freedom slip away.

Am I the only one who thinks the above excerpt would be similarly apt if you replaced the phrase "mobile apps" with "Facebook apps" or "OpenSocial apps"?

Note Now Playing: Lady GaGa - Bad Romance Note

Categories: Web Development

November 7, 2009

@ 04:41 PM

Comments [1]

Does OAuth Have a Dark Side?

There was an article in the The Register earlier this week titled Twitter fanatic glimpses dark side of OAuth which contains the following excerpt

A mobile enthusiast and professional internet strategist got a glimpse of OAuth's dark side recently when he received an urgent advisory from Twitter. The dispatch, generated when Terence Eden tried to log in, said his Twitter account may have been compromised and advised he change his password. After making sure the alert was legitimate, he complied.

That should have been the end of it, but it wasn't. It turns out Eden used OAuth to seamlessly pass content between third-party websites and Twitter, and even after he had changed his Twitter password, OAuth continued to allow those websites access to his account.
…
Eden alternately describes this as a "gaping security hole" and a "usability issue which has strong security implications." Whatever the case, the responsibility seems to lie with Twitter.

If the service is concerned enough to advise a user to change his password, you'd think it would take the added trouble of suggesting he also reset his OAuth credentials, as Google, which on Wednesday said it was opening its own services to work with OAuth, notes here.

I don't think the situation is as cut and dried as the article makes it seem. Someone trying to hack your account by guessing your password and thus triggering a warning that your account is being hacked is completely different from an application you've given permission to access your data doing the wrong thing with it.

Think about it. Below is a list of the applications I've allowed to access my Twitter stream. Is it really the desired user experience that when I change my password on Twitter that all of them break and require that I re-authorize each application?

list of applications that can access my Twitter stream

I suspect Terence Eden is being influenced by the fact that Twitter hasn't always had a delegated authorization model and the way to give applications access to your Twitter stream was to handout your user name & password. That's why just a few months ago it was commonplace to see blog posts like Why you should change your Twitter password NOW! which advocate changing your Twitter password as a way to prevent your password from being stolen from Twitter apps with weak security. There have also been a spate of phishing style Twitter applications such as TwitViewer and TwitterCut which masquerade as legitimate applications but are really password stealers in disguise. Again, the recommendation has been to change your password if you fell victim to such scams.

In a world where you use delegated authorization techniques like OAuth, changing your password isn't necessary to keep out apps like TwitViewer & TwitterCut since you can simply revoke their permission to access your account. Similarly if someone steals my password and I choose a new one, it doesn't mean that I now need to lose Twitter access from Brizzly or the new MSN home page until I reauthorize these apps. Those issues are ~~orthogonal~~ unrelated given the OAuth authorized apps didn't have my password in the first place.

Thus I question the recommendation at the end of the article that it is a good idea to even ask users about de-authorizing applications as part of password reset since it is misleading (i.e. it gives the impression the hack attempt was from one of your authorized apps instead of someone trying to guess your password) and just causes a hassle for the user who now has to reauthorize all the applications at a later date anyway.

Note Now Playing: Rascal Flatts - What Hurts The Most Note

Categories: Web Development

November 4, 2009

@ 03:09 PM

Comments [1]

Startup Advice: Find Underserved Markets

“Every marketer's dream is to find an unidentified or unknown market and develop it” – Barry Brand

Last week I read the various notes on the presentations by famous startup founders at Startup School 2009 and found a lot of the anecdotes interesting but wondered if they were truly useful to startup founders. I tried to think of some of the products I started using in the past five years that I now use a lot today and couldn’t do without. In looking for the common thread in these products and the quote above came to mind.

Finding an underserved market sounds like getting the winning ticket in a lottery, unlikely. In truth it isn’t hard to find underserved markets if you recognize the patterns. The hard part in turning it into a successful business is execution.

There are two patterns I’ve used in looking for underserved markets that were invigorated by products I now use regularly today. The first pattern is looking for activities people like doing where the technology has been stagnant for a while. For the past few decades, we’ve been in world where the technology products you use can be made smaller, cheaper and faster every few years. There is some notion of waiting for the stars to align such as hard drive sizes shrinking until you could fit gigabytes of data in your pocket (e.g. the iPod) or AJAX and online banking becoming ubiquitous (e.g. Mint). However for each product category where some upstart has changed the game by leveraging modern technologies there are still dozens of markets where decade(s) old tech is dictating the user experience.

Another patterns is looking for an activity or task that people hate doing but assume is a fact of life or a necessary aspect of using a product. I remember when I’d buy video games like Soul Calibur and lament about how I could only find people to play with when I went to visit friends from out of town who were also fans of the game. To me, not being able to play multiplayer games without having friends physically in the same room was just a fact of live. This is no longer the case in the world of XBox Live. I also remember almost cancelling my cable subscription about five or six years ago because I resented having to tailor my TV watching schedule around prime time hours which to me represented prime hours to decompress from work and hack on RSS Bandit. The entire notion of “prime time TV” has been a fact of life for decades. Once I discovered TiVo, it stopped being the case for me. Some of these painful facts of life have been with us for centuries. For example, take this quote from John Wanamaker

“Half the money I spend on advertising is wasted; the trouble is I don't know which half”

with modern advertising solutions like Google’s Adwords and Facebook’s Advertising platform this is no longer the case. People can now tell down to minute levels of detail exactly what their return on investment is on advertising.

I’d love to see more startups attempting to find and satisfy underserved markets instead of going with the crowd and doing what everyone else is doing. Less Facebook games and iPhone fart apps, more original ideas that solve real problems like Mint and Flickr. If your startup needs a few hints at what some of these markets are in the technology space, there’s always YCombinator’s list of Startup Ideas they’d fund.

Note Now Playing: Jay-Z Feat. Kanye West - Hate Note

Categories: Startup Shoutout

November 4, 2009

@ 02:10 PM

Comments [2]

New MSN Homepage with Activity Streams from Windows Live, Facebook and Twitter

On the MSN blog there's a new blog post entitled The New MSN Homepage Unveiled which states

Today is an exciting day for our team at MSN because we unveiled the most significant redesign our MSN.com homepage has seen in over a decade. We spent thousands of hours talking with customers; testing hundreds of ideas; experimenting around the world and carefully evaluating what our users want, and don’t want - to deliver a homepage that is designed to be the best homepage on the Web. We hope you’ll agree.
…
So, we started from scratch to cut the clutter on our homepage and reduced the amount of links by 50%. There’s also a simplified navigation across news, entertainment, sports, money, and lifestyle that lets you drill into information topics that interest you, without being overwhelming. Local information from your neighborhood is important to you and so is high quality, in-line video – so we offer both, right on the homepage. And, you told us you want the latest information not only from your favorite sources, but also from your friends, and the breadth of the Web – so we now offer convenient access to Facebook, Twitter, & Windows Live services and the most powerful search experience on the Web from Bing, empowering you to make more informed, faster decisions. And this is just the beginning - keep visiting our blog for more MSN news in the coming weeks.

This is a really exciting release for my team on Windows Live since we're responsible for the underlying platform that powers the display of what activities your friends have been performing across Windows Live. Working with the MSN home page team was a good experience and its great to see that the tens of millions of people who visit the MSN home page regularly will now get to experience our work. Kudos to the MSN team on a very nice release.

You can try out the new home page for yourself at http://preview.msn.com

Note Now Playing: Jay-Z - Reminder Note

Categories: MSN | Windows Live

November 2, 2009

@ 02:50 PM

Comments [2]

Real-time, Distributed Conversations: Some Thoughts on the Salmon Protocol

Last week John Panzer, who works on Blogger at Google, wrote about some of the work he’s been doing on creating a protocol for syndicating comments associated with activity streams in his post The Salmon Protocol: Introducing the Salmon Project. Key parts of his post are excerpted below

A few days ago, at the Real Time Web Summit, we had a session about Salmon, a protocol for re-aggregated distributed conversations around web content. I was hoping for some feedback and to generate some interest, and I was overwhelmed by the positive reactions, especially after Louis Gray's post "Proposed Salmon Protocol aims to unify Conversations on the Web". Adina Levin's "Salmon - Re-assembling distributed conversations" is a good, insightful review as well. There's clearly a great deal of interest in this, and so I've gone ahead and expanded Salmon's home at salmon-protocol.org with an open source project, salmon-protocol.googlecode.com, and a mailing list, groups.google.com/group/salmon-protocol.

Louis Gray’s post on the topic includes an embedded presentation which captures the essence of the protocol

Before talking about the technical details of the protocol it is a good idea to understand the end user problem the protocol solves. For me, it solves a problem I have in the way that RSS Bandit integrates with Facebook. The problem is that although there is a way to get regular updates on changes to the user’s news feed by polling Facebook’s stream and getting data back in the Activity Stream format there isn’t a mechanism today to get updates on the comments on items in the feed. What it means in practice today is that once an item rolls off of the news feed, there is no way to keep the comments up to date in RSS Bandit.

The Salmon Protocol aims to address this problem by piggybacking on PubSubHubBub as a way for applications to get real-time updates on comments on items in an activity stream not just updates on new activities.

There have also been several mentions of Salmon being a way to aggregate distributed conversations on an item (e.g. this blog post is syndicated to FriendFeed and there are comments there as well as in the comments on my blog) but I am less clear on those scenarios or whether Salmon is enough to solve the various tough problems that need to be solved to make that work end to end.

Any API for posting comments to a site needs to solve two problems; identity and dealing with comment spam. I decided to take a look at the Salmon Protocol Summary to see how it addresses these problems.

The meat of the Salmon Protocol format is excerpted below

A source provides an RSS/Atom feed of content. It includes a Salmon link in its feed:

<link rel="salmon" href="http://example.org/salmon-endpoint"/>

An aggregator reads the feed (ideally via a push mechanism such as PubSubHubbub), and sees from the link that it is Salmon-enabled. It remembers the endpoint URL for later use.

When an aggregator's user leaves a comment on a feed item, the aggregator stores the comment as usual, and then also POSTs a salmon version of it to the source's Salmon endpoint:

POST /salmon-endpoint HTTP/1.1

Host: example.org

Content-Type: application/atom+xml

<?xml version='1.0' encoding='UTF-8'?>

    <entry xmlns='http://www.w3.org/2005/Atom'>

    <author>

      <name>John Doe</name>

      <uri>acct:johndoe@aggregator-example.com</uri>

    </author>

    <content>Yes, but what about the llamas?</content>

    <id>tag:aggregator-example.com,2009:cmt-441071406174557701</id>

    <updated>2009-09-28T18:30:02Z</updated>

    <thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0'

       ref='tag:example.org,1999:id-22717401685551851865'/>

    <sal:signature xmlns:sal='http://salmonprotocol.org/ns/1.0'>

        e55bee08b4c643bc8aedf122f606f804269b7bc7

    </sal:signature>

    <title/>

</entry>

The commenter is identified in the published comment using the atom:uri element. How this author is authenticated in situations outside of public comments on a blog such as RSS Bandit posting a comment to Facebook on my behalf isn’t really discussed. I noticed an offhand reference to OAuth headers which seems to imply that the publishing application should also be sending authentication headers as well when publishing the comment. How these authentication headers would flow through the systems involved is unclear to me especially given the approach Salmon has taken to deal with spam prevention.

The workflow for dealing with spam comments is described as follows

A major concern with this type of distributed protocol is how to prevent spam and abuse. Salmon provides building blocks to allow in-depth defense against attacks. Specifically, every salmon has a verifiable author and user agent. The basic security flow when salmon swims upstream looks like this:

aggregator-example.com: "Here is a salmon, authored and signed by 'acct:johndoe@aggregator-example.com'; please accept it."

Recipient: "I know that this is really aggregator-example.com due to its OAuth headers, and it has a good reputatation, but I do not trust it completely; I will do a double check."

Recipient: Uses Webfinger/XRD to discover salmon validation service for acct:johndoe@aggregator-example.com, which turns out to be hosted by aggregator-example.com.

Recipient: "Given that johndoe has delegated Salmon validation to aggregator-example, and I know I'm talking to aggregator-example already, I'll skip the actual check." (Returns HTTP 200 to aggregator-example.com)

The flow can get more complicated, especially if the aggregator is not also providing identity services for the user. In the most general case, the recipient needs to take the salmon, discover a salmon validator service for the author via XRD discovery on the author's URI, and POST the salmon to the validator service. The validator service does an integrity / signature check against the salmon and returns 200 if the salmon checks out, 400 if not. The signature check means that the given author (johndoe in this case) signed the salmon with the given id, parent id, and timestamp. It does not attempt to do a full, XML-DSig style verification, though such a service is another reasonable extension.

This flow seems weird and it is unclear to me that it actually solves the problems involved in distributed commenting. So let’s say I post a comment to Facebook from RSS Bandit, in step 3 above they are now supposed to use WebFinger to lookup my email address provider and determine which service I use for digitally signing comments. Then they ask it if the comment looks like it was from me.

Hmm, this looks like a user authentication workflow in disguise as a comment validation workflow. Shouldn’t the service receiving the comment (i.e. Facebook) in the example above be responsible for validating my identity not some third party service? Maybe this protocol wasn’t meant for sites like Facebook?

Let’s say this protocol is really meant for situations when the comment recipient doesn’t intend to be the sole identity provider such as commenting on Robert Scoble's blog where he allows comments from anyone with just an email address and an optional web page URL as identifiers. So each commenter needs to provide an email address on an email service provider that supports WebFinger and validates digital signatures in the specific situation related to the Salmon protocol? Sounds like boiling the ocean. I wonder why this can’t work with OpenID validation or some other authentication protocol that has already been validated by developers and is seeing some adoption?

At the end of the day, I think the problem Salmon attempts to solve is one that needs solving as activity streams become a more popular and intrinsic feature across the Web. However in its current form it’s hard for me to see how it actually solves the real problems that exist today in a practical way.

Of course, this may just be my misunderstanding of the protocol documents currently published and I look forward to being corrected by one of the protocol gurus if that is the case.

Note Now Playing: Chris Brown - I Can Transform Ya (feat. Lil Wayne) Note

Categories: Social Software | Syndication Technology

November 2, 2009

@ 02:00 PM

Comments [1]

Keeping the Stream Pure: FriendFeed vs. Twitter vs. Facebook

Robert Scoble had an epiphany on one of the key problems with FriendFeed’s design that he now realizes now that he’s no longer a fanatical user of the service in his post The chat room/forum problem (& an apology to @Technosailor) . Robert writes

Twitter got lists.

This let us throw together a list of experts. For instance, I put together a list of people who have started companies. Compare that feed to your average Facebook feed and you’ll see it in stark black and white: your Facebook feed is “fun” but isn’t teaching you much.

It becomes even more stark when you do a list like my tech news brands list. See, this is NOT a forum! It is NOT a chat room!

No one can enter this community without being invited. Now compare to FriendFeed. We could have built a list like this over there, but it would have gotten noiser because of a feature called “Friend of a Friend.” That drags in people the list owner didn’t invite. Also, anyone can comment underneath any items on Facebook or FriendFeed. That brings people into YOUR life that YOU DID NOT INVITE!

Again, at first, this seems very democratic and very nice. After all, it’s great to throw a party for the whole world and let them drink your wine and have conversations with your kids. But, be honest here, would you rather have a private dinner with Steve Jobs, or would you rather have a dinner with Steve Jobs and 5,000 people who you don’t really know?

As someone who works on a platform for real-time streams for a living I always find it interesting to compare the approaches of various companies.

I agree with Robert that FriendFeed’s friend of friend feature breaks a fundamental model of the stream. In an earlier post on the problems with some of FriendFeed's design decisions I pointed out the following problem with the feature

FriendFeed shows you content from friends of friends: This is major social faux pas. It may sound like a cool viral feature but showing me content from people I haven't subscribed to means I don't have control of who shows up in my feed and it takes away from the intimacy of the site because I'm always seeing content from strangers.

One of the things I’ve learned about how people interact with activity streams and news feeds is that it is important to feel like you are in control of the experience. FriendFeed’s friend of friend feature explicitly takes that away from users. I can understand why they did it (i.e. to increase the amount of content showing up in the stream for people with few friends and as a friend discovery mechanism) but it doesn’t change the fact that the behavior can seem like a nuisance and even lead to lamebook style socially awkward situations.

However unlike Robert I don’t really agree with his characterization of the differences between streams on Facebook versus Twitter. On both sites, I as a user choose who the primary sources that show up in my stream are. Facebook is for people I know, Twitter is for brands and people I find interesting. I find both sites fun but I agree that I’m more likely to learn something new related to work from Twitter than from Facebook. Whether I am “learning” something new or not isn’t what’s important but whether I feel like I’m getting value out of the experience. As Biz Stone wrote in a blog post on Twitter's new terms of service

At the start, critics often said, "Twitter is fun, but it's not useful." At one point @ev responded dryly with, "Neither is ice cream."

Although comments in the news feed on Facebook bring people I didn’t explicitly add into my stream, they are often OK because they are often people who I consider to be part of my extended social network or at least are on topic. This would never work on Twitter with retweets being shown inline (for example) since anyone can follow anyone else or retweet their content which would lead to the same sort of chat room/forum noise that Robert decries in his post.

Looking at the sketches from Twitter’s Project Retweet

I don’t think Robert’s concerns about retweets polluting the stream are warranted. From the above sketch, it just looks like Twitter is fixing the bug in the retweeting process where I have to use part of my 140 character quota to provide attribution when retweeting an interesting status update. This leads to interesting behavior such as people keeping their tweets under 125 characters to enable retweeting. I have to admit I’ve wondered more than once if I should make a tweet shorter so that it is easier to retweet. Project Retweet is fixing in a way that encourages an existing user practice on the site. That deserves kudos in my book.

Note Now Playing: Jay-Z Feat. Alicia Keys - Empire State Of Mind Note

Categories: Social Software

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for November, 2009 - Dare Obasanjo's weblog

Retweeting 101: What is it and why do people do it?

Flaw #1: Need to visit multiple places to see all retweets of your content

Flaw #2: No way to add commentary on what you are retweeting

Flaw #3: Retweets don't show up in Twitter apps

Flaw #4: Pictures of people I don't know in my stream

Soft Deletes 101: What is a soft delete and how does it differ from a hard delete?

Rationale for War: The argument against soft deletes

Complexity

Performance

Trading the devil you know for the devil you don't: Thoughts on the alternatives to soft deletes

The memento pattern

Not so fast: The argument against hard deletes

Model the task, not the data

Statuses