Monday, 19 April 2010 - Dare Obasanjo's weblog

April 19, 2010

@ 03:38 PM

There's an interesting post on ReadWriteWeb about a burgeoning technology effort supported by a number of web companies titled XAuth: The Open Web Fires a Shot Against Facebook Connect which states

A consortium of companies including Google, Yahoo, MySpace, Meebo and more announced tonight that it will launch a new system on Monday that will let website owners discover which social networks a site visitor uses and prompt them automatically to log-in and share with friends on those network. The system is called XAuth and serves to facilitate cross-site authentication (logging in) for sharing and potentially many other uses.

Facebook and Twitter, the dominant ways people share links with friends outside of email, are not participating
...

What XAuth Delivers

It's like Facebook Connect, but for every other social network.

The gist here is that XAuth will make it easier for sites around the web to find out what social networks you are using, let you log in to those easily, access your permitted information from those networks in order to better personalize your experience on their site and easily share their content back into your social network. It's like Facebook Connect, but for every other social network. Any website can register as an identity provider with XAuth, too.

What About OAuth?
If you're familiar with OAuth, you might be wondering what the difference is between that system of secure authentication and XAuth. Here's one way to explain it: XAuth tells a webpage "this is where the site visitor does social networking." Then, OAuth is the way the user logs in there, granting the site permission to access their info without seeing their password. In other words, XAuth tells you where to ask for OAuth from.
Google's Joseph Smarr, recently hired because of his high-profile work on distributed identity systems across the web, says that XAuth is a provisional solution to the limitations of the cookie system. If you visit ReadWriteWeb, for example, our servers aren't allowed to check the cookies left on your browser by the social networks you use because they are tied to URL domains other than ours.

The first thing that is worth pointing out is that XAuth is not like Facebook Connect. Facebook Connect enables a website to associate a user’s Facebook identity, social graph and activity stream with the site. XAuth enables a website to discover which services an end user is a member of that support associating a user’s identity, social graph and/or activity stream with third party websites.

A practical example of this is the various sharing options at the bottom of this blog post (if you’re viewing it in the browser on your desktop PC)

This is a fixed list of options that where I had to write special Javascript code to handle each service. There are a few issues with this approach. The first is that people end up seeing exhortations to share on services they don’t use which is just visual noise in that case. Another is that, each of those widgets involves a Javascript call to the domain of the service which impacts page load times. In fact, I used to have a Reddit widget but took it out because their server was too slow and noticeably impacted rendering of my blog. Finally, I tend to keep the list small because I don’t want my blog posts to suffer from the NASCAR problem so some services that may be popular with my audience audience are left out (e.g. I have no widgets for sharing on Google Buzz, Slashdot or Digg).

How the XAuth specification attempts to solve this problem is fairly straightforward. Services that want to participate include some Javascript from http://xauth.org which writes some data to the local storage (not cookie) for that domain when the user visits the social network. At that point there is now an indication on the user’s machine that they are a member of the aforementioned social network. Then when the user visits a site such as my blog, I also include the same Javascript from but this time I ask it if the user is a member of the social networks I’m interested in. Once the list of sites is returned, I then only have to render sharing widgets from the sites I support which the user is a member of.

In general, I think XAuth is a legitimate attempt to solve a valid problem. However it should be made clear what problems XAuth doesn’t solve. For one, people like me who have an account on Facebook, Twitter, Digg, Google, Windows Live, Reddit, Delicious, etc will actually have a more cluttered experience than we do today. I know I’m always a little confused when I visit a site that uses Clickpass since I can never remember if I’ve associated that site with my Facebook, Windows Live, Yahoo! or Google account. Similarly XAuth will potentially exacerbate the problem for the subset of people who are members of lots of social media sites. Another thing to be made clear is that this isn’t a replacement for delegated authentication and authorization technologies like Facebook Connect, Twitter’s @Anywhere or OAuth WRAP. Sites will still need to support all of these technologies if they want to reach the widest audience. This is just about hiding options from users that do not matter to them.

The one thing I’d keep an eye on is that XAuth provides a token that uniquely identifies the user as part of the results returned to the requesting site instead of a simply stating the user is a member of a specified social media site. This enables the requesting site (e.g. my blog) to potentially make some API calls to the social network site for information specific to the user without asking for permission first. For example, pre-populating the user’s name and display picture in a comment box. Since Facebook has already announced such functionality I guess people don’t think it is overstepping the bounds of the user relationship to enable this feature on any website the user visits without the user explicitly granting the sites permission to their profile information. It will be interesting to see if implementations of this feature steer clear of some of the creepiness factor of Facebook’s Beacon program which led to massive outcry in its day.

Note Now Playing: Jamie Foxx - Winner (featuring Justin Timberlake & T.I.) Note

Categories: Social Software | Web Development

April 12, 2010

@ 02:21 PM

Comments [15]

WebHost4Life: When Good Web Hosting Companies Go Bad

Regular readers of this blog may have noticed that accessing this blog has been flaky all weekend. I've gotten numerous reports via Twitter that my blog was displaying the following error message when being visited

Server Error in '/site1/weblog' Application.
--------------------------------------------------------------------------------
Exception of type 'System.OutOfMemoryException' was thrown.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
Source Error:
An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.
Stack Trace:
[OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.]
Go1999(RegexRunner ) +0
System.Text.RegularExpressions.CompiledRegexRunner.Go() +14
System.Text.RegularExpressions.RegexRunner.Scan(Regex regex, String text, Int32 textbeg, Int32 textend, Int32 textstart, Int32 prevlen, Boolean quick) +144
System.Text.RegularExpressions.Regex.Run(Boolean quick, Int32 prevlen, String input, Int32 beginning, Int32 length, Int32 startat) +134
System.Text.RegularExpressions.Regex.Match(String input) +44
newtelligence.DasBlog.Web.Core.UrlMapperModule.HandleBeginRequest(Object sender, EventArgs evargs) +458
System.Web.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +68
System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +75

This seems like a straightforward issue. The machine my blog is hosted on is running out of memory probably because the server is either overloaded from having too many sites hosted on it or one of the sites is badly written and is using up more than its fair share of memory. This has happened once or twice in the past and when it did I used the Live Chat feature of WebHost4Life, my hosting company, to contact a support rep who quickly moved my site to a different server. Thus when I got the first reports of this issue I thought this would be a routine support issue. I was sorely mistaken.

Earlier this year, WebHost4Life started migrating their customers to a new infrastructure with a different set of support staff. It seems this was a cost cutting measure because the new support staff seem to be a lot less technical than their previous counterparts and seem to have less access to their infrastructure. Over the weekend I chatted with about three or four different support folks over IM and opened a ticket that was closed multiple times. Here is a sampling of some of the messages that were written as part of closing the support ticket by WebHost4Life

Hello,

Thank you for contacting Support.

We apologize for any inconvenience this has caused you. I have checked your website at the URL: http://www.25hoursaday.com/weblog/ and it is working fine. Could you please check it once again after clearing browser cache and cookies? If the issue persists, please get back to us with the exact error message, so that we will investigate on your issue.

Thank you!

Sincerely,

Sharon <redacted>
Customer Support

~~~ TICKET CLOSED

Hello,
Thank you for your reply.

I have checked your website URL: http://www.25hoursaday.com/weblog/ and I was able to duplicate the issue. I have received the error message. Hence, I have asked a member of our team who specializes in website management to review your issue. You should be hearing from this specialist within 24-48 hours. If you have any questions in the meantime, please let us know and be sure to refer to the link http://www.webhost4life.com/member/sconsole for the quickest service.

Thank you for choosing webhost4life, we appreciate your support.

Sincerely,

Sharon <redacted>
Customer Support

~~~ TICKET ESCALATED TO TIER 2 SUPPORT

Hello,

I have checked the issue and and was not able to duplicate it. It seems to be issue with your ISP. Please check once again and if the issue still persists, please get back to us, tracert results of the website when you experience the issue and also the exact time and location so that we can investigate on it further.

If you have any further questions, please update the Support Console.

Sincerely,

Aleta <redacted>
Technical Specialist

~~~ TICKET CLOSED

Hello,

Thank you for getting back to us.

Currently, your website http://www.25hoursaday.com/weblog/ is loading without any slowness. I suggest you to check the website functionality again and get back to us, if the issue persists.

If you have any further questions, please update the Support Console.

Sincerely,

Kara
Technical Specialist

~~~ TICKET CLOSED

I’m still getting reports that my blog is throwing out of memory exceptions infrequently and as you can see from the above there isn’t anything being done to fix this problem. At first I wondered if the poor support I was getting was because this happened on the weekend and perhaps the technical folks only work weekdays. Unfortunately my hopes were dashed when someone on Twitter pointed me to webhosting review site with dozens of complaints about WebHost4Life which echoed my experience. It seems the company is under new management and quality has suffered under some of their cost cutting moves.

Although I paid for a year’s worth of service, I’ve decided that I probably need to switch hosting companies. Thus I’m seeking recommendations for a web hosting company that supports ASP.NET and .NET 2.0 or higher from any of my readers. Also if anyone is familiar with the process of cancelling your service with a web host including getting refunded for services not provided, I’d love any experience or tips you have to share.

Categories: Rants

April 10, 2010

@ 04:36 PM

Comments [3]

Twitter Slaps Developers in the Face and How They Can Fix It

In a move that was telegraphed by Fred Wilson’s (Twitter investor) post titled The Twitter Platform's Inflection Point where he criticized Twitter platform developers for “filling holes” in Twitter’s user experience, the Twitter team have indicated they will start providing official Twitter clients for various mobile platforms. There have been announcements of an official Blackberry client and the purchase of Tweetie so it can become the official iPhone client. The latter was announced in the blog post Twitter for iPhone excerpted below

Careful analysis of the Twitter user experience in the iTunes AppStore revealed massive room for improvement. People are looking for an app from Twitter, and they're not finding one. So, they get confused and give up. It's important that we optimize for user benefit and create an awesome experience.

We're thrilled to announce that we've entered into an agreement with Atebits (aka Loren Brichter) to acquire Tweetie, a leading iPhone Twitter client.

This has led to some anger on the part of Twitter client developers with some of the more colorful reactions being the creation of the Twitter Destroyed My Market Segment T-shirt and a somewhat off-color image that is making the rounds as representative of what Twitter means by “filling holes”.

As an end user and someone who works on web platforms, none of this is really surprising. Geeks consider having to wade through half a dozen Twitter clients before finding one that works for them a feature even though paradox of choice means that most people are actually happier with less choices not more. This is made worse by the fact that in the mobile world, this may mean paying for multiple apps until you find one that you’re happy with.

Any web company that cares about their customers will want to ensure that their experience is as simple and as pleasant as possible. Trusting your primary mobile experience to the generosity and talents of 3rd party developers means you are not responsible for the primary way many people will access your service. This loss of control isn’t great especially when the design direction you want to take your service in may not line up with what developers are doing in their apps. Then there’s the fact that forcing your users to make purchasing decisions before they can use your site conveniently on their phone isn’t a great first time experience either.

I expect mobile clients are just the beginning. There are lots of flaws in the Twitter user experience that are due to Twitter’s reliance on “hole fillers” that I expect they’ll start to fill. The fact that I ever have to go to http://bit.ly as part of my Twitter workflow is a bug. URL shorteners really have no reason to exist in the majority of use cases except when Twitter is sending an SMS message. Sites that exist simply as image hosting services for Twitter like Twitpic and YFrog also seem extremely superflous especially when you consider that since only power users know about them not every Twitter user is figuring out how to use the service for image sharing. I expect this will eventually become a native feature of Twitter as well. Once Twitter controls the primary mobile clients for accessing their service, it’ll actually be easier for them to make these changes since they don’t have to worry about whether 3rd party apps will support Twitter image hosting vs. Twitpic versus rolling their own ghetto solution.

The situation is made particularly tough for 3rd party developers due to Twitter’s lack of a business model as Chris Dixon points out in his post Twitter and third-party Twitter developers

Normally, when third parties try to predict whether their products will be subsumed by a platform, the question boils down to whether their products will be strategic to the platform. When the platform has an established business model, this analysis is fairly straightforward (for example, here is my strategic analysis of Google’s platform). If you make games for the iPhone, you are pretty certain Apple will take their 30% cut and leave you alone. Similarly, if you are a content website relying on SEO and Google Adsense you can be pretty confident Google will leave you alone. Until Twitter has a successful business model, they can’t have a consistent strategy and third parties should expect erratic behavior and even complete and sudden shifts in strategy.

So what might Twitter’s business model eventually be? I expect that Twitter search will monetize poorly because most searches on Twitter don’t have purchasing intent. Twitter’s move into mobile clients and hints about a more engaging website suggest they may be trying to mimic Facebook’s display ad model.

The hard question then is what opportunities will be left for developers on Twitter’s platform once the low hanging fruit has been picked by the company. Here I agree with frequent comments by Dave Winer and Robert Scoble, that there needs to be more metadata attached to tweets so that different data aggregation and search scenarios can be built which satisfy thousands of niches. I especially like what Dave Winer wrote in his post How Twitter can kill the Twitter-killers where he stated

Suppose Twitter wants to make their offering much more competitive and at the same time much more attractive to developers. Sure, as Fred Wilson telegraphed, some developers are going to get rolled over, esp those who camped out on the natural evolutionary path of the platform vendor. But there are lots of things Twitter Corp can do to create more opportunities for developers, ones that expand the scope of the platform and make it possible for a thousand flowers to bloom, a thousand valuable non-trivial flowers.

The largest single thing Twitter could do is open tweet-level metadata. If I want to write an app for dogs who tweet, let me add a "field" to a tweet called isDog, a boolean, that tells me that the author of the tweet is a dog. That way the dog food company who has a Twitter presence can learn that the tweet is from a dog, from the guy who's developing a special Twitter client just for dogs, even though Twitter itself has no knowledge of the special needs of dogs. We can also add a field for breed and age (in dog years of course). Coat type. Toy preference. A link to his or her owner. Are there children in the household?

I probably wouldn’t have used the tweeting dog example but the idea is sound. Location is an example of metadata that is added to tweets which can be used for interesting applications on top of the core news feed experience as shown by Twittervision and Bing's Twitter Maps. I think there’s an opportunity to build interesting things in this space especially if developers can invent new types of metadata without relying on Twitter to first bless new fields like they’ve had to do with location (although their current implementation is still inadequate in my opinion).

Over the next few months, Twitter will likely continue to encroach on territory which was once assumed to belong to 3rd party developers. The question is whether Twitter will replace these opportunities they’ve taken away with new opportunities or instead if they’ve simply used developers as a means to an end and now they are no longer useful?

Note Now Playing: Notorious B.I.G. - One More Chance Note

Categories: Competitors/Web Companies | Platforms

March 29, 2010

@ 02:09 AM

Comments [6]

The NoSQL Debate: Automatic vs. Manual Transmission

The debate on the pros and cons of non-relational databases which are typically described as “NoSQL databases” has recently been heating up. The anti-NoSQL backlash is in full swing from the rebuttal to one of my recent posts of mine I saw mentioned in Dennis Forbes’s write-up The Impact of SSDs on Database Performance and the Performance Paradox of Data Explodification (aka Fighting the NoSQL mindset) and similar thoughts expressed in typical rant-y style by Ted Dziuba in his post I Can't Wait for NoSQL to Die.

This will probably be my last post on the topic for a while given that the discussion has now veered into religious debate territory similar to vi vs. emacs OR functional vs. object oriented programming. With that said…

It would be easy to write rebuttals of what Dziuba and Forbes have written but from what I can tell people are now talking past each other and are now defending entrenched positions. So instead I’ll leave this topic with an analogy. SQL databases are like automatic transmission and NoSQL databases are like manual transmission. Once you switch to NoSQL, you become responsible for a lot of work that the system takes care of automatically in a relational database system. Similar to what happens when you pick manual over automatic transmission. Secondly, NoSQL allows you to eke more performance out of the system by eliminating a lot of integrity checks done by relational databases from the database tier. Again, this is similar to how you can get more performance out of your car by driving a manual transmission versus an automatic transmission vehicle.

However the most notable similarity is that just like most of us can’t really take advantage of the benefits of a manual transmission vehicle because the majority of our driving is sitting in traffic on the way to and from work, there is a similar harsh reality in that most sites aren’t at Google or Facebook’s scale and thus have no need for a Bigtable or Cassandra. As I mentioned in my previous post, I believe a lot of problems people have with relational databases at web scale can be addressed by taking a hard look at adding in-memory caching solutions like memcached to their infrastructure before deciding the throw out their relational database systems.

Note Now Playing: Lady Gaga - Bad Romance Note

Categories: Web Development

March 15, 2010

@ 04:05 PM

Comments [3]

Some Thoughts on Location Based Services Like Gowalla and FourSquare

I was recently on a panel at the South by South West interactive conference (SXSW) where we discussed multiple applications of the real-time Web and the things that might prevent us from seeing its true potential. I’ve found it interesting that the key take away from the panel is that privacy issues will be one of the biggest problems we will face as we move forward. You can see this perspective in CNN’s coverage of the panel in the story Privacy concerns hinder 'real-time Web' creation, developers say and GigaOm’s write-up SXSW: Is Privacy on the Social Web a Technical Problem?

This overlap of privacy and real-time web features is brought into sharp relief when you look at services such as Foursquare and Gowalla which provide a mechanism for people to broadcast their physical location to a group of friends in real-time. I started using Foursquare last week and I’ve noticed I’m even more careful about who I accept friend requests from than on Facebook or Windows Live Messenger. The fact is that I may share status updates and photos with people but it doesn’t mean I want them to be aware of where I am on an up to the minute basis especially if I’m out spending time with my family and friends. This difference in how we view location data from other sorts of real-time data we share is captured by the co-founder of Foursquare in the article Facebook Isn't For Real Life Friends Anymore, Says Foursquare's Dennis Crowley where it states

Facebook plans to clone Foursquare's central service -- the ability for site members to use their phones to "check-in" from restaurants and bars -- and make it a mere Facebook feature.

But Foursquare cofounder Dennis Crowley says there's something Facebook can't clone: the real-life friendships between Foursquare users.

"Facebook used to be who your friends are, now it's everyone," Dennis told us in an interview.

"[Foursquare] is more tightly curated to who you want to have as your check-in friends. Facebook is good place for status updates and sharing photos, not to keep tabs on where people are going."

I think Crowley is on to something when he says Facebook can’t clone the Foursquare relationship model. I suspect that like Twitter, Foursquare has created a social network whose value proposition is differentiated enough from Facebook’s that it can grow into a relatively popular albeit smaller service that will not be “killed” by Facebook*. Secondly, there is a lot of synergy between Foursquare and Facebook as evidenced by the fact that Facebook is the largest referrer of traffic to Foursquare thanks to their implementation of Facebook Connect. So I think the claims that one will kill the other is just the usual tech press creating conflict to generate page views.

One thing I have noticed is that location can’t just be a field you bolt on to a status update. It has to be a key part of the information you are sharing with others otherwise it adds little value to the user experience and in fact may detract from it by adding clutter. For example, compare what a location-based update from Foursquare looks like on Facebook versus what the exact same update looks like on Twitter

VS

The difference between both updates is almost night and day even though the actual status text I shared is the same. The way Twitter has approached location is to treat it as a bunch of “poorly translated” GPS coordinates that are bolted on to the end of my status update. The Facebook update not only gives you that but also a human readable location for where I am down to the room number and includes some social context such as the fact that I was attending the talk with two coworkers from Windows Live.

As real-time location data starts to permeate social experiences, there’s a lot to learn from the above screenshots. In the example above, people who are interested in the topic based on my status knew which room to find danah’s talk from the Facebook update whereas they were told “downtown austin” in the Twitter update. As designers of social software applications, we should be mindful that location data enhances the experience and the information being shared. Adding location simply for buzzword compliance or to add metadata to the status update without enhancing the experience actually ends up crufting it up.

* Twitter’s value proposition is that it is the place to interact with celebrities and microcelebrities that you care about. It is useful to note that the much maligned Suggested Users List was key in establishing this value proposition in the minds of users. This is different from Facebook’s position as the social network for your real world friends, family, coworkers and acquaintances.

Note Now Playing: B.O.B. - Nothin' On You (featuring Bruno Mars) Note

Categories: Social Software | Startup Shoutout

March 10, 2010

@ 03:06 PM

Comments [2]

Building Scalable Databases: Are Relational Databases Compatible with Large Scale Websites?

A few weeks ago Todd Hoff over on the High Scalability blog penned a blog post titled MySQL and Memcached: End of an Era? where he wrote

If you look at the early days of this blog, when web scalability was still in its heady bloom of youth, many of the articles had to do with leveraging MySQL and memcached. Exciting times. Shard MySQL to handle high write loads, cache objects in memcached to handle high read loads, and then write a lot of glue code to make it all work together. That was state of the art, that was how it was done. The architecture of many major sites still follow this pattern today, largely because with enough elbow grease, it works.
…
With a little perspective, it's clear the MySQL+memcached era is passing.
…
LinkedIn has moved on with their Project Voldemort. Amazon went there a while ago.

Digg declared their entrance into a new era in a post on their blog titled Looking to the future with Cassandra,
…
Twitter has also declared their move in the article Cassandra @ Twitter: An Interview with Ryan King.

Todd’s blog has been a useful source of information on the topic of scaling large scale websites since he catalogues as many presentations as he can find from industry leaders on how they’ve designed their systems to deal with millions to hundreds of millions of users pounding their services a day. What he’s written above is really an observation about industry trends and isn’t really meant to attack any technology. I did find it interesting that many took it as an attack on memcached and/or relational databases and came out swinging.

One post which I thought tried to take a balanced approach to rebuttal was Dennis Forbes’ Getting Real about NoSQL and the SQL-Isn't-Scalable Lie where he writes

I work in the financial industry. RDBMS’ and the Structured Query Language (SQL) can be found at the nucleus of most of our solutions. The same was true when I worked in the insurance, telecommunication, and power generation industries. So it piqued my interest when a peer recently forwarded an article titled “The end of SQL and relational databases”, adding the subject line “We’re living in the past”. [Though as Michael Stonebraker points out, SQL the query language actually has remarkably little to actually to do with the debate. It would be more clearly called NoACID]
…
From a vertical scaling perspective — it’s the easiest and often the most computationally effective way to scale (albeit being very inefficient from a cost perspective) — you have the capacity to deploy your solution on powerful systems with armies of powerful cores, hundreds of GBs of memory, operating against SAN arrays with ranks and ranks of SSDs.

The computational and I/O capacity possible on a single “machine” are positively enormous. The storage system, which is the biggest limiting factor on most database platforms, is ridiculously scalable, especially in the bold new world of SSDs (or flash cards like the FusionIO).
…
From a horizontal scaling perspective you can partition the data across many machines, ideally configuring each machine in a failover cluster so you have complete redundancy and availability. With Oracle RAC and Sybase ASE you can even add the classic clustering approach. Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve built a system for a large corporation, deployed in your own datacenter, with few constraints beyond the limits of technology and the platform.

Your solution will cost hundreds of thousands of dollars (if not millions) to deploy, but that isn’t a critical blocking point for most enterprises.This sort of scaling that is at the heart of virtually every bank, trading system, energy platform, retailing system, and so on.

To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming evidence, defies all reason.

There’s lots of good for food for thought in both blog posts. Todd is right that a few large scale websites are moving beyond the horizontal scaling approach that Dennis brought up in his rebuttal based on their experiences. What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.

It is unsurprising that after being at this point for years, some people in our industry have wondered whether it doesn’t make more sense to use data stores that are optimized for the usage patterns of large scale websites instead of gloriously misusing relational databases. A good example of the tradeoffs is the blog post from the Digg team on why they switched to Cassandra. The database was already sharded which made performing joins to calculate results of queries such as “which of my friends Dugg this item?” to be infeasible. So instead they had to perform two reads from SQL (all Diggs on an item and all of the user’s friends) then perform the intersection operation on the PHP front end code. If the item was not already cached, this leads to disk I/O which could take seconds. To make the situation worse, you actually want to perform this operation multiple times on a single page view since it is reasonable to expect multiple Digg buttons on a page if it has multiple stories on it.

An alternate approach is to denormalize the data and for each user store a list of stories that have been Dugg by at least one of their friends. So whenever I Digg an item, an entry is placed in each of my friends’ lists to indicate that story is now one that has been Dugg by a friend. That way when the a friend of mine shows up, it is a simple lookup to say “is this story ID on the list of stories Dugg by one of their friends?” The challenge here is that it means Digging an item can result in literally thousands of logical write operations. It has been traditionally prohibitively expensive to incur such massive amounts of write I/O in relational databases with all of their transactionality and enforcing of ACID constraints. NoSQL databases like Cassandra which assume your data is denormalized are actually optimized for write I/O heavy operations given the necessity of having to perform enormous amounts of writes to keep data consistent.

Digg’s usage of Cassandra actually serves as a rebuttal to Dennis Forbes’ article since they couldn’t feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution. I would argue that introducing memcached into the mix would have addressed disk I/O concerns because all records of who has Dugg an item could be stored in-memory so comparisons of which of my friends have Dugg an item never have to go to disk to answer any parts of the query. The only caveat with that approach is that RAM is more expensive than disk so you’ll need a lot more servers to store 3 terabytes of data in memory than you would on disk.

However, the programming model is not the only factor one most consider when deciding whether to stay with a sharded/partitioned relational database versus going with a NoSQL solution. The other factor to consider is the actual management of the database servers. The sorts of questions one has to ask when choosing a database solution are listed in the interview with Ryan King of Twitter where he lists the following checklist that they evaluated before deciding to go with Cassandra over MySQL

We first evaluated them on their architectures by asking many questions along the lines of:

How will we add new machines?

Are their any single points of failure?

Do the writes scale as well?

How much administration will the system require?

If its open source, is there a healthy community?

How much time and effort would we have to expend to deploy and integrate it?

Does it use technology which we know we can work with?

The problem with database sharding is that it isn’t really a supported out of the box configuration for your traditional relational database product especially the open source ones. How your system deals with new machines being added to the cluster or handles machine failure often requires special case code being written by application developers along with special hand holding by operations teams. Dealing with issues related to database replication (whether it is multi-master or single master) also often takes up unexpected amounts of manpower once sharding is involved.

For these reasons I expect we’ll see more large scale websites decide that instead of treating a SQL database as a denormalized key-value pair store that they would rather use a NoSQL database. However I also suspect that a lot of services who already have a sharded relational database + in-memory cache solution can get a lot of mileage from more judicious usage of in-memory caches before switching. This is especially true given that you still caches in front of your NoSQL databases anyway. There’s also the question of whether traditional relational database vendors will add features to address the shortcomings highlighted by the NoSQL movement? Given that the sort of companies adopting NoSQL are doing so because they want to save costs on software, hardware and operations I somehow doubt that there is a lucrative market here for database vendors versus adding more features that the banks, insurance companies and telcos of the world find interesting.

Note Now Playing: Birdman - Money To Blow (featuring Drake & Lil Wayne Note

Categories: Web Development

February 28, 2010

@ 05:19 PM

Comments [0]

Achievements, Game Mechanics and Social Software

Earlier this morning I saw the following tweet by Alex Payne, one of the developers who works on Twitter

Game mechanics aren't going to fix your product and they aren't making people's lives better. Great essay: http://j.mp/aN66i8

Alex’s description piqued my interest so I checked out the article by Peter Michaud titled Achievement Porn and not only agreed that it is a great essay but walked away from it with a fairly different conclusion from Alex. Below are two key excerpts from the article

The game article, and the meta discussion surrounding it is actually part of an even larger discussion that affects more than just video gamers. Games are just a minor symptom of a systematic disease:

Our society is set up to make us feel as though we must always achieve and grow. That’s true because individuals growing tend to bolster the power and creature comforts of the groups they belong to with inventions, innovations, and impressive grandstanding (Go Team!).

Because of this pressure to grow, there’s another incentive to make growth easier. More perversely, to make growth seem easier.

Why work hard for achievements, when you could relax and achieve the same? That’s not pathological, that’s how exponential progress works.

But why achieve at all when you can plug into any number of “achievement games” and get the same personal satisfaction?
…
The good news is that these little “achievement games” are fairly easy to recognize once you realize what’s going on. The bad news is that more are cropping up at an alarming rate, sped largely by the intertubes.

Games fast becoming standard are the “followers” and “friends” games for example. Twitter, FaceBook, LinkedIn, et al, all have their own ostensible raison d’etre, but the psychological underpinning they all share is this treadmill of achievement. This accumulation of points that’s correlated with whatever the intended benefit of the service is.

I find this discussion interesting because it matches the theme of my most recent posts the difference between adding features that are good for users versus good for the product. The physiological underpinnings that make achievement games work have been covered quite well in the Slate article Seeking: How the brain hard-wires us to love Google, Twitter, and texting. And why that's dangerous. The article argues that our brains are wired to derive more pleasure from chasing after something than actually getting it. However although we are hard wired to constantly chase after achievement it is our individual choice which achievements spur us. Thus it is the same underlying biology that explains the addictions of Tiger Woods and those of the World of Warcraft junkie.

Our lives are full of lots of little “achievement treadmills” it’s just that video games are the most obvious. A few months ago I started playing Call of Duty: Modern Warfare 2. I play it for about 1-2 hours every day and according to the game have clocked in almost two weeks of playing time. The game has lots of mini-achievements and ways to keep you grinding from unlocking titles when you complete challenges like score 500 headshots with a particular weapon to encouraging you to de-power your character after you hit level 70 and start the grind all over again (I’m on my 3rd or 4th circuit). The interesting question is what have I lost by spending all this time playing MW2?

It turns out that the two activities that have suffered the most are my blogging and writing code for RSS Bandit. An insight from Peter Michaud’s post is that these were also achievement treadmills in their own way. On my blog all of my posts literally have a score which is the number of times other people have retweeted links to them, bookmarked them on delicious or shared them on Facebook. I also use FeedBurner and for a while used to obsess about my number of subscribers but eventually got over it since I don’t have the time or willingness to create the kind of content that generates a large following. As for RSS Bandit, the number of people who use it and the number of bugs I fixed have always been motivating factors. I can still remember the feeling I’d get when I’d see stats like 100,000 downloads a month or when I realized the application had been downloaded over a million times since it had started. Since I consider the glory days of Outlook-inspired desktop RSS readers to be in the past, I’m not as motivated as I once was to work on the project.

What it really boils down to is that I traded one set of “achievement treadmills” (i.e. blogging and contributing to an Open Source project) for another more explicit set (i.e. playing Modern Warfare 2). Now we can go back to Alex Payne’s tweet and find out where I disagree. From the perspective of Infinity Ward (creators of MW2) is it a bad thing for their business that they’ve created a game that has sucked me into almost 300 hours of play time? On the other hand, is it a good thing for me as a fully functioning member of society to have cut down my contributions to an Open Source project and the blogosphere to play a video game? Finally, is it better for me as a person to have traded achievement treadmills where I have little control over the achievements (i.e. number of blog subscribers, number of people who download a desktop RSS reader, etc) for one where I have complete control of the achievements as long as I dedicate the time?

I’ll leave the answers to those questions up to the reader. I will say game mechanics can more than “fix” a social software product, they can make it a massive success that it’s users are obsessed with. Just look at Farmville or FourSquare for explicit examples or sites like Twitter which have inspired hundreds of guides to increasing your number of Twitter followers for a more subtle example. Does it mean that these products aren’t making their users lives “better”? Well, it depends on how you define better.

Note Now Playing: DJ Khalid - All I Do Is Win (featuring Ludacris, Rick Ross, T-Pain & Snoop Doggy Doog) Note

Categories: Social Software

February 15, 2010

@ 04:22 PM

Comments [2]

Understanding the Real-Time Web for Web Developers

The term “The real-time web” has become popular as a way to describe burgeoning trends and technologies related to consuming web content as soon as it is created. However like popular buzz phrases such as “services oriented architecture” and “web 2.0” which came before it, there is often difficulty in understanding where the technical details end and where the hype begins. Given that this trend is a fundamental shift in how many users interact with the web, it is a good idea for developers to have a clear idea of the key concepts and implementations options available to them as they bring their applications into real-time web.

What Features and Functionality Make Up the Real-Time Web?

When people talk about the real-time web technologies, they are usually talking about one or more of the following features

Refreshing a web page as new updates are available without reloading the page. A good example of this is seen when performing a search on Twitter and you’ll notice that a yellow bar with a constantly updated number of tweets since you started searching is displayed.
Receiving notifications on content updates as soon as they happen instead of polling. This is primarily about moving away from RSS’s model of polling every couple of minutes or hours and instead having an end point that gets messages delivered as soon as they happen. An example of this is the fact that user status updates from Twitter appear within a second on FriendFeed when the user has hooked up both services. This is in contrast to how long it takes blog posts to show up in the typical RSS reader from when they are published.
Some people consider the universe of status updates on sites like Facebook and Twitter to be the real-time web. For these people, the key interesting technology in this space is the ability to consume a neverending feed of content from these sites (aka a fire hose) and provide search functionality over this data.

We can now take a look at some of the underlying technologies that make some of these user experiences and scenarios possible.

Bringing Real-Time to AJAX: COMET, Long Polling and soon Web Sockets

Most web developers should be familiar with the concept of Asynchronous Javascript and XML (AJAX) which enables the creation of dynamic webpages that can be partially updated without having to reload the site. A traditional AJAX interaction involves the user interacting with part of the page and then the browser submitting the request to the server and then rebuilding that part of the page with the results from the request. However it turns out that there are many situations where an application may want to update parts of a page without waiting for user interaction such as displaying live stock tickers, instant messaging scenarios or showing feedback on an article as comments are posted. The set of approaches to solving this term are typically described using the name COMET.

COMET typically refers to keeping a permanent open connection between a browser and a server using a number of techniques. One approach is the hidden iframe technique. With this technique you create an inline frame (i.e. an iframe) that is hidden from the user and then have the frame slowly filled with content as events occur on the server. This takes advantage of the fact that a browser will keep an open connection to the server as long as a page has not fully loaded. There’s a great example of what generating one of these invisible iframes looks like on the server side in the article How to implement COMET with PHP

  <?php
 
  header("Cache-Control: no-cache, must-revalidate");
  header("Expires: Mon, 26 Jul 1997 05:00:00 GMT");
  flush();

 
  ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
 
    <title>Comet php backend</title>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
 
  <script type="text/javascript">

    // KHTML browser don't share javascripts between iframes
    var is_khtml = navigator.appName.match("Konqueror") || navigator.appVersion.match("KHTML");
    if (is_khtml)
    {
      var prototypejs = document.createElement('script');
      prototypejs.setAttribute('type','text/javascript');
      prototypejs.setAttribute('src','prototype.js');
      var head = document.getElementsByTagName('head');
      head[0].appendChild(prototypejs);
    }
    // load the comet object
    var comet = window.parent.comet;
 
  </script>
 
  <?php
 
  while(1) {

    echo '<script type="text/javascript">';
    echo 'comet.printServerTime('.time().');';
    echo '</script>';
    flush(); // used to send the echoed data to the client

    sleep(1); // a little break to unload the server CPU
  }
 
  ?>
 
  </body>
  </html>

As you can see from this example, the page will never finish rendering which means the browser always has an open connection to the server. It should also be noted that the payload of each new event is the Javascript function that you want the browser to execute to update the relevant part of the page. This technique works in most common browsers since it uses established technologies like iframes and Javascript. The main problems are that since it is somewhat of a hack, it is somewhat opaque to for web applications to determine the current state of the communication between the browser and the server (e.g. error handling).

Another common technique is long polling. With this approach the browser application makes an asynchronous request for data from the server either using XMLHttpRequest or a script tag. Once data is returned, another request of the same type is made. The essentially permanently keeps an open connection between the browser and the server. This approach is often favored by developers because it reuses common AJAX techniques and doesn’t require any special client-side techniques. The main challenge with long polling and other COMET techniques is choosing a server-side framework that can solve the C10K problem. Specifically, traditional web servers are designed to handle short-lived connections between browsers and the server due to the request/response nature of HTTP. With COMET, we can have thousands to tens of thousands of browsers keeping an open connection to a server. To address this problem there are now a number of dedicated COMET application servers with some of the more notable implementations ones being FriendFeed's Tornado and Jetty.

The W3C’s HTML 5 working group is working on making COMET part of the next generation of HTML with the creation of the WebSockets specification. This formalizes the notion of an XMLHttpRequest-style object that can create a permanent bi-directional connection to the server without the hacks of long polling (i.e. re-establishing a connection whenever data is sent) and is also resistant to some of the issues long polling faces when going through HTTP intermediaries such as firewalls and proxy servers.

Notifications at the Speed of Light: Go Beyond Polling with PubSubHubbub

Content syndication using XML syndication technologies such as RSS and Atom is a key part of how many websites and web users consume content today. However XML syndication is traditionally not a real-time experience because feed readers work by polling a feed at specific intervals which could be anything from minutes to hours apart. Since polling is an inefficient way to get content updates it isn’t feasible to get updates within seconds from when they happen without overloading the service that is being polled. This calls for new communications patterns where clients are notified of changes as they happen instead of having to poll for them with the time lag that this entails.

To solve this problem, a couple of Google employees have proposed the PubSubHubbub protocol (commonly abbreviated as PuSH) as a way to bring real-time notifications to content syndication on the Web. The workflow for a PuSH system is as follows

A feed producer declares its Hub server(s) in its Atom or RSS XML file, via <link rel="hub" ...>. The hub(s) can be run by the publisher of the feed, or can be a community hub that anybody can use.
A feed reader or subscriber initially polls the feed as usual. .
On seeing that the feed supports PuSH, the subscriber declares it’s interest in getting real-time notifications from the hub when the feed is updated. This is done by registering a callback URL so it can receive the newly updated content whenever the feed is updated.
When the feed producer next updates the feed, the publisher also pings the Hub(s) indicating there is an update.
The hub then retrieves the feed and publishes the changes to all interested subscribers by POSTing an Atom or RSS item to their specified callback URLs

PubSubHubbub provides a more complete solution that other approaches such as Twitter’s firehose. With explicit subscription, services can use PuSH to get notified about updates that are either private or require authentication. With a firehose approach, all public content generated on the site is shared with all subscribers while private content is not provided since it isn’t really feasible to cherry pick which authenticated content should go to which consumer of the firehose.

There are already a number of sites consuming and producing PubSubHubbub including MySpace, LiveJournal, Google Reader, Tumblr and FriendFeed.

Creating and Consuming Fire Hoses: The Lynchpin of Real-Time Search

To many social media observers the real-time web refers to the ecosystem that has sprung up around status updates on sites like Twitter and Facebook. Being able to analyze status updates as they occur to determine people’s sentiments on a news event as it occurs or detect breaking news is a rapidly growing space with lots of players including Microsoft’s Bing, Tweetmeme and Sysomos among others. These services typically work by consuming a never ending stream of updates from the target services and then processing the updates as they are received. Twitter calls this never ending stream of updates a “firehose” and I’ll use this term to describe similar offerings from other services.

A firehose works similar to the COMET servers described earlier. A client connects to a server via HTTP and then starts to receive a stream of updates as they occur in some structured format. An early version of such a firehose is the SixApart Update Stream which is used to provide real-time feeds on changes to TypePad, Vox and [formerly] LiveJournal blogs to interested parties. From the SixApart developer documentation

Connecting to the Stream

To connect to the stream a simple HTTP GET request is issued to the following endpoint:
http://updates.sixapart.com/atom-stream.xml

Once a connection is established, the Atom Server will then begin transmitting to the client any content that is injected into the stream. Additionally, the Atom Stream Server transmits timestamps every second both to keep the connection alive (in case it goes idle), and to provide you a marker so you know how far you've gotten so you can reconnect at a certain point in time if you restart your listener.

Example Stream
		GET /atom-stream.xml HTTP/1.0

		Host: updates.sixapart.com


		HTTP/1.0 200 OK

		Content-Type: text/xml

		Connection: close


		<?xml version="1.0"?>

		<time>1834721912342</time>

		<time>1834721912372</time>

		<time>1834721912402</time>

		<feed>
  ...

		</feed>

		<feed>
   ...

		</feed>

		<sorryTooSlow youMissed="5" />		

This is very similar to the Twitter Streaming API with the primary differences being supported data formats (Twitter supports both JSON and XML output), support for filter predicates such as being limited to a firehose of posts containing references to “superbowl” and the fact that Twitter also provides notifications of deleted status updates in the stream.

Applications that consume such streams need to be carefully coded to handle falling behind and being able to restart from where they stopped if disconnected for any reason.

Learning More

If you find this topic interesting, I’ll be speaking about the real-time Web at two industry conferences next month with experts from noted Web companies.

MIX 10: Building Platforms and Applications for the Real-Time Web
From news feeds to search, the Web has become all about real-time access to news and other information as it happens. This panel will discuss what it takes to build the platforms and user experiences that power some of the most notable services on the real-time web. Come hear a lively discussion about the real-time web with moderator Dare Obasanjo (Microsoft) and panelists Ari Steinberg (Facebook), Brett Slatkin (Google), Chris Saad (~~JS-Kit~~ Echo), Lili Cheng (Microsoft) and Ryan Sarver (Twitter).

SXSW: Can the Real-Time Web Be Realized?
The emergence of the real-time web enables an unprecedented level of user engagement and dynamic content online. However, the rapidly growing audience puts new, complex demands on the architecture of the web as we know it. This panel will discuss what is needed to make the real-time web achievable. Organizer: Brett Slatkin.

Note Now Playing: Young Money - Bedrock (featuring Lloyd) Note

Categories:

February 15, 2010

@ 02:59 PM

Comments [4]

Google Buzz vs. Google Wave

From the Google Wave Federation architecture white paper

Google Wave is a new communication and collaboration platform based on hosted documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of wave.google.com and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with wave.google.com. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet.

From a Buzz post by Dewitt Clinton, a Google employee

The best way to get a sense of where the Buzz API is heading is to take a look at http://code.google.com/apis/buzz/. You'll notice that the "coming soon" section mentions a ton of protocols—Activity Streams, Atom, AtomPub, MediaRSS, WebFinger, PubSubHubbub, Salmon, OAuth, XFN, etc.

What it doesn't talk much about is Google. That's because the goal isn't Google specific at all. The idea is that someday, any host on the web should be able to implement these open protocols and send messages back and forth in real time with users from any network, without any one company in the middle. The web contains the social graph, the protocols are standard web protocols, the messages can contain whatever crazy stuff people think to put in them. Google Buzz will be just another node (a very good node, I hope) among many peers. Users of any two systems should be able to send updates back and forth, federate comments, share photos, send @replies, etc., without needing Google in the middle and without using a Google-specific protocol or format.

From Mark Sigal’s post Google Buzz: Is it Project, Product or Platform?

I think that it's great that Google is iterating Gmail (read Tim O'Reilly's excellent write-up on it here), and actually improving an existing product, versus rolling out a knock-off of something that is already in the market.

Nonetheless. I am confused. I thought that Google Wave was destined to be the new Gmail, but after yesterday's announcement, I am left wondering if Gmail is, instead, the new Google Wave.

Since the saying goes that people in glass houses shouldn’t throw stones, I won’t make any comment besides sharing these links with you.

Note Now Playing: 50 Cent - Crime Wave Note

Categories: Competitors/Web Companies | Mindless Link Propagation

February 13, 2010

@ 03:51 PM

Comments [0]

Autofollowing on Social Networks and User Privacy Becoming a Pawn in a Competitive Chess Games

NOTE: For an official Microsoft statement on Google Buzz, go here. This post is a discussion of recent trends in social networking features in our industry and how they impact web users focusing on a feature of Google Buzz as a kick off point.

One of the much lauded features of the recently released Google Buzz is autofollowing which is described as “No setup needed: Automatically follow the people you email and chat with a lot”. This feature solves the what if you build it and they don’t come problem that Google Buzz faced. What if when presented with a bunch of FriendFeed-like features in Gmail, people decided that they don’t want to build another social network when they’ve already done so on places like Facebook, MySpace and Twitter? Auto-following ensured that Gmail users already had a populated network of people they were receiving status updates from once Google Buzz was launched. So from the perspective of Google, it’s a great feature.

But is the feature in the best interests of users? Ignoring some of the privacy issues of the people you email with becoming a public friends list there is still the question of whether the feature is good for users in isolation. Here’s a story; my wife is divorced and has kids from her previous marriage. This means she exchanges a lot of email with her ex-husband and his new wife around kid visiting schedules, vacations, etc. Do you think my wife would consider it a great feature if one day she started getting status updates on how her ex-husband and his new wife spend their days due to introduction of social networking features in her email client?

Those of us building social networking products have a responsibility not only to ask if a feature is good for our product but also whether it is good for our users as well. Sometimes these goals align and sometimes they do not. What we do when they don’t is what defines us as an industry.

I want to also call out some of the thought leadership on this topic that has come from Marshall Kirkpatrick over on ReadWriteWeb with posts such as Why Facebook is Wrong: Privacy Is Still Important where he discusses Facebook’s privacy changes from last year. Personally, I think Facebook cleaned up their privacy model because they used to have privacy setting based on regional networks where user data was visible to people in a geographic region (e.g. everyone in New York city or everyone in Australia can see my profile information) which is actually kind of dumb. There have been legitimate privacy issues related to such loose settings such as Rudy Giulani's daughter being a Barrack Obama supporter being visible to everyone from New York city on Facebook. With the change people with such settings were asked if they wanted their profiles to be public since they effectively were in the old model. The question Marshall Kirkpatrick brings up is whether it is better for Facebook users in such situations to be asked do you want to go from everyone in New York can see my data –> public or only visible to my friends and networks? It is clear which is better for Facebook as a service but not so clear what is better for their users with regards to their personal notions of privacy and mental well being.

Social networking has transformed the way people communicate and relate to each other in many tangible ways. However they are built on real human relationships and connections. I hate the thought that people’s relationships and communications are becoming the ammunition in a war between web companies to dominate a particular online space. We can be better than that. We must be better than that.

Note Now Playing: Bun B - You're Everything (featuring Rick Ross, David Banner, Eightball & MJG) Note

Categories:

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Monday, 19 April 2010 - Dare Obasanjo's weblog

What XAuth Delivers

What About OAuth?

VS

What Features and Functionality Make Up the Real-Time Web?

Bringing Real-Time to AJAX: COMET, Long Polling and soon Web Sockets

Notifications at the Speed of Light: Go Beyond Polling with PubSubHubbub

Creating and Consuming Fire Hoses: The Lynchpin of Real-Time Search

Connecting to the Stream

Example Stream

Learning More