Dare Obasanjo's weblog

Google Scalability Conference Trip Report: Scaling Google for Every User

These are my notes from the talk Scaling Google for Every User by Marissa Mayer.

Google search has lots of different users who vary in age, sex, location, education, expertise and a lot of other factors. After lots of research, it seems the only factor that really influences how different users view search relevance is their location.

One thing that does distinguish users is the difference between a novice search user and an expert user of search. Novice users typically type queries in natural language while expert users use keyword searches.

Example Novice and Expert Search User Queries

NOVICE QUERY: Why doesn't anyone carry an umbrella in Seattle?
EXPERT QUERY: weather seattle washington

NOVICE QUERY: can I hike in the seattle area?
EXPERT QUERY: hike seattle area

On average, it takes a new Google user 1 month to go from typing novice queries to being a search expert. This means that there is little payoff in optimizing the site to help novices since they become search experts in such a short time frame.

Design Philosophy

In general, when it comes to the PC user experience, the more features available the better the user experience. However when it comes to handheld devices the graph is a bell curve and there reaches a point where adding extra features makes the user experience worse. At Google, they believe their experience is more like the latter and tend to hide features on the main page and only show them when necessary (e.g. after the user has performed a search). This is in contrast to the portal strategy from the 1990s when sites would list their entire product line on the front page.

When tasked with taking over the user interface for Google search, Marissa Mayer fell back on her AI background and focused on applying mathematical reasoning to the problem. Like Amazon, they decided to use split A/B testing to test different changes they planned to make to the user interface to see which got the best reaction from their users. One example of the kind of experiments they've run is when the founders asked whether they should switch from displaying 10 search results by default because Yahoo! was displaying 20 results. They'd only picked 10 results arbitrarily because that's what Alta Vista did. They had some focus groups and the majority of users said they'd like to see more than 10 results per page. So they ran an experiment with 20, 25 and 30 results and were surprised at the outcome. After 6 weeks, 25% of the people who were getting 30 results used Google search less while 20% of the people getting 20 results used the site less. The initial suspicion was that people weren't having to click the "next" button as much because they were getting more results but further investigation showed that people rarely click that link anyway. Then the Google researchers realized that while it took 0.4 seconds on average to render 10 results it took 0.9 seconds on average to render 25 results. This seemingly imperciptible lag was still enough to sour the experience of users enough that they'd reduce their usage of the service.

Improving Google Search

There are a number of factors that determine whether a user will find a set of search results to be relevant which include the query, the actual user's individual tastes, the task at hand and the user's locale. Locale is especially important because a query such as "GM" is likely be a search for General Motors but a query such as "GM foods" is most likely seeking information about genetically modified foods. Given a large enough corpus of data, statistical inference can seem almost like artificial intelligence. Another example is that a search like b&b ab looks for bed and breakfasts in Alberta while ramstein ab locates the Ramstein Airforce Base. This is because in general b&b typically means bed and breakfast so a search like "b&b ab" it is assumed that the term after "b&b" is a place name based on statistical inference over millions of such queries.

At Google they want to get even better at knowing what you mean instead of just looking at what you say. Here are some examples of user queries which Google will transform to other queries based on statistical inference [in future versions of the search engine]

User Query	Google Will Also Try This Query
unchanged lyrics van halen	lyrics to unchained by van halen
how much does it cost for an exhaust system	cost exhaust system
overhead view of bellagio pool	bellagio pool pictures
distance from zurich switzerland to lake como italy	train milan italy zurich switzerland

Performing query inference in this manner is a very large scale, ill-defined problem. Other efforts Google is pursuing is cross language information retrieval. Specifically, if I perform a query in one language it will be translated to a foreign language and the results would then be translated to my language. This may not be particularly interesting for English speakers since most of the Web is in English but it will be valuable for other languages (e.g. an Arabic speaker interested in restaurant reviews from New York City restaurants).

Google Universal Search was a revamp of the core engine to show results other than text-based URLs and website summaries in the search results (e.g. search for nosferatu). There were a number of challenges in building this functionality such as

Google's search verticals such as books, blog, news, video, and image search got a lot less traffic than the main search engine and originally couldn't handle receiving the same level of traffic as the main page.
How do you rank results across different media to figure out the most relevant? How do you decide a video result is more relevant than an image or a webpage? This problem was tackled by Udi Manber's team.
How do you integrate results from other media into the existing search result page? Should results be segregated by type or should it be a list ordered by relevance independent of media type? The current design was finally decided upon by Marissa Mayer's team but they will continue to incrementally improve it and measure the user reactions.

At Google, the belief is that the next big revolution is a search engine that understands what you want because it knows you. This means personalization is the next big frontier. A couple of years ago, the tech media was full of reports that a bunch of Stanford students had figured out how to make Google five times faster. This was actually incorrect. The students had figured out how to make PageRank calculations faster which doesn't really affect the speed of obtaining search results since PageRank is calculated offline. However this was still interesting to Google and the students' company was purchased. It turns out that making PageRank faster means that they can now calculate multiple PageRanks in the time it used to take to calculate a single PageRank (e.g. country specific PageRank, personal PageRank for a given user, etc). The aforementioned Stanford students now work on Google's personalized search efforts.

Speaking of personalization, iGoogle has become their fastest growing product of all time. Allowing users create a personalized page then opening up the platform to developers such Caleb to build gadgets lets them learn more about their users. Caleb's collection of gadgets garner about 30 million daily page views on various personalized homepage.

Q&A

Q: Does the focus on expert searchers mean that they de-emphasis natural language processing?
A: Yes, in the main search engine. However they do focus on it for their voice search product and they do believe that it is unfortunate that users have to adapt to Google's keyword based search style.

Q: How do the observations that are data mined about users search habits get back into the core engine?
A: Most of it happens offline not automatically. Personalized search is an exception and this data is uploaded periodically into the main engine to improve the results specific to that user.

Q: How well is the new Universal Search interface doing?
A: As well as Google Search is since it is now the Google search interface.

Q: What is the primary metric they look at during A/B testing?
A: It depends on what aspect of the service is being tested.

Q: Has there been user resistance to new features?
A: Not really. Google employees are actually more resistant to changes in the search interface than their average user.

Q: Why did they switch to showing Google Finance before Yahoo! Finance when showing search results for a stock ticker?
A: Links used to be ordered by ComScore metrics but ince Google Finance shipped they decided to show their service first. This is now a standard policy for Google search results that contain links to other services.

Q: How do they tell if they have bad results?
A: They have a bunch of watchdog services that track uptime for various servers to make sure a bad one isn't causing problems. In addition, they have 10,000 human evaluators who are always manually checking teh relevance of various results.

Q: How do they deal with spam?
A: Lots of definitions for spam; bad queries, bad results and email spam. For keeping out bad results they do automated link analysis (e.g. examine excessive number of links to a URL from a single domain or set of domains) and they use multiple user agents to detect cloaking.

Q: What percent of the Web is crawled?
A: They try to crawl most of it except that which is behind signins and product databases. And for product databases they now have Google Base and encourage people to upload their data there so it is accessible to Google.

Q: When will I be able to search using input other than search (e.g. find this tune or find the face in this photograph)?
A: We are still a long way from this. In academia, we now have experiments that show 50%-60% accuracy but that's a far cry from being a viable end user product. Customers don't want a search engine that gives relevant results half the time.

Categories: Trip Report

June 26, 2007

@ 03:04 AM

Google Scalability Conference Trip Report: Lessons in Building Scalable Systems

These are my notes from the talk Lessons in Building Scalable Systems by Reza Behforooz.

The Google Talk team have produced multiple versions of their application. There is

a desktop IM client which speaks the Jabber/XMPP protocol.
a Web-based IM client that is integrated into GMail
a Web-based IM client that is integrated into Orkut
An IM widget which can be embedded in iGoogle or in any website that supports embedding Flash.

Google Talk Server Challenges

The team has had to deal with a significant set of challenges since the service launched including

Support displaying online presence and sending messages for millions of users. Peak traffic is in hundreds of thousands of queries per second with a daily average of billions of messages handled by the system.
routing and application logic has to be applied to each message according to the preferences of each user while keeping latency under 100ms.
handling surge of traffic from integration with Orkut and GMail.
ensuring in-order delivery of messages
needing an extensibile architecture which could support a variety of clients

Lessons

The most important lesson the Google Talk team learned is that you have to measure the right things. Questions like "how many active users do you have" and "how many IM messages does the system carry a day" may be good for evaluating marketshare but are not good questions from an engineering perspective if one is trying to get insight into how the system is performing.

Specifically, the biggest strain on the system actually turns out to be displaying presence information. The formula for determining how many presence notifications they send out is

total_number_of_connected_users * avg_buddy_list_size * avg_number_of_state_changes

Sometimes there are drastic jumps in these numbers. For example, integrating with Orkut increased the average buddy list size since people usually have more friends in a social networking service than they have IM buddies.

Other lessons learned were

Slowly Ramp Up High Traffic Partners: To see what real world usage patterns would look like when Google Talk was integrated with Orkut and GMail, both services added code to fetch online presence from the Google Talk servers to their pages that displayed a user's contacts without adding any UI integration. This way the feature could be tested under real load without users being aware that there were any problems if there were capacity problems. In addition, the feature was rolled out to small groups of users at first (around 1%).
Dynamic Repartitioning: In general, it is a good idea to divide user data across various servers (aka partitioning or sharding) to reduce bottlenecks and spread out the load. However, the infrastructure should support redistributing these partitions/shards without having to cause any downtime.
Add Abstractions that Hide System Complexity: Partner services such as Orkut and GMail don't know which data centers contain the Google Talk servers, how many servers are in the Google Talk cluster and are oblivious of when or how load balancing, repartitioning or failover occurs in the Google Talk service.
Understand Semantics of Low Level Libraries: Sometimes low level details can stick it to you. The Google Talk developers found out that using epoll worked better than the poll/select loop because they have lots of open TCP conections but only a relatively small number of them are active at any time.
Protect Against Operational Problems: Review logs and endeavor to smooth out spikes in activity graphs. Limit cascading problems by having logic to back off from using busy or sick servers.
Any Scalable System is a Distributed System: Apply the lessons from the fallacies of distributed computing. Add fault tolerance to all your components. Add profiling to live services and follow transactions as they flow through the system (preferably in a non-intrusive manner). Collect metrics from services for monitoring both for real time diagnosis and offline generation of reports.

Recommended Software Development Strategies

Compatibility is very important, so making sure deployed binaries are backwards and forward compatible is always a good idea. Giving developers access to live servers (ideally public beta servers not main production servers) will encourage them to test and try out ideas quickly. It also gives them a sense of empowerement. Developers end up making their systems easier to deploy, configure, monitor, debug and maintain when they have a better idea of the end to end process.

Building an experimentation platform which allows you to empirically test the results of various changes to the service is also recommended.

Categories: Platforms | Trip Report

June 26, 2007

@ 03:03 AM

Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets

These are my notes from the talk Using MapReduce on Large Geographic Datasets by Barry Brummit.

Most of this talk was a repetition of the material in the previous talk by Jeff Dean including reusing a lot of the same slides. My notes primarily contain material I felt was unique to this talk.

A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to male lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).

Below are examples of the kinds of problems the Google Maps has used MapReduce to solve.

Locating all points that connect to a particular road
Input	Map	Shuffle	Reduce	Output
List of roads and intersections	Create pairs of connected points such as {road, intersection} or {road, road} pairs	Sort by key	Get list of pairs with the same key	A list of all the points that connect to a particular road

Rendering Map Tiles
Input	Map	Shuffle	Reduce	Output
Geographic Feature List	Emit each feature on a set of overlapping lat/long rectangles	Sort by Key	Emit tile using data for all enclosed features	Rendered tiles

Finding Nearest Gas Station to an Address within five miles
Input	Map	Shuffle	Reduce	Output
Graph describing node network with all gas stations marked	Search five mile radius of each gas station and mark distance to each node	Sort by key	For each node, emit path and gas station with the shortest distance	Graph marked with nearest gas station to each node

When issues are encountered in a MapReduce it is possible for developers to debug these issues by running their MapReduce applications locally on their desktops.

Developers who would like to harness the power of a several hundred to several thousand node cluster but do not work at Google can try

Hadoop which is an Open Source project that is Yahoo! sponsored attempt to duplicate Google's key infrastructure pieces.
Amazon's Elastic Compute Cloud (EC2) which allows developers to rent access to Amazon's computing resource

Recruiting Sales Pitch

[The conference was part recruiting event so some of the speakers ended their talks with a recruiting spiel. - Dare]

The Google infrastructure is the product of Google's engineering culture has the following ten characteristics

single source code repository for all Google code
Developers can checkin fixes for any Google product
You can build any Google product in three steps (get, configure, make)
Uniform coding standards across the company
Mandatory code reviews before checkins
Pervasive unit testing
Tests run nightly, emails sent to developers if any failures
Powerful tools that are shared company-wide
Rapid project cycles, developers change projects often, 20% time
Peer driven review process, flat management hierarchy

Q&A

Q: Where are intermediate results from map operations stored?
A: In BigTable or GFS

Q: Can you use MapReduce incrementally? For example, when new roads are built in North America do we have to run MapReduce over teh entire data set or can we only factor in the changed data?
A: Currently, you'll have to process the entire data stream again. However this is a problem that is the target of lots of active research at Google since it affects a lot of teams.

Categories: Platforms | Trip Report

June 25, 2007

@ 03:07 AM

Comments [8]

Google Scalability Conference Trip Report: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets

These are my notes from the keynote session MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets by Jeff Dean.

The talk was about the three pillars of Google's data storage and processing platform; GFS, BigTable and MapReduce.

GFS

The developers at Google decided to build their own custom distributed file system because they felt that they had unique requirements. These requirements included

scalable to thousands of network nodes
massive read/write bandwidth requirements
ability to handle large blocks of data which are gigabytes in size.
need exremely efficient distribution of operations across nodes to reduce bottlenecks

One benefit the developers of GFS had was that since it was an in-house application they could control the environment, the client applications and the libraries a lot better than in the off-the-shelf case.

GFS Server Architecture

There are two server types in the GFS system.

Master servers

These keep the metadata on the various data files (in 64MB chunks) within the file system. Client applications talk to the master servers to perform metadata operations on files or to locate the actual chunk server that contains the actual bits on disk.

Chunk servers

These contain the actual bits on disk and can be considered to be dumb file servers. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Client applications retrieve data files directly from chunk servers once they've been directed to the chunk server which contains the chunk they want by a master server.

There are currently over 200 GFS clusters at Google, some of which have over 5000 machines. They now have pools of tens of thousands of machines retrieving data from GFS clusters that run as large as 5 petabytes of storage with read/write throughput of over 40 gigabytes/second across the cluster.

MapReduce

At Google they do a lot of processing of very large amounts of data. In the old days, developers would have to write their own code to partition the large data sets, checkpoint code and save intermediate results, handle failover in case of server crashes, and so on as well as actually writing the business logic for the actual data processing they wanted to do which could have been something straightforward like counting the occurence of words in various Web pages or grouping documents by content checksums. The decision was made to reduce the duplication of effort and complexity of performing data processing tasks by building a platform technology that everyone at Google could use which handled all the generic tasks of working on very large data sets. So MapReduce was born.

MapReduce is an application programming interface for processing very large data sets. Application developers feed in a key/value pair (e.g. {URL,HTML content} pair) then use the map function to extract relevant information from each record which should produce a set of intermediate key/value pairs (e.g. {word, 1 } pairs for each time a word is encountered) and finally the reduce function merges the intermediate values associated with the same key to produce the final output (e.g. {word, total count of occurences} pairs).

A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.

Currently, almost every major product at Google uses MapReduce in some way. There are 6000 MapReduce applications checked into the Google source tree with the hundreds of new applications that utilize it being written per month. To illustrate its ease of use, a graph of new MapReduce applications checked into the Google source tree over time shows that there is a spike every summer as interns show up and create a flood of new MapReduce applications that are then checked into the Google source tree.

MapReduce Server Architecture

There are three server types in the MapReduce system.

Master server

This assigns user tasks to map and reduce servers as well as keeps track of the state of these tasks.

Map Servers

Accepts user input and performs map operation on them then writes the results to intermediate files

Reduce Servers

Accepts intermediate files produced by map servers and performs reduce operation on them.

One of the main issues they have to deal with in the MapReduce system is problem of stragglers. Stragglers are servers that run slower than expected for one reason or the other. Sometimes stragglers may be due to hardware issues (e.g. bad harddrive conttroller causes reduced I/O throughput) or may just be from the server running too many complex jobs which utilize too much CPU. To counter the effects of stragglers, they now assign multiple servers the same jobs which counterintuitively ends making tasks finish quicker. Another clever optimization is that all data transferred between map and reduce servers is compressed since the servers usually aren't CPU bound so compression/decompression costs are a small price to pay for bandwidth and I/O savings.

BigTable

After the creation of GFS, the need for structured and semi-structured storage that went beyond opaque files became clear. Examples of situations that could benefit from this included

associating metadata with a URL such as when it was crawled, its PageRank™, contents, links to it, etc
associating data with a user such as the user's search history and preferences
geographical data such as information about roads and sattelite imagery

The system required would need to be able to scale to storing billions of URLs, hundreds of terabytes of satellite imagery, data associated preferences with hundreds of millions of users and more. It was immediately obvious that this wasn't a task for an off-the-shelf commercial database system due to the scale requirements and the fact that such a system would be prohibitively expensive even if it did exist. In addition, an off-the-shelf system would not be able to make optimizations based on the underlying GFS file system. Thus BigTable was born.

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Instead it is more like a multi-level map data structure. It is a large scale, fault tolerant, self managing system with terabytes of memory and petabytes of storage space which can handle millions of reads/writes per second. BigTable is now used by over sixty Google products and projects as the platform for storing and retrieving structured data.

The BigTable data model is fairly straightforward, each data item is stored in a cell which can be accessed using its {row key, column key, timestamp}. The need for a timestamp came about because it was discovered that many Google services store and compare the same data over time (e.g. HTML content for a URL). The data for each row is stored in one or more tablets which are actually a sequence of 64KB blocks in a data format called SSTable.

BigTable Server Architecture

There are three primary server types of interest in the BigTable system.

Master servers

Assigns tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.

Tablet servers

Handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.

Lock servers

These are instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

There are a number of optimizations which applications can take advantage of in BigTable. One example is the concept of locality groups. For example, some of the simple metadata associated with a particular URL which is typically accessed together (e.g. language, PageRank™ , etc) can be physically stored together by placing them in a locality group while other columns (e.g. content) are in a separate locality group. In addition, tablets are usually kept in memory until the machine is running out of memory before their data is written to GFS as an SSTable and a new in memory table is created. This process is called compaction. There are other types of compactions where in memory tables are merged with SSTables on disk to create an entirely new SSTable which is then stored in GFS.

Current Challenges Facing Google's Infrastructure

Although Google's infrastructure works well at the single cluster level, there are a number of areas with room for improvement including

support for geo-distributed clusters
single global namespace for all data since currently data is segregated by cluster
more and better automated migration of data and computation
lots of consistency issues when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

Recruiting Sales Pitch

[The conference was part recruiting event so some of the speakers ended their talks with a recruiting spiel - Dare]

Having access to lots of data and computing power is a geek playground. You can build cool, seemingly trivial apps on top of the data such which turn out to be really useful such as Google Trends and catching misspellings of "britney spears. Another example of the kinds of apps you can build when you have enough data treating the problem of language translation as a statistical modeling problem which turns out to be one of the most successful methods around.

Google hires smart people and lets them work in small teams of 3 to 5 people. They can get away with teams being that small because they have the benefit of an infrastructure that takes care of all the hard problems so devs can focus on building interesting, innovative apps.

Categories: Platforms | Trip Report

June 23, 2007

@ 03:32 PM

Mike Arrington: Then and Now

THEN: The PayPerPost Virus Spreads

Two new services that are similar to the controversial PayPerPost have announced their launch in the last few days: ReviewMe and CreamAid. PayPerPost, a marketplace for advertisers to pay bloggers to write about products (with our without disclosure), recently gained additional attention when they announced a $3 million round of venture financing.

The PayPerPost model brings up memories of payola in the music industry, something the FCC and state attorney generals are still trying to eliminate or control. Given the distributed and unlicensed nature of the blogosphere, controlling payoffs to bloggers will be exponentially more difficult.
Our position on these pay-to-shill services is clear: they are a natural result of the growth in size and influence of the blogosphere, but they undermine the credibility of the entire ecosystem and mislead readers.

NOW: I’m shocked, shocked to find that gambling is going on in here!

The title, which is a quote from the movie casablanca, is what came to mind tonight when I read the complete train wreck occuring on TechMeme over advertisements that contain a written message from the publisher. The whole thing was started by Valleywag of course.

The ads in question are a staple of FM Publishing - a standard ad unit contains a quote by the publisher saying something about something. It isn’t a direct endorsement. Rather, it’s usually an answer to some lame slogan created by the adveriser. It makes the ad more personal and has a higher click through rate, or so we’ve been told. In the case of the Microsoft ad, we were quoted how we had become “people ready,” whatever that means. See our answer and some of the others here (I think it will be hard to find this text controversial, or anything other then extremely boring). We do these all the time…generally FM suggests some language and we approve or tweak it to make it less lame. The ads go up, we get paid. This has been going on for months and months - at least since the summer of 2006. It’s nothing new. It’s text in an ad box. I think people are pretty aware of what that means…which is nothing.

Any questions?

Categories: Current Affairs

June 23, 2007

@ 02:53 PM

Comments [10]

What is the Right PM<->Developer Ratio?

I was reading reddit this morning and spotted a reference to the Microsoft Popfly team's group picture which pointed out that from reading the job titles in the pic there were 9 managers and 5 developers on the product team. The list of people in the picture and their titles from the picture are excerpted below

From left to right: John Montgomery (Group Program Manager), Andy Sterland (Program Manager), Alpesh Gaglani (Developer), Tim Rice (Developer), Suzanne Hansen (Program Manager), Steven Wilssens (Program Manager), Vinay Deo (Engineering Manager), Michael Leonard (Test Developer), Jianchun Xu (Developer), Dan Fernandez (Product Manager), Adam Nathan (Developer), Wes Hutchins (Program Manager), Aaron Brethorst (Program Manager), Paramesh Vaidyanathan (Product Unit Manager), and Murali Potluri (Developer).

A Microsoft employee followed up the reddit link with a comment pointing out that it is actually 5 dev, 5 PM, 1 test, 3 managers and 1 marketing. This sounds a lot better but I still find it interesting that there is a 1:1 ratio of Program Managers (i.e. design features/APIs, write specs, call meetings) to Developer (i.e. write code, fix bugs, ignore PMs). Although this ratio isn't unusual for Microsoft this has always struck me as rather high. I've always felt that a decent ratio of PMs to developers is more like 1:2 or higher. And I've seen some claim ratios like 1 PM to 5 developers for Agile projects but haven't been able to find much about industry averages online. It seems must discussion about staffing ratios on software projects focus on Developer to Test ratios and even then the conclusion on is it depends. I think the PM to Developer ratio question is more clear cut.

What are good ratios that have worked for you in the past and what would you consider to be a bad ratio?

PS: A note underneath the group picture mentions that some folks on the team aren't pictured but I looked them up and they are in marketing so they aren't relevant to this discussion.

Categories: Programming

June 23, 2007

@ 03:09 AM

Ten Things I Found Funny This Week

XKCD: Pickup Lines: "If I could rearrange the alphabet..."
Chart: Chances of a Man Winning an Argument plotted over Time: I'm in the middle period. :)
Fake Steve Jobs: Microsoft Goes Pussy: "We've integrated search into our OS too. It makes sense. And Microsoft's search stuff in Vista is really good (God I just threw up in my mouth when I wrote that)..."
Chris Kelly: Das Capital One: "Back before Capital One, there were just two kinds of consumers: People who could afford credit cards and people who couldn't afford credit cards...The guy who started Capital One imagined a third kind of person - someone who could almost afford a credit card. A virtual credit card holder. Something between a good risk and a social parasite."
I CAN HAS CHEEZBURGER?: OH HAI GOOGLZ: Google Street View + lolcats = Comedic Gold
Bileblog: Google Code - Ugliness is not just skin deep: "The administrative menu is, to put it as kindly as possible, whimsical.Menu items and options are scattered about like goat pebbleturds on amountain. The only option under ‘Advanced’ is ‘Delete this project’.How is that advanced functionality?"
Wikipedia: Pokémon test: "Each of the 493 Pokémon has its own page, all of which are bigger than stubs. While it would be expected that Pikachu would have its own page, some might be surprised to find out that Stantler has its own page, as well. Some people perceive Pokémon as something 'for little kids' and argue that if that gets an article, so should their favorite hobby/band/made-up word/whatever."
YouTube: A Cialis Ad With Cuba Gooding Jr.: From the folks at NationalBanana, lots of funny content on their site.
Bumper Sticker: Hell Was Full: Saw this on my way to work.
YouTube: Microsoft Surface Parody - "The future is here and it's not an iPhone. It's a big @$$ table. Take that Apple"

Categories: Mindless Link Propagation

June 23, 2007

@ 03:05 AM

Did I Tick Off 214 People Last Weekend?

According to my Feedburner stats it seems I lost about 214 subscribers using Google Reader between Saturday June 16^th and Sunday June 17^th. This seems like a fairly significant number of readers to unsubscribe from my blog on a weekend especially since I don't think I posted anything particularly controversial relative to my regular posts.

I was wondering if any other Feedburner users noticed a similar dip in their subscribers numbers via Google Reader over the weekend or whether it's just a coincidence that I happened to lose so many regular readers at once?

Categories: Personal

June 23, 2007

@ 03:02 AM

Moonlight: Silverlight for Linux in 21 Days

In his post Implementing Silverlight in 21 Days Miguel De Icaza writes

The past 21 days have been some of the most intense hacking days that I have ever had and the same goes for my team that worked 12 to 16 hours per day every single day --including weekends-- to implement Silverlight for Linux in record time. We call this effort Moonlight.
Needless to say, we believe that Silverlight is a fantastic development platform, and its .NET-based version is incredibly interesting and as Linux/Unix users we wanted to both get access to content produced with it and to use Linux as our developer platform for Silverlight-powered web sites.

His post is a great read for anyone who geeks out over phenomenal feats of hackery. Going over the Moonlight Project Page it's interesting to note how useful blog posts from Microsoft employees were in getting Miguel's team to figure out the internal workings of Silverlight.

In addition, it seems Miguel also learned a lot from hanging out with Jason Zander and Scott Guthrie which influenced some of the design of Moonlight. It's good to see Open Source developers working on Linux having such an amicable relationship with Microsoft developers.

Congratulations to Mono team, it looks like we will have Silverlight on Linux after all. Sweet.

Categories: Platforms | Programming | Web Development

June 20, 2007

@ 02:48 AM

Facebook: The Power of Software that Knows Who You Know

A couple of years ago, I wrote a blog post entitled Social Software: Finding Beauty in Walled Gardens where I riffed on the benefits of being able to tell the software applications you use regularly "here are the people I know, these are the ones I trust, etc". At the time I assumed that it would be one of the big Web companies such as Google, Yahoo!, or Microsoft that would build the killer social software platform that was powered by this unified view of your social connections. I was wrong. Facebook has beaten everyone to doing it first. There are a lot of user scenarios on the Web that can be improved if the applications we were using know who our friends, family and co-workers were knew without us having to explicitly tell them. Below are a couple of online services where access to a user's social network has made Facebook better at performing certain Web tasks than the traditional market leaders.

NOTE: This started off as three different blog posts in my writing queue but after reading ridiculous overhype like ex-Google employees publicly decamping from their former employer because 'Facebook is the Google of yesterday, the Microsoft of long ago' I decided to scale back my writing about the service and merge all my thoughts into a single post.

Displacing Email for Personal Communication

Gervase Markham, an employee of the Mozilla Foundation, recently wrote in his blog post entitled The Proprietarisation of Email

However, I also think we need to be aware of current attempts to make email closed and proprietary.
What am I talking about, I hear you ask? No-one's resurrected the idea of a spam-free email walled garden recently. Companies who tout their own secure mail protocols come and go and no-one notes their passing. The volume of legitimate email sent continues to grow. What's the worry?

I'm talking about the messaging systems built into sites like Facebook and LinkedIn. On several occasions recently, friends have chosen to get back in touch with me via one of these rather than by email. Another friend recently finished a conversation with a third party by saying "Facebook me"; when I asked her why she didn't just use email, she said "Oh, Facebook is so much easier".

And she's right. There's no spam, no risk of viruses or phishing, and you have a ready-made address book that you don't have to maintain. You can even do common mass email types like "Everyone, come to this event" using a much richer interface. Or other people can see what you say if you "write on their wall". In that light, the facts that the compose interface sucks even more than normal webmail, and that you don't have export access to a store of your own messages, don't seem quite so important.

After I read this post, I reflected on my casual use of the user to user messaging feature on Facebook and realized that even though I've only used it a handful of times, I've used it to communicate with friends and family a lot more than I have used either of my personal email addresses in the past three months. In fact, there are a bunch of friends and family whose email addresses I don't know that I've only communicated with online through Facebook. That's pretty wild. The fact that I don't get spam or random messages from people I don't know is also a nice plus and something a lot of other social network sites could learn from.

So one consequence of Facebook being used heavily by people in my real-life social network is that it is now more likely to be my interface for communicating with people I know personally than email. I suspect that if they ever add an instant messaging component to the site, it could significantly change the demographics of the top instant messaging applications.

Changing the Nature of Software Discovery and Distribution

I wrote about this yesterday in my post Marc Andreessen: The GoDaddy 2.0 Business Model but I think the ramifications of this are significant enough that it bears repeating. The viral software distribution model is probably one of the biggest innovations in the Facebook platform. Whenever my friends add an application to their dashboard I get a message in my news feed informing with a link to try out the application. I've tried out a couple of applications this way and it seems like a very novel and viral way to distribute applications. For one thing, it definitely a better way to discover new Facebook applications than browsing the application directory. Secondly, it also means that the best software is found a lot more quickly. The iLike folks have a blog post entitled Holy cow... 6mm users and growing 300k/day! they show a graph that indicates that iLike on Facebook has grown faster in its first few weeks than a number of popular services that grew quite quickly in their day including Skype, Hotmail, Kazaa and ICQ. 6 million new users in less than a month? 300,000 new users a day? Wow.

Although there are a number of issues to work out before transferring this idea to other contexts, I believe that this is a very compelling to approach how new software is discovered and distributed. I would love it if I my friends and family got a notification whenever I discovered a useful Firefox add-on or a great Sidebar gadget and vice versa. I wouldn't be surprised if this concept starts showing up in other places very soon.

Facebook Marketplace: A Craigslist Killer

Recently my former apartment was put up for rent after I broke the lease as part of the process of moving into a house. I had assumed that it would be listed in the local paper and apartment finding sites like Apartments.com. However I was surprised to find out from the property manager that they only listed apartments on Craig's List because it wasn't worth it to list anywhere else anymore. It seemed that somewhere along the line, the critical mass of apartment hunters had moved to using Craig's List for finding apartments instead of the local paper.

Since then I've used Craig's List and I was very dissatisfied with the experience. Besides the prehistoric user interface, I had to kiss a lot of frogs before finding a prince. I called about a ten people based on their listing and could only reach about half of them. Of those one person said he'd call back and didn't, another said he'd deliver to my place and then switched off his phone after I called to ask why he was late (eventually never showed) while yet another promised to show up then called back to cancel because his wife didn't want him leaving the house on a weekend. I guess it should be unsurprising how untrustworthy and flaky a bunch of the people listing goods and services for sale on Craig's List are since it doesn't cost anything to create a listing.

Now imagine if I could get goods and services only from people I know, people they know or from a somewhat trusted circle (e.g. people who work for the same employer) wouldn't that lead to a better user experience than what I had to deal with on Craig's List? In fact, this was the motivation behind Microsoft's Windows Live Expo which is billed as a "social marketplace". However the problem with marketplaces is that you need a critical mass of buyers and sellers for them to thrive. Enter Facebook Marketplace.

Of course, this isn't a slam dunk for Facebook and in fact right now there are ten times as many items listed for sale in the Seattle area on Windows Live Expo than on Facebook Marketplace (5210 vs. 520). Even more interesting is that a number of listings on Facebook Marketplace actually link back to listings on Craig's List which implies that people aren't taking it seriously as a listing service yet.

Conclusion

From the above, it is clear that there is a lot of opportunity for Facebook to dominate and change a number of online markets beyond just social networking sites. However it is not a done deal. The company is estimated to have about 200 employees and that isn't a lot in the grand scheme of things. There is already evidence that they have been overwhelmed by the response to their platform when you see some of the complaints from developers about insufficient support and poor platform documentation.

In addition, it seems the core Facebook application is not seeing enough attention when you consider that there is some fairly obvious functionality that doesn't exist. Specifically, it is quite surprising that Facebook doesn't take advantage of the wisdom of the crowds for powering local recommendations. If I was a college freshman, new employee or some other recently transplanted individual it would be cool for me to plug into my social network to find out where to go for the best chinese food, pizza, or nightclubs in the area.

I suspect that the speculation on blogs like Paul Kedrosky's is right and we'll see Facebook try to raise a bunch of money to fuel growth within the next 12 - 18 months. To reach its full potential, the company needs a lot more resources than it currently has.

Categories: Competitors/Web Companies | Social Software

June 19, 2007

@ 03:53 AM

Comments [2]

Marc Andreessen: The GoDaddy 2.0 Business Model

I read Marc Andreessen's Analyzing the Facebook Platform, three weeks in and although there's a lot to agree with I was also confused by how he defined a platform in the Web era. After a while, it occurred to me that Marc Andreessen's Ning is part of an increased interest by a number of Web players in chasing after what I like to call the the GoDaddy 2.0 business model. Specifically, a surprisingly disparate group of companies seem to think that the business of providing people with domain names and free Web hosting so they can create their own "Web 2.0" service is interesting. None of these companies have actually come out and said it but whenever I think of Ning or Amazon's S3+EC2, I see a bunch of services that seem to be diving backwards into the Web hosting business dominated by companies like GoDaddy.

Reading Marc's post, I realized that I didn't think of the facilities that a Web hosting provider gives you as "a platform". When I think of a Web platform, I think of an existing online service that enables developers to either harness the capabilities of that service or access its users in a way that allows the developers add value to user experience. The Facebook platform is definitely in this category. On the other hand, the building blocks that it takes to actually build a successful online service including servers, bandwidth and software building blocks (LAMP/RoR/etc) can also be considered a platform. This is where Ning and Amazon's S3+EC2. With that context, let's look at the parts of Marc's Analyzing the Facebook Platform, three weeks in which moved me to write this. Marc writes

Third, there are three very powerful potential aspects of being a platform in the web era that Facebook does not embrace.

The first is that Facebook itself is not reprogrammable -- Facebook's own code and functionality remains closed and proprietary. You can layer new code and functionality on top of what Facebook's own programmers have built, but you cannot change the Facebook system itself at any level.

There doesn't seem to be anything fundamentally wrong with this. When you look at some of the most popular platforms in the history of the software industry such as Microsoft Windows, Microsoft Office, Mozilla Firefox or Java you notice that none of them allow applications built on them to fundamentally change what they are. This isn't the goal of an application platform. More specifically, when one considers platforms that were already applications such as Microsoft Office or Mozilla Firefox it is clear that the level of extensibility allowed is intended to allow improving the user experience while utilizing the application and thus make the platform more sticky as opposed to reprogramming the core functionality of the application.

The second is that all third-party code that uses the Facebook APIs has to run on third-party servers -- servers that you, as the developer provide. On the one hand, this is obviously fair and reasonable, given the value that Facebook developers are getting. On the other hand, this is a much higher hurdle for development than if code could be uploaded and run directly within the Facebook environment -- on the Facebook servers.

This is one unfortunate aspect of Web development which tends to harm hobbyists. Although it is possible for me to find the tools to create and distribute desktop applications for little or no cost, the same cannot be said about Web software. At the very least I need a public Web server and the ability to pay for the hosting bills if my service gets popular. This is one of the reasons I can create and distribute RSS Bandit to thousands of users as a part time project with no cost to me except my time but cannot say the same if I wanted to build something like Google Reader.

This is a significant barrier to adoption of certain Web platforms which is a deal breaker for many developers who potentially could add a lot of value. Unfortunately, building an infrastructure that allows you to run arbitrary code from random Web developers and gives these untrusted applications database access without harming your core service costs more in time and resources than most Web companies can afford. For now.

The third is that you cannot create your own world -- your own social network -- using the Facebook platform. You cannot build another Facebook with it.

See my response to his first point. The primary reason for the existence of the Facebook platform is to harness the creativity and resources of outside developers in benefiting the social networks within Facebook. Allowing third party applications to fracture this social network or build competing services doesn't benefit the Facebook application. What Facebook offers developers is access to an audience of engaged users and in exchange these developers make Facebook a more compelling service by building cool applications on it. That way everybody wins.

An application that takes off on Facebook is very quickly adopted by hundreds of thousands, and then millions -- in days! -- and then ultimately tens of millions of users.

Unless you're already operating your own systems at Facebook levels of scale, your servers will promptly explode from all the traffic and you will shortly be sending out an email like this.
...
The implication is, in my view, quite clear -- the Facebook Platform is primarily for use by either big companies, or venture-backed startups with the funding and capability to handle the slightly insane scale requirements. Individual developers are going to have a very hard time taking advantage of it in useful ways.

I think Marc is over blowing the problem here, if one can even call it a problem. A fundamental truth of building Web applications is that if your service is popular then you will eventually hit scale problems. This was happening last century during "Web 1.0" when eBay outages were regular headlines and website owners used to fear the Slashdot effect. Until the nature of the Internet is fundamentally changed, this will always be the case.

However none of this means you can't build a Web application unless you have VC money or are big company. Instead, you should just have a strategy for how to deal with keeping your servers up and running if you service becomes a massive hit with users. It's a good problem to have but one needs to remember that most Web applications will never have that problem. ;)

When you develop a new Facebook application, you submit it to the directory and someone at Facebook Inc. approves it -- or not.

If your application is not approved for any reason -- or if it's just taking too long -- you apparently have the option of letting your application go out "underground". This means that you need to start your application's proliferation some other way than listing it in the directory -- by promoting it somewhere else on the web, or getting your friends to use it.

But then it can apparently proliferate virally across Facebook just like an approved application.

I think the viral distribution model is probably one of the biggest innovations in the Facebook platform. Announcing to my friends whenever I install a new application so that they can try them out themselves is pretty killer. This feature probably needs to be fine tuned I don't end up recommending or being recommending bad apps like X Me but that is such a minor nitpick. This is potentially a game changing move in the world of software distribution. I mean, can you imagine if you got a notification whenever one of your friends discovered a useful Firefox add-on or a great Sidebar gadget? It definitely beats using TechCrunch or Download.com as your source of cool new apps.

Categories: Platforms | Social Software

June 19, 2007

@ 02:07 AM

Blogging is to Social Networks as Books are to...

Doc Searls has a blog post entitled Questions Du Jour where he writes

Dare Obasanjo: Why Facebook is Bigger than Blogging. In response to Kent Newsome's request for an explanation of why Facebook is so cool.
While I think Dare makes some good points, his headline (which differs somewaht from the case he makes in text) reads to me like "why phones are better than books". The logic required here is AND, not OR. Both are good, for their own reasons.
Unlike phones and books, however, neither blogging nor Facebook are the final forms of the basics they offer today.

I think Doc has mischaracterized my post on why social networking services have seen broader adoption than blogging. My post wasn't about which is better since such a statement is as unhelpful as saying a banana is better than an apple. A banana is a better source of potassium but a worse source of dietary fiber. Which is better depends on what metric you are interested in.

My post was about popularity and explaining why, in my opinion, more people create and update their profiles on social networks than write blogs.

Categories: Social Software

June 16, 2007

@ 08:18 PM

Seattle Startup Shoutout: Zillow

Every once in a while someone asks me about software companies to work for in the Seattle area that aren't Microsoft, Amazon or Google. This is the fourth in a series of ~~weekly~~ posts about startups in the Seattle area that I often mention to people when they ask me this question.

Zillow is a real-estate Web site that is slowly revolutionizing how people approach the home buying experience. The service caters to buyers, sellers, potential sellers and real estate professionals in the following ways

For buyers: You can research a house and find out it's vital statistics (e.g. number of bedrooms, square footage, etc) it's current estimated value and how much it sold for when it was last sold. In addition, you can scope out homes that were recently sold in the neighborhood and get a good visual representation of the housing market in a particular area.
For sellers and agents: Homes for sale can be listed on the service
Potential sellers: You can post a Make Me Move™ price without having to actually list your home for sale.

I used Zillow when as part of the home buying process when I got my place and I think the service is fantastic. They also have the right level of buzz given recent high level poachings of Google employees and various profiles in the financial press.

The company was founded by Lloyd Frink and Richard Barton who are ex-Microsoft folks whose previous venture was Expedia, another Web site that revolutionized how people approached a common task.

Press: Fortune on Zillow

Number of Employees: 133

Location: Seattle, WA (Downtown)

Jobs: careers@zillow.hrmdirect.com, current open positions include a number of Software Development Engineer and Software Development Engineer in Test positions as well as a Systems Engineer and Program Manager position.

Categories: Seattle Startup Shoutout

June 16, 2007

@ 07:28 PM

Comments [11]

Web3S: A RESTful Protocol for Accessing Windows Live Services

I had hoped to avoid talking about RESTful Web services for a couple of weeks but Yaron Goland's latest blog post APP and Dare, the sitting duck deserves a mention. In his post, Yaron talks concretely about some of the thinking that has gone on at Windows Live and other parts of Microsoft around building a RESTful protocol for accessing and manipulating data stores on the Web. He writes

I'll try to explain what's actually going on at Live. I know what's going on because my job for little over the last year has been to work with Live groups designing our platform strategy.
...
Most of the services in Live land follow a very similar design pattern, what I originally called S3C which stood for Structured data, with some kind of Schema (in the general sense, I don't mean XML Schema), with some kind of Search and usually manipulated with operations that look rather CRUD like. So it seemed fairly natural to figure out how to unify access to those services with a single protocol.
...
So with this in mind we first went to APP. It's the hottest thing around. Yahoo, Google, etc. everyone loves it. And as Dare pointed out in his last article Microsoft has adopted it and will continue to adopt it where it makes sense. There was only one problem - we couldn't make APP work in any sane way for our scenarios. In fact, after looking around for a bit, we couldn't find any protocol that really did what we needed. Because my boss hated the name S3C we renamed the spec Web3S and that's the name we published it under. The very first section of the spec explains our requirements. I also published a FAQ that explains the design rationale for Web3S. And sure enough, the very first question, 2.1, explains why we didn't use ATOM.
...
Why not just modify APP?
We considered this option but the changes needed to make APP work for our scenarios were so fundamental that it wasn't clear if the resulting protocol would still be APP. The core of ATOM is the feed/entry model. But that model is what causes us our problems. If we change the data model are we still dealing with the same protocol? I also have to admit that I was deathly afraid of the political implications of Microsoft messing around with APP. I suspect Mr. Bray's comments would be taken as a pleasant walk in the park compared to the kind of pummeling Microsoft would receive if it touched one hair on APP's head.

In his post, Yaron talks about two of the key limitations we saw with the Atom Publishing Protocol (i.e. lack of support for hierarchies and lack of support for granular updates to fields) and responds to the various suggestions about how one can workaround these problems in APP. As he states in the conclusion of his post we are very wary of suggestions to "embrace and extend" Web standards given the amount of negative press the company has gotten about that over the years. It seems better for the industry if we build a protocol that works for our needs and publish documentation about how it works so any interested party can interoperate with us than if we claim we support a Web standard when in truth it only "works with Microsoft" because it has been extended in incompatible ways.

Dealing with Hierarchy

Here's what Yaron had to say with regards to the discussion around APP's lack of explicit support for hierarchies

The idea that you put a link in the ATOM feed to the actual object. This isn't a bad idea if the goal was to publish notices about data. E.g. if I wanted to have a feed that published information about changes to my address book then having a link to the actual address book data in the APP entries is fine and dandy. But if the goal is to directly manipulate the address book's contents then having to first download the feed, pull out the URLs for the entries and then retrieve each and every one of those URLs in separate requests in order to pull together all the address book data is unacceptable from both an implementation simplicity and performance perspective. We need a way where by someone can get all the content in the address book at once. Also, each of our contacts, for example, are actually quite large trees. So the problem recurses. We need a way to get all the data in one contact at a go without having to necessarily pull down the entire address book. At the next level we need a way to get all the phone numbers for a single contact without having to download the entire contact and so on.

Yaron is really calling out two issues here. The first is that if you have a data type that doesn't map well as a piece of authored content then it is better represented as its own content type that is linked from an atom:entry than trying to treat an atom:entry with its requirement of author, summary and title fields as a good way to represent all types of data. The second issue is the lack of explicit support for hierarchies. This situation is an example of how something that seems entirely reasonable in one scenario can be problematic in another. If you are editing blog posts, it probably isn't that much of a burden to first retrieve an Atom feed of all your recent blog posts, locate the link to the one you want to edit then retrieve it for editing. In addition, since a blog post is authored content, the most relevant information about the post can be summarized in the atom:entry. On the other hand, if you want to retrieve your list of IM buddies so you can view their online status or get people in your friend's list to see their recent status updates, it isn't pretty to fetch a feed of your contacts then have to retrieve each contact one by one after locating the links to their representations in the Atom feed. Secondly, you may just want to address part of the data instead of instead of retrieving or accessing an entire user object if you just want their status message or online status.

Below are specification excerpts showing how two RESTful protocols from Microsoft address these issues.

How Web3S Does It

The naming of elements and level of hierarchy in an XML document that is accessible via Web3S can be arbitrarily complex as long as it satisfies some structural constraints as specified in The Web3S Resource Infoset. The constraints include no mixed content and that multiple instances of an element with the same name as children of a node must be identified by a Web3S:ID element (e.g. multiple entries under a feed are identified by ID). Thus the representation of a Facebook user returned by the users.getInfo method in the Facebook REST API should be a valid Web3S document [except that the concentration element would have to be changed from having string content to having two element children, a Web3S:ID that can be used to address each concentration directly and another containing the current textual content].

The most important part of being able to properly represent hierarchies is that different levels of the hierarchy can be directly accessed. From the Web3S documentation section entitled Addressing Web3S Information Items in HTTP

In order to enable maximum flexibility Element Information Items (EIIs) are directly exposed as HTTP resources. That is, each EII can be addressed as a HTTP resource and manipulated with the usual methods...
<articles>
 <article>
  <Web3S:ID>8383</Web3S:ID>
  <title>Manual of Surgery Volume First: General Surgery. Sixth Edition.</title>
  <authors>
   <author>
    <Web3S:ID>23455</Web3S:ID>
    <firstname>Alexander</firstname>
    <lastname>Miles</lastname>    
   </author>
   <author>
    <Web3S:ID>88828</Web3S:ID>
    <firstname>Alexis</firstname>
    <lastname>Thomson</lastname>    
   </author>
  </authors>
 </article>
</articles>
If the non-Web3S prefix path is http://example.net/stuff/morestuff then we could address the lastname EII in Alexander Miles’s entry as http://example.net/stuff/morestuff/net.examples.articles/net.example.article(8383)/net.example.authors/net.example.author(23455)/org.example.lastname.

Although String Information Items (SIIs) are modeled as resources they currently do not have their own URLs and therefore are addressed only in the context of EIIs. E.g. the value of an SII would be set by setting the value of its parent EII.

XML heads may balk at requiring IDs to differentiate elements with the same name at the same scope or level of hierarchy instead of using positional indexes like XPath does. The problem with is that assumes that the XML document order is significant in the underlying data store which may likely not be the case.

Supporting Granular Updates

Here's what Yaron had to say on the topic of supporting granular updates and the various suggestions that came up with regards to preventing the lost update problem in APP.

APP's approach to this problem is to have the client download all the content, change the stuff they understand and then upload all the content including stuff they don't understand.
...
On a practical level though the 'download then upload what you don't understand' approach is complicated. To make it work at all one has to use optimistic concurrency. For example, let's say I just want to change the first name of a contact and I want to use last update wins semantics. E.g. I don't want to use optimistic concurrency. But when I download the contact I get a first name and a last name. I don't care about the last name. I just want to change the first name. But since I don't have merge semantics I am forced to upload the entire record including both first name and last name. If someone changed the last name on the contact after I downloaded but before I uploaded I don't want to lose that change since I only want to change the first name. So I am forced to get an etag and then do an if-match and if the if-match fails then I have to download again and try again with a new etag. Besides creating race conditions I have to take on a whole bunch of extra complexity when all I wanted in the first place was just to do a 'last update wins' update of the first name.
...
A number of folks seem to agree that merge makes sense but they suggested that instead of using PUT we should use PATCH. Currently we use PUT with a specific content type (application/Web3S+xml). If you execute a PUT against a Web3S resources with that specific content-type then we will interpret the content using merge semantics. In other words by default PUT has replacement semantics unless you use our specific content-type on a Web3S resource. Should we use PATCH? I don't think so but I'm flexible on the topic.

This is one place where a number of APP experts such as Bill de hÓra and James Snell seem to agree that the current semantics in APP are insufficient. There also seems to be some consensus that it is too early to standardize a technology for partial updates of XML on the Web without lots more implementation experience. I also agree with that sentiment. So having it out of APP for now probably isn't a bad thing.

Currently I'm still torn on whether Web3S's use of PUT for submitting partial updates is kosher or whether it is more appropriate to invent a new HTTP method called PATCH. There was a thread about this on the rest-discuss mailing list and for the most part it seems people felt that applying merge semantics on PUT requests for a specific media type is valid if the server understands that those are the semantics of that type.

How Web3S Does It

From the Web3S documentation section entitled Application/Web3S+xml with Merge Semantics

On its own the Application/Web3S+xml content type is used to represent a Web3S infoset. But the semantics of that infoset can change depending on what method it is used with.

In the case of PUT the semantics of the Application/Web3S+xml request body are “merge the infoset information in the Application/Web3S+xml request with the infoset of the EII identified in the request-URI.” This section defines how Application/Web3S+xml is to be handled specifically in the case of PUT or any other context in which the Web3S infoset in the Application/Web3S+xml serialization is to be merged with some existing Web3S infoset.

For example, imagine that the source contains:
<whatever> <Web3S:ID>234</Web3S:ID> <yo> <Web3S:ID>efghi</Web3S:ID> <avalue /> <somethingElse>YO!!!</somethingElse> </yo> </whatever>
Now imagine that the destination, before the merge, contains:
<whatever> <nobodyhome /> </whatever>
In this example the only successful outcome of the merge would have to be:
<whatever> <Web3S:ID>234</Web3S:ID> <yo> <Web3S:ID>efghi</Web3S:ID> <avalue /> <somethingElse>YO!!!</somethingElse> </yo> <nobodyhome /> </whatever>
In other words, not only would all of the source’s contents have to be copied over but the full names (E.g. EII names and IDs) must also be copied over exactly.

This an early draft of the spec so there are a lot of rules that aren't explicitly spelled out but now you should get the gist of how Web3S works. If you have any questions, direct them to Yaron not to me. I'm just an interested observer when it comes to Web3S. Yaron is the person to talk to if you want to make things happen. :)

In a couple of days I'll take a look at how Project Astoria deals with the same problems in a curiously similar fashion. Until then you can make do with Leonard Richardson's excellent review of Project Astoria. Until next time.

Categories: Windows Live | XML Web Services

June 16, 2007

@ 02:40 AM

Comments [9]

RSS Bandit 1.5.x (ShadowCat) Beta Installer Available

We are about ready to ship the second release of RSS Bandit this year. This release will be codenamed ShadowCat. This version will primarily be about bug fixes with one or two minor convenience features thrown in. If you are an RSS Bandit user who's had problems with crashes or other instability in the application, please grab the installer at RssBandit.1.5.0.14.ShadowCat.Beta.zip and let us know if this improves your user experience.

New Features

Newspaper view can be configured to show unread items in a feed as multiple pages instead of a single page of items

Major Bug Fixes

Javascript errors on Web pages result in error dialogs being displayed or the application hanging.
Search indexing thread takes 100% of CPU and freezes the computer.
Crashes related to Lucene search indexing (e.g. IO exceptions, access violations, file in use errors, etc)
Crash when a feed has an IRI (i.e. URL with unicode characters such as http://www.bücher-forum.de/forum/rdf.php) instead of issuing an error message since they are not supported.
Context menus no longer work on category nodes after OPML import or remote sync
Crash on deleting a feed which still has enclosures being downloaded
Podcasts downloaded from the http://poweruser.tv/ feed are named "..mp3" instead of the actual file name.
Items marked as read in a search folder not marked as read in original feed
No news shown when subscribed to newsgroups.borland.com
Can't subscribe to feeds on your local hard drive (regression since it worked in previous versions)

After the final release of ShadowCat, the following release of RSS Bandit (codenamed Phoenix) will contain our next major user interface revamp and will likely be the version where we move to version 2.0 of the .NET Framework.You can find some of our early screen shots on Flickr.

Categories: RSS Bandit

June 14, 2007

@ 09:28 PM

Comments [9]

Overview of the Atom's Data Model

The Atom data model consists of collections, entry resources and media resources. Entry resources and media resources are member resources of a collection. There is a handy drawings in section 4.2 of the latest APP draft specification that shows the hierarchy in this data model which is reproduced below.

                    
             Member Resources
                    |
             -----------------
            |                 |
      Entry Resources     Media Resources
            |
      Media Link Entry

A media resource can have representations in any media type. An entry resource corresponds to an atom:entry element which means it must have an id, a title, an updated date, one or more authors and textual content. Below is a minimal atom:entry element taken from the Atom 1.0 specification


<entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
</entry>

The process of creating and editing resources is covered in section 9 of the current APP draft specification. To add members to a Collection, clients send POST requests to the URI of the Collection. To delete a Member Resource, clients send a DELETE request to its Member URI. While to edit a Member Resource, clients send PUT requests to its Member URI.

Since using PUT to edit a resource is obviously problematic, the specification notes two concerns that developers have to pay attention to when updating resources.

To avoid unintentional loss of data when editing Member Entries or Media Link Entries, Atom Protocol clients SHOULD preserve all metadata that has not been intentionally modified, including unknown foreign markup.
...
Implementers are advised to pay attention to cache controls, and to make use of the mechanisms available in HTTP when editing Resources, in particular entity-tags as outlined in [NOTE-detect-lost-update]. Clients are not assured to receive the most recent representations of Collection Members using GET if the server is authorizing intermediaries to cache them.

The [NOTE-detect-lost-update] points to Editing the Web: Detecting the Lost Update Problem Using Unreserved Checkout which not only talks about ETags but also talks about conflict resolution strategies when faced with multiple edits to a Web document. This information is quite relevant to anyone considering implementing the Atom Publishing Protocol or a similar data manipulation protocol.

With this foundation, we can now talk about the various problems one faces when trying to use the Atom Publishing Protocol with certain types of Web data stores.

Limitations Caused by the Constraints within the Atom Data Model

The following is a list of problems one faces when trying to utilize the Atom Publishing Protocol in areas outside of content publishing for which it was originally designed.

Mismatch with data models that aren't microcontent: The Atom data model fits very well for representing authored content or microcontent on the Web such as blog posts, lists of links, podcasts, online photo albums and calendar events. In each of these cases the requirement that each Atom entry has an an id, a title, an updated date, one or more authors and textual content can be met and actually makes a lot of sense. On the other hand, there are other kinds online data that don't really fit this model.

Below is an example of the results one could get from invoking the users.getInfo method in the Facebook REST API.
```
<user>
    <uid>8055</uid>
    <about_me>This field perpetuates the glorification of the ego.  Also, it has a character limit.</about_me>
    <activities>Here: facebook, etc. There: Glee Club, a capella, teaching.</activities>    
    <birthday>November 3</birthday>
    <books>The Brothers K, GEB, Ken Wilber, Zen and the Art, Fitzgerald, The Emporer's New Mind, The Wonderful Story of Henry Sugar</books>
    <current_location>
      <city>Palo Alto</city>
      <state>CA</state>
      <country>United States</country>
      <zip>94303</zip>
    </current_location>    
    <first_name>Dave</first_name>       
     <interests>coffee, computers, the funny, architecture, code breaking,snowboarding, philosophy, soccer, talking to strangers</interests>
     <last_name>Fetterman</last_name>   
     <movies>Tommy Boy, Billy Madison, Fight Club, Dirty Work, Meet the Parents, My Blue Heaven, Office Space </movies>
     <music>New Found Glory, Daft Punk, Weezer, The Crystal Method, Rage, the KLF, Green Day, Live, Coldplay, Panic at the Disco, Family Force 5</music>
     <name>Dave Fetterman</name>  
     <profile_update_time>1170414620</profile_update_time>
     <relationship_status>In a Relationship</relationship_status>
     <religion/>
     <sex>male</sex>
     <significant_other_id xsi:nil="true"/>
     <status>
       <message>Pirates of the Carribean was an awful movie!!!</message>
     </status>    
   </user>
```
How exactly would one map this to an Atom entry? Most of the elements that constitute an Atom entry don't make much sense when representing a Facebook user. Secondly, one would have to create a large number of proprietary extension elements to anotate the atom:entry element to hold all the Facebook specific fields for the user. It's like trying to fit a square peg in a round hole. If you force it hard enough, you can make it fit but it will look damned ugly.

Even after doing that, it is extremely unlikely that an unmodified Atom feed reader or editing client such as would be able to do anything useful with this Frankenstein atom:entry element. If you are going to roll your own libraries and clients to deal with this Frankenstein element, then it it begs the question of what benefit you are getting from ~~mis~~ using a standardized protocol in this manner?

I guess we could keep the existing XML format used by the Facebook REST API and treat the user documents as media resources. But in that case, we aren't really using the Atom Publishing Protocol, instead we've reinvented WebDAV. Poorly.
Lack of support for granular updates to fields of an item: As mentioned in the previous section editing an entry requires replacing the old entry with a new one. The expected client interaction with the server is described in section 5.4 of the current APP draft and is excerpted below.
Retrieving a Resource
```
Client                                     Server
  |                                           |
  |  1.) GET to Member URI                    |
  |------------------------------------------>|
  |                                           |
  |  2.) 200 Ok                               |
  |      Member Representation                |
  |<------------------------------------------|
  |                                           |
```
1. The client sends a GET request to the URI of a Member Resource to retrieve its representation.
2. The server responds with the representation of the Member Resource.
Editing a Resource
```
Client                                     Server
  |                                           |
  |  1.) PUT to Member URI                    |
  |      Member Representation                |
  |------------------------------------------>|
  |                                           |
  |  2.) 200 OK                               |
  |<------------------------------------------|
```
1. The client sends a PUT request to store a representation of a Member Resource.
2. If the request is successful, the server responds with a status code of 200.
Can anyone spot what's wrong with this interaction? The first problem is a minor one that may prove problematic in certain cases. The problem is pointed out in the note in the documentation on Updating posts on Google Blogger via GData which states

IMPORTANT! To ensure forward compatibility, be sure that when you POST an updated entry you preserve all the XML that was present when you retrieved the entry from Blogger. Otherwise, when we implement new stuff and include <new-awesome-feature> elements in the feed, your client won't return them and your users will miss out! The Google data API client libraries all handle this correctly, so if you're using one of the libraries you're all set.

Thus each client is responsible for ensuring that it doesn't lose any XML that was in the original atom:entry element it downloaded. The second problem is more serious and should be of concern to anyone who's read Editing the Web: Detecting the Lost Update Problem Using Unreserved Checkout. The problem is that there is data loss if the entry has changed between the time the client downloaded it and when it tries to PUT its changes.

Even if the client does a HEAD request and compares ETags just before PUTing its changes, there's always the possibility of a race condition where an update occurs after the HEAD request. After a certain point, it is probably reasonable to just go with "most recent update wins" which is the simplest conflict resolution algorithm in existence. Unfortunately, this approach fails because the Atom Publishing Protocol makes client applications responsible for all the content within the atom:entry even if they are only interested in one field.

Let's go back to the Facebook example above. Having an API now makes it quite likely that users will have multiple applications editing their data at once and sometimes these aplications will change their data without direct user intervention. For example, imagine Dave Fetterman has just moved to New York city and is updating his data across various services. So he updates his status message in his favorite IM client to "I've moved" then goes to Facebook to update his current location. However, he's installed a plugin that synchronizes his IM status message with his Facebook status message. So the IM plugin downloads the atom:entry that represents Dave Fetterman, Dave then updates his address on Facebook and right afterwards the IM plugin uploads his profile information with the old location and his new status message. The IM plugin is now responsible for data loss in a field it doesn't even operate on directly.
Poor support for hierarchy: The Atom data model is that it doesn't directly support nesting or hierarchies. You can have a collection of media resources or entry resources but the entry resources cannot themselves contain entry resources. This means if you want to represent an item that has children they must be referenced via a link instead of included inline. This makes sense when you consider the blog syndication and blog editing background of Atom since it isn't a good idea to include all comments to a post directly children of an item in the feed or when editing the post. On the other hand, when you have a direct parent<->child hierarchical relationship, where the child is an addressable resource in its own right, it is cumbersome for clients to always have to make two or more calls to get all the data they need.

UPDATE: Bill de hÓra responds to these issues in his post APP on the Web has failed: miserably, utterly, and completely and points out to two more problems that developers may encounter while implementing GData/APP.

Categories: Web Development | XML Web Services

June 7, 2007

@ 06:06 PM

WhyWillYouWorkHere.com

A couple of months ago, I was asked to be part of a recruiting video for Microsoft Online Business Group (i.e. Windows Live and MSN). The site with the video is now up. It's http://www.whywillyouworkhere.com. As I expected, I sound like a dork. And as usual I was repping G-Unit. I probably should have worn something geeky like what I have on today, a a Super Mario Bros. 1up T-shirt. :)

Categories: MSN | Windows Live

June 7, 2007

@ 04:24 PM

Microsoft's "Back to the Future" Spoof Video

Via Todd Bishop I found the following spoof of Back to the Future starring Christopher Lloyd (from the actual movie) and Bob Muglia, Microsoft's senior vice president of the Server and Tools Business. The spoof takes pot shots at various failed Microsoft "big visions" like WinFS and Hailstorm in a humorous way. It's good to see our execs being able to make light of our mistakes in this manner. The full video of the keynote is here. Embedded below is the first five minutes of the Back to the Future spoof.

Video: Microsoft Back to the Future Parody

Categories: Life in the B0rg Cube | Mindless Link Propagation

June 6, 2007

@ 03:28 AM

Comments [16]

Google Gears: Replacing One Problem with Another

I've been thinking a little bit about Google Gears recently and after reading the documentation things I've realized that making a Web-based application that works well offline poses an interesting set of challenges. First of all, let's go over what constitutes the platform that is Google Gears. It consists of three components

LocalServer: Allows you to cache and serve application resources such as HTML pages, scripts, stylesheets and images from a local web server.

Database: A relational database where the application can store data locally. The database supports both full-text and SQL queries.
WorkerPool: Allows applications to perform I/O expensive tasks in the background and thus not lock up the browser. A necessary evil.

At first, this seemed like a lot to functionality being offered by Google Gears until I started trying to design how I'd take some of my favorite Web applications offline. Let's start with a straightforward case such as Google Reader. The first thing you have to do is decide what data needs to be stored locally when the user decides to go offline. Well, a desktop RSS reader has all my unread items even when I go offline so a user may expect that if they go offline in Google Reader this means all their unread items are offline. This could potentially be a lot of data to transfer in the split instant between when the user selects "go offline" in the Google Reader interface and she actually loses her 'net connection by closing her laptop. There are ways to work around this such as limiting how many feeds are available offline (e.g. Robert Scoble with a thousand feeds in his subscription list won't get to take all of them offline) or by progressively downloading all the unread content while the user is viewing the content in online mode. Let's ignore that problem for now because it isn't that interesting.

The next problem is to decide which state changes while the app is offline need to be reported back when the user gets back online. These seem to be quite straightforward,

Feed changed
Feed added
Feed deleted
Feed renamed
Feed moved
News item changed
Item marked read/unread
Item flagged/starred
Item tag updated

The application code can store these changes as a sequential list of modifications which are then executed whenever the user gets back online. Sounds easy enough. Or is it?

What happens if I'm on my laptop and I go offline in Google Reader and mark a bunch of stuff as read then unsubscribe from a few feeds I no longer find interesting. The next day when I get to work, I go online on my desktop, read some new items and subscribe to some new feeds. Later that day, I go online with my laptop. Now the state on my laptop is inconsistent from that on the Web server. How do we reconcile these differences?

The developers at Google have anticipated these questions and have answered them in Google Gears documentation topic titled Choosing an Offline Application Architecture which states

No matter which connection and modality strategy you use, the data in the local database will get out of sync with the server data. For example, local data and server data get out of sync when:

The user makes changes while offline
Data is shared and can be changed by external parties
Data comes from an external source, such as a feed

Resolving these differences so that the two stores are the same is called "synchronization". There are many approaches to synchronization and none are perfect for all situations. The solution you ultimately choose will likely be highly customized to your particular application.

Below are some general synchronization strategies.

Manual Sync

The simplest solution to synchronization is what we call "manual sync". It's manual because the user decides when to synchronize. It can be implemented simply by uploading all the old local data to the server, and then downloading a fresh copy from the server before going offline.
...
Background Sync

In a "background sync", the application continuously synchronizes the data between the local data store and the server. This can be implemented by pinging the server every once in a while or better yet, letting the server push or stream data to the client (this is called Comet in the Ajax lingo).

I don't consider myself some sort of expert on data synchronization protocols but it seems to me that there is a lot more to figuring out a data synchronization strategy than whether it should be done based on user action or automatically in the background without user intervention. It seems that there would be all sorts of decisions around consistency models and single vs. multi-master designs that developers would have to make as well. And that's just for a fairly straightforward application like Google Reader. Can you imagine what it would be like to use Google Gears to replicate the functionality of Outlook in the offline mode of Gmail or to make Google Docs & Spreadsheets behave properly when presented with conflicting versions of a document or spreadsheet because the user updated it from the Web and in offline mode?

It seems that without providing data synchronization out of the box, Google Gears leaves the most difficult and cumbersome aspect of building a disconnected Web app up to application developers. This may be OK for Google developers using Google Gears since the average Google coder is a Ph.D but the platform isn't terribly useful to Web application developers who want to use it for anything besides a super-sized HTTP cookie.

A number of other bloggers such as Roger Jennings and Tim Anderson have also pointed that the lack of data synchronization in Google Gears is a significant oversight. If Google intends for Google Gears to become a platform that will be generally useful to the average Web developer then the company will have to fix this oversight. Otherwise, they haven't done as much for the Web development world as the initial hype led us to believe.

Categories: Programming | Web Development

June 5, 2007

@ 05:53 PM

Comments [4]

What Ruby on Rails Can Learn from ASP.NET

Ian McKellar has a blog post entitled Insecurity is Ruby on Rails Best Practice where he points out that by default the Ruby on Rails framework makes sites vulnerable to a certain class of exploits. Specifically, he discusses the vulnerabilities in two Ruby on Rails applications, 37 Signals Highrise and Magnolia, then proposes solutions. He writes

Cross Site Request Forgery
CSRF is the new bad guy in web application security. Everyone has worked out how to protect their SQL database from malicious input, and RoR saves you from ever having to worry about this. Cross site scripting attacks are dying and the web community even managed to nip most JSON data leaks in the bud.

Cross Site Request Forgery is very simple. A malicious site asks the user's browser to carry out an action on a site that the user has an active session on and the victim site carries out that action believing that the user intended that action to occur. In other words the problem arises when a web application relies purely on session cookies to authenticate requests.
...
Solutions
Easy Solutions
There aren't any good easy solutions to this. A first step is to do referrer checking on every request and block GET requests in form actions. Simply checking the domain on the referrer may not be enough security if there's a chance that HTML could be posted somewhere in the domain by an attacker the application would be vulnerable again.

Better Solutions
Ideally we want a shared secret between the HTML that contains the form and the rails code in the action. We don't want this to be accessible to third parties so serving as JavaScript isn't an option. The way other platforms like Drupal achieve this is by inserting a hidden form field into every form that's generated that contains a secret token, either unique to the current user's current session or (for the more paranoid) also unique to the action. The action then has to check that the hidden token is correct before allowing processing to continue.

Incidents of Cross Site Request Forgery have become more popular with the rise of AJAX and this is likely to become as endemic as SQL injection attacks until the majority of Web frameworks take this into account in their out of the box experience.

At Microsoft, the Web teams at MSN and Windows Live have given the folks in Developer Division the virtue of their experience building Web apps which has helped in making sure our Web frameworks like ASP.NET Ajax (formerly codenamed Atlas) avoid this issue in their default configuration. Scott Guthrie outlines the safeguards against this class of issues in his post JSON Hijacking and How ASP.NET AJAX 1.0 Avoids these Attacks where he writes

Recently some reports have been issued by security researchers describing ways hackers can use the JSON wire format used by most popular AJAX frameworks to try and exploit cross domain scripts within browsers. Specifically, these attacks use HTTP GET requests invoked via an HTML include element to circumvent the "same origin policy" enforced by browsers (which limits JavaScript objects like XmlHttpRequest to only calling URLs on the same domain that the page was loaded from), and then look for ways to exploit the JSON payload content.
ASP.NET AJAX 1.0 includes a number of default settings and built-in features that prevent it from being susceptible to these types of JSON hijacking attacks.
...
ASP.NET AJAX 1.0 by default only allows the HTTP POST verb to be used when invoking web methods using JSON, which means you can't inadvertently allow browsers to invoke methods via HTTP GET.
ASP.NET AJAX 1.0 requires a Content-Type header to be set to "application/json" for both GET and POST invocations to AJAX web services. JSON requests that do not contain this header will be rejected by an ASP.NET server. This means you cannot invoke an ASP.NET AJAX web method via a include because browsers do not allow append custom content-type headers when requesting a JavaScript file like this.

These mitigations would solve the issues that Ian McKellar pointed out in 37 Signals Highrise and Magnolia because HTML forms hosted on a malicious site cannot set the Content-Type header so that exploit is blocked. However neither this approach nor referrer checking to see if the requests come from your domain is enough if the malicious party finds a way to upload HTML or script onto your site.

To completely mitigate against this attack, the shared secret approach is most secure and is what is used by most large websites. In this approach each page that can submit a request has a canary value (i.e. hidden form key) which must be returned with the request. If the form key is not provided, is invalid or expired then the request fails. This functionality is provided out of the box in ASP.NET by setting the Page.ViewStateUserKey property. Unfortunately, this feature is not on by default. On the positive side, it is a simple one line code change to get this functionality which needs to be rolled by hand on a number of other Web platforms today.

Categories: Web Development

June 5, 2007

@ 05:52 PM

Improving Website Usability with A/B Testing

The good folks on the Microsoft Experimentation Platform team have published a paper which gives a great introduction to how and why one can go about using controlled experiments (i.e. A/B testing) to improve the usability of a website. The paper is titled Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO and will be published as part of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The paper begins

In the 1700s, a British ship’s captain observed the lack of scurvy among sailors serving on the naval ships of Mediterranean countries, where citrus fruit was part of their rations. He then gave half his crew limes (the Treatment group) while the other half (the Control group) continued with their regular diet. Despite much grumbling among the crew in the Treatment group, the experiment was a success, showing that consuming limes prevented scurvy. While the captain did not realize that scurvy is a consequence of vitamin C deficiency, and that limes are rich in vitamin C, the intervention worked. British sailors eventually were compelled to consume citrus fruit regularly, a practice that gave rise to the still-popular label limeys.
Some 300 years later, Greg Linden at Amazon created a prototype to show personalized recommendations based on items in the shopping cart (2). You add an item, recommendations show up; add another item, different recommendations show up. Linden notes that while the prototype looked promising, ―a marketing senior vice-president was dead set against it, claiming it will distract people from checking out. Greg was ―forbidden to work on this any further. Nonetheless, Greg ran a controlled experiment, and the ―feature won by such a wide margin that not having it live was costing Amazon a noticeable chunk of change. With new urgency, shopping cart recommendations launched. Since then, multiple sites have copied cart recommendations.
The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition (3), and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively. At Microsoft, there are multiple systems for running controlled experiments. We describe several architectures in this paper with their advantages and disadvantages. A unifying theme is that controlled experiments have great return-on-investment (ROI) and that building the appropriate infrastructure can accelerate innovation.

I learned quite a bit from reading the paper although I did somewhat skip over some of the parts that involved math. It's pretty interesting when you realize how huge the impact of changing the layout of a page or moving links can be on the bottom line of a Web company. Were talking millions of dollars for the most popular sites. That's pretty crazy.

Anyway, Ronny Kohavi from the team mentioned that they will be giving a talk related to the paper at eBay research labs tomorrow at 11AM. The talk will be in Building 0 (Toys) in room 0158F. The address is 2145 Hamilton Avenue, San Jose, CA. If you are in the silicon valley area, this might be a nice bring your own lunch event to attend.

Categories: Mindless Link Propagation | Web Development

June 5, 2007

@ 04:20 PM

MSN Soapbox Now Open to the Public

The MSN Soapbox team has a blog post entitled Soapbox Is Now Open For Full Video Viewing which states

We have just opened Soapbox on MSN Video for full video viewing. You no longer need to be signed into the service to view Soapbox videos!

However just as before, you still need to sign in to upload, comment, tag or rate videos. e Embedded Player is also available for full viewing and embedding.

While it might not be visible, we have spent the last month working hard on improving our backend, to make encoding and performance as scalable as possible.

We are also now conducting proactive filtering of all uploaded content, using technology from Audible Magic.Audible Magic bases their filtering on the video's audio track and their database is being updated constantly. By using this technology, we see this as an important step to ensure the viability and success of Soapbox over the long run.

You can dive right in and explore the site's new interface by checking out my friend Ikechukwu's music video. I like the redesign of the site although there seem to be a few minor usability quibbles. The content filtering is also an interesting addition to the site and was recently mentioned in the Ars Technica article Soapbox re-opens, beating YouTube to the punch with content filtering. Come to think of it, I wonder what's been taking Google so long to implement content filtering for copyrighted materials given how long they've been talking about it. It's almost as if they are dawdling to add the feature. I wonder why?

Speaking of usability issues, the main problem I have with the redesign is that once you start browsing videos the site's use of AJAX/Flash techniques works against it. Once you click on any of the videos shown beside the one you are watching, the video changes inline instead of navigating to a new page which means the browser's URL doesn't change. To get a permalink to a video I had the hunt around in the interface until I saw a "video link" which gave me the permalink to the video I was currently watching. Both Google Video and YouTube actually reload the page and thus change the URL when you click on a video. Even then YouTube makes the URL prominent on the page as well. Although I like the smoothness of inline transitions, making the permalink and other sharing options more prominent would make the site a lot more user friendly in my opinion.

Categories: MSN

June 4, 2007

@ 10:36 PM