From the Live Search team blog post entitled Live Search autosuggestions come to Firefox we learn

We're happy to report that we've officially integrated Live Search into Firefox by popular demand.
...
The Live Search add-on for Firefox is available to install at
https://addons.mozilla.org/en-US/firefox/addon/10434. It's based on the Open Search standard and uses the JSON interface supported by Firefox to retrieve autosuggestions.

Image of Live Search autosuggestions in Firefox

I'm glad to see this finally get out there since I was one of the popular demanders at work. If you're a Firefox user who's also a fan of Live Search then this is a must have extension. It's definitely improved my browsing experience when I've had to use Firefox.

By the way, it's an interesting insight into the different user bases to compare the suggested searches from Google with those from Live Search for a particular phrase.

Note Now Playing: Kelly Clarkson - My Life Would Suck Without You Note


 

Categories: MSN

From the blog post on the Office Live team blog entitled Looking ahead and bringing you even more we learn

Today, I wanted to share with you some exciting news about Office Live: To simplify and improve the customer experience around our Live services, we’ve made the decision to converge Windows Live and Office Live into an integrated set of services at one single destination. We think that just makes a ton of sense and goes a long way toward giving you a simpler, richer, better service that allows you to do more with one account.

Every day, more and more people are signing up for Office Live Workspace and Office Live Small Business (4 million of you so far!), as well as Windows Live (460 million to date!).

Secondly there’s the MSDN forum post entitled Working Together: Live Mesh and Windows Live by the Live Mesh team which informs us that

For some time now, many of you have expressed interest in seeing some sort of combination of Live Mesh and other Microsoft services. In order to further explore these ideas we would like to ask you to share with us the scenario(s) that you have in mind. 

What combination(s) are you interested in, and why? Whatever your interests, whatever problem you’re trying to solve or scenario you want to enable, we’d like to hear the details – and the more specific you can be, the better.

We know you have ideas – this is the place to share them!

Thank you,

The Live Mesh Team

I love it when things come together this way. There's been a lot of choice in Web-based storage solutions offered by Microsoft and some would argue that it's been too much choice. It will be good to how the cross pollination of ideas ends up working out between these various products over the coming months and years.

If you are interested in developer or end user scenarios around Web-based storage I recommend chiming in on the MSDN forum post linked above. It's your chance to give your feedback to Microsoft and help influence the direction of the most innovative technology product of the past year

Note Now Playing: Amy Winehouse - You Know I'm No Good Note


 

Categories: Windows Live

Sarah Perez over on ReadWriteWeb has a blog post entitled In Cloud We Trust? where she states

Cloud computing may have been one of the biggest "buzzwords" (buzz phrases?) of this past year. From webmail to storage sites to web-based applications, everything online was sold under a new moniker in 2008: they're all "cloud" services now. Yet even though millions of internet users make use of these online services in some way, it seems that we haven't been completely sold on the cloud being any more safe or stable than data stored on our own computers.
...
Surprisingly, even on a site that tends to attract a lot of technology's earliest adopters, the responses were mixed. When asked the question: "Do you trust the cloud?," the majority of responses either came back as a flat-out "no" or as a longer explanation as to why their response was a "maybe" or a "sometimes." In other words, some people trust the cloud here, but not there, or for this, but not that.

The question this article asks is pointless on several levels.  First of all, it doesn't really matter if people trust the cloud or not. What matters is whether they use it or not. The average person doesn't trust computers, automobile mechanics or lawyers yet they use them anyway. Given the massive adoption of the Web from search engines and e-commerce sites to Web-based email and social networking services, it is clear that the average computer person trusts the cloud enough to part with their personal information and their money. Being scared and untrusting of the cloud is like being scared and untrusting of computers, it is a characteristic that belongs to an older generation while the younger generation couldn't imagine life any other way. It's like Douglas Adams wrote in his famous essay How to Stop Worrying and Learn to Love the Internet back in 1999.

Secondly, people are often notoriously bad at assessing risk and often fail to consider that it is more likely that data loss will occur when their personal hardware fails given that the average computer user doesn't have a data backup strategy than it is likely to occur if their information is stored on some Web company's servers. For example, I still have emails from the last decade available to me in my Hotmail and Yahoo! Mail accounts. On the other hand, my personal archive of mail from the early 2000s which had survived being moved across three different desktop PCs was finally lost when the hard drive failed on my home computer a few months ago. I used to have a personal backup strategy for my home desktop but gave up after encountering the kinds of frustrations Mark Pilgrim eloquently rants about in his post Juggling Oranges. These days, I just put all the files and photos I'm sure I'd miss on SkyDrive and treat any file not worth uploading the cloud as being transient anyway. It is actually somewhat liberating since I no longer feel like I'm owned by all my digital stuff I have to catalog, manage and archive.

On a final note, the point isn't that there aren't valid concerns raised whenever this question is brought up. However progress will march on despite our  Luddite concerns because the genie is already out of the bottle. For most people the benefits of anywhere access to their data from virtually any device and being able to share their content with people on the Web far outweighs the costs of not being in complete control of the data. In much the same way, horseless carriages (aka automobiles) may cause a lot more problems than horse drawn carriages from the quarter ton of carbon monoxide poured into the air per year by an average car to the tens of thousands of people killed each year in car crashes, yet the benefits of automobiles powered by internal combustion engines is so significant that humanity has decided to live with the problems that come along with them.

The cloud Web is already here and it is here to stay. It's about time we stopped questioning it and got used to the idea.

Note Now Playing: Amy Winehouse - Love Is A Losing Game Note


 

Categories: Cloud Computing

I've been hearing the term deflation being bandied about by TV news pundits with Japan being used as the popular example of this phenomenon for the past few months. The claim is that if the trend of price drops in the U.S. continues then we will be headed for deflation. When I first saw these stories I wondered what exactly is wrong with falling prices? Well…

Everywhere you look assets are worth less than they were a year ago. Gas prices are lower, house prices are lower, and stock portfolios are lower. At first I considered this a net positive. According to Zillow my house is now worth 10% less than what we paid for it almost two years ago and my 401(k) lost about 35% during the 2008 calendar year. However I treated this as a "correction" since these were in effect paper losses since I bought my house to live in not to flip it and I'm not planning to spend out of my 401(k) until retirement. On the other hand lower gas prices and cheaper consumer goods at Christmas time had a real and positive effect on my bottom line.

In addition, despite the media's claim that people were hoarding we planned to ignore the hype and help the local economy by continuing our plans to have our bathroom remodeled and get a new car later in the year (my wife has her eye on the Ford Flex).

As each week passes I've been less sure of our plans to "help the local economy" and last week's announcement by my employer to eliminate 5,000 jobs within the next 18 months began to make the plans seem downright irresponsible. At this point we've decided to hold off on the purchases and are debating the safest way to hold on the money and still retain value. My behavior sounded familiar and I looked up Wikipedia for information about deflation in Japan it was interesting to see the parallels

Systemic reasons for deflation in Japan can be said to include:

  • Fallen asset prices. There was a rather large price bubble in both equities and real estate in Japan in the 1980s (peaking in late 1989). When assets decrease in value, the money supply shrinks, which is deflationary.
  • Insolvent companies: Banks lent to companies and individuals that invested in real estate. When real estate values dropped, these loans could not be paid. The banks could try to collect on the collateral (land), but this wouldn't pay off the loan. Banks have delayed that decision, hoping asset prices would improve. These delays were allowed by national banking regulators. Some banks make even more loans to these companies that are used to service the debt they already have. This continuing process is known as maintaining an "unrealized loss", and until the assets are completely revalued and/or sold off (and the loss realized), it will continue to be a deflationary force in the economy. Improving bankruptcy law, land transfer law, and tax law have been suggested (by The Economist) as methods to speed this process and thus end the deflation.
  • Insolvent banks: Banks with a larger percentage of their loans which are "non-performing", that is to say, they are not receiving payments on them, but have not yet written them off, cannot lend more money; they must increase their cash reserves to cover the bad loans.
  • Fear of insolvent banks: Japanese people are afraid that banks will collapse so they prefer to buy gold or (United States or Japanese) Treasury bonds instead of saving their money in a bank account. This likewise means the money is not available for lending and therefore economic growth. This means that the savings rate depresses consumption, but does not appear in the economy in an efficient form to spur new investment. People also save by owning real estate, further slowing growth, since it inflates land prices.
  • Imported deflation: Japan imports Chinese and other countries' inexpensive consumable goods, raw materials (due to lower wages and fast growth in those countries). Thus, prices of imported products are decreasing. Domestic producers must match these prices in order to remain competitive. This decreases prices for many things in the economy, and thus is deflationary.

The crazy thing is that this sounds like a description of the United States of America today. That's when I understood that the threat of a deflationary spiral is very real. If people like me start hoarding cash then businesses get less customers which causes them to lower prices in return. With lower prices, they make less money and thus need to cut costs so they have layoffs. Now the local situation is worse giving people even more conviction in holding on to their cash and so on. The bit about imported goods kicking the butts of locally produced items in the marketplace is also especially apt given the recent drama about the automobile bailout

Anyway, it looks like we are at the start of a deflationary spiral. The interesting question is whether there is anything anyone can do to stop it before it fully gets underway.


 

Categories: Current Affairs | Personal

A few months ago, Tim O'Reilly wrote an excellent post entitled Web 2.0 and Cloud Computing where provides some definitions of two key cloud computing paradigms, Utility Computing and Platform as a Service. His descriptions of these models can be paraphrased as

  1. Utility Computing: In this approach, a vendor provides access to virtual server instances where each instance runs a traditional server operating system such as Linux or Windows Server. Computation and storage resources are metered and the customer can "scale infinitely" by simply creating new server instances. The most popular example of this approach is Amazon EC2.
  2. Platform as a Service: In this approach, a vendor abstracts away the notion of accessing traditional LAMP or WISC stacks from their customers and instead provides an environment for running programs written using a particular platform. In addition, data storage is provided via a custom storage layer and API instead of traditional relational database access. The most popular example of this approach is Google App Engine.

The more I interact with platform as a service offerings, the more I realize that although they are more easily approachable for getting started there is a cost because you often can't reuse your existing skills and technologies when utilizing such services. A great example of this is Don Park's post about developing on Google's App Engine entitled So GAE where he writes

What I found frustrating while coding for GAE are the usual constraints of sandboxing but, for languages with rich third-party library support like Python, it gets really bad because many of those libraries have to be rewritten or replaced to varying degrees. For example, I couldn’t use existing crop of Twitter client libraries so I had to code the necessary calls myself. Each such incident is no big deal but the difference between hastily handcrafted code and libraries polished over time piles up.

I expect that the inability of developers to simply use the existing libraries and tools that they are familiar with on services like Google App Engine is going to be an adoption blocker. However I expect that the lack of of a "SQL database in the cloud" will actually be an even bigger red flag than the fact that some APIs or libraries from your favorite programming language are missing.

A friend of mine who runs his own software company recently mentioned that one of the biggest problems he has delivering server-based software to his customers is that eventually the database requires tuning (e.g. creating indexes) and there is no expertise on-site at the customer to perform these tasks.  He wanted to explore whether a cloud based offering like the Azure Services Platform could help. My response was that it would if he was willing to rewrite his application to use a table based storage system instead of a relational database. In addition, aside from using a new set of APIs for interacting with the service he'd also have to give up relational database features like foreign keys, joins, triggers and stored procedures. He thought I was crazy to even suggest that as an option.  

This reminds me of an earlier post from Dave Winer entitled Microsoft's cloud strategy? 

No one seems to hit the sweet spot, the no-brainer cloud platform that could take our software as-is, and just run it -- and run by a company that stands a chance of surviving the coming recession (which everyone really thinks may be a depression).

Of all the offerings Amazon comes the closest.

As it stands today  platform as a service offerings currently do not satisfy the needs of people who have existing apps that want to "port them to the cloud". Instead this looks like it will remain the domain of utility computing services which just give you a VM and the ability to run any software you damn well please on the your operating system of choice.

However for brand new product development the restrictions of platform as a service offerings seem attractive given the ability to "scale infinitely" without having to get your hands dirty. Developers on platform as a service offerings don't have to worry about database management and the ensuing complexitiies like sharding, replication and database tuning.

What are your thoughts on the strengths and weaknesses of both classes of offerings?

Note Now Playing: The Pussycat Dolls - I Hate This Part Note


 

Categories: Cloud Computing

I've been a FeedBurner customer for a couple of years and was initially happy for the company when it was acquired by Google. This soon turned to frustration when I realized that Google had become the company where startups go to die. Since being acquired by Google almost two years ago, the service hasn't added new features or fixed glaring bugs. If anything, the service has only lost features since it was acquired.

I discovered the most significant feature loss this weekend when I was prompted to migrate my FeedBurner account to using a Google login. I thought this would just be a simple change in login credentials but I got a different version of the service as well. The version of the service for Google accounts does have a new feature, the chart of subscribers is now a chart of subscribers AND "reach". On the other hand, the website analytics features seemed to be completely missing. So I searched on the Internet and found the following FAQ on the FeedBurner to Google Accounts migration

Are there any features that are not available at feedburner.google.com?

There are two features that we are retiring from all versions of feedburner: Site Stats (visitors) and FeedBurner Networks.

We have decided to retire FeedBurner website visitor tracking, as we feel Google already has a comparable publisher site analytics tool in Google Analytics. If you do not currently use Google Analytics, we recommend signing up for an account. We want to stress that all feed analytics will remain the same or be improved, and they are not going away.

FeedBurner Networks, which were heavily integrated with FeedBurner Ad Network, are no longer being supported. As with many software features, the usage wasn't at the level we'd hoped, and therefore we are making the decision not to develop it further, but to focus our attention on other feed services that are being used with more frequency. We will continue to look out for more opportunities for publishers to group inventory as part of the AdSense platform.

I guess I should have seen this coming. I know enough about competing projects and acquisitions to realize that there was no way Google would continue to invest in two competing website analytics products. It is unfortunate because there were some nice features in FeedBurner's Site Stats such as tracking of clicked links and the ability to have requests from a particular machine be ignored which are missing in Google Analytics. I also miss the simplicity of FeedBurner's product. Google Analytic is a more complex product with lots of bells and whistles, yet it takes two or three times as many clicks to answer straightforward questions like what referrer pages are sending people to our site. 

The bigger concern for me is that both Google Analytics and FeedBurner (aka AdSense for Feeds) are really geared around providing a service for people who use Google's ad system. I keep wondering how much longer Google will be willing to let me mooch off of them by offloading my RSS feed bandwidth costs to them while not serving Google ads in my feed.  At this rate, I wouldn't be surprised if Google turns FeedBurner into a Freemium business model product where users like me are encouraged to run ads if we want any new features and better service.

Too bad there isn't any competition in this space.

Note Now Playing: Lords of the Underground - Chief Rocka Note


 

Categories: Rants

A few days ago someone asked how long I've been at Microsoft and I was surprised to hear myself say about 7 years. I hadn't consciously thought about it for a while but my 7th anniversary at the company is coming up in a few weeks. I spent a few days afterwards wondering if I have a seven year itch and thinking about what I want to do next in my career. 

I realized that I couldn't be happier right now. I get to build the core platform that powers the social software experience for half a billion users with a great product team (dev, test, ops) and very supportive management all the way up the chain. This hasn't always been the case. 

I went through my seven year itch period about two years ago. I had always planned for my career at Microsoft to be short since I've never considered myself to be a big company person. Around that time, I looked around at a couple of places both within and outside Microsoft. I had some surprisingly good hiring experiences and some that were surprisingly bad (as bad as this thread makes the Google hiring process seem, trust me it is worse than that) then came away with a surprising conclusion. The best place to build the kind of software I wanted to build was at Microsoft. I started working in online services at Microsoft because I believed Social Software is the Platform of the Future and wanted to build social experiences that influence how millions of people connect with each other. What I realized after my quick look around at other opportunities is that no one has more potential in this space than Microsoft. Today when I read articles about our recent release, it validates my belief that Microsoft will be the company to watch when it comes to bringing "social" to software in a big way. 

~

In my almost seven years in the software industry, I've had a number of friends go through the sense of needing change or career dissatisfaction which leads to the seven year itch. Both at Microsoft and elsewhere. Some of them have ended up dealing with this poorly and eventual became disgruntled and unhappy with their jobs which turns into a vicious cycle.  On the other hand, I know a bunch of people that went from being unhappy or disgruntled about their jobs to becoming happy and productive employees who are more satisfied with their career choices. For the latter class of people, here are the three most successful, proactive steps I've seen them make

  1. Change your perspective: Sometimes employees fall into situations where the reality for working on product X or team Y at company Z is different from their expectations. It could be a difference in development philosophy (e.g. the employee likes Agile practices like SCRUM and the product team does not practice them), technology choice (e.g. wanting to use the latest technologies whereas the product team has a legacy product in C++ with millions of customers) or one of many other differences in expectations versus reality.

    The realization that leads to satisfaction in this case is that it isn't necessarily the case that what the organization is doing is wrong (e.g. rewriting an app from scratch just to use the latest technologies is never a good idea), it's just that what the employee would prefer isn't a realistic option for the organization or is just a matter of personal preference (e.g. the goal of the organization is to deliver a product on time and under budget not to use a specific development practice). Of course, these are contrived examples but the key point should be clear. If you're unhappy because your organization doesn't meet your expectations it could be that your expectations are what needs adjusting.

  2. Change your organization: As the Mahatma Gandhi quote goes, Be the change you wish to see in the world. Every organization can be changed from within. After all, a lot of the most successful projects at Microsoft and around the software industry came about because a passionate person had a dream and did the leg work to convince enough people to share that dream. Where people often stumble is in underestimating the amount of leg work it takes to convince people to share their dream (and sometimes this leg work may mean writing a lot of code or doing a lot of research). .

    For example, when you go from reading Harvard Business School articles like Microsoft vs. Open Source: Who Will Win? in 2005 to seeing an Open Source at Microsoft portal in 2008, you have to wonder what happens behind the scenes to cause that sort of change. It took the prodding of a number of passionate Open Source ambassadors at Microsoft as well as other influences to get this to happen.

  3. Change your job: In some cases, there are irreconcilable differences between an employee and the organization they work for. The employee may have lost faith in the planned direction of the product, the product's management team or the entire company as a whole. In such cases the best thing to do is to part ways amicably before things go south.

    Being on an H-1B visa I'd heard all sorts of horror stories about being the equivalent of an indentured servant when working for an American software company but this has proven to be far from the truth. There is an H-1B transfer process that allows you to switch employers without having to re-apply for a visa or even inform your current employer. If you work at a big company and are paper-work averse, you can stick to switching teams within the company. This is especially true for Microsoft where there are hundreds of very different products (operating systems, databases, Web search engines, video game console hardware, social networking software, IT, web design, billing software, etc) with very different cultures to choose from.

These are the steps that I've seen work for friends and coworkers who've been unhappy in their jobs who've successfully been able to change their circumstances. The people who don't figure out how to execute one of the steps above eventually become embittered and are never a joy to be around.

Note Now Playing: Estelle (feat. Kanye West) - American Boy Note


 

A few weeks ago, Joshua Porter posted an excellent analysis of FriendFeed's user interface in his post Thoughts on the Friendfeed interface where he provides this excellent annotated screenshot

In addition to the screenshot, Joshua levels four key criticisms about FriendFeed's current design

  • Too few items per screen
  • Secondary information clogs up each item
  • Difficult to scan content titles quickly
  • People who aren't my friends

The last item is my biggest pet peeve about FriendFeed and why I haven't found myself able to get into the service. FriendFeed goes out of its way to show me content from and links to people I don't know and haven't become friends with on the site. In the screenshot above, there are at least twice as many people Joshua isn't friends with showing up on the page than people he knows. Here are the three situations FriendFeed commonly shows non-friends in and why they are bad

  1. FriendFeed shows you content from friends of friends: This is major social faux pas. It may sound like a cool viral feature but showing me content from people I haven't subscribed to means I don't have control of who shows up in my feed and it takes away from the intimacy of the site because I'm always seeing content from strangers.
  2. FriendFeed shows you who "liked" some content: Why should I care if some total stranger liked some blog post from a friend of mine? Again, this seems like a viral feature aimed at generating page views from users clicking on the people who liked an item in the feed but it comes at the cost of visual clutter and a reduction in the intimacy of the servers by putting strangers in your face.
  3. FriendFeed shows comments expanded by default in the feed: In the screenshot above, the comment thread for "Overnight Success and FriendFeed Needs" takes up space that could have been used to show another item from one of Joshua's friends. The question to ask is whether a bunch of comments from people Joshua may or may not know is more valuable to show than an update from one of his friends?

In fact the majority of Joshua's remaining complaints including secondary information causing visual clutter and too few items per screen are a consequence of FriendFeed's decision to take multiple opportunities to push people you don't know in your face on the home page. The need to grow virally by encouraging connections between users is costing them by hampering their core user experience.

On the flip side, look at how Facebook has tried to address the issue of prompting users to grow their social graph without spamming the news feed with people you don't know

 

People often claim that activity streams make them feel like they are drowning in a river of noise. FriendFeed compounds this by drowning you in a content from people you don't even know and never even asked to get content from in the first place.

Rule #1 of every activity stream experience is that users should feel in control of what content they get in their feed. Otherwise, the tendency to succumb to the feeling of "drowning" will be overwhelming.

Note Now Playing: Lupe Fiasco - Kick, Push Note


 

Categories: Social Software

Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. The justification for database sharding is that after a certain scale point it is cheaper and more feasible to scale a site horizontally by adding more machines than to grow it vertically by adding beefier servers.

Why Shard or Partition your Database?

Let's take Facebook.com as an example. In early 2004, the site was mostly used by Harvard students as a glorified online yearbook. You can imagine that the entire storage requirements and query load on the database could be handled by a single beefy server. Fast forward to 2008 where just the Facebook application related page views are about 14 billion a month (which translates to over 5,000 page views per second, each of which will require multiple backend queries to satisfy). Besides query load with its attendant IOPs, CPU and memory cost there's also storage capacity to consider. Today Facebook stores 40 billion physical files to represent about 10 billion photos which is over a petabyte of storage. Even though the actual photo files are likely not in a relational database, their metadata such as identifiers and locations still would require a few terabytes of storage to represent these photos in the database. Do you think the original database used by Facebook had terabytes of storage available just to store photo metadata?

At some point during the development of Facebook, they reached the physical capacity of their database server. The question then was whether to scale vertically by buying a more expensive, beefier server with more RAM, CPU horsepower, disk I/O and storage capacity or to spread their data out across multiple relatively cheap database servers. In general if your service has lots of rapidly changing data (i.e. lots of writes) or is sporadically queried by lots of users in a way which causes your working set not to fit in memory (i.e. lots of reads leading to lots of page faults and disk seeks) then your primary bottleneck will likely be I/O. This is typically the case with social media sites like Facebook, LinkedIn, Blogger, MySpace and even Flickr. In such cases, it is either prohibitively expensive or physically impossible to purchase a single server to handle the load on the site. In such situations sharding the database provides excellent bang for the buck with regards to cost savings relative to the increased complexity of the system.

Now that we have an understanding of when and why one would shard a database, the next step is to consider how one would actually partition the data into individual shards. There are a number of options and their individual tradeoffs presented below – Pseudocode / Joins

How Sharding Changes your Application

In a well designed application, the primary change sharding adds to the core application code is that instead of code such as

//string connectionString = @"Driver={MySQL};SERVER=dbserver;DATABASE=CustomerDB;"; <-- should be in web.config
string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"];          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
          
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader(); 

the actual connection information about the database to connect to depends on the data we are trying to store or access. So you'd have the following instead

string connectionString = GetDatabaseFor(customerId);          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
         
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader(); 

the assumption here being that the GetDatabaseFor() method knows how to map a customer ID to a physical database location. For the most part everything else should remain the same unless the application uses sharding as a way to parallelize queries. 

A Look at a Some Common Sharding Schemes

There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large scale Web applications today.

  1. Vertical Partitioning: A simple way to segment your application database is to move tables related to specific features to their own server. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The key benefit of this approach is that is straightforward to implement and has low impact to the application as a whole. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers (e.g. handling metadata queries for 10 billion photos by 140 million users may be more than a single server can handle).

  2. Range Based Partitioning: In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity. For example, splitting up sales transactions by what year they were created or assigning users to servers based on the first digit of their zip code. The main problem with this approach is that if the value whose range is used for partitioning isn't chosen carefully then the sharding scheme leads to unbalanced servers. In the previous example, splitting up transactions by date means that the server with the current year gets a disproportionate amount of read and write traffic. Similarly partitioning users based on their zip code assumes that your user base will be evenly distributed across the different zip codes which fails to account for situations where your application is popular in a particular region and the fact that human populations vary across different zip codes.

  3. Key or Hash Based Partitioning: This is often a synonym for user based partitioning for Web 2.0 sites. With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. A simplistic example is to consider if you have ten database servers and your user IDs were a numeric value that was incremented by 1 each time a new user is added. In this example, the hash function could be perform a modulo operation on the user ID with the number ten and then pick a database server based on the remainder value. This approach should ensure a uniform allocation of data to each server. The key problem with this approach is that it effectively fixes your number of database servers since adding new servers means changing the hash function which without downtime is like being asked to change the tires on a moving car.

  4. Directory Based Partitioning: A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the GetDatabaseFor() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a modulo operation. Let's say we want to add five database servers to the pool without incurring downtime. We can keep the existing hash function, add these servers to the pool and then run a script that copies data from the ten existing servers to the five new servers based on a new hash function implemented by performing the modulo operation on user IDs using the new server count of fifteen. Once the data is copied over (although this is tricky since users are always updating their data) the lookup service can change to using the new hash function without any of the calling applications being any wiser that their database pool just grew 50% and the database they went to for accessing John Doe's pictures five minutes ago is different from the one they are accessing now.

Problems Common to all Sharding Schemes

Once a database has been sharded, new constraints are placed on the operations that can be performed on the database. These constraints primarily center around the fact that operations across multiple tables or multiple rows in the same table no longer will run on the same server. Below are some of the constraints and additional complexities introduced by sharding

  • Joins and Denormalization – Prior to sharding a database, any queries that require joins on multiple tables execute on a single server. Once a database has been sharded across multiple servers, it is often not feasible to perform joins that span database shards due to performance constraints since data has to be compiled from multiple servers and the additional complexity of performing such cross-server.

    A common workaround is to denormalize the database so that queries that previously required joins can be performed from a single table. For example, consider a photo site which has a database which contains a user_info table and a photos table. Comments a user has left on photos are stored in the photos table and reference the user's ID as a foreign key. So when you go to the user's profile it takes a join of the user_info and photos tables to show the user's recent comments.  After sharding the database, it now takes querying two database servers to perform an operation that used to require hitting only one server. This performance hit can be avoided by denormalizing the database. In this case, a user's comments on photos could be stored in the same table or server as their user_info AND the photos table also has a copy of the comment. That way rendering a photo page and showing its comments only has to hit the server with the photos table while rendering a user profile page with their recent comments only has to hit the server with the user_info table.

    Of course, the service now has to deal with all the perils of denormalization such as data inconsistency (e.g. user deletes a comment and the operation is successful against the user_info DB server but fails against the photos DB server because it was just rebooted after a critical security patch).

  • Referential integrity – As you can imagine if there's a bad story around performing cross-shard queries it is even worse trying to enforce data integrity constraints such as foreign keys in a sharded database. Most relational database management systems do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references once they move to using database shards.

    Dealing with data inconsistency issues due to denormalization and lack of referential integrity can become a significant development cost to the service.

  • Rebalancing (Updated 1/21/2009) – In some cases, the sharding scheme chosen for a database has to be changed. This could happen because the sharding scheme was improperly chosen (e.g. partitioning users by zip code) or the application outgrows the database even after being sharded (e.g. too many requests being handled by the DB shard dedicated to photos so more database servers are needed for handling photos). In such cases, the database shards will have to be rebalanced which means the partitioning scheme changed AND all existing data moved to new locations. Doing this without incurring down time is extremely difficult and not supported by any off-the-shelf today. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).  

 

Further Reading

Note Now Playing: The Kinks - You Really Got Me Note


 

Categories: Web Development

Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases. 

I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed.

RDF proponents  often to argue that if we all used RDF based formats then instead of having to change your code to support every new photo site's Atom feed with custom extensions, you could instead create a mapping from the format you don't understand to the one you do using something like the OWL Web Ontology Language.  The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT).

The key problem is that in both cases (i.e. mapping with OWL vs. mapping with XSLT) there is still the problem that Picassa feeds won't work with an app that understand's Zoomr's feeds until some developer writes code. Thus we're really debating on whether it is better cheaper to have the developer write declarative mappings like OWL or XSLT instead of writing new parsing code in their language of choice.

In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem.

Note Now Playing: Lady GaGa & Colby O'Donis - Just Dance Note


 

Categories: Syndication Technology | XML

It looks like I'll be participating in two panels at the upcoming SXSW Interactive Festival. The descriptions of the panels are below

  1. Feed Me: Bite Size Info for a Hungry Internet

    In our fast-paced, information overload society, users are consuming shorter and more frequent content in the form of blogs, feeds and status messages. This panel will look at the social trends, as well as the technologies that makes feed-based communication possible. Led by Ari Steinberg, an engineering manager at Facebook who focuses on the development of News Feed.

  2. Post Standards: Creating Open Source Specs

    Many of the most interesting new formats on the web are being developed outside the traditional standards process; Microformats, OpenID, OAuth, OpenSocial, and originally Jabber — four out of five of these popular new specs have been standardized by the IETF, OASIS, or W3C. But real hackers are bringing their implementations to projects ranging from open source apps all the way up to the largest companies in the technology industry. While formal standards bodies still exist, their role is changing as open source communities are able to develop specifications, build working code, and promote it to the world. It isn't that these communities don't see the value in formal standardization, but rather that their needs are different than what formal standards bodies have traditionally offered. They care about ensuring that their technologies are freely implementable and are built and used by a diverse community where anyone can participate based on merit and not dollars. At OSCON last year, the Open Web Foundation was announced to create a new style of organization that helps these communities develop open specifications for the web. This panel brings together community leaders from these technologies to discuss the "why" behind the Open Web Foundation and how they see standards bodies needing to evolve to match lightweight community driven open specifications for the web.

If you'll be at SxSw and are a regular reader of my blog who would like to chat in person, feel free to swing by during one or both panels. I'd also be interested in what people who plan to attend either panel would like to get out of the experience. Let me know in the comments.

Note Now Playing: Estelle - American Boy (feat. Kanye West) Note


 

Categories: Trip Report

Angus Logan has the scoop

I’m in San Francisco at the 2008 Crunchie Awards and after ~ 350k votes were cast Ray Ozzie and David Treadwell accepted the award for Best Technology Innovation/Achievement on behalf of the Live Mesh team.

DSC_0007The Crunchies are an annual competition co-hosted by GigaOm, VentureBeat, Silicon Alley Insider, and TechCrunch which culminates in an award the most compelling startup, internet and technology innovations.

Kudos to the Live Mesh folks on getting this award. I can't wait to see what 2009 brings for this product.

PS: I noticed from the TechCrunch post that Facebook Connect was the runner up. I have to give an extra special shout out to my friend Mike for being a key figure behind two of the most innovative technology products of 2008. Nice work man.


 

Categories: Windows Live

Since we released the latest version of the Windows Live what's new feed which shows what's been going on with your social network at http://home.live.com, we've gotten repeated asks to provide a Windows Vista gadget so people can keep up with their social circle directly from their desktop.

You asked, and now we've delivered. Get it from here.

What I love most about this gadget is that a huge chunk of the work to get this out the door was done by our summer interns from 2008. I love how interns can be around for a short time but provide a ton of bang for the buck while they are here. Hope you enjoy the gadget as much as I have.

Note Now Playing: Akon, Lil Wayne & Young Jeezy - I'm So Paid Note


 

Categories: Windows Live

From Palm Pre and Palm WebOS in-depth look we learn

The star of the show was the new Palm WebOS. It's not just a snazzy new touch interface. It's a useful system with some thoughtful ideas that we've been looking for. First of all, the Palm WebOS takes live, while-you-type searching to a new level. On a Windows Mobile phone, typing from the home screen initiates a search of the address book. On the Palm WebOS, typing starts a search of the entire phone, from contacts through applications and more. If the phone can't find what you need, it offers to search Google, Maps and Wikipedia. It's an example of Palm's goal to create a unified, seamless interface.

Other examples of this unified philosophy can be found in the calendar, contacts and e-mail features. The Palm Pre will gather all of your information from your Exchange account, your Gmail account and your Facebook account and display them in a single, unduplicated format. The contact listing for our friend Dave might draw his phone number from our Exchange account, his e-mail address from Gmail and Facebook, and instant messenger from Gtalk. All of these are combined in a single entry, with a status indicator to show if Dave is available for IM chats.

This is the holy grail of contact management experiences on a mobile phone. Today I use Exchange as the master for my contact records and then use tools like OutSync to merge in contact data for my Outlook contacts who are also on Facebook before pushing it all down to my Windows Mobile phone (the AT&T Tilt). Unfortunately this is a manual process and I have to be careful of creating duplicates when importing contacts from different places.

If the Palm Pre can do this automatically in a "live" anmd always connected way without creating duplicate or useless contacts (e.g. Facebook friends with no phone or IM info shouldn't take up space in my phone contact list) then I might have to take this phone for a test drive.

Anyone at CES get a chance to play with the device up close?

Note Now Playing: Hootie & The Blowfish - Only Wanna Be With You Note


 

Categories: Social Software | Technology

As I've mentioned previously, one of the features we shipped in the most recent release of Windows Live is the ability to import your activities from photo sharing sites like Flickr and PhotoBucket or even blog posts from a regular RSS/Atom feed onto your Windows Live profile. You can see this in action on my Windows Live profile.

One question that has repeatedly come up for our service and others like it, is how users can get a great experience from just importing RSS/Atom feeds of sites that we don't support in a first class way. A couple of weeks ago Dave Winer asked this of FriendFeed in his post FriendFeed and level playing fields where he writes

Consider this screen (click on it to see the detail): Permalink to this paragraph

A picture named ffscrfeen.gif Permalink to this paragraph

Suppose you used a photo site that wasn't one of the ones listed, but you had an RSS feed for your photos and favorites on that site. What are you supposed to do? I always assumed you should just add the feed under "Blog" but then your readers will start asking why your pictures don't do all the neat things that happen automatically with Flickr, Picasa, SmugMug or Zooomr sites. I have such a site, and I don't want them to do anything special for it, I just want to tell FF that it's a photo site and have all the cool special goodies they have for Flickr kick in automatically. Permalink to this paragraph

If you pop up a higher level, you'll see that this is actually contrary to the whole idea of feeds, which were supposed to create a level playing field for the big guys and ordinary people.

We have a similar problem when importing arbitrary RSS/Atom feeds onto a user's profile in Windows Live. For now, we treat each imported RSS feed as a blog entry and assume it has a title and a body that can be used as a summary. This breaks down if you are someone like Kevin Radcliffe who would like to import his Picasa Web albums. At this point we run smack-dab into the fact that there aren't actually consistent standards around how to represent photo albums from photo sharing sites in Atom/RSS feeds.

Let's look at the RSS/Atom feeds from three of the sites that Dave names that aren't natively supported by Windows Live's Web Activities feature.

Picassa

<item>
  <guid isPermaLink='false'>http://picasaweb.google.com/data/entry/base/user/bo.so.po.ro.sie/albumid/5280893965532109969/photoid/5280894045331336242?alt=rss&amp;hl=en_US</guid>
  <pubDate>Wed, 17 Dec 2008 22:45:59 +0000</pubDate>
  <atom:updated>2008-12-17T22:45:59.000Z</atom:updated>
  <category domain='http://schemas.google.com/g/2005#kind'>http://schemas.google.com/photos/2007#photo</category>
  <title>DSC_0479.JPG</title>
  <description>&lt;table&gt;&lt;tr&gt;&lt;td style="padding: 0 5px"&gt;&lt;a href="http://picasaweb.google.com/bo.so.po.ro.sie/DosiaIPomaraCze#5280894045331336242"&gt;&lt;img style="border:1px solid #5C7FB9" src="http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/s288/DSC_0479.JPG" alt="DSC_0479.JPG"/&gt;&lt;/a&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;font color="#6B6B6B"&gt;Date: &lt;/font&gt;&lt;font color="#333333"&gt;Dec 17, 2008 10:56 AM&lt;/font&gt;&lt;br/&gt;&lt;font color=\"#6B6B6B\"&gt;Number of Comments on Photo:&lt;/font&gt;&lt;font color=\"#333333\"&gt;0&lt;/font&gt;&lt;br/&gt;&lt;p&gt;&lt;a href="http://picasaweb.google.com/bo.so.po.ro.sie/DosiaIPomaraCze#5280894045331336242"&gt;&lt;font color="#3964C2"&gt;View Photo&lt;/font&gt;&lt;/a&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
  <enclosure type='image/jpeg' url='http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/DSC_0479.JPG' length='0'/>
  <link>http://picasaweb.google.com/lh/photo/PORshBK0wdBV0WPl27g_wQ</link>
  <media:group>
    <media:title type='plain'>DSC_0479.JPG</media:title>
    <media:description type='plain'></media:description>
    <media:keywords></media:keywords>
    <media:content url='http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/DSC_0479.JPG' height='1600' width='1074' type='image/jpeg' medium='image'/>
    <media:thumbnail url='http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/s72/DSC_0479.JPG' height='72' width='49'/>
    <media:thumbnail url='http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/s144/DSC_0479.JPG' height='144' width='97'/>
    <media:thumbnail url='http://lh4.ggpht.com/_xRL2P3zJJOw/SUmBJ6RzLDI/AAAAAAAABX8/MkPUBcKqpRY/s288/DSC_0479.JPG' height='288' width='194'/>
    <media:credit>Joanna</media:credit>
  </media:group>
</item>

Smugmug

<entry>
   <title>Verbeast's photo</title>
   <link rel="alternate" type="text/html" href="http://verbeast.smugmug.com/gallery/5811621_NELr7#439421133_qFtZ5"/>
   <content type="html">&lt;p&gt;&lt;a href="http://verbeast.smugmug.com"&gt;Verbeast&lt;/a&gt; &lt;/p&gt;&lt;a href="http://verbeast.smugmug.com/gallery/5811621_NELr7#439421133_qFtZ5" title="Verbeast's photo"&gt;&lt;img src="http://verbeast.smugmug.com/photos/439421133_qFtZ5-Th.jpg" width="150" height="150" alt="Verbeast's photo" title="Verbeast's photo" style="border: 1px solid #000000;" /&gt;&lt;/a&gt;</content>
   <updated>2008-12-18T22:51:58Z</updated>
   <author>
     <name>Verbeast</name>
     <uri>http://verbeast.smugmug.com</uri>
   </author>
   <id>http://verbeast.smugmug.com/photos/439421133_qFtZ5-Th.jpg</id>
   <exif:DateTimeOriginal>2008-12-12 18:37:17</exif:DateTimeOriginal>
 </entry>

Zooomr

 <item>
      <title>ギンガメアジとジンベイ</title>
      <link>http://www.zooomr.com/photos/chuchu/6556014/</link>
      <description>
        &lt;a href=&quot;http://www.zooomr.com/photos/chuchu/&quot;&gt;chuchu&lt;/a&gt; posted a photograph:&lt;br /&gt;

        &lt;a href=&quot;http://www.zooomr.com/photos/chuchu/6556014/&quot; class=&quot;image_link&quot; &gt;&lt;img src=&quot;http://static.zooomr.com/images/6556014_00421b6456_m.jpg&quot; alt=&quot;ギンガメアジとジンベイ&quot; title=&quot;ギンガメアジとジンベイ&quot;  /&gt;&lt;/a&gt;&lt;br /&gt;

      </description>
      <pubDate>Mon, 22 Dec 2008 04:14:52 +0000</pubDate>
      <author zooomr:profile="http://www.zooomr.com/people/chuchu/">nobody@zooomr.com (chuchu)</author>
      <guid isPermaLink="false">tag:zooomr.com,2004:/photo/6556014</guid>
      <media:content url="http://static.zooomr.com/images/6556014_00421b6456_m.jpg" type="image/jpeg" />
      <media:title>ギンガメアジとジンベイ</media:title>
      <media:text type="html">
        &lt;a href=&quot;http://www.zooomr.com/photos/chuchu/&quot;&gt;chuchu&lt;/a&gt; posted a photograph:&lt;br /&gt;

        &lt;a href=&quot;http://www.zooomr.com/photos/chuchu/6556014/&quot; class=&quot;image_link&quot; &gt;&lt;img src=&quot;http://static.zooomr.com/images/6556014_00421b6456_m.jpg&quot; alt=&quot;ギンガメアジとジンベイ&quot; title=&quot;ギンガメアジとジンベイ&quot;  /&gt;&lt;/a&gt;&lt;br /&gt;

      </media:text>
      <media:thumbnail url="http://static.zooomr.com/images/6556014_00421b6456_s.jpg" height="75" width="75" />
      <media:credit role="photographer">chuchu</media:credit>
      <media:category scheme="urn:zooomr:tags">海遊館 aquarium kaiyukan osaka japan</media:category>
    </item>

As you can see from the above XML snippets there is no consistency in how they represent photo streams. Even though both Picasa and Zoomr use Yahoo's Media RSS extensions, they generate different markup. Picasa has the media extensions as a set of elements within a media:group element that is a child of the item element while Zooomr simply places a grab bag of Media RSS elements such including media:thumbnail and media:content as children of the item element.  Smugmug takes the cake by simply tunneling some escaped HTML in the atom:content element instead of using explicit metadata to describe the photos.

The bottom line is that it isn't possible to satisfy Dave Winer's request and create a level playing field today because there are no consistently applied standards for representing photo streams in RSS/Atom. This is unfortunate because it means that services have to write one off code (aka the beautiful fucking snowflake problem) for each photo sharing site they want to integrate with. Not only is this a lot of unnecessary code, it also prevents such integration from being a simple plug and play experience for users of social aggregation services.

So far, the closest thing to a standard in this space is Media RSS but as the name states it is an RSS based format and really doesn't fit with the Atom syndication format's data model. This is why Martin Atkins has started working on Atom Media Extensions which is an effort to create a similar set of media extensions for the Atom syndication format.

What I like about the first draft of Atom media extensions is that it is focused on the basic case of syndicating audio, video and image for use in activity streams and doesn't have some of the search related and feed republishing baggage you see in related formats like Media RSS or iTunes RSS extensions.

The interesting question is how to get the photo sites out there to adopt consistent standards in this space? Maybe we can get Google to add it to their Open Stack™ since they've been pretty good at getting social sites to adopt their standards and have been generally good at evangelization.

Note Now Playing: DMX - Ruff Ryders' Anthem Note


 

One of the features we recently shipped in Windows Live is the ability to link your Windows Live profile to your Flickr account so that whenever you add photos to Flickr they show up on your profile and in the What's New list of members of your network in Windows Live. Below are the steps to adding your Flickr photos to your Windows Live profile.

1. Go to your Windows Live profile at http://profile.live.com and locate the link to your Web Activities on the bottom left

2. Click the link to add Web Activities which should take you to http://profile.live.com/WebActivities/ shown below. Locate Flickr on that page.

3.) Click on the "Add" link for Flickr which should take you to http://profile.live.com/WebActivities/Add.aspx?appid=1073750531 shown below

4.) Click on the link to sign-in to Flickr. This should take you to the Flickr sign-in page shown below (if you aren't signed in)

 

5.) After signing in, you will need to grant Windows Live access to your Flickr photo stream. Click the "OK I'll Allow It" button shown below

6.) You should then be redirected to Windows Live where you can complete the final step and link both accounts. In addition, you can decide who should able to view your Flickr photos on your Windows Live profile as shown below

 

7.) After pushing the "Add" button you should end up back on your profile with your Flickr information now visible on it.

A.) People in your network can now see your Flickr updates in various Windows Live applications including Windows Live Messenger as shown below

 

PS: The same basic set of steps work for adding activities from Twitter, Pandora, StumbleUpon, Flixster, PhotoBucket, Yelp, iLike, blogs hosted on Wordpress.com, or from any RSS/Atom feed to your Windows Live profile. Based on announcements at CES yesterday, you'll soon be able to add your activities from Facebook to Windows Live as well.

Note Now Playing: DMX - Party Up (Up in Here) Note


 

Categories: Windows Live

In my recent post on building a Twitter search engine on Windows Azure I questioned the need the expose the notion of both partition and row keys to developers on the platforms. Since then I've had conversations with a couple of folks at work that indicate that I should have stated my concerns more explicitly. So here goes.

The documentation on Understanding the Windows Azure Table Storage Data Model states the following

PartitionKey Property

Tables are partitioned to support load balancing across storage nodes. A table's entities are organized by partition. A partition is a consecutive range of entities possessing the same partition key value. The partition key is a unique identifier for the partition within a given table, specified by the PartitionKey property. The partition key forms the first part of an entity's primary key. The partition key may be a string value up to 32 KB in size.

You must include the PartitionKey property in every insert, update, and delete operation.

RowKey Property

The second part of the primary key is the row key, specified by the RowKey property. The row key is a unique identifier for an entity within a given partition. Together the PartitionKey and RowKey uniquely identify every entity within a table.

The row key is a string value that may be up to 32 KB in size.

You must include the RowKey property in every insert, update, and delete operation.

In my case I'm building an application to represent users in a social network and each user is keyed by user ID (e.g. their Twitter user name). In my application I only have one unique key and it identifies each row of user data (e.g. profile pic, location, latest tweet, follower count, etc). My original intuition was to use the unique ID as the row key while letting the partition key be a single value. The purpose of the partition key is that it is a hint to say this data belongs on the same machine which in my case seems like overkill.

Where this design breaks down is when I actually end up storing more data than the Windows Azure system can or wants to fit on a single storage node. For example, what if I've actually built a Facebook crawler (140 million users) and I cache people's profile pics locally (10kilobytes). This ends up being 1.3 terabytes of data. I highly doubt that the Azure system will be allocating 1.3 terabytes of storage on a single server for a single developer and even if it did the transaction performance would suffer. So the only reasonable assumption is that the data will either be split across various nodes at some threshold [which the developer doesn't know] or at some point the developer gets a "disk full error" (i.e. a bad choice which no platform would make).

On the other hand, if I decide to use the user ID as the partition key then I am in essence allowing the system to theoretically store each user on a different machine or at least split up my data across the entire cloud. That sucks for me if all I have is three million users for which I'm only storing 1K of data so it could fit on a single storage node. Of course, the Windows Azure system could be smart enough to not split up my data since it fits underneath some threshold [which the developer doesn't know]. And this approach also allows the system to take advantage of parallelism across multiple machines if it does split my data.

Thus I'm now leaning towards the user ID being the partition key instead of the row key. So what advice do the system's creators actually have for developers?

Well from the discussion thread POST to Azure tables w/o PartitionKey/RowKey: that's a bug, right? on the MSDN forums there is the following advice from Niranjan Nilakantan of Microsoft

If the key for your logical data model has more than 1 property, you should default to (multiple partitions, multiple rows for each partition).

If the key for your logical data model has only one property, you would default to (multiple partitions, one row per partition).

We have two columns in the key to separate what defines uniqueness(PartitionKey and RowKey) from what defines scalability(just PartitionKey).
In general, write and query times are less affected by how the table is partitioned.  It is affected more by whether you specify the PartitionKey and/or RowKey in the query.

So that answers the question and validates the conclusions we eventually arrived at. It seems we should always use the partition key as the primary key and may optionally want to use a row key as a secondary key, if needed.

In that case, the fact that items with different partition keys may or may not be stored on the same machine seems to be an implementation detail that shouldn't matter to developers since there is nothing they can do about it anyway. Right?

Note Now Playing: Scarface - Hand of the Dead Body Note


 

Categories: Web Development

I've been spending some time thinking about the ramifications of centralized identity plays coming back into vogue with the release of Facebook Connect, MySpaceID and Google's weird amalgam Google Friend Connect. Slowly I began to draw parallels between the current situation and a different online technology battle from half a decade ago.

About five years ago, one of the most contentious issues among Web geeks was the RSS versus Atom debate. On the one hand there was RSS 2.0, a widely deployed and fairly straightforward XML syndication format which had some ambiguity around the spec but whose benevolent dictator had declared the spec frozen to stabilize the ecosystem around the technology.  On the other hand you had the Atom syndication format, an up and coming XML syndication format backed by a big company (Google) and a number of big names in the Web standards world (Tim Bray, Sam Ruby, Mark Pilgrim, etc) which intended to do XML syndication the right way and address some of the flaws of RSS 2.0.

During that time I was an RSS 2.0 backer even though I spent enough time on the atom-syntax mailing list to be named as a contributor on the final RFC. My reasons for supporting RSS 2.0 are captured in my five year old blog post The ATOM API vs. the ATOM Syndication Format which contained the following excerpt

Based on my experiences working with syndication software as a hobbyist developer for the past year is that the ATOM syndication format does not offer much (if anything) over RSS 2.0 but that the ATOM API looks to be a significant step forward compared to previous attempts at weblog editing/management APIs especially with regard to extensibility, support for modern practices around service oriented architecture, and security.
...
Regardless of what ends up happening, the ATOM API is best poised to be the future of weblog editting APIs. The ATOM syndication format on the other hand...

My perspective was that the Atom syndication format was a burden on consumers of feeds since it meant they had to add yet another XML syndication format to the list of formats they supported; RSS 0.91, RSS 1.0, RSS 2.0 and now Atom. However the Atom Publishing Protocol (AtomPub) was clearly an improvement to the state of the art at the time and was a welcome addition to the blog software ecosystem. It would have been the best of both worlds if AtomPub simply used RSS 2.0 so we got the benefits with none of the pain of duplicate syndication formats.

As time has passed, it looks like I was both right and wrong about how things would turn out. The Atom Publishing Protocol has been more successful than I could have ever imagined. It not only became a key blog editing API but evolved to become a key technology for accessing data from cloud based sources that has been embraced by big software companies like Google (GData) and Microsoft (ADO.NET Data Services, Live Framework, etc). This is where I was right.

I was wrong about how much of a burden having XML syndication formats would be on developers and end users. Although it is unfortunate that every consumer of XML feed formats has to write code to process both RSS and Atom feeds, this has not been a big deal. For one, this code has quickly been abstracted out into libraries on the majority of popular platforms so only a few developers have had to deal with it. Similarly end users also haven't had to deal with this fragmentation that much. At first some sites did put out feeds in multiple formats which just ended up confusing users but that is mostly a thing of the past. Today most end users interacting with feeds have no reason to know about the distinction between Atom or RSS since for the most part there is none when you are consuming the feed from Google Reader, RSS Bandit or your favorite RSS reader.

I was reminded by this turn of events when reading John McCrea's post As Online Identity War Breaks Out, JanRain Becomes “Switzerland” where he wrote

Until now, JanRain has been a pureplay OpenID solution provider, hoping to build a business just on OpenID, the promising open standard for single sign-on. But the company has now added Facebook as a proprietary login choice amidst the various OpenID options on RPX, a move that shifts them into a more neutral stance, straddling the Facebook and “Open Stack” camps. In my view, that puts JanRain in the interesting and enviable position of being the “Switzerland” of the emerging online identity wars.

Weezer Two

For site operators, RPX offers an economical way to integrate the non-core function of “login via third-party identity providers” at a time when the choices in that space are growing and evolving rapidly. So, rather than direct its own technical resources to integrating Facebook Connect and the various OpenID implementations from MySpace, Google, Yahoo, AOL, Microsoft, along with plain vanilla OpenID, a site operator can simply outsource all of those headaches to JanRain.

Just as standard libraries like the Universal Feed Parser and the Windows RSS platform insulated developers from the RSS vs. Atom formatwar, JanRain's RPX makes it so that individual developers don't have to worry about the differences between supporting proprietary technologies like Facebook Connect or Open Stack™ technologies like OpenID.

At the end of the day, it is quite likely that the underlying technologies will not matter to matter to anyone but a handful of Web standards geeks and library developers. Instead what is important is for sites to participate in the growing identity provider ecosystem not what technology they use to do so.

Note Now Playing: Yung Wun featuring DMX, Lil' Flip & David Banner - Tear It Up Note


 

Categories: Web Development

Last month James Governor of Redmonk had a blog post entitled Asymmetrical Follow: A Core Web 2.0 Pattern where he made the following claim

You’re sitting at the back of the room in a large auditorium. There is a guy up front, and he is having a conversation with the people in the front few rows. You can’t hear them quite so well, although it seems like you can tune into them if you listen carefully. But his voice is loud, clear and resonant. You have something to add to the conversation, and almost as soon as you think of it he looks right at you, and says thanks for the contribution… great idea. Then repeats it to the rest of the group.

That is Asymmetrical Follow.

When Twitter was first built it was intended for small groups of friends to communicate about going to the movies or the pub. It was never designed to cope with crazy popular people like Kevin Rose (@kevinrose 76,185 followers), Jason Calacanis (@jasoncalacanis 42,491), and Scobleizer (@scobleizer 41,916). Oh yeah, and some dude called BarackObama (@barackobama 141,862)

If you’re building a social network platform its critical that you consider the technical and social implications of Asymmetrical Follow. You may not expect it, but its part of the physics of social networks. Shirky wrote the book on this. Don’t expect a Gaussian distribution.

Asymmetric Follow is a core pattern for Web 2.0, in which a social network user can have many people following them without a need for reciprocity.

James Governor mixes up two things in his post which at first made it difficult to agree with its premise. The first thing he talks about is the specifics of the notion of a follower on Twitter where he focuses on the fact that someone may not follow you but you can follow them and get their attention by sending them an @reply which is then rebroadcast to their audience when they reply to your tweet. This particular feature is not a core design pattern of social networking sites (or Web 2.0 or whatever you want to call it).

The second point is that social networks have to deal with the nature of popularity in social circles as aptly described in Clay Shirky's essay Power Laws, Weblogs, and Inequality from 2003. In every social ecosystem, there will be people who are orders of magnitude more popular than others. Mike Arrington's blog is hundreds of times more popular than mine. My blog is hundreds of times more popular than my wife's. The adequately reflect this reality of social ecosystems, social networking software should scale up to being usable both by the super-popular and the long tail of unpopular users. Different social applications support this in different ways. Twitter supports this by making the act of showing interest in another user a one way relationship that doesn't have to reciprocated (i.e. a follower) and then not capping the amount of followers a user can have. Facebook supports this by creating special accounts for super-popular users called Facebook Pages which also have a one way relationship between the popular entity and its fans. Like Twitter, there is no cap on the number of fans a "Facebook Page" can have. Facebook differs from Twitter forcing super-popular users to have a different representation from regular users.

In general, I agree that being able to support the notion of super-popular users who have lots of fellow users who are their "fans" or "followers" is a key feature that every social software application should support natively. Applications that don't do this are artificially limiting their audience and penalizing their popular users.

Does that make it a core pattern for "Web 2.0"? I guess so.

Note Now Playing: David Banner - Play Note


 

Categories: Social Software

I spent the last few days hacking on a side project that I thought some of my readers might find interesting; you can find it at http://hottieornottie.cloudapp.net 

I had several goals when embarking on this project

After a few days of hacking I'm glad to say I've achieved every goal I wanted to get out of this experiment. I'd like to thank Matt Cutts for the initial idea on how to implement this and Kevin Mark's for saving me from having to write a Twitter crawler by reminding me of Google's Social Graph API.

What it does and how it works

The search experiment provides four kinds of searches

  1. The search functionality with no options checked is exactly the same as search.twitter.com

  2. Checking "Search Near Me" finds all tweets posted by people who are within 30 miles of your geographical location (requires JavaScript). Your geographical location is determined from your IP address while the geographical location of the tweets is determined from the location fields of the Twitter profiles of the authors. Nice way to find out what people in your area are thinking about local news.

  3. Checking 'Sort By Follower Count' is my attempt to jump on the authority based Twitter search bandwagon. I don't think it's very useful but it was easy to code. Follower counts are obtained via the Google Social Graph API.

  4. Checking 'Limit to People I Follow' requires you to also specify your user name and then all search results are filtered to only return results from people you follow (requires JavaScript). This feature only works for a small subset of Twitter users that have been encountered by a crawler I wrote. The application is crawling Twitter friend lists as you read this and anyone I follow should already have their friend list crawled. If it doesn't work for you, check back in a few days. It's been slow going since Twitter puts a 100 request per hour cap on crawlers.

Developing on Windows Azure: Likes

After building a small scale application with Windows Azure, there are definitely a number of things I like about the experience. The number one thing I loved was the integrated deployment story with Visual Studio. I can build a regular ASP.NET application on my local machine that either used cloud or local storage resources and all it takes is a few mouse clicks to go from my code running on my machine to my code running on computers in Microsoft's data center either in a staging environment or in production. The fact that the data access APIs are all RESTful makes it super easy to go from pointing the app running on your machine to cloud storage or local storage on your machine simply by changing some base URIs in a configuration file. 

Another aspect of Windows Azure that I thought was great is how easy it is to create background processing tasks. It was very straightforward to create a Web crawler that crawled Twitter to build a copy of its social graph by simply adding a "Worker Role" to my project. I've criticized Google App Engine in the past for not supporting the ability to create background tasks so it is nice to see this feature in Microsoft's platform as a service offering. 

Developing on Windows Azure: Dislikes

The majority of my negative experiences were related to teething problems I'd associate with this being a technology preview that still needs polishing. I hit a rather frustrating bug where half the time I tried to run my application it would end up hanging and I'd have to try again after several minutes. There were also issues with the Visual Studio integration where removing or renaming parts of the project from the Visual Studio UI didn't modify all of the related configuration files so the app was in a broken state until I mended it by hand. Documentation was another place where there is still a lot of work to do. My favorite head scratching moment is that there is a x-ms-Metadata-ApproximateMessagesCount HTTP header which returns the approximate number of messages in the a queue. It is unclear whether "approximate" here refers to the fact that messages in the queue have an "invisibility period" between when they are popped from the queue but before they are deleted where they can't be accessed or whether it refers to some other heuristic that determines the size of the queue. Then there's the fact that the documentation says you need to have a partition key and row key for each entry you place in a table but doesn't really explain why or how you are supposed to pick these keys. In fact, the documentation currently makes it seem like the notion of partition keys is an example of unnecessarily surfacing implementation details of Windows Azure to developers in a way that leads to confusion and cargo cult programming.

One missing piece is the lack of good tools for debugging your application once it is running in the cloud. When it is running on your local machine there is a nice viewer to keep an eye on the log output from your application but once it is in the cloud, your only option is to have the logs dropped to some directory in the cloud and then run one of the code samples to access those logs from your local machine. Since this is a technology preview, it is expected that the tooling shouldn't be all there but it is a cumbersome process as it exists today. Besides accessing your debug output there is also seeing what data your application is actually creating, retrieving and otherwise manipulating in storage. You can use SQL Server Management Studio to look at your data in Table Storage on your local machine but there isn't a similar experience in the cloud. Neither blob nor queue storage have any off-the-shelf tools for inspecting their contents locally or in the cloud so developers have to write custom code by hand. Perhaps this is somewhere the developer community can step up with some Open Source tools (e.g. David Aiken's Windows Azure Online Log Reader) or perhaps some commercial vendors will do step in as they have in the case of Amazon's Web Services (e.g. RightScale)?

Outside of the polish issues and bugs, there was only one aspect of Windows Azure development I disliked; the structured data/relational schema development process. Windows Azure has a Table Storage API which provides a RESTful interface to a row-based data store similar in concept to Google's BigTable. Trying to program locally against this API is rather convoluted and requires writing your classes first then running some object<->relational translation tools on your assemblies. This is probably a consequence of not being a big believer the use of ORM tools so having to first write objects before I can access my DB seems backwards to me. This gripe may just be a matter of preference since a lot of folks who use Rails, Django and various other ORM technologies seem fine with having primarily an object facade over their databases.  

Update: Early on in my testing I got a The requested operation is not implemented on the specified resource error when trying out a batch query and incorrectly concluded that the Table Storage API did not support complex OR queries. It turns out that the problem was that I was doing a $filter query using the tolower function. Once I took out the tolower() it was straightforward to construct queries with a bunch of OR clauses so I could request for multiple row keys at once.

I'll file this under "documentation issues" since there is a list of unsupported LINQ query operators and unsupported LINQ comparison operators but not a list of unsupported query expression functions in the Table Storage API documentation. Sorry about any confusion and thanks to Jamie Thomson for asking about this so I could clarify.

Besides the ORM issue, I felt that I was missing some storage capabilities when trying to build my application. One of the features I started building before going with the Google Social Graph API was a quick way to provide the follower counts for a batch of users. For example, I'd get 100 search results from the Twitter API and would then need to look up the follower counts of each user that showed up in the results for use in sorting. However there was no straightforward way to implement this lookup service in Windows Azure. Traditionally, I'd have used one of the following options

  1. Create a table of {user_id, follower_count} in a SQL database and then use batches of ugly select statements like SELECT FROM follower_tbl WHERE id=xxxx OR id=yyyy OR id=zzzz OR ….
  2. Create tuples of {user_id, follower_count} in an in-memory hash table like memcached and then do a bunch of fast hash table lookups to get the follower counts from each user

Neither of these options is possible given the three data structures that Windows Azure gives you. It could be that these missing pieces are intended to be provided by SQL Data Services which I haven't taken a look at yet. If not, the lack of these pieces of functionality will be sticking point in the craw of developers making the switch from traditional Web development platforms.

Note Now Playing: Geto Boys - Gangsta (Put Me Down) Note


 

Categories: Personal | Programming