Tuesday, 27 January 2009 - Dare Obasanjo's weblog

January 27, 2009

@ 08:02 AM

Asking "should we trust the cloud" is like asking "should we trust horseless carriages"

Sarah Perez over on ReadWriteWeb has a blog post entitled In Cloud We Trust? where she states

Cloud computing may have been one of the biggest "buzzwords" (buzz phrases?) of this past year. From webmail to storage sites to web-based applications, everything online was sold under a new moniker in 2008: they're all "cloud" services now. Yet even though millions of internet users make use of these online services in some way, it seems that we haven't been completely sold on the cloud being any more safe or stable than data stored on our own computers.
...
Surprisingly, even on a site that tends to attract a lot of technology's earliest adopters, the responses were mixed. When asked the question: "Do you trust the cloud?," the majority of responses either came back as a flat-out "no" or as a longer explanation as to why their response was a "maybe" or a "sometimes." In other words, some people trust the cloud here, but not there, or for this, but not that.

The question this article asks is pointless on several levels. First of all, it doesn't really matter if people trust the cloud or not. What matters is whether they use it or not. The average person doesn't trust computers, automobile mechanics or lawyers yet they use them anyway. Given the massive adoption of the Web from search engines and e-commerce sites to Web-based email and social networking services, it is clear that the average computer person trusts the cloud enough to part with their personal information and their money. Being scared and untrusting of the cloud is like being scared and untrusting of computers, it is a characteristic that belongs to an older generation while the younger generation couldn't imagine life any other way. It's like Douglas Adams wrote in his famous essay How to Stop Worrying and Learn to Love the Internet back in 1999.

Secondly, people are often notoriously bad at assessing risk and often fail to consider that it is more likely that data loss will occur when their personal hardware fails given that the average computer user doesn't have a data backup strategy than it is likely to occur if their information is stored on some Web company's servers. For example, I still have emails from the last decade available to me in my Hotmail and Yahoo! Mail accounts. On the other hand, my personal archive of mail from the early 2000s which had survived being moved across three different desktop PCs was finally lost when the hard drive failed on my home computer a few months ago. I used to have a personal backup strategy for my home desktop but gave up after encountering the kinds of frustrations Mark Pilgrim eloquently rants about in his post Juggling Oranges. These days, I just put all the files and photos I'm sure I'd miss on SkyDrive and treat any file not worth uploading the cloud as being transient anyway. It is actually somewhat liberating since I no longer feel like I'm owned by all my digital stuff I have to catalog, manage and archive.

On a final note, the point isn't that there aren't valid concerns raised whenever this question is brought up. However progress will march on despite our Luddite concerns because the genie is already out of the bottle. For most people the benefits of anywhere access to their data from virtually any device and being able to share their content with people on the Web far outweighs the costs of not being in complete control of the data. In much the same way, horseless carriages (aka automobiles) may cause a lot more problems than horse drawn carriages from the quarter ton of carbon monoxide poured into the air per year by an average car to the tens of thousands of people killed each year in car crashes, yet the benefits of automobiles powered by internal combustion engines is so significant that humanity has decided to live with the problems that come along with them.

The ~~cloud~~ Web is already here and it is here to stay. It's about time we stopped questioning it and got used to the idea.

Note Now Playing: Amy Winehouse - Love Is A Losing Game Note

Categories: Cloud Computing

January 25, 2009

@ 02:55 PM

Comments [8]

Are we living through a deflationary spiral?

I've been hearing the term deflation being bandied about by TV news pundits with Japan being used as the popular example of this phenomenon for the past few months. The claim is that if the trend of price drops in the U.S. continues then we will be headed for deflation. When I first saw these stories I wondered what exactly is wrong with falling prices? Well…

Everywhere you look assets are worth less than they were a year ago. Gas prices are lower, house prices are lower, and stock portfolios are lower. At first I considered this a net positive. According to Zillow my house is now worth 10% less than what we paid for it almost two years ago and my 401(k) lost about 35% during the 2008 calendar year. However I treated this as a "correction" since these were in effect paper losses since I bought my house to live in not to flip it and I'm not planning to spend out of my 401(k) until retirement. On the other hand lower gas prices and cheaper consumer goods at Christmas time had a real and positive effect on my bottom line.

In addition, despite the media's claim that people were hoarding we planned to ignore the hype and help the local economy by continuing our plans to have our bathroom remodeled and get a new car later in the year (my wife has her eye on the Ford Flex).

As each week passes I've been less sure of our plans to "help the local economy" and last week's announcement by my employer to eliminate 5,000 jobs within the next 18 months began to make the plans seem downright irresponsible. At this point we've decided to hold off on the purchases and are debating the safest way to hold on the money and still retain value. My behavior sounded familiar and I looked up Wikipedia for information about deflation in Japan it was interesting to see the parallels

Systemic reasons for deflation in Japan can be said to include:

Fallen asset prices. There was a rather large price bubble in both equities and real estate in Japan in the 1980s (peaking in late 1989). When assets decrease in value, the money supply shrinks, which is deflationary.

Insolvent companies: Banks lent to companies and individuals that invested in real estate. When real estate values dropped, these loans could not be paid. The banks could try to collect on the collateral (land), but this wouldn't pay off the loan. Banks have delayed that decision, hoping asset prices would improve. These delays were allowed by national banking regulators. Some banks make even more loans to these companies that are used to service the debt they already have. This continuing process is known as maintaining an "unrealized loss", and until the assets are completely revalued and/or sold off (and the loss realized), it will continue to be a deflationary force in the economy. Improving bankruptcy law, land transfer law, and tax law have been suggested (by The Economist) as methods to speed this process and thus end the deflation.

Insolvent banks: Banks with a larger percentage of their loans which are "non-performing", that is to say, they are not receiving payments on them, but have not yet written them off, cannot lend more money; they must increase their cash reserves to cover the bad loans.

Fear of insolvent banks: Japanese people are afraid that banks will collapse so they prefer to buy gold or (United States or Japanese) Treasury bonds instead of saving their money in a bank account. This likewise means the money is not available for lending and therefore economic growth. This means that the savings rate depresses consumption, but does not appear in the economy in an efficient form to spur new investment. People also save by owning real estate, further slowing growth, since it inflates land prices.

Imported deflation: Japan imports Chinese and other countries' inexpensive consumable goods, raw materials (due to lower wages and fast growth in those countries). Thus, prices of imported products are decreasing. Domestic producers must match these prices in order to remain competitive. This decreases prices for many things in the economy, and thus is deflationary.

The crazy thing is that this sounds like a description of the United States of America today. That's when I understood that the threat of a deflationary spiral is very real. If people like me start hoarding cash then businesses get less customers which causes them to lower prices in return. With lower prices, they make less money and thus need to cut costs so they have layoffs. Now the local situation is worse giving people even more conviction in holding on to their cash and so on. The bit about imported goods kicking the butts of locally produced items in the marketplace is also especially apt given the recent drama about the automobile bailout.

Anyway, it looks like we are at the start of a deflationary spiral. The interesting question is whether there is anything anyone can do to stop it before it fully gets underway.

Categories: Current Affairs | Personal

January 22, 2009

@ 02:54 PM

Comments [2]

Cloud Computing Conundrum: Platform as a Service vs. Utility Computing

A few months ago, Tim O'Reilly wrote an excellent post entitled Web 2.0 and Cloud Computing where provides some definitions of two key cloud computing paradigms, Utility Computing and Platform as a Service. His descriptions of these models can be paraphrased as

Utility Computing: In this approach, a vendor provides access to virtual server instances where each instance runs a traditional server operating system such as Linux or Windows Server. Computation and storage resources are metered and the customer can "scale infinitely" by simply creating new server instances. The most popular example of this approach is Amazon EC2.
Platform as a Service: In this approach, a vendor abstracts away the notion of accessing traditional LAMP or WISC stacks from their customers and instead provides an environment for running programs written using a particular platform. In addition, data storage is provided via a custom storage layer and API instead of traditional relational database access. The most popular example of this approach is Google App Engine.

The more I interact with platform as a service offerings, the more I realize that although they are more easily approachable for getting started there is a cost because you often can't reuse your existing skills and technologies when utilizing such services. A great example of this is Don Park's post about developing on Google's App Engine entitled So GAE where he writes

What I found frustrating while coding for GAE are the usual constraints of sandboxing but, for languages with rich third-party library support like Python, it gets really bad because many of those libraries have to be rewritten or replaced to varying degrees. For example, I couldn’t use existing crop of Twitter client libraries so I had to code the necessary calls myself. Each such incident is no big deal but the difference between hastily handcrafted code and libraries polished over time piles up.

I expect that the inability of developers to simply use the existing libraries and tools that they are familiar with on services like Google App Engine is going to be an adoption blocker. However I expect that the lack of of a "SQL database in the cloud" will actually be an even bigger red flag than the fact that some APIs or libraries from your favorite programming language are missing.

A friend of mine who runs his own software company recently mentioned that one of the biggest problems he has delivering server-based software to his customers is that eventually the database requires tuning (e.g. creating indexes) and there is no expertise on-site at the customer to perform these tasks. He wanted to explore whether a cloud based offering like the Azure Services Platform could help. My response was that it would if he was willing to rewrite his application to use a table based storage system instead of a relational database. In addition, aside from using a new set of APIs for interacting with the service he'd also have to give up relational database features like foreign keys, joins, triggers and stored procedures. He thought I was crazy to even suggest that as an option.

This reminds me of an earlier post from Dave Winer entitled Microsoft's cloud strategy?

No one seems to hit the sweet spot, the no-brainer cloud platform that could take our software as-is, and just run it -- and run by a company that stands a chance of surviving the coming recession (which everyone really thinks may be a depression).

Of all the offerings Amazon comes the closest.

As it stands today platform as a service offerings currently do not satisfy the needs of people who have existing apps that want to "port them to the cloud". Instead this looks like it will remain the domain of utility computing services which just give you a VM and the ability to run any software you damn well please on the your operating system of choice.

However for brand new product development the restrictions of platform as a service offerings seem attractive given the ability to "scale infinitely" without having to get your hands dirty. Developers on platform as a service offerings don't have to worry about database management and the ensuing complexitiies like sharding, replication and database tuning.

What are your thoughts on the strengths and weaknesses of both classes of offerings?

Note Now Playing: The Pussycat Dolls - I Hate This Part Note

Categories: Cloud Computing

January 19, 2009

@ 09:05 AM

Comments [3]

Some thoughts on FeedBurner Site Stats being replaced by Google Analytics

I've been a FeedBurner customer for a couple of years and was initially happy for the company when it was acquired by Google. This soon turned to frustration when I realized that Google had become the company where startups go to die. Since being acquired by Google almost two years ago, the service hasn't added new features or fixed glaring bugs. If anything, the service has only lost features since it was acquired.

I discovered the most significant feature loss this weekend when I was prompted to migrate my FeedBurner account to using a Google login. I thought this would just be a simple change in login credentials but I got a different version of the service as well. The version of the service for Google accounts does have a new feature, the chart of subscribers is now a chart of subscribers AND "reach". On the other hand, the website analytics features seemed to be completely missing. So I searched on the Internet and found the following FAQ on the FeedBurner to Google Accounts migration

Are there any features that are not available at feedburner.google.com?

There are two features that we are retiring from all versions of feedburner: Site Stats (visitors) and FeedBurner Networks.

We have decided to retire FeedBurner website visitor tracking, as we feel Google already has a comparable publisher site analytics tool in Google Analytics. If you do not currently use Google Analytics, we recommend signing up for an account. We want to stress that all feed analytics will remain the same or be improved, and they are not going away.

FeedBurner Networks, which were heavily integrated with FeedBurner Ad Network, are no longer being supported. As with many software features, the usage wasn't at the level we'd hoped, and therefore we are making the decision not to develop it further, but to focus our attention on other feed services that are being used with more frequency. We will continue to look out for more opportunities for publishers to group inventory as part of the AdSense platform.

I guess I should have seen this coming. I know enough about competing projects and acquisitions to realize that there was no way Google would continue to invest in two competing website analytics products. It is unfortunate because there were some nice features in FeedBurner's Site Stats such as tracking of clicked links and the ability to have requests from a particular machine be ignored which are missing in Google Analytics. I also miss the simplicity of FeedBurner's product. Google Analytic is a more complex product with lots of bells and whistles, yet it takes two or three times as many clicks to answer straightforward questions like what referrer pages are sending people to our site.

The bigger concern for me is that both Google Analytics and FeedBurner (aka AdSense for Feeds) are really geared around providing a service for people who use Google's ad system. I keep wondering how much longer Google will be willing to let me mooch off of them by offloading my RSS feed bandwidth costs to them while not serving Google ads in my feed. At this rate, I wouldn't be surprised if Google turns FeedBurner into a Freemium business model product where users like me are encouraged to run ads if we want any new features and better service.

Too bad there isn't any competition in this space.

Note Now Playing: Lords of the Underground - Chief Rocka Note

Categories: Rants

January 18, 2009

@ 06:03 PM

Comments [0]

Dealing with the Seven Year Itch, Working at Microsoft and a few thoughts on the Google Hiring Process

A few days ago someone asked how long I've been at Microsoft and I was surprised to hear myself say about 7 years. I hadn't consciously thought about it for a while but my 7th anniversary at the company is coming up in a few weeks. I spent a few days afterwards wondering if I have a seven year itch and thinking about what I want to do next in my career.

I realized that I couldn't be happier right now. I get to build the core platform that powers the social software experience for half a billion users with a great product team (dev, test, ops) and very supportive management all the way up the chain. This hasn't always been the case.

I went through my seven year itch period about two years ago. I had always planned for my career at Microsoft to be short since I've never considered myself to be a big company person. Around that time, I looked around at a couple of places both within and outside Microsoft. I had some surprisingly good hiring experiences and some that were surprisingly bad (as bad as this thread makes the Google hiring process seem, trust me it is worse than that) then came away with a surprising conclusion. The best place to build the kind of software I wanted to build was at Microsoft. I started working in online services at Microsoft because I believed Social Software is the Platform of the Future and wanted to build social experiences that influence how millions of people connect with each other. What I realized after my quick look around at other opportunities is that no one has more potential in this space than Microsoft. Today when I read articles about our recent release, it validates my belief that Microsoft will be the company to watch when it comes to bringing "social" to software in a big way.

In my almost seven years in the software industry, I've had a number of friends go through the sense of needing change or career dissatisfaction which leads to the seven year itch. Both at Microsoft and elsewhere. Some of them have ended up dealing with this poorly and eventual became disgruntled and unhappy with their jobs which turns into a vicious cycle. On the other hand, I know a bunch of people that went from being unhappy or disgruntled about their jobs to becoming happy and productive employees who are more satisfied with their career choices. For the latter class of people, here are the three most successful, proactive steps I've seen them make

Change your perspective: Sometimes employees fall into situations where the reality for working on product X or team Y at company Z is different from their expectations. It could be a difference in development philosophy (e.g. the employee likes Agile practices like SCRUM and the product team does not practice them), technology choice (e.g. wanting to use the latest technologies whereas the product team has a legacy product in C++ with millions of customers) or one of many other differences in expectations versus reality.

The realization that leads to satisfaction in this case is that it isn't necessarily the case that what the organization is doing is wrong (e.g. rewriting an app from scratch just to use the latest technologies is never a good idea), it's just that what the employee would prefer isn't a realistic option for the organization or is just a matter of personal preference (e.g. the goal of the organization is to deliver a product on time and under budget not to use a specific development practice). Of course, these are contrived examples but the key point should be clear. If you're unhappy because your organization doesn't meet your expectations it could be that your expectations are what needs adjusting.
Change your organization: As the Mahatma Gandhi quote goes, Be the change you wish to see in the world. Every organization can be changed from within. After all, a lot of the most successful projects at Microsoft and around the software industry came about because a passionate person had a dream and did the leg work to convince enough people to share that dream. Where people often stumble is in underestimating the amount of leg work it takes to convince people to share their dream (and sometimes this leg work may mean writing a lot of code or doing a lot of research). .

For example, when you go from reading Harvard Business School articles like Microsoft vs. Open Source: Who Will Win? in 2005 to seeing an Open Source at Microsoft portal in 2008, you have to wonder what happens behind the scenes to cause that sort of change. It took the prodding of a number of passionate Open Source ambassadors at Microsoft as well as other influences to get this to happen.
Change your job: In some cases, there are irreconcilable differences between an employee and the organization they work for. The employee may have lost faith in the planned direction of the product, the product's management team or the entire company as a whole. In such cases the best thing to do is to part ways amicably before things go south.

Being on an H-1B visa I'd heard all sorts of horror stories about being the equivalent of an indentured servant when working for an American software company but this has proven to be far from the truth. There is an H-1B transfer process that allows you to switch employers without having to re-apply for a visa or even inform your current employer. If you work at a big company and are paper-work averse, you can stick to switching teams within the company. This is especially true for Microsoft where there are hundreds of very different products (operating systems, databases, Web search engines, video game console hardware, social networking software, IT, web design, billing software, etc) with very different cultures to choose from.

These are the steps that I've seen work for friends and coworkers who've been unhappy in their jobs who've successfully been able to change their circumstances. The people who don't figure out how to execute one of the steps above eventually become embittered and are never a joy to be around.

Note Now Playing: Estelle (feat. Kanye West) - American Boy Note

Categories: Life in the B0rg Cube | Personal

January 17, 2009

@ 04:24 PM

Comments [4]

Some Thoughts on User Interfaces for Activity Streams

A few weeks ago, Joshua Porter posted an excellent analysis of FriendFeed's user interface in his post Thoughts on the Friendfeed interface where he provides this excellent annotated screenshot

In addition to the screenshot, Joshua levels four key criticisms about FriendFeed's current design

Too few items per screen
Secondary information clogs up each item
Difficult to scan content titles quickly
People who aren't my friends

The last item is my biggest pet peeve about FriendFeed and why I haven't found myself able to get into the service. FriendFeed goes out of its way to show me content from and links to people I don't know and haven't become friends with on the site. In the screenshot above, there are at least twice as many people Joshua isn't friends with showing up on the page than people he knows. Here are the three situations FriendFeed commonly shows non-friends in and why they are bad

FriendFeed shows you content from friends of friends: This is major social faux pas. It may sound like a cool viral feature but showing me content from people I haven't subscribed to means I don't have control of who shows up in my feed and it takes away from the intimacy of the site because I'm always seeing content from strangers.
FriendFeed shows you who "liked" some content: Why should I care if some total stranger liked some blog post from a friend of mine? Again, this seems like a viral feature aimed at generating page views from users clicking on the people who liked an item in the feed but it comes at the cost of visual clutter and a reduction in the intimacy of the servers by putting strangers in your face.
FriendFeed shows comments expanded by default in the feed: In the screenshot above, the comment thread for "Overnight Success and FriendFeed Needs" takes up space that could have been used to show another item from one of Joshua's friends. The question to ask is whether a bunch of comments from people Joshua may or may not know is more valuable to show than an update from one of his friends?

In fact the majority of Joshua's remaining complaints including secondary information causing visual clutter and too few items per screen are a consequence of FriendFeed's decision to take multiple opportunities to push people you don't know in your face on the home page. The need to grow virally by encouraging connections between users is costing them by hampering their core user experience.

On the flip side, look at how Facebook has tried to address the issue of prompting users to grow their social graph without spamming the news feed with people you don't know

People often claim that activity streams make them feel like they are drowning in a river of noise. FriendFeed compounds this by drowning you in a content from people you don't even know and never even asked to get content from in the first place.

Rule #1 of every activity stream experience is that users should feel in control of what content they get in their feed. Otherwise, the tendency to succumb to the feeling of "drowning" will be overwhelming.

Note Now Playing: Lupe Fiasco - Kick, Push Note

Categories: Social Software

January 16, 2009

@ 02:23 PM

Comments [9]

Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes

Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. The justification for database sharding is that after a certain scale point it is cheaper and more feasible to scale a site horizontally by adding more machines than to grow it vertically by adding beefier servers.

Why Shard or Partition your Database?

Let's take Facebook.com as an example. In early 2004, the site was mostly used by Harvard students as a glorified online yearbook. You can imagine that the entire storage requirements and query load on the database could be handled by a single beefy server. Fast forward to 2008 where just the Facebook application related page views are about 14 billion a month (which translates to over 5,000 page views per second, each of which will require multiple backend queries to satisfy). Besides query load with its attendant IOPs, CPU and memory cost there's also storage capacity to consider. Today Facebook stores 40 billion physical files to represent about 10 billion photos which is over a petabyte of storage. Even though the actual photo files are likely not in a relational database, their metadata such as identifiers and locations still would require a few terabytes of storage to represent these photos in the database. Do you think the original database used by Facebook had terabytes of storage available just to store photo metadata?

At some point during the development of Facebook, they reached the physical capacity of their database server. The question then was whether to scale vertically by buying a more expensive, beefier server with more RAM, CPU horsepower, disk I/O and storage capacity or to spread their data out across multiple relatively cheap database servers. In general if your service has lots of rapidly changing data (i.e. lots of writes) or is sporadically queried by lots of users in a way which causes your working set not to fit in memory (i.e. lots of reads leading to lots of page faults and disk seeks) then your primary bottleneck will likely be I/O. This is typically the case with social media sites like Facebook, LinkedIn, Blogger, MySpace and even Flickr. In such cases, it is either prohibitively expensive or physically impossible to purchase a single server to handle the load on the site. In such situations sharding the database provides excellent bang for the buck with regards to cost savings relative to the increased complexity of the system.

Now that we have an understanding of when and why one would shard a database, the next step is to consider how one would actually partition the data into individual shards. There are a number of options and their individual tradeoffs presented below – Pseudocode / Joins

How Sharding Changes your Application

In a well designed application, the primary change sharding adds to the core application code is that instead of code such as

//string connectionString = @"Driver={MySQL};SERVER=dbserver;DATABASE=CustomerDB;"; <-- should be in web.config
string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"];          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
          
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader();

the actual connection information about the database to connect to depends on the data we are trying to store or access. So you'd have the following instead

string connectionString = GetDatabaseFor(customerId);          
OdbcConnection conn = new OdbcConnection(connectionString);
conn.Open();
         
OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);
OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);
param.Value = customerId; 
OdbcDataReader reader = cmd.ExecuteReader();

the assumption here being that the GetDatabaseFor() method knows how to map a customer ID to a physical database location. For the most part everything else should remain the same unless the application uses sharding as a way to parallelize queries.

A Look at a Some Common Sharding Schemes

There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large scale Web applications today.

Vertical Partitioning: A simple way to segment your application database is to move tables related to specific features to their own server. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The key benefit of this approach is that is straightforward to implement and has low impact to the application as a whole. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers (e.g. handling metadata queries for 10 billion photos by 140 million users may be more than a single server can handle).
Range Based Partitioning: In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity. For example, splitting up sales transactions by what year they were created or assigning users to servers based on the first digit of their zip code. The main problem with this approach is that if the value whose range is used for partitioning isn't chosen carefully then the sharding scheme leads to unbalanced servers. In the previous example, splitting up transactions by date means that the server with the current year gets a disproportionate amount of read and write traffic. Similarly partitioning users based on their zip code assumes that your user base will be evenly distributed across the different zip codes which fails to account for situations where your application is popular in a particular region and the fact that human populations vary across different zip codes.
Key or Hash Based Partitioning: This is often a synonym for user based partitioning for Web 2.0 sites. With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. A simplistic example is to consider if you have ten database servers and your user IDs were a numeric value that was incremented by 1 each time a new user is added. In this example, the hash function could be perform a modulo operation on the user ID with the number ten and then pick a database server based on the remainder value. This approach should ensure a uniform allocation of data to each server. The key problem with this approach is that it effectively fixes your number of database servers since adding new servers means changing the hash function which without downtime is like being asked to change the tires on a moving car.
Directory Based Partitioning: A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the GetDatabaseFor() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a modulo operation. Let's say we want to add five database servers to the pool without incurring downtime. We can keep the existing hash function, add these servers to the pool and then run a script that copies data from the ten existing servers to the five new servers based on a new hash function implemented by performing the modulo operation on user IDs using the new server count of fifteen. Once the data is copied over (although this is tricky since users are always updating their data) the lookup service can change to using the new hash function without any of the calling applications being any wiser that their database pool just grew 50% and the database they went to for accessing John Doe's pictures five minutes ago is different from the one they are accessing now.

Problems Common to all Sharding Schemes

Once a database has been sharded, new constraints are placed on the operations that can be performed on the database. These constraints primarily center around the fact that operations across multiple tables or multiple rows in the same table no longer will run on the same server. Below are some of the constraints and additional complexities introduced by sharding

Joins and Denormalization – Prior to sharding a database, any queries that require joins on multiple tables execute on a single server. Once a database has been sharded across multiple servers, it is often not feasible to perform joins that span database shards due to performance constraints since data has to be compiled from multiple servers and the additional complexity of performing such cross-server.

A common workaround is to denormalize the database so that queries that previously required joins can be performed from a single table. For example, consider a photo site which has a database which contains a user_info table and a photos table. Comments a user has left on photos are stored in the photos table and reference the user's ID as a foreign key. So when you go to the user's profile it takes a join of the user_info and photos tables to show the user's recent comments. After sharding the database, it now takes querying two database servers to perform an operation that used to require hitting only one server. This performance hit can be avoided by denormalizing the database. In this case, a user's comments on photos could be stored in the same table or server as their user_info AND the photos table also has a copy of the comment. That way rendering a photo page and showing its comments only has to hit the server with the photos table while rendering a user profile page with their recent comments only has to hit the server with the user_info table.

Of course, the service now has to deal with all the perils of denormalization such as data inconsistency (e.g. user deletes a comment and the operation is successful against the user_info DB server but fails against the photos DB server because it was just rebooted after a critical security patch).
Referential integrity – As you can imagine if there's a bad story around performing cross-shard queries it is even worse trying to enforce data integrity constraints such as foreign keys in a sharded database. Most relational database management systems do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references once they move to using database shards.

Dealing with data inconsistency issues due to denormalization and lack of referential integrity can become a significant development cost to the service.
Rebalancing (Updated 1/21/2009) – In some cases, the sharding scheme chosen for a database has to be changed. This could happen because the sharding scheme was improperly chosen (e.g. partitioning users by zip code) or the application outgrows the database even after being sharded (e.g. too many requests being handled by the DB shard dedicated to photos so more database servers are needed for handling photos). In such cases, the database shards will have to be rebalanced which means the partitioning scheme changed AND all existing data moved to new locations. Doing this without incurring down time is extremely difficult and not supported by any off-the-shelf today. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Tuesday, 27 January 2009 - Dare Obasanjo's weblog

Are there any features that are not available at feedburner.google.com?

Why Shard or Partition your Database?

How Sharding Changes your Application

A Look at a Some Common Sharding Schemes

Problems Common to all Sharding Schemes