I recently got an email from a developer about my post Thoughts on Amazon's Internal Storage System (Dynamo) which claimed that I seem to be romanticizing distributed systems that favor availability over consistency. He pointed out that although this sounds nice on paper, it places a significant burden on application developers. He is 100% right. This has been my experience in Windows Live and I’ve heard enough second hand to assume it is the experience at Amazon as well when it comes to Dynamo.

I thought an example of how this trade-off affects developers would be a useful excercise and may be of interest to my readers. The following example is hypothetical and should not be construed as describing the internal architectures of any production systems I am aware of.

Scenario: Torsten Rendelmann, a Facebook user in Germany, accepts a friend request from Dare Obasanjo who is a Facebook user in the United States.

The Distributed System: To improve the response times for users in Europe, imagine Facebook has a data center in London while American users a serviced from a Data Center in Palo Alto. To achieve this, the user database is broken up in a process commonly described as shardingThe question of if and how data is replicated across both data centers isn’t relevant to this example.

The application developer who owns the confirm_friend_request() method, will ideally want to write code that took the following form 

public void confirm_friend_request(user1, user2){

  begin_transaction();
  update_friend_list(user1, user2, status.confirmed); //palo alto
  update_friend_list(user2, user1, status.confirmed); //london
  end_transaction();

Yay, distributed transactions. You have to love a feature that every vendor advises you not to use if you care about performance. So obviously this doesn’t work for a large scale distributed system where performance and availabilty are important.  

Things get particularly ugly when you realize that either data center or the specific server a user’s data is stored on could be unreachable for a variety of reasons (e.g. DoS attack, high seasonal load, drunk sys admin tripped over a power cord, hard drive failure due to cheap components, etc).

There are a number of options one can consider when availability and high performance are considered to be more important than data consistency in the above example. Below are three potential implementations of the code above each with it’s own set of trade offs.

OPTION I: Application developer performs manual rollback on error

public void confirm_friend_request_A(user1, user2){

 try{
   update_friend_list(user1, user2, status.confirmed); //palo alto 
 }catch(exception e){ 
  report_error(e); 
  return;
 }

 try{
  update_friend_list(user2, user1, status.confirmed); //london
 }catch(exception e) {
  revert_friend_list(user1, user2);
  report_error(e);
  return; 
 }

}

The problem here is that we don’t handle the case where revert_friend_list() fails. This means that Dare (user1) may end up having Torsten (user2) on his friend list but Torsten won’t have Dare on his friend list. The database has lied.

OPTION II: Failed events are placed in a message queue to be retried until they succeed.   

public void confirm_friend_request_B(user1, user2){

 try{
   update_friend_list(user1, user2, status.confirmed); //palo alto 
 }catch(exception e){ 
  report_error(e); 
  add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); 
 }

 try{
  update_friend_list(user2, user1, status.confirmed); //london
 }catch(exception e) {
  report_error(e); 
  add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());  
 }

}

Depending on how long the error exists and how long it takes an item to sit in the message queue, there will be times when the Dare (user1) may end up having Torsten (user2) on his friend list but Torsten won’t have Dare on his friend list. The database has lied, again.

OPTION III: System always accepts updates but application developers may have to resolve data conflicts later. (The Dynamo approach)

/* update_friend_list always succeeds but may enqueue an item in message queue to try again later in the event of failure. This failure is not propagated to callers of the method.  */

public void confirm_friend_request_C(user1, user2){
   update_friend_list(user1, user2, status.confirmed); // palo alto
   update_friend_list(user2, user1, status.confirmed); //london 

}

/* get_friends() method has to reconcile results returned by get_friends() because there may be data inconsistency due to a conflict because a change that was applied from the message queue is contradictory to a subsequent change by the user.  In this case, status is a bitflag where all conflicts are merged and it is up to app developer to figure out what to do. */ 

  public list get_friends(user1){ 
      list actual_friends = new list();
      list friends = get_friends();  

      foreach (friend in friends){     

        if(friend.status == friendstatus.confirmed){ //no conflict
           actual_friends.add(friend);

        }else if((friend.status &= friendstatus.confirmed) 
                   and !(friend.status &= friendstatus.deleted)){

          // assume friend is confirmed as long as it wasn’t also deleted
          friend.status = friendstatus.confirmed;              
          actual_friends.add(friend); 
          update_friends_list(user1, friend, status.confirmed);

        }else{ //assume deleted if there is a conflict with a delete
          update_friends_list( user1, friend, status.deleted)
        }

      }//foreach

   return actual_friends;
}

These are just a few of the many approaches that can be implemented in such a distributed system to get around the performance and availability implications of using distributed transactions. The main problem with them is that in every single case, the application developer has an extra burden placed on his shoulders due to inherent fragility of distributed systems. For a lot of developers, the shock of this realization is akin to the shock of programming in C++ after using C# or Ruby for a couple of years. Manual memory management? Actually having to perform bounds checking arrays? Being unable to use decades old technology like database transactions?

The challenge in building a distributed storage system like BigTable or Dynamo is in balancing the need for high availability and performance while not building a system that encourages all sorts of insidious bugs to exist in the system by design.  Some might argue that Dynamo goes to far in the burden that it places on developers while there are others that would argue that it doesn’t go far enough.

In what camp do you fall?

Now playing: R. Kelly - Rock Star (feat. Ludacris & Kid Rock)


 

Wednesday, 10 October 2007 14:33:23 (GMT Daylight Time, UTC+01:00)
It seems like BigTable originally required clients to ensure consistency but that function has now moved to the server. I am guessing even the server guarantees only EVENTUAL consistency, therefore requiring developers to write applications that are aware of this.

From the BigTable paper:

"The Personalized Search team originally built a *client-side replication mechanism on top of Bigtable that ensured eventual consistency* of all replicas. The current system now uses a
replication subsystem that is built into the servers."
Wednesday, 10 October 2007 19:06:18 (GMT Daylight Time, UTC+01:00)
I agree distributed is hard work. I also think we're a long way from having the right abstractions and I have a suspicion the reason for that is because not enough developers have been prepared to swim in the depths long enough to innovate and improve the situation.
Thursday, 11 October 2007 05:50:04 (GMT Daylight Time, UTC+01:00)
Dan: I believe these sort of large scale distributed systems are very new. Google published the BigTable paper less than a year ago and Amazon just published the Dynamo people. Not enough people (outside of Google, Amazon and a handful of other places) have worked on such large scale distributed systems long enough for the right abstractions to be agreed upon and popularized. We are just beginning to discuss these. It will take some time for standardization (formal or informal) to happen and typically the hordes show up only after standardization happens.
Thursday, 11 October 2007 10:53:01 (GMT Daylight Time, UTC+01:00)
As far as I'm aware, the corresponding table would look something like this:

First Row:
Friend 1 - John Doe
Friend 2 - Jane Doe
Status - Confirmed

Next Row:
Friend 2 - John Doe
Friend 1 - Jane Doe
Status - Confirmed

Furthermore, user pages are usually hit few times so it is not as necessary to cache them.

So, instead it is more sensible in this scenario to have the Rows be:
Participation Number - 1208345 (unique)
Relationship number - 812489 (non-unique)
Friend - John Doe

Participation Number - 1208346 (unique)
Relationship number - 812489 (non-unique) (as above)
Friend - Jane Doe

The database system should handle the replication automatically at a below-SQL level. Furthermore, the given example poisons the information with temporary data that is only relevant for a fraction of a second.
Nick
Thursday, 11 October 2007 19:09:47 (GMT Daylight Time, UTC+01:00)
why isn't R. Kelly in jail?
chi chi
Thursday, 11 October 2007 22:25:23 (GMT Daylight Time, UTC+01:00)
A very good overview, thanks!

I guess when it comes down to it, how important is the transaction? if friend 1 has friend 2 for a little while before friend 2 has friend 1, then the world isn't probably going to stop is it? I know that's a simplistic view and would not work in financial transactions, but again as I think you were touching on, it really is finding a balance and obviously using a wee bit of common sense of course.

Good stimulating reading as always, thanks!
David
Friday, 12 October 2007 21:31:01 (GMT Daylight Time, UTC+01:00)
You are basically running into "Brewer's Conjecture."

Consistent, available, partition tolerant. Pick two.

evgen
Comments are closed.