In my recent post on building a Twitter search engine on Windows Azure I questioned the need the expose the notion of both partition and row keys to developers on the platforms. Since then I've had conversations with a couple of folks at work that indicate that I should have stated my concerns more explicitly. So here goes.
The documentation on Understanding the Windows Azure Table Storage Data Model states the following
Tables are partitioned to support load balancing across storage nodes. A table's entities are organized by partition. A partition is a consecutive range of entities possessing the same partition key value. The partition key is a unique identifier for the partition within a given table, specified by the PartitionKey property. The partition key forms the first part of an entity's primary key. The partition key may be a string value up to 32 KB in size.
You must include the PartitionKey property in every insert, update, and delete operation.
The second part of the primary key is the row key, specified by the RowKey property. The row key is a unique identifier for an entity within a given partition. Together the PartitionKey and RowKey uniquely identify every entity within a table.
The row key is a string value that may be up to 32 KB in size.
You must include the RowKey property in every insert, update, and delete operation.
In my case I'm building an application to represent users in a social network and each user is keyed by user ID (e.g. their Twitter user name). In my application I only have one unique key and it identifies each row of user data (e.g. profile pic, location, latest tweet, follower count, etc). My original intuition was to use the unique ID as the row key while letting the partition key be a single value. The purpose of the partition key is that it is a hint to say this data belongs on the same machine which in my case seems like overkill.
Where this design breaks down is when I actually end up storing more data than the Windows Azure system can or wants to fit on a single storage node. For example, what if I've actually built a Facebook crawler (140 million users) and I cache people's profile pics locally (10kilobytes). This ends up being 1.3 terabytes of data. I highly doubt that the Azure system will be allocating 1.3 terabytes of storage on a single server for a single developer and even if it did the transaction performance would suffer. So the only reasonable assumption is that the data will either be split across various nodes at some threshold [which the developer doesn't know] or at some point the developer gets a "disk full error" (i.e. a bad choice which no platform would make).
On the other hand, if I decide to use the user ID as the partition key then I am in essence allowing the system to theoretically store each user on a different machine or at least split up my data across the entire cloud. That sucks for me if all I have is three million users for which I'm only storing 1K of data so it could fit on a single storage node. Of course, the Windows Azure system could be smart enough to not split up my data since it fits underneath some threshold [which the developer doesn't know]. And this approach also allows the system to take advantage of parallelism across multiple machines if it does split my data.
Thus I'm now leaning towards the user ID being the partition key instead of the row key. So what advice do the system's creators actually have for developers?
Well from the discussion thread POST to Azure tables w/o PartitionKey/RowKey: that's a bug, right? on the MSDN forums there is the following advice from Niranjan Nilakantan of Microsoft
If the key for your logical data model has more than 1 property, you should default to (multiple partitions, multiple rows for each partition).
If the key for your logical data model has only one property, you would default to (multiple partitions, one row per partition).
We have two columns in the key to separate what defines uniqueness(PartitionKey and RowKey) from what defines scalability(just PartitionKey).
In general, write and query times are less affected by how the table is partitioned. It is affected more by whether you specify the PartitionKey and/or RowKey in the query.
So that answers the question and validates the conclusions we eventually arrived at. It seems we should always use the partition key as the primary key and may optionally want to use a row key as a secondary key, if needed.
In that case, the fact that items with different partition keys may or may not be stored on the same machine seems to be an implementation detail that shouldn't matter to developers since there is nothing they can do about it anyway. Right?
Now Playing: Scarface - Hand of the Dead Body