Database normalization is a formal process of designing your database to eliminate redundant data, utilize space efficiently and reduce update errors. Anyone who has ever taken a database class has it drummed into their heads that a normalized database is the only way to go. This is true for the most part . However there are certain scenarios where the benefits of database normalization are outweighed by its costs. Two of these scenarios are described below.

Immutable Data and Append-Only Scenarios

Pat Helland, an enterprise architect at Microsoft who just rejoined the company after a two year stint at Amazon, has a blog post entitled Normalization is for Sissies where he presents his slides from an internal Microsoft gathering on database topics. In his presentation, Pat argues that database normalization is unnecessary in situations where we are storing immutable data such as financial transactions or a particular day's price list.

When Multiple Joins are Needed to Produce a Commonly Accessed View

The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item. For example, consider this normalized set of tables which represent a user profile on a typical social networking site.

user table
user_id first_name last_name sex hometown relationship_status interested_in religious_views political_views
12345 John Doe Male Atlanta, GA married women (null) (null)
user_affiliations table
user_id (foreign_key) affiliation_id (foreign key)
12345 42
12345 598
affiliations table
affiliation_id description member_count
42 Microsoft 18,656
598 Georgia Tech 23,488
user_phone_numbers table
user_id (foreign_key) phone_number phone_type
12345 425-555-1203 Home
12345 425-555-6161 Work
12345 206-555-0932 Cell
user_screen_names table
user_id (foreign_key) screen_name im_service
12345 geeknproud@example.com AIM
12345 voip4life@example.org Skype
user_work_history table
user_id (foreign_key) company_affiliation_id (foreign key) company_name job_title
12345 42 Microsoft Program Manager
12345 78 i2 Technologies Quality Assurance Engineer

This is the kind of information you see on the average profile on Facebook. With the above design, it takes six SQL Join operations to access and display the information about a single user. This makes rendering the profile page a fairly database intensive operation which is compounded by the fact that profile pages are the most popular pages on social networking sites.

The simplest way to fix this problem is to denormalize the database. Instead of having tables for the user’s affiliations, phone numbers, IM addresses and so on, we can just place them in the user table as columns. The drawback with this approach is that there is now more wasted space (e.g. lots of college students people will have null for their work_phone)  and perhaps some redundant information (e.g. if we copy over the description of each affiliation into an affiliation_name column for each user to prevent having to do a join with the affiliations table). However given the very low costs of storage versus the improved performance characteristics of querying a single table and not having to deal with SQL statements that operate across six tables for every operation. This is a small price to pay.

As Joe Gregorio mentions in his blog post about the emergence of megadata, a lot of the large Web companies such as Google, eBay and Amazon are heavily into denormalizing their databases as well as eschewing transactions when updating these databases to improve their scalability.

Maybe normalization is for sissies…

UPDATE: Someone pointed out in the comments that denormalizing the affiliations table into user's table would mean the member_count would have to updated in thousands of user's rows when a new member was added to the group. This is obviously not the intent of denormalization for performance reasons since it replaces a bad problem with a worse one. Since an affiliation is a distinct concept from a user, it makes sense for it to have it's own table. Replicating the names of the groups a user is affiliated with in the user table is a good performance optimization although it does mean that the name has to be fixed up in thousands of tables if it ever changes. Since this is likely to happen very rarely, this is probably acceptable especially if we schedule renames to be done by a cron job during offpeak ours On the other hand, replicating the member count is just asking for trouble.

UPDATE 2: Lots of great comments here and on reddit indicate that I should have put more context around this post. Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. However if you hit massive scale and are dealing with millions of queries a day across hundreds of millions to billions of records or have decided to go with database partitioning/sharding then you will likely end up resorting to denormalization. A real-world example of this is the Flickr database back-end whose details are described in Tim O'Reilly's Database War Stories #3: Flickr which contains the following quotes

tags are an interesting one. lots of the 'web 2.0' feature set doesn't fit well with traditional normalised db schema design. denormalization (or heavy caching) is the only way to generate a tag cloud in milliseconds for hundereds of millions of tags. you can cache stuff that's slow to generate, but if it's so expensive to generate that you can't ever regenerate that view without pegging a whole database server then it's not going to work (or you need dedicated servers to generate those views - some of our data views are calculated offline by dedicated processing clusters which save the results into mysql).

federating data also means denormalization is necessary - if we cut up data by user, where do we store data which relates to two users (such as a comment by one user on another user's photo). if we want to fetch it in the context of both user's, then we need to store it in both shards, or scan every shard for one of the views (which doesn't scale). we store alot of data twice, but then theres the issue of it going out of sync. we can avoid this to some extent with two-step transactions (open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st commited) but there still a chance for failure when a box goes down during the 1st commit.

we need new tools to check data consistency across multiple shards, move data around shards and so on - a lot of the flickr code infrastructure deals with ensuring data is consistent and well balanced and finding and repairing it when it's not."

The part highlighted in red is also important to consider. Denormalization means that you you are now likely to deal with data inconsistencies because you are storing redundant copies of data and may not be able to update all copies of a column value simultaneously  when it is changed for a variety of reasons. Having tools in your infrastructure to support fixing up data of this sort then become very important.

Now playing: Bow Wow - Outta My System (feat. T-Pain)


 

Saturday, August 4, 2007 2:33:41 AM (GMT Daylight Time, UTC+01:00)
Another great post from Dare! Very informative. But can you tell me how is babby formed?

http://www.somethingawful.com/flash/shmorky/babby.swf
Product Engineer Lead Manager
Saturday, August 4, 2007 3:33:10 AM (GMT Daylight Time, UTC+01:00)
I'm stating the extraordinarily obvious here, but denormalizing that into a single row also means that when the number of Microsofties changes, you've got 18,556 rows to update instead of 1; and that you can't have more than some arbitrary number of screen names/services; and that in general, there's more likely to be inconsistent data, scheduled bulk updates, and irritating limitations.

Normalization isn't just for sissies; it's for everyone who can't prove that they need to get by without it.
Saturday, August 4, 2007 3:43:59 AM (GMT Daylight Time, UTC+01:00)
My recommendation was to denormalize all of the user's data, so this would be all the tables with user in the name. Denormalizing affiliations table is definitely not a good idea since the member count is constantly changing but the name of the group may not and if it isn't in sync, it may not be a big deal.

I'll update my post with some clarifications.
Saturday, August 4, 2007 6:26:23 AM (GMT Daylight Time, UTC+01:00)
I’m not sure you’re correct in what you suggest here.

Joining across a large number of tables is expensive at times—but those times are generally when you’re doing a query that involves data in many of those tables, which doesn’t seem likely here.

If you’re displaying a single user’s profile, as you describe, there are in fact not many joins used at all. Assuming you are looking up by user ID, you’ll do the following queries:

1) User’s basic profile data (user_table), index lookup.

2) User’s affiliations (user_affiliations, affiliations), which is an index lookup on user_affiliations by user_id, joining on looking up the affiliation name by affiliation_id.

3) User’s phone numbers (user_phone_numbers), index lookup.

4) User’s screen names (user_screen_names), index lookup.

5) User’s work history (user_work_history, affiliations), which is an index lookup on user_work_history by user_id, joining on lookup up the affiliation name by affiliation_id.

There is no six-way join here. There are two two-way joins, and three single-table lookups. All information can be produced by simple index lookups.

In order to have any sort of complexity in your join at all, you need to work backwards from one of the multi-row subsidiary data tables to the user’s name. For example, if you want to produce a list of all of the people affiliated with Microsoft, you’ll need to do the following, assuming you know the affiliation_id (which is your primary key, so you should):

1) Look up the set of all user_ids associated with 42 in the user_affiliations table, using an index lookup.

2) Join on the user table, producing one row per person, including their name by using multiple index lookups and index scans.

If you want to avoid the join here, you would need to encode all of a user’s affiliations into columns in the user table—but a user can have more than one affiliation. That means that you’re going to have to do something awful:

One choice is to use multiple columns. This implies an upper limit on the number of affiliations you can associate with a user, and it makes you build very complicated queries like "(affiliation_1 is not null and affiliation_1 = 42) or (affiliation_2 is not null and affiliation_2 = 42) or ..." Assuming you have up to five affiliations per user, you would be doing five index scans here.

Another choice is to use a column that contains an array value. The down side of this is that in most databases, you can’t index that kind of column efficiently.

The nail in the coffin here is that the two way join you’re trying to replace IS NOT that expensive. It is in fact the most efficient way to answer this question.



To get an actual expensive query, you would have to ask something like "Tell me all of the users who are affiliated with Microsoft and have in the past been affiliated with i2 Technologies." (You could also include phone numbers or screen names here, but you probably don’t have indexes to let you search for users by phone number, anyway. You almost certainly do have one for past affiliations.)

To do this query, you need a three-way join.

1) You generate from the user_affiliations table a set of the user_ids that are affiliated with 42, using an index lookup.

2) You generate from the user_work_history table a set of user_ids that were affiliated with 78, using an index lookup.

3) You do a set intersection operation on these two sets, using a fast merge operation.

4) You look up all of the user table entries for user_ids in the intersection, using a series of index lookups and scans.

This is a three way join, and is potentially expensive. The expense is ameliorated by the fact that almost every operation you do is index-based. The most expensive thing that happens is that the two lists of users from #1 and #2 must be sorted before #3 can happen.


In short: Denormalizing a database can be a very powerful technique. However, the example you provide is not a place where it is appropriate to de-normalize very much at all. At the most, I would consider including a single phone number and a single screen name in the user table, for the purpose of displaying them in a user list. In that kind of display, you need to choose a single item to display in any case. You would of course also keep the separate tables for searching and maintaining multiple values.

Your suggestion that the above schema requires six-way joins is, quite frankly, laughable. It suggests that your understanding of relational databases is... poor.
John Prevost
Saturday, August 4, 2007 6:32:15 AM (GMT Daylight Time, UTC+01:00)
Oh, and one last point, with regard to the fact that your "normalized" schema is in fact already denormalized to being with:

The inclusion of the member count in the affiliations table and the company name in the work histories table in your example schema are the kind of denormalization that *is* appropriate. Computing the number of members of an organization is expensive, and it can easily be cached in this way. Computing the name of the organization for the work history table is not that expensive, but still might be useful.

Of the two, placing the computed member count in the affiliations table is the single denormalization most likely to actually provide a useful speed benefit. Without it, you would not be able to afford to display the size of affiliations in any sort of affiliation listing. With it, your code also has a quick way to decide whether to display a full member list without first having to ask the database to produce that list.
John Prevost
Saturday, August 4, 2007 6:38:45 AM (GMT Daylight Time, UTC+01:00)
Er. "(affiliation_1 is not null and affiliation_1 = 42) or (affiliation_2 is not null and affiliation_2 = 42)" doesn’t need the "X is not null" bits. Sorry about that—it’s late. :) And apologies for spamming your comments here... I just couldn’t let such a bald mis-statement pass without comment.
John Prevost
Saturday, August 4, 2007 8:32:53 AM (GMT Daylight Time, UTC+01:00)
You shouldn't assume that you know how any particular database engine will perform a given task. They will all be different. For example, some may allocate space to empty fields, some may not; joins may sometimes be performed with table accesses and sometimes with index lookups. Frequently used small tables may be kept in memory and be very quick to access. The advantage of a normalised design is as a consistent start point from which to optimise for performance, if that is what's needed. Optimisation may be by tuning the database, creating indexes or denormalising as appropriate.
Julian Gall
Saturday, August 4, 2007 2:32:17 PM (GMT Daylight Time, UTC+01:00)
I find de-normalization to be completely unnecessary

A much better approach is to cache any query that's expensive to generate. The Memcached approach.
Saturday, August 4, 2007 7:56:06 PM (GMT Daylight Time, UTC+01:00)
I'll agree with Seun's suggestion that caching is the way to go. Facebook should rarely serve a page that is directly tied to the database at all: a database trigger on INSERT or UPDATE should invalidate the cache and regenerate the dependent pages.
Sunday, August 5, 2007 7:04:00 AM (GMT Daylight Time, UTC+01:00)
Caching is another possible solution, either by using a reverse proxy, or maybe even the profile page should be cached, store the entire HTML chunk in the database.

In Oracle you could use materialized views.
Agreeing
Sunday, August 5, 2007 7:05:41 AM (GMT Daylight Time, UTC+01:00)
I did not read the previous comments.. silly me ....

I agree with the caching comments.
Agreeing
Sunday, August 5, 2007 11:53:40 PM (GMT Daylight Time, UTC+01:00)
When doing partitioning, it's often common to still have a shared database that merges data from those partitions either directly via replication or manually through jobs that either copy the data/schema as-is or into a new denormalized schema.

One perfect example is the Browse feature on MySpace - that doesn't directly scan the many hundreds of different user partitions, but is instead a new database that's populated from the other partitions and denormalized in such a way to provide optimum performance for that features particular means of accessing data.

As for having tools that help check data consistency across multiple partitions, there really isn't anything generic that can effectively be provided. Different types of features are going to have different requirements. Tools will have to be custom tailored for the problem they are solving.

"I find de-normalization to be completely unnecessary "
That's what everyone says until they're faced with a situation where you can't scale without it. Eventually a server will hit it's maximum capacity, with or without caching. LiveJournal released a great paper a couple years ago about their own partitioning schema and the ceilings they faced, I highly recommend it.

"a database trigger on INSERT or UPDATE should invalidate the cache and regenerate the dependent pages."
This is neither realistic or scalable - I'm 99.9999% sure that all validation and invalidation rules in Facebook would be in code. It doesn't make sense to create a dependency between the database and the caching layer, which is the responsibility of the application consuming both services. Also on this note, you should read a paper from eBay on how they scaled their infrastructure but dumbing down the database, and placing all RDBMS responsibilities other then storing data in the application (such as triggers, foreign key constraints, etc.).

At the end of the day, you can talk about scaling this kind of problem all you want, but unless you actually have to do it, you never know what kind of surprises you're in for. A process that might work for a million members might fail at 10 million.

Regardless of how fine tuned your queries are or how well covered your indexes are, the more data in a database, the longer a query will take. What might take 0.2 milliseconds at 1 million records could take 20 milliseconds at 100 million records, and so on. This applies to any RDBMS, and caching can only solve so much.
Comments are closed.