I remember taking an operating systems class in college and marveling at the fact that operating system design seemed less about elegant engineering and more about [what I viewed at the time as] performance hacks. I saw a similar sentiment recently captured by Eric Florenzano in his post It's Caches All the Way Down where he starts describing how a computer works to a friend and ends up talking about the various layers of caching from CPU registers to L2  caches to RAM and so on.

At the end of his post Eric Florenzano asks the following question I've often heard at work and in developer forums like programming.reddit

That's what struck me. When you come down to it, computers are just a waterfall of different caches, with systems that determine what piece of data goes where and when. For the most part, in user space, we don't care about much of that either. When writing a web application or even a desktop application, we don't much care whether a bit is in the register, the L1 or L2 cache, RAM, or if it's being swapped to disk. We just care that whatever system is managing that data, it's doing the best it can to make our application fast.

But then another thing struck me. Most web developers DO have to worry about the cache. We do it every day. Whether you're using memcached or velocity or some other caching system, everyone's manually moving data from the database to a cache layer, and back again. Sometimes we do clever things to make sure this cached data is correct, but sometimes we do some braindead things. We certainly manage the cache keys ourselves and we come up with systems again and again to do the same thing.

Does this strike anyone else as inconsistent? For practically every cache layer down the chain, a system (whose algorithms are usually the result of years and years of work and testing) automatically determines how to best utilize that cache, yet we do not yet have a good system for doing that with databases. Why do you think that is? Is it truly one of the two hard problems of computer science? Is it impossible to do this automatically? I honestly don't have the answers, but I'm interested if you do.

Eric is simultaneously correct and incorrect in his statements around caching and database layers. Every modern database system has caching mechanisms that are transparent to the developer. For LAMP developers there is the MySQL Query Cache which transparently caches the text of SELECT query and the results so that the next time the query is performed it is fetched from memory. For WISC developers there are the SQL Server's Data and Procedure caches which store query plans and their results to prevent having to repeatedly perform expensive computations or go to disk to fetch recently retrieved data. As with everything in programming, developers can eke more value out of these caches by knowing how they work. For example, using parameterized queries or stored procedures significantly reduces the size of the procedure cache in SQL Server. Tony Rogerson wrote an excellent post where he showed how switching from SQL queries based on string concatenation to parameterized queries can reduce the size of a procedure cache from over 1 Gigabyte to less than 1 Megabyte. This is similar to understanding how garbage collection or memory allocation works teaches developers to favor recycling objects instead of creating new ones and favoring arrays over linked lists to reduce memory fragmentation.

Even though there are caching mechanisms built into relational databases that are transparent to the developer, they typically aren't sufficient for high performance applications with significant amounts of read load. The biggest problem is hardware limitations. A database server will typically have twenty to fifty times more hard drive storage capacity than it has memory. Optimistically this means a database server can cache about 5% to 10% of its entire data in memory before having to go to disk. In comparison, lets look at some statistics from a popular social networking site (i.e. Facebook)

  • 48% of active users log-in daily
  • 49.9 average pages a day viewed per unique user

How much churn do you think the database cache goes through if half of the entire user base is making data requests every day? This explains why Facebook has over 400 memcached hosts storing over 5 Terabytes of data in-memory instead of simply relying on the query caching features in MySQL. This same consideration applies for the majority of social media and social networking sites on the Web today.

So the problem isn't a lack of transparent caching functionality in relational databases today. The problem is the significant differences in the storage and I/O capacity of memory versus disk in situations where a large percentage of the data set needs to be retrieved regularly.  In such cases, you want to serve as much data from memory as possible which means going beyond the physical memory capacity of your database server and investing in caching servers.

Note Now Playing: Kanye West - Two Words (feat. Mos Def, Freeway & The Harlem Boys Choir) Note


 

Sunday, 09 November 2008 15:42:55 (GMT Standard Time, UTC+00:00)
Good article. Do you think anything is to preclude the SQL Server team from allowing folks to scale out by adding caching servers with insane amounts of memory?
Sunday, 09 November 2008 16:09:50 (GMT Standard Time, UTC+00:00)
Good post!
Another form of caching used today is offered by the grid computing platforms like Terracotta. TC offers truly coherent and clustered cache that remains consistent even in the presence of database writes through write behind to System of Record. I had blogged on this some time back, and how TC caching compares with that of memcached .. http://debasishg.blogspot.com/2008/09/memcached-and-terracotta-alternatives.html
Sunday, 09 November 2008 20:33:10 (GMT Standard Time, UTC+00:00)
I'm enjoying messing with the assumptions made in software and hardware caching by having a RAID 0 array of SSDs. Thankfully Intel finally put their memory controllers on die, and continues to increase system bus speed and bandwidth.
Chris Saari
Sunday, 09 November 2008 20:58:14 (GMT Standard Time, UTC+00:00)
hey dare, what animal does this most look like?
http://i34.tinypic.com/11jlp5g.jpg
The Gooch
Tuesday, 11 November 2008 19:47:48 (GMT Standard Time, UTC+00:00)
Good points; the call to use time-tested caching technology, rather than rolling your own, reminds me very much of the arguments I've been using against memcached et al. However, focusing on the database alone misses the bigger picture -- we already have well-proven caching technology for the Web that scales very well, and I use it in my day job all the time.

I've been procrastinating on a 'squid vs. memcached' talk; looks like I need to get that done...
Tuesday, 11 November 2008 21:59:12 (GMT Standard Time, UTC+00:00)
This post was like a red rag to a bull for me. I consult on databases all day long and the quite honestly, the number one reason why people can't just depend on the database for adequate cacheing is this:

SHIT CODE

I can't remember the last time I walked into what I will describe as an "e-commerce" or "Enterprise 2.0" shop that had code that didn't make me want to reach for the eyeball fork. No database in the world will cope with some of the inanities slung at them on a daily basis.

What is it with "web programmers"? I used to be a programmer and I was very successful at it. But my code was always pretty close to the metal, it seems to me that with all these layers of abstraction and convenience programming gimmicks like Hibernate, programmers just don't care about the efficiency of their code AT ALL. I'm not talking about ekeing out every possible drop of performance, I'm just talking about getting the basics right. If you've got a zillion cacheing servers, why should you care about program efficiency? So they don't.

I did a lot of work at a very high-traffic website that has quite literally hundreds of cacheing servers. They have mainframe size 32-way Unix boxes serving the database with hundreds of GB of RAM -- a pair of them, for load balancing. But when you look at what the database is actually doing, it's completely apparent that a 4-way Linux box and a bit of due care and diligence would be more than adequate.

I wish there was some way to knock some sense in to programmers again.

Take some bloody pride in your craft, you lazy bastards.
Wednesday, 12 November 2008 07:11:13 (GMT Standard Time, UTC+00:00)
Manual trackback:

http://www.martin-probst.com/blog/2008/11/11/databases-and-caching

It's not about the data size at all - CPU caches are much smaller relative to main memory, but they get the job done. It's about available knowledge and the quality of the optimizers.
Comments are closed.