I remember taking an operating systems class in college and marveling at the fact that operating system design seemed less about elegant engineering and more about [what I viewed at the time as] performance hacks. I saw a similar sentiment recently captured by Eric Florenzano in his post It's Caches All the Way Down where he starts describing how a computer works to a friend and ends up talking about the various layers of caching from CPU registers to L2 caches to RAM and so on.
At the end of his post Eric Florenzano asks the following question I've often heard at work and in developer forums like programming.reddit
That's what struck me. When you come down to it, computers are just a waterfall of different caches, with systems that determine what piece of data goes where and when. For the most part, in user space, we don't care about much of that either. When writing a web application or even a desktop application, we don't much care whether a bit is in the register, the L1 or L2 cache, RAM, or if it's being swapped to disk. We just care that whatever system is managing that data, it's doing the best it can to make our application fast. But then another thing struck me. Most web developers DO have to worry about the cache. We do it every day. Whether you're using memcached or velocity or some other caching system, everyone's manually moving data from the database to a cache layer, and back again. Sometimes we do clever things to make sure this cached data is correct, but sometimes we do some braindead things. We certainly manage the cache keys ourselves and we come up with systems again and again to do the same thing. Does this strike anyone else as inconsistent? For practically every cache layer down the chain, a system (whose algorithms are usually the result of years and years of work and testing) automatically determines how to best utilize that cache, yet we do not yet have a good system for doing that with databases. Why do you think that is? Is it truly one of the two hard problems of computer science? Is it impossible to do this automatically? I honestly don't have the answers, but I'm interested if you do.
That's what struck me. When you come down to it, computers are just a waterfall of different caches, with systems that determine what piece of data goes where and when. For the most part, in user space, we don't care about much of that either. When writing a web application or even a desktop application, we don't much care whether a bit is in the register, the L1 or L2 cache, RAM, or if it's being swapped to disk. We just care that whatever system is managing that data, it's doing the best it can to make our application fast.
But then another thing struck me. Most web developers DO have to worry about the cache. We do it every day. Whether you're using memcached or velocity or some other caching system, everyone's manually moving data from the database to a cache layer, and back again. Sometimes we do clever things to make sure this cached data is correct, but sometimes we do some braindead things. We certainly manage the cache keys ourselves and we come up with systems again and again to do the same thing.
Does this strike anyone else as inconsistent? For practically every cache layer down the chain, a system (whose algorithms are usually the result of years and years of work and testing) automatically determines how to best utilize that cache, yet we do not yet have a good system for doing that with databases. Why do you think that is? Is it truly one of the two hard problems of computer science? Is it impossible to do this automatically? I honestly don't have the answers, but I'm interested if you do.
Eric is simultaneously correct and incorrect in his statements around caching and database layers. Every modern database system has caching mechanisms that are transparent to the developer. For LAMP developers there is the MySQL Query Cache which transparently caches the text of SELECT query and the results so that the next time the query is performed it is fetched from memory. For WISC developers there are the SQL Server's Data and Procedure caches which store query plans and their results to prevent having to repeatedly perform expensive computations or go to disk to fetch recently retrieved data. As with everything in programming, developers can eke more value out of these caches by knowing how they work. For example, using parameterized queries or stored procedures significantly reduces the size of the procedure cache in SQL Server. Tony Rogerson wrote an excellent post where he showed how switching from SQL queries based on string concatenation to parameterized queries can reduce the size of a procedure cache from over 1 Gigabyte to less than 1 Megabyte. This is similar to understanding how garbage collection or memory allocation works teaches developers to favor recycling objects instead of creating new ones and favoring arrays over linked lists to reduce memory fragmentation.
Even though there are caching mechanisms built into relational databases that are transparent to the developer, they typically aren't sufficient for high performance applications with significant amounts of read load. The biggest problem is hardware limitations. A database server will typically have twenty to fifty times more hard drive storage capacity than it has memory. Optimistically this means a database server can cache about 5% to 10% of its entire data in memory before having to go to disk. In comparison, lets look at some statistics from a popular social networking site (i.e. Facebook)
How much churn do you think the database cache goes through if half of the entire user base is making data requests every day? This explains why Facebook has over 400 memcached hosts storing over 5 Terabytes of data in-memory instead of simply relying on the query caching features in MySQL. This same consideration applies for the majority of social media and social networking sites on the Web today.
So the problem isn't a lack of transparent caching functionality in relational databases today. The problem is the significant differences in the storage and I/O capacity of memory versus disk in situations where a large percentage of the data set needs to be retrieved regularly. In such cases, you want to serve as much data from memory as possible which means going beyond the physical memory capacity of your database server and investing in caching servers.
Now Playing: Kanye West - Two Words (feat. Mos Def, Freeway & The Harlem Boys Choir)