The debate on the pros and cons of non-relational databases which are typically described as “NoSQL databases” has recently been heating up. The anti-NoSQL backlash is in full swing from the rebuttal to one of my recent posts of mine I saw mentioned in Dennis Forbes’s write-up The Impact of SSDs on Database Performance and the Performance Paradox of Data Explodification (aka Fighting the NoSQL mindset) and similar thoughts expressed in typical rant-y style by Ted Dziuba in his post I Can't Wait for NoSQL to Die.

This will probably be my last post on the topic for a while given that the discussion has now veered into religious debate territory similar to vi vs. emacs OR functional vs. object oriented programming. With that said…

It would be easy to write rebuttals of what Dziuba and Forbes have written but from what I can tell people are now talking past each other and are now defending entrenched positions. So instead I’ll leave this topic with an analogy. SQL databases are like automatic transmission and NoSQL databases are like manual transmission. Once you switch to NoSQL, you become responsible for a lot of work that the system takes care of automatically in a relational database system. Similar to what happens when you pick manual over automatic transmission. Secondly, NoSQL allows you to eke more performance out of the system by eliminating a lot of integrity checks done by relational databases from the database tier. Again, this is similar to how you can get more performance out of your car by driving a manual transmission versus an automatic transmission vehicle.

However the most notable similarity is that just like most of us can’t really take advantage of the benefits of a manual transmission vehicle because the majority of our driving is sitting in traffic on the way to and from work, there is a similar harsh reality in that most sites aren’t at Google or Facebook’s scale and thus have no need for a Bigtable or Cassandra. As I mentioned in my previous post, I believe a lot of problems people have with relational databases at web scale can be addressed by taking a hard look at adding in-memory caching solutions like memcached to their infrastructure before deciding the throw out their relational database systems.

Note Now Playing: Lady Gaga - Bad Romance Note


 

Categories: Web Development

I was recently on a panel at the South by South West interactive conference (SXSW) where we discussed multiple applications of the real-time Web and the things that might prevent us from seeing its true potential. I’ve found it interesting that the key take away from the panel is that privacy issues will be one of the biggest problems we will face as we move forward. You can see this perspective in CNN’s coverage of the panel in the story Privacy concerns hinder 'real-time Web' creation, developers say and GigaOm’s write-up SXSW: Is Privacy on the Social Web a Technical Problem? 

This overlap of privacy and real-time web features is brought into sharp relief when you look at services such as Foursquare and Gowalla which provide a mechanism for people to broadcast their physical location to a group of friends in real-time. I started using Foursquare last week and I’ve noticed I’m even more careful about who I accept friend requests from than on Facebook or Windows Live Messenger. The fact is that I may share status updates and photos with people but it doesn’t mean I want them to be aware of where I am on an up to the minute basis especially if I’m out spending time with my family and friends. This difference in how we view location data from other sorts of real-time data we share is captured by the co-founder of Foursquare in the article Facebook Isn't For Real Life Friends Anymore, Says Foursquare's Dennis Crowley where it states

Facebook plans to clone Foursquare's central service -- the ability for site members to use their phones to "check-in" from restaurants and bars -- and make it a mere Facebook feature.

But Foursquare cofounder Dennis Crowley says there's something Facebook can't clone: the real-life friendships between Foursquare users.

"Facebook used to be who your friends are, now it's everyone," Dennis told us in an interview.

"[Foursquare] is more tightly curated to who you want to have as your check-in friends. Facebook is good place for status updates and sharing photos, not to keep tabs on where people are going."

I think Crowley is on to something when he says Facebook can’t clone the Foursquare relationship model. I suspect that like Twitter, Foursquare has created a social network whose value proposition is differentiated enough from Facebook’s that it can grow into a relatively popular albeit smaller service that will not be “killed” by Facebook*. Secondly, there is a lot of synergy between Foursquare and Facebook as evidenced by the fact that Facebook is the largest referrer of traffic to Foursquare thanks to their implementation of Facebook Connect. So I think the claims that one will kill the other is just the usual tech press creating conflict to generate page views.

One thing I have noticed is that location can’t just be a field you bolt on to a status update. It has to be a key part of the information you are sharing with others otherwise it adds little value to the user experience and in fact may detract from it by adding clutter. For example, compare what a location-based update from Foursquare looks like on Facebook versus what the exact same update looks like on Twitter

VS

 

The difference between both updates is almost night and day even though the actual status text I shared is the same. The way Twitter has approached location is to treat it as a bunch of “poorly translated” GPS coordinates that are bolted on to the end of my status update. The Facebook update not only gives you that but also a human readable location for where I am down to the room number and includes some social context such as the fact that I was attending the talk with two coworkers from Windows Live.

As real-time location data starts to permeate social experiences, there’s a lot to learn from the above screenshots. In the example above, people who are interested in the topic based on my status knew which room to find danah’s talk from the Facebook update whereas they were told “downtown austin” in the Twitter update.  As designers of social software applications, we should be mindful that location data enhances the experience and the information being shared. Adding location simply for buzzword compliance or to add metadata to the status update without enhancing the experience actually ends up crufting it up.

* Twitter’s value proposition is that it is the place to interact with celebrities and microcelebrities that you care about. It is useful to note that the much maligned Suggested Users List was key in establishing this value proposition in the minds of users. This is different from Facebook’s position as the social network for your real world friends, family, coworkers and acquaintances.

Note Now Playing: B.O.B. - Nothin' On You (featuring Bruno Mars) Note


 

A few weeks ago Todd Hoff over on the High Scalability blog penned a blog post titled MySQL and Memcached: End of an Era? where he wrote

If you look at the early days of this blog, when web scalability was still in its heady bloom of youth, many of the articles had to do with leveraging MySQL and memcached. Exciting times. Shard MySQL to handle high write loads, cache objects in memcached to handle high read loads, and then write a lot of glue code to make it all work together. That was state of the art, that was how it was done. The architecture of many major sites still follow this pattern today, largely because with enough elbow grease, it works.

With a little perspective, it's clear the MySQL+memcached era is passing.

LinkedIn has moved on with their
Project Voldemort. Amazon went there a while ago.

Digg declared their entrance into a new era in a post on their blog titled Looking to the future with Cassandra,

Twitter has also declared their move in the article
Cassandra @ Twitter: An Interview with Ryan King.

Todd’s blog has been a useful source of information on the topic of scaling large scale websites since he catalogues as many presentations as he can find from industry leaders on how they’ve designed their systems to deal with millions to hundreds of millions of users pounding their services a day. What he’s written above is really an observation about industry trends and isn’t really meant to attack any technology. I did find it interesting that many took it as an attack on memcached and/or relational databases and came out swinging.

One post which I thought tried to take a balanced approach to rebuttal was Dennis Forbes’ Getting Real about NoSQL and the SQL-Isn't-Scalable Lie where he writes

I work in the financial industry. RDBMS’ and the Structured Query Language (SQL) can be found at the nucleus of most of our solutions. The same was true when I worked in the insurance, telecommunication, and power generation industries. So it piqued my interest when a peer recently forwarded an article titled “The end of SQL and relational databases”, adding the subject line “We’re living in the past”. [Though as Michael Stonebraker points out, SQL the query language actually has remarkably little to actually to do with the debate. It would be more clearly called NoACID]

From a vertical scaling perspective — it’s the easiest and often the most computationally effective way to scale (albeit being very inefficient from a cost perspective) — you have the capacity to deploy your solution on powerful systems with armies of powerful cores, hundreds of GBs of memory, operating against SAN arrays with ranks and ranks of SSDs.

The computational and I/O capacity possible on a single “machine” are positively enormous. The storage system, which is the biggest limiting factor on most database platforms, is ridiculously scalable, especially in the bold new world of SSDs (or flash cards like the FusionIO).

From a horizontal scaling perspective you can partition the data across many machines, ideally configuring each machine in a failover cluster so you have complete redundancy and availability. With Oracle RAC and Sybase ASE you can even add the classic clustering approach. Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve built a system for a large corporation, deployed in your own datacenter, with few constraints beyond the limits of technology and the platform.

Your solution will cost hundreds of thousands of dollars (if not millions) to deploy, but that isn’t a critical blocking point for most enterprises.This sort of scaling that is at the heart of virtually every bank, trading system, energy platform, retailing system, and so on.

To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming evidence, defies all reason.

There’s lots of good for food for thought in both blog posts. Todd is right that a few large scale websites are moving beyond the horizontal scaling approach that Dennis brought up in his rebuttal based on their experiences. What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.

It is unsurprising that after being at this point for years, some people in our industry have wondered whether it doesn’t make more sense to use data stores that are optimized for the usage patterns of large scale websites instead of gloriously misusing relational databases.  A good example of the tradeoffs is the blog post from the Digg team on why they switched to Cassandra. The database was already sharded which made performing joins to calculate results of queries such as “which of my friends Dugg this item?” to be infeasible. So instead they had to perform two reads from SQL (all Diggs on an item and all of the user’s friends) then perform the intersection operation on the PHP front end code. If the item was not already cached, this leads to disk I/O which could take seconds. To make the situation worse, you actually want to perform this operation multiple times on a single page view since it is reasonable to expect multiple Digg buttons on a page if it has multiple stories on it.

An alternate approach is to denormalize the data and for each user store a list of stories that have been Dugg by at least one of their friends. So whenever I Digg an item, an entry is placed in each of my friends’ lists to indicate that story is now one that has been Dugg by a friend. That way when the a friend of mine shows up, it is a simple lookup to say “is this story ID on the list of stories Dugg by one of their friends?” The challenge here is that it means Digging an item can result in literally thousands of logical write operations. It has been traditionally prohibitively expensive to incur such massive amounts of write I/O in relational databases with all of their transactionality and enforcing of ACID constraints. NoSQL databases like Cassandra which assume your data is denormalized are actually optimized for write I/O heavy operations given the necessity of having to perform enormous amounts of writes to keep data consistent.

Digg’s usage of Cassandra actually serves as a rebuttal to Dennis Forbes’ article since they couldn’t feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution. I would argue that introducing memcached into the mix would have addressed disk I/O concerns because all records of who has Dugg an item could be stored in-memory so comparisons of which of my friends have Dugg an item never have to go to disk to answer any parts of the query. The only caveat with that approach is that RAM is more expensive than disk so you’ll need a lot more servers to store 3 terabytes of data in memory than you would on disk.

However, the programming model is not the only factor one most consider when deciding whether to stay with a sharded/partitioned relational database versus going with a NoSQL solution. The other factor to consider is the actual management of the database servers. The sorts of questions one has to ask when choosing a database solution are listed in the interview with Ryan King of Twitter where he lists the following checklist that they evaluated before deciding to go with Cassandra over MySQL

We first evaluated them on their architectures by asking many questions along the lines of:

  • How will we add new machines?
  • Are their any single points of failure?
  • Do the writes scale as well?
  • How much administration will the system require?
  • If its open source, is there a healthy community?
  • How much time and effort would we have to expend to deploy and integrate it?
  • Does it use technology which we know we can work with?

The problem with database sharding is that it isn’t really a supported out of the box configuration for your traditional relational database product especially the open source ones. How your system deals with new machines being added to the cluster or handles machine failure often requires special case code being written by application developers along with special hand holding by operations teams. Dealing with issues related to database replication (whether it is multi-master or single master) also often takes up unexpected amounts of manpower once sharding is involved.

For these reasons I expect we’ll see more large scale websites decide that instead of treating a SQL database as a denormalized key-value pair store that they would rather use a NoSQL database. However I also suspect that a lot of services who already have a sharded relational database + in-memory cache solution can get a lot of mileage from more judicious usage of in-memory caches before switching. This is especially true given that you still caches in front of your NoSQL databases anyway. There’s also the question of whether traditional relational database vendors will add features to address the shortcomings highlighted by the NoSQL movement? Given that the sort of companies adopting NoSQL are doing so because they want to save costs on software, hardware and operations I somehow doubt that there is a lucrative market here for database vendors versus adding more features that the banks, insurance companies and telcos of the world find interesting.

Note Now Playing: Birdman - Money To Blow (featuring Drake & Lil Wayne Note


 

Categories: Web Development