Last week Joel Spolsky wrote a blog post entitled The Duct Tape Programmer where he praises developers who favor simple programming practices to complex ones. This blog post strongly resonated with me and made me recall some related thoughts on complexity and solving problems in software projects. Some key excerpts from his which I'll use as a jumping off point are below

Jamie Zawinski is what I would call a duct-tape programmer. And I say that with a great deal of respect. He is the kind of programmer who is hard at work building the future, and making useful things so that people can do stuff.
Duct tape programmers are pragmatic. Zawinski popularized Richard Gabriel’s precept of
Worse is Better. A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has because it’s in your lab where you’re endlessly polishing the damn thing. Shipping is a feature. A really important feature. Your product must have it.

One principle duct tape programmers understand well is that any kind of coding technique that’s even slightly complicated is going to doom your project. Duct tape programmers tend to avoid C++, templates, multiple inheritance, multithreading, COM, CORBA, and a host of other technologies that are all totally reasonable, when you think long and hard about them, but are, honestly, just a little bit too hard for the human brain.

The urge the reduce the complexity of the tools used to solve software problems is one that every software developer should share. However even more important is reducing the complexity of the actual solutions that are delivered to your customers at the end of the day. End users can't tell if you used complicated C++ techniques like template metaprogramming and mixins to build the application. They can tell when your application fails to solve their actual problems in a straightforward way or is so late to ship due to project delays that they lose interest in waiting for you to solve their problems.

There are many famous and everyday examples of this culture of complexity in software projects which are eventually trumped by solutions that solve 80% of the problem in a simple way. My favorite example is contrasting the World Wide Web invented by Tim Berners-Lee with Project Xanadu as envisioned by Ted Nelson.  Today the WWW is used by over a billion people to enrich their lives in myriad ways on a daily basis and has created hundreds of billions dollars in value by minting an entire new industry. Project Xanadu is a sad footnote spoken about in hushed tones by fans of hypertext who bewail the success of the Web and how it has forced us to settle for less (i.e. Worse Is Better).

If you aren't familiar with Project Xanadu you can think of it as a networked system of hyperlinked documents and media just like the WWW which had to satisfy the following seventeen rules

    1. Every Xanadu server is uniquely and securely identified.
    2. Every Xanadu server can be operated independently or in a network.
    3. Every user is uniquely and securely identified.
    4. Every user can search, retrieve, create and store documents.
    5. Every document can consist of any number of parts each of which may be of any data type.
    6. Every document can contain links of any type including virtual copies ("transclusions") to any other document in the system accessible to its owner.
    7. Links are visible and can be followed from all endpoints.
    8. Permission to link to a document is explicitly granted by the act of publication.
    9. Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document.
    10. Every document is uniquely and securely identified.
    11. Every document can have secure access controls.
    12. Every document can be rapidly searched, stored and retrieved without user knowledge of where it is physically stored.
    13. Every document is automatically moved to physical storage appropriate to its frequency of access from any given location.
    14. Every document is automatically stored redundantly to maintain availability even in case of a disaster.
    15. Every Xanadu service provider can charge their users at any rate they choose for the storage, retrieval and publishing of documents.
    16. Every transaction is secure and auditable only by the parties to that transaction.
    17. The Xanadu client-server communication protocol is an openly published standard. Third-party software development and integration is encouraged.

Reading this list is like going through a list of places where World Wide Web fails. Rule #14 which implies every document on the network is redundantly backed up in disparate locations so they can always be is something the WWW doesn't do today which is why we have broken links and 404s all the time. Rule #9 implies that not only is copyright respected and tracked throughout the system but there is even a micropayment platform built in. All the discussions on micropayments saving newspapers would be moot if Project Xanadu ruled the world since it would have existed from day one. Rule #16 on transactions being secure and auditable sounds like Nirvana in today's world of botnets, malware and phishing scams which plague the Web.

Yet despite the fact that the forty year old Project Xanadu is a more compelling vision than were we are today it failed and Tim Berners-Lee's World Wide Web succeeded. In practical terms, Project Xanadu was trying to solve too many complex problems in a v1 product. In contrast, Tim Berners-Lee focused on the most valuable problems to solve for end users which was sharing documents and media with anyone on the Internet and punted on a bunch of the hard problems that would require a more controlled and tightly coupled network as well as a ton of more code. Tim Berners-Lee solved less than half the problems Project Xanadu set out to solve but has changed the world immeasurably for billions of people by providing simple solutions to complex problems and running away from trying to create complex solutions to complex problems.

The bottom line is that a lot of the time it's OK to create a solution that solves 80% of the problem. Always remember that shipping is a feature.

Note Now Playing: Drake, Kanye West, Lil Wayne & Eminem - Forever Note


Categories: Programming | Ramblings

Every week there seems to be some new A-list blogger criticizing Twitter's Suggested User's List which is a selection of celebrities and brands that are suggested to new Twitter users as people the user might like to follow. This week it's Robert Scoble with You’re not on Twitter’s suggested user list but you are in good company that points out a number of interesting celebrities and brands that aren't on the list. Last week Dave Winer asked The SUL as a tool to control news?

I've had my issues with the SUL mainly from the perspective of how it ends up presenting Twitter to new users. When my wife joined Twitter I'd have loved it if the service had used integration with Facebook, Windows Live, MySpace, etc to suggest people who she already knew who were on Twitter. Instead the service prioritized pitching that she follow Shaquille O'Neal, Dell Outlet stores, NBC's Today Show and Jessica Simpson's kid sister. To find me on Twitter, my wife had to ask me for my Twitter handle in person. I felt like we were back in the dark ages of social networking.

In retrospect, not doing what I preferred them to do shows a lot of insight. It prevents the site from being viewed as yet another service where you have a duplicated social graph and thus has to compete head to head with the Facebooks and MySpaces of the world. Instead it pitches Twitter as a sort of user friendly RSS reader where you connect with your favorite celebrities and brands instead of another place where you get status updates from people who you're already getting status updates from in Facebook.


Note Now Playing: Jay Sean - Down (feat. Lil Wayne) Note


Categories: Social Software

Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

Why Denormalize Your Database?

Today, lots of Web applications have "social" features. A consequence of this is that whenever I look at content or a user in that service, there is always additional content from other users that also needs to be pulled in to page. When you visit the typical profile on a social network like Facebook or MySpace, data for all the people that are friends with that user needs to be pulled in. Or when you visit a shared bookmark on you need data for all the users who have tagged and bookmarked that URL as well. Performing a query across the entire user base for "all the users who are friends with Robert Scoble" or "all the users who have bookmarked this blog link" is expensive even with caching. It is orders of magnitude faster to return the data if it is precalculated and all written to the same place.

This is optimizes your reads at the cost of incurring more writes to the system. It also means that you'll end up with redundant data because there will be multiple copies of some amount of user data as we try to ensure the locality of data.

A good example of a Web application deciding to make this trade off is the recent post on the Digg Blog entitled Looking to the Future with Cassandra which contains the following excerpt

The Problem

In both models, we’re computing the intersection of two sets:

  1. Users who dugg an item.
  2. Users that have befriended the digger.

The Relational Model

The schema for this information in MySQL is:

  `id`      INT(11),
  `itemid`  INT(11),
  `userid`  INT(11),
  `digdate` DATETIME,
  PRIMARY KEY (`id`),
  KEY `user`  (`userid`),
  KEY `item`  (`itemid`)
  `id`           INT(10) AUTO_INCREMENT,
  `userid`       INT(10),
  `username`     VARCHAR(15),
  `friendid`     INT(10),
  `friendname`   VARCHAR(15),
  `mutual`       TINYINT(1),
  `date_created` DATETIME,
  PRIMARY KEY                (`id`),
  UNIQUE KEY `Friend_unique` (`userid`,`friendid`),
  KEY        `Friend_friend` (`friendid`)

The Friends table contains many million rows, while Diggs holds hundreds of millions. Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP. The steps are:

  1. Query Friends for all my friends. With a cold cache, this takes around 1.5 seconds to complete.
  2. Query Diggs for any diggs of a specific item by a user in the set of friend user IDs. This query is enormous, and looks something like:
    SELECT `digdate`, `id` FROM `Diggs`
     WHERE `userid` IN (59, 9006, 15989, 16045, 29183,
                        30220, 62511, 75212, 79006)
       AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;

    The real query is actually much worse than this, since the IN clause contains every friend of the user, and this can balloon to hundreds of user IDs. A full query can actually clock in at 1.5kb, which is many times larger than the actual data we want. With a cold cache, this query can take 14 seconds to execute.

Of course, both queries are cached, but due to the user-specific nature of this data, it doesn’t help much.

The solution the Digg development team went with was to denormalize the data. They also went an additional step and decided that since the data was no longer being kept in a relational manner there was no point in using a traditional relational database (i.e. MySQL) and instead they migrated to a non-RDBMS technology to solve this problem.


How Denormalization Changes Your Application

There are a number of things to keep in mind once you choose to denormalize your data including

  1. Denormalization means data redundancy which translates to significantly increased storage costs. The fully denormalized data set from the Digg exampled ended up being 3 terabytes of information. It is typical for developers to underestimate the data bloat that occurs once data is denormalized.

  2. Fixing data inconsistency is now the job of the application. Let's say each user has a list of the user names of all of their friends. What happens when one of these users changes their user name? In a normalized database that is a simple UPDATE query to change a single piece of data and then it will be current everywhere it is shown on the site. In a denormalized database, there now has to be a mechanism for fixing up this name in all of the dozens, hundreds or thousands of places it appears. Most services that create denormalized databases have "fixup" jobs that are constantly running on the database to fix such inconsistencies.

The No-SQL Movement vs. Abusing Relational Databases for Fun & Profit

If you’re a web developer interested in building large scale applications, it doesn’t take long in reading the various best practices on getting Web applications to scale such as practicing database sharding or eschewing transactions before it begins to sound like all the advice you are getting is about ignoring or abusing the key features that define a modern relational database system. Taken to its logical extreme all you really need is a key<->value or tuple store that supports some level of query functionality and has decent persistence semantics. Thus the NoSQL movement was borne.

The No-SQL movement is a used to describe the increasing usage of non-relational databases among Web developers. This approach has initially pioneered by large scale Web companies like Facebook (Cassandra), Amazon (Dynamo) & Google (BigTable) but now is finding its way down to smaller sites like Digg. Unlike relational databases, there is a yet to be a solid technical definition of what it means for a product to be a "NoSQL" database aside from the fact that it isn't a relational database. Commonalities include lack of fixed schemas and limited support for rich querying. Below is a list of some of the more popular NoSQL databases that you can try today along with a brief description of their key qualities 

  1. CouchDB: A document-oriented database where documents can be thought of as JSON/JavaScript objects. Creation, retrieval, update and deletion (CRUD) operations are performed via a RESTful API and support ACID properties. Rich querying is handled by creating Javascript functions called "Views" which can operate on the documents in the database via Map/Reduce style queries. Usage: Although popular among the geek set most users seem to be dabblers as opposed to large scale web companies. 

  2. Cassandra: A key-value store where each key-value pair comes with a timestamp and can be grouped together into a column family (i.e. a table). There is also a notion of super columns which are columns that contain whose values are a list of other key-value pairs. Cassandra is optimized to be always writable and uses eventual consistency to deal with the conflicts that inevitably occur when a distributed system aims to be always writable yet node failure is a fact of life. Querying is available via the Cassandra Thrift API and supports fairly basic data retrieval operations based on key values and column names. Usage: Originally developed and still used at Facebook today. Digg and Rackspace are the most recent big name adopters.

  3. Voldemort: Very similar to Cassandra which is unsurprising since they are both inspired by Amazon's Dynamo. Voldemort is a key-value store where each key value pair comes with a timestamp and eventual consistency is used to address write anomalies. Values can contain a list of further key value pairs. Data access involves creation, retrieval and deletion of serialized objects whose format can be one of JSON, strings, binary BLOBs, serialized Java objects and Google Protocol Buffers. Rich querying is non-existent, simple get and put operations are all that exist.  Usage: Originally developed and still used at LinkedIn.

There are a number of other interesting NoSQL databases such as HBase, MongoDB and Dynomite but the three above seem to be the most mature from my initial analysis. In general, most of them seem to be a clone of BigTable, Dynamo or some amalgam of ideas from both papers. The most original so far has been CouchDB.

An alternative to betting on a speculative database technologies at varying levels of maturity is to misuse an existing mature relational database product. As mentioned earlier, many large scale sites use relational databases but eschew relational features such as transactions and joins to achieve scalability. Some developers have even taken that practice to an extreme and built schema-less data models on top of traditional relational database. A great example of this How FriendFeed uses MySQL to store schema-less data which is a blog post excerpted below

Lots of projects exist designed to tackle the problem storing data with flexible schemas and building new indexes on the fly (e.g., CouchDB). However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs (see this somewhat outdated article on CouchDB, for example). MySQL works. It doesn't corrupt data. Replication works. We understand its limitations already. We like MySQL for storage, just not RDBMS usage patterns.

After some deliberation, we decided to implement a "schema-less" storage system on top of MySQL rather than use a completely new storage system.

Our datastore stores schema-less bags of properties (e.g., JSON objects or Python dictionaries). The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque as far as the datastore is concerned. We can change the "schema" simply by storing new properties.

In MySQL, our entities are stored in a table that looks like this:

CREATE TABLE entities (
    id BINARY(16) NOT NULL,
    body MEDIUMBLOB,
    UNIQUE KEY (id),
    KEY (updated)

The added_id column is present because InnoDB stores data rows physically in primary key order. The AUTO_INCREMENT primary key ensures new entities are written sequentially on disk after old entities, which helps for both read and write locality (new entities tend to be read more frequently than old entities since FriendFeed pages are ordered reverse-chronologically). Entity bodies are stored as zlib-compressed, pickled Python dictionaries.

Now that the FriendFeed team works at Facebook I suspect they'll end up deciding that a NoSQL database that has solved a good story around replication and fault tolerance is more amenable to solving the problem of building a schema-less database than storing key<->value pairs in a SQL database where the value is a serialized Python object.

As a Web developer it's always a good idea to know what the current practices are in the industry even if they seem a bit too crazy to adopt…yet.

Further Reading

Note Now Playing: Jay-Z - Run This Town (feat. Rihanna & Kanye West) Note


    Categories: Web Development