In the past few months I've noticed an increased number of posts questioning practices around deleting and "virtually" deleting data from databases. Since some of the concerns around this practice have to do with the impact of soft deletes on scalability of a database-based application, I thought it would be a good topic for my ongoing series on building scalable databases.

Soft Deletes 101: What is a soft delete and how does it differ from a hard delete?

Soft deleting an item from a database means that the row or entity is marked as deleted but not physically removed from the database. Instead it is hidden from normal users of the system but may be accessible by database or system administrators.

For example, let's consider this sample database of XBox 360 games I own

Name Category ESRB GamespotScore Company
Call of Duty: Modern Warfare 2 First Person Shooter Mature 9.0 Infinity Ward
Batman: Arkham Asylum Fantasy Action Adventure Teen 9.0 Rocksteady Studios
Gears of War 2 Sci-Fi Shooter Mature 9.0 Epic Games
Call of Duty 4: Modern Warfare First Person Shooter Mature 9.0 Infinity Ward
Soul Calibur IV 3D Fighting Teen 8.5 Namco

Now consider what happens if I decide that I'm done with Call of Duty 4: Modern Warfare now that I own Call of Duty: Modern Warfare 2. The expected thing to do would then be to remove the entry from my database using a query such as

DELETE FROM games WHERE name='Call of Duty 4: Modern Warfare';

This is what is considered a "hard" delete.

But then what happens if my friends decide to use my list of games to decide which games to get me for Christmas? A friend might not realize I'd previously owned the game and might get it for me again. Thus it might be preferable if instead of deleting items from the database they were removed from consideration as games I currently own but still could be retrieved in special situations. To address this scenario I'd add an IsDeleted column as shown below

Name Category ESRB GamespotScore Company IsDeleted
Call of Duty: Modern Warfare 2 First Person Shooter Mature 9.0 Infinity Ward False
Batman: Arkham Asylum Fantasy Action Adventure Teen 9.0 Rocksteady Studios False
Gears of War 2 Sci-Fi Shooter Mature 9.0 Epic Games False
Call of Duty 4: Modern Warfare First Person Shooter Mature 9.0 Infinity Ward True
Soul Calibur IV 3D Fighting Teen 8.5 Namco False

Then for typical uses an application would interact with the following view of the underlying table

CREATE VIEW current_games AS
SELECT Name, Category, ESRB, GameSpotScore, Company FROM games WHERE IsDeleted=False;

but when my friends ask me for a list of all of the games I have, I can provide the full list of all the games I've ever owned from the original games table if needed. Now that we understand how one would use soft deletes we can discuss the arguments against this practice.

Rationale for War: The argument against soft deletes

Ayende Rahien makes a cogent argument against soft deletes in his post Avoid Soft Deletes where he writes

One of the annoyances that we have to deal when building enterprise applications is the requirement that no data shall be lost. The usual response to that is to introduce a WasDeleted or an IsActive column in the database and implement deletes as an update that would set that flag.

Simple, easy to understand, quick to implement and explain.

It is also, quite often, wrong.

The problem is that deletion of a row or an entity is rarely a simple event. It effect not only the data in the model, but also the shape of the model. That is why we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have a parent Order. And that is just the simplest of issues.
...
Let us say that we want to delete an order. What should we do? That is a business decision, actually. But it is one that is enforced by the DB itself, keeping the data integrity.

When we are dealing with soft deletes, it is easy to get into situations where we have, for all intents and purposes, corrupt data, because Customer’s LastOrder (which is just a tiny optimization that no one thought about) now points to a soft deleted order.

Ayende is right that adding an IsDeleted flag mean that you can no longer take advantage of database triggers for use when cleaning up database state when a deletion occurs. This sort of cleanup now has to moved up into the application layer.

There is another set of arguments against soft deletes in Richard Dingwall's post entitled The Trouble with Soft Delete where he points out the following problems

Complexity

To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.

This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query.

Performance

At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance. However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself.

Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes.

Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.

These arguments seem less valid than Ayende's especially when the alternatives proposed are evaluated. Let's look at the aforementioned problems and the proposed alternatives in turn.

Trading the devil you know for the devil you don't: Thoughts on the alternatives to soft deletes

Richard Dingwall argues that soft deletes add unnecessary complexity to the system since all queries have to be aware of the IsDeleted column(s) in the database. As I mentioned in my initial description of soft deletes this definitely does not have to be the case. The database administrator can create views which the core application logic interacts with (i.e. the current_games table in my example) so that only a small subset of system procedures need to actually know that the soft deleted columns even still exist in the database.

A database becoming so large that data manipulation becomes slow due to having to update indexes is a valid problem. However Richard Dingwall's suggested alternative excerpted below seems to trade one problem for a worse one

The memento pattern

Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require.

It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:

The problem I have with this solution is that if your database is already grinding to a halt simply because you track which items are active/inactive in your database, how much worse would the situation be if you now store every state transition in the database as well? Sounds like you're trading one performance problem for a much worse one.

The real problem seems to be that the database has gotten too big to be operated on in an efficient manner on a single machine. The best way to address this is to partition or shard the database. In fact, you could even choose to store all inactive records on one database server and all active records on another. Those interested in database sharding can take a look at a more detailed discussion on database sharding I wrote earlier this year.

Another alternative proposed by both Ayende Rahien and Richard Dingwall is to delete the data but use database triggers to write to an audit log in the cases where auditing is the primary use case for keeping soft deleted entries in the database. This works in the cases where the only reason for soft deleting entries is for auditing purposes. However there are many real world situations where this is not the case.

One use case for soft deleting is to provide an "undo" feature in an end user application. For example, consider a user synchronizes the contact list on their phone with one in the cloud (e.g. an iPhone or Windows Mobile/Windows Phone connecting to Exchange or an Android phone connecting to Gmail). Imagine that the user now deletes a contact from their phone because they do not have a phone number for the person only to find out that person has also been deleted from their address book in the cloud. At that point, an undo feature is desirable.

Other use cases could be the need to reactivate items that have been removed from the database but with their state intact. For example, when people return to Microsoft who used to work there in the past their seniority for certain perks takes into account their previous stints at the company. Similarly, you can imagine a company restocking an item that they had pulled from their shelves because they have become popular due to some new fad (e.g. Beatles memorabilia is back in style thanks to The Beatles™: Rock Band™).

The bottom line is that an audit log may be a useful replacement for soft deletes in some scenarios but it isn't the answer to every situation where soft deletes are typically used.

Not so fast: The argument against hard deletes

So far we haven't discussed how hard deletes should fit in a world of soft deletes. In some cases, soft deletes eventually lead to hard deletes. In the example of video games I've owned I might decide that if a soft deleted item is several years old or is a game from an outdated console then it might be OK to delete. So I'd create a janitor process that would scan the database periodically to seek out soft deleted entries to permanently delete. In other cases, some content may always be hard deleted since there are no situations where one might consider keeping them around for posterity. An example of the latter is comment or trackback spam on a blog post.

Udi Dahan wrote a rebuttal to Ayende Rahien's post where he question my assertion above that there are situations where one wants to hard delete data from the database in his post Don’t Delete – Just Don’t where he writes

Model the task, not the data

Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).

As we broaden our perspective to more parts of the system, we see this same pattern repeating:

Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.

Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.

Jobs aren’t deleted – they’re filled (or their requisition is revoked).

In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.

Statuses

In all the examples above, what we see is a replacement of the technical action ‘delete’ with a relevant business action. At the entity level, instead of having a (hidden) technical WasDeleted status, we see an explicit business status that users need to be aware of.

I tend to agree with Udi Dahan's recommendation. Instead of a technical flag like IsDeleted, we should model the business process. So my database table of games I owned should really be called games_I_have_owned with the IsDeleted column replaced with something more appropriate such as CurrentlyOwn. This is a much better model of the real-life situation than my initial table and the soft deleted entries are now clearly part of the business process as opposed to being part of some internal system book keeping system.

Advocating that items be never deleted is a tad extreme but I'd actually lean closer to that extreme than most. Unless the data is clearly worthless (e.g. comment spam) or the cost is truly prohibitive (e.g. you're storing large amounts of binary data) then I'd recommend keeping the information around instead of assuming the existence of a DELETE clause in your database is a requirement that you use it.

Note Now Playing: 50 Cent - Baby By Me (feat. Ne-Yo) Note


 

Monday, 23 November 2009 15:34:56 (GMT Standard Time, UTC+00:00)
Udi's advice is a good one.
I would add that you often have several context in your application.
your main context should never delete data. Order should be canceled etc...
But in other context, this canceld order is not important anymore, so you can delete it.
So you main context emit a ThisGameIsToOldAndIDontCareAnyMore event, and the other context will delete it from its local representation.
The first context will simply register the change. A context whose goal is to show only old games would add it to its local representation....

Why have only one model when there are several contexts ?
Tuesday, 24 November 2009 15:53:34 (GMT Standard Time, UTC+00:00)
Very nice, although I'd argue that comment spam is potentially useful information - it can help identify further comment spam by similarities ;-)

Perhaps it should be removed from the main data store & put somewhere out of the way
imma
Comments are closed.