Originally published on the Crittercism Engineering Blog and reprinted with permission.

by Kirk Strauser on April 8, 2014

CAP theorem hates you and wants you to be unhappy

Some guy who isn’t fun at parties came up with the CAP theorem, which basically says it’s impossible to be consistent and available at the same time. In short, things will break and clients will lose access to a storage backend, or units in a storage cluster will lose the ability to talk to their peers. Maybe those servers are crashed. Even worse, maybe they’re up and running but a network outage means they can’t reach each other, and each part is still accepting writes from clients. Our enemy, CAP theorem, says we have to choose between:

  • Keeping our data consistent, at the price of not being able to make progress when parts of the database cluster are unavailable.
  • Keeping the database cluster available, at the price of some parts of it being out of sync with others and resolving any conflicts later.

Consistency brings pain

In any case, we have to decide what happens when we want to write to a record. Let’s assume for demonstration sake that a record is a bag of opaque data that the backing store doesn’t really understand; imagine a JSON blob, or a bit of XML, or whatever other datatype your favorite database doesn’t natively support.

Let’s also assume we have a consistent database. Either it’s a single server that’s running or not running, or it’s a cluster that only accepts requests if all nodes are online and synchronized. In short, we can always trust our database to Do The Right Thing.

Here’s how consistent workflows evolve from the most innocent of intentions.

First attempt: blind writes

We want to write a record, so we write it! Easy-peasy.

  1. Write out an entire record
  2. Profit

Second attempt: read-update-write

Ouch! Two requests want to update the same record. Both of them write out its entire contents, but only the last one wins.

  1. Request A writes out {"foo": "bar"}
  2. Request B writes out {"baz": "qux"}
  3. Request A cries salty tears
  4. Request B gloats and gets punched

That’s not good. The answer, then, is surely to read what’s there, update it, and write the results back out:

  1. Request A fetches the record with its initial value of {}
  2. Request A updates the record to {"foo": "bar"}
  3. Request A writes the record with the its new value
  4. Request B fetches the record with A’s value of {"foo": "bar"}
  5. Request B updates the record to {"foo": "bar", "baz": "qux"}
  6. Request B writes the record with the combined value

They shake hands and go home. And at 2AM, the Ops pager goes off because every write requires a read to get the pre-existing value. But let’s pretend IO is free and infinite. This algorithm is chock-full of race conditions. At our scale, here’s what’s going to happen many times per second:

  1. Request A fetches the record with its initial value of {}
  2. Request B fetches the record with its initial value of {}
  3. Request A updates the record to {"foo": "bar"}
  4. Request B updates the record to {"baz": "qux"}
  5. Request A writes the record with only its new value
  6. Request B writes the record with only its new value, overwriting A’s

And now we’re right back where we started.

Third attempt: locks everywhere!

Looks like we’ll need to lock each record before updating it so that only one request can mutate it at a time. We care about uptime so we have a highly available distributed locking system (ZooKeeper, Redis, a highly motivated Amazon Mechanical Turk, etc.). Now our workflow looks like:

  1. Request A acquires a lock on the record
  2. Request B attempts to acquire the same lock, but fails
  3. Request A fetches the record with its initial value of {}
  4. Request A updates the record to {"foo": "bar"}
  5. Request A writes the record with only its new value
  6. Request A releases the lock
  7. Request B attempts to acquire the same lock, and succeeds this time
  8. Request B fetches the record with A’s value of {"foo": "bar"}
  9. Request B updates the record to {"foo": "bar", "baz": "qux"}
  10. Request B writes the record with the combined value
  11. Request B releases the lock

That actually worked! Of course, it took two reads and two writes of the database and fives calls to the lock manager and Ops wants to set fire to your cubicle because their call duty phone won’t stop buzzing.

But let’s assume that we have a free and infinite lock manager. What happens if Request A never completes the transaction and releases its lock, maybe because of network problems, or the node it was on died, or it couldn’t write to the database, or [insert your own pet scenario here]. Now we can’t make progress on B’s request until the lock expires, or until we break the lock and potentially overwrite A’s updates. For all our efforts, we’re still not in a much better place than we started.

Side note about the locking manager

Any distributed lock manager has to solve all of the same problems we’re listing here. Even if we use this pattern, we haven’t made the root problem go away: we’ve just shifted it to another piece of software. The CAP theorem means that a lock manager optimized for consistency has to sacrifice availability, so in the event of a network outage or failed locking manager node we still can’t get any work done.

But eventual consistency brings joy and unicorns!

Consistency is critical for many reasons. I’d much rather queries to my bank be slow or unavailable than incorrect! There are times and places when we want the properties that consistency buys, regardless of its price.

But there aren’t many of them at our scale.

What we want is eventual consistency, or a promise that the database will make its best effort to make its records return their current values. This pattern extends to both the database we use to store our records, to the way in which we generate and process those records.

Solution: journaled updates

Instead of treating our record like an atomic chunk of data, we’ll treat it like a list of atomic chunks of data representing updates that clients want to make.

  1. Request A appends its update of {"foo": "bar"} to whatever happens to already be in the record
  2. Request B appends its update of {"baz": "qux"} to the record
  3. Much later, if ever, Request C fetches all the values from the record and combines them into a final data structure. In pseudocode:
def materialize(query):
    result = dict()
    for key, value in query.records():
        result[key] = value
    return result

In our example, that would fetch the list of updates [{"foo": "bar"}, {"baz": "qux"}] and combine them into a single record like {"foo": "bar", "baz": "qux"}. This is a very fast operation for any sane amount of updates.

Our primary usage pattern is “write many, read rarely”. Most of the events recorded by our system will never be viewed individually, but might be used to calculate trends. This solution allows us to trade a small (but fast and easy) bit of post-processing for a huge bit of not having to worry about requests clobbering each other, locking semantics, or mixing reads with writes.

Ordering is still hard

This solution isn’t magic, though. It doesn’t define how to reconcile conflicts between updates, and we still have to make those decisions.

Time and time again

The simplest method is to store each record’s timestamp and replay them in order. However, it’s impossible to guarantee ordering between timestamps generated across more than one host. Two database servers might be off from each other by a few seconds, and NTP only seems like a solution until you’ve actually tried to count on it. The one situation where this is feasible is when requests to update a given record are all generated by the same client. In this case, we can use the client-generated timestamp to reasonably represent the correct ordering of updates.

Understand our data

Another approach is to make smart decisions about the data we’re storing. Suppose a certain key, foo, may only ever increase. Given a set of updates like [{"foo": 23, "foo": 42, "foo": 17}], the correct resolution would be {"foo": 42}. This requires an understanding of the data, though, and isn’t something we’d want to pursue for customer-generated inputs.


Math says you can’t have both consistency and availability. At our scale, availability wins the argument and eventual consistency carries the day.