Thursday, 28 January 2016

Time in distributed systems

Time in a program that runs on a single machine and uses local physical clock is already a tricky concept. Just think about time zones, daylight saving and the fact they can change at any time because they are governed by politicians. Distributed programs add one more dimension to this problem space because there isn’t a single clock that decides what time it is.

The goal of this post is to create a short summary of some of the academic papers written on this topic and in the process to better understand the challenges faced by the builders of distributed systems. I hope I won’t be the only person that will benefit from it :).

As end users we take the notion of time for granted and rely on our physical clocks to determine whether event A happened before event B. This works great most of the time because in our day-to-day lives we don’t really need high precision time readings. The fact that my watch is running a minute or two late comparing to some other watch doesn’t really matter all that much. Computer systems are different because they operate at much higher speed and even small variations can affect the after-before relationship. The clocks that are used in computers nowadays tend to drift enough that distributed systems can’t really rely on them. At least not as the only source of truth when it comes to time.

To solve this problem Lasslie Lamport introduced an idea of virtual time that can be implemented using logical clocks (Time, Clocks, and the Ordering of Events in a Distributed System). His algorithm implements an invariant called Clock Condition: if event A happened before event B then the value of logical clock when A happened is lower than the value of logical clock when B happened, in short If A –> B then C(A) < C(B). This invariant provides one of many possible, consistent partial orderings of events in a distributed system. The drawback of the ordering established based on Clock Condition is that C(A) < C(B) doesn’t mean A happened before B. Moreover, as far the end user is concerned B might even have happened before A or might have happened concurrently with A.

Vector Clocks, introduced by Colin J. Fidge (Timestamps in Message-Passing Systems That Preserve the Partial Ordering), fill the shortcoming mentioned above by providing an invariant that is a reverse of Clock Condition: if C(A) < C(B) then A –> B. Vector Clock is built on top of logical clock designed by Lasslie Lamport. The main difference is that instead of using just one logical clock for the whole system it uses one logical clock per process in the system. In this way it tries to capture distributed state of time in the system. Vector Clock stores a list of values so the original invariant should be rewritten to if VC(A) < VC(B) then A –> B where VC(A)[i] <= VC(B)[i] and j exists that VC(A)[j] < VC(B)[j]. E.g. [1,2,3] < [1,2,4]. If neither A->B nor B->A (eg. VC(A) = [1,2,1] and VC(B) =  [2,1,2]) then A and B are considered to be concurrent events.

Even with Vector Clocks we can end up with concurrent events that from the end user perspective are not concurrent. To mitigate this problem we can make a full circle and employ help from physical clocks or we can simply let the user order the events for us. This is how Riak deals with write conflicts. It accepts concurrent writes to the same key but then the conflict needs to be resolved at the read time.

The first step towards fixing a problem is to realize that there is a problem and this is what Vector Clocks provide because they can detect conflicts. If you don’t think this is a big deal then follow Kyle Kingsbury’s Jepsen series on distributed systems correctness to see how many mainstream (database) systems can lose data and not know about it.