Friday, June 3, 2011

Concurrent Design as a Matter of Cause

I've written earlier about designing for concurrency in the small.
But even the best and most flawless bricks can be put together in far more meaningless ways than meaningful ways.
In the following, I'll consider concurrent design on a larger scale, and introduce a tool which may be useful to ensure soundness in a design.

What is concurrent design about? (Performance aside — we'll focus on soundness for the moment.) What is the basic units for reasoning about concurrent, possibly distributed systems?
Mutexes, messages, transactions — these are among the building blocks. But the connecting mortar is causality chains. When a database client, for instance, submits data to a database server, it can rely on the data to be persisted only if there is a causality chain leading from the moment the client receives the "success"-response, back to when the database sent it, and further back to when the database server's hard drive physically wrote the last block of the transaction.
If there is no causality chain, then no timing assumptions can be made. Which is why, in a given concurrent design, it is prudent to ensure that the necessary causality chains are present.

Example: the "update take-over" pitfall

This is a concurrency design pitfall which I learned about a few years ago. I'd forgotten all about its non-obviousness, until it came up recently in a design discussion. This incident suggested to me that the problem might be sufficiently non-trivial for there to be a lesson to pass on. And I will do that — describe the problem, the naïve-but-wrong solution, and some correct solutions — but do so a bit more elaborately than I originally planned, because the focus will not be on the solutions themselves, but rather on the process: a method to arrive at them.

The scenario: A client (C) must keep track of some object's state. It does so by subscribing to changes (updates). However, there are two update sources, and it is desirable to change over from on (A) to the other (B) from some point on. That is, before the take-over, A provides all updates; after the take-over, B provides all the updates.
A common variation of the pattern is that A is providing snapshots of the object's state through an synchronous request-response protocol, rather than providing updates through a subscribe-publish mechanism.
For simplicity, this variation is the scenario I'll be focusing on in the following; it is depicted as UML to the right.

The naïve solution: Simply set up the subscription to B before getting the snapshot from A.

The threat: An update is missed because it "falls into the crack" caused by the take-over — it is processed by A too late to be part of the snapshot, but is processed by B too soon, before the subscription is set up.
The core of the issue is that when there are two independent message paths, we cannot assume anything about their relative timing; even though they have a common source, events in one path may overtake events in the other path. 

Analysis:
What does it take to get this setup to work as expected?
Any update must reach the client, either through A (the snapshot) or B (the following updates).
Or, from a causality view:
There must be a causality chain ruling out "update is processe by A after the snapshot is made, but is processed by B before the subscription is set up".
A causality chain is a series of causality links, which are defined as follows:

Causality rules

  1. There is a causality link from X to Y if X happens before Y and they both happen in the same thread.
  2. There is a causality link from a message is sent to its reception.
  3. Two message receptions may be causally linked if the transport layer guarantees that one is delivered before the other.
    For instance, TCP guarantees that message within the same connection are delivered in order, i.e. that if X is sent before Y (and in the same direction), then X is delivered/received before Y.
    Similarly, in Erlang, message delivery order is guaranteed for any sender/receiver pair.

Exercise/Kata: I think this problem makes a rather nice kata in concurrent design.
If you'd like to try it yourself, do so before reading on.
How many different solutions do you see?

Enter the loop

Here is the trick: Causality loops are impossible. You can't enter a causality loop; they correspond to paradoxes. (Proof by paradox is an ancient technique. also known as proof by contradiction.)
Thus, one way to attack the problem is to reformulate the requirement as follows:
If update is processed by A after the snapshot is made, and the same update is processed by B before the subscription is set up, then there is a causality loop.
The loop in question is absent, however, if either of the two sub-conditions are absent.
The situation can be represented graphically:

This is quite a bit simpler than the UML diagram above. But actually it captures the essence of the problem.
And it is useful, because it leads to a further refinement of the problem statement: How can we introduce a loop which is absent if we reverse either of the two arrows?
We will, however, also need the context in order to translate our findings back into concrete solutions. The context looks like this:

Getting solutions through loops

That question is not too hard: simply add A2→B1 and B2→A1, to get a four-edge loop which disappears if any edge is reversed or removed.
Translating backwards, what does this mean in terms of the original problem?

Solution 1:
The client subscribes synchronously to B before getting the snapshot from A. The event origin publishes updates synchronously to A before it sends them to B.
Now, the first part of that solution is reasonable, but the second part is often not directly achievable — the event origin may just have a generic publish-subscribe scheme, and be oblivious to the difference between the nature of A and B.
But there is another interpretation of the A2→B1 edge.
It can be realized by linking A2 or anything "downstream" (causally later) from A2 to B1 or anything "upstream" (causally earlier) from B1.
By using a direct A2→B1 link, we obtain:

Solution 2: The client subscribes synchronously to B before getting the snapshot from A. When A has processed an update, it sends it on to B (which gets the updates by this route, rather than directly from the event source).

Likewise, the A2→B1 edge can be realized by a direct link.  This results in another two solutions:

Solution 3: The client does not contact A directly in order to get a snapshot.  Instead, the 'subscribe' operation provided by B both subscribes and obtains a snapshot from A.

Solution 4: Like in #2, updates are sent only to A, which sends them on to B; like in #3, the client contacts only B, which both registers a subscription and requests a snapshot for the client from A.

Merging nodes

Revisiting Solution 2, we find that, depending on the situation, this solution may render B somewhat moot. It may make sense to eliminate it altogether, which is yet another interpretation: A2 occurs in the same thread as B1, and B2 occurs in the same thread as A1.
In terms of causality graphs, this is a correct solution because merging A1 with B2 and B1 with A2 results in a two-edge loop:

Solution 5: The snapshot provider also plays the role of update publisher. This combined service provides a "get snapshot and subscribe for updates" operation.

In general, merging existing nodes may provide additional ways to obtain causality loops.

Solutions involving conditional edges

We now have usable solutions, but our options haven't been exhausted yet.
There are other ways of causing loops in the graph — but in order to ensure that the loop is only present when it should be, we will need to introduce conditional edges — which are only actually there in certain circumstances.

For instance, consider the loop caused by adding the edge A2→A1. In order to ensure that this loop is absent if the B1→B2 edge is reversed, we will need to put a condition on the new edge saying "only if the update at A2 has already happened at B before the subscription at B2." This means that information from the subscription operation is required to evaluate the condition, so we'll need a B2→A2 edge as well:
Also, we will need the updates to be identifiable in some way, by timestamp, serial number or similar; let's assume serial numbers.

Just as Solutions 1 and 2 differed in how the A2→B1 edge was implemented — whether it was present as a link directly from A to B or from A to the event source to B ­— so there are at least two solutions corresponding to this graph, because an A2→A1 can be realized both directly and as A2→C1 (because C1 is upstreams of A1).
The A2→C1 edge involves the client:

Solution 6: Subscribing to B is a synchronous operation, which returns the ID of the last known update (which is the last update not published as result of the subscription). The "get snapshot" operation provided by A returns both the snapshot and the ID of the last update included in the snapshot. The client first performs the subscription, then requests snapshots repeatedly, until the snapshot includes the last update not published.

In the direct variant, the A2→A1 edge is kept within A; if sending snapshot is expensive, this is probably better:

Solution 7: Subscribing to B is a synchronous operation, which returns the ID of the last known update. The "get snapshot" operation  takes an update ID and checks it against the last known (by A) update. Sending the snapshot is delayed until the update in question has been processed by A.

So much for that causality loop; moving on, we consider the loop caused by adding a B2→B1 edge. It also needs to be conditional, with the same condition; this time, we need to add an A1→B2 edge to have the information required to evaluate the condition:
In domain terms: "If the update is not in the image when subscribing, then reprocess the update."
This seems to imply memory:

Solution 8: The update provider B caches a suitable amount of the most recent updates. The "get snapshot" operation returns both the snapshot and the ID of the last update included therein. The subscription operation takes an update ID, and results in all updates since that one to be sent to the client — whether they have already been processed before the subscription took place (in which case the update is taken from the cache), or arrive at B at a later point. The client first gets a snapshot synchronously, then subscribes with the ID thus obtained.

There are more (of course there are)

The set of solutions found so far is obviously not complete; for instance, none of the solutions we've considered have involved reinventing even a quarter of Common Lisp (or Erlang).
Further meaningful solutions exist; for instance, one that lets the snapshot provider send updates until the real update provider has taken over (which resembles a hybrid between #5 and #7); or one in which the snapshot provider tries to include a given update ID (as in #7) but times out if this takes too long, in which case the client may repeat the snapshot operation (as in #6). Or one in which the "get snapshot" operation is replaced with a "get snapshot if update u is included" (also a hybrid between #6 and #7).

Conclusion

We have looked at a concurrent design problem in the view of causality chains, and attacked it using the concept of causality loops (and with the help of causality graphs). Using a fairly simple technique, we have generated a number of solutions — quite different solutions, as it turns out, but all correct and realistic, and with different trade-offs.
Causality loops as a design tool appears to be a useful way both to find correct solutions and to explore and outline the design space for concurrent design problems.

Acknowledgement

I would like to thank Kresten Krab Thorup, who unknowingly became the cause of this posting, and indirectly of many of the ideas in it.