Monday, May 23, 2011

On concurrency issues

Concurrency issues — race conditions and the like — are the worst category of bugs. These are the bugs that cannot well be proven absent by unit tests; these are the kind of bugs that hide away, biding their time until the most inopportune moment, then rearing their ugly, non-deterministic head on your production system when it is at its busiest. And even then, they can continue to exist unlocated for quite a while, despite many hours being put into tracking them down. Elusive, hardly reproducible, yet ultimately expensive; I've seen it happen more than once.

What follows is some thoughts on the basis of the typical issues.

Your basic multi-threading bug
As any course on multi-threaded programming will tell you, when multiple threads of execution are to run concurrently, care needs to be taken or there will be a risk of data corruption and/or unintended results.

More specifially, the threat (a "race condition" or "data race") is present when:
  1. one thread modifies the state of an object
  2. at the same time as
  3. another thread accesses the object.
That is: it takes a coincidence, a conjunction of three conditions.
Let's analyze it...:

Analysis

Another way of stating the above is obtained by reversing the statement:
We can avoid concurrency issues by always making sure that any object either
  1. is never modified; or
  2. is never accessed by two threads simultaneously (or stronger: never written to while accessed otherwise); or
  3. can only ever be accessed by one thread.
Such objects are known as, respectively,
  1. Immutable objects.
  2. Objects with state protected by a mutex (synchronization lock; monitor)
and the one used less often:
  1. Single-thread objects — or even stronger: objects of linear type.
The first two options are well-known; personally, I'm increasingly coming to consider the strong third option — objects with enforced linear lifecycle — to be rather overlooked, language design space-wise.

"Concurrency-ready" metric
One possible metric for how well a programming language is designed to express complex concurrent systems is, then, how easy it is to enforce that all objects fall into one of the three categories.

Languages and concurrency

Let's apply this view to a few programming languages. The two languages I've used most recently are Java and Erlang:
Erlang
In Erlang, there are the following kinds of objects: terms, message queues, process dictionaries and other process metadata, ETS tables, ports (files and drivers).
  • Terms are immutable values. They may contains handles of other kinds of objects, but the handles themselves are also immutable.
  • Certain objects — private ETS tables, and to some extent ports — are single-thread objects.
  • The rest — message queues, public ETS tables, process metadata etc. — are mutex-protected.
Single-thread objects are enforced not to be used by other threads than the owning one.
In Erlang, low-level data races only occur if there's a bug in the Erlang run-time system, or if you write your own driver and include one.

Java
In Java, you could argue both that there are fewer kinds of object — just one, really, the class-file defined kind — and that there is a much wider range of object kinds.There are certainly thousands of classes, defined by one well-defined scheme, which includes just a few mechanisms relevant to concurrency.
But the problem here is the number of classes — because it is at the class level that it is ensured that the corresponding objects will be thread-safe. A well-designed and -implemented class may be thread-safe, in that it is either immutable or uses appropriate inter-thread synchronization (and encapsulation) to ensure thread-safety. A less well-written class may rely implicitly on details of the context in which it is used, and really be single-thread-use only, or multi-thread usable only in certain unstated conditions.
Java is one of the few languages actually designed for portable multi-threaded programs; in particular, it has explicitly stated semantics wrt. multi-threaded execution. However, as the above comparison is one indication of, it has its shortcomings. For a language to claim good concurrency support, it should provide mechanisms for good, usable guarantees to be derived from local context (e.g. "we know that this-and-this property always holds, because these few lines here ensure that it does").
I've expounded earlier on the importance of supporting local reasoning. As it happens, Java did also then get some criticism (sorry, Java, but you're the modern main-stream language I happen to know the best...).

A Java exercise
Imagine that you have in front of you the source code of a Java class.
A quick inspection reveals that all methods are declared "synchronized".
What kinds of thread-safety issues might the class yet have? Try to list at least three ways in which the class may be non-thread-safe.
I'll present my list at the end of this article.

Does it matter?

But building security against these issues into a language is not exactly trivial, you might argue. Is stricter language rules, additional compiler analysis, and/or costly run-time support just for preventing concurrency issues not just overkill?
Sure, the trend is towards increasing parallellism and so on, but we have done quite fine without such extra measures so far, haven't we?
And bondage-and-discipline languages have been out of fashion for a while. Dynamically typed languages are as popular as ever!

The difference is this:

You can easily live and work with the relative uncertainties of dynamic typing — but then, you can unit test and get some confidence that the types match. If something is broken, or breaks later, then there's a good chance it'll be caught.
For many concurrency issues, unit tests are not likely to catch any errors. Furthermore, the necessary invariants aren't local — they are often widely dispersed in the code. To convince yourself that the code is correct, you more or less need to keep it all in your head at once. That, combined with missing language support for documenting and/or enforcing vital non-local invariants, means that they will perhaps not be communicated to whoever makes the next change in the code, who will therefore not have the full picture necessary to keep things correct.

Rather than being caught at the next full test suite run, or at least quite soon after the rubber hits the road, here's what happens to a race-condition bug:
  • The program may appear to work most of the time.
  • The issue will tend to manifest itself at the most inopportune moment: Not during development or testing (unless explicit and considerable effort is taken to stress-test against such issues), but in production, when your servers are at their busiest, or on the desktop of your busiest client.
  • Often, what clues you have amount to little besides "There almost certainly is an issue, it presumably is a software bug, the issue is probably in our code — and it occurs seldomly, so it's likely to be a concurrency issue of some kind."
  • Replicating the issue may be difficult; under the exact same circumstances, the program may run just fine.
    Indeed, if the problem does manifest itself during development, it'll appear to have gone at the next run.
    The issue may even be technically impossible to reproduce on some machine architectures, because it requires multiple physical processors of the right kinds to manifest itself (and your servers or other production environment is less likely than the development machines to be thus bug-resistant).
  • Eliminating a tricky concurrency bug can be a drawn-out experience in all phases — detecting it, reproducing it, tracking down its root cause, verifying that it has gone — all steps tend to be markedly more difficult than for deterministic bugs.
  • With any kind of bug, tracking it down once is one thing; there may be even be a feeling of gratification once you've done it. Tracking the same bug down twice is another matter — it is deeply frustrating to realize that there's a reason for your feeling of déja-vu. This is one of the reasons for regression testing: bug hunting may be stimulating in its own way, but it'd better lead to a different bug each time.
    This goes doubly (well, even more than that) for the elusive non-deterministic bugs.
    Sadly, because of their nature it is at best difficult, and often near impossible, to write regression tests for these kinds of issues.

Conclusion

Languages differ in concurrency support. Nothing new about that, of course, but I think it likely that many developers using one of the  mainstream languages which have a relatively good level of support in that area may not know that there are alternatives which are significantly more concurrency-ready.
In the beginning of this text, the prerequisites for a race condition were broken down and three kinds of conditions for avoiding race condition were derived; based on this I suggested a qualitative metric for a language's level of support for concurrent programming. I hope to have demonstrated that it may be a useful way of looking at both languages and concrete programs or program designs.

In the absence of a programming language providing strong, local guarantees with respect to thread-safety, as a developer you need to be alert whenever there's a chance that two threads may be executing your code concurrently. The best way of doing this is probably through discipline — for instance, by clearly constructing each class so that it falls into one of the above categories: Immutable, thread-safe through synchronization, or single-thread use only — and then using them strictly according to this. That is one way of trying to restore local reasoning.

Do you write programs involving multiple threads? If so, are you familiar with which consistency guarantees your platform actually provides (e.g, the Java Memory Model)?
Do you regularly stress-test your program on a system with multiple physical processors? I hope you do.
Dealing with concurrent programs in general requires good global, combinatorial-temporal reasoning abilities. Probably not your best-developed cognitive mode... :-)
The solution? Keep things as simple as possible. Find rules that work, and follow them. Encapsulate the issues, so that you can deal with just one question at a time. If possible, let the rules be checked mechanically.


Answer to Java exercise
Some ways in which a Java class may be unsafe even though all methods are declared "synchronized":
  1. A field is public, and either
    1. Non-final (and non-volatile) or
    2. Referring to a non-thread-safe object.
  2. The superclass is non-thread-safe, and the current class does not or can not override the causes.
  3. An instance method modifies the value of a a static variable without proper synchronization.
  4. An instance method reads the value of a a static variable which is also modified by a static method, without proper synchronization.
  5. A method exposes a reference to a non-thread-safe object referred to directly or indirectly by the current object.
    1. By returning such a reference.
    2. By passing such a reference to some method which stores the reference somewhere accessible to another thread.
    3. By starting a new thread and giving it such a reference.
    (I.e., a non-thread-safe object becomes shared.)
  6. A field (which may be private) refers to a non-thread-safe object which may be shared because
    1. It originated as (or was extracted from) a parameter to a constructor
    2. It originated as (or was extracted from) a parameter to a method
    3. It was returned from a call
    (I.e., a non-thread-safe object is already unexpectedly shared.)
  7. An inner class accesses an instance variable of the surrounding class, but fails to synchronize on the right this.
  8. An instance method locks on another object which may be of the same class (this may result in a deadlock)
    1. Implicitly, by calling a (synchronized) method on the other object
    2. Explicitly, using "synchronized"
    (A plausible example of this is a synchronized equals() method.)
  9. A constructor leaks this to some place where it can be accessed by another thread, and the object has at least one (final) variable which is accessed without synchronization.
    (This is because the special rule for final fields, which allows them to be accessed without synchronization, only applies after the constructor has completed.)
  10. A method exposes a reference to a thread-safe object referred to directly or indirectly by the current object, and that thread-safe object allows mutation (of itself or of a contained object) in a way that the current class is not prepared for.
This list is quite likely not complete; it is not the result of a systematic analysis of the matter.

Monday, May 16, 2011

Testing distributed-store algorithms

This is a follow-up to my post on a datastructure for storing collections in Riak.
While I have been planning this follow-up, Kresten Krab Thorup has actually gone ahead and implemented the algorithm (on GitHub; see src/riak_column.erl) — which suites me nicely, as I haven't written a single line of implementation and am not really past the thinking stage yet :-)
To wit:

Shortly after writing the post, I discovered a couple of issues or finer points, both of which have to do with the fact that different rows in Riak live independent lives —  they are independently versioned, and no cross-row guarantees are given — specifically, nothing can be concluded from the read order: if one client A updates row X, then row Y, then another client B may see the Y update and later see the old version of row X.

(If Riak is set up with appropriate consistency settings, such that the majority of the replicas of an object are written synchronously, then you do have indirectly some sort of guarantee. Below a certain threshold of machine and/or network problems, that is.
But even so, nothing can be concluded from the read order if there several clusters set up with cluster-to-cluster replication: changes on different keys in one cluster arrive at the other clusters in arbitrary order.)

This leads to at least the following issues:
  1. Firstly, when splitting a row into two, I included this step: "Write an empty row back under the original auxiliary row key".
    In fact, doing that would be wrong. Even writing some tombstone dummy value in the old row would be a bad idea; instead, one should simply mark the row as obsolete while keeping the old data. This is necessary because a client accessing the collection later on may see from the master row that the old row has been split, but not see the new rows. Or it may see the old row, but not the change in the main row. In the former case, it is necessary that at least the values in the old row be available (or the collection would suddenly have shrunk); in the latter case, it would not be apparent that the values in the old row are obsolete.

  2. Secondly, after modifying an auxiliary row, the main row should be updated — even nothing in it has changed. This may sound silly, but is the easiest way to ensure that, in the event of a concurrent row split, the auxiliary row in question is taken into account at a subsequent merge (indeed, that the merge is triggered at all).

  3. And even that is not enough if there are more than one cluster: At the time of the read repair of the main row, the update for the auxiliary row may not have arrived - and it may therefore be lost silently. It appears that some kind of versioning of the auxiliary rows is necessary; with such versioning, we can tell when repairing the main row that we're missing an update on the pre-split auxiliary row, and that the reference to it should therefore be kept in the main row, so that it can be repaired later when all information is available.
Ah, the perils and challenges of incomplete information.

Managing concurrent subtlety
As the saying goes,
Meddle not in the affairs of concurrent systems for their ways are subtle, and are quick to anger.
--
Tolkie(error: timeout)n
In a domain as subtle as this — with gotchas in the style of the above mentioned issues, how can we ever convince ourselves that a scheme like the one for Riak collections are implemented robustly and correctly?

As always, there are two ways: Formal verification — proving that the program is correct; and thorough testing — gaining confidence by exercising the program.
Both have their advantages and disadvantages.

If I were Dijkstra, I'd develop a formal proof (as he did in quite a few of his blog postings), and probably modify the algorithm appropriately along the way. I, however, am no Dijkstra, do not believe to have the time to develop a formal proof — nor do I have any delusion of being able to write anything worth reading in the process. Luckily, going the other way, through testing, have something going for it as well:
  • It is not always clear which guarantees the components we build on actually provide. That mean that, regarding formal proof, the set of axioms may be uncertain.
  • The same test can be applied to different implementations of a component.
  • The same test can be applied to different versions of a component — i.e. if we modify the code, we can cheaply gain some confidence in the modified version.
For both formal verification and testing goes that the answers you get depend on the questions you ask.
When proving a property, you prove only that property; when you test a concrete call sequence with concrete values, you test with only those values.
Which is of course a good reason to get familiar with property-based testing — which lies somewhere in between the two in that it tests a property on a number of concrete call sequences — typically a few hundred, and typically different instances for each time a test is run.

This can be a nice compromise between formal verification and hand-rolled test cases — provided of course that the properties and the instance generators are chosen well.
As always with testing, paranoia and imagination is key. But you get more value for your paranoia when it's used to power random test case generation.

For the problem in question, and assuming an Erlang implementation like KKT's, property testing can be done with relative ease using a tool like Quviq QuickCheck or Proper. Their support for testing state machines comes in handy; I've not tried using it before, but this seems a good occasion.

How to test randomly
Randomized testing involves abstracting over usage scenarios.
How the abstraction is done determines what will be tested.
Knowledge of the problem domain is the primary guide; paranoia together with perhaps knowledge of implementation details should provide additional input.
What must be considered is:
  • What is tested:
    Which API functions to exercise?
    Which invariants to verify?
  • How it is tested:
    Test concurrent use to check for thread safety?
    Test with invalid inputs?
    Should certain special circumstances be simulated - e.g., file I/O errors, disk-full conditions, network delays, packet loss?
  • With what it is tested:
    Test with abnormal (very long/short/high/low) inputs?
    Key collisions?
    Many values for the same key?
    Keys which are nearly identical?
This is of course where the imagination and paranoia enter the picture.
You can't assume that "since we generate the values randomly over a large domain, we exercise all code paths" — exercising e.g. a dictionary with a million randomly generated keys is of little use if the keys never collide.

How to test distributed collections
OK then, how to test an implementation of the distributed collections scheme?
Using the state machine testing support of Proper, we can maintain a "model" collection aside the subject-under-scrutiny collection (shortened to SUS in the following).

What to test
:
This is the easiest part: We will test the usual collection functions: insert, delete, lookup, list keys.
The invariant is that the result is consistent with the same operation performed on the model collection — as defined below.

How to test:
We'll certainly need to test concurrent use. Let's say that there are three simultaneous users of the collection; we know that there are subtleties involving two, and there may be additional ones involving at least three.
The users won't be really simultaneous, though — We will test concurrent use in a way where the concurrency is made explicit. This provides repeatability and insight into why things fail, which is indispensable for this problem domain.
So, instead of having an actual underlying Riak store, we'll mock it up, in such a way that the mock provides just the guarantees we actually expect the real thing to provide.


Model representation
The structure I have in mind is: the mock remembers all versions of all values put into it. This is what the entire simulated multi-cluster store — let's call it the "cloud" — has ever seen.
The mock furthermore has a number of "views" of the store, corresponding to what you would see if you accessed the cloud at different points. It has a number of these; we'll make the assumption that in any of these views, for any given key, the version will only increase with time. I.e. we do not expect the version of any value in any view to evolve backwards. The mock keeps track of the view-to-version mapping for all keys.

One thing that can happen, then, beside operations on collections, is that value versions find their way from one view to another. Also, version merging can happen in transit — the versions being merged are not necessarily the latest value in any view, but may be any versions that have ever existed in the cloud.
For simplicity, let's say that this is modelled with views also — if we allow enough views, this won't reduce the generality.

The model thus consists, not just of a simple collection, but of an abstract model of the storage cloud.
Beside the externally visible events — the calls to the collection API — there are internal events: a given version of a given key finds its way to one view to another. Both kind of events go into the randomly generated test scenario specification.

The test
It was originally my intention to end with putting all of this together in a Proper-based Erlang unit test — it would then be a good occasion for me to get experience with the state-machine modelling support (link: state machine part is  from p.26).
However, it appears that I have a latency-vs.-completeness tradeoff to make, so I'd better stop here, publish what I have, and hope to return to the subject soon, hopefully with some concrete code. (While this posting has been under way, naturally other topics have come up which I'd like to write about; the order in which further postings arrive here is undetermined...)