What follows is some thoughts on the basis of the typical issues.
Your basic multi-threading bug
As any course on multi-threaded programming will tell you, when multiple threads of execution are to run concurrently, care needs to be taken or there will be a risk of data corruption and/or unintended results.
More specifially, the threat (a "race condition" or "data race") is present when:
- one thread modifies the state of an object
- at the same time as
- another thread accesses the object.
Let's analyze it...:
Analysis
Another way of stating the above is obtained by reversing the statement:We can avoid concurrency issues by always making sure that any object either
- is never modified; or
- is never accessed by two threads simultaneously (or stronger: never written to while accessed otherwise); or
- can only ever be accessed by one thread.
- Immutable objects.
- Objects with state protected by a mutex (synchronization lock; monitor)
- Single-thread objects — or even stronger: objects of linear type.
"Concurrency-ready" metric
One possible metric for how well a programming language is designed to express complex concurrent systems is, then, how easy it is to enforce that all objects fall into one of the three categories.
Languages and concurrency
Let's apply this view to a few programming languages. The two languages I've used most recently are Java and Erlang:- Erlang
- In Erlang, there are the following kinds of objects: terms, message queues, process dictionaries and other process metadata, ETS tables, ports (files and drivers).
- Terms are immutable values. They may contains handles of other kinds of objects, but the handles themselves are also immutable.
- Certain objects — private ETS tables, and to some extent ports — are single-thread objects.
- The rest — message queues, public ETS tables, process metadata etc. — are mutex-protected.
In Erlang, low-level data races only occur if there's a bug in the Erlang run-time system, or if you write your own driver and include one.
- Java
- In Java, you could argue both that there are fewer kinds of
object — just one, really, the class-file defined kind — and that there
is a much wider range of object kinds.There are certainly thousands of
classes, defined by one well-defined scheme, which includes just a few
mechanisms relevant to concurrency.
But the problem here is the number of classes — because it is at the class level that it is ensured that the corresponding objects will be thread-safe. A well-designed and -implemented class may be thread-safe, in that it is either immutable or uses appropriate inter-thread synchronization (and encapsulation) to ensure thread-safety. A less well-written class may rely implicitly on details of the context in which it is used, and really be single-thread-use only, or multi-thread usable only in certain unstated conditions.
I've expounded earlier on the importance of supporting local reasoning. As it happens, Java did also then get some criticism (sorry, Java, but you're the modern main-stream language I happen to know the best...).
A Java exercise
Imagine that you have in front of you the source code of a Java class.I'll present my list at the end of this article.
A quick inspection reveals that all methods are declared "synchronized".
What kinds of thread-safety issues might the class yet have? Try to list at least three ways in which the class may be non-thread-safe.
Does it matter?
But building security against these issues into a language is not exactly trivial, you might argue. Is stricter language rules, additional compiler analysis, and/or costly run-time support just for preventing concurrency issues not just overkill?Sure, the trend is towards increasing parallellism and so on, but we have done quite fine without such extra measures so far, haven't we?
And bondage-and-discipline languages have been out of fashion for a while. Dynamically typed languages are as popular as ever!
The difference is this:
You can easily live and work with the relative uncertainties of dynamic typing — but then, you can unit test and get some confidence that the types match. If something is broken, or breaks later, then there's a good chance it'll be caught.
For many concurrency issues, unit tests are not likely to catch any errors. Furthermore, the necessary invariants aren't local — they are often widely dispersed in the code. To convince yourself that the code is correct, you more or less need to keep it all in your head at once. That, combined with missing language support for documenting and/or enforcing vital non-local invariants, means that they will perhaps not be communicated to whoever makes the next change in the code, who will therefore not have the full picture necessary to keep things correct.
Rather than being caught at the next full test suite run, or at least quite soon after the rubber hits the road, here's what happens to a race-condition bug:
- The program may appear to work most of the time.
- The issue will tend to manifest itself at the most inopportune moment: Not during development or testing (unless explicit and considerable effort is taken to stress-test against such issues), but in production, when your servers are at their busiest, or on the desktop of your busiest client.
- Often, what clues you have amount to little besides "There almost certainly is an issue, it presumably is a software bug, the issue is probably in our code — and it occurs seldomly, so it's likely to be a concurrency issue of some kind."
- Replicating the issue may be difficult; under the exact same
circumstances, the program may run just fine.
Indeed, if the problem does manifest itself during development, it'll appear to have gone at the next run.
The issue may even be technically impossible to reproduce on some machine architectures, because it requires multiple physical processors of the right kinds to manifest itself (and your servers or other production environment is less likely than the development machines to be thus bug-resistant). - Eliminating a tricky concurrency bug can be a drawn-out experience in all phases — detecting it, reproducing it, tracking down its root cause, verifying that it has gone — all steps tend to be markedly more difficult than for deterministic bugs.
- With any kind of bug, tracking it down once is one thing; there
may be even be a feeling of gratification once you've done it. Tracking
the same bug down twice is another matter — it is deeply frustrating to
realize that there's a reason for your feeling of déja-vu. This is one
of the reasons for regression testing: bug hunting may be stimulating
in its own way, but it'd better lead to a different bug each time.
This goes doubly (well, even more than that) for the elusive non-deterministic bugs.
Sadly, because of their nature it is at best difficult, and often near impossible, to write regression tests for these kinds of issues.
Conclusion
Languages differ in concurrency support. Nothing new about that, of course, but I think it likely that many developers using one of the mainstream languages which have a relatively good level of support in that area may not know that there are alternatives which are significantly more concurrency-ready.In the beginning of this text, the prerequisites for a race condition were broken down and three kinds of conditions for avoiding race condition were derived; based on this I suggested a qualitative metric for a language's level of support for concurrent programming. I hope to have demonstrated that it may be a useful way of looking at both languages and concrete programs or program designs.
In the absence of a programming language providing strong, local guarantees with respect to thread-safety, as a developer you need to be alert whenever there's a chance that two threads may be executing your code concurrently. The best way of doing this is probably through discipline — for instance, by clearly constructing each class so that it falls into one of the above categories: Immutable, thread-safe through synchronization, or single-thread use only — and then using them strictly according to this. That is one way of trying to restore local reasoning.
Do you write programs involving multiple threads? If so, are you familiar with which consistency guarantees your platform actually provides (e.g, the Java Memory Model)?
Do you regularly stress-test your program on a system with multiple physical processors? I hope you do.
Dealing with concurrent programs in general requires good global, combinatorial-temporal reasoning abilities. Probably not your best-developed cognitive mode... :-)
The solution? Keep things as simple as possible. Find rules that work, and follow them. Encapsulate the issues, so that you can deal with just one question at a time. If possible, let the rules be checked mechanically.
Answer to Java exercise
Some ways in which a Java class may be unsafe even though all methods are declared "synchronized":
- A field is public, and either
- Non-final (and non-volatile) or
- Referring to a non-thread-safe object.
- The superclass is non-thread-safe, and the current class does not or can not override the causes.
- An instance method modifies the value of a a static variable without proper synchronization.
- An instance method reads the value of a a static variable which is also
modified by a static method, without proper synchronization.
- A method exposes a reference to a non-thread-safe object referred
to directly or indirectly by the current object.
- By returning such a reference.
- By passing such a reference to some method which stores the
reference somewhere accessible to another thread.
- By starting a new thread and giving it such a reference.
- A field (which may be private) refers to a non-thread-safe object which may be shared because
- It originated as (or was extracted from) a parameter to a constructor
- It originated as (or was extracted from) a parameter to a method
- It was returned from a call
- An inner class accesses an instance variable of the surrounding class, but fails to synchronize on the right this.
- An instance method locks on another object which may be of the
same class (this may result in a deadlock)
- Implicitly, by calling a (synchronized) method on the other object
- Explicitly, using "synchronized"
- A constructor leaks this
to some place where it can be accessed by another thread, and the
object has at least one (final)
variable which is accessed without synchronization.
(This is because the special rule for final fields, which allows them to be accessed without synchronization, only applies after the constructor has completed.) - A method exposes a reference to a thread-safe object referred to directly or indirectly by the current object, and that thread-safe object allows mutation (of itself or of a contained object) in a way that the current class is not prepared for.