Friday, September 18, 2009

Programming Languages and Local Reasoning

Through time, there have been two interesting trends in programming languages: towards better guarantees and later binding.

With "guarantees", I mean facts and invariants which the language rules permit us to conclude from the code. Such as the following - you may or may not recognize them (and the language features which give rise to them):
  • "The value in this variable is of this type."
  • "The value in this variable can only come from one of these expressions."
  • "The value of this object field can not be changed once the object has been constructed."
  • "This variable can only be accessed from one thread."
  • "This method can not be redefined in subclasses."
  • "This process will always be notified when this other process disappears."
  • "If this code point is reached, then this other code point will eventually be reached."
The two trends, better guarantees and later binding, correspond to two desirable properties of software: the ability to reason about the behaviour of software, and flexibility.



One important kind of guarantee is encapsulation.
Encapsulation enable us to reason about a program's behaviour because it limits the number of places we would have to look to find all the pieces of code that might affect the encapsulated entity.

A rough evolution timeline might look like this:

Language Evolution
Machine language / assembly code:
No encapsulation, no guarantees above what the hardware and OS provides, no late binding (unless you count self-modifying programs, which I find little reason to do; they rather count as examples of the absense of a very valuable guarantee).
BASIC etc:
Still no encapsulation, but guarantees in the form of e.g. typed variables.
Structured Programming:
Data encapsulation in the form of local variables ("At which program points can a given variable be accessed?"); control encapsulation in the form of less reliance of labels ("Which program points might be the predecessor of a given program point, and under which circumstances?").
OOP:
Data encapsulation in the form of private variables. Late binding in the form of subtype polymorphism and virtual methods.
Language evolution is not linear, of course. Just to keep our horizons broad and to point out that OOP is not the only and natural way forward (although it may dominate today's language landscape), let's include some non-OOP paradigms - the functional one, and the one employed by Erlang (a concurrent, functional paradigm that in these respects yields much the same benefits as OOP does):
Functional programming:
Same as Structured Programming, plus: data encapsulation in the form of immutable values and variables, and late binding in the form of higher-order functions.
Erlang:
Same as Functional Programming, plus: data encapsulation in the form of state private to a process; late binding in the form of weakly-coupled processes and callback modules. Further guarantees in the form of e.g. process link semantics.
(Having worked professionally with Erlang for nearly two years has left an impression. And, come to think of it, that's a very good sign: if you work with any tool for a prolonged period of time, shouldn't it shape the way you think? Preferably in an inclusive, letting-you-see-new-connections kind of way, rather than letting-you-see-nothing-but-nails.)

Late binding is very useful and associated with all kinds of sexy buzzwords like "reusability" and "hot upgrading" - but in the remainder of this article, I will focus on guarantees.

Language Guarantees
Deciding which guarantees a language gives under which assumptions is a crucial part of language design. And the form of those assumptions are also important.

As Dijkstra pointed out in his famous "Notes on Structured Programming" [PDF], our cogitational capacities are limited - and we must recognize this fact and shape our efforts accordingly.
Regarding the tools we use to create computer programs, the following question therefore becomes important: Which kinds of reasoning are our minds strong at, respectively weak at?

Now, it doesn't take an expert on cognitive psychology to guess that we're better suited for what I would call "local reasoning" - where few facts and rules, preferably "close" in some sense, are involved in coming to a conclusion - than we are at more "global" reasoning, with more numerous and/or widespread facts and rules.

The most practical kind of conclusion is the kind where you, based on a sheetful or screenful of information, can say with confidence: yes, this property certainly holds.

The less practical kind is where you need to go through your code base to ascrtain that there are no violations of a certain property.

In other words - in logic terminology - it's best if the premises to the useful conclusions are such that you need a "yes" answer to all "exists"-kind of questions, and a "no" to all "for all"-questions.
Local reasoning, in other words. Even gods occasionally have problems with global reasoning.

Examples
To take an example from Java, the keyword final in front of the definition of a class, a method or a variable signals clearly that we're allowed certain assumptions about the defined entity: It is guaranteed not to be subclassed/redefined in a subclass/have its value modified. That's a strong and useful guarantee, eaily read from the code: there exists a final keyword before the declaration, therefore we can conclude such-and-such.

An example of a language feature, also from Java, that is less well designed: synchronizing memory accesses between thread is done in Java with the keyword synchronized, which in front of a method definition (or block statement). The presence of the keyword alone actually tells the reader of a program very little about the possible behaviours of the code block in question; most significantly, you cannot locally infer that when one thread is inside the synchronized block, no other threads will alter the value of any variables used in the block (except of course for local variables). It is entirely possible for other threads to e.g. execute other methods on the same object, simultaneously with the execution of the synchronized region.
To get that kind of certainty, you need to ensure that all code points where the variable or variables in question are accessed are likewise within a synchronization block (which synchronize on the same object).

In short, the synchronized language feature allows reasoning which reaches a positive conlusion only based on a "yes" answer to a "for all" question, which is less usable in practise - partly because the human mind is not overly good at working globally, enumerating all of a certain kind of object and checking that a property holds for each of them - and partly because as code evolves, such non-local properties are more likely to break.

For these reasons, it would make good sense to attach synchronization specifications to the vulnerable variables, instead of just to the pieces of code which manipulate them. More demanding on the compiler, but more developer friendly (especially as it concerns an area - concurrency - which is tricky and where many bugs are elusive and are often found late in the development cycle).
/Examples

In summary, one axis on which to evaluate a language feature is to which extent it provides guarantees - or take them away (a feature like concurrency, for instance, usually invalidates certain assumptions) - and to which extent the provided guarantees can lead to useful conclusions about code, using just local reasoning.

No comments:

Post a Comment