Wednesday, September 23, 2009

Program Composition and Transformation in Unix

A few years ago, I got the feeling that language design could benefit from looking towards Unix and the lessons of the Unix culture.
A definite source of inspiration for this was Eric S. Raymond's The Art of Unix Programming, which among other things analyze what the greatest strengths and most significant design decisions (or rather, design style elements) of Unix are.

One of the qualities I'd most like to carry over from Unix to programming languages has to do with modularity - the ease with which one can combine small components into great things; components which in themselves are simple and easily understood.



Properties of the command line
The Unix shell is a marvelous thing. It allows a reasonably skilled programmer to write many practical programs in one or a few lines, and in less than a minute; programs which in a "real" programming language would have taken considerably longer - and many of which would not have been written at all without this ease.
The power and ease come mainly from Unix's pipeline, the primary means of combining components into a greater whole. The pipeline has many interesting properties:
  • Transparency/debuggability - output can be inspected after each prefix of the pipeline. This can even be done while running the full pipeline, with the help of tee(1).
    (tee even makes sense as the last part of the pipeline, where it lets you store your results and view them too.)
    (A rarely exploited - or perhaps indeed widely known - fact is that tee may even be used to change a linear pipeline into a treeshaped one - like this: echo foo | tee >(tr a-z A-Z) | rev)
  • A pipeline can be extended with additional processing steps.
  • Or it can be cut in half - to debug each half on its own; to reuse output from a common prefix of two or more pipelines; or to run one part of the pipeline on a different machine or at a later time.
  • The components of a Unix pipeline run concurrently - it's parallelism that comes for free, without one having to do anything for it, and without muddling the semantics.
    Even on single-core machines, this may be a win if I/O and processing can be performed in parallel.
  • Isolation/encapsulation - The components are loosely coupled. A malfunction in one of the components of a pipeline cannot spread to other component except through clearly defined ways (bad output; premature or infinitely delayed termination; resource depletion). In particular, one component can't corrupt the internal state of another.
  • Context independence/agnosticism - The components are not as a rule aware of each other's existence; their behaviour in the pipeline does not change if the pipekline is changed - or even if they are run by themselves, without the pipeline. (The exception is through performing the isatty() query operation on the standard input and output streams.)
  • Reusability - Many of the atoms of a typical command line, and sometimes larger parts of one, are reusable.
  • A program like grep may be one of the most successfully and frequently reused pieces of code around - it is used in new contexts thousands of times each day.1
All in all, the Unix pipeline is a showcase for composability and reusability.

Let's look at the mentioned strengths in the perspective of the two desirable properties I wrote of in my previous post: guarantees and late binding.

First, guarantees:
(Given a few simplifying assumptions: that none of the components depend on isatty() or on any timers, or of how much data each read/write call transfers; that no component read files written to by other components appearing later in the pipeline, or in other ways introduce causal loops; that CPU and memory resources are not too scarce; and that we do not care about side effects by components which may be precluded by premature termination of later components. Did I miss any...?
The majority of these assumptions are subsumed by this stronger version: That all components save the first are deterministic data transformations from standard input to standard output; many useful programs fulfill this criterion.)

  • The components of a pipeline are guaranteed to not interfer with each other except through a few, well-defined ways.
  • The behaviour of components do not depend on whether their execution is performed in serial, or interleaved, or in parallel, or in the latter cases how the execution is interleaved or parallelized.
  • The behaviour of any prefix of a pipeline is guaranteed to be independent of the remainder of the pipeline.
  • Running a whole pipeline at once is guaranteed to be equivalent to running a prefix, storing the intermediate result, and later running the remainder with the intermediate result as input.
  • It is guaranteed that tee(1) can be inserted at any point in the pipeline without affecting its behaviour.
  • Running a whole pipeline one one machine is equivalent to running a prefix on one machine and the remainder of the pipeline on another machine (with input from the prefix), provided that any accesses to local resources by the remainder yield identical results in the two scenarios.
    (This equivalence rule is often used in the reverse direction: the first machine does not have access to a certain resource; therefore the pipeline is split up, so that it behaves *as if* the first machine did have access to the resource.)
(Sorry about the abundance of lists - there will be a couple more.)
As for late binding, the loose coupling of one pipeline component to the next is very like a late binding. I guess it does not as such count as a true late binding, because a source code (i.e. pipeline) change is after all usually necessary to change the binding (though it may not be; a pipeline component may be a Bash if-statement or depend on environment variables, $PAGER being a good example). However, the source changes in question are pretty trivial.

It is because of the loose couplings that
  • tee can be inserted at any point, facilitating debugging;
  • a pipeline may be broken at any point, substituting storage of the intermediate result for parallel processing;
  • or broken at any point to let each half run on a separate machine;
  • processing step can easily be added or removed.
Black-box program transformations
Various features of Unix provide other kinds of guarantees and late binding for a program.
  • Each program has an owner; it runs with the permissions of a certain user. This implies certain guarantees about which operations the program may and may not perform; primarily which files and devices it may read from and write to, and whether it is allowed to perform operations restricted to the superuser.

  • Most programs depend on the values of environment variables. This provides late binding with respect to e.g. which language the application speaks, which character encoding it uses, and other localization information.
    Another use of this is for configuration of preferred service programs - external programs invoked by the application to perform certain tasks; the environment variables PAGER, EDITOR and CVS_RSH are examples of this use of late binding.

  • Certain programs, primarily servers, need for security reasons to be restricted in regard to which files they may access, but still need to perform certain superuser-only operations. The solution to this is to put the program in a *chroot jail* - to irreversibly change its notion about what the root of the filesystem is, in such a way that guarantees that it can only ever access files within the corresponding subtree of the file system. This mechanism means that even if the program is compromised, the damage is contained.

  • For some purposes, it is useful to run a program with redefined versions of system functions. For instance, a development tool called Electric Fence (link) redefines the libc functions malloc and free with versions which make buffer overruns and similar more likely to be caught immediately (rather than e.g. cause crashes at arbitrary subsequent times). Unix enables such context modifications through the environment variable LD_LIBRARY_PATH - which makes the program be late-bound to other library implementation(s) than the default.

  • Finally, it is occasionally useful to follow the sequence of system calls that a program performs. The LD_LIBRARY_PATH could be used for this purpose, to the extent that the program only calls system functions through a given shared library. However, this is not guaranteed. A more complete solution is provided though a tracing functionality in Unix; this is a late-binding feature that instruments a program in order to obtain a trace.

Conclusion
To sum up, the Unix pipeline provides flexibility and other interesting properties in a form that has proven very convenient, intuitive and useful.
Furthermore, Unix provides certain forms of encapsulation (access restrictions), as well as certain "program transformations" by relatively simple mechanisms (and through late binding). A some of these transformations effectively change the substrate on which the program runs - the implementation of low-level operations - which in a pipeline-analogy can be likened to changing or injecting a component (for instance, adding tracing corresponds to adding a tee component).

I find these features interesting from a programming language design perspective, because they are solutions to problems familiar to programmers, and because they are solutions which have proved to work rather well - and give a great deal of flexibility - through relatively simple means.




1: For illustration, here's one that can be used on a stream of Erlang-term-formatted property lists, to get a statistic of the values associated with a given key: program-emitting-erlang-terms | ggrep -Eo '{keyword,[^}]+}' | sort | uniq -c | sort -nk1

No comments:

Post a Comment