Polymorphic Typist: December 2013

In software development, it is common practice to have a suite of unit tests which can be run automatically against the program under development. In the following, I’ll introduce (or quite possibly reinvent) a way of improving the quality of a test suite.

How well the tests check the program is difficult to quantify. One common way is to use a coverage measuring tool; such a coverage tool then monitors the test execution and records which of the program’s code paths are exercised and which aren’t.

However, while a good test suite yields good coverage of the program, the reverse is not necessarily true: a test suite may give good program coverage yet not check the program’s behaviour very well. For instance, a test suite which runs the program with a lot of different inputs, but never perform any checks on the output of the program, might give good coverage without finding any bugs. In a less pathetic example, one unit test may exercise a code path, and another test may contain the check which would expose a bug in that code path — this, too, would exhibit good program coverage, yet not catch the bug.

Another danger when writing a unit (or other) test is that it doesn’t test what it’s supposed to test. The important assertions may be defunct for some reason or other; sometimes this lack is caught, sometimes not.

In this respect, the test is not different from any other program: somtimes it doesn’t do what it is supposed to do. Except the risk is in some respects greater for test code: if it doesn’t work as expected, there’s a significant probability that the result will just be a green lamp, with no-one being the wiser.

A different, but related problem with test code is that of maintenance. As the requirements, structure, and maintainers of a program change, the original intent of the code may be muddled somewhat, making it less obvious what the test is supposed to check, and whether those checks remain relevant.

So — who guards the guards? Who tests the tests — keeps the checkers in check?

For the main program, both failure to behave as intended, and ceasing at some point to do it, is often caught by the test suite. (That’s the idea, anyway.) For tests, on the other hand, either kind of failure may go unnoticed.

Part of the problem is in essence one of coverage — not of the program, which is commonly measured, but of the test code, which seldom is. It is, indeed, a common feature of test code, by its very nature, that certain of its code paths are seldom if ever exercised.

Can you test the tests, then? Can you exercise them more thoroughly?

Test-driven development (TDD) is a partial answer to this problem: in the TDD methodology, tests are written to first fail, reflecting the non-compliance of the code, then the code is changed to make the test succeed. That means that the error-path of the test is taken at least once; definitely an improvement.

Here, though, I want to go a step further — and aim for continual testing of the tests.

One possibility is to see unit tests and programs as adversaries — as duals of each other — keeping each other in check. That approach appears to have some advantages.

Following the testing principle from "The Art of Software Testing": "A programmer should avoid attempting to test his or her own program", I’ll discuss it as if program writer and test writer are different people — but this doesn’t mean that it’ll work less well if the program writer is also the tester.

If the normal unit test writing method is an asymmetric game where the test writer tries to second what the program writer might have missed, and the program writer is simply doing his job, — we could conceivably aim for a more symmetric situation of both parties writing checks of the other’s work.

In that kind of game, this means that the program writer supplies more than one implementation, of which all but one are known to contain a bug — realistically, this would be the actual program and a set of variations on it.

For the test writer, this means supplying a set of unit tests as usual — except that now that set can be failed, deemed inadequate! The test writer gets a chance to lose, too.

Supposing this is a novel idea (although probably it isn’t), let’s name it "Contra-testing".

(Update: The idea if using intentional bug introduction as part of the testing process is indeed far from new. Both "Bebugging" and "Mutation Testing" (two techniques dating back to the 1970s) are about exactly that. More on that later.)

Also, let’s have some illustrative examples. I’ll go with a couple of classics — as far as the programs are concerned. The test cases probably won’t be.

Notation

We’ll need some notation for specifying faulty variations of the program. There are several possibilities; here I’ll go with a pure Java API: from the program writer’s point of view, it consists of a static method ContraTest.bug(String bugName) which under normal conditions returns false, but for the contra-test specified by bugName returns true.

Example 1 — factorial

Consider the well-known fibonacci function, as written by our friend Petra the Program Writer (and, also, written in Java):

public static long fibonacci(int n) {
    long a=0, b=1;
    for (int i=0; i<n; i++) {
        long tmp=a; a=b; b=tmp+b;
    }
    return a;
}

What challenges might Petra set for Terry the Test Writer? What might she suspect that he won’t think of testing?

Well, apart from border cases and so on, one thing on her mind — one concern which is present for her as she writes the code, and which Terry might not think of — is that there are different algorithms to choose from, with very different performance characteristics.

Thus, one suitable contra-test would be this:

public static long fibonacci(int n) {
    if (ContraTest.bug("slow factorial")) {
        return (n==0) ? 0
             : (n==1) ? 1
             : fibonacci(n-1) + fibonacci(n-2);
    ...the real code of the function...

This contra-test ensures that if Terry only tests the function with very small inputs, that’ll be caught. So he’ll include one. Which again means that if, later on, some other developer steps in thinking, "why use iteration? Surely the straight-forward recursive version is better," that regression will be caught. Also, possibly, he’ll be more inclined to write tests for algorithmic complexity in the future.

The example may not be so realistic (there are faily few uses of the fibonacci function in everyday software projects), but it illustrates the general principles well enough.

Example 2 — accounts

Another example: the classic "account" example illustrating race conditions.

Here is Petra’s code, including a contra-test:

public class Accounts {
    public boolean transfer(String fromAccountID, String toAccountID, long amount) {
        synchronized (ContraTest.bug("no locking in transfer")
                      ? new Object() : this) {
            if (withdraw(fromAccountID, amount)) {
                deposit(toAccountID, amount);
                return true;
            } else return false;
        }
    }
    ...
}

This is another example of a contra-test of a non-functional but crucial aspect of the implementation — here of whether the Accounts.transfer() method is properly synchronized.

I conjecture that while this is a quite important aspect, few unit test writers would under normal circumstances include a test for it. With contra testing, the odds might improve.

Example 3 — hash set

The examples so far happen to have lead to non-functional tests. For a functional-test example, here’s a Set implementation Petra has written:

public class MySet {
   private Element[] hash_table;
   private static class Element {...}

   ...

   public boolean contains(Object candidate) {
       // Compute hash:
       long candHash = element.hashCode();

       // Determine the bucket to search:
       int bucket = Math.abs(candHash) % hashTable.length;

       // Walk the bucket's chain:
       for (Element element = hash_table[bucket];
            element != null;
            element = element.next)
       {
           if (element.hash != candHash) continue; // This wasn't it.

           if (ContraTest.bug("contains: no equals check") ||
               candidate.equals(element.value))
           {
               return true; // Candidate is present.
           }
       }

       return false; // Candidate is absent.
   }
}

Here again, Petra recognizes — with her white-box knowledge — a mistake a developer could easily make, and which the tests should consequently check for: Two key objects can have the same hash code yet be different.

Back from examples

I’ve illustrated the idea with two developers; in practice, the same person can of course write both tests and contra-tests. In that case, of course, the question which the program writer asks herself is not likely to be "what haven’t the test writer thought about?" (or why indeed not?) but rather: "what have I thought of but possibly got wrong anyway?" — much like when one developer writes normal unit tests for his or her own code.

Now, why might this testing approach work better than the status quo?

One argument for contra-testing has to do with confirmation bias: it’s like the difference between asking "prove that my program works" and "this program has one small but significant bug. Find it." Given that all software contains bugs, the latter question is obviously the correct one to ask.

Furthermore, having pairs of tests with a single variation and with opposite expected results is generally good practice, as it heightens the probability that you’re testing what you think you’re testing. Testing on both sides of a boundary, and expecting different results, tends to be a very good idea.

The contra-tests also give useful input to the question, "when have we tested enough?"

What might the consequences be for which unit tests are written? The above examples give reason to believe that there might be a greater weight on non-functional tests such as tests of performance/algorithmic complexity and thread-safety. Also, the trickiest part of the code, the special cases which were hardest to express in code, and the hardest-gained insights of the program developer would probably be better assured to get attention — because the program developer is most acutely aware of these areas of the problem. Which would be a very good thing, because for the same reasons, these corners and program properties would be most likely to deteriorate later on in the program’s life.

Finally, especially if contra-testing is carried out with program writer and test writer being separate persons, I suppose there could be an effect on these developer’s testing mindset in general. This suspicion is purely based on the above examples, though.

Not all new

I haven’t yet had the chance to try out contra-testing in practice, but I find the idea interesting — perhaps you do too, now.

And as I’ve mentioned, the idea is not all new — few ideas usually are.

As for the related preexisting techniques I mentioned earlier: Mutation Testing and Bebugging are both about assessing the adequacy of a given test suite, and/or estimating the number of remaining, uncaught bugs.

Bebugging involves manually injecting a known number of bugs, seeing what proportion of of those are caught by the testing process, and using that number to extrapolate from the number of real caught errors to the number of unfound errors.

Mutation testing, on the other hand, is about automatically generating program variations (each of which may or may not be faulty) and see in how many cases the test suite catches the difference. This translates into an adequacy score of the test suite. Thus, this method in effect works as an improved coverage test, measuring not only if all parts of the program is exercised, but also whether the effects of each part of the program are in fact (directly or indirectly) checked by the tests.

The main drawbacks of mutation testing is that it’s a resource-intensive process, and that many of the generated variations may be semantically equivalent to the original program, and this cannot be detected fully automatically.

As far as I can make out, neither of the two techniques are meant for continuous testing, and for being a permanent and integrated part of the development process. Thus, contra-testing may still be not all old news. I guess another name for it would be "Continuous Bebugging".

Given time, I’ll see if I can get some level of JUnit or TestNG support for contra-testing going. (Yeah, like that’s going to happen…) In any case: If you decide to try this testing method, please let me know your experiences.

Polymorphic Typist

Friday, December 27, 2013

Contra-testing

Notation

Example 1 — factorial

Example 2 — accounts

Example 3 — hash set

Back from examples

Not all new

Labels

About Me

Blog Archive