Monday, May 16, 2011

Testing distributed-store algorithms

This is a follow-up to my post on a datastructure for storing collections in Riak.
While I have been planning this follow-up, Kresten Krab Thorup has actually gone ahead and implemented the algorithm (on GitHub; see src/riak_column.erl) — which suites me nicely, as I haven't written a single line of implementation and am not really past the thinking stage yet :-)
To wit:

Shortly after writing the post, I discovered a couple of issues or finer points, both of which have to do with the fact that different rows in Riak live independent lives —  they are independently versioned, and no cross-row guarantees are given — specifically, nothing can be concluded from the read order: if one client A updates row X, then row Y, then another client B may see the Y update and later see the old version of row X.

(If Riak is set up with appropriate consistency settings, such that the majority of the replicas of an object are written synchronously, then you do have indirectly some sort of guarantee. Below a certain threshold of machine and/or network problems, that is.
But even so, nothing can be concluded from the read order if there several clusters set up with cluster-to-cluster replication: changes on different keys in one cluster arrive at the other clusters in arbitrary order.)

This leads to at least the following issues:
  1. Firstly, when splitting a row into two, I included this step: "Write an empty row back under the original auxiliary row key".
    In fact, doing that would be wrong. Even writing some tombstone dummy value in the old row would be a bad idea; instead, one should simply mark the row as obsolete while keeping the old data. This is necessary because a client accessing the collection later on may see from the master row that the old row has been split, but not see the new rows. Or it may see the old row, but not the change in the main row. In the former case, it is necessary that at least the values in the old row be available (or the collection would suddenly have shrunk); in the latter case, it would not be apparent that the values in the old row are obsolete.

  2. Secondly, after modifying an auxiliary row, the main row should be updated — even nothing in it has changed. This may sound silly, but is the easiest way to ensure that, in the event of a concurrent row split, the auxiliary row in question is taken into account at a subsequent merge (indeed, that the merge is triggered at all).

  3. And even that is not enough if there are more than one cluster: At the time of the read repair of the main row, the update for the auxiliary row may not have arrived - and it may therefore be lost silently. It appears that some kind of versioning of the auxiliary rows is necessary; with such versioning, we can tell when repairing the main row that we're missing an update on the pre-split auxiliary row, and that the reference to it should therefore be kept in the main row, so that it can be repaired later when all information is available.
Ah, the perils and challenges of incomplete information.

Managing concurrent subtlety
As the saying goes,
Meddle not in the affairs of concurrent systems for their ways are subtle, and are quick to anger.
--
Tolkie(error: timeout)n
In a domain as subtle as this — with gotchas in the style of the above mentioned issues, how can we ever convince ourselves that a scheme like the one for Riak collections are implemented robustly and correctly?

As always, there are two ways: Formal verification — proving that the program is correct; and thorough testing — gaining confidence by exercising the program.
Both have their advantages and disadvantages.

If I were Dijkstra, I'd develop a formal proof (as he did in quite a few of his blog postings), and probably modify the algorithm appropriately along the way. I, however, am no Dijkstra, do not believe to have the time to develop a formal proof — nor do I have any delusion of being able to write anything worth reading in the process. Luckily, going the other way, through testing, have something going for it as well:
  • It is not always clear which guarantees the components we build on actually provide. That mean that, regarding formal proof, the set of axioms may be uncertain.
  • The same test can be applied to different implementations of a component.
  • The same test can be applied to different versions of a component — i.e. if we modify the code, we can cheaply gain some confidence in the modified version.
For both formal verification and testing goes that the answers you get depend on the questions you ask.
When proving a property, you prove only that property; when you test a concrete call sequence with concrete values, you test with only those values.
Which is of course a good reason to get familiar with property-based testing — which lies somewhere in between the two in that it tests a property on a number of concrete call sequences — typically a few hundred, and typically different instances for each time a test is run.

This can be a nice compromise between formal verification and hand-rolled test cases — provided of course that the properties and the instance generators are chosen well.
As always with testing, paranoia and imagination is key. But you get more value for your paranoia when it's used to power random test case generation.

For the problem in question, and assuming an Erlang implementation like KKT's, property testing can be done with relative ease using a tool like Quviq QuickCheck or Proper. Their support for testing state machines comes in handy; I've not tried using it before, but this seems a good occasion.

How to test randomly
Randomized testing involves abstracting over usage scenarios.
How the abstraction is done determines what will be tested.
Knowledge of the problem domain is the primary guide; paranoia together with perhaps knowledge of implementation details should provide additional input.
What must be considered is:
  • What is tested:
    Which API functions to exercise?
    Which invariants to verify?
  • How it is tested:
    Test concurrent use to check for thread safety?
    Test with invalid inputs?
    Should certain special circumstances be simulated - e.g., file I/O errors, disk-full conditions, network delays, packet loss?
  • With what it is tested:
    Test with abnormal (very long/short/high/low) inputs?
    Key collisions?
    Many values for the same key?
    Keys which are nearly identical?
This is of course where the imagination and paranoia enter the picture.
You can't assume that "since we generate the values randomly over a large domain, we exercise all code paths" — exercising e.g. a dictionary with a million randomly generated keys is of little use if the keys never collide.

How to test distributed collections
OK then, how to test an implementation of the distributed collections scheme?
Using the state machine testing support of Proper, we can maintain a "model" collection aside the subject-under-scrutiny collection (shortened to SUS in the following).

What to test
:
This is the easiest part: We will test the usual collection functions: insert, delete, lookup, list keys.
The invariant is that the result is consistent with the same operation performed on the model collection — as defined below.

How to test:
We'll certainly need to test concurrent use. Let's say that there are three simultaneous users of the collection; we know that there are subtleties involving two, and there may be additional ones involving at least three.
The users won't be really simultaneous, though — We will test concurrent use in a way where the concurrency is made explicit. This provides repeatability and insight into why things fail, which is indispensable for this problem domain.
So, instead of having an actual underlying Riak store, we'll mock it up, in such a way that the mock provides just the guarantees we actually expect the real thing to provide.


Model representation
The structure I have in mind is: the mock remembers all versions of all values put into it. This is what the entire simulated multi-cluster store — let's call it the "cloud" — has ever seen.
The mock furthermore has a number of "views" of the store, corresponding to what you would see if you accessed the cloud at different points. It has a number of these; we'll make the assumption that in any of these views, for any given key, the version will only increase with time. I.e. we do not expect the version of any value in any view to evolve backwards. The mock keeps track of the view-to-version mapping for all keys.

One thing that can happen, then, beside operations on collections, is that value versions find their way from one view to another. Also, version merging can happen in transit — the versions being merged are not necessarily the latest value in any view, but may be any versions that have ever existed in the cloud.
For simplicity, let's say that this is modelled with views also — if we allow enough views, this won't reduce the generality.

The model thus consists, not just of a simple collection, but of an abstract model of the storage cloud.
Beside the externally visible events — the calls to the collection API — there are internal events: a given version of a given key finds its way to one view to another. Both kind of events go into the randomly generated test scenario specification.

The test
It was originally my intention to end with putting all of this together in a Proper-based Erlang unit test — it would then be a good occasion for me to get experience with the state-machine modelling support (link: state machine part is  from p.26).
However, it appears that I have a latency-vs.-completeness tradeoff to make, so I'd better stop here, publish what I have, and hope to return to the subject soon, hopefully with some concrete code. (While this posting has been under way, naturally other topics have come up which I'd like to write about; the order in which further postings arrive here is undetermined...)

No comments:

Post a Comment