My bad opinions

2016/06/12

Property-Based Testing Basics

This is a quick introduction text regarding the general concepts behind property-based testing.

The Claims

Before diving into what property-based testing and how it works, let's go for some wild claims we've had at the Erlang Factory last march. In the slide set[1] from Thomas Arts, it is said that they used quickcheck (canonical property-based testing tool) to run tests on Project FIFO, an open source cloud project[2]. Their end results was:

We'll quickly look at why this is possible.

How it works

So how does that work? That's the magic of property-based testing. There's two basic modes for property-based testing: datastructure generation and Finite State Machine model validation.

Regular tests

Datastructure generation is fairly simple and has been reimplemented in many languages[3] from the initial Haskell version of quickcheck. What it does is go from a type declaration or generator declaration and generate more and more complex instances of that type to pass to a function or method.

Rather than going in and explicitly listing all the relevant cases and finding the edge conditions yourself, you write a specification that says "I expect this type of data" and then a bit of code that says "here's what I expect to be true no matter what data you have". Then the test tool generates all the tests for you and finds edge conditions you wouldn't have thought of.

Let's say I have a piece of code that finds me the greatest value in a list of values. I could write tests such as:

max_test() {
    assert(312, max([2,9,18,312]);
    assert(0.1, max([-2,-9.0,-18,0,0.1]);
    ...
}

These show that I expect the max value to work with floats and integers and seems to properly cover all cases that I have in mind. The problem is that it doesn't check for a few number stuff that is funky: what about negative and positive 0? Does this handle larger integers fine? What if the list iteration is buggy on lists of items greater than 128 items? Starting to list out all these conditions is costly.

So the thing I can do is find the properties of max($list). The trick is to figure out how to say what it does without the implementation itself. One way would be, for example, to use an existing min function and say that the biggest number is the one for which all the other ones are smaller. Or I could take a sort() function that returns me all the items in order, and then picking the last item means it has to be the greatest:

max_test() {
    for each $list in :
         max($list) == last(sort($list));
}

What a property-based test framework does is randomly generate lists with numbers, and then run the assertion (max($list) == last(sort($list))) against each of them. If any one fails, it complains. That will find a bunch of cases the first test didn't consider: what about empty lists? What if the list has a single element but the function expected two? Maybe we can't work with empty lists, so we modify the generator to say <nonempty> in whatever language notation one has for that.

So what's interesting is that instead of having say, 30 lines of test code for all kinds of interleavings for the max() function, I have 2 effective lines of code that cover the same data. I can ask the test framework to generate longer or more complex cases, or to generate a hundred million variations of them if I want. As iterations go forward, the framework adds complexity to the types in the dataset.

Let's say I have the example of the property above failing on the input:

[120, -2321, -41234.2123, 18.32130, -1, -3412312, 120.0, -3123]

Shrinking (the framework goes back and tries to simplify the input) could produce say:

[120, 120.0]

or even better:

[0, 0.0]

This means that max([0, 0.0]) != last(sort([0, 0.0])). It's not obvious why that is. Maybe the code is wrong, or maybe the test is. In this case, a likely candidate could be that sort() is stable (if two values are compared equal, they remain in the same order in the final list) whereas max() could always favor the leftmost item in the list to be the greatest. This requires me to refine my property: do I want max() to be stable, or do I not care? If I don't care, then I need to modify my property such that either value can be good (make sure 0 == 0.0 is true); if not, then I need to modify my implementation to always return the rightmost operand in the list as greatest.

That stuff would likely not have been exposed by manual testing, but property-based testing uncovers all kinds of gotchas like that in code that is seemingly simple and straightforward.

That's basic property-based testing in a nutshell.

FSM validation

The most advanced kind, which only exists or is frequently used in Erlang as far as I can tell (and is the one discussed in Thomas Arts' presentation), works by doing a special thing: if we can represent a program's execution sequence as a data structure, we can use quickcheck on it.

So the big magical thing they do is build a framework where you define a model consisting of a state machine containing:

That's it. Then what the framework does is take that model you have written with the pre and post conditions, and generate sequences of operations (as a data structure). And much like with regular property-based test cases, it grows the complexity of operations after each iteration. It then takes that sequence and the model, and then runs it against your real code.

It then, step by step, compares the output of the model to the output of the real thing, and ensures all post conditions still hold. If one doesn't or the model disagrees from the implementation, you have a test failure.

And much like regular test cases, it then tries to shrink that result by finding simpler sequences of operations that find problems.

So if we had a bug that went 'user adds an app, removes an app, re-adds the app, scales up, moves up a tier, moves down a tier, scales down, scales up, runs a new release' and the bug was in the sequence 'moves up a tier, scales down', the shrinking mechanism could possibly find that.

The big strength of this is that it's not exactly the same as fuzzing; the more complete test suites are more like model validation through generated command sequences (although non-exhaustive). The basic tests are a bit more like fuzzing, but with the requirement of the properties/invariants being maintained. The other big distinction is that since all of that exploration is guided by how types and command sequences are generated, you do have the ability to shrink problems to a simpler form.

I'm still spending a bunch of time toying with some of these frameworks (specifically PropEr[4], a GPL variant of the Quviq Quickcheck framework[5]), but the more I use them, the more powerful they feel. It forces you to develop new ways to think about your programs and how they should behave.

How Long does it take to run

Tools like Quickcheck and PropEr let you specify how many iterations you want. By default they tend to run 100 tests, which is small, but what I run for trivial changes I just want a safety check on:

$ rebar3 proper
===> Verifying dependencies...
===> Compiling shunk
===> Testing prop_shunk_tab:prop_first()
....................................................................................................
OK: Passed 100 test(s).

[... 11 more properties here ...]

===> Testing prop_shunk_tab:prop_thresholds()
....................................................................................................
OK: Passed 100 test(s).
13/13 properties passed

Each period represents a test and they're usually shown in real time to know how test execution is progressing.

When I modify something trickier, I can configure things to run more tests, like 1,000 or 10,000, or pick a single property I have doubts about and run it 100,000 times if I want.

Given it's mostly just running ever-growing tests in a loop, the speed of the test case depends mainly on:

Data generation is generally fast if you stick to basic types. If you have custom recursive generators with all sorts of ways to go deeply complex (I have a suite where I generate two data sets of >10,000 key/val pairs with a diff between them wanting 5-10% differences), it's sometimes easy to write things in such a way generating the data set eats everything up; you have to think of ways to use the framework's laziness macros and balance probabilities to speed things up/make sure they don't run infinitely, but it's usually worth it and it's possible to bring it down.

There's also tricks like knowing some fancy properties of your systems; the vast majority of distributed system errors can be found with just 3 nodes for example, so if I were to write a property test, I'm unlikely to need to run them over 50 nodes. That can let things go faster.

If the time it takes for each test to run is what is long, well, that's what is usually hard to optimize, especially if you're testing asynchronous systems and must insert pauses for things to sync up. Tests can sometimes be run in parallel to cope with that, but if the procedure is particularly slow, I find it pays to look at the data I'm generating (there's functions for that in the frameworks) and make sure they cover a lot of ground through proper statistical distribution, so that running the test 100 times may dig into as many cases as a simpler one run 5,000 times, for example.

A simple example of this is that if you're writing a parser or a tokenizer, directing the framework to generate strings with characters your parser cares about may prove more useful than randomly generating strings out of the entire unicode space: you might otherwise generate strings where 99% of the cases are not exercising your code all that much. You may later just write specific property tests for unicode support.

In the end, I usually end up sticking with that thing where I do faster runs for the common case, and once in a while I'm gonna run the whole suite over large samples just to make sure nothing funny happened over the course of changes.

What tips and tricks do you have to write suites

That's the hard part. Writing property-based tests is very difficult when what your code should do or how it should react is not properly defined. Extracting the properties and thinking hard is always good, and may require a lot of exploration; TDD with property-based testing, while difficult, seems to help figure out these properties (or their absence) before you're married to a specific implementation.

Aside from the usual "sit down, think very hard" approach, a usually good way to generate tests is to write a model. It's the central part behind the state-machine generation, but it's equally valid for data structures and simpler tests. Just write an implementation of what your code should do, except you want to make it so simple it cannot be wrong. We do not care about efficiency. Make it slow. Not a problem. We worry about correctness.

Take that implementation and pit it against your actual production code. If they both behave the same, then you know that despite all of its complexity, your production code tends to behave the same way you would expect it to do, regardless of all its complexities and gotchas. The more tricky your implementation, the more interesting this approach becomes. This is also why a lot of the property-based testing tutorials out there look so daunting: the test cases often look more complex than the code being tested because tutorials try to keep code samples simple, while property-based testing is the most valuable when code is complex.

In all cases, property-based testing needs not to be your only tool. If you find specific regressions that have sneaked past your model or suite, it can always help to add some literal regressions in a test suite on the side. It's another reminder that volume never trumps quality, and that quality of your generated tests may not be as great as you thought. I'd still recommend trying to tweak the property-based suite to generate that error, since it may help you discover a lot more bugs that were hiding in your code.

Additional reading material

References: