Unit Tests Are Inherently Incorrect

Published on May 1, 2025

A subtle, yet important distinction which needs to be manifest in the day to day engineers mind is what information a unit tests provides about the slice of the system on under test. Most offer no provably correct information, only a regression barrier for a small set of behavior.

For instance if a test looks like:

assert(add(1, 2), 3, "'add' implementation is erroneous")

What does the assertion prove? The assertion proves add(1, 2) is correct, but what about add(2, 2)? What about add(x, y) where x and y belong to the set of real numbers?

From the above, I cannot reason or reasonably prove add behaves correctly for the above inputs. For all I know, the implementation could be:

function add(x, y) {
  return 1 + 2
}

The example is contrived, as most examples are, but it raises a point: how can the gap between seemingly correct and likely correct be bridged?

The answer is property-based, generative testing.

What does property-based, generative testing look like? It looks like specifying generators, which generate values based on a random seed, and running a test against some basis/algebra on the host system to give a confidence for likely correctness:

generate_real_number = generate_real_in_range(-1_000_000, 1_000_000)
generate_real_tuple = generate_tuple(generate_real_number, generate_real_number)
generate_real_tuples = generate_lazy_sequence(generate_real_tuple)

// In a test
for (tuple of generate_real_tuples) {
  assert(add(tuple[0], tuple[1]), tuple[0] + tuple[1], "'add' implementation is erroneous")
}

The above will ensure the add implementation is likely correct, rather than seemingly correct.

Property-based, generative testing can also be applied to generating sequences of events which can be submitted against a slice of a sub-system, or an API. The result can then be differentiated between another result which is built off of system primitives (i.e. sets, hashmaps, arrays, vectors, etc...) to determine if there are edge cases where a system may exhibit unexpected behaviors:

event_sequence = generate_lazy_sequence(generate_event)

for (event of event_sequence) {
  implemented_result = implementation_api(event)
  desired_result = semantically_equivalent_test_fn(event)

  assert(implemented_result, desired_result, "implementation erroneous")
}

The above snippet is more abstract than I would like, but there is a fair amount of details which cannot be covered in a short post such as this one.