Goldilocks tests

I was speaking to my friend Allan Kelly at the Agile Cambridge conference and he mentioned he's reading a book about the maths of production. It contains a proof: rare failures that take a long time to fix are much much worse than frequent failures that get fixed fast. This is perhaps a counter-intuitive result to the way many people think. It may mean you'd be better off with more tests failing.

It's the goldilocks effect.
Is the porridge too hot or too cold or just right?
Are the beds too hard or too soft or just right?

Do you want tests always failing?
Do you want tests never failing?
Or do you want tests sometimes failing?

You want enough to give you some confidence that your tests are testing areas where defects exist. Because exist they surely do.
You want enough to keep the code current in the developers' consciousness. So they grok it. So they can fix fast.
You want enough to keep the developers' defect fixing skills sharp. So they can fix fast.


  1. Anonymous11:43 am

    Interesting. Does Allan give any insight into whether the rare-but-expensive bugs are of comparable complexity to the frequent-but-inexpensive ones? Here's how I read your comments, and how I would interpret Allan's proof:

    [Scenario] Company A and Company B both have the same defect in their software. Company A finds the defect quickly and implement a low-cost fix. Company B takes longer to find the bug, and it costs more for them to fix it.

    [What I think you're saying] Having only just worked on that bit of the code, Company A has kept their defect fixing skills sharp, thus allowing them to fix it quickly/cheaply.

    For Company B, the time between the defect being added and the defect being found means that their defect fixing skills are blunt, and fixing the defect costs more.

    Crucially, the bug is the same in both companies, and if their skills were as sharp, Company B could've fixed the bug just as quickly/cheaply as Company A, if only they'd known how.

    [My alternate reality] Because Company A finds the bug while it's still fresh, it hasn't had time to become ingrained into their system. No other parts depend on it, the code isn't being called from elsewhere (or, dare I say it, copied-and-pasted to elsewhere). Yes, having fresh knowledge of the codebase helps, but catching the bug before the bit rot has had a chance to twist it into something else is also an important factor.

  2. Anonymous11:44 am

    Also, I'm not entirely sure how you get from Allan's proof to the Goldilocks effect. To say that you should aim for tests that always fail, or never fail, or sometimes fail, seems to be driving the design of your testing from completely the wrong angle[1]. To say that we want not too many test failures, and not too few, but “just right” implies that we somehow work towards some level of test failure, rather than functionality.

    Perhaps it’s not the number of failures, but the mode of failure, that sits within the Goldilocks sweet spot. Each test should be not too sensitive to change (where a brittle test fails as part of our standard development), not too open to change (where a test continues to pass even though defects have been introduced), but “just right”, where every defect triggers a [targeted, informative, deterministic] test failure but every valid change results in the build staying green*.

    *or should that be Gold?


    [1] Footnote: The angle you want your testing to be coming from (am fearful that I’m wading into egg-sucking territory here. Sorry, this is aimed more at me than you).

    You want good tests. You want tests that fail in an informative and deterministic way. You want to know that when @Test should_do_A fails, it's because there's some mismatch between the behaviour documented in the test, and what A is actually doing. You don't want should_do_A failing because of a change in B. In a sense, good tests are defect fixing skills that are permanently sharp. They always point at the piece of code where investigation should be focused. They mean that, no matter when a defect is discovered, our attention is drawn to the right area of functionality.

    That’s at the individual test level. From there, you need to structure your test suit(es) to give you a similar focus - should_do_A and should_do_B both run before should_use_A_and_B_together is run. That way, you know whether it’s a failure on the part of the individual component (e.g. part B fails == PASS, FAIL, FAIL) or a failure in their combination (PASS, PASS, FAIL).

  3. I don't have any more info from Allan at the moment no.

    I'm not sure that working to a level of test failure and working towards functionality are necessarily exclusive. Can't I aim for both?

    And lest I've given the wrong impression I'm not suggesting using the number of test failures as a primary driver of design.

    But I am suggesting that, perhaps counter-intuitively, your overall effectiveness in creating a whole software system, over time, might increase if you have more test failures.

  4. Anonymous10:02 am

    Hi Jon, Sorry, I shouldn't be allowed to post on Mondays, particularly Monday mornings...

    I guess what I'm saying is that one shouldn't favour sporadic, non-deterministic or buggy tests in order to increase your failure rate.

    "This system is good because the tests fail often" isn't something I'd agree with.

    I'd prefer "This system is good because the tests fail early whenever a defect is introduced, and those tests are run often".

    It would be nice to differentiate between the system where tests failures are rare because test coverage is high, test quality is good, and the devs maintain these levels through always checking in tested, working code, CI, and so on... and the system where tests failures are rare because test coverage is poor, tests are weak, and tests are rarely run, thereby allowing defects to live and grow within the system for a significant amount of time.

  5. Hi Dan,
    in case I gave the wrong impression let me try and be clearer - I am not saying you should necessarily aim for more failing tests. And neither am I saying you should necessarily aim for less failing tests. I do not know anything about the state of the readers tests and how often they pass or fail so I cannot offer specific advice? I can't. However I can offer some general advice. And the general advice is very general indeed. It is simply the observation that, perhaps counter-intuitively, some software systems would benefit if more of their tests failed more often.