Smash the Test Pyramid!

For many years, the Test Pyramid was (still is?) the de-facto model for formulating the testing strategy for your software system. But it’s not perfect, and if you are focused on the wrong aspects of it, it could be doing you more harm than good.

Evaluating a Testing Strategy

Time is a limited resource. You don’t have time to test every behavior of your system. A good testing strategy strikes a balance between the time it takes to write and maintain your tests, and the amount of confidence they give you that your system is working.

In an ideal world, your automated test suite covers every behavior your system can exhibit. All behaviors are defined, and they are automatically validated whenever you make code changes. When an end user uses your system, you know exactly what is going to happen, because you’ve already defined everything that can happen, and you’ve already validated that it works how it is supposed to.

Unfortunately, reality is far from ideal. It takes time to define behaviors, and we often simply don’t have the time. There are tremendous uncertainties about system behaviors as well, since we are often discovering what a system should do as we go. The system builders are often not the system users, and even in the ‘best case’ where the system builders are domain experts and potential users – they can only represent some users, not all. We may be waiting for requirements from someone else, and just need to guess on some things. All this put together means we have to choose which system behaviors are worth writing automated tests for, and which are not.

So how do we make that decision? As with most things, we want the most bang for our buck – maximize the benefit-to-cost ratio. The goal of an automated test suite is to give you the confidence that your system is going to behave how you expect it to. That’s the benefit – confidence. The cost is time. The time it takes to write the tests, run the tests, and maintain the tests. Maximize confidence, and minimize time.

What does the Test Pyramid do well?

The Test Pyramid does a good job of cultivating the benefit-to-cost mindset. Unit tests are typically easy to write, have very few false failures, give you very high confidence about the code under test, and run very fast. You can consistently run tens of thousands of unit tests in a continuous integration pipeline with no problem. They offer a very attractive benefit-to-cost ratio. And that’s what we want a lot of – high benefit-to-cost ratio.

At the top of the pyramid, E2E tests are the opposite. They take longer to write, they are more complicated to set up, they take longer to run, and they are often ‘flaky’ – they have failures due to timing issues or the complexity of the test environment instead of due to incorrect system behaviors. When done well however, they can give you very high confidence that your system as a whole does what it is supposed to do. They can be high benefit, but they can be even higher cost, and often the benefit-to-cost ratio is questionable.

Where does it fall short?

The motivation for the test pyramid shape has a lot to do with the costs of writing, running, and maintaining the various kinds of tests. But not every system is the same. Different systems have different cost profiles for different kinds of tests. E2E tests for a console application that uses a SQLite database have much lower costs than E2E tests for a web application with multiple services backed by 3rd party APIs and a clustered database. For a console application like that, you might be able to write nothing but E2E tests and be just fine.

Costs also change over time as technology evolves. With the advent of containerization, it is common practice to run a database in a container, which can reduce setup and maintenance costs, and increase test isolation resulting in fewer flaky tests and better parallelization.

Another tricky part of the test pyramid is the definitions of the different kinds of tests. What exactly is a ‘unit’ test? Martin Fowler has a good article that highlights some of the grey areas with the classifications. Is a ‘unit’ a class? A function? A group of classes that work together in some way? If you don’t draw good lines, you can actually end up spending a lot more time maintaining unit tests, which hurts the benefit-to-cost ratio that makes unit tests attractive in the first place.

Here’s a common scenario. You take the dogmatic approach that each class is a unit, and gets a suite of tests. Always. Class-to-unit-test-suite is 1-to-1. You write a class. You write unit tests for it. Then you decide to refactor the class into three separate classes. The behavior of the system hasn’t changed, but because you have reorganized, and because of your approach to testing, you need to change your tests and/or add new tests so that each of your new classes has a test suite. That kind of refactoring can happen all the time, and the test suite maintenance costs can add up. Over time, it also becomes unclear whether certain fine-grained tests are even doing anything useful – whether they tie back to the behaviors of the system that provide business value to users.

There is also a false sense of security you can get when tests are written according to the code organization of the language (like a class), instead of being written according to the business organization of the code. Classes in isolation can be proven to ‘work’. But the system is a collaboration of different pieces serving a business goal. Without that context, a single class may be tested with the wrong arguments and the wrong scenarios, mocks may be incorrectly configured, and you end up with tests that don’t really verify how your system works in reality.

Alternative Schools of Thought

One alternative model is the Testing Trophy. The basic idea here is that you put more emphasis on tests that exercise bigger chunks of your system, so that you have more confidence that your system behaves as it should. If the costs of these tests isn’t much different than the costs of the traditional ‘unit’ tests, this makes perfect sense. As we move up the testing pyramid, tests give more confidence about the correctness of our system. If the costs don’t increase much as we move up, then we should be shifting tests upwards, because we will get a better benefit-to-cost ratio for our whole test suite.

Aslak Hellesøy, the creator the Cucumber Testing Framework, has an interesting talk about a completely different way to structure your testing strategy. The key insight here is that the higher cost of tests higher up the test pyramid is due to the technologies in play, not the business behaviors of our system. Out of process calls, network calls, database and file system calls – these all inherently take longer to run and introduce more complexity and points of failure that cause flakiness, which together means higher costs. The approach he takes is to maintain a single suite of tests which validates the system behaviors that are important to the business, but with different configurations that swap out the technology components. You can run the same suite of tests very quickly by using all in-memory adapters for components like databases. You can run the same suite of tests more slowly against the real database, and get maximum confidence. This flavor of testing can work well with architectures like Clean and Hexagonal, since those architectures make it natural to swap out different adapters for different technology components of the system.

The Functional Core, Imperative Shell design offers another way of thinking about testing that ties into your system architecture. Instead of thinking of your system in terms of ‘units’, you think more in terms of ‘inside’ versus ‘outside’. Your system is separated into…

The core logic of your program that decides ‘what’ to do in response to various inputs.
The glue code of your program that connects your program to the outside environment – the file system, a database, 3rd party APIs, and so forth.

Part (1) is analogous to the ‘unit’ portion of your testing. With this design, you would have a rich suite of tests on (1). These run really fast, and give you very high confidence that the business logic of your program is correct. You would have minimal automated tests against (2) (perhaps none!), because that layer of the code is very thin and is mostly just a question of does someone else's library work like it is supposed to.

Tie Your Testing Strategy to Your Architecture

The goal of creating an automated test suite is to easily verify that your system behaves like you expect it to. The more you can align your test organization with your system’s architecture, the more confidence you are going to get.

Going back to the functional core, imperative shell architecture – your code is intentionally organized to separate the glue from your important application logic, and make the imperative shell as thin as possible. The architecture emphasizes the functional core, so your test suite should as well. The bulk of the interesting code is there, so by focusing your testing there, you are maximizing confidence in the bulk of your system.

In the Hexagonal family of architectures, dependencies like databases and 3rd party APIs are intentionally abstracted away behind adapters to make it easy to swap those implementations without changing your system’s behavior. An in-memory database adapter is supposed to behave the same as a MySQL database adapter, as far as the business behavior of your system is concerned. Since your system is supposed to be indifferent to those implementation details, your test suite should be as well. You can write tests to validate the business behaviors of your system, and swap out adapters in different test configurations to run in ‘fast’ mode or ‘production’ mode.

Conclusion

There are no right or wrong answers when it comes to writing automated tests for your system. Every system is different and every team has different constraints and capabilities. Old models of thinking have valuable insights. So do new ones. Use them all, and stay focused on the goal of a test suite – maximize the confidence-to-cost ratio.