Diagnosing a 45 Second Test Startup

2020-05-21

At my last job, testing was emphasized highly in the Java development culture. New features required unit tests and integration tests. Different communities use these terms in different ways, so I'll clarify what we meant by them:

A "unit test" tested code that was local to a single Java process, and could be run by faking or mocking dependencies of the code under test. All data was resident in memory.
An "integration test" was a test that spun up the service's Dependency Injection/Inversion of Control "container", and exercised the service's dependency graph to access resources such as:
- the MySQL database (running on a developer machine),
- the service's RPC stack (receiving stubbed-out responses),
- the container's scheduled task infrastructure
Tests that exercised multiple services in a "real"-looking deployment environment (typically, staging), were called "acceptance tests", and written less commonly than the other two.

Importantly, not all integration tests were designed to exercise the whole system--some tests required a database connection to test a single DAO class. Others probably could have been unit tests, except for the opaqueness of accessing certain dependencies without the crutch of the dependency injection framework. I made it a habit to convert these to unit tests whenever possible.

As I understand the debate, many developers would characterize some of our integration tests as unit tests, since they were not necessarily verifying the correct integration of multiple subsystems. However, I believe that these terms are best regarded in their colloquial usage, as revealing something about the development environment, and what the developers believe to be important. In our case, it would have been silly to insist on some test being a "unit test", when it took 45 seconds to start.

That's right, 45 seconds from hitting compile to finding out the first thing about the code you just wrote. Now, it was almost universally the case that, once the execution made its way past startup and into the code paths you wanted to test, things were perfectly fast. All of the overhead was in spinning up the container. The service in question was a particularly large one, and of the many reasons to break it up, test speed was near the middle of the list. However, even the few seconds that a much smaller service took to start up its integration tests were too much.

After profiling the container startup, the thing that stood out was that the container spent most of its time tied up in classloading. The Java ClassLoader's job is to take a compiled .class file, which consists of Java bytecode, and load it into an in-memory representation of the class. In Java, the in-memory representation of the class is more or less everything about it: fields, constructors, and both static and instance methods. But the runtime is not required to know this information until it is asked for. Basically, the process looks like this:

Execute code that contains a reference to MyClass.class, either by
- normal usage: new MyClass(),
- static reference: as in a Guice binding, such as bind(MyClass.class).to(...), or
- dynamic reflection: myClassList.flatMap((klass) => klass.methods()
Search the classpath for the first appearance of the class's fully qualified name,
Load the class and continue executing the calling code.

This process is not blazingly fast, and certainly contributes to Java's reputation for being slow to start. However, it ought to work pretty well most of the time, even for large applications. The JVM is not required to load every class and method that the application could ever reference. In fact, Java's dynamic features make this impossible. By loading classes on demand, the JVM trades off startup time for a longer, slower "rampup" period during which the application is still loading classes as it references them for the first time.

Luckily, classloading occurs within the context of the stack frame where it is requested, so my profiling was able to determine which packages it was spending the most time loading. By far the majority of the time was spent loading generated Protocol Buffer classes for external RPC services that our service would connect to in the normal course of operation. This was unexpected, as many of the "integration" tests didn't exercise any RPC code paths at all--as I mentioned, many just needed a database connection.

It only took a glance further down the stack trace to reveal that all this classloading was due to the service container framework. As part of application startup, the service container walked through the RPC dependencies that the application declared, in the form of Protocol Buffer service definitions, and generated bindings from the service methods to transport-layer code. This required referencing, and loading, massive, multi-thousand line generated classes. When running what should have been a unit test, this was totally wasted work.

I would love to be able to write about how I fixed this situation, but by that point, other reasons to split up the monolith were even more compelling, and the team was exploring other languages than Java. Before I had the opportunity to carve out time to dive into the framework code, I had [returned to school]({{< relref "research" >}}).

The moral of the story is to be aware of the small inefficiences that can be the price of a convenient framework. At least from the perspective of developer experience, it can be possible to get much more mileage out of a monolithic application, simply by being careful to run only the code that needs to run. The footguns are worse here with Java than most other languages, because reflection is powerful and slow. When you've got an idea for how to make a framework easier to use, and you're trying out a proof of concept, reflection isn't too slow. But five years down the road, all that reflective code, buried so deep in the framework that only the high priests of your Java organization can touch it, will greatly deteriorate your organization's ability to write and test new features.

Jack Coughlin

Diagnosing a 45 Second Test Startup