Making your code base better will make your code coverage worse

“There are three kinds of lies: lies, damned lies, and statistics” - attributed to Mark Twain (allegedly)

There are certain widely used numbers and calculations that, without additional information, can give the wrong impression. Take body mass index (BMI), for example. It’s presented as a way to measure whether a person is at a healthy weight. BMI is calculated using an individual’s height and weight. Generally a BMI of 25 or more is regarded as overweight. By that definition the person listed below would be regarded as overweight:

Person X:

Height - 6’-0” (183 cm)
Weight - 230 pounds (104kg)
BMI - 31.2

If you were given more information—say that this person earned a ring from winning the Super Bowl in 2025 as a running back for the Philadelphia Eagles—you might conclude that maybe BMI is not such an accurate reflection of this one person’s health. It might even make you question if BMI has any value whatsoever.

That person is Saquon Barkley, and at least according to the internet, that’s his height, weight, and calculated BMI. I feel comfortable saying that Mr. Barkley is very healthy, and BMI is a terrible metric to use to determine his health. BMI might be helpful, but taken by itself it is a very crude metric. It does not distinguish between muscle and fat—the additional information needed to make a clearer judgement.

This article is about another metric that seems to suddenly gained a lot of favorability in software development: code coverage.

I’ve had multiple experiences where the importance of maintaining a minimum of 80% code coverage affected the decisions I made, the code I wrote, and not always for the better. Why did test coverage become a proxy for code quality? I read up on the issue and history. I ran an experiment. And I came up with a few general observations:

Not all files, features, or applications are of equal value, but code coverage tools will treat them that way (without customization)
Automated testing is not always the most cost effective way to test your app
The default minimum code coverage of 80% is arbitrary
Making your code DRYer will make your code coverage worse
How you structure your code can improve the accuracy of your code coverage metrics—with verifiable proof using real data and code

Code coverage has been around for many years. However, its use has grown, to the seeming exclusion of all other metrics. It’s transformed from being a useful tool for locating files that need more testing to the ultimate determinant of code quality.

My assumption was that someone else, the “experts”, must know more than I do, and there is a good reason, a quantified, validated, data proved, TESTED, reason we’re all trying to meet minimum test coverage. I searched and could not find any scientifically tested evidence that links code coverage to software quality.

There has always been a healthy skepticism about relying too much on code coverage. Martin Fowler wrote about the possibility of 100% code coverage with assertion-free tests more than ten years ago. Even with the legitimate concern over assertion-free tests, I think there are other more likely scenarios than an entire team of bad actors writing assertion-free tests to deliver 100% code coverage.

Personally I find code coverage metrics useful. It makes it easier to identify where the biggest holes are when you are writing tests. While high coverage alone does not verify the quality of your code, low coverage is a red flag. When automated, this metric can prevent developers from merging a pull request where any file has fallen below the minimum threshold—a gatekeeper of code quality, roughly.

I’d be lying to you if I said I didn’t know why code coverage is now so much more valued—the reason is there are numerous software vendors marketing and promoting tools that use code coverage as a way to automate measuring quality. A software team can add one of these tools to a pipeline, and it will act as a gatekeeper to prevent developers from merging a pull request where any file has fallen below the minimum threshold. The use of these tools are becoming increasingly more prevalent.

In short, determining code quality is hard. Buying a tool that automates and measures unit tests is easy.

I’ve had multiple experiences where the importance of maintaining a minimum of 80% code coverage affected the decisions I made, the code I wrote, and not for the better. This got me wondering—how many of us are being nudged to do the same?

Articles about code coverage can range from “code coverage is a flawed metric” to “100% code coverage is not enough”. If you’re advocating for at least 100% code coverage, it would indicate you have a lot of faith in it. I’d probably put myself somewhere in the middle; It may be a crude measurement, but I do believe it has some value.

Most of these articles however, are written in a vacuum with a developer-centric viewpoint, as if the only thing that matters is code. They are written as if time and money are not to be considered, and every code base is pristine and being written from scratch. Not one of these articles ever mentioned anything about the cost of writing these tests, the risk management of dealing with certain types of bugs, and the return on investment on all of this.

One of the articles cited by Fowler, “How to Misuse Code Coverage” was written in 1997.

When you use code coverage as your primary (or only) metric for gauging test coverage, you are sizing up every file the same exact way—does this file meet the minimum threshold? None of these articles discuss what happens when you add code coverage to a code base that has been around for years, which is what is happening frequently these days.

There is no consideration to other tasks aside from testing, no prioritization of one feature versus another. You ignore what the code is doing and the value it is delivering. You do zero financial assessment of what happens if one particular feature fails. You are treating every file as if they have equal value and each one must meet 80% percent code coverage. The file that encrypts user data? 80% code coverage required. The file that allows a user to upload their profile image? 80% code coverage required.

If I were adding a code coverage tool to an existing code base, I’d rather take a closer look at the results before I start blindly applying a minimum threshold across each and every file. I can say with absolute confidence no matter what application that is being considered, there’s code that merits near 100% code coverage, while there are others that deserve a lot less than 80%. Adding a gatekeeper to maintain the same percentage across each file will result in developers writing tests for low value features, and not writing tests for the remaining 20% of files with higher value features.

A few of these tools can be configured with different thresholds for different directories within an application. Many teams would get a higher degree of value if they first manually identified the most important parts of their software and applied the minimum threshold to only these parts. Having been part of multiple projects using a code coverage tool, not once have I seen anything other than the default coverage across the entire code base.

Many projects add these tools without considering the product itself. There should be a distinction between using code coverage for software that operates on a medical device versus a mobile dating app, between a legacy app that has millions of paying users versus a startup releasing its minimum viable product. In all of these cases there is only one value: 80%.

There are features in your app that, if they were to fail, would have little to no consequence to your users, your liability, or your bottom line. There are others that would be catastrophic. You can and should be smarter with your code coverage tools. Please consider doing something other than blindly applying the default threshold to every project and every file.

Never spend six minutes doing something by hand when you can spend six hours failing to automate it - Zhuowei Zhang

In keeping with the theme of return on investment, I do believe developers forget that automated code-based tests are not the only way to test a feature or code. If you’re building a web-based application, there are tools like Selenium and Cypress. If you’re testing a mobile app or a desktop app there are other options available. Typically these tools don’t measure code coverage.

There is also the old fashioned way, which involves a physical user going through the motions of verifying the feature or code works by hand. Consider the amount of time it takes to write automated tests for a particular feature set versus the time it would take to manually test the same thing.

For example, if you can write automated tests in four hours (4 X 60 = 240 minutes),and manually test the same features in 20 minutes from start to finish, with a very rough back of the napkin math, you’d potentially see a return on investment after 12 releases/deployments of that software.

There are other costs involved that we’re intentionally not factoring, i.e. cost of the person who manually executes the test, the potential fallibility of humans versus code, the cloud costs of running these tests in your pipeline. Even without those values it is easy to appreciate that automated tests provide benefits in this use case through the economies of scale. Depending on the feature and the critical nature of that application, it’s common to have testing for the same feature in multiple places such as automated and manual testing.

Sometimes a particular feature is very hard to generate an automated test for, it might be less time consuming to test it manually, even considering scale. I’ve lost track of the tasks I’ve been assigned where writing the test was significantly more difficult than writing the code. The same math applies—imagine an automated test that takes 16 hours (two days) X 60 minutes = 960 minutes of effort to write an automated test. Compare that to testing the same feature manually, and it might require only five minutes. You would need 192 deployments before you’ll see a return on investment from that automated test.

There’s a lot of sub-optimal code out there. Your automated tests can only be as good as the code that it is used to validate. Your whole test strategy is only worth as much as the features that they validate. Adding a code coverage tool to a sub-optimal code base with the hopes that it will magically improve the quality of your application will not work. It’s also likely to make it a lot harder for your development team to make it better.

If there is anything I hope to convey is that the way your code is structured has a strong correlation on how accurate your code coverage data is. I have verifiable proof at the end of this article why that is.

I did an obligatory Google search in the hope I could find some scientifically-proven reasoning behind the default threshold value of 80%. Alas I came up empty.

I did however find some credible speculation that the 80% is related to the Pareto principle. That would make a lot of sense, unless you actually understood what the 80% in Pareto principle means.

The Pareto principle dictates that “roughly 80% of consequences come from 20% of causes.” One example of the Pareto principle applied to testing is as follows;

“80% of complaints come from 20% of recurring issues.”

There are lots of ways you might apply the Pareto principle to testing. It could mean looking at 80% of your complaints, finding the corresponding code and increasing code coverage and quality there. Or maybe it could mean you would identify the 20% of your code that delivers the greatest value to your customers, and put 80% of your resources and efforts there. Properly applied, it would mean you’d spend more time testing something important like “how do we handle a failed transaction” and less time validating something like “does dark mode work?”

That is not what enforcing 80% code coverage does at all—it assigns equal value to each line of code in each file. That means if you have one file with 200 lines of code validating a user’s credit card, and another 200 lines allowing users to change their default appearance to dark mode, you’re going to have to make sure that your coverage reaches 80% on each of these files. It’s hard to quantify the level of effort required for maintaining the functionality for one versus the other, but it should be easy for anyone to understand that one piece of code holds significantly more value.

Using 80% from the Pareto principle as the default minimum threshold strikes me as misunderstood, grossly misapplied, and quite frankly laughable. The only thing more ridiculous than misunderstanding it and making it the default for code coverage everywhere is blindly trusting code coverage as the be-all-end-all metric for measuring code.

And for that very reason, I believe the reason we’re all using 80% is because of the Pareto principle. Someone read about it somewhere, didn’t understand what it meant, liked how it sounded, and made it the default.

I mean 80% is also the minimum threshold for a B grade in most US schools (and elsewhere). No one wants to put out C level code, but requiring a B as the minimum everywhere kinda feels like good enough.

Developers are now living in a world where all of our pull requests are held hostage to a percentage because someone misunderstood the Pareto principle or continues to think in terms of letter grades.

As a code base evolves, opportunities present themselves to make improvements. One of the most common practices is to consolidate repeated code. Your code base might have one or more blocks of code that gets copied and pasted elsewhere.

Having identical code in multiple places is generally regarded as bad practice, it makes sense to move that repeatedly used block into a single location. That shared code might still be in the same file or moved into a separate one. This is the principle of Don’t Repeat Yourself (DRY), as opposed to Write Every Time (WET) code.

Making your code DRYer is generally accepted as a good thing. Yet this comes at the cost of declining code coverage. Here are some hypothetical numbers.

Imagine you have a file with 100 lines of code and it meets the minimum of 80% code coverage threshold. That means 80 lines of that file have been touched during testing and 20 lines have not.

That file has a block of ten lines of code that appears in two different places, and it is currently covered in your testing.

The DRY principle suggests to move this code into a reusable function. However, this has the unintended effect of reducing the code coverage. The newly refactored code may be better, but it's also going to reduce the quantity of lines that are covered with tests, while the uncovered code will remain the same coverage. Without additional changes this code won’t be mergeable.

WET code:

100 lines, 80 covered = 80%

DRYer code:

90 lines, 70 covered = 77.8%

So now you have a choice to make:

add tests that cover previously uncovered code
leave the repeating code as is

There are times when the code was left uncovered for a good reason—they were the hardest to write tests for. The level of effort required to test at times exceeds the value it is supposed to provide.

The numbers for this situation presented were hypothetical, but the circumstances are not. I was tasked with fixing a bug, and in the process found an opportunity to make the code base better, DRYer. But I made the code coverage worse, and it fell below the threshold.

There is another option available.

Remember when you were in elementary or middle school, and you were given an assignment with a page range? Many of us would tend to focus on the minimum.

While writing that report, maybe you came up a little short on pages. The deadline was approaching, and perhaps there were other assignments to do, other tests to study for. Maybe you were just lazy. The point is there are ways to stretch out what you had to meet the minimum: changing the font, increasing the line spacing, breaking one paragraph into two so the last word of the paragraph goes to a new line. Desperate people do desperate things. With enough moxy, you might just hit that target.

If you know what is possible, and don't have reservations about using that knowledge, it’s not hard to pad the blocks of code that already have coverage. A couple of logger statements here and there may be enough. You can also reduce the amount of lines in uncovered code and get the same benefit. It appears very plausible how you can get refactored code over the minimum threshold without new tests, if you're willing to consider these sub-optimal practices.

I cannot be the first person to have this realization A little research confirms that I wasn’t. And yet code coverage minimum live on.

There is some good news about code coverage. You can increase the quality of your coverage data based on how you structure your code. There is one catch though—it involves writing your code as verbose and explicit as possible.

For all of my software development career I’ve been doing largely the opposite. I’ve witnessed most experienced developers do the same thing. How many pull requests have I seen where the comments are something like this;

“Why use an if-else block when you can use a ternary operator?”

I’ve been doing this for a very long time. I personally love using a ternary operator, and really any programming syntax that reduces the quantity of code lines.

I wanted to run an experiment to test whether reducing an if-else block to a ternary operator affected code coverage. Spoiler alert: I’m not sure this reduction is always the right choice. The results of this experiment and the prominence of code coverage is forcing me to reevaluate what I value in code syntax.

So here’s how the experiment works—I’ve written a function that accepts three parameters, named x, y, and z, all of them boolean. The function returns a boolean value as well, returning true if any of the parameters are true, and false if they are all false.

There are four test cases;

X is true
Y is true
Z is true
All parameters are false

All code is written in JavaScript. I use the testing framework Jest which uses Istanbul by default.

The git repository contains multiple versions of the described function. Each version of that function has its own test file. Each test file contains the same four4 test cases as described above. Below are two2 variations of the same function, side by side;

condition_control.js

export const conditionControl = (x, y, z) => {
 return x || y || z;
};

export const conditionSwitchSeparate = (x, y, z) => {
 switch (true) {
   case x:
     return true;
   case y:
     return true;
   case z:
     return true;
   default:
     return false;
 }
};

Unit test for the control function;

import { conditionControl } from "./condition_control";


describe("validate function conditionIfElse", () => {
 xit("should return true when X is true and other params are false", () => {
   const xIsTrue = conditionControl(true, false, false);
   expect(xIsTrue).toBe(true);
 });


 xit("should return true when Y is true", () => {
   const yIsTrue = conditionControl(false, true, false);
   expect(yIsTrue).toBe(true);
 });


 it("should return true when Z is true", () => {
   const zIsTrue = conditionControl(false, false, true);
   expect(zIsTrue).toBe(true);
 });


 xit("should return false when all params are false", () => {
   const allFalse = conditionControl(false, false, false);
   expect(allFalse).toBe(false);
 });
});

On the control branch, experiment_01_all_tests_active ,when the tests are executed with the option to display coverage, the results will show 100% code coverage across all categories; statements, branches, functions, and lines. It will also show that there are no specified lines of code that are uncovered.

jaredtoporek@Jareds-Laptop stackoverflow_code_coverage_experiment % npm test -- --coverage                      
> stackoverflow_code_coverage_experiment@1.0.0 test
> node --experimental-vm-modules node_modules/jest/bin/jest.js --coverage
(node:72128) ExperimentalWarning: VM Modules is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
 PASS  src/condition_if_else_then_return_separate.test.js
 PASS  src/condition_switch_separate.test.js
 PASS  src/condition_control.test.js
 PASS  src/condition_switch_grouped.test.js
 PASS  src/condition_if_else_separate.test.js
 PASS  src/condition_if_else_grouped.test.js
-------------------------------------------|---------|----------|---------|---------|-------------------
File                                       | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s 
-------------------------------------------|---------|----------|---------|---------|-------------------
All files                                  |     100 |      100 |     100 |     100 |                   
 condition_control.js                      |     100 |      100 |     100 |     100 |                   
 condition_if_else_grouped.js              |     100 |      100 |     100 |     100 |                   
 condition_if_else_separate.js             |     100 |      100 |     100 |     100 |                   
 condition_if_else_then_return_separate.js |     100 |      100 |     100 |     100 |                   
 condition_switch_grouped.js               |     100 |      100 |     100 |     100 |                   
 condition_switch_separate.js              |     100 |      100 |     100 |     100 |                   
-------------------------------------------|---------|----------|---------|---------|-------------------
Test Suites: 6 passed, 6 total
Tests:       24 passed, 24 total
Snapshots:   0 total
Time:        0.235 s, estimated 1 s

So this is where the fun begins. We’re going to disable certain tests, and then execute the test suite to see how disabling tests affects the code coverage. There are multiple variations of which tests are disabled. Here are the variations I used;

All tests cases are executed
Disable test for all parameters are false
Disable tests for Y and Z are true
Disable all tests except X are true
Disable all tests except Z are true

If you download the repo you can try each variation/branch yourself. For the purpose of brevity we’re only going to examine the variation with the starkest difference in test coverage, example 5: disable all tests except Z is true.

When comparing code coverage results, the condition_control.js file, the version with the most concise code, shows 100% code coverage despite three out of four tests having been disabled. The other versions give a more accurate picture of where code coverage is missing. The version of code using a switch statement with each condition on a separate line indicates 25% branch coverage and 50% line coverage. This would seem to indicate that the more concise your code is, the easier it is to get an inflated code coverage score with sub standard test coverage. The more verbose code is easier to read and understand, and it gives more accurate code coverage data.

jmtop@Jareds-Laptop stackoverflow_code_coverage_experiment % git checkout experiment_05_disable_all_tests_except_z_is_true
Switched to branch 'experiment_05_disable_all_tests_except_z_is_true'
jmtop@Jareds-Laptop stackoverflow_code_coverage_experiment % npm test -- --coverage                                       

> stackoverflow_code_coverage_experiment@1.0.0 test
> node --experimental-vm-modules node_modules/jest/bin/jest.js --coverage

(node:70912) ExperimentalWarning: VM Modules is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
 PASS  src/condition_if_else_then_return_separate.test.js
 PASS  src/condition_switch_separate.test.js
 PASS  src/condition_if_else_separate.test.js
 PASS  src/condition_switch_grouped.test.js
 PASS  src/condition_if_else_grouped.test.js
 PASS  src/condition_control.test.js
-------------------------------------------|---------|----------|---------|---------|-------------------
File                                       | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s 
-------------------------------------------|---------|----------|---------|---------|-------------------
All files                                  |   66.66 |    53.57 |     100 |   66.66 |                   
 condition_control.js                      |     100 |      100 |     100 |     100 |                   
 condition_if_else_grouped.js              |      75 |       80 |     100 |      75 | 5                 
 condition_if_else_separate.js             |    62.5 |       50 |     100 |    62.5 | 3,5,9             
 condition_if_else_then_return_separate.js |   66.66 |       50 |     100 |   66.66 | 4,6,10            
 condition_switch_grouped.js               |      75 |       25 |     100 |      75 | 8                 
 condition_switch_separate.js              |      50 |       25 |     100 |      50 | 4-6,10            
-------------------------------------------|---------|----------|---------|---------|-------------------

Test Suites: 6 passed, 6 total
Tests:       18 skipped, 6 passed, 24 total
Snapshots:   0 total
Time:        0.245 s, estimated 1 s
Ran all test suites.

This is a very simple function. It would be an interesting exercise to try it with other languages, other code coverage tools, and more complex code. It would make sense to take an existing code base with somewhat decent code coverage and refactor certain files to make them more verbose, then execute the test suite to see how it affects the code coverage.

Testing and code coverage has a lot in common with insurance. Both provide a level of security and have a cost. The premiums you pay to insure your house or car have to be worth the thing you are insuring. Testing and code coverage tools have a cost as well; the time and effort required from your developers to write tests, the execution of the test suite in your pipeline, etc. It’s not just the cost of the software you’re using to monitor code coverage. All of the costs incurred have to be worth the value of the thing you’re protecting.

There’s a reason why not everyone pays for collision coverage on their car—the insurance that helps pay to repair or replace your car if it's damaged. This is different from liability, the insurance that covers someone else’s car in the event of an accident. If you have a newer car and have a car loan, liability is cost effective and most likely required from whoever gave you the car loan. On the other hand if the car you have is paid for and has a lot of mileage, the insurance premiums are unlikely to be worth the cost of the value of your vehicle if it gets totaled.

Putting a code coverage tool in place that requires 80% code coverage across all your code may not be worth the cost it is incurring. I suspect most software teams are only considering the cost of the code coverage tool they are buying without doing a cost benefit analysis of the other costs that enforcing code coverage will incur. Code coverage does not come for free.

Making your code base better will make your code coverage worse

Why has code coverage become the go to metric for code quality?

Not everything is of equal value, but code coverage treats them equally

Automated testing is not always the most cost effective way

Where does the threshold of 80% come from?

Making code DRYer means making code coverage worse

How you structure your code alters the accuracy of your code coverage metrics

Conclusion

Add to the discussion