Motivation
In software engineering, we love to measure things — and we don’t always know when to stop.
The effectiveness of your test suite is important. An effective test suite will catch regressions in new or updated code that will negatively impact users. We try to measure this effectiveness in a metric known as “code coverage”.
The more experience I have with varying codebases, frameworks and companies, the more I believe that measuring code coverage is essentially useless. Sometimes it can be worse than useless as teams, managers and businesses focus on the wrong metrics and succumb to Goodhart’s Law.
I find it’s about as useful as measuring economic growth by GDP or a person’s health by their BMI.
In this article, I argue that code coverage isn’t something you measure — it’s something you feel.
What is code coverage?
“Code coverage” is the proportion of lines, branches, statements, or functions in your application code that are executed when your test suite runs.
Here is some output from an rspec test:
.......................
Finished in 3.94 seconds (files took 8.89 seconds to load)
23 examples, 0 failures
Coverage report generated for RSpec to ./coverage. 10310 / 10409 LOC (99.05%) covered.
99.05% of lines of code executed, not bad!
I will now explain the fallacies with this number.
Code coverage fallacies
To highlight these fallacies, I will refer to the following pseudo-code snippet:
def prank_user(id)
user = User.find(id)
phone_number = user.phone_number
if phone_number.present?
SmsService.ask_if_refrigerator_is_running(phone_number)
# Because if so, they better go catch it!
else
DatabaseService.update_name_to_barry(user)
# Nobody wants to have the name "Barry".
end
end
Here we have a very standard piece of behaviour which takes in some state and causes some side-effects. Either by updating a database or making a 3rd party API request.
1. Executed != tested
Just because the line ran, doesn’t mean any of the behaviour associated with that line was tested.
Assume I’ve written a test suite for my prank_user method which covers both when the user has a phone number and when they don’t. My LOC will be 100%, hooray!
But did I actually assert that the correct state change occurred in each scenario? Maybe the update_name_to_barry method incorrectly updates the user’s name to “Sally”.
You might be quick to say “Aaah but we don’t care what update_name_to_barry does, we’re testing this method in a unit!”. Sure, but at some point you will have a test that runs the line of code that updates the database. So then the same question can be asked in that test: did you actually assert that the expected state change happened?
Code coverage does not cover this.
2. Mocks
Again let’s assume I have 100% LOC coverage on my prank_user method, but at the start of the test I add the following:
# ...
allow(SmsService).to receive(:ask_if_refrigerator_is_running?)
allow(DatabaseService).to receive(:update_name_to_barry?)
# ...
Great, my method has 100% code coverage but am I testing any real behaviour? Sure I can assert that the expected methods were called, but who is to say that those downstream methods will actually accept the arguments I give them?
What is the benefit of a test like this?
Mocks essentially separate your code from reality. Sometimes reality is scary and we don’t want to include its fickleness in our test suite, but most of the time we want our test suite to do the thing that the code is going to do in production. Otherwise we lose confidence that our test suite is reflective of what our users see, we lose test coverage.
The nuance of mocks is hidden by a 100% code coverage report.
3. Big domains
Congratulations we have tested prank_user with no mocks and asserted all of the expected state change has happened… But did we test it on all of the possible values prank_user could receive for the argument id?
In this case, id represents a Postgres auto-increment serial column value, so we know it can only be a positive integer.
In a statically typed language, we can somewhat restrict the domain of the arguments a function can accept:
# Rust
fn prank_user(id: u32) {}
// Java
public static void prankUser(int id) {}
-- Haskell
prankUser :: Int -> ()
In a dynamic language like Ruby however, id really can be anything. Have we tested for when id is a string? What about a negative number? What if id is a complex object like an ActiveRecord model to another table?
The space of values that id could be really is quite large - infinite even.
Now, would we reasonably expect id to be something wild? Probably not - but I have seen ridiculous values being passed into functions, particularly in dynamic languages where it’s easy to make a mistake as a developer.
These situations are generally fine because it’s often immediately obvious you’ve made a mistake, but sometimes it’s not obvious. The point is, your test certainly doesn’t cover these scenarios and code coverage does not capture this.
4. Frequency
We run a serious business so prank_user runs far less frequently than other parts of the codebase. Our code coverage report doesn’t factor this in however.
Maybe we have a do_not_prank_user method that runs 1,000 times more than prank_user that is completely untested. Well, our code coverage report across these 2 methods shows 50%, so it can’t be that bad, right?
Weighted by execution frequency, the effective coverage is (50% / 1,000 = 0.05%) 0.05%.
In reality, the lines of code executed in your application likely follow the Pareto principle or the 80/20 rule. So shouldn’t you care more about the high-frequency 20% rather than the rest?
Maybe, maybe not - I don’t know your application - but this fact is completely lost in a code coverage report.
5. Dead code
Similar to 4, imagine a world where prank_user is never executed in production. How valuable is a test suite that exercises a method never used in production?
Doesn’t matter, my code coverage report is at 80%!
6. All lines are equal
But some lines are more equal than others.
We really care if our users don’t get pranked. In fact, it’s how we make money so it’s all we care about. My code coverage report says 0.1%, but that’s an incredibly valuable 0.1%.
Certain parts of your codebase mean more to the business than others. Maybe no one notices if 1,000 lines over here fails. But if 1 line over there fails perhaps your company is now facing a lawsuit.
So does it really matter if you are missing code coverage on those inconsequential 1,000 lines of code?
Some lines of code are more important than others. A code coverage report does not capture what lines your business really cares about and which ones they don’t.
Summary
Hopefully I’ve highlighted how even a simple function like prank_user can fail to be tested effectively despite passing the code coverage report with flying colours.
Managers love code coverage reports. They’re a nice, tangible number they can report on to make them feel safe or that things are getting better.
“We have just eclipsed 80% code coverage in our server!”
Oh, my dear manager, please do not let your guard down.
Measuring code coverage is not simple. Measuring code coverage by lines of code misses all of the delicious nuance that the real world has. You can’t reduce the effectiveness of testing your codebase to a mere percentage because the real world is complex and messy. You might have to read tests, understand failure modes, know which behaviour matters and understand how your software interacts with the world around it.
We’ve all felt uneasy shipping a change into a dark, complex part of the codebase — the kind that seems to cause problems no matter how thoroughly you test it. Savour that feeling. Listen to it. Don’t let a code coverage report silence it!
So don’t try to measure code coverage - feel it.