Monitoring debt builds up faster than software teams can pay it off

If we are to believe the stories we hear, software teams across the industry have modern monitoring and observability practices. Teams get alerted about potential issues before they hit customers—and are able to pull up crime-show-worthy dashboards to find and fix their issues.

From everything I’ve seen these last few years, few software organizations have achieved this level of monitoring. Most teams I’ve encountered have told me they do not have the monitoring coverage they would like across the surface area of their app. I’ve seen many startups go surprisingly far with almost no monitoring at all. To those who still struggle with the monitoring basics: you are in good company.

Today, it’s easier than ever for a team to monitor software in production. There are more monitoring and observability tools available than ever before. We’re also seeing more collective understanding about monitoring and observability best practice across the industry. So why is there such a gap between monitoring ideals and monitoring reality?

What’s happening is that teams are falling into monitoring debt more quickly than they are able to pay it back. In this article, I’ll talk about what monitoring debt is, why it’s easier than ever for teams to build towards monitoring bankruptcy, and what there is to do about it.

Most software engineers are familiar with the concept of technical debt, a metaphor for understanding how technical tradeoffs have long-term consequences. People typically talk about tech debt in terms of how the cost of refactoring, redesigning, or rewriting tomorrow allows a team to ship faster today. Tech debt, like financial debt, can be taken judiciously and paid off responsibly.

In some ways, monitoring debt is analogous to tech debt: teams can choose to underinvest in monitoring at the time of shipping code at the cost of having to go back and invest in monitoring later. From a technical perspective, monitoring debt behaves similarly to tech debt. It costs more to clean up later and requires intimate knowledge of a system that a developer may have context-switched out of. And this is assuming the same developer is around to pay back the debt!

The costs of monitoring debt are even more insidious. When a team chooses to ship code without thorough monitoring, here are the the immediate costs:

The team needs to accept limited ability to catch issues ahead of customers, meaning the customer is the monitoring plan. This may work for some tools, but may wear on the patience of paying customers.
The team has chosen to give up at least partial ability to quickly localize issues when they arise, meaning they are less likely to be able to fix issues quickly. This means that customers could be waiting for up to hours or days for their issues to get resolved.

What’s worse, paying back monitoring debt is often harder than paying off technical debt, since it requires both intimate knowledge of the code (what kind of behavior is normal; what are the highest-priority events to monitor for), as well as facility with the monitoring tools (how to find and fix the issues of interests given what the tools support?).

Often, the reasons teams decide to take on monitoring debt—not enough expertise; not enough time—cause them to fall deeper and deeper into debt.

Today, there are a few reasons it’s easier than ever for teams to quickly and build towards monitoring bankruptcy.

Monitoring a system requires a nontrivial amount of knowledge about both the system under monitoring and how that system should be monitored using the tools.

Setting up monitoring and observability requires knowledge of the underlying system. If you’re using something like Datadog APM, it can look like all you need to do is include a library, BUT this may often involve updating other dependencies. Even in our ~1-year-old code base with three microservices, it took an extremely senior engineer a week to hunt down the dependencies across multiple languages. And even after we set it up, though, my developers don’t have the bandwidth to set up all the dashboards we need to properly use this data. We are perpetually behind!
The tools themselves have a learning curve. Many tools require some understanding of how to use the tools: how to instrument; how to customize the graphs. Using OpenTelemetry outside a framework that provides automatic instrumentation support has a decently high learning curve because you have to learn how to implement spans. Other observability tools advocate writing code to anticipate that you will consume the logs or traces, which requires understanding and discipline on the part of the developer. Tools that require custom dashboards often require developers to understand how to access the data they need and which thresholds mean something is wrong. Like manual transmission cars, most monitoring and observability tools today trade ease of use for control and flexibility; these tools require some facility and basic understanding of both clear monitoring goals and the underlying monitoring mechanisms.

The longer a piece of software goes without monitoring, the exponentially harder it gets to monitor. First, hooking up the monitoring tool is harder for a system that is not completely up-to-date and paged-in. Any monitoring system that requires the use of a library means that there are likely compatibility issues—in the very least, somebody needs to go around updating libraries. More high-powered tools that require code changes are even harder to use. It’s already hard to go back to old code to make any updates. Context-switching it back in for tracing is just as tricky!

Second, consuming “old” monitoring data is tricky. Even if you can rely on your framework to automatically generate logs or add instrumentation, what is considered “normal behavior” for a system may have gotten paged out or left the company with past team members.

Finally, with software teams being more junior than ever before and experiencing more churn than in recent history, the chances are increasing that a different, more junior developer may be tasked with cleaning up the debt. These developers are going to take longer just to understand the codebase and its needs. Expecting them to simultaneously pick up skills in monitoring and observability, while retrofitting logging and tracing statements onto a code base, is a big ask.

Finally, the rise of SaaS and APIs have made it a lot harder to monitor systems. Monitoring is now no longer about seeing what your own system is doing, but how your system is interacting with a variety of other systems, from third-party payment APIs to data infrastructure. I would also say that a legacy subsystem nobody on the team completely understands also falls in this category. While traditional monitoring and observability practices made sense for monoliths and distributed services completely under one organization’s control, it is unclear how to adapt these practices when your distributed system has components not under your team’s control.

My take: let’s get new tools. But in the meantime, let’s also rethink our practices.

Today’s monitoring tools are built for a world in which the developers who built well-contained systems of manageable size can get high-fidelity logging across the entire thing. We live instead in a world where software services run wild with emergent behaviors and software engineering is more like archaeology or biology. Monitoring tools need to reflect this change.

To meet software development where it is, monitoring tools need debt forgiveness. My proposed improvements:

Make it easier to set up monitoring and observability black-box. I know, I know: the common wisdom about finding and fixing issues is that you want to understand the inner workings of the underlying system as well as possible. But what if there’s too much code in the underlying system to make this possible? Or if there are parts of the system that are too old, or too precariously held together, to dive in and add some new logs to? Let’s make it easier for people to walk into a system and be able to monitor it, without needing to touch code or even update libraries. Especially to make it possible to set up new monitoring on old code, we want drop-in solutions that require no code changes and no SDKs. And with more and more of system interactions becoming visible across network APIs and other well-defined interfaces, blackbox monitoring is getting closer to reality.
Make it easier for teams to identify what’s wrong without full knowledge of what’s right. Today’s monitoring tools are built for people who know what they’re doing to do exactly what they need to do to fix what is wrong. Teams should not need to understand what their latency thresholds need to be, or what error rates need to be, in order to start understanding how their system is working together. The accessible monitoring and observability tools of the future should help software teams bridge knowledge gaps here.

Across monitoring and observability, we have great power tools—but what don’t we need in a “for dummies” solution? This is something we’ll need to think about collectively across the industry, but here are some starting ideas:

We’re not building for teams that are optimizing peak performance. They’re trying to make sure that when load is high, the system does not fall over. Answering the basic questions of “is anything errors?” and “is anything too slow?” often does not require precise machinery.
Being able to precisely trace requests and responses across the system is great, but it’s usually not necessary. For a point of reference: a friend once told me that less than five Principal Engineers at his FAANG-like company used their tracing tools.
What is the minimum information we need for monitoring? When root causing issues, is it simply enough to get a unique identifier on which process the issue came from? I would love to see more discussion around “minimum viable information” when it comes to finding and fixing issues.

Talking more about the answers to these questions can help establish minimum, rather than ideal, standards for monitoring using the existing tools.

In order to help teams pay off monitoring debt, we need a mindset shift in developer tools. We need more tools that are usable in the face of tech debt, monitoring debt, and very little understanding of the underlying system or tools. Today, it would be a heroic feat for a new junior person to join a team and successfully address an incident. But we’re already seeing accounts of this in the news—and tech workplace dynamics mean we should only expect more of this to happen. Why not make it easier for a junior dev to handle any incident?

Monitoring debt builds up faster than software teams can pay it off

What is monitoring debt?

Why it’s easier than ever to build towards monitoring bankruptcy

Monitoring is a highly-skilled activity

Monitoring is best done fresh

Better tools have made it easier to take on monitoring debt

What teams need to pay off monitoring debt

Bringing down the debt

Add to the discussion