Ensuring backwards compatibility in distributed systems

As our lives become more distributed, so too has the software that we rely on. What we see as a single user interface is typically powered by a series of connected services, each with a specific job.

Consider Netflix. On the home page, we see a mix of content: previously watched shows, popular new titles, account management, and more.

But this screen is not generated by netflix.exe running on a PC somewhere. As of 2017, it was powered by over 700 individual services. This means the start screen is really just an aggregation of hundreds of microservices working together. One service to manage account features, another to make recommendations, and so on.

The move towards distributed architectures brings lots of benefits: easier testing, smaller deployable units, looser decoupling, smaller failure surfaces, to name a few. But it also brings its own set of challenges.

One of these is maintaining backwards compatibility between components. In other words, how can a set of services evolve together in a way that doesn’t break the system? Services can only work together if they all agree to various contracts: how to exchange data and what the format of the data looks like. Breaking even a single contract can wreak havoc on your system.

But as developers, we know change is the only constant. Technology and business needs inevitably change over time, and so must our services. This can happen in a variety of ways: web APIs, messaging such as JMS or Kafka, and even in data stores.

Below we’ll look at some best practices for building distributed systems that allow us to modify services and interfaces in a way to make upgrading easier.

RESTful web APIs are one of the main ways in which distributed systems communicate. These are just a basic client-server model: Service A (the client) sends a request to service B (the server). The server does some work and sends back a response indicating success or failure.

Over time, our web APIs may need to change. Whether it’s from shifting business priorities or new strategies, we should accept from day one that our APIs will likely be modified.

Let’s look at some ways we can make our web APIs backwards compatible.

To create web APIs that are easy to evolve, follow the robustness principle, summarized as “Be conservative in what you do, be liberal in what you accept.”

In the context of web APIs, this principle can be applied in several ways:

Every API endpoint should have a small, specific goal that follows only one of the CRUD operations. Clients should be in charge of aggregating multiple calls as needed.
Servers should communicate expected message formats and schemas and adhere to them.
New or unknown fields in message bodies should not cause errors in APIs, they should just be ignored.

Versioning an API allows us to support different functionality for the same resource.

For example, consider a blog application that offers an API for managing its core data such as users, blog posts, categories, etc. Let’s say the first iteration has an endpoint that creates a user with the following data: name, email, and a password. Six months later, we decide that every account now must include a role (admin, editor, author, etc). What should we do with the existing API?

We essentially have two options:

Update the user API to require a role with every request.
Simultaneously support the old and new user APIs.

With option 1, we update the code and any request that doesn’t include the new parameter is rejected as a bad request. This is easy to implement, but it also breaks existing API users.

With option 2, we implement the new API and also update the original API to provide some reasonable default for the new role parameter. While this is definitely more work for us, we don’t break any existing API users.

The next question is how do we version an API? This debate has raged on for many years, and there is no single right answer. A lot will depend on your tech stack, but generally speaking, there are three primary ways to implement API versioning:

This is the easiest and most common way and can be achieved using either the path:

POST /v2/blog/users

Or by using query parameters:

POST /blog/users?v=2

URLs are convenient because they’re a required part of every request, so your consumers have to deal with it. Most frameworks log URLs with every request, so it’s easy to track which consumers are using which versions.

You can do this with a custom header name that your services understand:

API-Version: 2

Or we can hijack the `Accept` header to include custom extensions:

Accept: application/vnd.mycompany.v2+json

Using headers for versioning is more in line with RESTful practices. After all, the URL should represent the resource, not some version of it. Additionally, headers are already great at passing what is essentially metadata between clients and servers, so adding in version seems like a good fit.

On the other hand, headers are cumbersome to work with in some frameworks, more difficult to test, and not feasible to log for every request. Some internet proxies may remove unknown headers, meaning we’d lose our custom header before it reaches the service.

We could wrap the message body with some metadata that includes the version:

{
  metadata: {
    version: 2
  },
  message: {
    name: “John Doe”,
    email: “john@stackoverflow.com”,
    password: “P@assword123”,
    role: “editor”
  }
}

From a RESTful point of view, this violates the idea that message bodies are representations of resources, not a version of the resource. We also have to wrap all our domain objects in a common wrapper class, which doesn’t feel great—if that wrapper class ever needs to change, all of our APIs potentially have to change with it.

One final thought on versioning: consider using something beyond a simple counting scheme (v1, v2, etc). You can provide some more context to your users by using a date format (i.e. “201911”) or even semantic versioning.

When we release libraries to GitHub or Maven, we provide change logs and documentation. Our web APIs should be no different.

Change logs are essential for letting API consumers make informed decisions about if and when they should update their clients. At a minimum, API change logs should include the following:

Version and effective date
Breaking changes that consumers will have to handle
New features that can optionally be used but don’t require any updates by consumers
Fixes and changes to existing APIs that don’t require consumers to change anything
Deprecation notices that are planned for future work

This last part is critical to making our APIs evolvable. Deleting an endpoint is clearly not backwards compatible, so instead, we should deprecate them. This means we continue to support it for a fixed period of time and allow our consumers time to modify their code instead of breaking unexpectedly.

Messaging services like JMS and Kafka are another way to connect distributed systems. Unlike web APIs, messaging services are fire-and-forget. This means we typically don’t get immediate feedback about whether the consumer accepted the message or not.

Because of that, we have to be careful when updating either the publisher or consumer. There are several strategies we can adopt to prevent breaking changes when upgrading our messaging apps.

A good best practice is to upgrade consumer applications first. This gives us a chance to handle new message formats before we actually start publishing them.

The robustness principle applies here as well. Producers should always send the minimum required payload, and consumers should only consume the fields they care about and ignore anything else.

If message bodies change significantly or we introduce a new message type entirely, we should use a new topic or queue. This allows us to publish messages without worrying that consumers might not be ready to consume them. Messages will queue up in the brokers, and we are free to deploy the new or updated consumer whenever we want.

Most message buses offer message headers. Just like HTTP headers, this is a great way to pass metadata without polluting the message payload. We can use this to our advantage in multiple ways. Just like with web APIs, we can publish messages with version information in the header.

On the consumer side, we can filter for messages that match versions that are known to us, while ignoring others.

In a true microservices architecture, data stores are not shared resources. Each service owns its data and controls access to it.

However, in the real world, this isn’t often the case. Most systems are a mix of legacy and modern code that all access data stores using their own accessors.

So how can we evolve data stores in a backwards compatible way? Since most data stores are either a relational or NoSQL database, we’ll look at each one separately.

Relational databases, such as Oracle, MySQL, and PostgreSQL, have several characteristics that can make upgrading them a challenge:

Tables have very strict schemas and will reject data that doesn’t exactly conform
Tables can have foreign key constraints amongst themselves

Changes to relational databases can be broken into three categories.

This is generally safe to do and will not break any existing applications. We should avoid creating foreign key constraints in existing tables, but otherwise, there’s not much to worry about in this case.

Always add new columns to the end of tables. If the column is not nullable, we should include a reasonable default value for existing rows.

Additionally, queries in our applications should always use named columns instead of numeric indices. This is the safest way to ensure new columns do not break existing queries.

These types of updates pose the most risk to backwards compatibility. There’s no good way to ensure a table or column exists before querying it. The overheard of checking a table before each query simply isn’t worth it.

If possible, database queries should gracefully handle failure. Assuming the table or column that is being removed isn’t critical or part of some larger transaction, the query should continue execution if possible.

However, this won’t work for most cases. Chances are, every column or table in the schema is important, and having it disappear unexpectedly will break your queries.

Therefore the most practical approach to removing columns and tables is to first update the code that calls it. This means updating every query that references the table in question and modifying its behavior. Once all those usages are gone, it is safe to drop it from the database.

NoSQL data stores such as MongoDB, ElasticSearch, and Cassandra have different constraints than their relational counterparts.

The main difference is that instead of rows of data that all must conform to a schema, documents inside a NoSQL database have no such restriction. This means our applications are already used to dealing with documents that don’t have a unified schema.

We have the additional benefit that most NoSQL databases do not allow constraints between collections the way relational databases do.

In this context, adding new collections and fields is usually not a concern. Here again the robustness principle is our guide: only persist required fields and ignore any fields we don’t care about when reading a document.

On the other hand, removing fields and collections should follow the same best practices as relational databases. If possible, our queries should handle failure gracefully and continue executing. Barring that, we should update any queries first, then update the data store itself.

Regardless of which tech stack we use, there are certain practices we can incorporate into our software lifecycle that help eliminate or minimize compatibility issues.

Keep in mind, most of these only work under two conditions:

Brand new software projects.
Mature software development organizations willing to devote the necessary resources for training tooling.

If your organization doesn't fit into one of these categories, you’re unlikely to be successful in implementing any of these processes.

Additionally, none of the practices below is meant to be a silver bullet that will solve all deployment problems. It’s possible none, one, or many of these will be applicable to your organization. Evaluate how each may or may not help you.

A canary deployment, also known as a blue/green, red/black, or purple/red deployment, is the idea of releasing a new version of an application and only allowing a small percentage of traffic to reach it.

The goal is to test new application versions with real traffic, while minimizing the impacts of any problems that might occur. If the new application works as expected, then the remaining instances can be upgraded. If something goes wrong, the single instance can be reverted and only a small portion of traffic is impacted.

This only works for clustered services, where we run more than one instance. Applications that run as singletons cannot be tested in this way.

Additionally, canary deployments require sophisticated service meshes to work. Most microservice architectures already use some type of service discovery, but not all of them are created equal. Without a service mesh that provides fine-grained control over traffic flow, a canary deployment is not possible.

Finally, canary deployments aren’t the answer to all upgrades. They don’t work services being deployed for the first time. If the underlying data model is changing with a service, it may not be possible to have multiple versions of the application running concurrently.

The three Ns refers to the idea that an application should support three versions of every service it interacts with:

The previous version (N-1)
The current version (N)
The next version (N+1)

So what exactly does this mean? It really just boils down to not assuming that our services will be upgraded in any particular order.

As an example, let’s consider two services, A and B, where A makes RESTful calls to B.

If we need to make a change to A, we should not assume that B will be upgraded before or after A, or even at all. The changes we make to either A or B should stand on their own.

And what happens if B has to rollback? We shouldn’t have to revert all of its dependent services in this case.

To be clear: the three Ns principle is not easy to achieve, especially when dealing with legacy monolithic applications. However, it’s not impossible.

It takes planning and forethought, and it won’t come without growing pains and failures along the way. It typically requires a massive shift in the development mindset to the point where every developer and team should be asking two questions before they release any new code:

Which services does my code use and which services use my code?
What would happen if my code had to be reverted and what is my rollback plan?

The first question may not be easy to answer but there are plenty of tools that can help. From static analysis of source code to more sophisticated tools like Zipkin, a dependency graph can help you understand how services interact.

The second question should be easy to answer: what changed outside of code? This could be database, configuration files, etc. We need to have a plan to roll back these changes, not just compiled code.

If you’re willing to put in the effort, achieving the three Ns is a great way to ensure backwards compatibility throughout a distributed system.

Another great way to shield services from breaking changes is by using feature toggles.

A feature toggle is a piece of configuration that applications can use to determine if a particular feature is enabled or not. This allows us to release new code, but avoid actually executing it until a time of our choosing. Likewise, we can quickly disable new functionality if we find a problem with it.

There are plenty of tools that can be used to implement feature toggles, such as rollout.io and Optimizely. Regardless of which tool we use, there are certain characteristics that we should look for.

Implementing feature toggles usually means adding lots of code like the following into our applications:

if(newFeatureEnabled()) {
  // do new stuff
}

Checking the state of a feature toggle must therefore be quick. We shouldn’t rely on reading from a database or remote file every time we need to check that state of toggle, as that could degrade our application very quickly.

Ideally, toggle state should be loaded during application startup and cached internally, with some mechanism to update that internal state as needed (messaging bus, JMX, API, etc).

Since we’re dealing with distributed systems, it’s likely that a feature toggle will need to be accessed by multiple applications. Therefore the state of a toggle should be distributed so that every application sees the same state, along with any changes.

Changing the state of a toggle should be a single operation. If we have to update multiple sources of configuration, we’re increasing the chances that applications will get a different view of toggle.

Toggles have a tendency to accumulate over time in code. While the performance impact of checking lots of toggles may be negligible, they can quickly morph into tech debt and require periodic cleaning. Make sure to plan for time to revisit them and cleanup as necessary.

In our ever changing distributed world, there are lots of ways for applications and services to communicate. Which means there are lots of ways to break them as they inevitably evolve.

The tips and ideas above are only a starting point and don’t cover all the ways in which our systems might talk. Things like distributed caches and transactions can also provide obstacles to building backwards compatible software.

There are also other protocols, such as web sockets or gRPC, that have their own features that we can utilize to smartly upgrade our systems.

As we move away from monoliths and towards microservices, we need to ensure we’re focusing as much on the evolvability of our systems as we do with functionality.

Ensuring backwards compatibility in distributed systems

Web APIs

The robustness principle

Versioning

URL

Headers

Message body

Documentation

Messaging services

Upgrade consumers first

Create new topics and queues

Use headers and filters

Data stores

Relational databases

Adding new tables

Adding new columns

Removing columns or tables

NoSQL databases

Software deployments

Canary deployment

The three Ns

Feature toggles

Fast

Distributed

Atomic

Move fast, but don't break things

Add to the discussion