Fulfilling the promise of CI/CD

[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. Please enjoy our favorite work this year and we'll see you in 2022.]

A Morality Tale

Once upon a time there was a large and terrifying pirate ship, and it ruled the high seas despite being a bit leaky. The crew often had to pump seawater out of the lower decks after a storm or a fight. One day, after a particularly vicious battle, they realized that cannonballs had breached the hull below the water line. They weren’t sure how to fix it without tools or carpenters. So they just hooked the pumps up to run constantly, day and night. For a while this kept pace, but soon they had to buy twice as many pumps, and bring on more sailors to staff the pumps while others sailed. Before long, they had more sailors running pumps than actually sailing the ship or looting and pillaging. This wasn’t what they signed up for, which made everyone pretty cranky. Before long they became experts in everything that had to do with pumping water out of ships, but their other skills of sailing, looting and pillaging suffered, and their best people left in frustration. Before the year was over their ship had sunk.

The moral of the story is, don’t treat the symptoms: treat the cause.

Speaking of which, let’s talk about CI/CD.

CI/CD (continuous integration and Continuous Delivery/Deployment) is part of the fabric of our lives. We go to conferences about CI/CD, we write articles and blog about CI/CD, we list CI/CD on our LinkedIn page. Nobody even debates about whether it’s the right thing to do or not. We are all on the same page. Right?

Except … if you listen closely to what people are saying, they aren’t talking about CI/CD at all—they may say “CI/CD,” but they are only talking about continuous integration. Nobody is talking about (or practicing) continuous deployment. AT ALL. It’s like we have all forgotten it exists.

Don’t get me wrong, it is wonderful that we have gotten better at CI over the past decade. But it’s not enough. It’s not even half the battle. The point of continuous integration was never to feel smug about the syntactic and logically integrated correctness of our software, although that’s a nice bonus. The point of CI is to clear the path and set the stage for continuous delivery, because CD is what will actually save your ass.

Just like on that pirate ship, you need to treat the causes instead of chasing after symptoms. And what is the primary source of those leaks and holes in our (arr!) production pirate ship? It is everything that happens between the moment you write the code and the moment that code is live: your too-long, leaky, distended, bloated, lumped-up, misfiring software digestive system.

Great question. Glad you asked. CI/CD is a process and a methodology designed to make sure that all the code you merge to main is deployable at any time by testing it and deploying it. Automatically. After every diff.

The goal of CI/CD is to reduce the lead time of software changes to an interval short enough that our human learning systems (adrenaline, dopamine, etc.) can map to the feedback loops of writing and shipping code.

If any engineer on your team can merge a set of changes to main, secure in the knowledge that 15 minutes later their changes will have been tested and deployed to production, with no human gates or manual intervention required, then congratulations! You’re doing CI/CD.

Very few teams are actually practicing CI/CD, but nearly all should be. This is basic software hygiene.

The time elapsed between writing and shipping is the room temp petri dish where pathological symptoms breed and snowball. Longer lead times lead to larger code diffs and slower code reviews. This means anyone reviewing or revising these nightmare diffs has to pause and swap the full context in and out of their mind any time they switch gears, from writing code to reviewing and back again.

Thus the elastic feedback loop of the development cycle begins to stretch out longer and longer, as people spend more and more time waiting on each other—to review, to comment, to deploy, to make requested changes—and more and more time paging state in and out of their brains and trying to remember where they were and what they were trying to do.

But it gets worse. Most deploys are run by hand, at some indeterminate interval after the code was written, not by the person who wrote the code, and—worst of all—with many developers’ changes batched up at once. It is this, above all else, that severs the engineer from the effects of their work, and makes it impossible for them to practice responsible ownership over the full lifecycle of their code.

Batching up changes means you cannot practice observability-driven development; you can’t expect engineers to watch their code go out and ensure it is working in production. You don’t know when your code is going out and you don’t know who is responsible for a change in production behavior; therefore, you have no accountability or autonomy and cannot practice software ownership over your own code. Ultimately, your engineers will spend a higher percentage of their time waiting, fumbling, or dealing with tech debt, not moving the business forward.

Since so many of your existing engineers are tied up, you will need to hire many more of them, which carries higher coordination costs. Soon, you will need specialized roles such as SRE, ops, QA, build engineers, etc. to help you heroically battle each of the symptoms in isolation. Then you will need more managers and TPMs, and so on. Guess what! Now you’re a big, expensive company, and you lumber like a dinosaur.

It gets slower, so it gets harder and more complicated, so you need more resources to manage the complexity and battle the side effects, which creates even more complexity and surface area. This is the death spiral of software development. But there is another way. You can fix the problem at the source, by focusing relentlessly on the length of time between when a line of code is written and when it is fully deployed to production. Fixate on that interval, automate that interval, track that interval, dedicate real engineering resources to shrinking it over time.

Until that interval is short enough to be a functional feedback loop, all you will be doing is managing the symptoms of dysfunction. The longer the delay, the more of these symptoms will appear, and the more time your teams will spend running around bailing leaks.

How short? 15 minutes is great, under an hour is probably fine. Predictability matters as much as length, which is why human gates are a disqualifier. But if your current interval is something like 15 days, take heart—any work you put into shortening this interval will pay off. Any improvement pays dividends.

This is where lots of people look dubious. “I’m not sure my team can handle that”, they may say. “This is all very well and good for Facebook or Google, but we aren’t them.”

This may surprise you, but continuous deployment is far and away the easiest way to write, ship, and run code in production. This is the counterintuitive truth about software: making lots of little changes swiftly is infinitely easier than making a few bulky changes slowly.

Think of it this way. Which of these bugs would be easier for you to find, understand, repro, and fix: a bug in the code you know you wrote earlier today or a bug in the code someone on your team probably wrote last year?

It’s not even close! You will never, ever again debug or understand this code as well as you do right now, with your original intent fresh in your brain, your understanding of the problem and its solution space rich and fresh.

As you write your code, you should be instrumenting it with an eye to your future self half an hour from now: “How will you be able to tell if your code is doing what you wanted it to or not?” You write the code, you merge to main, wait a few minutes, and pull up your observability tool to view production through the lens of your instrumentation. Ask yourself, “is it doing what it’s supposed to do?” and “does anything else look … weird?”

Teams that get this rhythm down usually find >80% of all bugs right then and there. It’s fresh in your head, you’re primed to look for the right things, you’re instrumenting on the spot. If anything goes wrong or doesn’t look quite right, you can churn out another diff on the spot.

Conversely, teams that don’t have a practice of swift deploys and aren’t used to looking at their changes in production, well, they don’t find those bugs. (Their customers do.)

By the way, yes, it is not exactly easy to take an app that ships monthly and get it shipping ten times a day. But it is astonishingly easy to start an app off with continuous delivery from the beginning and keep it that way, never letting it slip past 15 minutes of lead time. Like Alexander the Great picking up his horse every morning before breakfast, it hardly feels like work.

So if moving your legacy apps to CD feels too heavy a lift right now, could you at least start any new repos or services or apps off with CD from the start? Any CD is better than no CD. Having even some of your code deploy automatically after changes will enforce lots of good patterns on your team, like preserving backwards compatibility between versions.

Why do so few teams make continuous delivery a priority? The virtues of a short lead time and tight feedback loops are so dramatic, widely known and well-understood that I have never understood this. But now I think I do.

The issue is that this reads like a people problem, not a technical problem. So this gets classed as a management problem. And managers are used to solving management problems with management tools, like asking for more headcount.

It can also be a hard sell to make to upper management. You’re asking them to accept a delay on their feature roadmap in the short term and possibly less reliability for some indeterminate amount of time in order to implement something that runs absolutely counter to our basic human instincts, which tell us to slow down to increase safety.

By and large, engineers seem to be aware of the value of continuous delivery, and many are desperate to work on teams that move as swiftly and with as much confidence as you can with CI/CD. If you’re trying to hire more engineers, it makes great recruiting catnip. More importantly, it makes your team use your existing engineering cycles way more efficiently.

I would like to end with an appeal to the engineering managers and directors of the world. Would you like your teams to be passionate and engaged, working on all cylinders, outputting at peak capacity, spending minimal time firefighting or tending to overdue technical debt payments or waiting on each other’s reviews?

Would you like your team members to look back and remember you wistfully and fondly as someone who changed the way they thought about developing software—someone who forever raised the bar in their eyes? Our fates are in your hands.

The teams who have achieved CI/CD have not done so because they are better engineers than the rest of us. I promise you. They are teams that pay more attention to process than the rest of us. Great teams build great engineers, not vice versa.

Continuous delivery helps your engineers own their code in production, but it also has many other side benefits. It forces you to do all the right things—write small diffs, review code quickly, instrument your original intent just like you comment your code, use feature flags, decouple deploys and releases, etc.

You can either spend your life nagging people to clean up their code and follow all the best practices separately, or you can just laser focus on shortening that feedback loop and watch everything else fall into place on its own.

Is it hard? Yes, it is hard. But I hope I’ve convinced you that it is worth doing. It is life changing. It is our bridge to the sociotechnical systems of tomorrow, and more of us need to make this leap. What is your plan for achieving CI/CD in 2021?

Fulfilling the promise of CI/CD

Err...what is CI/CD, really?

The software development death spiral

The way you are doing it now is the hard way

The #1 challenge for technical leadership in 2021

Add to the discussion