[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. Please enjoy our favorite work this year and we’ll see you in 2022.]
Running a successful website requires the cooperation of operations (ops) and developers (devs) – thus the term “devops”. When there is a conflict, antagonisms, or disharmony, the website suffers. Simply telling people to “get along” is not effective. Authentic cooperation is the result of providing a structure that enables and encourages cooperation.
Often one such area of conflict and disharmony involves on-call escalations. The monitoring system alerts ops folks that there is a problem or a growing issue that could lead to an outage. Whoever is “on call” from operations must handle the issue or escalate the issue to developers, no matter the time of day or night.
There lies the potential for conflict. Too many escalations wear down the developers. The disharmony begins with exclamations like “I just fixed something easy! Why can’t those operations folks do their jobs?”
Operations gets defensive. “How was I supposed to know?” or, “I just asked a question and now they’re being jerks about it!”
The disharmony can start in operations too. “Oh great! Another surprise from the devs!”
You can’t force people to cooperate, but you can set up structures and glide-paths that create an environment for cooperation.
One such paradigm is the dynamic runbook feedback loop.
Dynamic runbook feedback loops
A runbook is a set of procedures for how to respond in situations such as receiving an alert from the monitoring system. The goal of the feedback loop is to create a mechanism where subject matter experts create runbooks, but both devs and ops are empowered to improve them in ways that reduce the number of escalations and improve cooperation.
The goal of this process is to establish the proper balance of effort versus value when crafting documentation. It is a waste of effort for someone to write a 100-page treatise about a simple issue; but a runbook that is too brief isn’t useful. This paradigm leads to the right balance. A runbook starts at the size the original author believed to be appropriate given the knowledge at hand but evolves to the right size as it gets exercised. Runbooks that are rarely or never needed receive less effort and frequently-used runbooks are updated, optimized, and possibly turned into automated responses.
This is in striking contrast to organizations where runbooks are created by “on high” without input from people with direct involvement. Often these either can’t be changed or change require a heavy-weight process that impedes any improvement.
Write it down
The easiest way to distribute team knowledge is to have good documentation, allowing anyone who encounters an unfamiliar issue to follow a tested process to resolve it. That’s what runbooks are supposed to be.
Our preferred format is a bullet list that ops can use to resolve alerts. When an alert arrives, ops follows the instructions in the runbook. If we get to the end of the bullet list and the issue is unresolved, then ops escalate to the developers.
An organization’s developers need to write the docs, obviously. But all too often, docs get assigned as low priority and get pushed to the back burner in order to ship new product, feature upgrades, and other work deemed mission-critical. They never get around to it. Management needs to include runbook creation as part of the project.
Developers are not always comfortable with writing. Staring down a blank page can be intimidating. To overcome the blank page syndrome, give them a template. It won’t even feel like you’re asking them to write a document; you just want them to fill out this form. If you make the template a series of bullet points, each being the procedure to handle a specific alert, the document becomes almost trivial to complete. The basic piece of documentation here is how to deal with every alert.
To motivate devs, remind them that the more time spent writing runbooks, the less likely issues will escalate to them. Every company wants to reduce escalations—especially the developers who have to field those escalations during off hours—and the path to get there is through documentation.
The other side of this process—the alerts—should reinforce the idea that any alert has a corresponding process in the runbooks by including the runbook URL in the alert text. This increases the likelihood that procedures will be complied with.
Channeling feedback into better runbooks
Here’s where the feedback loop begins.
At some point, an ops person is going to get an alert that is insufficiently documented and needs to be escalated to the developers. Once you go through every step on the runbook and exhaust every idea you have, it’s time to talk to the folks who wrote the code.
This is the first point of feedback: the ops engineer reads the runbook and tries to implement the process documented therein. The dev may have documented every alert, but this is where the rubber hits the road; if a runbook doesn’t fix a problem when it says it does, the ops person should correct it immediately or at least identify the problem. When the ops person escalates to the developer, that’s the start of a feedback loop.
If a problem does get escalated, then that should trigger an update to the doc. Whether it’s the ops engineer adding what they were unsure about, noting the use case that triggered the escalation, or the developer augmenting the bullet list, the end result makes operations more self-sufficient and reduces future escalations. Maybe your clever ops engineer wrote a shell script that bundles ten bullets into a single command; edit the runbook and include it.
And yet, sometimes you run into unknown or unpredictable problems. At a previous company, I got a runbook that was just one bulleted item: “This should never happen. If it does, call the developers.” I spoke with the developer and, without accusing them of being lazy, asked if this was the best runbook entry they could make? It turned out it was! They were monitoring for an assertion that should never happen, but if it ever did happen, they wanted to know. It was the kind of situation where a fix could only be determined after the first time it happened.
It is important to remove speed bumps to fixing and updating runbooks. Management should create a culture where everyone is expected and encouraged to edit frequently, in real-time, not at the end of a project. I recommend that ops engineers go one step further. When you read a runbook, do it in edit mode. It removes the temptation to procrastinate about updates. You may think, 'I’m in the middle of addressing a problem in production, I'll come back later and edit this.' No, you're never going to come back. There's never a later. If you're in edit mode, you can fix that broken comma, paste in an updated command, or at least make a note that something needs improvement.
If a step doesn’t work or you’re not sure what does, still put in the edit. These are living documents, so not every edit has to be perfect. Put in a note to self or the developer: "that last statement is confusing" or "bullet three didn’t work." At least the next person that sees this will know to be careful when they get to bullet item number three. Identifying the problem is better than silence.
The feedback loop gives developers control over how often they are escalated to. If the devs feel like they’re being escalated to too much, well, physician, heal thyself. Developers can reduce future escalations by improving the document. Either add those procedures that would have kept ops from interrupting your dinner or put in deliberate escalation calls.
The feedback loop gives operations the confidence to escalate without feeling like they are nagging or giving up too soon. Operations can be hesitant to escalate an issue for fear of creating a “boy who cried wolf” situation or that they will look stupid for not knowing how to handle a situation. Instead, operations can “show their homework” to justify their escalation. It is difficult to be upset that your dinner was interrupted when the operations person calling you can clearly show they followed the written procedure and the results of each step.
The benefit to this feedback loop is that it gives you as much or as little documentation as the process needs, and as many or as few escalations as needed. It empowers the developers to fix the problem of getting too many escalations.
Share, don’t hoard
Everyone should want to build an organization where information is shared, and that sharing is rewarded.
I once worked at a company where one person was celebrated because they were the most knowledgeable at handling every kind of on-call situation. In hindsight, this was a red flag. Why was one person so much better than everyone else? Are we allowed to have crappy service the weeks other people are on call? I learned from that experience.
I now prefer to celebrate the people that share their knowledge so that everyone is great at on-call duty. We should honor those with more experience, but celebrate the people that share their knowledge in ways that empower everyone to be better. This includes everything from one-on-one teaching, to answering questions on Stack Overflow, to writing runbooks. People that do that enable excellence no matter whose turn it is to carry the pager.
We can reframe this in terms of power dynamics. The old school way to maintain power and influence at a company was to hoard information. I remember admiring a person at a previous company because of what information they held. Need a [insert technical task] done? They’re the only person that knows how to do it. Visit their office. Show your respect. Prostrate to their higher mind. If they deem you are worthy, they will do the tasks for you. If you want a smoothly run company, toss that toxic attitude into the dustbin of history.
The new way is the opposite. Power comes from how much you give away. We admire the person that shares their knowledge: the person that teaches people how to do things instead of doing things for them; that documents constantly, not at the end of the project; that is generous with their knowledge in everything they do. They are powerful because everyone points to them and says, “that’s the person that made me successful!”
Easily transferable knowledge
Here at Stack Overflow, we have a collection in our Teams instance that contains all of our runbooks and related procedures. We push all of our developers to use this feedback process with the docs. The format—articles and questions that anyone can comment on or edit—makes feedback easy. The most commonly used runbooks are heavily edited and refined. Less recently edited runbooks are easy to identify and update. The level of effort reflects the need.
Then there are the knock-on effects, bonus efficiencies that come from having field-tested knowledge easily available. The technologies and skills mentioned in the runbooks inform us when writing the job advertisement for new operational staff. Once hired, runbooks can be used as a training tool. Newly hired operations people can be walked through each runbook, or use the runbooks as a self-study aide. Once they have reviewed all the runbooks, they are ready for on-call.
Key to all of these feedback loops is the ease in which everyone—developers, SREs, and new hires—can ask questions, raise issues, and offer suggestions for better solutions. Sometimes the anxiety of being seen as less knowledgeable can prevent people from jumping in and commenting. You can alleviate that by explicitly providing a space for questions in every runbook. Even better, you can use a platform like Stack Overflow for Teams that is designed to gather business-critical information from questions and answers.
A good feedback loop won’t solve every problem in the on-call process. But it will make it a smoother one, from debugging an issue that pops up in production all the way through hiring and training and onboarding. When done well, it’s a very effective way to improve your organization through small, ingrained processes.