“This should never happen. If it does, call the developers.”
[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. Please enjoy our favorite work this year and we’ll see you in 2022.]
Running a successful website requires the cooperation of operations (ops) and developers (devs) – thus the term “devops”. When there is a conflict, antagonisms, or disharmony, the website suffers. Simply telling people to “get along” is not effective. Authentic cooperation is the result of providing a structure that enables and encourages cooperation.
Often one such area of conflict and disharmony involves on-call escalations. The monitoring system alerts ops folks that there is a problem or a growing issue that could lead to an outage. Whoever is “on call” from operations must handle the issue or escalate the issue to developers, no matter the time of day or night.
There lies the potential for conflict. Too many escalations wear down the developers. The disharmony begins with exclamations like “I just fixed something easy! Why can’t those operations folks do their jobs?”
Operations gets defensive. “How was I supposed to know?” or, “I just asked a question and now they’re being jerks about it!”
The disharmony can start in operations too. “Oh great! Another surprise from the devs!”
You can’t force people to cooperate, but you can set up structures and glide-paths that create an environment for cooperation.
One such paradigm is the dynamic runbook feedback loop.
Dynamic runbook feedback loops
A runbook is a set of procedures for how to respond in situations such as receiving an alert from the monitoring system. The goal of the feedback loop is to create a mechanism where subject matter experts create runbooks, but both devs and ops are empowered to improve them in ways that reduce the number of escalations and improve cooperation.
The goal of this process is to establish the proper balance of effort versus value when crafting documentation. It is a waste of effort for someone to write a 100-page treatise about a simple issue; but a runbook that is too brief isn’t useful. This paradigm leads to the right balance. A runbook starts at the size the original author believed to be appropriate given the knowledge at hand but evolves to the right size as it gets exercised. Runbooks that are rarely or never needed receive less effort and frequently-used runbooks are updated, optimized, and possibly turned into automated responses.
This is in striking contrast to organizations where runbooks are created by “on high” without input from people with direct involvement. Often these either can’t be changed or change require a heavy-weight process that impedes any improvement.
Write it down
The easiest way to distribute team knowledge is to have good documentation, allowing anyone who encounters an unfamiliar issue to follow a tested process to resolve it. That’s what runbooks are supposed to be.
Our preferred format is a bullet list that ops can use to resolve alerts. When an alert arrives, ops follows the instructions in the runbook. If we get to the end of the bullet list and the issue is unresolved, then ops escalate to the developers.
An organization’s developers need to write the docs, obviously. But all too often, docs get assigned as low priority and get pushed to the back burner in order to ship new product, feature upgrades, and other work deemed mission-critical. They never get around to it. Management needs to include runbook creation as part of the project.
Developers are not always comfortable with writing. Staring down a blank page can be intimidating. To overcome the blank page syndrome, give them a template. It won’t even feel like you’re asking them to write a document; you just want them to fill out this form. If you make the template a series of bullet points, each being the procedure to handle a specific alert, the document becomes almost trivial to complete. The basic piece of documentation here is how to deal with every alert.
To motivate devs, remind them that the more time spent writing runbooks, the less likely issues will escalate to them. Every company wants to reduce escalations—especially the developers who have to field those escalations during off hours—and the path to get there is through documentation.
The other side of this process—the alerts—should reinforce the idea that any alert has a corresponding process in the runbooks by including the runbook URL in the alert text. This increases the likelihood that procedures will be complied with.
Channeling feedback into better runbooks
Here’s where the feedback loop begins.
At some point, an ops person is going to get an alert that is insufficiently documented and needs to be escalated to the developers. Once you go through every step on the runbook and exhaust every idea you have, it’s time to talk to the folks who wrote the code.
This is the first point of feedback: the ops engineer reads the runbook and tries to implement the process documented therein. The dev may have documented every alert, but this is where the rubber hits the road; if a runbook doesn’t fix a problem when it says it does, the ops person should correct it immediately or at least identify the problem. When the ops person escalates to the developer, that’s the start of a feedback loop.
If a problem does get escalated, then that should trigger an update to the doc. Whether it’s the ops engineer adding what they were unsure about, noting the use case that triggered the escalation, or the developer augmenting the bullet list, the end result makes operations more self-sufficient and reduces future escalations. Maybe your clever ops engineer wrote a shell script that bundles ten bullets into a single command; edit the runbook and include it.
And yet, sometimes you run into unknown or unpredictable problems. At a previous company, I got a runbook that was just one bulleted item: “This should never happen. If it does, call the developers.” I spoke with the developer and, without accusing them of being lazy, asked if this was the best runbook entry they could make? It turned out it was! They were monitoring for an assertion that should never happen, but if it ever did happen, they wanted to know. It was the kind of situation where a fix could only be determined after the first time it happened.
It is important to remove speed bumps to fixing and updating runbooks. Management should create a culture where everyone is expected and encouraged to edit frequently, in real-time, not at the end of a project. I recommend that ops engineers go one step further. When you read a runbook, do it in edit mode. It removes the temptation to procrastinate about updates. You may think, ‘I’m in the middle of addressing a problem in production, I’ll come back later and edit this.’ No, you’re never going to come back. There’s never a later. If you’re in edit mode, you can fix that broken comma, paste in an updated command, or at least make a note that something needs improvement.
If a step doesn’t work or you’re not sure what does, still put in the edit. These are living documents, so not every edit has to be perfect. Put in a note to self or the developer: “that last statement is confusing” or “bullet three didn’t work.” At least the next person that sees this will know to be careful when they get to bullet item number three. Identifying the problem is better than silence.
The feedback loop gives developers control over how often they are escalated to. If the devs feel like they’re being escalated to too much, well, physician, heal thyself. Developers can reduce future escalations by improving the document. Either add those procedures that would have kept ops from interrupting your dinner or put in deliberate escalation calls.
The feedback loop gives operations the confidence to escalate without feeling like they are nagging or giving up too soon. Operations can be hesitant to escalate an issue for fear of creating a “boy who cried wolf” situation or that they will look stupid for not knowing how to handle a situation. Instead, operations can “show their homework” to justify their escalation. It is difficult to be upset that your dinner was interrupted when the operations person calling you can clearly show they followed the written procedure and the results of each step.
The benefit to this feedback loop is that it gives you as much or as little documentation as the process needs, and as many or as few escalations as needed. It empowers the developers to fix the problem of getting too many escalations.
Share, don’t hoard
Everyone should want to build an organization where information is shared, and that sharing is rewarded.
I once worked at a company where one person was celebrated because they were the most knowledgeable at handling every kind of on-call situation. In hindsight, this was a red flag. Why was one person so much better than everyone else? Are we allowed to have crappy service the weeks other people are on call? I learned from that experience.
I now prefer to celebrate the people that share their knowledge so that everyone is great at on-call duty. We should honor those with more experience, but celebrate the people that share their knowledge in ways that empower everyone to be better. This includes everything from one-on-one teaching, to answering questions on Stack Overflow, to writing runbooks. People that do that enable excellence no matter whose turn it is to carry the pager.
We can reframe this in terms of power dynamics. The old school way to maintain power and influence at a company was to hoard information. I remember admiring a person at a previous company because of what information they held. Need a [insert technical task] done? They’re the only person that knows how to do it. Visit their office. Show your respect. Prostrate to their higher mind. If they deem you are worthy, they will do the tasks for you. If you want a smoothly run company, toss that toxic attitude into the dustbin of history.
The new way is the opposite. Power comes from how much you give away. We admire the person that shares their knowledge: the person that teaches people how to do things instead of doing things for them; that documents constantly, not at the end of the project; that is generous with their knowledge in everything they do. They are powerful because everyone points to them and says, “that’s the person that made me successful!”
Easily transferable knowledge
Here at Stack Overflow, we have a collection in our Teams instance that contains all of our runbooks and related procedures. We push all of our developers to use this feedback process with the docs. The format—articles and questions that anyone can comment on or edit—makes feedback easy. The most commonly used runbooks are heavily edited and refined. Less recently edited runbooks are easy to identify and update. The level of effort reflects the need.
Then there are the knock-on effects, bonus efficiencies that come from having field-tested knowledge easily available. The technologies and skills mentioned in the runbooks inform us when writing the job advertisement for new operational staff. Once hired, runbooks can be used as a training tool. Newly hired operations people can be walked through each runbook, or use the runbooks as a self-study aide. Once they have reviewed all the runbooks, they are ready for on-call.
Key to all of these feedback loops is the ease in which everyone—developers, SREs, and new hires—can ask questions, raise issues, and offer suggestions for better solutions. Sometimes the anxiety of being seen as less knowledgeable can prevent people from jumping in and commenting. You can alleviate that by explicitly providing a space for questions in every runbook. Even better, you can use a platform like Stack Overflow for Teams that is designed to gather business-critical information from questions and answers.
A good feedback loop won’t solve every problem in the on-call process. But it will make it a smoother one, from debugging an issue that pops up in production all the way through hiring and training and onboarding. When done well, it’s a very effective way to improve your organization through small, ingrained processes.
Tags: devops, documentation, sre
26 Comments
Big fights kept breaking out because support didn’t know the first thing about TLS Certificates and would bug devs for Customer on-prem problems with Certificates. Guess what? Devs issue dev certificates from the internal root certificate. Customers need a certificate from the customer’s root, or more often than not, a publicly verified certificate. Devs don’t know how to issue those.
Interesting. So what happened? Did everyone eventually figure out that for production APIs a TLS cert is needed from a public Certificate Authority like Let’s Encrypt?
Thank you for your help.
What is it all about what im little post los about my situación
Please remove driving me crazy
Just got out of hosppital
Please remove all test experience most important IOT
I _love_ writing documentation, by now. It’s great to offload stuff from my brain into separate storage, so that I don’t have to keep that old baggage in my head. It’s great to be able to read my own documentation after a year and knowing that this is not some hazy and incomplete memory, but that it is as correct as I could write it. And I hate to be the only person that knows all the stuff – because that means people will keep interrupting me with questions.
Unfortunately it took several years for me to realize this (and the company itself hasn’t quite realized this yet); so there’s lots of stuff from the past years that’s still undocumented. Don’t let that happen! Write docs right from the beginning!
Also, the “living documentation” works great as well. Inspired by http://web.archive.org/web/20071013071537/http:/www.hacknot.info/hacknot/action/showEntry?eid=97 I’ve now created step-by-step docs for building and working with my current project; and whenever this doc doesn’t work for new devs, it is amended. This finds the relevant mistakes and leads to useful docs.
But indeed not all devs like to write text at all; and I guess that causes those devs to postpone doc writing forever. If they are sufficiently important it can help to put a technical writer at their side, who works closely with the dev and does just the docs; but I’m not sure if that scales.
Very good. I would add one thing: Everything note/edit should be given an explicit date. The initial runbook recipe for something for sure. But also each note added by ops or devs should have the date. Because later on, even with – or maybe especially with – the incremental changes you perform to keep the runbook “up to date” – stuff goes out of date and no longer applies. When someone annotates with “do this instead” or “this didn’t work” – then 3/8/15 months later you’re going to be seeing a page full of these notes and you’ll want to know which ones are more current.
Because if there’s something that happens much less often than keeping the runbook “up to date” its removing the old stuff that no longer applies. Nobody does it because you never know if your current problem is a “one off” or is new behavior that you’ll be seeing again, so you don’t know if the old stuff is out of date or just doesn’t apply this one time. So you never delete. But you _do_ want to be able to judge what’s more reliable, and one way is to know what’s more recent (and when – since you can correlate it with releases/deployments).
Re: “The new way is the opposite. Power comes from how much you give away. We admire the person that shares their knowledge: the person that teaches people how to do things instead of doing things for them; that documents constantly, not at the end of the project; that is generous with their knowledge in everything they do. They are powerful because everyone points to them and says, “that’s the person that made me successful!””
Also this makes them more disposable, thank goodness. When recession hits, you can terminate their contract, offer them an inferior one and if they refuse to sign the latter (how ungrateful), you can remain confident as they sulk off into the sunset because you know that they’re no longer essential. Never give a sucker an even break.
Have to (marginally) agree with this.
I get a bit put off sometimes when people want all the knowledge but don’t want to engage with the documentation. A kind of “I want to know how to do this, so you who has spent so much time working on this must teach me so that I can replace you” sort of mentality.
I think the other side to the utopian picture of a well-functioning work environment that has been painted above is that people are either have their own individual utility, or they are just a never ending supply of disposable lemmings. Business obviously prefers the latter, but it is the former which yields personal satisfaction and growth. I think therein lies the eternal struggle.
I believe the enteral struggle is the hardest for me
“Also this makes them more disposable” rightly stated. The person who documents (on almost personal time?) can easily be packed off because everyone knows that the knowledge is available in some document somewhere. It is also unfair because this person is diligent at this aspect of work to the extent that any other mess in the team is also siphoned to this person for understanding and documenting. By far, from my experience, the people who document, do it and others don’t. Updating the escalation protocol, standard operation procedures etc mentioned in this article is almost never done on a rotational basis which causes good people to get tired and keep shifting teams or companies. As someone else has commented, managements does find a certain excitement in identifying these gaps and setting up mountains of meeting hours in a bid to resolve them, giving themselves work to do.
In-house code for a small company: I have an error message in the code that says if you see it to tell . It should only be triggered if there was a deployment error. Fortunately for me, all problem reports go to him, I only get the stuff he can’t figure out (which mostly comes down to him messing up a config or script, or edge cases parsing the output of another program) or if he’s on vacation. (The reason that report barfs is that you still have the old one open in an editor!)
I think all the points here are very accurate Thomas Thank you for the blog!
One can even use a ticket system, such as Mantisbt, which i like for some process, for feedback loop tracking also. I keep thinking about Confluence when I read your article Thomas, which is the one i came off using. I personally had been documenting after the project a lot. I like your point to document as you go along, as well. I was also encouraging use of the comment box on the bottom, as that question space. I would initially use the doc as a spec for me, after use, the specs were build into docs (or move off confluence). So after the project it became a maturing doc ready for later team feedback during use. Normally that person was me, who was using the doc, so I would update it as I used it, and have others review and comment. During team meetings you can mention to others to be sure to review, and comment the docs, where the owner can update it.
One of the other guys was doing admin work now and then, to complement what I was doing, but most of it was my duty so i always had a personal campaign going on to share the documented knowledge, get feedback, then enable the team to be able to do these tasks when I wasn’t available. So sometimes the priority of the stakeholders don’t want to spend effort on this, hence my personal campaign. This is where sovereign engineering teams, who have the flexibility to make top level decisions on the infrastructure changes, tend to rapidly increase in pipeline / automation implementation and use, bringing the infrastructure up rapidly into modern times.
I remember having to contact the dev’s about an application that was using up too much system resources, or had some other weird back-end issue, and really had a hard for them or the directorship seeing the need that these were priorities for them to add to their process, so as you mentioned, this did always fall back on me, and I just dealt with it, sometimes made the fixes to the dev’s app code bases, and sometimes even added steps to the personal campaign that tried to educate the team about these new areas of process development.
I would highly encourage Corporate Enterprise to adopt these ideals.
Whenever I read “We do such and such”, I always ask, who’s the We? When the author of this piece says “We admire the person that shares their knowledge”, is that our colleagues or management? The problems come when management rewards somebody for always being the hero who sorts out the crisis, rather than ensuring that crises don’t happen in the first place.
I’ve also seen managers who like to provoke crises themselves occasionally. Management is fundamentally a dull pastime and they need something to liven up their day.
Me, I’m like the firefighter who likes to arrange that there are never any fires. The Very Slightly Sceptical comment above points to the problem with that – it’s a recipe for being made redundant. Actually I’m a contractor, and that’s one of the reasons I enjoy that role – making myself redundant and moving on is part of the deal.
A few years ago we had the concept of a “rock star developer”, one who was supposed to have some kind of star quality and was a cut above the rest. My take on that is that rock stars make their living by being very entertaining for short periods of time. They don’t stay in any one place for long and they leave a trail of mess behind them for somebody else to clean up. I don’t see a rock star programmer making useful contributions to the run book.
Standing back from all this, I’ve worked in sectors such as banking, where operating companies spend huge amounts of money to make their systems robust and resilient. I’ve also worked in sectors where the system had been thrown together, was held together with string and sticky tape, and required a team of poorly-paid people to work all hours to keep it running. Interestingly, both types can make a lot of money for their owners.
Runbooks could be developed during testing as the testing folks are generally going to have to exercise all features of the product, as well as set up test systems and servers. Often the issues founs during this effort would serve as a basis for initial runbooks.
I guess that I am the lone dissident. After more than 40 years of programming, I fall on 3 axioms. First, documentation is never kept up to date and, secondly, no one ever reads documentation. Lastly, documentation is only as good as the day that the product was launched. Don’t get me wrong; technical documentation is a requirement but don’t expect your user’s to read a manual.
Personally, I prefer to be the ‘go-to guy’ for all of my software for which I am the developer but this only comes with a change in managerial thinking. Some things that management needs to understand are that software is a living, breathing, evergreen animal and changes should be allowed quickly. Also, management needs to give us the time to handle issues.
But, also, if I can give my user’s a better experience then I will try to do that and I have often said, ‘Work your app until the phone stops ringing’. This is especially true right after implementation.
In my career, I have mostly written software for business processing that is internal to my company which gives me much more flexibility than if I were writing a COTS product but if I can make my business processes better then my business will run better. Lastly, being the ‘Top Dog’ for any app is not a bad thing. It can generate trust, appreciation and job security.
Yes, but the author is not talking about “Users” to read documentations, but for the Ops teams. And as part of an Ops team, I would prefer reading some docs instead of reverse engineering your commits.
I always find it interesting to read posts like this because at our company there is no separate “operations” layer. Devs are the ones who are on-call. There is usually a rotation of some kind to share the pain. It also greatly incentivizes devs to fix problems that cause on-call in the first place. If you have a large set of runbooks / playbooks with mechanical ways to fix problems, that means you can almost always automate those fixes or redesign the system in such a way as to make them unnecessary.
We’re now at the point where being woken up due to issues is vanishingly rare (maybe once every couple of months) and they’re usually issues we’ve never seen before.
Well, this is weird. I thought the term “devops” means that development and operation are no longer separate and therefore, conflicts can’t occur in the first place. Did I miss something?
Yeah, seems like you have missed a lot!
I have also found that using “engage the developers” vs. “escalate to the developers” helps to drive a belief that the teams are equals doing different things. “Escalate” makes it sound like the developers know more or are more valuable to the organization and that isn’t true. By treating the teams as peers, you reinforce the belief that they need to partner together to make the overall end user experience better vs. making it an “us vs. them” mentality.
This all a New
But the developers are more valuable to the organization though lol
Decades ago when I was the dev responding to escalations, I adopted the philosophy that I wanted to answer each question ONCE. When a question or a problem came up from Ops or a CSR, I documented the symptoms, cause and fix in a publicly accessible area. If a similar problem got referred to me again, I analyzed WHY it was coming to me: bad documentation? Lack of training of the Ops / CSR? Actually a new problem? The idea of “hoarding knowledge” and having to answer these sorts of questions over and over again never crossed my mind.
Bottom line is this isn’t something new, this is what people who think about how to handle escalation have been doing *forever*.
Great article Thomas!
This article takes me back and also raises my blood pressure. I could write volumes, but I’ll limit myself to a few examples.
Many years ago, a database reporting tool has issued a new release and all of the reports were wrong. I phoned support and they informed that the number of records read from the database was supposed to go down because of improvements to generation of the SQL query. I informed him that the number of records read had increased by a few orders of magnitude. Examining the reports revealed that the new versionmof the software was ignoring some types of selection criteria. However, the support person said the number of records read changing was normal and they had been told to simply mark the issue as resolved and not to bother the developers. The company went bankrupt a few years later.
If yu have a runbook, it should have included sending a bulletin written by the developers and intended for the technically savy at the other end. This bulletin shoud have a checklist that would determine what it was the expected behavior and when it should be escalated. Without this, te manager of the support team was happy because he was able to mark all of the calls as resolved, the manager of the development team was happy because he wasn’t getting bothered by the support team, and upper management was happy because they weren’t getting bothered by lower management. However, a year or so later, upper management was very unhappy because the company went out of business.
At a major social media firm had a great deal of complaints because user’s couldn’t reorder the list of groups that appeared in the profile. When you phoned support, the response was that that they were aware of the problem but couldn’t find the cause, the problem was given a very high priority, and they would notify me when a fix was found. The error was a Javascript error and by examining the page, I discovered that the problem was that they used an outer join where they should have used an inner join. I verified this by having somebody from the support staff go through a set of steps. This would have been a one line change in a Java class or a PHP script, and locating the line should have been very simple. When I sent a letter describing this in detail, I received the standard response. When I asked the person to look at my e-mail again, the response was “sorry”. The problem was never fixed, although I believe that they did remove the web page for reordering the group.
In another case, I was in a group writing image processing software, and we were using a proprietary software package. One of my co-workers was getting his complaints shunted aside because we were due to “roundoff error”. I was able to write a short program that proved to their developer that it wasn’t roundoff error, and I was able to walk the developer through his code (which I had never seen before) and explained the required fix and where it should be placed. His response was “oh”. (He was treating a parameter as passed by value when it was actually passed by reference. However, at least he was willing to listen. By the way, errors in this software, which was used by medical organizations, could cause physical harm and even death.)
In another case, I was in an internal support group and the printers producing the program listings were leaving off the first three program lines and the last three program lines on each page. Management told me that the vendor wouldn’t fix it, I wasn’t to work on it, but I was to tell anyone calling that it was being worked on with a very high priority. I knew that I would get the blame if it wasn’t fixed, so I fixed it along with the printer drivers for some plotters. It took under a week. At this organization, I was also told that we didn’t look for bugs and only fixed them when the customer complained.
The problem is that it assumes that development and operations act as a team. Team building is hard, and it’s much easier to always blame everything on the other side. Don’t complain that they don’t know what they are doing; ask them what they need to know and have in order to do their job. I had two cases where reports were being delivered late. One was due to a massive amount of manual collation and the other was due to a number of corrections that had to be applied manually to the computer generated reports. In the first case, I changed the production job so that the reports came out already collated, reducing a few man-weeks to a single man-day. In the other, I incorporated the corrections in the program that generated the reports.
To paraphrase President Kennedy, do not ask what the other groups can do for you. Instead, ask what you can do for the other groups.