\u003Cfigcaption>Back of the envelope style!\u003C/figcaption>\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Key aspects of our plan included:\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>\u003Cstrong>Build a systems platform:\u003C/strong> This dovetailed nicely with our goal to move towards a microservices architecture. We invested years of work into building a scalable platform using containers and load balancers so that creating a new system could be done with a click of a button. This took all the \"smarts\" out of what infrastructure did and put it in a system so teams could be confident that the template they chose would work for their use case.\u003C/li>\u003Cli>\u003Cstrong>Use the cloud:\u003C/strong> Cloud services helped immensely and took many of the tasks we would do on our own (such as managing instances and databases) and handled them externally. We could focus on what our company was good at—our content and features—and leave the infrastructure and systems know-how to the companies that are built around that.\u003C/li>\u003Cli>\u003Cstrong>Spread the knowledge: \u003C/strong>Rather than keeping all the systems knowledge in one team, we slowly spread it out to the teams that need it. In some cases we would \"plant\" someone strong in systems skills into a team, then ensure they shared their knowledge with everyone else, both through training and playbooks.\u003C/li>\u003Cli>\u003Cstrong>Embrace \u003C/strong>\u003Ca href=\"https://stackoverflow.blog/2021/01/19/fulfilling-the-promise-of-ci-cd/\">\u003Cstrong>continuous deployment\u003C/strong>\u003C/a>\u003Cstrong>:\u003C/strong> Having a continuous integration tool do the deployments helped reduce fear of deploys and gave more autonomy to the teams. We embraced the idea that the \u003Cstrong>merge button\u003C/strong> replaced the \u003Cstrong>deploy button\u003C/strong>. Whoever merges the code to the main branch implicitly approves that code to go live immediately.\u003C/li>\u003Cli>\u003Cstrong>Widespread monitoring and alerting:\u003C/strong> We embraced a cloud provider for both monitoring (metrics, downtime, latency, etc.) and alerting (on-call schedules, alerts, communication) and gave all teams access to it. This put the accountability for system uptime on the team themselves—and allowed them to manage their own schedules and decide what is and isn't considered an \"alert.\"\u003C/li>\u003Cli>\u003Cstrong>Fix alerts:\u003C/strong> Teams were encouraged to set up a process around reviewing and improving alerts, making sure that the alerts themselves were actionable (not noisy and hence ignored) and that they put time towards reducing those alerts as time went on. We ensured that our product team was on board with having this as a team goal, especially for teams that managed legacy systems that had poor alert hygiene.\u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Note that Eng Cap was still available as \"subject matter experts'' for infrastructure. This means that they could always provide their expertise when planning new systems, and they made themselves available as \"level two\" on-call in case more expert help was needed in an emergency. But we switched the responsibility—now teams have to call Eng Cap rather than the other way around, and it doesn't happen often.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:block {\"ref\":17976} /-->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-roadblocks-and-protests\">Roadblocks and protests\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Like any large change, even a long-term one, there's no silver bullet. When you're dealing with dozens of devs and engineering managers used to the \"old way,\" we had to understand the resistance to our plans, and either make improvements or explain and educate where needed.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\u003Ch3 id=\"h-security-and-controls\">\u003Cstrong>Security and Controls:\u003C/strong> \u003C/h3>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>Q: If there is no separate team pressing the \"deploy button\", doesn't that violate segregation of duties and present issues with audits? \u003C/li>\u003Cli>A: \u003Ca href=\"https://devops.com/devops-getting-past-audit/\">Not really.\u003C/a> Especially since those pressing the \"deploy button\" don't have context into what they're allowing, all it does is add needless slowdown. A continuous deployment process has full auditability (just check the git log) and should satisfy any reasonable audit.\u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\u003Ch3 id=\"h-developer-happiness\">\u003Cstrong>Developer Happiness:\u003C/strong> \u003C/h3>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>Q: Devs don't want to wake up in the middle of the night. Won't moving this responsibility from a single team to all teams mean devs will be more miserable and increase turnover?\u003C/li>\u003Cli>A: By moving the pain into the dev teams, we are giving them the motivation to actually fix the problems. This may cause some short-term complaints, but because they are given more autonomy, in the long-term they have a higher satisfaction with their work. However, it is \u003Cem>crucial\u003C/em> to make sure our product roadmap gives time and effort into fixing them. Without that, this step would likely cause a lot of resentment.\u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:heading {\"level\":3} -->\n\u003Ch3 id=\"h-skill-gap\">\u003Cstrong>Skill Gap:\u003C/strong>\u003C/h3>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>Q: Learning how to handle servers and databases are specialized skills. If we make the application teams responsible, won't we lose those skills?\u003C/li>\u003Cli>A: Teaching teams how to handle specific technologies is actually much easier than trying to teach an infrastructure team how every single system works. Spreading the knowledge out slowly, with good documentation, is much more scalable.\u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-where-are-we-now\">Where are we now?\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>This was a five-year process, and we're still learning and improving, especially for our legacy systems, a few of which are still on our old platform. But for the most part, the following statements are now true:\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003Cli>Teams are now empowered and knowledgeable about their systems, how they behave in production, how to know when there is a problem, and how to fix those problems.\u003C/li>\u003Cli>They are motivated to build their own playbooks to spread the knowledge, so that new team members can share the load of being on call.\u003C/li>\u003Cli>They are motivated to make their alerts \"smart\" and actionable, so that they know when systems have problems and can remove noisy alerts.\u003C/li>\u003Cli>They are motivated and empowered to fix things that cause alerts by building it into their product roadmap.\u003C/li>\u003Cli>Often, the number of alerts, time to fix, validity, etc. are part of sprint metrics and reports.\u003C/li>\u003Cli>Alerts and emergencies are generally trending downwards and are not part of everyday life, even for legacy systems. As a whole, when an alert happens, it's nearly always something new that the team hasn't seen before.\u003C/li>\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-conclusion\">Conclusion\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Empowering your teams to do their own operations can pay big dividends down the road. Your company needs to be in the right place and be willing to make the right investments to enable it. But as a decision, it's one that I'd urge any company to continually revisit until they're ready to make the plunge.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>---\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>You can visit my \u003Ca href=\"https://github.com/dorner\">GitHub profile page\u003C/a> for more articles!\u003C/p>\n\u003C!-- /wp:paragraph -->","html","2021-05-24T14:08:23.000Z",{"current":688},"how-developers-can-be-their-own-operations-department",[690,698,702,707],{"_createdAt":691,"_id":692,"_rev":693,"_type":694,"_updatedAt":691,"slug":695,"title":697},"2023-05-23T16:43:21Z","wp-tagcat-code-for-a-living","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":696},"code-for-a-living","Code for a Living",{"_createdAt":691,"_id":699,"_rev":693,"_type":694,"_updatedAt":691,"slug":700,"title":403},"wp-tagcat-continuous-deployment",{"current":701},"continuous-deployment",{"_createdAt":691,"_id":703,"_rev":693,"_type":694,"_updatedAt":691,"slug":704,"title":706},"wp-tagcat-continuous-integration",{"current":705},"continuous-integration","continuous integration",{"_createdAt":691,"_id":708,"_rev":693,"_type":694,"_updatedAt":691,"slug":709,"title":710},"wp-tagcat-devops",{"current":710},"devops","How developers can be their own operations department",[713,719,725,730],{"_id":714,"publishedAt":715,"slug":716,"sponsored":12,"title":718},"9fd8968d-abaa-4253-b14b-3129c6e85408","2025-09-10T17:00:00.000Z",{"_type":10,"current":717},"ai-vs-gen-z","AI vs Gen Z: How AI has changed the career pathway for junior developers",{"_id":720,"publishedAt":721,"slug":722,"sponsored":12,"title":724},"1d082483-6dc6-424b-8b09-9c84b54779da","2025-09-02T17:00:00.000Z",{"_type":10,"current":723},"back-to-school-developers-at-stack-overflow-have-some-advice-for-you","Back to school? Developers at Stack Overflow have some advice for you",{"_id":726,"publishedAt":721,"slug":727,"sponsored":12,"title":729},"5cd91820-9515-4be5-87ae-e919fd443c18",{"_type":10,"current":728},"getting-started-on-stack-overflow-a-step-by-step-guide-for-students","Getting started on Stack Overflow: a step-by-step guide for students",{"_id":731,"publishedAt":721,"slug":732,"sponsored":12,"title":734},"614538a9-c352-4024-adf1-fa44a9f911b6",{"_type":10,"current":733},"stack-overflow-is-helping-you-learn-to-code-with-new-resources","Stack Overflow is helping you learn to code with new resources",{"count":736,"lastTimestamp":737},11,"2023-05-25T09:47:33Z",["Reactive",739],{"$sarticleModal":740},false,["Set"],["ShallowReactive",743],{"sanity-S6r1oENM0nn4oeRnO_NrhatP2y82M2JkSE3Z3OJFx6U":-1,"sanity-comment-wp-post-18112-1758196257833":-1},"/2021/05/24/how-developers-can-be-their-own-operations-department"]