\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The focus is on understanding the connections between components and how they are affected by failures. In a “boxes and wires” diagram, most people focus on specifying the boxes and their failure modes and are less precise about the information flowing between boxes. With STPA, there is a focus on the wires, what control information flows across them, what happens if those flows are affected, and the models that consume the information and drive control actions.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>There are two main steps that form good checklists for thinking about your own design. First, identify the potential for inadequate control of the system that could lead to a hazardous state. This state may result from inadequate control or enforcement of the safety constraints. For the second step, each potentially hazardous control action is examined to see how it could occur. Evaluate controls and mitigation mechanisms, looking for conflicts and coordination problems. Consider how controls could degrade over time, using techniques like change management, performance audits and incident reviews to surface anomalies, and problems with the system design.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>If we take the general STPA model and map it to a specific application, such as a financial services API that collects customer requests and performs actions, then the human controller monitors the throughput of the system to make sure it’s completing actions at the expected rate. The automated controller could be an autoscaler that is looking at the CPU utilization of the controlled process, scaling up and down the number of instances that are supporting the traffic to maintain CPU utilization between a fixed minimum and maximum level.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>If the service CPU utilization maxes out and throughput drops sharply, the human controller is expected to notice and decide what to do about it. The controls available to them are to change the autoscaler limits, restart the data plane or control plane systems, or to roll back to a previous version of the code.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The hazards in this situation are that the human controller could do something that makes it worse instead of better. They could do nothing, because they aren’t paying attention. They could reboot all the instances at once, which would stop the service completely. They could freak out after a large drop in traffic caused by many customers deciding to watch the Superbowl on TV and take an action before it is needed. They could do something too late, like notice eventually after the system has been degraded for a while and increase the autoscaler maximum limit. They could do things in the wrong order, like reboot or rollback before they increase the autoscaler. They could stop too soon, by increasing the autoscaler limit, but not far enough to get the system working again, and go away assuming it’s fixed. They could spend too long rebooting the system over and over again. The incident response team could get into an argument about what to do, or multiple people could make different changes at once. In addition, the runbook is likely to be out of date and contain incorrect information about how to respond to the problem in the current system. I’m sure many readers have seen these hazards in person!\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Each of the information flows in the system should also be examined to see what hazards could occur. In the observability flows, the typical hazards are a little different than the control flows. In this case, the sensor that reports throughput could stop reporting and get stuck on the last value seen. It could report zero throughput, even though the system is working correctly. The reported value could numerically overflow and report a negative or wrapped positive value. The data could be corrupted and report an arbitrary value. Readings could be delayed by different amounts so they are seen out of order. The update rate could be set too high so that the sensor or metric delivery system can’t keep up. Updates could be delayed so that the monitoring system is showing out of date status, and the effect of control actions aren’t seen soon enough. This often leads to over-correction and oscillation in the system, which is one example of a coordination problem. Sensor readings may degrade over time, perhaps due to memory leaks or garbage collection activity in the delivery path.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The third area of focus is to think about the models that make up the system, remembering the maxim “All models are wrong, some models are useful.” An autoscaler contains a simple model that decides what control action is needed based on reported utilization. This is a very simple view of the world, and it works fine as long as CPU utilization is the primary bottleneck. For example, let’s assume the code is running on a system with four CPUs and a software update is pushed out that contains changes to locking algorithms that serialize the code so it can only make use of one CPU. The instance cannot get much more than 25% busy, so the autoscaler will scale down to the minimum number of instances, but the system will actually be overloaded with poor throughput. Here, the autoscaler fails because its model doesn’t account for the situation it’s trying to control.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The other important model to consider is the internal mental model of whoever is operating the system. That model is created by training and experience with the system. The operator gets an alert and looks at throughput and utilization and will probably be confused. One fix could be to increase the minimum number of instances. If they are also able to see when new software was pushed out and that this correlates with the start of the problem, then they could also roll back to a previous release as the corrective control action.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Some questions we could ask: is the model of the controlled process looking at the right metrics and behaving safely? What is the time constant and damping factor for the control algorithm? Will it oscillate, ring, or take too long to respond to inputs? How is the human controller expected to develop their own models of the controlled process and the automation, then understand what to expect when they make control inputs? How is the user experience designed so that the human controller is notified quickly and accurately with enough information to respond correctly, but without too much data to wade through or too many false alarms?\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-failover-without-collapsing-in-practice\">\u003Cstrong>Failover without collapsing in practice\u003C/strong>\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The autoscaler example provides an easy introduction to the concepts, but the focus of this story is how to fail over without falling over, and we can apply STPA to that situation as well. We should first consider the two failover patterns that are common in cloud architectures, cross zone and cross region. \u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Availability zones are similar to datacenter failover situations where the two locations are close enough together that data can be replicated synchronously over a fairly low latency network. AWS limits the distance between availability zones to under 100 kilometers, with latency of a few milliseconds. However, to maintain independent failure modes, availability zones are at least ten kilometers apart, in different flood zones, with separate network and power connections. Data that is written in one location is automatically updated in all zones, and the failover process is usually automatic, so an application that is running in three zones should be able to keep running in two zones without any operator input. This maintains application availability with just a few glitches and retries for some customers and no loss of data. \u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The other pattern is cross region. Regions are generally far enough apart that latency is too high to support synchronous updates. Failovers are usually manually initiated and take long enough that they are often visible to customers.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>The application and supporting service configurations are different in the two cases, but the essential difference from a hazard analysis point of view is what I want to focus on. I will assume zone failovers are triggered automatically by the control plane, and region failovers are triggered manually by an operator.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:image {\"align\":\"center\"} -->\n\u003Cfigure class=\"wp-block-image aligncenter\">\u003Cimg src=\"https://lh5.googleusercontent.com/yJ4nx611iXi-OpbfQjT80hiPTt4A3sGtLvh7YchJxzSc4AVKZnT9vzwYvPrzYrPCcpZl_ioz_8ZOyi2ZFByBmoM56fQfRg8CaT8UENwYZBhiHwtphkwQILnsTqjcRLg2AHkPeWk\" alt=\"\"/>\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>In an automated cross-zone failover situation, what is likely to happen? An in-rush of extra traffic from the failed zone and extra work from a cross-zone request-retry storm causes the remaining zones to struggle and triggers a complete failure of the application. Meanwhile, the routing service that sends traffic to the zones and acts as the failover switch also has a retry storm and is impacted. Confused human controllers disagree among themselves about whether they need to do something or not, with floods of errors, displays that lag reality by several minutes, and out of date runbooks. The routing control plane doesn’t clearly inform humans whether everything is taken care of, and the offline zone delays and breaks other metrics with a flood of errors. \u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>In cross-zone failovers, human controllers should not need to do anything! However, confused and working separately, they try to fix different problems. Some of their tools don’t get used often and are broken or misconfigured to do the wrong thing. They eventually realize that the system has fallen over. The first few times you try this in a game day, this is what you should expect to happen. After fixing the observability issues, tuning out the retry storms, implementing zoned request scoping to avoid non-essential cross-zone calls, and setting up alert correlation tools, you should finally be able to sit back and watch the automation do the right thing without anyone else noticing.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>If you don’t have the operational excellence in place to operate frequent successful zone failure game days, then you shouldn’t be trying to implement multi-region failover. It’s much more complex, there are more opportunities for failure, and you will be building a less reliable system.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>What’s likely to happen during a cross-region fail-over? The failed region creates a flood of errors and alerts, these delays breaks other metrics, and the cross-region routing control plane doesn’t clearly inform humans that a region is unusable. Human controllers should initiate failover! As stated above for the zone level failover, they are confused, and disagree among themselves. In fact, the problem is worse because they have to decide whether this is a zone level failure or a region level failure, and respond appropriately.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Operators then decide to initiate failover but they redirect traffic too quickly, extra work from a cross-region request-retry storm causes other regions to struggle, and triggers a complete failure of the application. Meanwhile, the routing service also has a retry storm and is impacted, so the operators lose control of the failover process. Sounds familiar? Some readers may have PTSD flashbacks at this point. Again, the only way you can be sure that a failover will work when you need it is to run game days frequently enough that operators know what to do. Ensure retry storms are minimized, alerts floods are contained and correlated, and observability and control systems are well tested in the failover situation.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Your operators need to constantly maintain their mental model of the system. This model is created by experience operating the system and supported by documentation and training. However, with continuous delivery and a high rate of change in the application, documentation and training isn’t going to be up to date. When cross-zone and cross-region failovers are added to the operating model, it gets far more complex. One way to reduce complexity is to maintain consistent patterns that enforce symmetry in the architecture. Anything that makes a particular zone or a region different is a problem, so deploy everything automatically and identically in every zone and region. If something isn’t the same everywhere, make that difference as visible as possible. The Netflix multi-region architecture we deployed in 2013 contained a total of nine complete copies of all the data and services (three zones by three regions), and by shutting down zones and regions regularly, anything that broke that symmetry would be discovered and fixed.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Failover is made easier by the common services that AWS provides, but gets much more complex if each application has a unique failover architecture, and there is little commonality across the AWS customer base. The \u003Ca href=\"https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html\">AWS Well Architected Guide Reliability Pillar\u003C/a> contains lots of useful advice, and it also supports common practices and languages across accounts and customers, which socializes a more consistent model across more human controllers. This standardization of underlying services and human operator model by AWS is a big help in building confidence that it will be possible to fail over without falling over.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Where should you start? I think the most effective first step is to start a regular series of game day exercises. Train your people to work together on incident calls, get your existing dashboards and controls together, and you will have a much faster and more coordinated response to the next incident. Create your own control diagrams, and think through the lists of hazards defined by STPA. Gradually work up from game days using simulated exercises, to test environments, to hardened production applications fixing things as you go.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>If you’d like to learn more about this topic, I’m the opening speaker in the \u003Ca href=\"https://aws-amazon-event-chaosengineering.splashthat.com\">AWS Chaos Engineering and Resiliency Series\u003C/a> online event, taking place Oct 27-28th from 11am-2pm in the AEDT timezone. Best wishes, and I hope the next time you have to fail over, you don’t also fall over!\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Cem>References\u003C/em>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Paper: \u003Ca href=\"https://d1.awsstatic.com/Financial%20Services/Resilient%20Applications%20on%20AWS%20for%20Financial%20Services.pdf\">Building Mission Critical Financial Services Applications on AWS\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>By Pawan Agnihotri with contributions by Adrian Cockcroft\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Well Architected Guide - Reliability Pillar:\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Ca href=\"https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html\">https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Related blog posts by @adrianco\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Ca href=\"https://medium.com/@adrianco/failure-modes-and-continuous-resilience-6553078caad5\">https://medium.com/@adrianco/failure-modes-and-continuous-resilience-6553078caad5\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Ca href=\"https://dev.to/aws/why-are-services-slow-sometimes-mn3\">Response time variation: https://dev.to/aws/why-are-services-slow-sometimes-mn3\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Ca href=\"https://dev.to/aws/if-at-first-you-don-t-get-an-answer-3e85\">Retries and timeouts: https://dev.to/aws/if-at-first-you-don-t-get-an-answer-3e85\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Source for various slide decks at: \u003Ca href=\"https://github.com/adrianco/slides\">https://github.com/adrianco/slides\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Book recommendations, reading list at: \u003Ca href=\"http://a.co/79CGMfB\">http://a.co/79CGMfB\u003C/a>\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:separator -->\n\u003Chr class=\"wp-block-separator has-alpha-channel-opacity\"/>\n\u003C!-- /wp:separator -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Cem>The Stack Overflow blog is committed to publishing interesting articles by developers, for developers. From time to time that means working with companies that are also clients of Stack Overflow’s through our advertising, talent, or teams business. When we publish work from clients, we’ll identify it as Partner Content with tags and by including this disclaimer at the bottom.\u003C/em>\u003C/p>\n\u003C!-- /wp:paragraph -->","html","2020-10-23T15:04:24.000Z",{"current":844},"adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery",[846,853,857,862,867,871,876,878],{"_createdAt":847,"_id":848,"_rev":849,"_type":850,"_updatedAt":847,"slug":851,"title":852},"2023-05-23T16:43:21Z","wp-tagcat-aws","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":852},"aws",{"_createdAt":847,"_id":854,"_rev":849,"_type":850,"_updatedAt":847,"slug":855,"title":258},"wp-tagcat-chaos-engineering",{"current":856},"chaos-engineering",{"_createdAt":847,"_id":858,"_rev":849,"_type":850,"_updatedAt":847,"slug":859,"title":861},"wp-tagcat-code-for-a-living",{"current":860},"code-for-a-living","Code for a Living",{"_createdAt":847,"_id":863,"_rev":849,"_type":850,"_updatedAt":847,"slug":864,"title":866},"wp-tagcat-engineering",{"current":865},"engineering","Engineering",{"_createdAt":847,"_id":868,"_rev":849,"_type":850,"_updatedAt":847,"slug":869,"title":870},"wp-tagcat-failover",{"current":870},"failover",{"_createdAt":847,"_id":872,"_rev":849,"_type":850,"_updatedAt":847,"slug":873,"title":875},"wp-tagcat-partner-content",{"current":874},"partner-content","Partner Content",{"_createdAt":847,"_id":872,"_rev":849,"_type":850,"_updatedAt":847,"slug":877,"title":875},{"current":874},{"_createdAt":847,"_id":879,"_rev":849,"_type":850,"_updatedAt":847,"slug":880,"title":881},"wp-tagcat-partnercontent",{"current":881},"partnercontent","Failing over without falling over",[884,890,896,902],{"_id":885,"publishedAt":886,"slug":887,"sponsored":12,"title":889},"f0807820-02d7-4fc5-845f-3d76514b81c0","2025-08-11T16:00:00.000Z",{"_type":10,"current":888},"renewing-chat-on-stack-overflow","Renewing Chat on Stack Overflow ",{"_id":891,"publishedAt":892,"slug":893,"sponsored":12,"title":895},"e33464c4-b21b-4019-8b86-64a46335a95e","2025-08-07T16:00:00.000Z",{"_type":10,"current":894},"a-new-worst-coder-has-entered-the-chat-vibe-coding-without-code-knowledge","A new worst coder has entered the chat: vibe coding without code knowledge",{"_id":897,"publishedAt":898,"slug":899,"sponsored":12,"title":901},"8b04b236-51d5-4747-9de8-2fe6e6a2512e","2025-08-04T16:00:00.000Z",{"_type":10,"current":900},"cross-pollination-as-a-strategic-advantage-for-forward-thinking-organizations","Cross-pollination as a strategic advantage for forward-thinking organizations",{"_id":903,"publishedAt":904,"slug":905,"sponsored":12,"title":907},"5bddfa7a-32ce-4f9b-9919-10f03a9ef39b","2025-07-31T16:00:00.000Z",{"_type":10,"current":906},"do-ai-coding-tools-help-with-imposter-syndrome-or-make-it-worse","Do AI coding tools help with imposter syndrome or make it worse?",{"count":909,"lastTimestamp":12},0,["Reactive",911],{"$sarticleModal":838},["Set"],["ShallowReactive",914],{"sanity-qlA2vYtVAD5bRRjhSj5yHPNnXsMlj8XsQIYZmv7okgw":-1,"sanity-comment-wp-post-16832-1755467798160":-1},"/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery"]