Don’t panic! A playbook for managing any production incident
No matter how strong your organization is, how detailed their planning, how deliberate their deployments…things will break and emergencies will send your teams scrambling. Systems will go down, key functionality will stop working, and at some point in everyone’s career as a developer, an issue will call for all hands on deck.
The nature of these challenges evolve as time goes on, but some things stay consistent in how you view the challenges and how folks can work to make sure that you get back to good reliably. And to be clear, we aren’t talking about run of the mill production bugs, we are talking about issues that are large and sweeping but at the same time delicate and brittle.
Having done more than my share of organizing and solving some of the big challenges organizations face when these events happen, I have a high-level playbook that I try to make sure my team follows when things fall apart. A lot of these expectations started to take shape during my first big outage as a developer. This helped me understand what people should do as developers, SREs, managers, and everything in between. And the cause of this first big outage: a brand new checkout process on an ecommerce site. All of these takeaways are applicable for folks at all levels, and hopefully can give some insights into what folks in other roles go through.
Step 1. Don’t panic and identify your problem.
My very first “outage” came when I was a developer working on an application that had a new checkout process. It was nothing fancy, but like all applications at one point or another, this key piece of functionality stopped working with our newest launch. As if people not being able to check out and complete sales wasn’t bad enough, shopping carts lost items and product descriptions were showing up blank. Pieces that weren’t within scope or crossed our mind to test stopped working. We immediately grabbed folks into a room to get to work and figure it out.
Our first instinct right away was “Quick, roll it back!” This was an understandable feeling to have, we introduced problems, and naturally you want to take the problems away. But with quick actions come quick mistakes, and a seasoned senior developer stopped everyone from scrambling to ask the pertinent question: “Well, why isn’t it working?” In my mind I was screaming “Who cares! Our embarrassing mistake is out there for the whole world to see!” But the calm nature and analytical demeanor of this senior developer settled us down and assured us that what we were doing right now in that room was the right thing to do: ask questions and investigate.
Step 2. Diagnosis and understand the source(s) of your problem
This sounds like an obvious thing, but with concern and panic overtaking the team, not enough folks asked us why things were breaking. The senior engineer left the problem out in the wild for a full 30 minutes after we found it to make sure we knew why it wasn’t working. We checked and double checked exception logs, we did a few different checks with separate workflows, and even checked if there was anything odd at a systems level. After all, we had good development environments setup to replicate production, and things were breaking so double and triple checking ourselves became important. Retracing these steps with new context from the errors we were seeing helped us go through all of these steps in a new light. After we had enough to know what we did wrong, and gathered enough confidence for next time we release, we then started our rollback. It is a delicate balance, but I learned to always take all the opportunities you can before a rollback before you lose your best source of information to find the root of your problem: the actual problem in the wild.
The same senior dev who was tempering our poorer instincts was the one who took “point” or tech lead during this time, while relying on our director to be the incident leader. You will hear many names for these roles, but they are someone who is technical and can help coordinate those efforts (usually a more senior developer) and someone who is responsible for communicating around it and giving air cover for those who may want to take time away from the fixes (usually a director or engineering manager). This is to protect the most valuable resource during a crisis: The time and focus of those who can actually implement the plan to fix.
The more technical person will be there to help set milestones and delegate or divvy up the work that needs to be done. The incident leader, as they are often ironically named, are there to facilitate and not to dictate. I remember hearing from my mentor at the time that the best incident leaders asked two questions: “Where are we at?” and “What do you need?” The first so they could keep people off our back, and the second so the last thing our engineers had to worry about was resources, including time.
Step 3. Remediation: let’s start working on the problem.
We know we have a problem, we know the source of the problem, now let’s make a plan and fix it. We all love to jump to this step, go right into fixing it. And sometimes we have that luxury for simple issues where the problem is so apparent that confirming and understanding the source of the problem, or problems, are very quick steps, but most times if the problem has made it this far and is this impactful, we need to be more deliberate. Much like how we were potentially shooting ourselves in the foot by instinctually rolling things back too quickly, the same instinct to just fix it can come up,
This point person is going to help prioritize the work to do, find out where the biggest mitigation steps are, and make sure that other stakeholders have clear expectations of the impact. As a developer working on an issue, you also have a responsibility to hold this person accountable, make sure they give you the resources you need to help figure out the issue. This can be time, access, or other people who have answers you don’t. And this is an important theme throughout this phase: Give the engineers what they need to fix things. Arguably this should be a theme for all of engineering leadership, but nothing more pronounced than when things have gone down and vital workflows have gone silent.
When we were working on the checkout bug, the biggest piece missing was not information or other developers to help, but focus. This may sound odd, but I am willing to bet it is a familiar feeling to any who have been in the boat with leaders who are panicky or never understood the fallacies of the mythical man month. The leaders were eager for progress updates, and what better way to get those updates than to get everyone in a meeting together four times a day to tell us how things progressed. That means every two hours we lost 30 minutes, had to context switch, and update tracking sheets.When I told my tech lead about this, he immediately had the meeting moved down to once a day for developers, and optional at that. The speed gains from this alone were huge; being able to focus and remove distractions was the bigger factor for remediating the problem.
Step 4. Verification and learnings
If all goes well, tests are confirmed, and all the valuable information you got from steps 1 and 2 have led to confidence in your new test plans, you can move the fix out to production. Once it is out live, lean on your teammates in all departments to confirm and explore. Interestingly, I have found time and again that if patience and freedom are given to the engineers at the beginning of these incidents, there is a correlated confidence and calmness to the subsequent release and fix.
However, once the fix is out live and everyone feels strongly about the current state, your work is only half done. Now you need to make sure that expensive, hard earned lessons from this problem grow your whole organization. People often take the measure of a good retrospective from big events like this as problems never happening again, but that is plainly unreasonable to any reasonable person. Often I have found the best learning is how we can DEAL with problems better, not pretend like we can make them go away.
In the end for our checkout issue, it all came down to a missed release step by our deployment team. An honest mistake that can happen to anyone. This doesn’t mean we ignore the issue: we thought of adding redundancy or perhaps trying to automate certain bits more, but that wasn’t the best bit that we learned. Our tech lead was far more focused not on preventing errors, but sharpening our ability to deal with them. Though they wanted to prevent future errors, they saw a lot more room to improve in how we respond to the error. What did we learn about engineer focus time? Where were we able to investigate quickly? Slowly? And even good questions outside of engineering such as who was best at handling comms and what was the info they needed?
There have been shades of this outage throughout almost two decades of my career, and I have no doubt the days of having to deal with things like it are coming to a comfortable middle. But the themes of how to approach it, process it, and most importantly enable my team to tack it tend to be the same.
Tags: devops, outages, playbook
7 Comments
“Our tech lead was far more focused not on preventing errors, but sharpening our ability to deal with them.” – very wise
Gotta love a project manager with the wisdom and experience to convince the stakeholders and the engineers that the current problem can be resolved:
https://www.youtube.com/watch?v=sngs6wj8tEY
“Finest Hour” scene from Apollo 13
I’ve read the article twice and did not understand the semantics of “problem” as per the author. From one side, you have the problem of “now” where the production came to a halt, bringing immediate consequence (depending on your level of enterprise, it can be costing from thousands to millions per hour). There’s the problem of “later” which is knowing how to fix and how to avoid it. There’s then this line:
“After we had enough to know what we did wrong, and gathered enough confidence for next time we release, we then started our rollback.”
This means the “solution” to the main problem was the… first guess all along? We then get the follow up:
“In the end for our checkout issue, it all came down to a missed release step by our deployment team.”
So this wasn’t even an engineering fix at all, or lack of focus by anyone of the engineering team? The immediate problem persisted longer for no reason? So what was the real problem, lack of confidence in the engineering team to blame someone else right away which allowed production to halt for longer? Not the best war story to tell people imo.
That’s not how it is done e.g. at Amazon.
First priority is to reduce the impact, only then diagnose and analyze.
Thus “Quick, roll it back!” was probably the right call.
There could be situations when rollback can make it worse, but not if you prepared to do rollback.
It was pure luck if you were able to patch the issue with new version quickly, without introducing new issues.
Step 1) Ping ChatGPT
Step 2) Open a soda
If you did indeed “_have good development environments setup to replicate production_”, then it’s not true that leaving the fault in the wild was the “_best source of information to find the root of [the] problem_” – by definition, you could replicate the issue in a safe environment where customers weren’t impacted, so you had an equivalently-good source of information with the benefit of zero customer impact. On the other hand, if that assumption was false, and development environments were insufficiently faithful to production to reproduce the issue, then rolling back immediately is still (with some caveats – see next paragraph) the right choice: once you’ve stopped customer impact, you can test on pre-prod, read the logs with a clearer head, formulate a solid theory about the fault, and instrument whatever observability you’d need to help you investigate further. Even if you can’t prove the fix until you deliberately reintroduce the fault to prod, you’ve still mitigated customer impact over that time; and have the option of reintroducing the fault at a low-traffic time, or to a limited segment of customers/hosts. The worst possible outcome of a rollback is to deliberately reintroduce the fault in order to get more information, which is still strictly-better than leaving the fault in production over the theorizing-and-investigating time.
Rolling back isn’t always the right choice – for instance, when dealing with a stateful service for which rolling back isn’t guaranteed to return to a healthy situation, or when the most recent change was a large infrastructural change that will take significant time to rollback and there’s high confidence in the rollforward fix. But in a situation where a rollback is practical, it’s _never_ the right move to delay it. It’s not about mitigating embarrassment but about mitigating customer impact.
One might claim that the worst possible outcome is, in fact, that you rollback, cannot understand the problem, reintroduce the “fault”, and it doesn’t show up. This is, I claim, still better than leaving the fault active. Customer impact is nil, you’ve introduced extra observability _if_ the issue does reoccur, and if it doesn’t – well, sure, it’d bug me to not learn from the incident, but the next outage will be along any minute now, there’ll be more to learn! There are always many unknown bugs in any sufficiently complex system – just because one of them has temporarily surfaced, doesn’t mean you need to deliberately cause customer impact in order to chase it down if it would never again rear its head. Focus on the issues that are actually causing problems.
That said, I love the rest of the post – incident leader as facilitator and distraction shield, ensuring that the problem is fully understood and fixed rather than simply patched over, learning (and spreading) lessons from outages, blame-free fixes to process. I assume there’s more to the original story that we’re not hearing which made the non-immediate rollback actually the correct choice – I’d love to know what it was!
“Don’t panic” – tell that the managers. They love running around like headless chickens in situations like these.
Nice blog. I have been responsible in a fair number of serious incidents in my company as the product owner of a dep/maintenance team. Sometimes you know in a minute what goes wrong. It may be a failed connection, an outage with a third party, a user issue, a disk capacity issue (yes, that happens to my dismay) and so on. In other cases the analysis may be very complex.
My priority nr 1 is always to keep the management informed on the steps you take and manage their expectations. If you think they can send the whole callcenter home they are not happy and you get a lot of attention. Informing them in an open way, including what you do not know is very helpful. You still can hope for compliments for the way the issue was managed.