"An unintended but unavoidable consequence of associating safety with things that go wrong is a creeping lack of attention to things that go right."
- Erik Hollnagel, A Tale of Two Safeties.
In order to understand how things went wrong, we need to first understand how they went right
When an incident happens in an organization, the traditional response is to identify ways to prevent the incident from happening again in the future. The community around this website takes a different approach towards incident analysis. To paraphrase the late computer scientist Edsger Dijkstra, incident analysis is no more about incidents than astronomy is about telescopes. Instead of focusing on prevention, we seek to leverage incidents as an opportunity to learn as much as possible about how work is done within the organization.
It's possible to study how work is done in an organization without focusing on incidents. However, one of the advantages that incidents give us is that incident reviews are common throughout our industry, and so, as a consequence, people expect someone to go around asking questions after an incident. As incident investigators, we are just asking different sorts of questions and have different goals in mind than the traditional let's-make-sure-this-doesn't-happen-again approach.
Aren't the incidents that make the news more important?
Ultimately, there doesn't need to be a business-impacting incident to study how work is done. While incidents with greater business impact get more attention from an organization than incidents with smaller impact, it doesn't necessarily follow that we can learn more from the great impact incidents than the lesser one. In fact, it may be even harder to do an incident investigation for a high-profile incident since there is more scrutiny from an organization. And the difference between a high-profile incident and no incident at all might be something as mundane as a particular individual being out of the office on a certain day.
And, if we can learn just as much from smaller incidents as we can from larger ones, we can also learn just as much from an "incidents" when there is no impact at all! These are the kinds of things we call close calls or near misses.
Any time we encounter an operational surprise, something that happened in operations that we didn't expect, there's an opportunity for us to discover how the observed system behavior deviated from our mental model of how the system is supposed to behave.
There are a few nice things about operational surprises:
First, they don't carry with them the psychological impact of incidents. When somebody is involved in an incident, they can often suffer from "second victim syndrome", where they feel guilty about having contributed to the incident in some way. This doesn't happen with operational surprises, because there was no negative impact.
Second, the term "operational surprise" is much less ambiguous than "incident". As John Allspaw notes, incident severity is negotiable. On the other hand, I've never heard engineers argue over whether something was a surprise or not.
Third, they're likely happening all of the time inside of your organization!
At Netflix, we started the OOPS project to encourage engineering teams to self-report when they encounter an operational surprise. This writeup contains a narrative description of the events that led to surprise, and identifies contributors, mitigators, risks, and challenges in handling, which is the same writeup structure we use for incidents that we investigate.
Engineers share these write-ups within the organization. These are often commented upon and discussed, sometimes in a structured meeting where we focus on identifying risks and learnings.
By sharing experiences with OOPSies across the organization, we hope to build shared understanding around how the overall system behaves, demonstrate expertise in action, and encourage discussion around signals of risk that these OOPSies reveal.