Hiding Theory in Practice

2022/11/23

Hiding Theory in Practice

I'm a self-labeled incident nerd. I very much enjoy reading books and papers about them, I hang out with other incident nerds, and I always look for ways to connect the theory I learn about with the events I see at work and in everyday life. As it happens, studying incidents tends to put you in close proximity with many systems that are in various states of failure, which also tends to elicit all sorts of negative reactions from the people around them.

This sensitive nature makes it perhaps unsurprising that incident investigation and review facilitation come with a large number of concepts and practices you are told to avoid because they are considered counterproductive. A tricky question I want to discuss in this post is how to deal with them when you see them come up.

A small sample of these undesirable concepts includes things such as:

Root Cause: I've covered this one in Errors are constructed, not discovered. To put it briefly, focusing on root causes tends to narrow the investigation in a way that ignores a rich tapestry of contributing factors.
Normative Judgments: this is often used when saying someone should have done something that they have not. It carries the risk of siding with the existing procedure as correct and applicable by default, and tends to blame and demand change from operators more than their tools and support structure.
Counterfactuals: those are about things that did not happen: "had we been warned earlier, none of this would have cascaded." This is a bit like preparing for yesterday's battle. It's very often coupled with normative judgments ("the operator failed to do X, which led to ...")
Human Error: generally not a useful concept, at least not in the way you'd think. This is best covered in "Those found responsible have been sacked" by Richard Cook or The Field Guide to Understanding 'Human Error', but tends to be the sign of an organization protecting itself, or of a failed investigation. Generally the advice is that if you find human error, that's where the investigation begins, not where it ends.
Blame: psychological safety is generally hard to maintain if people feel that they are going to be punished for doing their best and trying to help. You can only get good information if people trust that they can reveal it. Blameless processes—or rather, blame-aware reviews aim to foster this safety.

There are more concepts than these, and each could be a post on its own. I've chosen this list because each of them is an absolutely common reaction, something so intuitive it will feel self-evident to people using them. Avoiding these requires a kind of unlearning, so that you can remove the usual framing you'd use to interpret events, and then gradually learning to re-construct them differently.

This is challenging, and while this is something you and other self-labeled incident nerds can extensively discuss and debate as peers, it is not something you can reasonably expect others to go through in a natural post-incident setting. Most of the people with whom you will interact will never care about the theory as much as you do, almost by definition since you're likely to represent expertise for the whole organization on these topics.

In short, you need to find how to act in a way that is coherent with the theory you hold as an inspiration while being flexible enough to not cause friction with others, nor requiring them to know everything you know for your own work to be effective.

As an investigator or facilitator, let's imagine someone who's a technical expert on the team comes to you during the investigation (before the review) and says "I don't get why the on-call engineer couldn't find the root cause right away since it was so obvious. All they had to do was follow the runbook and everything would have been fine!"

There are going to be times where it's okay to let go of these comments, to avoid doing a deep dive on every opportunity. In the context of a review based on a thematic analysis, the themes you are focusing on should help direct where you put your energy, and guide you to figure out whether emotionally-charged comments are relevant or not.

But let's assume they are relevant to your themes, or that you're still trying to figure them out. Here are two reactions you can have, which may come up as easy solutions but are not very constructive:

You may want to police their intervention: since you care for blame-awareness and psychological safety, you may want to nip this behavior in the bud and let them know about the issues around blame, normativeness and counterfactuals.
You may also want to ignore that statement, drop it from your notes, and make sure it does not come up in any written form. Just pretend it never came up.

In either case, if behavior that clashes with theoretical ideals is not welcomed, the end result is that you lose precious data, either by omission or by making participants feel less comfortable in talking to you.

Strong emotional reactions are as good data as any architecture diagram for your work. They can highlight important and significant dynamics about your organization. Ignoring them is ignoring potentially useful data, and may damage the trust people put in you.

The approach I find more useful is one where the theoretical points you know and appreciate guide your actions. That statement is full of amazing hooks grab onto:

That they believe it is obvious but was not to the on-call engineer hints at a clash in their mental models, which is a great opportunity to compare and contrast them. Diverging perspectives like that are worth digging into because they can reveal a lot.
The thought that the runbook is complete and adequate is worth exploring: was the on-call engineer aware of it? Are runbooks considered trustworthy by all? Were they entertaining hypotheses or observing signals that pointed another direction? Is there any missing context?
That counterfactual point ("everything would have been fine!") is a good call-out for perspective. Does it mean next time we need to change nothing? Can we look into challenges around the current situation to help shape decision-making in the future?
Is this frustrated reaction pointing at patterns the engineer finds annoying? Does it hint at conflicts or lack of trust across teams?
Zooming out from the "root cause" with a newcomer's eyes can be a great way to get insights into a broader context: is this failure mechanism always easily identifiable? Are there false positives to care for? Has it changed recently? What's the broader context around this component? You can discuss "contributing factors" even when using the words "root cause" with people.

None of these requires interrupting or policing what the interviewee is telling you. The incident investigation itself becomes a place where various viewpoints are shared. The review should then be a place where everyone can broaden their understanding, and can form their own insights about how the socio-technical system works. Welcome the data, use it as a foothold for more discoveries.

If you do bring that testimony to the review (on top of having used it to inform the investigation), make sure you frame it in a way that feels safe and unsurprising for all participants involved. Respect the trust they've put in you.

How to do this, it turns out, is not something about which I have seen a lot of easily applicable theory. It's just hard. If I had to guess, I'd say there's a huge part of it that is tacit knowledge, which means you probably shouldn't wait on theory to learn how to do it. It's way too contextual and specific to your situation. If this is indeed the case, theory can be a fuzzy guideline for you at most, not a clear instruction set.

This is how I believe theory is most applicable: as a hidden guide you use to choose which paths to take, which actions to prefer. There's a huge gap between the idealized higher level models and the mess (or richness) of the real world situations you'll be in. Navigating that gap is a skill you'll develop over time. Theory does not need to be complete to provide practical insights for problem resolution. It is more useful as a personal north star than as a map. Others don't need to see it, and you can succeed without it.

Thanks to Clint Byrum for reviewing this text.