My bad opinions

2024/05/30

The Review Is the Action Item

I like to consider running an incident review to be its own action item. Other follow-ups emerging from it are a plus, but the point is to learn from incidents, and the review gives room for that to happen.

This is not surprising advice if you’ve read material from the LFI community and related disciplines. However, there are specific perspectives required to make this work, and some assumptions necessary for it, without which things can break down.

How can it work?

In a more traditional view, the system is believed to be stable, then disrupted into an incident. The system gets stabilized, and we must look for weaknesses that can be removed or barriers that could be added in order to prevent such disruption in the future.

Other perspectives for systems include views where they are never truly stable. Things change constantly; uncertainty is normal. Under that lens, systems can’t be forced into stability by control or authority. They can be influenced and adapt on an ongoing basis, and possibly kept in balance through constant effort.

Once you adopt a socio-technical perspective, the hard-to-model nature of humans becomes a desirable trait to cope with chaos. Rather than a messy variable to stamp out, you’ll want to give them more tools and ways to keep all the moving parts of the subsystems going.

There, an incident review becomes an arena where misalignment in objectives can be repaired, where strategies and tactics can be discussed, where mental models can be corrected and enriched, where voices can be heard when they wouldn’t be, and where we are free to reflect on the messy reality that drove us here.

This is valuable work, and establishing an environment where it takes place is a key action item on its own.

People who want to keep things working will jump on this opportunity if they see any value in it. Rather than giving them tickets to work on, we’re giving them a safe context to surface and discuss useful information. They’ll carry that information with them in the future, and it may influence the decisions they make, here and elsewhere.

If the stories that come out of reviews are good enough, they will be retold to others, and the organization will have learned something.

That belief people will do better over time as they learn, to me, tends to be worth more than focusing on making room for a few tickets in the backlog.

How can it break down?

One of the unnamed assumptions with this whole approach is that teams should have the ability to influence their own roadmap and choose some of their own work.

A staunchly top-down organization may leverage incident reviews as a way to let people change the established course with a high priority. That use of incident reviews can’t be denied in these contexts.

We want to give people the information and the perspectives they need to come up with fixes that are effective. Good reviews with action items ought to make sense, particularly in these orgs where most of the work is normally driven by folks outside of the engineering teams.

But if the maintainers do not have the opportunity to schedule work they think needs doing outside of the aftermath of an incident—work that is by definition reactive—then they have no real power to schedule preventive work on their own.

And so that’s a place where learning being its own purpose breaks down: when the learnings can’t be applied.

Maybe it feels like “good” reviews focused on learning apply to a surprisingly narrow set of teams then, because most teams don’t have that much control. The question here really boils down to “who is it that can apply things they learned, and when?”

If the answer is “almost no one, and only when things explode,” that’s maybe a good lesson already. That’s maybe where you’d want to start remediating.

Note that even this perspective is a bit reductionist, which is also another way in which learning reviews may break down. By narrowing knowledge’s utility only to when it gets applied in measurable scheduled work, we stop finding value outside of this context, and eventually stop giving space for it.

It’s easy to forget that we don’t control what people learn. We don’t choose what the takeaways are. Everyone does it for themselves based on their own lived experience. More importantly, we can’t stop people from using the information they learned, whether at work or in their personal life.

Lessons learned can be applied anywhere and any time, and they can become critically useful at unexpected times. Narrowing the scope of your reviews such that they only aim to prevent bad accidents indirectly hinders creating fertile grounds for good surprises as well.

Going for better

While the need for action items is almost always there, a key element of improving incident reviews is to not make corrections the focal point.

Consider the incident review as a preliminary step, the data-gathering stage before writing down the ideas. You’re using recent events as a study of what’s surprising within the system, but also of how it is that things usually work well. Only once that perspective is established does it make sense to start thinking of ways of modifying things.

Try it with only one or two reviews at first. Minor incidents are usually good, because following the methods outlined in docs like the Etsy Debriefing Facilitation Guide and the Howie guide tends to reveal many useful insights in incidents people would have otherwise overlooked as not very interesting.

As you and your teams see value, expand to more and more incidents.

It also helps to set the tone before and during the meetings. I’ve written a set of “ground rules” we use at Honeycomb and that my colleague Lex Neva has transcribed, commented, and published. See if something like that could adequately frame the session..

If abandoning the idea of action items seems irresponsible or impractical to you, keep them. But keep them with some distance; the common tip given by the LFI community is to schedule another meeting after the review to discuss them in isolation. iiii At some point, that follow-up meeting may become disjoint from the reviews. There’s not necessarily a reason why every incident needs a dedicated set of fixes (longer-term changes impacting them could already be in progress, for example), nor is there a reason to wait for an incident to fix things and improve them.

That’s when you decouple understanding from fixing, and the incident review becomes its own sufficient action item.