My bad opinions

2024/01/26

Counting Forest Fires

Today I'm hitting my 3 years mark at Honeycomb, and so I thought I'd re-publish one of my favorite short blog posts written over there, Counting Forest Fires, which has become my go-to argument when discussing incident metrics when asked to count outages and incidents.

If you were asked to evaluate how good crews were at fighting forest fires, what metric would you use? Would you consider it a regression on your firefighters' part if you had more fires this year than the last? Would the size and impact of a forest fire be a measure of their success? Would you look for the cause—such as a person lighting it, an environmental factor, etc—and act on it? Chances are that yes, that's what you'd do.

As time has gone by, we've learned interesting things about forest fires. Smokey Bear can tell us to be careful all he wants, but sometimes there's nothing we can do about fires. Climate change creates conditions where fires are going to be more likely, intense, and uncontrollable. We constantly learn from indigenous approaches to fire management, and we now leverage prescribed burns instead of trying to prevent them all.

In short, there are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know the burden or impact they have, it isn't a legitimate measure of success. Knowing whether your firefighters or whether your prevention campaigns are useful can't rely on these high-level observations, because they'll be drowned in the noise of a messy unpredictable world.

Forest fires and tech fires: turns out they're not so different

The parallel to software is obvious: there are conditions we put in place in organizations—or the whole industry—and things we do that have greater impacts than we can account for, or that can't be countered by individual teams or practitioner actions. And if you want to improve things, there are things you can measure that are going to be more useful. The type of metric I constantly argue for is to count the things you can do, not the things you hope don't happen. You hope that forest fires don't happen, but there's only so much that prevention can do. Likewise with incidents. You want to know that your response is adequate.

You want to know about risk factors to alter behavior in the immediate, or of signals that tell you the way you run things needs to be adjusted in the long-term, for sustainability. The goal isn't to prevent all incidents, because small ones here and there are useful to prevent even greater ones or to provide practice to your teams, enough that you may want to cause controlled incidents on purpose—that's chaos engineering. You want to be able to prioritize concurrent events so that you respond where it's most worth it. You want to prevent harm to your responders, and know how to limit it as much as possible on the people they serve. You want to make sure you learn from new observations and methods, and that your practice remains current with escalating challenges.

Don't look at success/failure. It goes deeper than that.

The concepts mentioned above are things you can invest in, train people for, and create conditions that can lead to success. Counting forest fires or incidents lets you estimate how bad a given season or quarter was, but it tells you almost nothing about how good of a job you did.

It tells you about challenges, about areas where you may want to invest and pay attention, and where you may need to repair and heal—more than it tells you about successes or failures. Fires are going to happen regardless, but they can be part of a healthy ecosystem. Trying to stamp them all out may do more harm than good in the long run.

The real question is: how do you want to react? And how do you measure that?