Paper: Building and revising adaptive capacity sharing for technical incident response

2023/06/21

Paper: Building and revising adaptive capacity sharing for technical incident response

Since last week I was at a conference where John Allspaw presented content from this paper, this week I decided to go and take a look in my notes for Richard Cook and Beth Adele Long's Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering, and put them in here.

This paper was published in the journal of Applied Ergonomics and represents a recent and significant milestone for Resilience Engineering. In short, a lot of Resilience Engineering as a discipline is far closer to Resilience Studying, in that people look at what resilience is like after it has been noticed in the wild. This paper is one of the first peer reviewed ones that explicitly looks at it from the point of view of Engineering, meaning that you find ways to create conditions that are likely to promote resilience emerging.

It's a case report by looking at what was being done at a tech company and starts with a brief description of the industry: high-stakes work, high technical skill, countless anomalies, rapid growth, increased demands and complexity, etc. There's a quick reference to Woods' categories of resilience, and a reminder that we're dealing with categories 3 and 4 here:

Table 1 from the paper, defining the 4 types of resilience: rebound, robustness, graceful extensibility, and sustained adaptability

What they noticed at a given org was the existence of a specialized multidisciplinary team dealing with bigger thornier incidents:

Within the company, normal work is done by small teams. Teams maintain an on-call rotation and are responsible for handling events that affect their assigned components. While this approach produces efficient response to many anomalies the event salvo contained events that were difficult to troubleshoot. In response, a group established a support cadre to assist in response to high severity or difficult to resolve events. This group would provide a deep technical resource that could be called on to support incident response.

The support group was composed of eight engineers and engineer managers from the different technical units. The group organized an on-call rotation to provide a reserve of engineering and operational expertise. One member of that group would participate in incident response when specific thresholds (e.g. consequences for customers, incident duration) were reached. This approach was taken without altering the workload of the support group members’ teams.

That support group had weekly meetings to review and adjust their approach. They noted a few good properties: incidents closing faster, no "fire alarms" that distracts everyone, etc. They also found out cases like "there's been an incident lasting for long" and wrote tools to automatically ping the support team so escalation didn't rely on small team members realizing they are getting bogged down.

At some point though, the demand became a burden on members of the group and some left, so the company was forced to figure out how to create a pipeline to make it sustainable. Since the org saw the importance, they increased the status, added financial incentives, excluded members from their own team's rotation, etc.

Key characteristics of an Incident

The authors point out 3 key characteristics of an incident:

Tempo: how often incident happens. This org had a dozen a week. Too low and you can't accumulate experience, and too high and you would exhaust everyone, create weird expertise concentration, etc.
Duration: incidents need to be long enough for sharing to take place but not be so long that they damage the "donor" team
Magnitude: if all incidents are minor, there's no benefit in sharing resources. If all incidents are major then highly expert-centric units would make sense. Having a variation in magnitude makes the sharing of employees and knowledge workable.

They point out that:

The situations where sharing capacity is useful must also be familiar enough to the group members that each is able to “get up to speed” quickly enough to make that individual’s contribution useful.

If everyone only has general knowledge, sharing won't work as well as if people are routinely dealing with the specifics of local conditions.

They point out that by the nature of incidents, the support group needs to be interruptible and available for mobilization. If their worklads are hard to interrupt (think surgeon in an operating room) then they'd not be available to share their knowledge in an incident. Easy communications are also key, particularly in globally distributed businesses, where tools support instantly reaching out to lots of people. The authors theorize that "resilience itself is highly dependent on communications."

Sustaining

What's interesting here is that setting up the initial team wouldn't work long term:

In this example it quickly became apparent that simply sharing existing adaptive capacity would consume it and that sustaining the ability to share adaptive capacity requires resources and attention. The individuals and their expertise need to be replenished. Indeed the cadre of experts from which the group was drawn is itself continually changing. Given the rapid pace of change in this domain what constitutes useful expertise will continue to evolve. Whether the approach is durable is very much an open question.

These features suggest that successful resilience engineering may – at least at present – depend on identifying and exploiting situations and resources that are already well configured.

They then provide table 3, which is probably the perfect TL:DR; to the article:

Table 3 from the paper defining key characteristics: tempo and magnitude of challenges, duration and character of challenges, local resources, communication between units, and interruptible task milieu.

Woods’ theory of graceful extensibility predicts that resilience will appear as individual “units of adaptive behavior” exhaust their own adaptive capacity and obtain additional capacity from other units in their network. This means that it's not about being able to sustain all challenges for each individual unit (person or team or whatever), but about being able to recognize you're going out of capacity, and then getting help and readjusting how tasks are accomplished.

The unit here is a team of people that, from time to time, is required to cope with an incident. When an incident threatens to exhaust that group’s capacity it can obtain support from those in surrounding units.

They provide this little image, where 'UAB' stands for "Unit of Adaptive Behavior" and sort of graphically puts the 'support' from a neighbouring unit:

Fig. 1. Schematic of resilience as sharing of adaptive capacity; with a UAB pusing the support from a neighboring unit

This image is purposefully drawn to look general like that, because it could apply as well to teams pushing on the operational boundaries as it would to Osteocytes communicating to control bone remodelling.

They conclude:

Graceful extension of the group’s capacity to respond is an expression of resilience. What makes the example a case of resilience engineering is the deliberate, iterative, and empirically based creation and modification of a method to make that makes the expression of resilience–the sharing of adaptive capacity–efficient, effective, and–most of all–sustainable.

The resilience engineering described in the example is the deliberate adjustment of adaptive capacity sharing. The organization engineered a way for sharing to take place. This was originally driven by an economic concern – the need to keep the operational cost of managing incidents in check. Over time, the engineering has refined the sharing to make it better and more efficient. The engineering also set up the conditions needed to replace and grow the adaptive capacity.

They state that given the right conditions, resilience engineering is already taking place in these organizations and could orient more practical studies of these approaches. It also hints at ways contexts could be shaped to either enhance or protect these properties.

I know it's one of the things we do at work sometimes, where we don't try to remove all incidents—they're part of a healthy tempo—and a period of calm with nothing is something I interpret as a signal to do simulations and chaos engineering.