My notes and other stuff

2022/12/03

Paper: Managing the Hidden Costs of Coordination

This week's paper is an entry from the ACM Queue titled Managing the Hidden Costs of Coordination by Dr. Laura Maguire, which is a shorter take of her own thesis (which is an excellent read, but too long for me to cover here). One of the interesting things about her work is that she looked into incident response in tech companies, which tend to have a very different pace and dynamics than other areas often studied (eg. hospitals or firefighting). I'm somewhat tired and afraid I won't do it justice, but here's my attempt anyway.

The paper contextualizes everything with an incident at a tech company doing Critical Digital Infrastructure (CDI), which starts with 502 errors, a quick declaration of incidents, and a bunch of people jumping in:

In less than seven minutes, eight hypotheses about the nature of the problems had been proposed by the responders. In that same period, five of those had been investigated and discarded.

Within the first 10 minutes of the incident, the responders had been directly in touch with the 4,700 users in their community channel, opened tickets with three dependent services' support teams, and coordinated among a response squad of 10.

This is is to emphasize the difference in pace and scope, I believe. A lot of companies have objectives around 4 nines of uptime (99.99%) or represent critical infrastructure and tend to rapidly declare all-hands-on-deck incidents.

There's a big mix of scripted and unscripted elements around figuring out what is going on, ways to fix it, stabilization, writing changes, applying them, finding the right people, briefing them, and so on. What the author observed is that even in small-scale systems, incident response can quickly become mostly about managing the capabilities of responders, communications, and overall coordination rather than about diagnosis and repair.

Either growing the uncertainty or number of responders increases this coordination demand, and the more critical the service, the higher the pressure:

Herein lies the crux of the issue: The collaborative interplay and synchronization of roles is critical, but prior research has shown poor coordination design incurs cognitive costs for practitioners, specifically, the additional mental effort and load required to participate in joint activities.

This is a pattern that comes time and time again for people around resilience engineering: a tool or procedure is most critically needed at a time where the available cognitive bandwidth of participants is at its lowest. The author points out that this is exacerbated in tech, where groups are often distributed across various geographical areas.

Dr. Maguire points out that the choreography needed for smooth operation demands active efforts, but efforts that are hard to define, and often distinct from the skills required for problem solving itself. Generally, that coordination ability is invisible until it breaks down, and only then do people take note of it. They each have their overhead as well, such as: monitoring capacity vs. demands, identifying skills required and who has them, figuring out how to contact them and doing so, adapting work, anticipating future needs, broadcasting status updates, handling access and permissions, preparing and sharing artifacts (charts, dashboards, screenshots), and so on.

These overheads seem relatively benign—they are implicit features of any joint activity. And that is precisely the point: They can be a minimal burden in normal operations and therefore disregarded as worthy of support in explicit design. In high-tempo, time-pressured, and cognitively demanding scenarios, however, these burdens increase to the point of overloading already burdened responders.

Considering distributed teams, there is a need for design around this sort of incident coordination; you can't just leave it to improvisation (even if it plays a part). Generally, people organize around triage, runbooks, and troubleshooting in a way that reduces how many people are needed for your usual response (page as few people as needed), but this can specifically make it more challenging to deal with rapidly escalating situations, where tons of people and stakeholders bring in new data or demands as well.

Tech tends to use process to cope with this, often by borrowing from the Incident Command System (ICS) structure. The author points out two ways in which following this structure actually hinders response.

Attempt 1: Assigning an Incident Commander

The Incident Commands (IC) is there to help explicitly manage coordination, and in some cases has the responsibility of directing activities and making decisions. Part of the challenge is that an IC owns all the responsibilities in its most formal form, and must hand them off to other responders.

But working both in and on the incident is tricky: as you assess the situation, you take away bandwidth to help coordinate other people. If you're centralizing decision-making, tends to see you fall behind events. Doing both is only effective at low tempo:

Being an effective choreographer of the joint activity demands current, accurate knowledge and the ability to redirect attention to the orchestration of the players coming in and out of the event alongside their changing needs.

She mentions as well that a structure that is decidedly strict can reduce the options for adaptation and resilience of responders.

Attempt 2: Enforcing operational discipline to follow the ICS

Overload can lead to various behaviours, namely load shedding (dropping tasks on the floor), reducing accuracy (doing the work less thoroughly), or blocking on progress.

During high-pace incidents, it is not uncommon to find yourself in a situation where you need more hands, but the pace is so high that bringing more people up to speed would be too demanding and would make the situation worse. If this is taking place in one big room for the IC to see what is going on, you shouldn't be surprised to see people leave for a side-channel or private room to troubleshoot a specific aspect of the situation. It will be common then, to see the post-incident review point out this as an anti-pattern making coordination more difficult.

The author counters:

Retrospective discussions portray these adaptations as contrary to the ICS protocols and therefore lead to efforts to block people from forming these channels. The behavior is actually an adaptive strategy to cope as coordination becomes too expensive. Rather than forcing responders to bear significant attentional and workload costs, it is advisable to facilitate shifting various lines of work to subgroups while supporting connecting the progress or difficulties into the larger flow of the response.

Or put it another way (in my own editorialized words): people do the things they do because they think they'll help. If you see this emergent behaviour and decide you need to squash it, you are ignoring precious signals that your procedure was feeling inadequate, and may want to actually harness (or better support) that behaviour rather than fighting it.

Attempt 3: Software Platforms

This isn't related to ICS, but is a third and somewhat ineffective approach that arose as a response.

It is rather unsurprising that people tried to develop tools and platforms with software to help coordinate incident response for these new situations. The side-channels in the previous point happen in no small part because tools make them easy to spin up. Making it possible for ad-hoc reorganization to take place implicitly supports coordination costs, but unfortunately the tools themselves then create new costs instead. For example, chat-based apps can make coordination simpler by providing a text history of everything that happened, but now the new cost is on getting through the history, figuring out all the new channels, finding which threaded conversations were relevant or not, and so on.

These seemingly trivial aspects of design matter greatly. [...] Those who are likely to be drawn in to join in the response efforts on a service outage frequently possess specialized skills that are often scarce. As such, they may not be brought into the event until later stages, at which time the tempo or propagation of failure drives a need for taking urgent action. Poor design renders ChatOps nearly useless as a tool for sensemaking as people come into an evolving and increasingly pressured situation.

Coordinating and Reciprocity

The previous point about software shifts the demand from human-to-human coordination to human-machine coordination costs, which are often going unnoticed. Sometimes it's in setup costs (just evaluating, configuring all the things, training people), but it also has ongoing costs such as keeping up with updates, integrating it into workflow and other tools, adjusting for their use in new situations, etc.

If the tool were a human colleague, the amount of effort you would need to expend to ensure it remained a relevant team member might give you pause; however, this fundamental asymmetry that unduly burdens the human team members with additional costs to compensate for the limitations of automation is characteristic of current-day human-machine teams.

One of the critical benefits that can however be obtained is one where people who aren't strictly responders still join in to observe and listen in on the activity. The voice loops used in NASA during critical times, for example, let people both focus on what they're doing and their own responsibilities, but let bystanders catch up on context such that if their help is required, they are already up to date or very fast to catch up. The ability to "look in and listen" tends to benefit coordination.

The key element that is required in design then is this ability to allow smooth, rapid escalation without undue costs on people who could be of use without participating yet. Having people who are up to date about the current situation without also having specific responsibilities can let them bring fresh perspectives to the situation or be brought in with limited overhead.

The author concludes with 4 considerations for cognitive costs:

  1. assessing coordination strategies relative to the cognitive demands of the incident
  2. recognizing when adaptations represent a tension between multiple competing demands
  3. widening the lens to study the joint cognition system (human-machine interactions)
  4. looking for opportunities of reciprocity within and across organizational boundaries

These will be useful if not essential to control cognitive costs in situations becoming ever more complex.