Paper: Past the Edge of Chaos

2022/11/12

Paper: Past the Edge of Chaos

I've referred to work-as-done and work-as-imagined many, many times before, but never took the time to actually break down any literature behind that. The page that describes them the most accessibly is Steven Shorrock's The Varieties of Human Work. I decided to dig into some of its sources to find canonical material behind it, and unfortunately I got nerd-sniped by a reference to a paper from Sidney Dekker from 2006 titled Past the Edge of Chaos.

Ghost Flight

he paper starts with a re-telling of Helios flight 522, a "ghost flight". The story goes that the flight took off normally, but started having issues pressurizing itself. Dekker mentions "anomalies in the pressurization system" whereas the official report mentions that the pressurization system was set to MAN (manual mode) rather than one of the two other automated modes, which were more normal. None of the pre- and post-takeoff checklist runs highlighted this state.

The plane was configured such that its auto-pilot would make it climb to 34,000 ft and cruise from Cyprus to Athens. At 10,000 ft, a bunch of alarms start blaring—the same one usually ringing if a bad take-off configuration is detected (eg. bad flap settings). Having a take-off alarm at 10,000ft is bound to be confusing, doubly so since hypoxia conditions can cause confusion and likely were starting to set in, since cabin pressurization keeps things at the equivalent of 8,000ft of altitude.

The autopilot kept climbing, and once at 14,000ft, the oxygen masks dropped from the ceiling, along with a master caution light lighting up. Other alarms started blaring as well, related to insufficient cooling air entering the compartment that contains avionics equipment. This is also where the German captain and Cypriot co-pilot find out that they don't share enough English in common (nor their native languages) to debug the situation. Hypoxia is getting worse and alarms keep blaring.

After calling the maintenance base to turn off a loud alarm that made things even harder, the pilot got up to look at its circuit breaker in a cabinet behind him. The co-pilot was left confused in his seat. As the aircraft kept climbing, the captain passed out, and then the co-pilot. The autopilot kept climbing, reached 34,000ft, and kept a holding pattern over Athens.

Greek military jets accompanied the unresponsive plane that kept going for 70 minutes. A flight attendant with a portable oxygen supply sat down in the captain's seat and tried to gain control, but was not qualified. Eventually, fuel ran low, engines flamed out, and the plane crashed, killing everyone on board.

The Decomposition Assumption

Dekker takes this incident as a case study demonstrating that aviation safety's models need adjusting. He mentions that the 90s safety model derives from analytical decomposition: take a complex component, study its individual parts. If all the individual parts are behaving as expected, then the overall system should keep working fine. This is an approach that was behind the energy & barrier models, and partly influential on the latent failure model (but particularly the more common Swiss cheese model variant, often made simpler and linear).

What this airline disaster shows is that it is very possible to have an incident where all inspections were passed, all officers were qualified, the airframe was long-certified, and run by an approved organization, yet major failure could happen. This is non-linear in terms of interactions, and Dekker picks a particular aspect—the fact that both pilots couldn't communicate effectively in abnormal situations—to show that all the components part of the equation are actually not independent.

He makes a comment on the general well-intentioned approach of inspection and regulators:

Moves towards “system oversight” put regulators and certifiers in a sort of second-order role relative to their previous position. Rather than wanting to know exactly what problems an airline, or other inspection object, is having (e.g. bolts of the wrong size), the regulator wants to get an idea of how well the airline is able to deal with the problems that will come its way. The inspector, in other words, is trying to make a judgment of the resilience of the inspection object. The intention to help create safety through proactive resilient processes, rather than through reactive barriers, is laudable and productive. But the critical question is what to base a judgment of resilience on. This question is only beginning to be examined.

This intention is good, but the results are a bit more limited because of the way it is implemented. Since the approach depends on analytical decomposition to work (making everything remains stable and in place, then ensuring inspection that can maintain these invariants are in place), it still does not have the ability to cope effectively with emergent failures that recombine in non-linear ways (where a small variation in input creates a major variation in output).

Towards Resilience Engineering

Dekker mentions that the above approach has its limits whenever failures are not caused by sub-components misbehaving. Instead, he mentions the need to look at complexity theory, and the ideas in the then emergent resilience engineering for ways to dynamically generating safety. A particular point is in finding something he calls "the edge of chaos", a point of emergence beyond which new behaviours appear where they couldn't have been predicted by functional decomposition.

To keep a machine working, we want to check on the servicability of its parts and their interactions. Keep out the harmful forces, throw out the bad parts, build barriers around sensitive sub-systems to shield them from danger. To keep a living system working, that is not enough, if applicable at all. Instead, we must adopt a functional, rather than structural point of view. Resilience is the system’s ability to effectively adjust to hazardous influences, rather than resist or deflect them. The reason for this is that these influences are also ecologically adaptive and help guarantee the system’s survival.

[...]

The systems perspective, of living organizations whose stability is dynamically emergent rather than structurally inherent, means that safety is something a system does, not something a system has. Failures represent breakdowns in adaptations directed at coping with complexity. Resilience, then, represents the system’s ability to recognize, adapt to, and absorb a disruption that falls outside the disturbances the system was designed to handle.

Dekker follows with a helpful list of attributes that can hint at signs of resilience (adaptive capacity) in an organization:

Meta-monitoring: is the system aware of risks inherent to how it models risks? Does it detect misplaced confidence?
Readiness to be surprised: if the system assumes that past successes are guarantees of future safety, they are unlikely to be ready to deal with developments going past its tipping points
Learning from others' experience: this is essentially not doing "distancing through differencing", a sort of it couldn't happen here attitude where the differences of other organizations make people believe theirs isn't susceptible to similar events.
Contiguous problem-solving: basically, if you silo all the parts of your system, you risk running into fragmented problem solving, where nobody is able to see an overall erosion of safety when focusing on their part.
Knowing the gap between work-as-done and work-as-imagined: the bigger the gap, the more ill-calibrated leadership might be to problems.
Discussing risk even when everything is safe: you have to keep re-calibrating, even when things look stable.
Have a role with authority, credibility, and resources who can challenge common decisions about risk. Essentially, this is because whistle-blowers with all the information are generally low-ranking with no power to do something about it. Having the ability to change course based on this information and making room for it is a sign of resilience.
The ability to bring fresh perspectives: you get more hypotheses, cover more contingencies, and create more discussions around rationales. Institutionalizing such a role for neutral commentators/observers can give confidence on the ability to self-regulate

Dekker concludes that the most important ingredient of engineering a resilient system is constantly testing whether ideas about risk still match with reality; whether the model of operations (and what makes them safe or unsafe) is still up to date.