Paper: Systemic Contributors Analysis and Diagram (SCAD)

2023/04/07

Paper: Systemic Contributors Analysis and Diagram (SCAD)

A few months ago, a bunch of people in the LFI community were discussing various models for organizational factors such as FRAM or STAMP, and a few mentioned that "SCAD" was the one they really liked. Finding information about "SCAD model" yielded useless results about OpenSCAD and actual 3D modelling instead of the safety/incident modelling SCAD that stands for "Systemic Contributors Analysis and Diagram."

As it turns out there are very few explanations of even the good terms, some leading to Katherine Walker's dissertation titled Using a corpus of accidents to reveal adaptive patterns that threaten safety but that never introduces the diagrams associated with SCAD. The best one I know of is the one I'm covering here where Walker used her analysis from her dissertation and turned it onto one dense little paper titled Multiple Systemic Contributors versus Root Cause: Learning from a NASA Near Miss.

It's a short 5 pages paper that does a lot of heavy lifting: it shows a root cause analysis's outcomes, then introduces SCAD diagrams and its outcomes, and compares both of them. Most of the actual analysis leading to the root cause analysis and SCAD are not in the paper (the latter one is in the dissertation), but it gives the clearest outlines I've ever found of what SCAD is supposed to do, how it's supposed to work, and how you do the investigation.

The paper starts by mentioning how most incident analyses tend to not yield practical results. Either they do a sort of root cause approach where the investigation tends to stop on human error, or when they dig deeper into organizational "roots", they'll find issues with broad categories like "communications" that give very few suggestions for improvements.

To drive that point through and as a basis for a comparison, they use a case where a NASA spacewalk went bad; water kept accumulating in the helmet of the astronaut (for two trips in a row), up to the point they risked drowning. The entire time, crew and ground control thought it was a drink bag leaking, but it turned out to be a blockage in a water separator that would let water spill into the portable life support system vent loop.

The root causes identified by the NASA investigation were:

Program emphasis was to maximize crew time on orbit for utilization
ISS community perception was that drink bags leak
Flight Control Team's perception of the anomaly report process as being resource-intensive made them reluctant to invoke it
No one applied knowledge of the physics of water behavior in zero-gravity to water coming from the vent loop
Small amounts of water in the helmet was normalized

They state:

RCA misses the important characteristics of complex systems and [it] persists to fulfill social needs related to blame. The focus on root cause leads to actions that add additional defenses and increase the negative pressures for compliance. But these pressures add complexity and create unintended effects in complex system

The SCAD approach instead looks at "how the system has performed and adapted successfully in the past to understand where practitioners are bridging gaps and resolving dilemmas."

It assumes that long before accidents happen, pressures are in place within the system and influence sharp-end behavior, and often comes with a SCAD diagram that looks like this:

Figure 3. SCAD diagram begins and the blunt and distal end, charting how pressures lead to conflicts, forcing adaptations and increasing brittleness, which is revealed by the adverse event.

"Proximal" elements are those that happen during the incident. "Distal" factors are those that were there before the incident. The blunt end represents organizational factors, such as policies, management demands, defining priorities, values, and so on. They create pressures on the sharp end practitioners, who must negotiate these goal conflicts and adjust their work to meet objectives and work around bottlenecks.

SCAD creates a four-quadrant diagram tracing the sharp versus blunt and distal versus proximal factors. The diagram uses events in the accident as a scale for placing factors temporally. All factors should stem from pressures in the blunt-distal quadrant, and events should start proximally. If a factor starts at the sharp end and is seemingly irrational or strange, that is a sign that there is some blunt end pressure that is being missed.

So you start at the right, and walk the analysis backwards in time, but you must necessarily tie them to specific pressures or conflicts to map out the motivations and goals of people within the system.

The idea there is that this lets you slowly uncover how organizations drift over time and push their operational boundaries into incident territory. Systems degrade very progressively, so noticing that drift is particularly challenging.

The authors generated the following diagrams for the two space walks (one that found water but led to no investigation, and then the second one that was an incident):

A complex SCAD diagram of the first spacewalk (Figure 4) An even more complex SCAD diagram of the second space walk leading to a near-miss (Figure 5)

Although elements from the root cause analysis are also on the SCAD diagrams, the main distinction the authors point out is that the SCAD approach goes further than at human error. Human error is instead seen as a label that represents a normal adaptation to common pressures that usually works.

They point out that production pressures made it habitual to work under degraded conditions, and it identifies some of these habitual behaviors and conflicting goals that encourage them, so they can be modified:

RCA stops prematurely in its analysis—the SCAD analysis starts where RCA stops. [...] The SCAD analysis reveals the adverse event was due to production pressure that led to operations under degraded conditions and created the conditions for the near miss. Production pressure was key to a failure in proactive learning as the organization discounted evidence and missed opportunities to learn and change

My interpretation of this is that you probably wouldn't get good results by asking people to say, follow procedures harder, or find ways to add barriers to prevent them from deviating: so long as the production pressures remain in place unchanged (or without other balancing forces guiding decision-making in trade-off situations), you should expect similar adaptations and drift to keep taking place. This gives you a significantly broader framing for corrective actions and learning than you would have had by only looking at the events surrounding the incident rather than the forces behind them.