My notes and other stuff


Paper: Going Solid

I had a quick chat earlier this week with a coworker which made me revisit Richard Cook's Going Solid: a model of system dynamics and consequences for patient safety, written with the collaboration of Jans Rasmussen. It's a short paper that leans on and covers two concepts: Rasmussen's Drift Model and its ability to represent Going Solid as a concept, and it does so by applying it to situations in healthcare.

"Going Solid" is an expression originally coming from the nuclear industry, when a boiler would become completely filled with liquid (solid). At that point, its operating characteristics become very different, and things become both dangerous and hard to control. The expression became a sort of piece of jargon, and Cook here expands it to mean in a more general sense that a situation in a system where components were loosely coupled suddenly become tightly coupled such that events can quickly go out of control. The paper aims to contextualize this concept within Rasmussen's drift model.

The drift model is descriptive, not prescriptive, and describes the possible operating space for a sociotechnical system, based on an envelope bounded by 3 boundaries: economic failure, unacceptable workloads, and acceptable performance:

Basic diagram of the drift model by Rasmussen, showing a rounded triangle. The left side is the acceptable performance boundary, the top side is an economic boundary, and the bottom side is an acceptable workload boundary. A dotted line within the triangle, to the left, defines a marginal performance boundary which defines a margin of error. The operating point is somewhere within the triangle, pushed toward the left by pressures coming from economic or workload factors

The operating point location is influenced by gradients that drive operations away from the workload and economic failure boundaries and towards the unacceptable performance (accident) boundary. Because the environment is dynamic, the operating point moves continuously; stability occurs when the movements of the operating point are small and, over time, random. Changes in the gradients (for example, increased economic pressure) move the operating point. The risk of an accident falls as distance from the unacceptable performance boundary increases. In practice, the precise location of the boundary of unacceptable performance is uncertain. Only accidents provide unambiguous information about its position.

Because only accidents represent unambiguous information about your operating point, organizations tend to create a marginal boundary, which is enforced by social norms, and has people reel and try to bring the operating towards the center. Those are essentially "near misses" where you haven't had an accident but you knew you came close and operated in ways that felt unsafe.

There are what we call High-Reliability Organizations (HROs), traditionally being things like nuclear power plants or the airline industry, which are known to be high-risk systems that nevertheless operate with long track records of safety, and then Low-Reliability Organizations (LROs), which still operate high-risk systems but tend to not do so reliably:

Same chart as the previous one, but 3 areas are highlighted: a small one near the marginal boundary labelled HROs, a large one overlapping the marginal boundary labelled LROs,  and an average-sized one comfortably within all boundaries labelled as a low-risk system

The key characteristic of an HRO in this view is that they have some awareness and agreement of where they are, and generally keep operating within a tight area with few large variations. LROs however would operate at the same point on average, but their overall spread and variation is much larger and therefore tend to cross the unacceptable workload boundary more often.

Cook adds:

The marginal boundary location is variable because it is controlled by sociotechnical processes. Publicized painful accidents usually lead to stepwise movement of the boundary inwards. Long periods without such events may result in marginal boundary creep outwards. The gradients encourage organizations to "test" the validity of the marginal boundary by deliberately moving the operating point beyond it. Because this "flirting with the margin" does not immediately produce accidents, it can lead to incremental adjustment of the marginal boundary outwards.

A zoom on the area between the acceptable performance boundary and the marginal boundary, showing the operating point of the system slowly drifting closer to the actual boundary, slowly redefining what the marginal boundary is.

The paper then turns its eyes toward the concept of coupling. Loosely coupled systems are those where activities and conditions in one part of the system have limited effect on those elsewhere. There is often buffering existing between parts of it, margins of manoeuver within which resources or efforts can be called upon to adjust to variations. Tightly coupled systems instead have many critical dependencies, which end up propagating effects widely and rapidly, and in ways that are difficult to anticipate. This in turn makes them harder to analyze, troubleshoot, and to intervene into.

Tight coupling however tends to come with higher efficiency—optimizing away the slack buffers and utilizing resources more fully—and therefore an economical advantage. Coupling can also be created to avoid some frequent low-impact events.

Specifically, the act of "going solid" is what happens when a generally loosely-coupled system reaches a saturation point and suddenly becomes tightly-coupled. The authors use hospitals and ICUs as an example of this, which is nice because it's not just mechanical components in a large piece of technical machinery:

Although tight coupling occurs in some hospital settings, hospital operations are usually loosely coupled. Individual units are independently staffed and have some degree of local autonomy. This arrangement produces operational "slack" that can buffer consequences of high workload in one unit. If an ICU is full, for example, it is possible to keep new critically ill patients in the emergency room or recovery room. This buffering capacity is lost, however, when the facility goes solid and all units are filled. Then even minor events in one unit may be major determinants of operations in the others.


To cite an example: a surgical procedure was cancelled after induction of anesthesia because a scheduled transfer of another patient out of the ICU was made impossible by deterioration of that patient's condition. The anesthetic was started because it had become routine to begin surgery in anticipation of resources becoming available rather than waiting for them to be available. The patient would have required an ICU bed for recovery after the procedure and the practitioners elected to halt the operation when it became apparent that no ICU bed would be available.

They mention that "going solid" is a condition that once is hit may last for many weeks. It creates new types of work for practitioners, involves management in new ways, and create opportunities for new types of failure. In the hospital situation, you end up having to reorient a ton of work around managing patient discharge, accounting resources, creating new communication channels, and new behaviors emerge such as hiding or hoarding resources ("make a patient take longer to be discharged so it lines up with a new scheduled one coming in so you keep their bed without giving it to another department"). In fact, this condition may turn out to be profitable for the hospital which now spreads its fixed costs around more patients, despite all these challenges.

However, when we look back at Rasmussen's drift model, tight coupling means that effects of disruptions are now much larger:

A similar zoom on the left side of the triangle, showing an HRO with a lot of tight variations comfortably on the safe side of the marginal boundary, then going solid and having very wide motions within the margin, potentially into accident territory, and then coming back to a more loosely coupled state where it again operates safely with far smaller variations

Because "going solid" is likely to occur when the operating point is already near the marginal boundary, the transition to tight coupling may allow otherwise minor changes in the operating point to propel the system beyond the marginal boundary and produce an accident. This is especially likely if there has been substantial marginal boundary creep during a prolonged accident-free period.

Generally, safety efforts are concentrated around either pushing back the economic and workload boundaries or creating a force that counteracts their influence on the operating point, efforts to move the marginal boundary inwards (tolerating fewer things), or by moving the performance boundaries outward. What this paper highlights then is that there is also a potential benefit to better characterizing and understanding what your operating point is and how it shifts around over time.

To do that, however, you need to have a general agreement about where you are, what's acceptable or unacceptable in order to properly negotiate the marginal boundary. So in general, there is value in surfacing signals to know the factors influencing where the marginal boundary is, what the responses are when you are crossing it, the degree of agreement of where you're operating now, and whether this is even accurate.

The authors conclude:

Finally, the willingness of organizations to tolerate going and remaining solid for long periods is likely to encourage flirting with the margin crossing and marginal creep. The ability to map the operating point of individual systems through time and study of the factors influencing the marginal boundary location are likely to be productive lines of inquiry.