My notes and other stuff

2023/03/26

Paper: Nine Steps to Move Forward From Error

This week's paper is Nine Steps to Move Forward from Error by David Woods and Richard Cook, some of the authors I cover the most here. This is a paper from 2002, a relatively short one, which tries to state 9 steps and 8 maxims (with 8 corollaries) to provide ways in which organizations and systems can constructively respond to failure, rather than getting stuck around concepts such as "human error."

It's a sort of quick overview of a lot of the content from Woods and Cook I've annotated here before.

The nine steps are:

1. Pursue second stories beneath the surface to discover multiple contributors.

This section contrasts two things Cook calls "first stories" and "second stories".

First stories are the initial re-telling of an incident, mostly based on the outcome, identifying an overly simplified cause. It stops too early, hides the pressures and dilemmas in play, and overshadows the ways in which people usually work around hazards. This superficial view in turn leads to proposed "solutions" that are either not effective or outright counterproductive when it comes to both learning and improving.

A second story looks at deeper mechanisms, beyond the label of "human error". Second stories show how systems usually drift toward accidents, but also how they also are often stopped from going too far. They are based on a vision in which safety is actively created by ongoing work, and this ongoing work must be made visible and supported if we want to have effective outcomes.

2. Escape the hindsight bias.

Knowing the outcome of actions or decisions tints the opinions we have of them, and lead to "solutions" that can be counterproductive: they tend to hinder the flow of information or add more complexity to already complex systems.

It's not stated explicitly in this paper, but a thing I find useful to frame why hindsight bias isn't productive is that the people who were in an incident generally were trying to fix things and create success; they had no idea what the outcome would be, otherwise they'd have done something else. A common framing around this type of research is that you have to find why the decisions made were making sense at the time.

3.Understand work as performed at the sharp end of the system.

The sharp end of the system is essentially people who are the practitioners. All the contextual factors (economic pressures, organisational pressures, human factors, and technological factors) come together, and create goal conflicts, dilemmas, and challenges that must be navigated while trying to accomplish work.

This must be properly understood if you want to take corrective action and change how work is done:

Safety is created here at the sharp end as practitioners interact with the hazardous processes inherent in the field of activity in the face of the multiple demands and using the available tools and resources.

[...]

Ultimately, all efforts to improve safety will be translated into new demands, constraints, tools or resources that appear at the sharp end. Improving safety depends on investing in resources that support practitioners in meeting the demands and overcoming the inherent hazards in that setting.

Most of the time, systems will naturally tend toward failure, and the ongoing work of people within them is going to keep it on track to instead be successful. By looking into hard, challenging problems people encounter (rather than the "errors" they make), we can find better opportunities for learning and improvement.

Doing this sort of investigation is however difficult: observers have a distant view of a workplace compared to practitioners, and the experts at doing something tend not to be experts at explaining how or why they do it. A good analysis must have a contextually rich view of the practitioners, be grounded by adequate technical knowledge, and also be supported by the general results and concepts that drive human performance.

As such, you have to assume that learning from incidents will require an interdisciplinary collaboration.

4. Search for systemic vulnerabilities.

To put it simply, safety is a systemic property, not a property of components. It is possible (and easy!) to create unsafe systems from perfectly reliable and well-behaved parts doing everything they should as specified. Seeking flawed individuals is not productive.

After looking at technical work in context, you'll identify complexities and coping strategies, and ways in which they might break down in some situations. These are the type of vulnerabilities you want to surface in an organization if you want to anticipate and avoid future failures. Acting on these tends to be tricky because they often arise on conflicting goals, where some must be sacrificed to save the whole (eg. be less productive to be safer) and that can create conflict.

Successful individuals, groups and organisations, from a safety point of view, learn about complexities and the limits of current adaptations and then have mechanisms to act on what is learned, despite the implications for other goals

This, in turn, means that you can look at how the system deals with this type of information, at how it uses it. Blame and punishment will drive this information underground. Organizations with a strong safety culture will tend to make this information prominent, sustained, and highly valued. There is a deliberate search for such vulnerabilities, and these organizations are forward-looking—failures can be celebrated as a source of learning.

5. Study how practice creates safety.

We've stated this before: all systems deal with trade-offs, and people adapt to these to resolve them. However there are limits to how groups or teams or people can adapt. Studying the strengths, weaknesses, and costs and benefits of these adaptations and how adjusted they are to a constantly changing environment is critical for progress.

6. Search for underlying patterns.

When surfacing challenges, it's easy to focus on salient issues that are particularly local. However there exist patterns that come into play in multiple cases, and not just about a particular field of practice, but also how people, teams, and organizations coordinate information and activities to deal with changing situations.

Contrasting various sets of cases to find patterns can lead to more robust testing and extension of knowledge.

7. Examine how change will produce new vulnerabilities and paths to failure.

The state of a system is always dynamic, and on top of (or in spite of) benefits, changes can also create new vulnerabilities. Change can be environmental, organizational, economic, technological, based in capabilities, management, or regulations; anything in the context, really. The benefits obtained do not necessarily impact the pressures in play, and in turn work changes.

This ties into the law of stretched systems, which I wrote about in a blog post before:

Every system is stretched to operate at its capacity; as soon as there is some improvement, for example in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity.

The authors here add that changes made under resource or performance pressure tend to increase coupling (increase interconnections as an optimization), which in turn increases complexity, effects, and challenges. Periods of change therefore must be considered as windows of opportunities to observe and anticipate new systemic vulnerabilities.

8. Use new technology to support and enhance human expertise.

As with change, new technology can both help and harm system performance. The underlying complexity of operations contributes to it all, and the benefits brought by technology tend to increase overall complexity rather than reduce it. People and computers are not separate nor independent, but must be framed as working together:

On the one hand, new technology creates new dilemmas and demands new judgments, but, on the other hand, once the basis for human expertise and the threats to that expertise had been studied, technology was an important means to the end of enhanced system performance.

What you must do to support practitioners is "understand the sources of and challenges to expertise in context." There is no such thing as neutral design.

9. Tame complexity through new forms of feedback.

I'll just quote this bit directly:

The theme that leaps out from past results is that failure represents breakdowns in adaptations directed at coping with complexity. Success relates to organisations, groups and individuals who are skilful at recognising the need to adapt in a changing, variable world and in developing ways to adapt plans to meet these changing conditions despite the risk of negative side effects.

[...]

Yet, all of these processes depend fundamentally on the ability to see the emerging effects of decisions, actions, policies – feedback, especially feedback about the future. In general, increasing complexity can be balanced with improved feedback. Improving feedback is a critical investment area for improving human performance and guarding against paths toward failure.

Better feedback needs to be well integrated to capture relationships and patterns (and not just be a dumb of a lot of available data), have historicity (past events) rather than just current values, be future-oriented to let people understand what could happen and not just what has happened, and to be context-sensitive to match expectations of the monitor.

The objective of that type of feedback is to create foresight; anticipation is required to give time and margins to adapt.

To finish this off, here are all the maxims and corollaries (taken literally from the text):

  1. Progress on safety begins with uncovering ‘second stories’.
  2. Progress on safety depends on understanding how practitioners cope with the complexities of technical work.
    1. To understand failure, understand success in the face of complexities.
    2. To understand failure, look at what makes problems difficult.
    3. Understand the nature of practice from the practitioner’s point of view.
  3. Progress on safety depends on facilitating interdisciplinary investigations.
  4. Safety is an emergent property of systems and not of their components.
    1. Understand how the system of interest supports (or fails to support) detection and recovery from incipient failures.
    2. Safe organisations deliberately search for and learn about systemic vulnerabilities.
  5. Progress on safety comes from going beyond the surface descriptions (the phenotypes of failures) to discover underlying patterns of systemic factors (genotypical patterns).
  6. The state of safety in any system always is dynamic.
    1. Under resource pressure, the benefits of change are taken in increased productivity, pushing the system back to the edge of the performance envelope
    2. Increased coupling creates new cognitive and collaborative demands and new forms of failure.
  7. Use periods of change as windows of opportunity to anticipate and treat new systemic vulnerabilities
  8. People and computers are not separate and independent, but are interwoven into a distributed system that performs cognitive work in context.
    1. There is no Neutral in design; we either support or hobble people’s natural ability to express forms of expertise