Leveson on Severity
I was reading Engineering a Safer World by Nancy Leveson and it has the first description of accident severity I like, but it's not exactly the same stuff I see in my usual software context.
Specifically, she mentions severities and levels not as a thing related to "the impact the incident is having right now" nor about "what's the level of response we want" (which are the two uses I was familiar with), but as a set of priorities to help designers negotiate goal tradeoffs when making decisions.
The first step in any safety effort involves agreeing on the types of accidents or losses to be considered. In general, the definition of an accident comes from the customer and occasionally from the government for systems that are regulated by government agencies. Other sources might be user groups, insurance companies, professional societies, industry standards, and other stakeholders. If the company or group developing the system is free to build whatever they want, then considerations of liability and the cost of accidents will come into play. Definitions of basic terms differ greatly among industries and engineering disciplines. [...] An accident is defined as:
Accident: An undesired or unplanned event that results in a loss, including loss of human life or human injury, property damage, environmental pollution, mission loss, etc.
An accident need not involve loss of life, but it does result in some loss that is unacceptable to the stakeholders. System Safety has always considered non-human losses, but for some reason, many other approaches to safety engineering have limited the definition of a loss to human death or injury. As an example of an inclusive definition, a spacecraft accident might include loss of the astronauts (if the spacecraft is manned), death or injury to support personnel or the public, non-accomplishment of the mission, major equipment damage (such as damage to launch facilities), environmental pollution of planets, and so on. An accident definition used in the design of an explorer spacecraft to characterize the icy moon of a planet in the Earth's solar system, for example, was:
- A1. Humans or human assets on earth are killed or damaged.
- A2. Humans or human assets off of the earth are killed or damaged.
- A3. Organisms on any of the moons of the outer planet (if they exist) are killed or mutated by biological agents of Earth origin. [...]
- A4. The scientific data corresponding to the mission goals is not collected.
- A5. The scientific data corresponding to the mission goals is rendered unusable (i.e., deleted or corrupted) before it can be fully investigated.
- A6. Organisms of Earth origin are mistaken for organisms indigenous to any of the moons of the outer planet in future missions to study the outer planet's moon. [...]
- A7. An incident during this mission directly causes another mission to fail to collect, return, or use the scientific data corresponding to its mission goals. [...]
Prioritizing or assigning a level of severity to the identified losses may be useful when tradeoffs among goals are required in the design process. As an example, consider an industrial robot to service the thermal tiles on the Space Shuttle[...]. The goals for the robot are (1) to inspect the thermal tiles for damage caused during launch, reentry, and transport of a Space Shuttle and (2) to apply waterproofing chemicals to the thermal tiles.
- Level 1:
- Al-1: Loss of the orbiter and crew (e.g., inadequate thermal protection)
- Al-2: Loss of life or serious injury in the processing facility
- Level 2:
- A2-1: Damage to the orbiter or to objects in the processing facility that results in the delay of a launch or in a loss of greater than x dollars
- A2-2: Injury to humans requiring hospitalization or medical attention and leading to long-term or permanent physical effects
- Level 3:
- A3-1: Minor human injury (does not require medical attention or requires only minimal intervention and does not lead to long-term or permanent physical effects)
- A3-2: Damage to orbiter that does not delay launch and results in a loss of less than x dollars
- A3-3: Damage to objects in the processing facility (both on the floor or suspended) that does not result in delay of a launch or a loss of greater than x dollars
- A3-4: Damage to the mobile robot [...]
The customer may also have a safety policy that must be followed by the contractor or those designing the thermal tile servicing robot. As an example, the following is similar to a typical NASA safety policy:
General Safety Policy: All hazards related to human injury or damage to the orbiter must be eliminated or mitigated by the system design. A reasonable effort must be made to eliminate or mitigate hazards resulting at most in damage to the robot or objects in the work area. For any hazards that cannot be eliminated, the hazard analysis as well as the design features and development procedures, including any tradeoff studies, must be documented and presented to the customer for acceptance.
Within that context, severities provide a good guideline about what is acceptable or not, what you can sacrifice to maintain higher level guarantees, and in turn help prioritization in incident situations, but that's a side-effect of having clearly defined core priorities rather than picking how loud the alarm is when it's a higher/lower SEV.
It's something I instantly wanted to bring to work and that we had a few discussions about. Clear goal priorities mean goal conflicts become a bit easier to negotiate in difficult situations, and can ensure more graceful degradation when developers align with them as well.