Paper: The Strategic Agility Gap

2023/09/24

Paper: The Strategic Agility Gap

This week I read David Woods' The Strategic Agility Gap: How Organizations Are Slow and Stale to Adapt in Turbulent Worlds, an open access chapter that sort of surveys and puts together a lot of the concepts he has written about in the past, particularly around the need of organizations to balance growth in capabilities with the ability to adjust to the changes they enable.

The idea here is that growths in capability—often due to better technology—brings rapid changes at a societal level: new opportunities are found, complexity grows, and new threats emerge. New capabilities generally mean growth, expansion, bigger scales, and more interactions, which means more surprises. On the other hand, organizations are generally slow and stale when it comes to adapting to these threats or to seize these opportunities:

As capability grows to improve performance on some criteria, interdependencies become more extensive and produce surprising anomalies as the systems also become more brittle.

The strategic agility gap is the difference between the rate at which an organization adapts to change and the rise of new unexpected challenges at a larger industry/society scale. It is a mismatch in velocities of change and velocities of adaptation.

This figure is attached:

Because the risks are difficult to see ahead, and that the growth is continuous, there's a risk of cascade to disturbances and challenges; this requires anticipating challenges and building a "readiness-to-response" to avoid having to generate and deploy them while the challenge is taking place. Here the text seems to intent something different from just having a plan for specific challenges; the words used are "organizations need to coordinate and synchronize activities over changing tempos, otherwise decisions will be slow and stale". This hints at overall response patterns and reorganization more than having a runbook with specific scenarios.

To provide an example of a failing and a successful case, Woods covers the Knight Capital Collapse from 2012 (other great link) and of a transport company dealing with Hurricane Sandy (illegal source).

In the case of Knight Capital, they rolled out code that reused an old feature flag that had been repurposed, and the deployment failed on a single out of eight servers. When it went live, it produced unexpected behavior that ran more transactions than expected; rolling it back produced even more anomalous behavior due to the flag. People involved struggled to understand the issue. Woods mentions that it took a while before upper management was informed and then authorized to stop trading. By then, it had been less than an hour, but it was too late and the company went bankrupt from their now untenable market position.

The author picked it as an example that shows that:

small problems interact and can escalate quickly
as effects cascade, roles struggle to understand the situation and figure out how to react
non-routine responses are more difficult to get authorization for
this requires more coordination which slows things down while effects still amplify
response can't keep pace with events, particularly when communications are serialized vertically through the organization

The comparative case of a large transportation firm that reconfigured itself during hurricane Sandy has the following elements named behind their effective adaptation. Quoted literally from the text, they:

re-prioritized over multiple conflicting goals,
sacrificed cost control processes in the face of safety risks,
valued timely responsive decisions and actions,
coordinated horizontally across functions to reduce the risk of missing critical information or side effects when replanning under time pressure,
controlled the cost of coordination to avoid overloading already busy people and communication channels,
pushed initiative and authority down to the lowest unit of action in the situation to increase the readiness to respond when unanticipated challenges arose.

This, Woods mention, helped balance what is called the efficiency-thoroughness tradeoff. Also noted ETTO, this is a principle that states that needs for safety tend to reduce efficiency, and demands for productivity tend to reduce thoroughness. All of these are because people are limited on time and these two values are in tension. Specifically, they sacrificed economics and standard processes to keep up with events, by using patterns that existed within the organization already given adapting to surprises was a normal experience.

In comparing both cases, the author mentions that following plan is not enough in these situations. There's a need for anticipation and initiative, particularly when events challenge existing plans. The difference between both organizations is that for the transportation company:

From facing surprises in the past, the varying roles/levels had opportunities to exercise their coordinative ‘muscles,’ even though this specific event presented unique difficulties. In the strategic agility gap, the challenge for organizations is to develop new forms of coordination across functional, spatial, and temporal scales—otherwise organizations will be slow, stale and fragmented as they inevitably confront surprising challenges.

While I personally feel the time scales between cases are very different for the comparison, they probably do a decent job of demonstrating the types of behaviors on each side of the accelerated trajectory line.

The paper shifts toward a "Systems are messy" section, recalling the wold WWII term SNAFU, standing for "Situation Normal: All Fucked Up". Standard plans inevitably break down, and some people in some roles do "SNAFU catching", often in hard to detect manners:

all organizations are adaptive systems, consist of a network of adaptive systems, and exist in a web of adaptive systems—i.e., the resilience engineering paradigm. All human adaptive systems make trade-offs to cope with finite resource and all live in a changing world. The pace of change is accelerated by past successes, as growth stimulates more adaptation by more players in a more interconnected system.

The point here is that operating within the strategic agility gap is unavoidable. Organizations love to rationalize this away:

Since SNAFUs occur rarely, this is a low priority issue
There's a record of improvement that reduces the challenge SNAFUs represent
Poor response when SNAFUs occur is due to people who fail to follow the plan and design

Woods states directly that these rationalisations are wrong empirically, technically, and theoretically. When framing surprises as deviations from the established plan, the compliance pressure that follows undermines the system's adaptive capacities. The background of improvements and a sudden collapse surprises and confuses people within the system. The argument here is that this is normal: as scale and interdependencies increase, performance increases, but so does the proportion of large collapses and failures.

The Resilience Engineering statement here is that what we shouldn't be surprised by the failures, but by how few of them we have. One of Woods favorite laws is the fluency law, which states:

well adapted activity occurs with a facility that belies the difficulty of the demands resolved and the dilemmas balanced.

The reason we see so few failures is that adapting to SNAFUs continually takes place, and that it is nearly invisible. It is, in fact, one of the tenets of resilience engineering.

Past successes in these situations drive effective leaders to take advantage of improvements and drive the systems to do even more, and this creates adaptive cycles which accelerate the strategic gap. Organizations end up living in that strategic agility gap, and to thrive in there they need to develop and sustain the ability to continuously adapt.

Resilience Engineering researchers turn to web operations in order to study this: outages and near-misses are incredibly common even in the best organizations, and things change so fast that they provide a great laboratory to study constraints and shifting opportunities and risks. The key ingredients identified are:

anticipation: seeing signs of trouble and starting adaptation before it becomes definitive
contingent synchronization: based on pacing, roles at different levels coordinate differently
readiness to respond: developing and mobilizing response capability before surprises
proactive learning: studying how surprises are caught and resolved before major collapses or accidents

To express and apply initiatives, there's a need to push it down closer to action; this can be miscalibrated in a way that fragments efforts and makes units work at cross-purposes. Since we can't just enforce plans harder, resilience engineering seeks system architectures that can adjust the expression of initiative as the potential for surprises varies. This requires to prioritize and sacrifice some goals as conflicts arise. Proactive learning is key there—and not just learning from events that cause economic loss or cause harm after they have happened.

There's also a good call for reciprocity, which I'll use as the author's closing words:

Effective organizations living in the gap build reciprocity across roles and levels. Reciprocity in collaborative work is commitment to mutual assistance. With reciprocity, one unit donates from their limited resources now to help another in their role, so both achieve benefits for overarching goals, and trusts that when the roles are reversed, the other unit will come to its aid.

[...]

Units can ignore other interdependent roles and focus their resources on meeting just the performance standards set for their role alone. Pressures for compliance undermine the willingness to reach across roles and coordinate when anomalies and surprises occur. This increases brittleness and undermines coordinated activity. Reciprocity overcomes this tendency to act selfishly and narrowly. Interdependent units in a network should show a willingness to invest energy to accommodate other units, specifically when the other units’ performance is at risk.

[...]

Episodes of surprise provide the opportunity to see when and how people re-prioritize across multiple goals when operating in the midst of uncertainties, changing tempos and pressures.

Sustaining capabilities for dealing with SNAFUs is essential.