Skip to content
Article

Is Catastrophic Software Failure a Black Swan?

May 23, 2025 | 4 minutes reading time | By Gill Ringland and Ed Steinmueller

The risk is systemic, and the preparations must go beyond the technical.

Black Swans refer to rare, unforeseen and hence unpredictable events, typically with extreme consequences. For instance, "financial industry models that simplify a complex reality are vulnerable to Black Swans that literally break the bank.”

The potential risks of software failure are variously described as a “fly in the soup” or an “elephant in the room.” These metaphors convey the idea of unexpected risk events (flies) or neglected, but obvious and visible (elephant), risks. Nassim Taleb, who popularized the use of Black Swans, added the important element of magnitude to unexpected and neglected risks. Black Swan events constitute a much more major, often catastrophic, risk than either flies or elephants.

Taleb observes that people and their organizations are often complacent about the existence of Black Swans. Even though, in retrospect, Black Swan events appear inevitable, the fact that they haven’t yet happened encourages a belief that they will not happen, and this causes shock when they do occur.

To indicate the added risks of complacency, Taleb uses the contrast between the view of a well-cared-for turkey that is surprised by Thanksgiving (the sucker), and the butcher (the knower) who plans for the turkey’s demise.

Despite recent apparent increases in the number and scope of software system failures leading to service outages and data breaches, those who haven’t experienced them expect they will not happen to them. In Taleb’s analogy, they are the suckers.

“Inevitable and Unpredictable”

Most organizations deliver their critical services directly or indirectly through digital systems. These services are seen as a utility – essential to the operation of the economy and society, and to the quality of life. But these digital systems contain software components over which the organizations have no control.

gringland - 160 x 190Gill Ringland

The failure of one or more of these software components is inevitable and unpredictable The question is about risk magnitude, whether it will cause a Black Swan event for a particular organization or merely an anticipated and rehearsed-for emergency.

Given the difficulties of predicting the nature and size of service outages or data breaches, and expectations of society that a utility will operate 24/7, the best means of avoiding Black Swan events is for organizations to focus on resilience.

Resilience can be used to describe many aspects of organizations. Here we focus on the resilience of services to users – reducing the fallout to users from digital systems failure.

The first steps in building a more resilient organization are to increase the visibility of systemic risk. Some very basic managerial tools, such as RACI (the framework based on assigning Responsibility and Accountability with Consultation and Informing of stakeholders), provide the means for getting started in establishing responsibility for resilience and increased availability.

Building responsibility and eventually consensus requires a common language about risk and resilience. This enables technical and non-technical people to jointly consider actions to mitigate risk and improve resilience. Encouraging “translators,” people who are able to bridge differences in outlook and language, is helpful for building mutual understanding of what is at stake.

Risk and Impact Tolerance

Metrics are invaluable in moving from talk to action. A particularly useful metric is based on the concept of important business services – those that are vital to the financial and reputational survival of the organization and whose interruption would cause intolerable harm.

esteinmueller - 150 x 150Ed Steinmueller

In most cases, a small interruption, or one that affects a limited number of customers, is tolerable. This allows the identification of a metric – impact tolerance – the threshold between tolerable and intolerable interruption. (See the U.K. Operational Resilience statement of policy for the relationship between risk appetite and impact tolerance.)

In setting impact tolerances, a plot of the cost of reducing risks against the time before failure can be expected is useful in crystallizing the idea of impact tolerance. The plotting of risk-reduction costs is only one of several cost-benefit analyses that can help with common language and enlist support for necessary investments to stay within agreed tolerances. (Cost-benefit analysis of risk reduction is extensively examined in Philosophical Basis for Risk Analysis.)

Another useful metric is the Information Commissioner’s Office (ICO) NIS framework, which measures four aspects of the cost of failures in terms of impact to users: cost of “lost user hours,” cost of data breaches, cost of damage to life or health, and cost of significant financial impact to users.

The NIS framework is used by the ICO to regulate RDSPs (Relevant Data Service Providers) but is not yet widely used for costing the economic impact of service failures. For very large outages, the organization needs to have a disaster response plan that incorporates alternative service provision (a “Plan B”) or well-structured rollback and restart procedures to restore system functionality.

Culture of Resilience

Resilience is more than a technical issue. It requires specific management actions and skills. A recent BCS RoundTable report found that these skills are often broadly dispersed within organizations and that gaps in knowledge and practice are only recognized after an outage. These gaps can be bridged by developing internal capability or by procuring external capabilities, but there is no magic bullet that will assure that the necessary skills are available.

However, it can help to build an organizational culture that provides “safe spaces” for people to discuss service resilience and its value to the organization. This results in more open discussion of failures and outages, and the recognition of early signs of instability or risk.

The practice, common in health and safety, of making it a positive obligation to call out issues has much value. Organizations need to establish a “what if?” approach to planning for future potential scenarios to ensure they have adequate protections in place. One method is pre-mortem examination of a large-scale failure to address managing organizational risks effectively.

From Visibility to Action

Information technology is a critical utility; this increases the need for resilience everywhere. Improving resilience begins with awareness and visibility of risks, including risks of Black Swan catastrophic failures.

An all-hands effort to communicate priorities regarding important business services and the organization’s tolerance for outages is a first step. This initial planning can be turned into action using metrics to gauge ordinary risk and mitigation measures such as alternative service provision (Plan B) or well-defined recovery procedures to deal with larger-scale failures.

Building from visibility to action requires organizational learning and skills-building and a common commitment to reducing the scale of impacts of failures when they occur.

 

Gill Ringland and Ed Steinmueller are the co-authors of Resilience of Services: Reducing the Impact of IT Failures, where the issues and methods in this article are explored in more depth. A version of the article was published previously in the Long Finance Pamphleteers Blog.

Ringland is Emeritus Fellow of SAMI Consulting, where she was CEO, director and a fellow from 2002 to 2017. “Resilience of Services” is her 13th book. An earlier title, Scenario Planning: Managing for the Future, was written while she was responsible for strategy at computer firm ICL.

Steinmueller is Professor Emeritus of Information and Communications Technology Policy, and of the Economics of Innovation, Science Policy Research Unit (SPRU), University of Sussex Business School. He has published four books and more than 80 peer-reviewed journal articles and book chapters. He is a Fellow of BCS, The Chartered Institute for IT, and was co-chair with Gill Ringland of the BCS Service Resilience Working Group.

Topics: Resilience

Trending