Two incidents in 2024 – one malicious, one accidental – demonstrated how systemic digital failure has moved from theoretical tail risk to realized threat.
The Change Healthcare ransomware attack paralyzed critical payment infrastructure connecting hospitals, pharmacies, and insurance companies nationwide, creating an industry-wide liquidity crisis. Small providers faced insolvency, forcing UnitedHealth Group (UHG) to inject approximately $6 billion in emergency funds and advance payments to support affected providers. While framed as cybersecurity risk, this was fundamentally a systemic payment crisis demonstrating how malicious actors can leverage financial choke points to inflict liquidity and credit risk across entire sectors.
The CrowdStrike incident proved systemic risk requires no malicious intent. A faulty software update caused the “Blue Screen of Death” on 8.5 million endpoints globally. Because CrowdStrike Falcon is a critical security layer, the failure caused immediate multi-sector operational paralysis across airlines, banks, and hospitals, with total economic losses estimated in the billions globally.
Astrid Yee-Sobraques
For risk professionals, the difference between a hack and a bad software update is immaterial. Business continuity planning (BCP) assumes isolated failures with time to mobilize recovery, not systemic risk where shared dependencies fail simultaneously and contagion cascades within hours. Resolution planning extends BCP to preserve critical functions, maintain liquidity, and manage ecosystem spillover.
Traditional risk transfer provides little relief; cyber insurance covered only approximately 1% of the $0.9 trillion in global economic losses from cyber incidents, according to Zurich's and Marsh McLennan’s Closing the Cyber Risk Protection Gap. The report identifies catastrophic cyber incidents as largely non-insurable due to lack of insurer risk appetite – systemic digital failures fall into this category. This protection gap underscores why organizations must pre-position their own capital to manage the collapse of digital dependencies.
Learning from Financial Resolution Planning
After 2008, systemically important banks were required to maintain living wills – resolution plans with pre-positioned capital buffers and liquidity reserves. Organizations dependent on systemic digital utilities must similarly pre-position resources, with one key difference: Cyber resolution must execute within hours, not weeks.
Current frameworks like the European Union’s Digital Operational Resilience Act (DORA) focus on prevention and recovery, not the resolution mechanics – the gap where resolution planning must operate.
The DUST Methodology: Tiering Digital Dependencies
Most organizations treat all “critical” vendors identically, applying similar audit requirements, service-level agreements (SLAs), and controls regardless of whether a vendor is critical only to the firm, or across the industry with contagion potential. This creates both under-protection of truly systemic dependencies that require resolution planning, and over-investment in firm-specific dependencies that require only standard BCP.
The Digital Utility Systemic Tiering (DUST) methodology applies established systemic risk principles from financial regulation to digital dependencies, differentiating by concentration risk (industry-wide exposure) and criticality (firm-specific impact).
Tier 0: Systemic Market Infrastructure (SMI) are digital utilities so fundamental that failure requires regulatory or industry-wide response (e.g., payment rails, SWIFT, core internet infrastructure). These dependencies already fall under regulatory oversight and crisis management frameworks. Firm plans must recognize which dependencies operate at this tier and ensure internal escalation protocols connect to existing regulatory coordination mechanisms.
Tier 1: Systemic Digital Utility (SDU) comprises vendors or systems whose failure causes significant contagion across an organization and counterparties – primary cloud providers, enterprise-wide endpoint detection and response (EDR) platforms, critical clearinghouses. These require both high concentration and high criticality, with maximum tolerable downtime (MTD) measured in hours.
Tier 1 subdivides based on substitutability. Tier 1A dependencies have alternatives that can activate within MTD. For example, an organization using AWS as primary cloud infrastructure might maintain a warm standby on Microsoft Azure with quarterly failover testing, ensuring critical workloads can migrate within hours. These require operational warm standby with secondary providers and documented switchover procedures tested quarterly.
Tier 1B dependencies have no realistic alternative within MTD. CrowdStrike Falcon exemplifies this tier. Organizations must instead contractually obligate vendors to demonstrate geographic redundancy, provide audit rights to vendor resolution plans, conduct joint stress testing, and accept pre-negotiated financial penalties for outages exceeding MTD.
Tier 2: Enterprise Critical Function (ECF) encompasses systems critical to core business but whose impact remains largely contained to the firm – high criticality but lower concentration. These require contractual resilience through pre-negotiated substitution costs and warm standby arrangements, but not the full resolution playbook.
Tier 3: Business Support Service (BSS) covers standard platforms with localized, temporary impact where standard BCP applies.
DUST operationalizes through Critical Function mapping – identifying core services that must be maintained and tracing them through business processes, applications, and nth-party relationships. When a single vendor supports the majority of Critical Functions, it qualifies as Tier 1 SDU, triggering the integrated playbook requirement.
The Resolution Playbook Framework
For Tier 1 SDUs, organizations must develop integrated resolution playbooks with two parts:
The Financial Resolution Playbook addresses the UHG lesson – when financial choke points fail, cascading liquidity and credit risk threatens dependent counterparties. The playbook quantifies financial impact of SDU failure, stress tests the liquidity freeze scenario across the ecosystem, and establishes Resolution Liquidity Facilities – ring-fenced capital or committed credit lines sized to maintain the organization’s own operations during outages, covering costs such as manual processing, temporary staffing, alternative service providers, and business disruption until primary systems recover.
The Technical Resilience Playbook addresses the CrowdStrike lesson: When operational utilities fail, organizations need tested procedures to maintain function through alternatives. For Tier 1A SDUs, this means maintaining operational warm standby with alternative providers. For Tier 1B SDUs, the playbook mandates contractual resilience obligations. Both require degraded operations protocols defining which Critical Functions can operate in manual mode and for how long.
The Path Forward
The 2024 incidents proved systemic digital risk materialized with devastating consequences. Traditional prevention through security controls and recovery is insufficient. Resolution planning extends well past incident recovery to ecosystem stabilization and financial continuity.
Organizations cannot wait for industry solutions or regulatory mandates. The actionable path forward requires mapping Critical Functions and dependencies, classifying digital utilities using concentration and criticality dimensions, building and funding integrated playbooks, and testing these as living operational tools.
This is firm-centric work requiring immediate action. However, certain systemic risks exceed individual capacity to manage. Industry consortia should develop shared resolution standards, coordinate stress testing for common dependencies, and work with regulators to designate systemically important digital utilities for heightened resilience requirements.
The cost of inaction is clear: UHG spent billions in ad hoc emergency assistance under crisis conditions with no pre-planned framework. Resolution planning transforms chaotic emergency response into orderly execution.
For risk professionals, systemic resolution planning must become a fundamental practice. The methodology exists, the precedents are established, and the case for action has been written in billions of dollars of losses.
Astrid Yee-Sobraques, FRM, CISSP is a senior risk executive in Enterprise Risk Management, Operational Resilience and Cybersecurity. Over 25 years at GE Capital, AIG, Citibank, and PwC, she specializes in “risk connectivity” – integrating people, processes, and data to strengthen how organizations anticipate, manage, and respond to cascading financial, operational, and compliance risks. Astrid serves on GARP’s New York Chapter Advisory Committee. She can be reached at Astrid@therisksherpa.com.
Topics: Resilience, Cybersecurity
Astrid Yee-Sobraques