
The historic CrowdStrike outage in July brought operations across numerous industries and public service sectors to a halt, causing over $5 billion in estimated costs and damages.
The incident was unique not only because of its sheer size and broad, cross-industry impacts. Unlike other recent significant technology issues (e.g., the 2023 MOVEit cyberattacks), the cause was not a software vulnerability or external threat actor. Instead, it was a logic error in an ordinarily harmless update to the cybersecurity vendor’s Falcon platform configuration data. Millions of Windows systems experienced the “blue screen of death,” forcing them into recovery boot loops.
While there is little any firm could have done to prevent this outage and its impacts, the incident highlights several critical lessons for executives and IT and cybersecurity leaders to enhance future preparedness and resilience against similar disruptions.
Lessons Learned
The importance of preparedness and response planning. The CrowdStrike outage highlights the need to invest in IT and cybersecurity preparedness and response planning that goes beyond typical cyberattack scenarios. Part of what made recovery so difficult was that many of the impacted Windows devices also used the BitLocker hard drive encryption tool, which had to be disabled prior to restoring each device. By design, disabling BitLocker is not an easy task. A complex and unique code must be entered into each device, a process that in most cases needed to be done manually and required a significant amount of time.
Baivab Jena
Moving forward, firms should be sure to include this type of scenario in their incident planning and response exercises, with IT, cybersecurity, and business leaders thinking through how they would recover from a similar situation where many devices needed to be manually reset. This planning should include traditional aspects of scenario planning, like practicing incident assessment and communication activities. But it can also serve as an excellent opportunity to brainstorm and document unique solutions for future challenges, a skill that helped some organizations rapidly recover from the CrowdStrike incident.
To mitigate long-term impacts, firms must be sure to include scenarios where key vendors experience internal disruptions in their disaster recovery plans. Through regular business impact analyses (BIAs), firms will be better able to identify how the failure of a critical service provider would affect critical functions, allowing for a more targeted and efficient response. Additionally, business continuity plans (BCPs) should ensure that essential services continue even when third-party systems fail, safeguarding against prolonged disruptions.
Aaron Pinnick
The need to focus on third-party risk management. While third-party oversight and risk is a key priority, it is easy for business and technology leaders to assume that their IT and security vendors will not be the source of their firms’ operational disruptions. As the CrowdStrike incident demonstrates, thorough due diligence of these providers is critical, as well as to push them to provide clear and detailed descriptions of their change management process, how updates are tested and deployed, and the processes they have in place to roll back or address issues that may come up after an update has been pushed out.
It is important to note that CrowdStrike was not directly engaged with many of the companies that suffered from the disruption. It was instead serving as a fourth-party vendor to Windows users. Still, this incident demonstrates the need for firms to dig deeper into how their third parties deal with disruptions, their incident response plans, and their ability to deal with breakdowns in their own third-party networks. An operational disruption of a fourth party can quickly become a significant issue.
The value of a rigorous change management process. CrowdStrike serves as a stark reminder that when dealing with critical systems and software, no configuration changes should be considered low-risk. It is critical that firms conduct thorough risk analysis and account for as many dependencies as possible before changes are made, so that organizations can catch potential issues early.
A structured approach also ensures that swift rollback procedures are in place, which could have allowed CrowdStrike to reverse the problematic update before it caused widespread outages, minimizing operational impact.
Related to change management, firms should also be sure to invest in backup and redundant systems so that critical operations, and the incident response plan, can be activated even in the face of widespread outages of critical systems. In the CrowdStrike case, it is likely that some firms responded slower because their plans, necessary contact information, roles and responsibilities, key vendor information, and backup information was stored only on the very Windows devices they could no longer access.
Conclusion
The CrowdStrike outage is a reminder of the unpredictability inherent in the realm of IT and cybersecurity. Even routine updates, intended to improve systems, can lead to significant disruptions for firms and even entire critical sectors of the economy. And while incidents like this are often difficult to prevent, it is a reminder that firms that invest in organizational resilience, comprehensive planning, and robust response strategies will fare much better.
As the complexity of digital ecosystems continues to grow, so too does the necessity for organizations to be prepared for the unexpected, ensuring they can swiftly recover and maintain operations in the face of unforeseen challenges.
Baivab Jena is a research analyst, and Aaron Pinnick is senior manager of thought leadership, at ACA Aponix, which provides cybersecurity and technology risk assessment programs, data privacy compliance services, vendor and M&A diligence services, portfolio company oversight, network testing, and advisory services for companies of all sizes.
Topics: Resilience