Operational Resilience: The Critical Need to Learn from Failure

High-profile IT failures, ranging from the Equifax data breach to TSB's botched tech upgrade to the damaging customer-account problems at Barclays and RBS, have recently proliferated in the financial services industry. Combined with the staggering level of prudential and conduct-related failures the financial system has experienced over the past decade, these incidents have naturally sparked great anger - from both the public and politicians.

Regulators, in response, are demanding greater operational resilience, asking firms to plan on the basis that it's no longer a matter of if, but when, your firm experiences a failure.

Jo Paisley

Too often when things go wrong, the instinct is to look for who to blame. Although that's natural, a more productive approach is to investigate the reasons for these failures and to seek to learn from them.

The incidents are, in part, symptomatic of the increasing complexity and interconnectedness within the financial system, as well as the massive amount of change affecting financial services firms. Increased digitalization, for example, is elevating security and technology risks, while greater outsourcing - via cloud computing, open banking and fintech partnerships - is creating new dependencies and raising third-party vendor risk.

The consequences of operational outages and failures are also changing: firms that display any sort of IT weakness themselves become targets for fraudsters. And more stringent regulatory safeguards on data privacy raise the prospect of larger fines and increased reputational risk in the case of breaches.

As a result of all of these issues, the likelihood and potential costs of operational failures are rising. Moreover, failures will have potentially more substantial and unpredictable impacts, on both individual firms and the financial system, as the system becomes more complex.

Given all this, is it clearly time to start thinking about failure in a different and more proactive way.

Lessons from Aviation

Some industries, such as aviation, do this well. The two black boxes on each plane collect vital data for independent investigators to study in the event of accidents and near misses. Lessons are learned from these incidents for the good of the industry - a system that Black Box Thinking author Matthew Syed has called an 'open loop.'

A 'no-blame culture' has been institutionalized in this industry, because any evidence compiled by accident investigators is inadmissible in court. This provides the incentive for individuals to speak up.

The key factors at work are the source of data on failures and near misses, independent investigators and sharing lessons across the industry. Cultural factors - such as breaking down hierarchies and encouraging people to speak up - are also critical. But it has taken around a decade to get to this state of maturity.

What does this mean for the way that we look at failures at financial firms? Academics David Blake and Matthew Roy believe that insights from the learning culture in aviation are highly relevant to the pensions industry.

In their report, Bringing Black Box Thinking to the Pensions Industry, they argue that The Pensions Regulator in the UK should be a clearing house for post-mortems of failed schemes - akin to the role that accident investigators have in aviation. If this role were to be set up, they contend it could help reduce the likelihood of common mistakes, such as inappropriate hedging, being made across schemes.

Perhaps that is what Sam Woods, CEO of the Prudential Regulation Authority (PRA), had in mind in his recent 'Good cop, Bad cop' speech. Woods drew a distinction between topics where the incentives of firms and the regulator are aligned and those where they are not.

Examples of the former are cyber risk and operational resilience, where firms and the regulator should be on the same side, sharing information. Under those scenarios, the PRA will act as a 'good cop.' In other areas, such as ring-fencing, accountability, pay and internal models, the PRA will tend to be the 'bad cop.'

While there are certainly areas in between, the 'good cop' role sounds similar to an industry mechanism for learning from failure. This raises a few interesting questions:

Can a regulator simultaneously be a good cop and a bad cop? This is difficult but not impossible, requiring clear rules of engagement. It would take time to build the levels of trust required to ensure complete reporting of operational failures. And it might work better for some types of events (e.g., cyber incidents) rather than more general operational incidents linked to, say, poor testing or human error, where the instinct to find blame might dominate the desire to learn lessons.

Is the possibility of blame and the bad cop role enough to put firms off reporting honest mistakes that might help the sector learn from failures? An independent body is likely to be more effective at overcoming the reluctance of firms to admit mistakes. Certainly, that has been the route that the UK healthcare industry has followed to try to improve safety through effective and independent investigations that don't apportion blame or liability.

Can financial institutions and regulators work jointly to aggregate data on failures/near misses and share best practices? Many regulators have launched cyber risk initiatives to help financial institutions learn from failure. In the US, the Federal Reserve is trying to find a common way of classifying and modelling cyber threats. In the UK, the Cyber Security Information Sharing Partnership (CiSP) aims to improve the sharing of cybersecurity incident information at a national level, and a Financial Sector Cyber Collaboration Centre is being established to improve collaboration and learning about systemic cyber risks across financial services.

However, it is commonly believed that there is still significant under-reporting of cyber incidents within firms. Without improvement at this lower level, national incident sharing initiatives cannot become as effective or institutionalized as in the aviation industry. And building operational resilience will need a broader perspective than just dealing with cyber incidents.

Next Steps

There is no panacea, but there are specific steps can risk practitioners and regulators take to address failure issues. Practitioners should proactively support a learning culture and ensure that there is a clear and well-understood incident reporting system that is supported by senior management. Moreover, each firm should have an effective resilience plan that can be activated immediately after a breach occurs.

Regulators, meanwhile, should have rules in place to ensure that organisations that commit infractions are not just punished but actually learn from their mistakes.

Today, firms, regulators, politicians and the public are all guilty, at times, of thinking about failures in the wrong way. The knee-jerk reaction of asking 'who is to blame?' is counterproductive. Although it is a challenge to achieve the right balance between a no-blame culture and instilling a strong sense of accountability, it's time to think about these incidents differently and consider the right approach to learn from them.

Jo Paisley is the Co-President of the GARP Risk Institute. Paisley worked as the Global Head of Stress Testing at HSBC from 2015-17, and has served as a stress-testing advisor at two other UK banks. As the director of the supervisory risk specialists division at the Prudential Regulation Authority, she was also intimately involved in the design and execution of the UK's first concurrent stress test in 2014.

2025 FRM Candidate Guide

2025 SCR Candidate Guide

2024-2025 RAI Candidate Guide

2024 Risk Careers Survey: Global Report

Article

Operational Resilience: The Critical Need to Learn from Failure

Share

Trending

Capturing the Voice for e-Communications Compliance

July 11

Agentic AI: On the Frontier of Autonomy

June 13

ERM and Risk Appetite

August 1