Things will go wrong – a key issue for operational resilience is how we respond when they do. To be able to respond well to operational failures requires insight, preparation and practice, as well as an ability to learn from errors. This is easier in a culture where staff acknowledge mistakes, there is an effective system for staff to record them, and the organization is open to learning from them.
In a joint 2018 discussion paper, three UK regulators (the Bank of England, the Prudential Regulation Authority and the Financial Conduct Authority) noted that the financial system needs to be able to absorb shocks, rather than contribute to them. “The financial sector needs an approach to operational risk management that includes preventative measures and the capabilities – in terms of people, processes and organizational culture – to adapt and recover when things go wrong,” the regulators elaborated.
It's helpful to recognize that there are different types of failure that warrant different types of responses. In a paper Dr. Amy Edmondson wrote for the Harvard Business Review, she cites three broad categories of failure: preventable, complexity‐related and intelligent.
From an operational resilience point of view, paying attention to preventable (even quite minor) failings makes a lot of sense, as these can in combination trigger a catastrophic process failure. Evidence of this has been found in several public enquiries investigating the causes of various disasters.
Consider, for example, the UK, enquiries into the Bradford Football Club fire, the sinking of the Herald of Free Enterprise and the Kings Cross Underground fire. What did all three of these disasters have in common? All were the culmination of a number of smaller events, including design and management deficiencies.
Each disaster could have been averted – or, at the very least, mitigated – if the smaller trigger events had been identified, reported and tackled. The general lesson is that an industry that does not learn from past failures is doomed to repeat them.
In the case of the enquiry into the Bradford fire, the report highlighted that many of the safety‐related recommendations were in fact identified in previous reports into other football‐related disasters, but had not been put in place. In other words, the industry did not learn from previous failures.
The 1987 sinking of the Herald of Free Enterprise ferry, which resulted in the loss of 188 lives, was another classic example of not learning from previous minor incidents. Prior to the disaster, several minor incidents had been noted by members of the ferry's crew – but were either not officially reported or dismissed by management as ‘exaggerations.’ In the ensuing Sheen investigation, management failure was cited as a prime reason for the disaster.
This practice of ignoring, dismissing or marginalizing previous related incidents is part of a larger, even more troubling pattern of behaviour. In the wake of a major disaster, people take much more care, their attitude to risk changes and policies and processes are introduced to prevent reoccurrence. However, these efforts can soon dissipate, allowing things to revert to a norm in which trigger events were missed, ignored or not reported properly.
So, even if we know it makes sense to learn from failure, we can see it's hard.
In the context of aircraft safety, aviation is one of the most advanced industries to embrace learning from failures. As we will see, other industries have tried to learn from them.
Planes have two black boxes: one records the instructions that are sent to all on‐board electronic systems; the other is a voice recorder in the cockpit.
These boxes provide a rich source of data for independent investigators to study in the event of accidents and near misses. Lessons are learned from these incidents for the good of the industry – a system that “Black Box Thinking” author Matthew Syed has called an ‘open loop.’
A ‘no‐blame culture’ has been institutionalized in this industry, because any evidence compiled by accident investigators is inadmissible in court. This provides the incentive for individuals to speak up. Syed refers to this mindset and approach as ‘black box thinking.’
Airlines must implement recommendations, and the data underlying the report should be made available to all pilots. This transparency aids learning across other airlines.
Of course, if an investigation found that a person had been negligent, then blame and punishment would be justified. But the starting point is not to seek out who is to blame – it is to learn the lessons.
The key factors at work are the source of data on failures and near misses, independent investigators and sharing lessons across the industry. Cultural factors – such as breaking down hierarchies and encouraging people to speak up – are also critical. But it has taken around a decade to get to this state of maturity.
Other sectors, such as health care, have tried to learn from this approach. For example, the UK's National Health Service (NHS) has taken steps to ensure that learning from adverse events is built into their culture and operations. They've noted that when things go wrong, the response has often been to identify who to blame—but the focus of investigations should, rather, be the events immediately preceding a failure.
The NHS recognizes that failure can sometimes be the result of negligent or criminal behaviour, but that it is more often the result of a huge number of factors that are way beyond the remit of one individual. It is the system, rather, that needs analysis.
Similar to Edmondson's distinction between preventable and unavoidable failures, the NHS distinguishes between active failures and latent conditions. It defines active failures as ‘unsafe acts’ (typically short‐lived and often unpredictable) committed by those working at the sharp end of a system. Latent conditions, in contrast, can develop over time and lie dormant before combining with other factors or active failures to breach a system's safety defences. They are long‐lived and, unlike many active failures, can be identified and removed before they cause an adverse event.
By examining the latent conditions at work, it is possible to remove these factors and help reduce the likelihood of an extreme adverse outcome.
In the early part of this century, the NHS placed an emphasis on four key areas: (1) a unified mechanism for reporting and analysis when things go wrong; (2) a more open culture in which errors or service failures can be reported and discussed; (3) mechanisms for ensuring that, where lessons are identified, the necessary changes are put into practice; and (4) a much wider appreciation of the value of the system approach in preventing, analyzing and learning from errors.
Over the past six years, public failures have forced the NHS to reconsider its approach to failure. For example, the 2013 Berwick review set out the key lessons learned from some failures at some UK hospitals, most notably urging the NHS to embrace wholeheartedly an ethic of learning. The Secretary of State for Health subsequently announced, in July 2015, that the NHS would set up a new independent investigation branch, modelled on the Air Accident Investigation Branch used in the aviation industry.
The Healthcare Safety Investigation Branch was launched in April 2017. This body investigates up to 30 safety incidents a year, placing an emphasis on learning rather than blaming. The lessons can then be shared across hospitals to try to make sure that mistakes aren't repeated and that the system is more resilient.
The healthcare industry has certainly taken a more robust approach, but only time will tell if it is as effective as the aviation plan‐of‐attack for learning from failure.
The US Army has also developed an approach that aims to learn the lessons from failure. The US Army's After Action Review (AAR) process involves a systematic debriefing after every mission, project, or critical activity. This process is framed by four simple questions:
Lessons move through the chain of command and are shared through sanctioned websites. Then the results are codified by the Center for Army Lessons Learned (CALL).
So, how applicable is this for financial services? After all, lives are not typically at risk and errors are not always as immediately obvious as a plane crash; indeed, sometimes they take many years to come to light.
David Blake and Matthew Roy recently argued that insights from the learning culture in aviation were highly relevant to the pension industry. In their report, Bringing Black Box Thinking to the Pensions Industry, they set out their argument for how ‘black box thinking’ could be applied to trustees of defined benefit (DB) pension schemes.
After interviewing UK DB pensions leaders and experts, they found evidence of a ‘closed loop’ mindset, including not setting strong measurable targets; inertia in decision making; herding behaviour; shifting goal posts; failing to take ownership of mistakes; and blaming others.
Blake and Roy also argue that organizations must create structures to mitigate human cognitive biases. The key is to have a mindset that constantly questions the status quo and seeks improvement, and they suggest various ways to break “group think” and introduce better measurement.
The pensions regulator, the authors contend, should be a clearing house for post‐mortems of failed schemes – akin to the role that accident investigators have in aviation. If this role were to be set up, they argue it could help reduce the likelihood of common mistakes (such as, inappropriate hedging) being made across schemes.
While we can encourage individual banks, insurance companies and asset managers to embrace a learning culture, should the industry go further and create structures where there is more sharing of information on failure across the industry for the good of all firms? Clearly, this would need to be in areas where there is no danger of firms being perceived as colluding. What's more, for firms to want to share data, it would have to be in areas that are not regarded as sources of competitive advantage.
One such area is cybersecurity. Initiatives in the UK, such as the Cyber Security Information Sharing Partnership (CiSP), aim to improve the sharing of cybersecurity incident information at a national level. However, it is commonly believed that there is significant under‐reporting of incidents within firms. Without improvement at this lower level, national incident sharing initiatives cannot become as effective or institutionalized as in the aviation industry.
Perhaps it's not just managers who think about failures in the wrong way – maybe we all do: firms, regulators, politicians and the public. The knee‐jerk reaction of asking ‘who is to blame?’ is counterproductive.
Equally, it is a challenge to achieve the right balance between a no‐blame culture and instilling a strong sense of accountability.
In the UK, the Treasury Committee's November 2018 inquiry into IT failures in the financial services sector provides a good example of the obstacles standing in the way of implementing no‐blame cultures. Citing an “astonishing” number of recent technology failures Nicky Morgan, a Member of Parliament (MP) and the chair of the Treasury Committee, said “Millions of customers have been affected by the uncertainty and disruption caused by failures of banking IT systems. Measly apologies and hollow words from financial services institutions will not suffice when consumers aren't able to access their own money and face delays in paying bills.”
It's not surprising that there is anger over these incidents. However, rather than start with the tone of blame, perhaps a more productive approach is to ask why there is a growing number of incidents. What does it signify and what are the lessons? Are these incidents symptoms of preventable failures in predictable operations or unavoidable failures in complex systems?
Looking across industries, the following ingredients appear to be required for effectively embedding learning from failure: a no‐blame culture to encourage people to speak up; effective incident reporting/data capture; and a culture or mindset that embraces learning from failure.
For the learnings to be spread across an industry, there is the further need for an independent body to analyze failures, understand the key lessons and promote lessons learned. It is probably easier when lives are at risk, because this raises the stakes in a way that gets our attention.
In a recent speech, Sam Woods, CEO of the Prudential Regulation Authority (PRA) recognized the distinction between topics where the incentives of firms and the regulator can be aligned and those where they are not.
Examples of the former are cyber risk and operational resilience, where firms and the regulator should be on the same side, sharing information. Under those scenarios, the PRA will act as a ‘good cop.’ In other areas, such as ring‐fencing, accountability, pay and internal models, the PRA will tend to be ‘bad cop.’
Undoubtedly, there are areas in between. But the ‘good cop’ role sounds similar to an industry mechanism for learning from failure. This raises some interesting, highly relevant questions:
Although it isn't clear what the role of the regulator should be, nor the best organizational structure to achieve this, it is clear that good‐quality data on failures and near misses are the bedrock.
About the Author
Jo Paisley, Co‐President, GARP Risk Institute, served as the Global Head of Stress Testing at HSBC from 2015‐17, and as a stress testing advisor at two other UK banks. As the Director of the Supervisory Risk Specialists Division at the Prudential Regulation Authority, she was also intimately involved in the design and execution of the UK's first concurrent stress test in 2014.
By Dr. Mike Humphrey
By Caroline Stroud, Emma Rachmaninov, and Holly Insley