Back in 1978, David Rockefeller was called before Congress to explain why Chase Manhattan wasn’t lending more to small businesses and minority borrowers. The head of Citibank testified first and gave a clean answer: Our cost of capital is 12%, these loans don’t return that much, so we can’t make them. Rockefeller was impressed. When he got back to New York, he asked his vice presidents what Chase’s return on capital was.
They hemmed. They hawed. They said they didn’t have the most up-to-date figure at their fingertips and would get back to him that afternoon.
Each VP went down to his analysts and asked for the number. Producing a defensible answer would have been a months-long project with serious definitional questions. The analysts ate lunch together and worked out a strategy. The figures had to come in below 12% – Rockefeller would not be happy to learn Chase trailed Citi – but close enough to seem credible. They agreed to scatter the numbers slightly so they wouldn’t look like a conspiracy. One VP reported 11.1%. Another 10.8. A third 11.4.
Rockefeller blew up. These numbers are all over the map. Doesn’t anyone here know what’s going on?
I heard this story from several of the analysts not long after, and I’m telling it half a century later from memory. The details vary in my sources, and I have no witnesses to the VP meetings. Treat it as a possibly semi-fictional but entirely believable account of what banking was like before risk management existed as a discipline.
Aaron Brown
The point is what wasn’t there. One of the largest banks in the country had no answer to a basic question about its own performance. The Citibank number was almost certainly no better grounded. It was a number that worked in a Senate hearing. Both banks were running on Wrong Numbers, and nobody at the top knew it because nobody had built the apparatus that would let them know.
That apparatus is what those of us in the first generation of modern financial risk management spent our careers building. And it’s worth remembering, especially now, that the field of financial risk management was founded on the recognition that the numbers in front of senior management were wrong.
A Field Built to Replace Wrong Numbers
Three episodes anchor the founding period.
In 1991, John Reed at Citibank looked at the bank’s notional foreign exchange derivatives exposure and realized it exceeded the bank’s capital. He told his people he would shut the derivatives business down unless someone could produce a risk metric he could tolerate. The metric that emerged – capital-at-risk – was a popular early candidate for a one-size-fits-all risk metric. Notional exposure was a Wrong Number for the question being asked. CaR was an attempt at a right one.
Around the same time, Dennis Weatherstone at J.P. Morgan demanded what became known as the 4:15 Report: a single page summarizing the firm’s market risk across every desk and product, on his desk 15 minutes after the New York close, while index futures and currency markets were still open. Building the report required the invention of Value-at-Risk. VaR drew on the same instinct as Reed’s CaR but on a shorter time horizon (one to 10 days versus a year) and lower confidence level (90% or 95%, not 99.97%).
At Bankers Trust, Charles Sanford and his team pushed in a third direction: risk-adjusted return on capital, or RAROC. The accounting return on equity that David Rockefeller’s analysts had fudged.
The work that began at these three institutions connected the analyst-lunch-spitballing numbers of 1980 with the modern financial risk management practice. All three came from senior executives looking at the numbers in front of them and concluding the numbers were wrong. Risk management is structurally an answer to Wrong Numbers.
The Problem Now
Forty years on, the situation has inverted. Risk managers today don’t suffer from too few numbers. We drown in them. Climate-risk vendors deliver physical-risk scores at the asset level. ESG raters hand us company scores. Ratings agencies produce credit assessments. Internal model validation teams produce model approvals. Regulators produce stress-test scenarios. Now AI systems produce summaries, classifications, and predictions across all of the above.
Most of these numbers are produced by people we don’t supervise, using methods we can’t fully audit, drawing on data we can’t always see. The standard quality signals – peer review, prestigious source, credentialed author, regulatory blessing – were never reliable filters for obvious errors, and they’re getting less reliable. My new book, Wrong Number, walks through 30 cases in which the most respected institutions, journals, and researchers produced numbers with errors any numerate reader could spot. The cases come from public health, climate, economics, criminology, and finance. The pattern is the same across all of them.
What follows are five tests I’ve found useful for catching Wrong Numbers before they become my problem. Each takes less than a minute. None requires specialized statistical training. They don’t replace formal model validation; they are the screen you run before model validation, on the numbers you didn’t realize were models.
Test 1: Is this number woven into a web, or thrown through a window?
Twenty-six centuries ago, Xenophanes of Colophon wrote that “all is but a woven web of guesses” – that no single human can know the final truth, that even someone who stumbled on it would not recognize it, and that knowledge advances only through the collective work of testing and weaving. It is the best one-line description I know of how reliable knowledge actually gets made.
A useful number sits inside such a web. It can be replicated, contradicted, refined, and built on. Other researchers have tried to break it. The number’s producers expect this and welcome it. No single thread is strong; the structure is.
A Wrong Number is typically a thread that no one tried to weave into anything useful. It goes directly to the press, the regulator, the policymaker, or the senior executive, claiming the authority of science but bypassing the part of science that builds useful knowledge. It is not integrated with related work, and other researchers do not build on it. It is not a guess offered for testing; it is a pronouncement.
In risk management practice, this catches most third-party climate physical-risk scores. The same asset gets dramatically different ratings from different vendors. The vendors do not reconcile their differences, do not publish methodologies in forms others can replicate, and do not have a body of literature checking their work against realized losses. There is no web.
A pricing model on a derivatives desk, by contrast, is woven tight – reconciled daily against trades, against other desks’ marks, against the firm’s overall book, and against the market itself when positions close out. It earns trust the slow way, the way Xenophanes said knowledge has to.
Test 2: Would the producer bet on it?
Among the advantage gamblers I spent decades around at Bill Eadington’s International Conference on Gambling and Risk-Taking, the standard rejoinder to an academic claiming a profitable trading edge or sports bet was: then why are you poor? The five-word version of a serious epistemic test. If you have a useful number, the natural thing to do is act on it.
Those of us selling Value-at-Risk in the early days had to pass exactly this test, twice over. On the trading floor, we pitched VaR as a point spread – the language a trader understood. A point spread converts every sports bet to an even-money proposition. VaR converted every P&L item into a 19:1 shot. The traders insisted we put up cash. Who would trust a risk manager on billion-dollar trading decisions if he wouldn’t put $10,000 of his own money on his model?
In the executive suite, we pitched VaR as an actuarial calculation – the language a CEO understood. Actuaries earn credibility by producing predictions that get tested against realized claims. The executive suite made us produce P&L charts and validate our VaR numbers against actual trading outcomes, day after day. Both audiences, in their own dialects, were making the same challenge: wanna bet?
That is the test. Would the producer of this number put their P&L on it? An ESG rating shop with no investment book of its own is one answer. A climate-vendor score whose producer has no skin in physical-asset insurance is another. An internal model team whose performance is measured on regulatory acceptance rather than realized loss accuracy is a third.
Numbers produced by people who never have to live with being wrong drift, slowly, toward whatever serves the producer’s actual incentives.
The test isn’t dispositive. Some honest analysts produce honest numbers without trading on them. Lots of people bet and lose. But the question reliably surfaces structural reasons to discount – and reminds us that the discipline we built was forced through this gate at its founding.
Test 3: Does the calculation match what a non-expert would assume it means?
The 2013 National Transportation Safety Board (NTSB) study of curbside bus carriers – the one that shut down the Fung Wah “Chinatown bus” and 26 other operators – concluded that curbside companies were seven times more likely to be involved in a fatal accident than conventional carriers. The number ran in the New York Times, Bloomberg, Businessweek, and elsewhere. A reasonable reader assumed it meant: If you ride a curbside bus, you are seven times more likely to die.
That is not what it meant. It meant that if a curbside company bought a bus, that bus had a higher probability of being involved in a fatal accident over its operational life than if a conventional carrier had bought the same bus. Most of the fatalities were pedestrians and other drivers, not passengers. Most of the accidents were not the bus’s fault. And the calculation was built by averaging per-company ratios. The correct approach in this case was to divide total curbside fatalities by total curbside buses, in which case the difference between curbside and conventional carriers disappeared.
In risk practice, this test catches most regulatory metrics. Capital ratios, liquidity coverage ratios, leverage ratios – each is calculated in a specific way that does not match what a senior executive or board member assumes from the name. The risk manager’s job is to know the gap and either close it or make the gap visible.
Test 4: Is this relative risk being used where only absolute risk should matter?
Even if the seven-times claim about curbside buses had been true, intercity bus travel is so safe that the absolute risk of dying as a passenger on a five-hour ride from a bus-at-fault accident is lower than the risk of dying from crossing the streets to and from the bus terminal. Seven times almost zero is still almost zero. The relative number was reported as if it implied a decision; the absolute number told you the decision was trivial.
Risk managers see this constantly. A model output that says “this exposure is 40% riskier than the comparable benchmark” is doing the work of the seven-times claim if the absolute risk is small enough that nothing in the portfolio depends on it. The discipline is to ask, every time a relative number arrives: What is the absolute number, and does it warrant any action at all?
Test 5: Will the producer share the underlying data?
The NTSB refused to release the company-level data behind its seven-times claim. Reason magazine’s Jim Epstein filed a Freedom of Information Act request and fought the agency for six months before being denied. Think about that for a moment. A federal safety agency had identified bus companies it deemed seven times more dangerous than their competitors – and refused to tell the public which companies those were.
Epstein eventually obtained the company classifications anyway, apparently by accident. What the data showed was worse than statistical error. Greyhound, the largest conventional carrier in the country, had been classified as a curbside operator. So had Peter Pan. Together, those two companies’ terminal-run buses accounted for 30 of the 37 fatal accidents the NTSB had attributed to curbside carriers. Strip out the misclassifications, and the seven-times claim collapses entirely. The “dangerous” curbside operators the agency had used to justify shutting down Fung Wah and 26 of its competitors were, in fact, the largest traditional bus companies in America.
When the agency was confronted with the misclassifications, its spokesperson said the NTSB “stands by its report.”
Refusal to share data is the cleanest signal a number is wrong. It is also the most reliable signal, because the producer’s reluctance is usually disproportionate to the apparent stakes. There is no good reason for a transportation safety agency to hide which companies it considers dangerous. There is no good reason for a climate-risk vendor to hide which assets received which scores and why. There is no good reason for an internal model team to resist a clear walk-through of inputs and adjustments.
When you encounter that resistance, the resistance is the answer.
Why This Matters Now
The Office of the Comptroller of the Currency (OCC) in April rescinded Bulletin 2011-12, taking with it most of the prescriptive scaffolding for U.S. model risk management. The new principles-based regime, paired with an imminent interagency RFI (request for information) on AI, generative AI, and agentic AI in models, leaves more judgment in the hands of institutions and the risk managers inside them.
That is, on net, the right direction. Prescriptive rules let bad models pass by satisfying checklists. Judgment-based regimes work only when the people exercising judgment have something to exercise it with.
The five tests above are not sufficient. They are the ones I run first because they are fast, require no specialized tools, and catch the largest category of Wrong Numbers – those produced by people who have no incentive to be right and no apparatus that would catch them if they were wrong.
Four decades ago, I joined a small group of like-minded quantitative financial traders who could no longer stomach the Wrong Numbers infecting finance. The numbers in front of us today are better on average but, since there are many more of them, the worst Wrong Numbers are even worse than what the bankers were telling Congress in 1979.
Aaron Brown is the 2011 GARP Risk Manager of the Year and the author of Wrong Number (Wiley, May 2026), as well as Red-Blooded Risk, The Poker Face of Wall Street, and Financial Risk Management for Dummies. He held risk roles at JPMorgan, Rabobank, Citigroup, Morgan Stanley, and AQR Capital Management, and was an early developer of Value-at-Risk. He writes for Bloomberg Opinion and Wilmott Magazine, and hosts the Wrong Number video series for Reason.
Topics: Enterprise, Data, Metrics
Aaron Brown