When no one in the room knows the answer

Why the most expensive technology bet in modern history rests on a test most buyers have not run.

In 1904, a horse named Hans, kept in a Berlin courtyard by a retired schoolteacher, appeared to do arithmetic. Audiences shouted sums; Hans tapped his hoof on the right answer.

What Hans had learned was, in its way, more remarkable than arithmetic. Across thousands of repetitions he had built an internal model of human attention precise enough to look like calculation — the catch in his questioner's breath, the lean of the trainer's posture, the tension that peaked when his hoof approached the right number. A real skill, finely tuned. Just not the skill he was being paid for.

Then Oskar Pfungst arrived. A 25-year-old psychology student with no budget and one experimental idea, Pfungst had questioners stand behind a screen, then arranged for them to pose problems they themselves did not know the answer to. He also asked Hans to tap-dance — a request for which the training distribution offered no shortcut at all. Hans's accuracy on arithmetic fell from 89 per cent to 6. The tap-dancing never came.[1]

Hans had learned something genuine. He had not learned what he was sold as having learned.

The trillion-dollar bet

This year, the world will spend roughly $725 billion on artificial-intelligence infrastructure — more than several G7 governments spend on defence.[2] JP Morgan calculates that delivering a 10 per cent return on that capital would require $650 billion in annual revenue, in perpetuity, from buyers of AI services.[3] Anthropic's revenue grew 80-fold in the first quarter of 2026 alone.

In the same quarter, the National Bureau of Economic Research found that 90 per cent of firms reported no measurable productivity impact from their AI deployments.[4] PwC's 2026 survey of 4,454 chief executives found 56 per cent had got "nothing" from the technology.[5] MIT's Project NANDA put the share of generative-AI deployments delivering zero return at 95 per cent.[6]

And yet — many a senior engineer who has put one of these systems against the right task will tell you the uplift is not 20 per cent but tenfold. Both sets of testimony are true. The reconciliation is not in the marketing; it is in the model — and in which work sits where.

What they have learned, and where it holds

Gradient descent does not find the deepest path to a correct answer. It finds the easiest — exploiting whichever surface signal is simplest, rather than the causal structure beneath. Robert Geirhos and colleagues, in Nature Machine Intelligence, named this shortcut learning: excellent on the training distribution, often catastrophic outside it.[7]

The set of things these systems can correctly answer dwarfs the dimension of their weight space, which means they have not merely memorised. They have compressed. They have built deep statistical representations of human discourse — real, useful, transformative in many domains. The question is not whether they have learned. The question is what they have learned, and where it holds.

In November 2025, Ilya Sutskever — once scaling's loudest advocate — named the failure in the present tense. He called it "jaggedness": a coding model fixes a bug, introduces a second, then "fixes" the second by reintroducing the first. The model is interpolating between surface patterns that look like fixes. Today's systems, he said, "generalise dramatically worse than people. It's super obvious." The age of scaling is over. We are back to the age of research — "just with big computers."[8]

Apple's The Illusion of Thinking, replicated through three revisions to November 2025, found the same: accuracy holds to a complexity threshold, then collapses. The chain-of-thought, instead of growing to meet the harder problem, contracts. The reasoning is not deepening. It is giving up, then producing fluent text to disguise the giving up.[9]

The flip

If an answer is an interpolation over surface features, it is whatever the most salient features in the current context happen to favour. A softly stated objection, a rephrased question, an irrelevant detail — each shifts which shortcut wins, and the output flips with high confidence.

Salesforce's FlipFlop Experiment found that when models were challenged with no more than "Are you sure?", they reversed their answers 46 per cent of the time; accuracy dropped 17 percentage points.[10] Stanford's FermiEval found nominal 99 per cent confidence intervals contained the truth only about 65 per cent of the time.[11] Across five frontier models, the National University of Singapore measured overconfidence at 20 per cent in OpenAI's o1, 60 per cent in GPT-3.5; humans of comparable accuracy showed 4 per cent.[12]

Three studies, one fact. The model is not changing its mind under pressure; it does not have a mind to change. It is landing on a different local optimum in the same shallow landscape, and confidence remains high either way. Often correct. No causal scaffolding. Brittle either way.

The two tails

This is the answer to the productivity paradox. Where the work sits in the well-trodden middle of the distribution — refactoring, unit tests, documentation, the data-visualisation that used to consume an afternoon — the model has seen it all, and the gain is real. The training distribution is dense; the long tail is short; the cost of error is bounded.

Where the work depends on novel concepts, edge cases, decisions whose consequences propagate — underwriting, clinical triage, regulatory interpretation, investment due diligence — the same model becomes a liability. The training distribution is sparse; the long tail dominates; confidence does not track depth. NBER's 90 per cent and the practitioner's hundredfold are not contradictions. They are the same technology measured against opposite tails of the work. The expensive error is buying for one and deploying into the other.

The price that has not yet arrived

The economic exposure compounds. AI inference is currently sold below cost, subsidised by capital betting — on JP Morgan's arithmetic — on a revenue base that does not yet exist. When the IPOs come and full pricing arrives, buyers will discover what they have actually been paying for. A model priced at one developer is a bargain on the high-volume middle and a luxury on the long tail. A model priced at seventeen developers, that still trips on the pirouette and costs another eight to clean up, is not the same purchase at all. The productivity case at subsidised cost is not the productivity case at full cost on the wrong tail of the work. That test has not yet been run.

The technical brittleness and the economic exposure are the same fact, viewed from two angles. The model's confidence is uncalibrated to its understanding; the market's pricing is uncalibrated to the technology's actual reliability. Both will correct. The first correction is happening in the literature; the second will happen in the capital markets.

Four tests for the boardroom

Audit for depth, not accuracy. Benchmarks measure average performance on distributions that resemble training data. The decisions that determine your exposure look least like training data. Test there.

Treat confidence as pattern salience, not understanding. A model 95 per cent confident and wrong is more dangerous than one 60 per cent confident and wrong, because the first will be trusted.

Build for the failure, not against it. The deployments that work at scale pair models with deterministic checks and human review on every decision where the cost of error is concentrated.[13]

Watch for the flip. Two confident answers to near-identical prompts are not two opinions. They are one process landing twice.

The question Pfungst asked

Pfungst's experiment cost almost nothing and took six weeks. We are running its corporate equivalent at $725 billion a year. The question he asked was the only one that mattered: what happens when no one in the room knows the answer? Hans was right about a remarkable number of things. He was not right for any reason that survived the test.

The horse could not do arithmetic. The model does not understand your business in the way the price assumes. The difference is that nobody put Hans in charge of underwriting.

If your AI deployments are producing confident answers you cannot verify, book a call with Lion Strategy. We help boards audit for depth, not just accuracy.

Notes

[1] Pfungst, O., Clever Hans (The Horse of Mr. von Osten), Henry Holt, 1911 (English translation; original German study, 1907).

[2] Aggregated 2026 hyperscaler capex commitments (Microsoft, Alphabet, Amazon, Meta, Oracle), per Goldman Sachs Research and Bank of America estimates published Q4 2025–Q1 2026; consensus range $690 bn–$725 bn.

[3] JP Morgan, AI infrastructure capex analysis, 2026, drawing the explicit parallel to the late-1990s telecoms fibre build-out.

[4] National Bureau of Economic Research, study of enterprise AI productivity outcomes, February 2026.

[5] PwC, 29th Annual Global CEO Survey, 2026 (4,454 respondents, 95 countries).

[6] MIT Media Lab, Project NANDA, "The GenAI Divide: State of AI in Business 2025," July 2025.

[7] Geirhos, R., et al., "Shortcut Learning in Deep Neural Networks," Nature Machine Intelligence 2: 660–668, November 2020.

[8] Sutskever, I., interview with Dwarkesh Patel, 25 November 2025.

[9] Shojaee, P., et al., "The Illusion of Thinking," Apple Machine Learning Research, June 2025; arXiv 2506.06941, v3 published 20 November 2025.

[10] Laban, P., et al., "Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment," arXiv 2311.08596, Salesforce AI Research.

[11] Epstein, E.L., et al., "LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval," Stanford University, 2025.

[12] Sun, F., et al., "Large Language Models are Overconfident and Amplify Human Bias," arXiv 2505.02151, 2025, National University of Singapore.

[13] BCG, "The Widening AI Value Gap," September 2025: across 1,250 respondents, the 5 per cent of deployments creating substantial value at scale were distinguished by verification architecture, not model capability.

Elliot Ronald is the Founding Partner of Lion Strategy, a strategy consultancy working with C-suite executives on the deployment — and the limits — of frontier AI.