AI’s “Smell Test”: From Outcome Supervision to Process Supervision—How Do We Teach Machines to Truly Think?

Posted at 2025-06-15 # AI # LLM # Machine Learning # Deep Learning # Process Supervision

Yesterday (14 June 2025) while listening to Lex Fridman’s podcast, I was struck by a remark from Terence Tao. Assessing current LLMs, he said that AI can already pass the eye test but fails miserably at the smell test¹.

He explained that AI‑generated mathematical proofs look impeccable on the surface (eye test), yet when domain experts “smell” the underlying logic, they uncover awkward, counter‑intuitive paths. What the models lack is the deep insight and strategic elegance that experts acquire through years of immersion—the metaphorical “scent” of mathematics.

This observation pierces the heart of today’s AI boom: the systems we build may be ever‑stronger answer machines but still fall short of becoming genuine thinkers.

In this article we dive into the origin of that “scent” and why it has become a critical bottleneck. We introduce the notion of process data, explain how its scarcity limits deep reasoning, and survey the cutting‑edge work on process supervision that is carving a route through the fog toward truly “thinking machines.”

Diagnosing the Ailment: Why Does AI Lack a “Sense of Smell”?

To understand the deficiency, we must examine what AI is fed. For years we have supplied it almost exclusively with what I call result data².

Result Data vs. Process Data

Result data is straightforward: a “question–answer” pair.

Question: “Solve the equation $3x - 6 = 9$ .”
Answer: “ $x = 5$ .”

This approach powers most AI applications we see today, yet it has a fatal flaw: it tells the AI what but not why or how. The journey is hidden; only the destination is shown.

By contrast, process data² includes not only the final answer but every verifiable step of reasoning that leads there.

Question: “Solve the equation $3x - 6 = 9$ .”
Process data:
1. Goal: Solve for $x$ .
2. Step 1: Add $6$ to both sides: $3x - 6 + 6 = 9 + 6$ .
3. Step 2: Simplify to get $3x = 15$ .
4. Step 3: Divide both sides by $3$ : $x = 15 / 3$ .
5. Final answer: $x = 5$ .

The first is an isolated fact; the second is a full map of thought.

The Iceberg Behind Fermat’s Last Theorem

For a deeper sense of the gap, consider Andrew Wiles’s 1995 paper proving Fermat’s Last Theorem, a hundred‑plus‑page masterpiece—perfect result data. Beneath it, however, lay eight secluded years of work and a mountain of draft notes recording failed paths, strategy shifts, and iterative corrections.

Those notes are priceless process data.

An AI trained only on the published paper admires a grand edifice; one that also studies the notes might learn how to become an architect.

Here lies the root of AI’s missing “sense of smell.” Human experts cultivate intuition by internalizing vast amounts of process data over years. They know not only what is right but also what is wrong. Our AIs ingest nearly all of the final papers but almost none of the eight‑year struggle, so naturally they miss this precious instinct.

The “Scarcity Trap” of Process Data

High‑quality process data is scarce for three main reasons:

Prohibitive annotation cost: Domain experts must painstakingly document and vet every step.
Publication norms: Academia and industry showcase polished results, hiding the valuable middle and the failures.
A staggering scale gap: LLM pre‑training needs trillions of tokens, yet public process‑label data sets contain mere millions³, a difference of six orders of magnitude.

The scarcity of process data

Together these factors make process data extremely scarce, a key bottleneck blocking AI from leaping from answer machine to thinker.

The Bottleneck Revealed: A Master of Form, a Mimic in Essence

Raised on result data, AI has become a strange acquaintance: extraordinarily capable yet puzzlingly brittle. It has perfected the “grammar” and form of text—mathematical proofs, legal briefs, code.

Hence it aces the eye test. But perfect form masks hollow content. Without process training, AI misses three essentials on the road to true intelligence:

Genuine reasoning: Models interpolate between seen data but struggle to extrapolate to novel problems because they learn answer templates, not universal problem‑decomposition strategies.
Reliability and robustness: Rewarding only the outcome encourages shortcut learning based on superficial correlations. The model becomes fragile—rephrase the question and it may collapse—fueling hallucinations.
Explainability and trust: An AI that only outputs answers is a black box. We cannot trace its logic, diagnose its errors, or fully trust it in high‑risk fields such as medicine or autonomous driving.

Fortunately, where we diagnose a problem we often glimpse a solution. A cadre of top researchers has launched an exploration called process supervision.

Breaking the Deadlock: Three Waves of Process Supervision

The core idea is simple: instead of rewarding only the final correct answer, reward every correct step along the way.

Like a patient math teacher who praises each substitution and simplification, this revolution has unfolded in three dramatic stages.

The three waves of process supervision

Stage 1 Theoretical Foundations and the Costly “Gold Standard” (2023)

The overture came from DeepMind and OpenAI. In late 2022 DeepMind’s paper systematically distinguished and validated process supervision. OpenAI’s 2023 landmark Let’s Verify Step by Step hired experts to annotate every step of math solutions, producing the PRM800K data set of about 800 000 step‑level labels.

Models trained on PRM800K pushed accuracy on the MATH benchmark from roughly 72 %⁴ to about 78 %—proof that process supervision works, yet its human cost is astronomical and hard to replicate.

Stage 2 Automation on the Offensive—A Race to Cut Costs (2024)

Success spawned the next question: how do we make it cheap? In 2024 the spotlight shifted to automated cost reduction.

Solving problems like playing chess: Math‑Shepherd introduced Monte Carlo Tree Search (MCTS), letting models self‑play and explore, automatically generating hundreds of thousands of valid process steps.
Pinpointing errors: Researchers applied bisection to locate the first incorrect step with high efficiency, slashing annotation workloads.
Finer alignment: Step‑DPO turned “rating a step” into “choosing the better of two steps,” yielding more stable alignment.

A new benchmark, ProcessBench, emerged to evaluate a model’s ability to locate errors. The virtuous cycle of model development and benchmarking propelled the field forward.

Stage 3 Paradigm Shift and Generative Intelligence (2025)

By 2025 we witnessed a profound shift: can AI itself create process data? The focus moved from “data‑driven algorithms” to “algorithm‑driven data.”

The flagship is ThinkPRM. Using barely 1 % of PRM800K—about 8000 “gold‑seed” annotations⁵—researchers taught the model to produce detailed, verifiable chains of thought. Once mastered, the model could self‑validate and score the vast additional steps it generated.

The results were stunning: ThinkPRM, trained on 8000 human‑labeled steps, outperformed first‑generation models that used 800 000. Process supervision entered the generative era, reducing dependence on human data by two orders of magnitude.

Pure automation also reached new heights. OmegaPRM built a fully automated pipeline that produced 1.5 million step labels. Researchers extended process supervision beyond math and code: LongDPO applied it to creativity‑heavy long‑form writing.

New Challenges on the Horizon

Solving the process‑data shortage does not mark the end. Old problems fixed often breed new ones.

New Bottleneck 1 Compute Economics

We have swapped reliance on human labor for compute. MCTS search and ThinkPRM self‑verification are compute‑hungry “gold eaters.” The cost bottleneck shifts from labor to compute, still barring many players.

New Bottleneck 2 The Return of Explainability

One goal of process supervision was to open the black box. Yet as automation peaks, AI‑generated reasoning paths might again become “alien,” filled with shortcuts humans cannot fathom. Do we want efficient alien intelligence or relatable partner intelligence?

New Bottleneck 3 Lagging Evaluation and Safety

Success so far centers on domains like math and code with objective standards. In law, research, or business strategy—murkier arenas—we badly lack standardized benchmarks. Moreover, applying process supervision while safeguarding proprietary knowledge is a must‑solve hurdle for industrial deployment.

Conclusion From Feeding Data to Building Ecosystems

Let us return to Terence Tao’s “smell test.”

Surveying the recent journey, we can say the AI community has found the right road to cultivate “scent”: process supervision. We have watched it evolve from costly proof‑of‑concept to efficient automation and now to disruptive generative self‑improvement.

The deepest lesson is this: the future of AI will pivot from feeding models to empowering them.

Our role is shifting from data zookeepers to ecosystem architects. Our task is to design initial rules and environments so that AI can explore, self‑correct, and grow—ultimately producing the intelligence we need.

We stand at a new dawn where AI might become not just a powerful tool but a genuine thinking partner. The road remains long, but at least we can already smell the exhilarating scent of that future.

This is my literary interpretation of Terence Tao’s original statement: “So the sense of smell, this is one thing that humans have, and there’s a metaphorical mathematical smell that it’s not clear how to get the AI to duplicate that eventually. So the way AlphaZero and so forth make progress on Go and chess and so forth, is in some sense they have developed a sense of smell for Go and chess positions, that this position is good for white, it’s good for black. They can’t initiate why, but just having that sense of smell lets them strategize. So if AIs gain that ability to a sense of viability of certain proof strategies, because I’m going to try to break up this problem into two small subtasks and they can say, “Oh, this looks good. The two tasks look like they’re simpler tasks than your main task and they’ve still got a good chance of being true. So this is good to try.” Or “No, you’ve made the problem worse, because each of the two subproblems is actually harder than your original problem,” which is actually what normally happens if you try a random thing to try normally it’s very easy to transform a problem into an even harder problem. Very rarely do you transform into a simpler problem. So if they can pick up a sense of smell, then they could maybe start competing with a human level of mathematicians.” ↩︎
The terms “result data” and “process data” are informal labels for ease of discussion, not strict academic terminology. ↩︎ ↩︎
Google’s Gemma paper notes “We trained Gemma models on up to 6 T tokens of text,” indicating trillion‑token corpora for mainstream LLMs, whereas OpenAI’s PRM800K contains only 800 000 step labels—sub‑million scale. ↩︎
On a representative MATH subset with best‑of‑1860 search, accuracy rose from 72.4 % to 78.2 %. ↩︎
The authors state they used ≈1 % of PRM800K for manual labeling; 1 % of 800 000 is roughly 8000. ↩︎