Skip to content
ai-reasoningmachine-learningreasoning-models

How AI learns to reason: the race to teach models where they went wrong

You press Enter, and instead of an answer, you see a single word: Thinking. Ten seconds pass, then twenty. Then a response appears — structured, self-correcting, sometimes catching its own mistakes mid-sentence. If you have used Claude, ChatGPT o3, or DeepSeek in the past year, you have seen this pause. Most explanations stop at “the model is reasoning step by step,” as if that settles the matter.

It does not. Behind that pause is a training method, and it has a fundamental flaw — one that limits how well the model can learn from its own mistakes. Right now, six research teams across four countries are racing to fix it. Each has found a different path, and none has reached the finish line. The outcome of this race will determine how well AI reasons for the next several years, and understanding it changes how you work with these tools today.

The teacher who only checks the final answer

To understand the flaw, consider an analogy. A student submits a six-step math solution. The teacher looks at the bottom of the page, sees the correct answer, and writes “good” on every step. Another student submits a solution where the first five steps are perfect but the last one contains an arithmetic error. The teacher writes “bad” on every step, including the five correct ones.

This is approximately how the most common training method — Group Relative Policy Optimization, or GRPO — works. GRPO is the algorithm behind DeepSeek-R1 and Qwen, two of the most capable open reasoning models. During training, the model generates multiple solutions to the same problem. Solutions that arrive at the correct answer receive a positive reward. Solutions that fail receive a negative one. The reward is distributed equally across every token in the response — every word, every punctuation mark, every “let me reconsider” gets the same score.

This approach is called outcome-based reward, and it has a technical name for the flaw it creates: the credit assignment problem. The model cannot distinguish a critical reasoning step from a filler phrase, because both receive exactly the same training signal.

The consequences are concrete. A model trained this way can score 97% on graduate-level math benchmarks but occasionally fail at two-digit addition — because the training process never learned to separate the moments where arithmetic precision mattered from the moments where the model was generating boilerplate like “let’s approach this systematically.” Up to a certain level of difficulty, uniform reward works well enough. Beyond that ceiling, progress stalls. The model keeps generating longer responses, but accuracy stops improving.

The first fix: hire a second teacher

The first team to take this problem seriously was OpenAI. In 2023, they published a paper called “Let’s Verify Step by Step” and built what they called a Process Reward Model, or PRM. The idea was straightforward: instead of judging only the final answer, train a separate model to evaluate each intermediate step.

To make this work, OpenAI hired human annotators who labeled over 800,000 individual reasoning steps in mathematical solutions — marking each one as correct, incorrect, or neutral. They then trained a dedicated reward model on this data. During the main model’s training, the PRM evaluated each step and provided granular feedback: this step was right, this one was wrong, this one was irrelevant.

It worked. Step-level feedback improved the main model’s reasoning ability significantly compared to outcome-only training. The paper became a reference point, and the PRM800K dataset remains widely cited.

But the approach had a limitation that made it difficult to scale. Training a PRM required massive human annotation for every new domain. Math reasoning had labeled data; legal reasoning, medical reasoning, and coding did not. And maintaining a second model — one that had to be retrained alongside the main model to stay calibrated — added substantial computational overhead.

The question became: can you get the benefits of step-level feedback without the cost of a separate judge? In 2025, five teams proposed five different answers.

Five paths to the same destination

Statistical branching

A team from Mila and Microsoft, led by researchers at the University of Montreal, published VinePPO in mid-2025. Their approach was built on a Monte Carlo estimation: for each step in a reasoning chain, generate dozens of alternative continuations and count how many of them eventually reach the correct answer.

If 80% of the branches growing from step three lead to a correct solution, step three was probably good. If only 10% succeed after step five, something went wrong at step five. The model uses these completion rates as step-level rewards, without any human annotation.

The method produced meaningful improvements on mathematical reasoning benchmarks for models up to 7 billion parameters. But the computational cost was severe — generating dozens of alternative branches for every step in every training example required far more processing power than standard GRPO, which made VinePPO impractical for larger models or production-scale training.

Traces of influence

Later in 2025, Prasanna Parthasarathi at Huawei’s Noah’s Ark Lab and Mathieu Reymond at Mila took a different path: they went back to the 1980s. Classical reinforcement learning had already solved a version of the credit assignment problem through a technique called eligibility traces — a mechanism that propagates reward signals backward through a sequence of actions, with exponential decay over distance.

Their method, GRPO-λ, adapted this mechanism for language models without requiring a critic model. The core idea: if the model started generating correct tokens after a particular step, that step receives credit for the subsequent success, weighted by how close it was to the good outcome. A step that immediately preceded a correct derivation gets strong credit. A step ten positions back gets weaker credit, decaying exponentially.

The results were compelling: 30 to 40% faster convergence during training, with consistent improvements across math benchmarks on Qwen and LLaMA architectures. The method added no memory overhead and no additional model. But the experiments stopped at 7 billion parameters, and the authors noted that the gap narrowed on larger models — leaving open the question of whether traces of influence matter when the model is already large enough to learn effectively from coarser signals.

A critic that reads in one pass

The Tencent and Renmin University team behind CAPO, published in late 2025, took the PRM idea and removed the cost of training a dedicated judge. Instead of a specialized reward model, they used an existing large language model — a 72-billion-parameter Qwen or Llama — as a generative critic. The critic reads a solution in a single inference pass and identifies which steps contain errors, producing a verdict for each step.

Tokens in correct steps receive the full outcome reward. Tokens in steps flagged as erroneous receive a penalty. The method introduced an asymmetric weighting scheme: correct answers matter more than penalizing mistakes, which prevented the critic’s errors from dominating the training signal.

CAPO improved performance by 2 to 3.5 percentage points across multiple benchmarks and worked with different critic models without task-specific tuning. The trade-off was clear: you no longer needed to train a separate judge, but you still needed to run inference on a 72-billion-parameter model for every training example — a substantial cost that scaled linearly with dataset size.

Comparing what the model already knows

In February 2026, Hritik Bansal published DenseR, which approached the problem from an entirely different angle. Instead of using external judges or statistical sampling, DenseR looked inside the model itself.

The key insight was that the model’s internal representations — the hidden states it produces at each token position — already contain information about where reasoning diverges. When two solutions to the same problem start identically but end differently (one correct, one wrong), their hidden states are nearly identical at the beginning and sharply divergent at the point where one solution went wrong. That divergence point is the decision that mattered.

DenseR uses cosine similarity between hidden states to compute a per-token weight: tokens where correct and incorrect solutions diverge sharply receive higher weight in the training gradient. Tokens where all solutions look similar receive lower weight. The total gradient magnitude stays the same — DenseR only reshapes where learning happens, not how much.

The results on a 600-million-parameter model were striking: a 12.5-fold improvement on AIME 2024 (a benchmark of competition-level math problems) compared to standard GRPO. On a 4-billion-parameter model, the pass@1 improvement was modest, but the diversity of correct solutions increased meaningfully — the model found more distinct paths to the right answer.

The limitation was equally clear. The hidden-state comparison works cleanly when two solutions share a literal prefix and diverge at a single point. When solutions take fundamentally different approaches from the start, the divergence signal blends into background noise and loses its discriminative power.

Tracking probability shifts

The most recent entry, published in March 2026 by the Qwen team at Alibaba, is FIPO — Future-KL Influenced Policy Optimization. Where DenseR looked at hidden states, FIPO looks at what happens after the model is updated.

The method works by measuring how a policy update changes the probability of subsequent tokens. If updating the model on a particular training example “reinforces” the tokens that follow a certain reasoning step — making the model more likely to continue down that path in the future — then that reasoning step was influential and should receive higher weight in the next update.

FIPO uses a discounted sum of these probability shifts, with an exponential decay window controlled by a single hyperparameter. The result is a dense, per-token advantage signal that requires no additional models, no branching, and no hidden-state analysis — only the log-probabilities that GRPO already computes.

On Qwen2.5-32B, the largest model tested in any of these studies, FIPO pushed AIME 2024 accuracy from 50.0% to 58.0% — the best reported result for a credit assignment method. The model’s responses grew from roughly 4,000 tokens to over 10,000, and qualitative analysis showed a four-stage evolution: from superficial planning, through linear execution, to spontaneous self-verification, and finally systematic multi-pass reasoning. Standard GRPO converged at stage two; FIPO reached stage four.

But like every method in this race, the evaluation was limited to mathematical reasoning. Whether the same approach improves coding, legal analysis, or scientific research remains untested.

The leaders are not sharing

These six methods are the ones we know about because the teams published their work. The largest commercial labs — OpenAI, Anthropic, and Google — have not disclosed how they train their reasoning models.

What is publicly known is limited. OpenAI has confirmed that the o-series models are “trained via reinforcement learning to explore various strategies, break problems into steps, and identify errors.” The specific reward structure and credit assignment method remain proprietary. Anthropic has described Claude’s reasoning as “serial test-time compute using multiple sequential reasoning steps,” with performance scaling logarithmically with thinking tokens, but has published nothing about the training algorithm. Google has disclosed even less about Gemini’s reasoning training.

The one exception among major labs is DeepSeek, which published its full methodology in a paper that was subsequently reviewed in Nature. Their most remarkable finding was not a specific algorithm but an observation: when they trained a model with pure reinforcement learning and no supervised reasoning data at all (DeepSeek-R1-Zero), chain-of-thought reasoning, self-verification, and strategy adaptation emerged spontaneously. Nobody programmed the model to say “wait, let me check that.” The behavior appeared on its own, as a byproduct of the training objective.

This finding has a direct implication for the credit assignment race. If sophisticated reasoning behaviors can emerge from relatively simple training signals, the question is not just “how do we assign credit more precisely?” but also “how much precision is actually necessary?” The answer is not yet clear, and the gap between open research and proprietary methods makes it difficult to assess where the field actually stands.

What this means if you are the one typing the prompt

The research described above has practical consequences for anyone who uses reasoning models for work.

Different models are trained to think differently. When Claude, o3, and DeepSeek produce different reasoning styles on the same problem, the difference is not random — it reflects different training methods. DeepSeek’s reasoning emerged from pure RL with GRPO. Qwen’s team is experimenting with FIPO. OpenAI’s approach is unknown. These choices shape how the model structures its thoughts, how often it self-corrects, and what kinds of errors it tends to miss.

The “Thinking” pause is not decoration. When a reasoning model takes thirty seconds before responding, it is generating internal reasoning tokens — a chain of thought that, in well-trained models, includes verification steps. The research shows that models trained with better credit assignment develop self-checking behavior spontaneously, without being told to verify their work. The length of the thinking phase correlates with accuracy on difficult problems.

“Think step by step” is already built in. Reasoning models are specifically trained to produce chain-of-thought reasoning. Adding “think step by step” to your prompt is redundant for o3, DeepSeek-R1, or Claude in thinking mode — and some recent evidence suggests it can even hurt performance by overriding the model’s trained reasoning patterns. What helps more is a precise problem statement: clear constraints, explicit success criteria, and the specific context the model needs.

Harder problems benefit more. On simple tasks, reasoning models and standard models perform similarly. The credit assignment improvements described above produce their largest gains on the hardest benchmarks — competition-level mathematics, multi-step proofs, problems requiring sustained logical chains. If your work involves complex analysis, multi-step reasoning, or synthesizing information across many sources, these training advances will affect you directly.

This race is far from over. Every method described here was published in the last twelve months, every one has been tested primarily on mathematics, and every one has known limitations. The field is moving fast enough that the reasoning capabilities of the models you use today will be measurably different from those available six months from now — not because the models will be bigger, but because they will be trained to identify and learn from their own mistakes more precisely.


Sources: FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization (Qwen/Alibaba, 2026), GRPO-λ: Credit Assignment improves LLM Reasoning (Huawei/Mila, 2025), CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment (Tencent/Renmin University, 2025), DenseR: Dense Rewards For Free in LLM Reasoning (Bansal, 2026), VinePPO: Refining Credit Assignment in RL Training of LLMs (Mila/Microsoft, 2025), Let’s Verify Step by Step (OpenAI, 2023), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning (DeepSeek, Nature 2025)