Multi-turn reasoning is broken in a way nobody saw coming

Everybody assumed reasoning models would fail the obvious way. A model commits to something in turn two, contradicts it in turn nine, and you catch it. Clean, detectable, patchable. Grab a consistency checker, add some grounding, and move on.

A paper presented at the ICLR 2026 Workshop on Reasoning and Planning for LLMs just made that assumption look optimistic.

The failure mode the field missed

The researchers built DRIFT-Bench, a solver-instrumented benchmark of 816 test problems across three constraint domains, evaluated across four open-weight models from 8B to 120B parameters.

Their headline finding upends the standard mental model of how multi-turn reasoning breaks.

The dominant failure mode is satisfiable drift. The model's internal state stays logically consistent. The logic holds, the surface reads fine, and the output looks coherent.

The model has simply abandoned a prior commitment across turns, moved in a different direction, and given absolutely zero signal that anything went wrong.

Logical contradiction is the failure mode everyone built tooling for. Satisfiable drift is the one that slips straight through it.

The field has spent two years building hallucination detection infrastructure: retrieval grounding, factual verification, G-Eval, and LLM-as-judge pipelines. All of it is geared toward catching outputs that are wrong.

Satisfiable drift produces outputs that are coherent, consistent, and well-formed. They pass the checks. They just happen to violate a commitment the model made six turns ago, and your pipeline has no hook to catch that.

Think of it as a contractor who nods enthusiastically at the original brief, then builds something slightly different. Structurally sound. Properly finished. Completely wrong. And when you ask them about it, they point to the last five things they said, all of which are internally consistent.

This is the part where practitioners reading this start nodding.

What the benchmark actually found

The DRIFT-Bench methodology is what makes these results worth taking seriously. Instead of asking a judge model to rate coherence (which is itself subject to satisfiable drift, delightfully), the researchers verify commitments at the formal constraint level.

Three constraint domains. Four models. Instrumented with solvers throughout.

Performance drops substantially as conversation length grows, and satisfiable drift dominates the failure taxonomy across every model and domain tested.

The bigger models are actually more susceptible in one specific way: they are better at maintaining surface coherence, which means their drift is harder to detect.

The best mitigation tested is MUS-Repair, a technique that feeds minimal unsatisfiable subsets back to the generator when the system detects inconsistency.

Results across every model and setting:

MUS-Repair outperforms all non-MUS baselines by 1.8 to 15.0 percentage points
The performance gap is widest on the larger models, where surface coherence masks drift most effectively
The gains are consistent across all three constraint domains tested, suggesting this is a structural fix rather than a domain-specific one
Smaller models show lower absolute drift rates, but higher rates of outright logical contradiction, a different failure mode requiring different mitigations

Larger models drift more elegantly. That sentence will haunt some production pipelines.

Why single-turn benchmarks are hiding this from you

MMLU, HumanEval, SWE-bench, GPQA: all single-turn. A model that scores 87% on SWE-bench Verified can still drift badly on a 12-turn constraint reasoning chain.

These are measuring different properties entirely, and treating strong single-turn scores as a proxy for multi-turn reliability is a category error that deployment teams are making right now, at scale.

The 2026 International AI Safety Report, authored by over 100 experts, flagged persistent unreliability across long-horizon tasks as a core unsolved challenge. DRIFT-Bench gives that observation a specific mechanistic name. The model does not run out of capability.

It loses track of its prior commitments, stays internally valid, and produces no signal that anything has gone wrong.

For anyone building a stateful agent workflow, that is a property worth designing around urgently.

What to actually do about it

The practical responses are specific enough to act on:

Run multi-turn evaluations on held-out conversation sequences. Four or five turns minimum, ideally longer. Single-turn benchmarks will give you false confidence on workflows that extend beyond a single exchange.
Build commitment-tracking into your architecture. This means maintaining an explicit record of constraints agreed in prior turns, and verifying each new output against that record before returning it. LangGraph's stateful checkpointing is a natural place to add this layer.
Treat satisfiable drift and hallucination as separate failure classes. They share a symptom (wrong output) but require different detection approaches and different mitigations. Conflating them will leave gaps in both directions.
Prototype MUS-Repair as a mitigation layer for constrained reasoning tasks. The performance gains are consistent enough across model sizes that it deserves evaluation in any pipeline where multi-turn constraint satisfaction matters.

A final thought

Here is the honest picture.

The AI industry has spent a lot of energy worrying about the failures it can see: hallucinations, contradictions, refusals. Satisfiable drift is the failure that looks fine on the way out the door, gets deployed, and causes problems three weeks later when someone finally traces the conversation back to turn four.

Multi-turn reasoning is where production AI is headed. Agents, copilots, long-horizon planning, autonomous workflows: all of it depends on a model that can hold a commitment across time. That turns out to be a harder problem than anyone budgeted for.

The benchmarking culture rewards single-turn performance because single-turn performance is easy to measure. Production cares about multi-turn reliability because that is where things actually break. DRIFT-Bench puts a name on the gap.

The rest is up to the teams building it...

Is multi-turn reasoning broken?

Why your current verification stack is flying blind

What the benchmark actually found

Why single-turn benchmarks are hiding this from you

What to actually do about it

A final thought

Related

Is multi-turn reasoning broken?