Span-level error localization for deep-research agents

Where Do Deep-Research Agents Go Wrong?

DRIFT audits long agent trajectories by tracking what the agent comes to believe, whether those claims are supported, and where unsupported commitments become harmful error spans.

Jiaming Wang*, Ziteng Feng*, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

NJU-LINK Team, Nanjing University · JIUTIAN Research · OPPO AI Agent Team

trajectory
s001
s002
s003
s004
s011
A Claim Keeper
B Support Seeker
C Dependency Tracer
Abstract

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Final-answer evaluation shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents, build TELBench from 1,000 expert-verified trajectories, and propose DRIFT, a claim-centric auditing framework that marks spans where unsupported or conflicting claims affect the answer path.

Benchmark

TELBench evaluates process-level reliability.

TELBench asks models to identify harmful error spans from ordered semantic spans, not from final answers alone. The benchmark is built from real deep-research agent runs and preserves benign exploration, failed searches, tentative hypotheses, and harmless noise.

2,790agent trajectories collected
1,000verified TELBench instances
600 / 400easy and hard split
3benchmarks: GAIA, XBench, BrowseComp
TELBench data curation pipeline
Data curation pipeline for trajectory collection, semantic-span segmentation, and expert verification.
Mechanism Analysis

Errors are shaped by workflow stage, fault family, and temporal position.

TELBench includes mechanism labels for analysis only. They reveal recurring failure families, stage-dependent error patterns, and differences across benchmarks, agent frameworks, and model families. These labels are never provided to evaluation models.

Mechanism analysis of annotated TELBench trajectories
Mechanism analysis over annotated trajectories, including fault families, workflow stages, first-error patterns, and the Verified-1K subset.
Method

From span scoring to claim auditing.

A failed trajectory is rarely a single bad span. Agents search, compare candidates, revise hypotheses, and later reuse earlier claims as if they were established facts. DRIFT therefore diagnoses errors by auditing claims and their dependencies, rather than asking a model to classify every span independently.

A

Claim Keeper

Builds a compact ledger of decision-critical claims, including where each claim appears, where it becomes consequential, and which later spans use it.

B

Support Seeker

Checks whether each consequential claim is directly supported, weakly supported, missing support, or contradicted by raw trajectory evidence.

C

Dependency Tracer

Locates spans where risky claims become harmful commitments and later spans that reuse, amplify, or finalize the same unsupported claim.

Architecture

Claim ledger, graph-grep support, and final localization.

DRIFT keeps the input clean: every module receives only the task question and ordered raw span text, with no gold labels, judge results, manual notes, span types, or generated summaries.

Overview of the DRIFT claim-centric auditing workflow
DRIFT turns a long trajectory into a claim-support audit before predicting error span ids.
Results

DRIFT improves span-level localization across model families.

On TELBench, DRIFT consistently improves overall macro-F1 over bare full-context prompting. The gains hold across GPT-5.4, DeepSeek-V3.2, Claude-Sonnet-4.6, and Gemini-2.5-Pro, showing that structured claim-centric auditing provides useful signal beyond stronger base models alone.

GPT-5.4 52.48 DRIFT macro-F1, +18.55 over bare
DeepSeek-V3.2 50.51 DRIFT macro-F1, +28.05 over bare
Claude-Sonnet-4.6 54.91 DRIFT macro-F1, +33.02 over bare
Gemini-2.5-Pro 48.41 DRIFT macro-F1, +17.40 over bare
Overall macro-F1 on TELBench
Overall macro-F1 on TELBench. DRIFT outperforms bare full-context prompting and generic agentic auditing baselines across model families.
Citation

Cite DRIFT

@misc{wang2026deepresearchagentswrongspanlevel,
      title={Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories}, 
      author={Jiaming Wang and Ziteng Feng and Jiangtao Wu and Ruihao Li and Qianqian Xie and Yuxiang Ren and He Zhu and Xueming Han and Fanyu Meng and Junlan Feng and Jiaheng Liu},
      year={2026},
      eprint={2606.02060},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.02060}, 
}