Span-level error localization for deep-research agents

Where Do Deep-Research Agents Go Wrong?

DRIFT audits long agent trajectories by tracking what the agent comes to believe, whether those claims are supported, and where unsupported commitments become harmful error spans.

Jiaming Wang*, Ziteng Feng*, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

NJU-LINK Team, Nanjing University · JIUTIAN Research · OPPO AI Agent Team

Code Dataset Citation

trajectory

s001

s002

s003

s004

s011

A Claim Keeper

B Support Seeker

C Dependency Tracer

Abstract

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Final-answer evaluation shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents, build TELBench from 1,000 expert-verified trajectories, and propose DRIFT, a claim-centric auditing framework that marks spans where unsupported or conflicting claims affect the answer path.

Benchmark

TELBench evaluates process-level reliability.

TELBench asks models to identify harmful error spans from ordered semantic spans, not from final answers alone. The benchmark is built from real deep-research agent runs and preserves benign exploration, failed searches, tentative hypotheses, and harmless noise.

2,790agent trajectories collected

1,000verified TELBench instances

600 / 400easy and hard split

3benchmarks: GAIA, XBench, BrowseComp

TELBench data curation pipeline — Data curation pipeline for trajectory collection, semantic-span segmentation, and expert verification.

Mechanism Analysis

Errors are shaped by workflow stage, fault family, and temporal position.

TELBench includes mechanism labels for analysis only. They reveal recurring failure families, stage-dependent error patterns, and differences across benchmarks, agent frameworks, and model families. These labels are never provided to evaluation models.

Mechanism analysis of annotated TELBench trajectories — Mechanism analysis over annotated trajectories, including fault families, workflow stages, first-error patterns, and the Verified-1K subset.

Method

From span scoring to claim auditing.

A failed trajectory is rarely a single bad span. Agents search, compare candidates, revise hypotheses, and later reuse earlier claims as if they were established facts. DRIFT therefore diagnoses errors by auditing claims and their dependencies, rather than asking a model to classify every span independently.

Claim Keeper

Builds a compact ledger of decision-critical claims, including where each claim appears, where it becomes consequential, and which later spans use it.

Support Seeker

Checks whether each consequential claim is directly supported, weakly supported, missing support, or contradicted by raw trajectory evidence.

Dependency Tracer

Locates spans where risky claims become harmful commitments and later spans that reuse, amplify, or finalize the same unsupported claim.

Architecture

Claim ledger, graph-grep support, and final localization.

DRIFT keeps the input clean: every module receives only the task question and ordered raw span text, with no gold labels, judge results, manual notes, span types, or generated summaries.

Overview of the DRIFT claim-centric auditing workflow — DRIFT turns a long trajectory into a claim-support audit before predicting error span ids.

Results

DRIFT improves span-level localization across model families.

On TELBench, DRIFT consistently improves overall macro-F1 over bare full-context prompting. The gains hold across GPT-5.4, DeepSeek-V3.2, Claude-Sonnet-4.6, and Gemini-2.5-Pro, showing that structured claim-centric auditing provides useful signal beyond stronger base models alone.

GPT-5.4 52.48 DRIFT macro-F1, +18.55 over bare

DeepSeek-V3.2 50.51 DRIFT macro-F1, +28.05 over bare

Claude-Sonnet-4.6 54.91 DRIFT macro-F1, +33.02 over bare

Gemini-2.5-Pro 48.41 DRIFT macro-F1, +17.40 over bare

Overall macro-F1 on TELBench. DRIFT outperforms bare full-context prompting and generic agentic auditing baselines across model families.

Performance across the Qwen3 family — Scaling alone is insufficient: larger base models do not monotonically solve trajectory diagnosis.

Sensitivity to span complexity — As trajectory span complexity increases, DRIFT preserves a stronger localization signal than bare prompting.

F1 growth with DRIFT modules — Module growth trend on representative model families.

Efficiency-performance trade-off across frameworks — Efficiency-performance trade-off across diagnostic frameworks.

Precision, recall, and F1 ablation across four models — Appendix ablation view for precision, recall, and F1.

Citation

Cite DRIFT

@misc{wang2026deepresearchagentswrongspanlevel,
      title={Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories}, 
      author={Jiaming Wang and Ziteng Feng and Jiangtao Wu and Ruihao Li and Qianqian Xie and Yuxiang Ren and He Zhu and Xueming Han and Fanyu Meng and Junlan Feng and Jiaheng Liu},
      year={2026},
      eprint={2606.02060},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.02060}, 
}