TVIR: Building Deep Research Agents Towards
Text-Visual Interleaved Report Generation

100 Expert-Curated Tasks
10 Domains
9 Systems Evaluated
2 Languages

Abstract

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis.

To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes:

TVIR-Bench

A benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals.

TVIR-Agent

A hierarchical multi-agent framework for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports.

Dual-Path Evaluation

A comprehensive framework combining Textual Assessment and Visual Assessment for evidence-driven report evaluation.

Motivation

The Gap in Current Deep Research Systems

Existing deep research paradigms remain predominantly text-centric. Most benchmarks and agent frameworks evaluate success based on textual coherence, depth, and citation support, while overlooking a critical characteristic of real-world professional reports: the integration of visual evidence.

Visual elements treated as decorative supplements
Limited evaluation of visual fidelity and provenance
Mismatch between benchmarks and real-world demands
Comparison of deep research benchmarks

Figure 1: Comparison of representative deep research benchmarks. Existing benchmarks mainly focus on text-only or weakly multimodal reports, whereas TVIR-Bench requires text-visual interleaved reports with semantically grounded charts and retrieved images.

TVIR-Bench

A comprehensive multimodal deep research benchmark with 100 expert-curated tasks spanning diverse domains and complexity levels.

Domain Coverage

Domain taxonomy of TVIR-Bench

Figure 2: Domain taxonomy of TVIR-Bench.

Task Design Principles

Role-Driven

Tasks grounded in realistic user needs

Demand-Oriented

Focused on practical requirements

Deep Research

Requires substantive analytical synthesis

Frontier-Focused

Novel and timely topics

Multimodal Integration

Explicit multimodal elements required

Data Construction & Evaluation Pipeline

Overview of TVIR-Bench pipeline

Figure 3: Overview of TVIR-Bench, including data construction pipeline and evaluation framework.

Dataset Statistics

100
Total Tasks
50 / 50
Chinese / English
10
Major Domains
3
Complexity Levels

Dual-Path Evaluation Framework

Textual Assessment (TA)

  • CS - Citation Support
  • IA - Instruction Alignment
  • WQ - Writing Quality
  • ADB - Analytical Depth & Breadth
  • FLC - Factual & Logical Consistency

Visual Assessment (VA)

  • MC - Multimodal Composition
  • FQ - Figure Quality
  • FCQ - Figure Caption Quality
  • FCI - Figure-Context Integration
  • CSC - Chart-Source Consistency

TVIR-Agent

A hierarchical multi-agent framework for text-visual interleaved report generation.

TVIR-Agent Architecture

Figure 4: Overview of the proposed multi-stage framework for report generation.

1

Research-Grounded Planning

The Planner parses user tasks and iteratively invokes external tools (Google Search, web scraping) to retrieve relevant information. It synthesizes collected information into a structured outline with section titles, summaries, planned visual requirements, and research notes.

2

Visual Asset Instantiation

Two specialized agents handle different visual needs:

  • Image Searcher: Retrieves candidate images through Google Image Search, filters low-quality results, and uses VQA for relevance verification.
  • Chart Generator: Retrieves relevant data, verifies authenticity, generates Python plotting code, and executes it in a sandbox.
3

Context-Aware Sequential Writing

The Writer generates the report section by section, conditioning on the current outline unit and a dynamically updated global context. It determines insertion points for visual assets and composes Markdown content with interleaved text and visual elements.

4

Global Index Polishing

The Polisher processes references and figures at the report level: removes uncited references, deduplicates globally by URL, renumbers into a unified reference list, and reassigns figure IDs in sequential order.

Experimental Results

Main Results

Model Aggregate Textual Assessment Visual Assessment
Overall TA VA CS IA WQ ADB FLC FQ MC FCQ FCI CSC
TVIR-Agent (Claude-4.5-Sonnet) 74.44 70.12 78.76 51.20 81.09 69.88 72.22 76.20 87.17 77.80 74.49 76.75 77.58
TVIR-Agent (Qwen3-Max) 73.53 70.03 77.03 53.68 76.69 69.30 67.48 83.00 91.71 67.80 72.44 74.56 78.63
TVIR-Agent (GLM-4.7) 72.62 71.64 73.61 68.64 71.98 69.20 68.16 80.20 84.61 62.55 70.13 73.39 77.35
Manus-1.6 69.73 69.42 70.04 45.57 74.12 72.15 62.84 92.40 86.27 70.75 66.14 71.03 56.02
Claude-4.5-Sonnet w/Search 68.72 70.15 67.30 47.53 79.32 69.37 70.52 84.00 90.24 63.85 61.47 53.43 67.49
Genspark Deep Research 66.99 68.70 65.29 35.27 83.71 69.28 70.64 84.60 92.87 70.70 63.58 59.00 40.28
Perplexity Deep Research 61.20 68.95 53.46 44.60 81.03 67.60 70.64 80.90 73.62 62.90 59.02 63.41 8.35
Grok-4.1-Thinking DeepSearch 52.49 58.56 46.43 17.72 60.65 67.68 57.58 89.20 80.43 52.15 47.04 46.76 5.75
Gemini-3-Pro Deep Research - 58.52 - 14.96 58.31 66.88 63.94 88.50 - - - - -

Best scores are highlighted. Gemini-3-Pro generates text-only reports and cannot be evaluated on VA metrics.

Key Insights

Strong Overall Performance

TVIR-Agent variants achieve the strongest aggregate performance among all evaluated systems, with TVIR-Agent (Claude-4.5-Sonnet) obtaining the best Overall score.

Better Evidence Grounding

TVIR-Agent (GLM-4.7) achieves 68.64 on Citation Support, outperforming the best commercial system by 21.11 points.

Superior Visual Alignment

TVIR-Agent (Claude-4.5-Sonnet) scores 74.49 on Figure Caption Quality, exceeding Manus-1.6 by 8.35 points.

Text vs. Visual Gap

Current systems remain much stronger at textual synthesis than at integrating visual assets, highlighting a significant gap in existing paradigms.

Tool Usage Analysis

Tool usage distribution

Figure 5: Tool usage distribution of TVIR-Agent variants across major components.

Structural Error Analysis

Structural error distribution

Figure 6: Distribution of structural errors across deep research systems. TVIR-Agent variants produce substantially fewer structural errors than commercial systems.

Ablation Studies

System Variant TA VA Overall
TVIR-Agent (Full) 69.23 78.62 73.92
w/o research notes 68.63 (-0.60) 78.42 (-0.20) 73.52 (-0.40)
w/o Image Searcher 67.82 (-1.41) 77.23 (-1.39) 72.53 (-1.39)
w/o Chart Generator 66.77 (-2.46) 60.91 (-17.71) 63.84 (-10.08)

Removing the Chart Generator has the largest effect, highlighting its central role in visual synthesis and cross-modal alignment.

Citation

@misc{ma2026tvirbuildingdeepresearch,
      title={TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation}, 
      author={Xinkai Ma and Zhiqi Bai and Dingling Zhang and Pei Liu and Yishuo Yuan and He Zhu and Jiakai Wang and Qianqian Xie and Yifan Zhao and Xinlong Yang and Hao Cong and Zhiheng Yao and Fengxia Xie and Zihao Xu and Haoran Xu and Zhaohui Wang and Minghao Liu and Shirong Lin and Yingshui Tan and Yuchi Xu and Wenbo Su and Zhaoxiang Zhang and Bo Zheng and Jiaheng Liu},
      year={2026},
      eprint={2606.02320},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.02320}, 
}