Realistic · Reproducible · Deep Research Evaluation

DR³-Eval

Towards Realistic and Reproducible Deep Research Evaluation

100 Tasks
13 Sub-domains
68% Multimodal
2.24 Files/Task
100
Total Tasks
13
Sub-domains
68%
Multimodal
2.24
Files / Task

📊 Dataset Statistics

100 tasks (50 EN + 50 ZH), 13 sub-domains, 68% multimodal input.

Domain Distribution

File Type Distribution

User Files Per Task

📏 Evaluation Metrics

5 complementary metrics across two dimensions to assess deep research agent performance.

🔍 Information Seeking

📖
IR — Information Recall
47.6

Coverage of key insights from user files and sandbox corpus.

📎
CC — Citation Coverage
39.0

Overlap between cited and required source documents.

📝 Report Generation

FA — Factual Accuracy
67.3

Factual correctness of claim-source pairs via entailment.

📋
IF — Instruction Following
83.4

Whether report satisfies each requirement in a verified checklist.

🧠
DQ — Depth Quality
66.5

Expert judge evaluates analytical substance on 1–10 scale.

🏆 Leaderboard

Evaluation results of 8 state-of-the-art LLMs across 64k/128k/512k sandbox corpus scales.

Model IRUF IRSC CC FA IF DQ Avg.
64k128k512k 64k128k512k 64k128k512k 64k128k512k 64k128k512k 64k128k512k 64k128k512k
Claude Sonnet 458.860.460.855.346.641.864.754.848.587.082.782.187.489.288.570.771.572.070.767.565.6
GLM-4.755.755.057.153.147.642.165.455.945.384.582.180.388.889.388.171.171.872.169.866.964.1
GLM-4.653.452.650.349.543.939.858.252.044.084.082.382.985.687.286.470.169.370.666.864.562.3
Gemini-2.5-Pro43.945.742.937.733.230.552.144.338.782.580.178.986.285.884.368.967.266.861.959.457.0
GPT-4.142.340.839.535.231.828.448.641.235.963.861.559.284.983.782.165.364.863.556.753.951.4
Qwen3-235B44.143.541.238.934.731.350.343.837.252.450.848.685.784.983.266.265.164.356.353.851.0
Qwen3-32B38.537.235.832.128.525.344.738.232.558.356.153.880.278.976.562.861.560.252.850.147.4
Qwen2.5-72B35.234.132.529.826.323.141.335.730.255.153.250.878.576.874.260.559.257.850.147.644.8
🔥

Extremely Challenging

Best model achieves only 65.6 avg under 512k. Scaling law remains key.

📉

Longer Context = Lower Score

Performance drops as corpus grows 64k→512k. Noise makes evidence harder to find.

⚠️

IF ≠ FA

Some models achieve good IF but very low FA. Reports look complete but contain factual errors.

🌐

Domain Performance Varies

No single model dominates all 13 sub-domains.

🏗️ Benchmark Construction

A 5-stage pipeline for realistic, controllable, and precisely evaluable deep research benchmarking.

Framework

👥 Stage 1: Real-World Grounding

Volunteers provide multimodal materials (text, images, video, audio). 100 document sets across 3 domains and 13 sub-fields.

🔍 Stage 2: Search Path Distillation

Divergent-convergent keyword generation splits into Signal Keywords (core path) and Noise Keywords (misleading).

📦 Stage 3: Sandbox Construction

Static sandbox with Supportive, Distractor, and Noise documents. Five context-length settings (32k–512k).

Stage 4: Query Construction

Queries reverse-engineered from verified evidence, ensuring definitive answers grounded in the sandbox.

Stage 5: Quality Control

Four-dimensional validation: implicit guidance, synthesis necessity, insight novelty, interpretative unambiguity.

🔍 Result Analysis

Detailed analysis across dimensions, corpus scales, error types, and experimental settings.

Heatmap

🗺️ Cross-Domain Heatmap

🏆 Top Performers

Claude Sonnet 4 leads Physics (84.6). GLM-4.7 excels in Industry and Policy.

📉 Challenging Domains

Agriculture and Commerce are hardest across all models.

🌐 No Universal Winner

No single model dominates all 13 sub-domains.

Scale

📏 Scale Analysis

📉 Performance Drops

Avg, IR_SC and CC decline as corpus grows from 32k to 512k.

🧠 FA Stays Stable

FA remains stable across scales, reflecting reasoning over retrieval.

⚠️ IF Resilient

IF stays high even at 512k, independent of retrieval difficulty.

Error Types

🚨 Error Type Analysis

💥 Hallucination Dominates

48–78% of errors are hallucinations.

Gemini-2.5-Pro

Lowest hallucination, most retrieval failures.

⚠️ Qwen3-235B

78% hallucination despite low retrieval failure.

Ablation

🧪 Sandbox Corpus Effectiveness

💪 Distractors Work

Removing distractors improves performance significantly.

🔒 No Shortcuts

W/o Supportive ≈ W/o RAG, no exploitable shortcuts.

📈 Only Supportive Best

Only supportive docs yield highest scores.

MetricQwen3-235B-A22BGemini-2.5-Pro
Baselinew/ WebΔBaselinew/ WebΔ
IRSC31.028.1-2.832.834.9+2.1
IRUF43.647.9+4.336.942.4+5.4
FA63.365.9+2.675.670.2-5.4
DQ67.065.5-1.569.572.0+2.5
IF73.479.1+5.781.480.4-1.0
Avg.54.956.6+1.663.162.1-1.0

🌐 Sandbox vs. Online

Minimal Difference

Δ < 2 points between sandbox and web search.

🔄 Reproducibility

Sandbox simulates open-web with reproducibility.

ModelOpenAI-EmbQwen-EmbBM25
GLM-4.756.5853.6150.71
GPT-4.136.1535.6422.60
Gemini-2.5-Pro49.5137.1631.25

🔎 Retriever Comparison

🥇 Dense Retrieval Wins

OpenAI embedding-3-small achieves best performance.

📉 BM25 Falls Behind

Lexical BM25 significantly worse than dense retrieval.

Evaluation MethodPearson rSpearman ρPairwise Agr.
DR³-Eval (Ours)0.780.730.89
Inter-Human0.830.760.91

🤝 Human Evaluation

🤝 Strong Alignment

r=0.78, ρ=0.73, agreement 89%.

👥 Near Human Level

Approaches inter-human consistency (r=0.83, ρ=0.76).

📖 Citation

@article{dr3eval2025,
  title={DR$^3$-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
  author={NJU-LINK Team},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}