T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
zhecao@smail.nju.edu.cn · liujiaheng@nju.edu.cn
Abstract
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AV systems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, and etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
Overview
Introduction
Generative AI has witnessed a paradigm shift from unimodal synthesis to cohesive multimodal content creation, with Text-to-Audio-Video (T2AV) generation emerging as a frontier that unifies visual dynamics and auditory realism.
Recent breakthroughs, from proprietary systems like Sora and Veo to open research efforts, have demonstrated the ability to generate high-fidelity audio-video pairs from textual prompts.
Despite this rapid progress, the evaluation of T2AV systems remains fundamentally underdeveloped. These challenges are exacerbated by the intrinsic complexity of T2AV generation. Specifically, high-quality output requires simultaneous success along multiple axes: unimodal perceptual quality, cross-modal semantic alignment, precise temporal synchronization, instruction following under compositional constraints, and realism grounded in physical and commonsense knowledge.
Current evaluations struggle to answer core questions: Do generated sounds correspond to visible events? Are multiple audio sources synchronized with complex visual interactions? Does the model faithfully follow detailed instructions while maintaining physical and perceptual realism?
To address this gap, we introduce T2AV-Compass, the first comprehensive benchmark designed specifically for evaluating text-to-audio-video generation.
T2AV-Compass employs a taxonomy-driven curation pipeline to construct 500 complex prompts and ensure broad semantic coverage and challenging audiovisual scenarios. These prompts impose precise constraints across cinematography, physical causality, and acoustic environments, ensuring coverage of diverse and challenging scenarios—from multi-source sound mixing to long narrative event chains.
Second, we propose a dual-level evaluation Framework that integrates objective evaluation based on classical automated metrics with subjective evaluation based on MLLM-as-judge. The objective evaluation quantifies video quality (technical fidelity, aesthetic appeal), audio quality (acoustic realism, semantic usefulness), and cross-modal alignment (text-audio/video semantic consistency, temporal synchronization). The subjective evaluation mainly evaluates video and audio instruction following abilities based on well-defined checklists and perceptual realism (e.g., physical plausibility and fine-grained details), which aims to address the limitations of automated metrics in capturing nuanced semantic and causal coherence.
Contributions
- Taxonomy-Driven High-Complexity Benchmark: We introduce T2AV-Compass, a benchmark comprising 500 semantically dense prompts synthesized through a hybrid pipeline of taxonomy-based curation and video inversion. It targets fine-grained audiovisual constraints—such as off-screen sound and physical causality—frequently overlooked in existing evaluations.
- Unified Dual-Level Evaluation Framework: We propose a paradigm that integrates objective signal metrics with a novel MLLM-as-a-Judge protocol. By employing a reasoning-first diagnostic mechanism based on granular QA checklists and violation checks (e.g., Material-Timbre Consistency), our framework bridges the gap between low-level fidelity and high-level semantic logic with enhanced interpretability.
- Extensive Benchmarking and Empirical Insights: We conduct a systematic evaluation of 11 state-of-the-art T2AV systems, including leading proprietary models like Veo-3.1 and Kling-2.6. Our analysis unveils a critical "Audio Realism Bottleneck," revealing that current models struggle to synthesize physically grounded audio textures that match the fidelity of their visual counterparts.
Leaderboard
Tables mirror the main results in the paper. Best values are highlighted.
Objective metrics
| Method | Open-Source | VT↑ | VA↑ | PQ↑ | CU↑ | A-V↑ | T-A↑ | T-V↑ | DS↓ | LS↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| — T2AV | ||||||||||
| Veo-3.1 | ✗ | 13.39 | 5.425 | 7.015 | 6.621 | 0.2856 | 0.2335 | 0.2438 | 0.6776 | 1.509 |
| Sora-2 | ✗ | 7.568 | 4.112 | 5.827 | 5.340 | 0.2419 | 0.2484 | 0.2432 | 0.8100 | 1.331 |
| Kling-2.6 | ✗ | 11.41 | 5.417 | 6.882 | 6.449 | 0.2495 | 0.2495 | 0.2449 | 0.7852 | 1.502 |
| Wan-2.6 | ✗ | 11.87 | 4.605 | 6.658 | 6.222 | 0.2149 | 0.2572 | 0.2451 | 0.8818 | 1.081 |
| Seedance-1.5 | ✗ | 12.74 | 5.007 | 7.555 | 7.250 | 0.2875 | 0.2320 | 0.2370 | 0.8650 | 1.560 |
| Wan-2.5 | ✗ | 13.29 | 4.642 | 6.469 | 5.869 | 0.2026 | 0.2445 | 0.2470 | 0.8810 | 1.065 |
| Pixverse-V5.5 | ✗ | 11.54 | 4.558 | 6.108 | 5.855 | 0.1816 | 0.2305 | 0.2431 | 0.6627 | 1.306 |
| Ovi-1.1 | ✓ | 9.336 | 4.368 | 6.569 | 6.492 | 0.1620 | 0.1756 | 0.2391 | 0.9624 | 1.191 |
| JavisDiT | ✓ | 6.850 | 3.575 | 4.299 | 5.204 | 0.1284 | 0.1257 | 0.2320 | 1.322 | -- |
| — T2V + TV2A | ||||||||||
| Wan-2.2 + Hunyuan-Foley | ✓ | 13.43 | 5.605 | 6.497 | 6.208 | 0.2575 | 0.2076 | 0.2455 | 0.7935 | 0.6978 |
| — T2A + TA2V | ||||||||||
| AudioLDM2 + MTV | ✓ | 8.066 | 3.458 | 6.406 | 6.100 | 0.1639 | 0.2698 | 0.2394 | 1.1592 | 0.6835 |
VT: Video Technological · VA: Video Aesthetic · PQ: Perceptual Quality · CU: Content Usefulness · A–V/T–A/T–V: alignment · DS: DeSync (lower is better) · LS: LatentSync.
Subjective evaluation
| Method | Open-Source | IF Video↑ | IF Audio↑ | Video Realism↑ | Audio Realism↑ | Average↑ |
|---|---|---|---|---|---|---|
| — T2AV | ||||||
| Veo-3.1 | ✗ | 76.15 | 67.90 | 87.14 | 49.95 | 70.29 |
| Sora-2 | ✗ | 74.93 | 72.86 | 85.53 | 46.01 | 69.83 |
| Kling-2.6 | ✗ | 73.72 | 63.89 | 87.98 | 47.03 | 68.16 |
| Wan-2.6 | ✗ | 78.52 | 74.95 | 82.05 | 35.18 | 67.68 |
| Seedance-1.5 | ✗ | 60.96 | 61.22 | 88.94 | 53.84 | 66.24 |
| Wan-2.5 | ✗ | 76.56 | 57.95 | 76.00 | 35.06 | 61.39 |
| Pixverse-V5.5 | ✗ | 65.13 | 53.31 | 69.37 | 33.58 | 55.35 |
| Ovi-1.1 | ✓ | 55.05 | 52.83 | 65.93 | 30.75 | 51.14 |
| JavisDiT | ✓ | 32.56 | 15.26 | 34.97 | 14.85 | 24.41 |
| — T2V + TV2A | ||||||
| Wan-2.2 + Hunyuan-Foley | ✓ | 64.54 | 38.19 | 89.63 | 41.25 | 58.40 |
| — T2A + TA2V | ||||||
| AudioLDM2 + MTV | ✓ | 47.13 | 54.39 | 56.73 | 31.90 | 47.54 |
IF: instruction following. Realism measures perceptual plausibility and fine-grained details.
T2AV-Compass
We present T2AV-Compass, a unified benchmark designed to evaluate diverse T2AV systems. Section 3.1 details the data construction pipeline. Section 3.2 provides comprehensive statistics of the resulting benchmark, highlighting its diversity and complexity. Section 3.3 introduces our Dual-Level Evaluation Framework, assessing both objective signal fidelity and cross-modal semantics.
Data Construction
To ensure diversity and complexity of the dataset, we employ a three-stage construction pipeline combining taxonomy-based curation and real-world video inversion at scale.
Data Collection
To establish a foundation of broad semantic coverage, we aggregate raw prompts from a variety of high-quality sources, including VidProM, the Kling AI community, LMArena, and Shot2Story. To mitigate the imbalance between common concepts and long-tail distributions, we implement a semantic clustering strategy. Specifically, we encode all prompts using all-mpnet-base-v2 and perform deduplication with a cosine similarity threshold of 0.8. We then apply square-root sampling (where sampling probability is inversely proportional to the square root of cluster size) to preserve semantic distinctiveness while preventing the dominance of frequent topics.
Prompt Refinement and Alignment
Raw prompts often lack the descriptive density for state-of-the-art models (e.g., Veo 3.1, Sora 2, Kling 2.6). To address this, we employ Gemini-2.5-Pro to restructure and enrich the sampled prompts. We enhance descriptions of visual subjects, motion dynamics, and acoustic events, while enforcing strict cinematographic constraints (e.g., camera angles, lighting). Following automated generation, we conduct a rigorous manual audit to filter out static scenes or illogical compositions, resulting in a curated subset of 400 complex prompts.
Real-world Video Inversion
To counterbalance potential hallucinations in text-only generation and ensure physical plausibility, we introduce a Video-to-Text inversion stream. We select 100 diverse, high-fidelity video clips (4–10s) from YouTube and utilize Gemini-2.5-Pro to generate dense, temporally aligned captions. Discrepancies between the generated prompts and the source ground truth are resolved via human-in-the-loop verification, yielding 100 high-quality prompts anchored in real-world dynamics.
Dataset Statistics
Distribution and Diversity
Our prompts exhibit notably higher token counts compared to existing baselines (e.g., JavisBench, VABench), more accurately mirroring the complexity of real-world user queries. The dataset encompasses a broad spectrum of themes, soundscapes, and cinematographic styles. To quantify diversity, we analyze the semantic retention rates of CLIP (video) and CLAP (audio) embeddings after deduplication. Our benchmark demonstrates superior semantic distinctiveness across both modalities, significantly outperforming concurrent datasets.
Difficulty Analysis
We assess benchmark difficulty across four axes: (1) Visual Subject Multiplicity: 35.8% of samples feature crowds (≥ 4 subjects); (2) Audio Spatial Composition: 55.6% involve mixed on-screen/off-screen sources; (3) Event Temporal Structure: 28.2% contain long narrative chains (≥ 4 event units); (4) Audio Temporal Composition: 72.8% include simultaneous or overlapping audio events. These statistics confirm that our benchmark poses significant challenges regarding fine-grained control and temporal consistency.
Dual-Level Evaluation Framework
We introduce a dual-level evaluation framework for T2AV generation that is both systematic and reproducible. At the objective level, we factor system performance into three complementary pillars: (i) video quality, (ii) audio quality, and (iii) cross-modal alignment. At the subjective level, we propose a reasoning-first MLLM-as-a-Judge protocol that evaluates high-level semantics through two dimensions: Instruction Following (IF) via granular QA checklists, and Perceptual Realism (PR) via diagnostic violation checks. This mechanism ensures both robustness and interpretability by mandating explicit rationales before scoring.
Objective Evaluation
We use a set of expert metrics to cover the three pillars above.
Video Quality
- Video Technological Score (VT): Quantifies low-level visual integrity using DOVER++, penalizing artifacts such as noise, blur, and compression distortions.
- Video Aesthetic Score (VA): Captures high-level perceptual attributes using LAION-Aesthetic Predictor V2.5, including composition, lighting, and color harmony.
Audio Quality
- Perceptual Quality (PQ): Measures signal fidelity and acoustic realism, sensitive to background noise, bandwidth limitations, and unnatural timbre.
- Content Usefulness (CU): Quantifies the semantic validity and information density of the generated audio.
Cross-modal Alignment
- Text–Audio (T–A) Alignment: CLAP cosine similarity between text and audio embeddings.
- Text–Video (T–V) Alignment: VideoCLIP-XL-V2 cosine similarity between text and video embeddings.
- Audio–Video (A–V) Alignment: ImageBind semantic similarity independent of the text prompt.
- Temporal Synchronization: DeSync (DS) measures synchronization error; LatentSync (LS) for lip-sync in talking-face scenarios.
Subjective Evaluation
To address the limitations of traditional metrics in capturing fine-grained semantic details and complex cross-modal dynamics, we establish a robust "MLLM-as-a-Judge" framework. This framework comprises two distinct evaluation tracks: Instruction Following Verification (IFV) and Realism. We enforce a reasoning-first protocol, mandating that the judge explicitly articulates the rationale behind its decision prior to assigning a score on a 5-point scale.
Instruction Following (IF) assesses the model's fidelity to textual prompts. The taxonomy encompasses 7 primary dimensions decomposed into 17 sub-dimensions:
- Attribute: Examines visual accuracy, focusing on Look and Quantity.
- Dynamics: Assesses dynamic behaviors, including Motion, Interaction, Transformation, and Cam. Motion.
- Cinematography: Scrutinizes directorial control, including Light, Frame, and Color Grading.
- Aesthetics: Measures artistic integrity, decomposed into Style and Mood.
- Relations: Verifies structural logic, evaluating Spatial and Logical connections.
- World Knowledge: Tests grounding in reality, specifically Factual Knowledge of real-world scenarios.
- Sound: Assesses the generation of auditory elements, covering Sound Effects, Speech, and Music.
Realism scrutinizes the physical and perceptual authenticity of the generated content:
- Video Realism: Motion Smoothness Score (MSS), Object Integrity Score (OIS), and Temporal Coherence Score (TCS).
- Audio Realism: Acoustic Artifacts Score (AAS) and Material-Timbre Consistency (MTC).
Experiments
Main Results
We evaluate 11 representative T2AV systems, comprising of 7 closed-source end-to-end models, 2 open-source end-to-end models, and 2 composed generation pipelines: Veo-3.1, Sora-2, Kling-2.6, Wan-2.6 and Wan-2.5, Seedance-1.5, PixVerse-V5.5, the open-source Ovi-1.1 and JavisDiT, and two modular pipelines Wan-2.2 + HunyuanVideo-Foley and AudioLDM2 + MTV.
Our analysis of the results yields the following key observations:
- The Gap Between Open and Closed-Source: Closed-source models show superior performance over open-source ones in both objective metrics and semantic evaluations.
- The Audio Realism Bottleneck: While proprietary models demonstrate robust capabilities in Instruction Following (IF), they exhibit significant deficiencies in Realism, particularly in the auditory domain.
- T2AV-Compass is challenging: No single model dominates all evaluation dimensions. For instance, while Veo-3.1 attains the highest overall average, it shows major deficiencies in Audio Realism.
- Competitiveness of Composed Pipelines: Composed systems remain highly effective for specific metrics. Notably, the Wan-2.2 + HunyuanFoley pipeline achieves the highest score in Video Realism, surpassing all end-to-end models.
Further Analysis
As illustrated in the figure above, the macro-level evaluation reveals a clear stratification of model capabilities across the six visual dimensions. Veo-3.1 and Wan-2.5 consistently constitute the top tier, demonstrating robust and balanced performance across Aesthetics, Attribute, and Cinematography (Cinema). Notably, Sora-2 remains highly competitive in static-centric dimensions such as Attribute and World, even surpassing the other leaders in the latter, which suggests a strong prior in factual and naturalistic grounding.
However, Dynamics emerges as the most challenging and discriminative dimension for all systems. Wan-2.5 attains the peak score in Dynamics, with Veo-3.1 following closely, underscoring their relative strength in executing motion-centric instructions. In contrast, Sora-2 exhibits a noticeable decline in this category, indicating a potential bottleneck in maintaining complex temporal coherence and interactions.
Among the remaining systems, PixVerse maintains a stable mid-tier position, while Ovi-1.1 consistently trails across all metrics. The most pronounced deficits for Ovi-1.1 are observed in Dynamics and World, reflecting significant difficulties in handling temporally demanding tasks and knowledge-intensive prompts. Collectively, these findings suggest that while high-end models are approaching saturation in visual appearance and cinematic styling, the frontier for robust instruction-following lies in mastering temporal causality and sophisticated world-knowledge integration.
Multi-metric Radar Comparison
As shown in the radar plots above, the evaluated systems exhibit a consistent trend: OIS and TCS achieve relatively higher scores for strong models, while MTC remains the most challenging dimension and contributes the largest cross-model variance.
Inspecting individual profiles, Veo-3.1 demonstrates the most balanced high-level performance, leading on MSS and maintaining strong OIS/TCS, indicating robust content presentation and temporal consistency. Sora-2 is highly competitive and attains the strongest OIS and TCS, but shows a lower value on AAS, suggesting that its strengths lie more in overall realism/coherence than in fine-grained attribute adherence. Wan-2.5 forms the second tier with solid OIS/TCS yet noticeably weaker MSS/MTC, implying a relative gap in multi-aspect stability and cross-topic robustness. PixVerse-V5.5 delivers mid-range performance with comparatively better MSS/OIS but limited TCS, reflecting less consistent temporal coherence. Finally, OVI-1.1 underperforms across most criteria—especially MTC—highlighting persistent difficulty in maintaining reliable multi-topic consistency and overall temporal quality.
Overall, these results suggest that improving MTC-related capability is crucial for narrowing the gap, while current top models primarily differentiate through stronger temporal coherence (TCS) and overall integrity (OIS).
Case Study
Representative video samples generated by different T2AV models. Click to play with audio.
Case #1
"In a stylized 3D Pixar-like CGI animation, a sunny school basketball court is bathed in the warm glow of afternoon sunlight, with soft volumetric rays and a gentle bloom on the highlights. The court, featuring slightly worn painted lines and metal hoops with chain nets, is alive with the joyful energy of 10 to 12-year-old kids dribbling, passing, and shooting basketballs. In the mid-ground near the baseline stands Wimbly, a cheerful 10-year-old white boy with a round, friendly face, medium-length messy brown hair, and big warm brown eyes, dressed in a yellow T-shirt, navy shorts, and white sneakers. The camera performs a slow, cinematic pan with a natural handheld feel across the court, eventually drifting to settle into a medium close-up on Wimbly. He grips a basketball, his bright eyes intently watching the game. Just as a shot swishes cleanly through a chain net, a warm rim light catches the side of his face and outlines his hair, and his face lights up with a smile. The air is filled with the sounds of squeaking sneakers, rhythmic ball thumps, the satisfying rattle of the chain net, and the cheerful laughter of children. A calm, narrative voiceover says, "There was once a cheerful little boy named Wimbly.""
Case #268
"In a medium wide shot, a strikingly beautiful woman in a vibrant, figure-hugging red sheath dress walks with a graceful, unhurried gait through a lush, sun-dappled urban park during the golden hour. A smooth tracking shot follows her from the side, keeping her centered in the frame with a shallow depth of field. As she passes various groups of men, their activities come to an abrupt halt; conversations trail off and all heads turn in unison, their expressions a mixture of awe and disbelief. The ambient sounds of the park—distant city hum and birds chirping—suddenly diminish, leaving only the confident, rhythmic click of her heels on the path."
Case #377
"On a brightly lit ceremonial stage, a noble German Shepherd service dog stands proudly, its tactical vest densely covered with gleaming medals. A female officer in full dress uniform gently pets its head, her expression a mix of affection and respect, while a male officer stands beside them, smiling with admiration. The camera begins with an extreme close-up, slowly panning across the collection of medals, then smoothly pulls back into a medium shot that captures the entire heartwarming tableau. The scene is filmed with a shallow depth of field and warm, respectful lighting, creating a respectful and heartwarming cinematic style. A gentle, inspiring orchestral score plays, accompanied by the faint, respectful murmur of an audience and the soft rustle of uniforms."
Conclusion
We introduced T2AV-Compass, a unified benchmark for systematically evaluating text-to-audio-video generation. By combining a taxonomy-driven prompt construction pipeline with a dual-level evaluation framework, T2AV-Compass enables fine-grained and diagnostic assessment of video quality, audio quality, cross-modal alignment, instruction following, and realism.
Extensive experiments across a broad set of representative T2AV systems demonstrate that our benchmark effectively differentiates model capabilities and exposes diverse failure modes that are not captured by existing evaluations.
We hope T2AV-Compass serves as a practical and evolving foundation for advancing both the evaluation and modeling of text-to-audio-video generation.
Citation
@misc{cao2025t2avcompass,
title = {T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation},
author = {Cao, Zhe and Wang, Tao and Wang, Jiaming and Wang, Yanghai and Zhang, Yuanxing and Chen, Jialu and Deng, Miao and Wang, Jiahao and Guo, Yubin and Liao, Chenxi and Zhang, Yize and Zhang, Zhaoxiang and Liu, Jiaheng},
year = {2025},
note = {Preprint},
}