T2AV icon T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

1 NJU-LINK Team, Nanjing University
2 Kling Team, Kuaishou Technology
3 Institute of Automation, Chinese Academy of Sciences
* Equal Contribution    Corresponding Author
zhecao@smail.nju.edu.cn · liujiaheng@nju.edu.cn
NJU-LINK Team

Abstract

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AV systems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, and etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

Overview

Main overview figure
Overview of T2AV-Compass analysis and evaluation taxonomy. (a) Radial comparison of representative T2AV models under our evaluation suite. (b) Prompt token-length distribution. (c–d) Semantic diversity of video/audio prompts quantified via embedding similarity (higher indicates broader coverage). (e) Hierarchical distribution of evaluation dimensions, clearly organizing objective metrics and MLLM-based assessments across video, audio, and cross-modal alignment.

Introduction

Generative AI has witnessed a paradigm shift from unimodal synthesis to cohesive multimodal content creation, with Text-to-Audio-Video (T2AV) generation emerging as a frontier that unifies visual dynamics and auditory realism.

Recent breakthroughs, from proprietary systems like Sora and Veo to open research efforts, have demonstrated the ability to generate high-fidelity audio-video pairs from textual prompts.

Despite this rapid progress, the evaluation of T2AV systems remains fundamentally underdeveloped. These challenges are exacerbated by the intrinsic complexity of T2AV generation. Specifically, high-quality output requires simultaneous success along multiple axes: unimodal perceptual quality, cross-modal semantic alignment, precise temporal synchronization, instruction following under compositional constraints, and realism grounded in physical and commonsense knowledge.

Current evaluations struggle to answer core questions: Do generated sounds correspond to visible events? Are multiple audio sources synchronized with complex visual interactions? Does the model faithfully follow detailed instructions while maintaining physical and perceptual realism?

To address this gap, we introduce T2AV-Compass, the first comprehensive benchmark designed specifically for evaluating text-to-audio-video generation.

T2AV-Compass employs a taxonomy-driven curation pipeline to construct 500 complex prompts and ensure broad semantic coverage and challenging audiovisual scenarios. These prompts impose precise constraints across cinematography, physical causality, and acoustic environments, ensuring coverage of diverse and challenging scenarios—from multi-source sound mixing to long narrative event chains.

Second, we propose a dual-level evaluation Framework that integrates objective evaluation based on classical automated metrics with subjective evaluation based on MLLM-as-judge. The objective evaluation quantifies video quality (technical fidelity, aesthetic appeal), audio quality (acoustic realism, semantic usefulness), and cross-modal alignment (text-audio/video semantic consistency, temporal synchronization). The subjective evaluation mainly evaluates video and audio instruction following abilities based on well-defined checklists and perceptual realism (e.g., physical plausibility and fine-grained details), which aims to address the limitations of automated metrics in capturing nuanced semantic and causal coherence.

Contributions

Leaderboard

Tables mirror the main results in the paper. Best values are highlighted.

Objective metrics

MethodOpen-SourceVT↑VA↑PQ↑CU↑A-V↑T-A↑T-V↑DS↓LS↑
— T2AV
Veo-3.113.395.4257.0156.6210.28560.23350.24380.67761.509
Sora-27.5684.1125.8275.3400.24190.24840.24320.81001.331
Kling-2.611.415.4176.8826.4490.24950.24950.24490.78521.502
Wan-2.611.874.6056.6586.2220.21490.25720.24510.88181.081
Seedance-1.512.745.0077.5557.2500.28750.23200.23700.86501.560
Wan-2.513.294.6426.4695.8690.20260.24450.24700.88101.065
Pixverse-V5.511.544.5586.1085.8550.18160.23050.24310.66271.306
Ovi-1.19.3364.3686.5696.4920.16200.17560.23910.96241.191
JavisDiT6.8503.5754.2995.2040.12840.12570.23201.322--
— T2V + TV2A
Wan-2.2 + Hunyuan-Foley13.435.6056.4976.2080.25750.20760.24550.79350.6978
— T2A + TA2V
AudioLDM2 + MTV8.0663.4586.4066.1000.16390.26980.23941.15920.6835

VT: Video Technological · VA: Video Aesthetic · PQ: Perceptual Quality · CU: Content Usefulness · A–V/T–A/T–V: alignment · DS: DeSync (lower is better) · LS: LatentSync.

Subjective evaluation

MethodOpen-SourceIF Video↑IF Audio↑Video Realism↑Audio Realism↑Average↑
— T2AV
Veo-3.176.1567.9087.1449.9570.29
Sora-274.9372.8685.5346.0169.83
Kling-2.673.7263.8987.9847.0368.16
Wan-2.678.5274.9582.0535.1867.68
Seedance-1.560.9661.2288.9453.8466.24
Wan-2.576.5657.9576.0035.0661.39
Pixverse-V5.565.1353.3169.3733.5855.35
Ovi-1.155.0552.8365.9330.7551.14
JavisDiT32.5615.2634.9714.8524.41
— T2V + TV2A
Wan-2.2 + Hunyuan-Foley64.5438.1989.6341.2558.40
— T2A + TA2V
AudioLDM2 + MTV47.1354.3956.7331.9047.54

IF: instruction following. Realism measures perceptual plausibility and fine-grained details.

T2AV-Compass

We present T2AV-Compass, a unified benchmark designed to evaluate diverse T2AV systems. Section 3.1 details the data construction pipeline. Section 3.2 provides comprehensive statistics of the resulting benchmark, highlighting its diversity and complexity. Section 3.3 introduces our Dual-Level Evaluation Framework, assessing both objective signal fidelity and cross-modal semantics.

Data Construction

To ensure diversity and complexity of the dataset, we employ a three-stage construction pipeline combining taxonomy-based curation and real-world video inversion at scale.

Data construction pipeline
Data construction and checklist-based evaluation generation. The prompt suite is constructed from (1) curated community prompts with semantic deduplication (cos ≥ 0.8), clustering-based sampling, LLM rewriting, and human refinement, and (2) a video-inversion stream using filtered 4–10s YouTube clips with dense captioning and manual verification. The finalized prompts are then converted into two types of checklists: instruction-alignment checks via slot extraction and dimension mapping, and perceptual-realism checks for video/audio quality.

Data Collection

To establish a foundation of broad semantic coverage, we aggregate raw prompts from a variety of high-quality sources, including VidProM, the Kling AI community, LMArena, and Shot2Story. To mitigate the imbalance between common concepts and long-tail distributions, we implement a semantic clustering strategy. Specifically, we encode all prompts using all-mpnet-base-v2 and perform deduplication with a cosine similarity threshold of 0.8. We then apply square-root sampling (where sampling probability is inversely proportional to the square root of cluster size) to preserve semantic distinctiveness while preventing the dominance of frequent topics.

Prompt Refinement and Alignment

Raw prompts often lack the descriptive density for state-of-the-art models (e.g., Veo 3.1, Sora 2, Kling 2.6). To address this, we employ Gemini-2.5-Pro to restructure and enrich the sampled prompts. We enhance descriptions of visual subjects, motion dynamics, and acoustic events, while enforcing strict cinematographic constraints (e.g., camera angles, lighting). Following automated generation, we conduct a rigorous manual audit to filter out static scenes or illogical compositions, resulting in a curated subset of 400 complex prompts.

Real-world Video Inversion

To counterbalance potential hallucinations in text-only generation and ensure physical plausibility, we introduce a Video-to-Text inversion stream. We select 100 diverse, high-fidelity video clips (4–10s) from YouTube and utilize Gemini-2.5-Pro to generate dense, temporally aligned captions. Discrepancies between the generated prompts and the source ground truth are resolved via human-in-the-loop verification, yielding 100 high-quality prompts anchored in real-world dynamics.

Dataset Statistics

Dataset statistics
Dataset statistics of T2AV-Compass. (a) Category distributions over five annotation dimensions (Content Genre, Primary Subject, Event Scenario, Sound Category, and Camera Motion). (b) Distributions of audiovisual complexity factors, including Visual Subject Count, Event Temporal Structure, Audio Spatial Composition, and Audio Temporal Composition.

Distribution and Diversity

Our prompts exhibit notably higher token counts compared to existing baselines (e.g., JavisBench, VABench), more accurately mirroring the complexity of real-world user queries. The dataset encompasses a broad spectrum of themes, soundscapes, and cinematographic styles. To quantify diversity, we analyze the semantic retention rates of CLIP (video) and CLAP (audio) embeddings after deduplication. Our benchmark demonstrates superior semantic distinctiveness across both modalities, significantly outperforming concurrent datasets.

Difficulty Analysis

We assess benchmark difficulty across four axes: (1) Visual Subject Multiplicity: 35.8% of samples feature crowds (≥ 4 subjects); (2) Audio Spatial Composition: 55.6% involve mixed on-screen/off-screen sources; (3) Event Temporal Structure: 28.2% contain long narrative chains (≥ 4 event units); (4) Audio Temporal Composition: 72.8% include simultaneous or overlapping audio events. These statistics confirm that our benchmark poses significant challenges regarding fine-grained control and temporal consistency.

Dual-Level Evaluation Framework

We introduce a dual-level evaluation framework for T2AV generation that is both systematic and reproducible. At the objective level, we factor system performance into three complementary pillars: (i) video quality, (ii) audio quality, and (iii) cross-modal alignment. At the subjective level, we propose a reasoning-first MLLM-as-a-Judge protocol that evaluates high-level semantics through two dimensions: Instruction Following (IF) via granular QA checklists, and Perceptual Realism (PR) via diagnostic violation checks. This mechanism ensures both robustness and interpretability by mandating explicit rationales before scoring.

Objective Evaluation

We use a set of expert metrics to cover the three pillars above.

Video Quality

Audio Quality

Cross-modal Alignment

Subjective Evaluation

Subjective evaluation illustration
Illustration of the subjective evaluation framework in T2AV-Compass. Unlike traditional metrics, our protocol provides interpretable diagnosis through two distinct tracks: (Top) Instruction following is evaluated via rigorous Q&A checklist pairs, ensuring semantic alignment in complex scenarios like social interactions and sound effects. (Bottom) Realism scrutinizes perceptual quality, rewarding fine-grained details while explicitly penalizing visual hallucinations or audio dissonance.

To address the limitations of traditional metrics in capturing fine-grained semantic details and complex cross-modal dynamics, we establish a robust "MLLM-as-a-Judge" framework. This framework comprises two distinct evaluation tracks: Instruction Following Verification (IFV) and Realism. We enforce a reasoning-first protocol, mandating that the judge explicitly articulates the rationale behind its decision prior to assigning a score on a 5-point scale.

Instruction Following (IF) assesses the model's fidelity to textual prompts. The taxonomy encompasses 7 primary dimensions decomposed into 17 sub-dimensions:

Realism scrutinizes the physical and perceptual authenticity of the generated content:

Experiments

Main Results

We evaluate 11 representative T2AV systems, comprising of 7 closed-source end-to-end models, 2 open-source end-to-end models, and 2 composed generation pipelines: Veo-3.1, Sora-2, Kling-2.6, Wan-2.6 and Wan-2.5, Seedance-1.5, PixVerse-V5.5, the open-source Ovi-1.1 and JavisDiT, and two modular pipelines Wan-2.2 + HunyuanVideo-Foley and AudioLDM2 + MTV.

Our analysis of the results yields the following key observations:

Further Analysis

Macro-level comparison across six evaluation dimensions
Macro-level comparison across six evaluation dimensions. We report the averaged Video Instruction-Following score (Video IF, Avg.) of five representative models (Veo-3.1, Wan-2.5, Ovi-1.1, PixVerse-V5.5, and Sora-2) on Aesthetics, Attribute, Cinematography, Dynamics, Relations, and World. Overall, Veo-3.1 and Wan-2.5 form the top tier with consistently strong performance; Sora-2 is competitive on Attribute and Cinema but lags on Dynamics; PixVerse exhibits mid-range performance across most dimensions; and Ovi-1.1 shows the lowest scores, with the largest gaps on Dynamics and World.

As illustrated in the figure above, the macro-level evaluation reveals a clear stratification of model capabilities across the six visual dimensions. Veo-3.1 and Wan-2.5 consistently constitute the top tier, demonstrating robust and balanced performance across Aesthetics, Attribute, and Cinematography (Cinema). Notably, Sora-2 remains highly competitive in static-centric dimensions such as Attribute and World, even surpassing the other leaders in the latter, which suggests a strong prior in factual and naturalistic grounding.

However, Dynamics emerges as the most challenging and discriminative dimension for all systems. Wan-2.5 attains the peak score in Dynamics, with Veo-3.1 following closely, underscoring their relative strength in executing motion-centric instructions. In contrast, Sora-2 exhibits a noticeable decline in this category, indicating a potential bottleneck in maintaining complex temporal coherence and interactions.

Among the remaining systems, PixVerse maintains a stable mid-tier position, while Ovi-1.1 consistently trails across all metrics. The most pronounced deficits for Ovi-1.1 are observed in Dynamics and World, reflecting significant difficulties in handling temporally demanding tasks and knowledge-intensive prompts. Collectively, these findings suggest that while high-end models are approaching saturation in visual appearance and cinematic styling, the frontier for robust instruction-following lies in mastering temporal causality and sophisticated world-knowledge integration.

Multi-metric Radar Comparison

Multi-metric radar comparison of representative T2AV systems
Multi-metric radar comparison of representative T2AV systems. We report five complementary criteria for overall generation quality: AAS, MSS, MTC, OIS, and TCS (higher is better). The leftmost panel summarizes the average performance across models, while the remaining panels present per-model radar profiles for OVI-1.1, PixVerse-V5.5, Sora-2, Wan-2.5, and Veo-3.1, respectively. Overall, Veo-3.1 and Sora-2 achieve the strongest balanced performance, whereas OVI-1.1 shows the lowest scores with particularly weak MTC.

As shown in the radar plots above, the evaluated systems exhibit a consistent trend: OIS and TCS achieve relatively higher scores for strong models, while MTC remains the most challenging dimension and contributes the largest cross-model variance.

Inspecting individual profiles, Veo-3.1 demonstrates the most balanced high-level performance, leading on MSS and maintaining strong OIS/TCS, indicating robust content presentation and temporal consistency. Sora-2 is highly competitive and attains the strongest OIS and TCS, but shows a lower value on AAS, suggesting that its strengths lie more in overall realism/coherence than in fine-grained attribute adherence. Wan-2.5 forms the second tier with solid OIS/TCS yet noticeably weaker MSS/MTC, implying a relative gap in multi-aspect stability and cross-topic robustness. PixVerse-V5.5 delivers mid-range performance with comparatively better MSS/OIS but limited TCS, reflecting less consistent temporal coherence. Finally, OVI-1.1 underperforms across most criteria—especially MTC—highlighting persistent difficulty in maintaining reliable multi-topic consistency and overall temporal quality.

Overall, these results suggest that improving MTC-related capability is crucial for narrowing the gap, while current top models primarily differentiate through stronger temporal coherence (TCS) and overall integrity (OIS).

Case Study

Representative video samples generated by different T2AV models. Click to play with audio.

Case #1

Prompt:

"In a stylized 3D Pixar-like CGI animation, a sunny school basketball court is bathed in the warm glow of afternoon sunlight, with soft volumetric rays and a gentle bloom on the highlights. The court, featuring slightly worn painted lines and metal hoops with chain nets, is alive with the joyful energy of 10 to 12-year-old kids dribbling, passing, and shooting basketballs. In the mid-ground near the baseline stands Wimbly, a cheerful 10-year-old white boy with a round, friendly face, medium-length messy brown hair, and big warm brown eyes, dressed in a yellow T-shirt, navy shorts, and white sneakers. The camera performs a slow, cinematic pan with a natural handheld feel across the court, eventually drifting to settle into a medium close-up on Wimbly. He grips a basketball, his bright eyes intently watching the game. Just as a shot swishes cleanly through a chain net, a warm rim light catches the side of his face and outlines his hair, and his face lights up with a smile. The air is filled with the sounds of squeaking sneakers, rhythmic ball thumps, the satisfying rattle of the chain net, and the cheerful laughter of children. A calm, narrative voiceover says, "There was once a cheerful little boy named Wimbly.""

Veo-3.1
Kling-2.6
Ovi-1.1

Case #268

Prompt:

"In a medium wide shot, a strikingly beautiful woman in a vibrant, figure-hugging red sheath dress walks with a graceful, unhurried gait through a lush, sun-dappled urban park during the golden hour. A smooth tracking shot follows her from the side, keeping her centered in the frame with a shallow depth of field. As she passes various groups of men, their activities come to an abrupt halt; conversations trail off and all heads turn in unison, their expressions a mixture of awe and disbelief. The ambient sounds of the park—distant city hum and birds chirping—suddenly diminish, leaving only the confident, rhythmic click of her heels on the path."

Veo-3.1
Kling-2.6
Ovi-1.1

Case #377

Prompt:

"On a brightly lit ceremonial stage, a noble German Shepherd service dog stands proudly, its tactical vest densely covered with gleaming medals. A female officer in full dress uniform gently pets its head, her expression a mix of affection and respect, while a male officer stands beside them, smiling with admiration. The camera begins with an extreme close-up, slowly panning across the collection of medals, then smoothly pulls back into a medium shot that captures the entire heartwarming tableau. The scene is filmed with a shallow depth of field and warm, respectful lighting, creating a respectful and heartwarming cinematic style. A gentle, inspiring orchestral score plays, accompanied by the faint, respectful murmur of an audience and the soft rustle of uniforms."

Veo-3.1
Kling-2.6
Ovi-1.1

Conclusion

We introduced T2AV-Compass, a unified benchmark for systematically evaluating text-to-audio-video generation. By combining a taxonomy-driven prompt construction pipeline with a dual-level evaluation framework, T2AV-Compass enables fine-grained and diagnostic assessment of video quality, audio quality, cross-modal alignment, instruction following, and realism.

Extensive experiments across a broad set of representative T2AV systems demonstrate that our benchmark effectively differentiates model capabilities and exposes diverse failure modes that are not captured by existing evaluations.

We hope T2AV-Compass serves as a practical and evolving foundation for advancing both the evaluation and modeling of text-to-audio-video generation.

Citation

BibTeX

@misc{cao2025t2avcompass,
  title        = {T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation},
  author       = {Cao, Zhe and Wang, Tao and Wang, Jiaming and Wang, Yanghai and Zhang, Yuanxing and Chen, Jialu and Deng, Miao and Wang, Jiahao and Guo, Yubin and Liao, Chenxi and Zhang, Yize and Zhang, Zhaoxiang and Liu, Jiaheng},
  year         = {2025},
  note         = {Preprint},
}