Compositional video editing evaluation

CoVEBench: Can Video Editing Models Handle Complex Instructions?

A diagnostic benchmark for real-world, multi-point video editing prompts, with fine-grained checklist evaluation across instruction compliance, video quality, and source fidelity.

Overview of the CoVEBench evaluation framework
CoVEBench diagnoses compositional editing through structured prompts and checklist-based evaluation.
416 curated source videos
626 multi-point instructions
9,990 fine-grained checklist items
10 evaluated editing models

Project demo

CoVEBench in action

A short walkthrough of the benchmark interface, evaluation flow, and representative video editing examples.

Demo video for the CoVEBench project page.

Benchmark construction

Designed for complex editing workflows

The benchmark emphasizes realistic compositional instructions instead of isolated, single-edit prompts. Videos are filtered for visual quality, editability, duration, resolution, and duplicate removal before being paired with diverse instructions and verifiable checklist questions.

Three-stage data curation pipeline
Three-stage curation: source video filtering, instruction generation, and checklist refinement.
Data statistics covering edit types and video properties
Diverse edit dimensions, prompt lengths, edit counts, durations, and resolutions.

Evaluation protocol

Three complementary dimensions

CoVEBench separates whether an edit was executed, whether the output remains visually plausible, and whether unrelated source content is preserved, enabling failures to be localized beyond a single aggregate score.

Evaluation matrix of CoVEBench metrics
Primary holistic indicators are UAS for instruction compliance, VQR for video quality, and SEM for fidelity.

Main results

Complex instructions remain difficult

Strong proprietary systems lead the benchmark, but absolute union accuracy remains far below individual instruction-following and realism scores. This gap indicates that models can partially execute requested changes while still failing the full compositional requirement.

Quantitative results across instruction compliance, video quality, and video fidelity
Quantitative comparison across closed-source and open-source video editing models.
Finding 1

Closed-source models are stronger, but not solved

Wan2.7 and HappyHorse1.0 reach the best UAS scores, yet even the strongest model remains under 57% union accuracy on compositional edits.

Finding 2

Execution and preservation are in tension

Some systems improve instruction following by making stronger edits, but this can reduce semantic preservation and alter regions that should remain unchanged.

Finding 3

Joint editing outperforms stepwise decomposition

Joint editing achieves 30.63% UAS versus 23.70% for sequential editing, avoiding error accumulation and overwriting from intermediate generations.

Further analysis

Robustness, validity, and failure modes

Additional experiments show that longer temporal spans, more edit points, and longer instructions amplify the difficulty. Metric-human agreement remains high, supporting the evaluation protocol.

Model robustness under increasing temporal and editing complexity
Accuracy declines as generated frames, source duration, edit count, and instruction length increase.
Human preference consistency of metrics
All reported metrics align with human preferences by more than 85%.
Inference efficiency comparison across open-source models
Efficiency varies sharply across open-source systems, affecting deployment feasibility.

Diagnostic views

Fine-grained categories expose hidden weaknesses

Aggregate scores hide important differences between editing categories. Camera control, motion edits, and subject operations remain particularly challenging, while style and background changes are more tractable for current models.

Metric correlation and category-level performance analysis
Metric complementarity and category-level performance across representative editing dimensions.
High-level error analysis of five video editing models
Execution inadequacy is the dominant bottleneck, with preservation and physical grounding also recurring.

Examples

Representative samples and qualitative comparisons

CoVEBench pairs visual samples with detailed checklist supervision, enabling model outputs to be inspected beyond a single global score.

Example 01 · Source video & editing instruction Add two additional double-walled glass cups to the machine's tray, placing one on each side of the center cup to create a row of three. Fill these two new side cups with espresso and a layer of crema, while keeping the center cup positioned under the nozzles to continue receiving the pour.
Wan2.7
HappyHorse1.0
OmniWeaving
Kiwi
Ditto
Example 02 · Source video & editing instruction Change the starting contents of the bowl to include strawberries already mixed in the yogurt. Replace the action of sliding sliced strawberries with a knife with the person adding mango cubes to the bowl. Swap the tool being used from a metal knife to the black spoon previously seen in the background. Place the knife in a resting position on the table to the right of the bowl. Change the hand movements so both hands are active, using the left hand to pick up mango and the right hand to use the spoon.
Wan2.7
HappyHorse1.0
Lucy
ReCo
VACE
Example 03 · Source video & editing instruction Apply an American Comic stylization to the entire scene, add Action VFX of vibrant energy trails emanating from the guitar strings, and execute a Zoom out camera movement. Maintain all other elements unchanged.
Wan2.7
HappyHorse1.0
OmniWeaving
Lucy
VACE

Citation

Cite CoVEBench

If you find this benchmark useful for your research, please cite the project.

@misc{wu2026covebenchvideoeditingmodels,
      title={CoVEBench: Can Video Editing Models Handle Complex Instructions?}, 
      author={Jiangtao Wu and Jiaming Wang and Yiwen He and Yuanxing Zhang and Shihao Li and Dunyuan Liu and Xuedong Zhao and Jialu Chen and Zekun Moore Wang and Jiaheng Liu},
      year={2026},
      eprint={2606.08415},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.08415}, 
}