Compositional video editing evaluation

CoVEBench: Can Video Editing Models Handle Complex Instructions?

A diagnostic benchmark for real-world, multi-point video editing prompts, with fine-grained checklist evaluation across instruction compliance, video quality, and source fidelity.

Paper Code Dataset

Overview of the CoVEBench evaluation framework — CoVEBench diagnoses compositional editing through structured prompts and checklist-based evaluation.

416 curated source videos

626 multi-point instructions

9,990 fine-grained checklist items

10 evaluated editing models

Project demo

CoVEBench in action

A short walkthrough of the benchmark interface, evaluation flow, and representative video editing examples.

Demo video for the CoVEBench project page.

Benchmark construction

Designed for complex editing workflows

The benchmark emphasizes realistic compositional instructions instead of isolated, single-edit prompts. Videos are filtered for visual quality, editability, duration, resolution, and duplicate removal before being paired with diverse instructions and verifiable checklist questions.

Three-stage data curation pipeline — Three-stage curation: source video filtering, instruction generation, and checklist refinement.

Data statistics covering edit types and video properties — Diverse edit dimensions, prompt lengths, edit counts, durations, and resolutions.

Evaluation protocol

Three complementary dimensions

CoVEBench separates whether an edit was executed, whether the output remains visually plausible, and whether unrelated source content is preserved, enabling failures to be localized beyond a single aggregate score.

Evaluation matrix of CoVEBench metrics — Primary holistic indicators are UAS for instruction compliance, VQR for video quality, and SEM for fidelity.

Main results

Complex instructions remain difficult

Strong proprietary systems lead the benchmark, but absolute union accuracy remains far below individual instruction-following and realism scores. This gap indicates that models can partially execute requested changes while still failing the full compositional requirement.

Leaderboard

Ranked by union accuracy on compositional edits

UAS VQR SEM

Closed-source

Wan2.7

56.89 UAS

IFS: 82.02
VQR: 4.407
SEM: 87.90

Closed-source

HappyHorse1.0

55.18 UAS

IFS: 76.54
VQR: 4.388
SEM: 92.73

Open-source

OmniWeaving

30.14 UAS

IFS: 57.18
VQR: 3.660
SEM: 85.05

Higher is better. Models are ordered by UAS, the strict union accuracy for satisfying every checklist item in a compositional instruction.
Rank	Model	Source	UAS	IFS	VRS	VQR	AES	MSM	TQ	SEM	SSIM	MF	SRC
#1	Wan2.7	Closed	56.89	82.02	79.97	4.407	5.077	0.692	18.223	87.90	0.482	0.896	0.815
#2	HappyHorse1.0	Closed	55.18	76.54	84.52	4.388	5.070	0.710	18.414	92.73	0.506	0.886	0.823
#3	OmniWeaving	Open	30.14	57.18	61.75	3.660	4.135	0.709	15.092	85.05	0.463	0.891	0.781
#4	Kiwi	Open	29.03	53.90	56.13	3.670	4.609	0.642	15.649	79.51	0.605	0.893	0.814
#5	Ditto	Open	26.50	49.45	60.69	3.921	4.297	0.639	15.583	58.02	0.355	0.907	0.763
#6	Lucy	Open	26.01	50.85	58.68	3.688	4.136	0.661	15.045	86.13	0.762	0.918	0.834
#7	ICVE	Open	25.83	53.14	54.00	3.277	3.695	0.642	12.168	71.02	0.288	0.814	0.642
#8	ReCo	Open	24.35	54.16	47.42	3.146	3.906	0.625	12.101	70.03	0.528	0.870	0.730
#9	InsV2V	Open	14.61	37.18	47.36	3.307	4.327	0.698	10.501	77.85	0.280	0.886	0.740
#10	VACE	Open	9.69	22.92	41.35	3.718	5.037	0.688	13.637	81.73	0.709	0.958	0.783

Finding 1

Closed-source models are stronger, but not solved

Wan2.7 and HappyHorse1.0 reach the best UAS scores, yet even the strongest model remains under 57% union accuracy on compositional edits.

Finding 2

Execution and preservation are in tension

Some systems improve instruction following by making stronger edits, but this can reduce semantic preservation and alter regions that should remain unchanged.

Finding 3

Joint editing outperforms stepwise decomposition

Joint editing achieves 30.63% UAS versus 23.70% for sequential editing, avoiding error accumulation and overwriting from intermediate generations.

Further analysis

Robustness, validity, and failure modes

Additional experiments show that longer temporal spans, more edit points, and longer instructions amplify the difficulty. Metric-human agreement remains high, supporting the evaluation protocol.

Model robustness under increasing temporal and editing complexity — Accuracy declines as generated frames, source duration, edit count, and instruction length increase.

Human preference consistency of metrics — All reported metrics align with human preferences by more than 85%.

Inference efficiency comparison across open-source models — Efficiency varies sharply across open-source systems, affecting deployment feasibility.

Diagnostic views

Fine-grained categories expose hidden weaknesses

Aggregate scores hide important differences between editing categories. Camera control, motion edits, and subject operations remain particularly challenging, while style and background changes are more tractable for current models.

Metric correlation and category-level performance analysis — Metric complementarity and category-level performance across representative editing dimensions.

High-level error analysis of five video editing models — Execution inadequacy is the dominant bottleneck, with preservation and physical grounding also recurring.

Examples

Representative samples and qualitative comparisons

CoVEBench pairs visual samples with detailed checklist supervision, enabling model outputs to be inspected beyond a single global score.

Example 01 · Source video & editing instruction Add two additional double-walled glass cups to the machine's tray, placing one on each side of the center cup to create a row of three. Fill these two new side cups with espresso and a layer of crema, while keeping the center cup positioned under the nozzles to continue receiving the pour.

Wan2.7

HappyHorse1.0

OmniWeaving

Kiwi

Ditto

Example 02 · Source video & editing instruction Change the starting contents of the bowl to include strawberries already mixed in the yogurt. Replace the action of sliding sliced strawberries with a knife with the person adding mango cubes to the bowl. Swap the tool being used from a metal knife to the black spoon previously seen in the background. Place the knife in a resting position on the table to the right of the bowl. Change the hand movements so both hands are active, using the left hand to pick up mango and the right hand to use the spoon.

Wan2.7

HappyHorse1.0

Lucy

ReCo

VACE

Example 03 · Source video & editing instruction Apply an American Comic stylization to the entire scene, add Action VFX of vibrant energy trails emanating from the guitar strings, and execute a Zoom out camera movement. Maintain all other elements unchanged.

Wan2.7

HappyHorse1.0

OmniWeaving

Lucy

VACE

Citation

Cite CoVEBench

If you find this benchmark useful for your research, please cite the project.

@misc{wu2026covebenchvideoeditingmodels,
      title={CoVEBench: Can Video Editing Models Handle Complex Instructions?}, 
      author={Jiangtao Wu and Jiaming Wang and Yiwen He and Yuanxing Zhang and Shihao Li and Dunyuan Liu and Xuedong Zhao and Jialu Chen and Zekun Moore Wang and Jiaheng Liu},
      year={2026},
      eprint={2606.08415},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.08415}, 
}