ReadPaper Blog
CoVEBench: A Benchmark for Compositional Video Editing
CoVEBench addresses a gap in text-guided video editing evaluation: real users often request multiple coupled edits in one instruction, while existing benchmarks mostly test isolated changes. The paper introduces a benchmark with curated videos, multi-point editing prompts, and fine-grained checklist-based evaluation to measure whether models can follow complex instructions while preserving unrelated spatiotemporal content. Its experiments indicate that current video editing models still struggle with compositional workflows, often omitting edits, altering content that should remain unchanged, or producing artifacts.
Source: CoVEBench: A Benchmark for Compositional Video Editing

Who asked for this mess?
The paper argues that the central challenge for modern instruction-guided video editing is no longer whether a model can perform a single style transfer or object insertion, but whether it can coordinate several edits under shared temporal and spatial constraints. Real-world prompts often combine subject modification, camera adjustment, object addition, motion change, and preservation requirements in a single request. CoVEBench frames this as compositional video editing, where models must understand relationships among multiple editing goals rather than treat each operation independently. The motivation is that benchmarks built around simple, isolated edits have become misaligned with creator workflows and too limited for diagnosing advanced proprietary and open-source systems. By foregrounding coupled operations and preservation of irrelevant content, the paper positions compositional editing as a more realistic test of video editing intelligence.

Why old scoreboards fail
The paper also criticizes existing evaluation protocols for relying too heavily on coarse global metrics such as CLIP-style similarity scores. Such metrics can indicate broad semantic alignment while hiding important failures, including an omitted edit, a physically implausible modification, or an unintended change to the background or scene structure. CoVEBench therefore treats evaluation as a diagnostic problem rather than a single holistic scoring problem. The benchmark is designed to separate instruction compliance, modification quality, physical realism, video fidelity, and preservation of unrelated content. This matters because compositional prompts create interactions among edits, and the paper argues that only fine-grained assessment can reveal where a model succeeds or fails within a multi-operation workflow.

CoVEBench enters the garage
CoVEBench is built as a dedicated benchmark for compositional video editing with 416 curated source videos and 626 multi-point editing instructions. The paper reports that each instruction contains approximately three atomic edit operations on average, making the prompts substantially more complex than single-operation editing requests. Its taxonomy covers seven practical editing dimensions: Subject, Background, Camera, Style, Motion, Position, and Special Effects. Source videos are drawn from stock platforms such as Pexels and Mixkit and academic resources including Vript, UltraVideo, ViDiC, and LMArena, then filtered for properties such as resolution of at least 480p, duration within 3–21 seconds, visual quality, near-duplicate removal, and human-reviewed editability. The resulting dataset is intended to span varied scenes, durations, resolutions, and editing categories rather than overfit to narrow prompt templates.

The checklist trap, but useful
A major methodological contribution of the paper is its decomposition of complex editing instructions into 9,990 fine-grained checklist items. The checklist generation process uses advanced language models, including Gemini-3-Flash, GPT-5, and DeepSeek-V4-Pro, to convert each editing instruction and source-video description into verifiable questions about distinct edit points. The paper then applies manual filtering, retaining about 67.2% of the initial checklist outputs, to improve reliability and remove inappropriate or redundant items. These checklists support MLLM-judged evaluation of whether requested edits were executed, whether the edited regions remain visually and logically plausible, and whether unrelated source content is preserved. CoVEBench also incorporates automated video quality metrics, so its evaluation combines checklist-based instruction analysis with measures of visual quality, motion consistency, and structural preservation.

What the benchmark revealed
The paper’s experiments evaluate a range of leading open-source and proprietary video editing models and conclude that compositional editing remains a profound challenge. Current models frequently execute only part of a multi-point instruction, showing that success on elementary editing tasks does not reliably transfer to bundled requests. The results also reveal failures in preservation constraints, where models change irrelevant spatiotemporal content while attempting the target edits. The paper further notes that artifacts can appear when multiple operations interact, suggesting that models struggle to coordinate atomic edits without mutual interference. CoVEBench is therefore presented not merely as a leaderboard, but as a diagnostic testbed for advancing video editing systems toward realistic instruction-following workflows.
