ReadPaper Blog
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
VideoKR addresses a central weakness in current video understanding systems: models trained mostly on everyday perceptual datasets often fail when a video question requires domain knowledge, multi-step inference, or scientific explanation. The paper introduces a large-scale CC-licensed expert-domain training corpus, a skill-oriented QA generation pipeline with human-in-the-loop quality control, and VideoKR-Eval, a benchmark designed to reduce shortcut answerability and test genuine knowledge-intensive video reasoning.
Source: VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

When Video Models Need More Than Watching
The paper motivates VideoKR by arguing that modern multimodal video models have improved rapidly on perception-oriented tasks but remain limited when video understanding requires knowledge beyond visible surface cues. Existing post-training corpora are described as heavily skewed toward everyday activities, short-range temporal understanding, action recognition, and event localization, which leaves models underprepared for specialized domains. The authors identify knowledge- and reasoning-intensive video understanding as a harder setting in which models must combine visual evidence, temporal dynamics, domain concepts, and multi-hop inference. Their examples include scientific and professional scenarios where the correct answer depends on recognizing an observed process and applying non-obvious knowledge, such as interpreting a chemical demonstration rather than merely detecting objects. This framing matters because real-world video applications often require domain-aware explanations, not just labels or descriptions.

Meet VideoKR
VideoKR is presented as the paper’s main data contribution: the first large-scale training corpus specifically designed for knowledge- and reasoning-intensive video understanding. The corpus contains 315K video reasoning examples built from 145K newly collected Creative Commons licensed expert-domain videos, rather than relying only on existing video datasets. The collection spans 82 professional subjects across broad areas including social science, natural science, engineering, and healthcare, reflecting the paper’s emphasis on expert-domain coverage. Compared with prior post-training corpora listed in the paper, VideoKR uses newly collected CC-licensed videos, has 100% video understanding examples, and targets expert-domain reasoning with newly generated examples. The implication is that data source and domain composition are treated not as incidental details but as core design choices for improving video reasoning.

How the Examples Are Built
The paper’s method centers on a human-in-the-loop, skill-oriented QA generation pipeline that decomposes difficult video reasoning into three capabilities: basic video reasoning, knowledge-enhanced video perception, and knowledge-intensive video reasoning. The authors first construct a domain knowledge bank from subjects, courses, lectures, and knowledge points, then use this structure to guide knowledge-driven video collection and question generation. Each generated example is grounded in one of the three core skills, and the chain-of-thought subset includes rationales intended to support more structured reasoning during supervised fine-tuning. Quality control includes self-consistency checks, a video dependency filter, CoT rationale validation, and expert validation of model choices at different pipeline steps. The resulting training split includes VideoKR-SFT-201K for supervised fine-tuning and VideoKR-RL-114K for reinforcement learning, giving the paper a concrete SFT and RL data design rather than a purely benchmark-oriented contribution.

Why the New Benchmark Matters
VideoKR-Eval is introduced because the authors found that existing knowledge-intensive video reasoning benchmarks can contain questions solvable with little genuine video understanding. The paper specifically targets single-frame answerability, where a model or annotator may infer the answer from a static image or textual shortcut rather than reasoning over video content. To mitigate this problem, the authors use multi-model single-frame probing and expert re-annotation of filtered videos, creating an evaluation benchmark where questions are intended to depend on temporal evidence and domain reasoning. This benchmark complements the training corpus by measuring whether models actually use video information in knowledge-intensive settings. The broader methodological point is that evaluation design must guard against shortcuts if progress in video reasoning is to be measured reliably.

What the Paper Claims So Far
The experiments use a standard SFT→GRPO post-training pipeline so that improvements can be attributed more clearly to VideoKR’s data design rather than to a novel training algorithm. The paper reports that models such as Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct, when post-trained on VideoKR, outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning benchmarks. The authors also conduct ablations to isolate the effects of CoT supervision, skill-based data composition, and comparisons against prior post-training corpora. These studies support the paper’s central claim that carefully constructed expert-domain data can be a key driver of progress in video reasoning. The implication is practical: future work can improve video understanding not only through architectures or reward engineering, but also through targeted corpus construction, rigorous filtering, and evaluation that demands real video-grounded reasoning.
