ReadPaper Blog
Kwai Keye-VL-2.0 Technical Report
The paper introduces Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts multimodal foundation model for long-video understanding and agentic intelligence. It addresses the memory and compute bottlenecks of hour-level video by adapting DeepSeek Sparse Attention to a GQA-based multimodal architecture, while using Cross-Modal Multi-Teacher On-Policy Distillation, Context-RL, and Video-RL to add agent skills without eroding core reasoning. The reported evaluations show strong performance among similar-scale open models on TimeLens temporal localization, LongVideoBench, and Video-MME-v2, suggesting a practical route toward scalable long-context multimodal agents.
Source: Kwai Keye-VL-2.0 Technical Report

Why long videos are hard
The paper frames extreme long-video understanding as a central bottleneck for modern multimodal large language models because hour-level videos create enormous visual token streams and long-range temporal dependencies. Standard dense attention makes the key-value cache grow prohibitively with context length, which raises memory and latency costs and often forces aggressive frame subsampling. The authors argue that this subsampling sacrifices temporal continuity and can remove the very moments needed for fine-grained reasoning, temporal grounding, and video question answering. Kwai Keye-VL-2.0 is therefore motivated by the need to move from frame-limited perception toward global-context reasoning over long videos. This problem is tied directly to the report’s broader goal of building a multimodal model that can support robust agentic applications rather than only short-video comprehension.

The core idea: sparse attention for video
The main architectural contribution is the adaptation of DeepSeek Sparse Attention to a GQA-based multimodal architecture, which the report describes as a first for this setting. Keye-VL-2.0 combines a sparse attention module with a global MQA-style lightning indexer and grouped GQA sparse aggregation so that the model can identify and aggregate important information across long multimodal sequences. The authors describe dense warm-up and sparse adaptation as part of making the mechanism compatible with long-context multimodal modeling rather than simply replacing attention naively. This design enables lossless 256K context processing while capturing critical frames and long-range temporal dependencies. The architecture is paired with a native-resolution vision encoder and unified visual encoding strategy, allowing images and videos to be represented in a way that supports high-resolution detail and extended temporal reasoning.

Why MoE matters
Kwai Keye-VL-2.0-30B-A3B is presented as a 30-billion-parameter Mixture-of-Experts foundation model that activates only 3 billion parameters during inference. The paper uses this MoE design to reconcile model capacity with deployment efficiency, since expert routing allows the system to preserve broad capability without paying the full compute cost of a dense 30B model at every token. Its multimodal stack includes a ViT vision encoder inherited from Keye-VL-1.5-8B, a language decoder built on Qwen3-30B-A3B-Thinking-2507, an MLP projector for aligning visual features into the language-model representation space, and the GQA-compatible DSA module. The pre-training plan is staged, beginning with projector initialization and continuing through general multimodal pre-training, multi-task capability injection, and long-context extension. This modular design lets the report connect efficiency claims to both architecture and training curriculum rather than relying on parameter count alone.

Teaching the model without forgetting
The paper identifies a second major obstacle as the multimodal alignment dilemma, where adding complex video, tool-use, search, and code-agent abilities can cause catastrophic forgetting of foundational STEM, mathematical, linguistic, and general reasoning capabilities. To address this, the authors introduce Cross-Modal Multi-Teacher On-Policy Distillation, or MOPD, which uses specialized teacher models to provide dense token-level feedback on student-generated trajectories across modalities and tasks. Because the feedback is produced on policy, the distillation targets the model’s own rollouts rather than only static demonstrations, making it better aligned with the errors encountered during post-training. The report pairs MOPD with Context-RL and Video-RL, including bucket advantage scaling, to stabilize long-sequence decision processes and reduce visual hallucinations. This post-training strategy is positioned as the mechanism that allows Keye-VL-2.0 to gain native agent collaboration abilities in Code, Tool, and Search scenarios while preserving general-purpose reasoning baselines.

What the results say
The evidence in the report comes from comprehensive evaluations across video understanding, fine-grained temporal grounding, reasoning, STEM, and agent benchmarks. On the TimeLens framework, the paper reports strong temporal localization performance on ActivityNet, QVHighlights, and Charades, using mean temporal Intersection-over-Union as the metric for predicted versus ground-truth moments. For extreme long-video comprehension, the report highlights LongVideoBench and Video-MME-v2, including a Video-MME-v2 non-linear score setting, as key tests of the model’s ability to use long temporal context. The comparison table positions Keye-VL-2.0-30B-A3B competitively against open-source systems such as Qwen3.5-35B-A3B and Qwen3-VL-235B-A22B, and also against the closed-source Gemini-3-Flash in selected settings. The paper’s implication is that sparse long-context attention, efficient MoE activation, optimized video I/O and DSA kernels, and multi-teacher on-policy alignment can together make open multimodal agent systems more scalable and practically deployable.
