ReadPaper Blog
StateKV: Linear Scaling Video VLMs for Long Video Understanding
StateKV addresses the quadratic video-prefill cost that makes pretrained video vision-language models difficult to use for long videos and streaming scenarios. The paper proposes an inference-time KV-cache method that processes frames incrementally with a fixed-capacity temporal state while preserving per-frame visual detail for decoding, achieving linear scaling without fine-tuning or architectural changes. Experiments across three long-video benchmarks and seven models show that StateKV stays close to full self-attention while outperforming sliding-window and recency-based streaming approximations.
Source: StateKV: Linear Scaling Video VLMs for Long Video Understanding

The Long-Video Bottleneck
The paper begins from a practical bottleneck in long-video understanding: modern video VLMs often rely on spatiotemporal self-attention, so the cost of processing a video grows quadratically with the number of frames. This scaling is especially problematic for long-horizon and streaming applications such as autonomous driving and embodied robotics, where visual evidence may accumulate over minutes or hours. In the authors’ framing, the video-side prefill stage becomes the dominant computational constraint because each new frame can attend to an expanding history of previous visual tokens. The result is increasing per-frame latency as a stream grows longer, which undermines real-time use. StateKV targets this bottleneck by seeking constant marginal prefill cost per added frame, yielding linear video-prefill complexity while retaining the long-context behavior needed for VideoQA.

Why Common Fixes Fall Short
The paper argues that many common efficiency strategies reduce input size without fundamentally solving the long-video scaling law. Frame subsampling, visual-token trimming, and KV-cache compression can make inference cheaper, but they risk discarding temporal or spatial evidence needed for long-horizon reasoning. The authors note that prior cache-compression work may need to retain a large token fraction to avoid severe degradation, which limits the practical benefit of aggressive compression. They also distinguish these approaches from streaming-prefill methods such as ReKV, which keep all per-frame visual tokens for decoding while reducing video encoding complexity. This distinction motivates StateKV’s central design goal: reduce the cost of constructing the video KV cache without throwing away the detailed per-frame information used during final language generation.

StateKV's Core Insight
StateKV is motivated by the paper’s observation that long-video attention in pretrained VLMs is structured rather than uniformly dense. According to the authors’ analysis, most attention interactions are within a frame, while long-range cross-frame interactions often concentrate on a small set of slowly evolving “temporal sink” tokens. This pattern is consistent with attention-sink phenomena studied in language transformers and with the image-first training history of many VLM backbones. The paper turns this empirical structure into a self-attention approximation: preserve the small amount of information that is useful across frames while leaving intraframe detail intact. This insight separates StateKV from strict recency-window methods, because the retained cross-frame state is selected by importance rather than by assuming that the most recent frames are always the most relevant.

How StateKV Works
Methodologically, StateKV is an inference-time KV-cache prefill method for frozen pretrained video VLMs. For each layer, it maintains two coupled caches: a fixed-capacity temporal state that carries cross-frame context and a full per-frame cache that preserves detailed visual tokens for decoding. Frames are processed incrementally as they arrive, and the recurrent temporal state supplies the limited long-range context needed during video prefill. After the video cache has been built, the model performs standard text decoding conditioned on all accumulated per-frame video tokens. This design changes the cost of video encoding from quadratic to linear in the number of frames while avoiding fine-tuning, architectural modification, or permanent fixed-budget compression of the final generation context.

What the Paper Shows
The empirical claim of the paper is that StateKV provides a better compute-accuracy tradeoff than dominant streaming approximations for long-video VLMs. Across three long-video benchmarks and seven models spanning three model families and multiple scales, the method remains close to full spatiotemporal self-attention and consistently outperforms sliding-window or recency-based approaches such as ReKV. The results also show that performance improves as the capacity of the temporal state increases, supporting the paper’s view that a compact but well-chosen cross-frame state can approximate full attention effectively. Because StateKV substantially reduces measured video-prefill FLOPs, it can shift the compute-accuracy frontier by making it feasible to run larger, stronger models under the same compute budget. The paper’s broader implication is that long-video and streaming VLM deployment may be advanced by principled KV-cache construction rather than by simply dropping frames or compressing visual tokens more aggressively.
