ReadPaper Blog
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA studies how to make long-rollout causal video diffusion more memory- and latency-efficient by redesigning the KV cache rather than only adjusting the sliding-window policy. The paper replaces dense per-head cached keys and values with a shared low-rank latent representation and a shared decoupled 3D-RoPE positional key, cutting per-token KV memory by 92.7% at cached layers and targeting minute-scale autoregressive video generation.
Source: VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

The cache is eating the room!
VideoMLA addresses a core systems bottleneck in minute-scale autoregressive video diffusion: long-rollout causal generation depends on a streaming KV cache whose dense per-head layout consumes substantial memory and contributes directly to latency. The paper argues that fixed-size sliding-window caches help models continue generation over long horizons, but they do not remove the per-token cost of storing separate keys and values for each attention head. This matters especially for video diffusion, where cached tokens accumulate across spatiotemporal content and each cached layer must preserve enough attention state for coherent continuation. The paper’s motivation is therefore not simply longer context, but a more compact cached representation that keeps causal video generation practical under real memory constraints. By focusing on the KV layout itself, VideoMLA reframes cache design as an architectural compression problem rather than only a token-retention problem.

Not just window tricks
The paper positions VideoMLA against recent long-rollout causal video diffusion work that largely preserves the standard sliding-window cache abstraction. Prior approaches described in the paper’s framing alter which tokens remain in the window, how temporal or spatial positions are encoded, or how cached memory is managed, but they generally leave the dense per-head key-value representation intact. VideoMLA identifies that unchanged layout as a direct bottleneck because every retained token still carries head-specific key and value tensors through cached attention layers. This related-work gap motivates the first study, according to the abstract, of Multi-Head Latent Attention in video diffusion rather than in language-only settings. The implication is that scaling long video generation may require changing what is stored per token, not only deciding which tokens to store.

VideoMLA changes the furniture
The central method in VideoMLA is a low-rank latent KV cache adapted to autoregressive video diffusion. Instead of caching per-head keys and values, the method stores a shared low-rank content latent together with a shared decoupled 3D-RoPE positional key. The 3D-RoPE component is important because video attention must preserve positional structure across spatial and temporal axes, while the latent content representation reduces redundant head-specific storage. The abstract reports that this replacement reduces per-token KV memory by 92.7% at every cached layer, making the method directly relevant to streaming generation where cache size compounds over time. In methodological terms, VideoMLA imports the Multi-Head Latent Attention idea into the video diffusion setting and modifies the cache representation to respect both content and spatiotemporal position.

But wait—wasn’t the model not low-rank?
A key technical issue for the paper is whether low-rank cached latents are justified when pretrained video attention may not naturally behave as a low-rank object. VideoMLA treats this as an empirical and architectural question rather than assuming that dense pretrained attention weights can simply be spectrally compressed without consequence. The paper’s framing distinguishes direct low-rank approximation from a redesigned attention cache in which the bottlenecked latent becomes part of the model’s operative representation. This distinction matters because the success of a low-rank cache depends on whether training and inference can use the imposed latent structure, not merely on whether an existing dense matrix has a small effective rank. The broader implication is that cache compression for video diffusion should be evaluated as a learned attention-layout change, not only as post-hoc matrix compression.

The bottleneck decides the rank
The paper’s experiments and results sections are organized to test whether the VideoMLA cache can preserve video-generation quality while reducing memory and improving streaming efficiency. Its stated contribution is not only a smaller cache, but a practical path toward minute-scale autoregressive video diffusion with lower per-token KV storage at cached layers. By targeting the dominant cached-attention layout, VideoMLA suggests that long-horizon video generation can benefit from architectural changes that are orthogonal to sliding-window scheduling and positional-policy tweaks. The reported 92.7% memory reduction is the central quantitative claim available from the abstract and supports the paper’s argument that latent KV caching can materially alter the deployment cost of causal video diffusion. The conclusion implied by the work is that future long-video diffusion systems should consider the representation stored in memory as a first-class design choice alongside model scale, window length, and positional encoding.
