ReadPaper Blog
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
Echo-Infinity addresses a core obstacle in autoregressive video diffusion transformers: real-time video streams need long historical context, but KV caches grow without bound and temporal RoPE indices eventually exceed the pretrained range. The paper introduces learnable Memory Queries and a Unified Relative RoPE Recipe to compress arbitrary-length history at constant cost while keeping positional indices valid during both training and inference. Its reported results include state-of-the-art performance on long and short video generation benchmarks and 24-hour real-time rollouts exceeding 1.3 million frames on a single NVIDIA H100.
Source: Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Echo-Infinity: Infinite Video, Tiny Memory
The paper studies why autoregressive video diffusion transformers, although promising for real-time streaming generation, break down over very long horizons. In causal Video DiTs, each newly generated frame contributes key-value cache entries, so preserving the entire history makes memory grow linearly with sequence length. The second bottleneck is temporal rotary positional embedding, because the temporal RoPE id used during long rollouts can exceed the range seen during pretraining, causing degradation and even overflow. Echo-Infinity is proposed as a framework for real-time infinite video generation that tackles both bottlenecks together rather than treating memory and position encoding as separate engineering issues. Its central claim is that a compact, learnable, evolving memory can preserve generation-relevant history at constant computational cost while supporting much longer rollouts than fixed-window cache designs.

The Old Fixes Keep Truncating the Past
The paper positions Echo-Infinity against three common families of long-video memory mechanisms and argues that each leaves important failure modes unresolved. Window truncation keeps a bounded local window plus sink frames, which controls memory but discards distant history that may still matter for temporal consistency. Hand-crafted KV-cache management methods retain selected evicted keys and values according to offline rules or schedules, but they remain constrained by fixed cache budgets and cannot adaptively learn what information is useful under autoregressive generation noise. Heuristic compression methods replace parts of the history with compact representations, yet often depend on predefined compression ratios, compression schedules, or separate reconstruction objectives instead of optimizing memory directly for the video generation task. Echo-Infinity’s motivation is therefore to replace passive cache curation with an end-to-end learned memory state integrated into the DiT generation process.

Meet Memory Queries
The main technical contribution is a set of trainable Memory Queries that function as an evolving long-term memory alongside sink frames and a recent local KV cache. When past frames are evicted from the local window, these Memory Queries attend over the evicted KV cache to extract information that may still be useful for future generation. A sigmoid-gated residual update then controls how much of the old memory is overwritten, giving the model a learned mechanism for filtering, abstraction, and compression. Because the number of Memory Queries is fixed, the memory footprint remains independent of video length while the memory state can be updated recurrently as generation proceeds. The paper emphasizes that these queries are optimized end-to-end with the video diffusion transformers, and it further reports that the optimized initial memory state can serve as a generalizable generation prior even when memory updates are disabled in short-video settings.

Fix the RoPE Problem Too
Echo-Infinity also introduces a Unified Relative RoPE Recipe to address positional extrapolation in long autoregressive rollouts. Modern video DiTs use 3D RoPE over temporal, height, and width axes, but the temporal coordinate can quickly exceed the pretrained maximum, such as the maximum temporal id used by Wan2.1-1.3B. The paper’s recipe anchors sink frames at temporal id 0, allows the newest frame id to grow only up to the pretrained maximum fmax, and rotates older frames backward once that limit is reached. Crucially, the same bounded relative-RoPE schedule is used during both training and inference, rather than being applied only as an inference-time patch. This design is intended to close the train-test RoPE extrapolation gap while preventing overflow, so the memory mechanism is not undermined by positional encoding failure.

The Big Payoff
The paper reports that Echo-Infinity achieves overall state-of-the-art performance on long-video benchmarks at 30 seconds, 60 seconds, and 240 seconds, as well as on 5-second short-video generation. It also evaluates the system in interactive video generation and describes the learned Memory Queries as beneficial beyond the specific long-rollout update procedure. The strongest systems claim is a 24-hour real-time generation demonstration exceeding 1.3 million frames at 18.5 FPS on a single NVIDIA H100. The reported throughput overhead is 10.6% relative to a memory-free baseline, which supports the paper’s argument that learned evolving memory can be practical rather than merely more expressive. The authors present these results as evidence that constant-cost learned memory plus bounded relative RoPE provides a viable path toward infinite-horizon video generation.
