ReadPaper Blog
MEMDREAMER: Decoupling Perception and Reasoning for Long Video Understanding
MEMDREAMER addresses the difficulty Vision-Language Models face when understanding hours-long videos, where brute-force frame ingestion creates token explosion and attention dilution. The paper proposes a plug-and-play framework that streams video into a Hierarchical Graph Memory and then lets a separate reasoning model perform agentic, tool-augmented retrieval over that memory, achieving stronger long-video understanding with a much smaller reasoning context.
Source: MEMDREAMER: Decoupling Perception and Reasoning for Long Video Understanding

Why Long Videos Break Normal VLMs
MEMDREAMER is motivated by a central failure mode in long video understanding: current Vision-Language Models often try to perceive and reason over the full visual sequence at once, which turns hours of video into an enormous and noisy token stream. The paper notes that sampling a two-hour video at 1 FPS can generate more than 1.6 million tokens, far beyond practical context budgets for many models. Even when longer contexts are available, the authors argue that redundant frames bury sparse reasoning cues and intensify attention dilution and the “lost in the middle” problem. This framing treats long video understanding not merely as a scaling problem, but as a structural mismatch between sequential token ingestion and the hierarchical, causal nature of video content. The implication is that larger context windows alone are unlikely to solve hours-level video reasoning unless models can filter, organize, and revisit evidence more selectively.

The Core Trick: Decouple Perception and Reasoning
The paper’s core contribution is to decouple perception from reasoning, replacing end-to-end full-sequence ingestion with a two-stage process. In MEMDREAMER, a perception model streams the video incrementally and builds a persistent memory bank, while a separate reasoning model later queries that memory for task-relevant evidence. This design shifts long-video understanding from passive consumption of all frames into an agentic exploration process over structured representations. The authors present the framework as plug-and-play, meaning it can sit around existing Vision-Language Models rather than requiring a wholly new backbone. By separating the expensive act of seeing from the focused act of reasoning, MEMDREAMER aims to preserve evidence while avoiding the context overload that weakens coupled VLM pipelines.

Hierarchical Graph Memory
The memory structure proposed by the paper is a Hierarchical Graph Memory designed to reflect the coarse-to-fine organization of long videos. Its top-down three-tier architecture starts with a Video Root that summarizes global context, decomposes the video into Super Events, and further resolves them into Macro Events. At the Macro Event level, the framework instantiates local subgraphs that represent entities, events, and logical relations rather than storing only flat chunks or isolated summaries. The paper emphasizes that long videos contain spatiotemporal and causal dependencies that are easily lost in purely sequential or similarity-indexed storage. By anchoring memory in a graph, MEMDREAMER can suppress irrelevant detail at higher levels while preserving relational paths needed for later inference.

Agentic Retrieval in the Reasoning Loop
During inference, MEMDREAMER uses an Agentic Retrieval Mechanism that lets the reasoning model interact with the Hierarchical Graph Memory through tools. The paper defines an Agentic Tool Bank with Navigation tools for moving vertically through the hierarchy, Search tools for localizing relevant nodes, and Graph Traversal tools for following logical edges across related events and entities. This retrieval process is organized as an Observation-Reason-Action loop, in which the model observes memory content, decides what evidence is missing, invokes a tool, and iteratively refines its search. The mechanism is intended to avoid both full-context ingestion and shallow similarity retrieval, because many long-video questions require multi-step evidence gathering rather than a single nearest-neighbor lookup. In the paper’s view, the reasoning model’s agentic capability becomes directly useful once perception has been externalized into a navigable memory.

What the Experiments Say
The experiments reported in the paper show MEMDREAMER achieving state-of-the-art results across four mainstream long-video understanding benchmarks and narrowing the gap with human experts to 3.7 points. On hours-long video evaluation including LVBench, the framework reportedly uses only about 2% of the context window required by full-context ingestion while delivering a 12.5-point absolute accuracy gain over end-to-end coupled paradigms. The authors also report that MEMDREAMER’s reasoning window is only about 5–6K tokens, described as 41–124× smaller than end-to-end input. Beyond benchmark gains, the paper presents a statistical analysis showing a strong positive linear correlation between a VLM’s logic reasoning performance and its long-video understanding performance under the MEMDREAMER setup. This evidence supports the authors’ broader claim that decoupling perception and reasoning can convert long-video comprehension into an agentic capability-scaling problem rather than a brute-force context-scaling problem.
