ReadPaper Blog
M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
M3Eval introduces a benchmark for evaluating memory in multi-modal models on long-form video tasks, addressing the gap between larger context windows and genuine retention, retrieval, and organization of information. The paper adapts controlled paradigms from cognitive psychology into video question-answering tasks that isolate memory mechanisms such as divided attention, interference, temporal organization, and symbolic abstraction. Its experiments across representative open-source and proprietary multi-modal models show that current systems often remember video content in brittle, entangled, and non-human-like ways.
Source: M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Why Long-Video Models Still Forget
The paper argues that long-video understanding requires more than feeding a model more frames or expanding its context window. M3Eval is motivated by the observation that existing video benchmarks mainly test perception and reasoning, while leaving memory as an implicit and poorly measured capability. The authors define multi-modal memory as the ability to encode, store, retrieve, and synthesize information across long temporal horizons involving video and text. This distinction matters because downstream reasoning over long videos depends on whether the model preserves the right information, keeps sources disentangled, and resists distraction. By framing memory as a first-class evaluation target, the paper turns a broad limitation of multi-modal models into a set of testable mechanisms.

M3Eval’s Core Idea
M3Eval’s main methodological contribution is to transfer the logic of cognitive psychology into controlled video-based evaluation. Rather than asking broad questions that mix perception, reasoning, and recall, the benchmark constructs stimuli and QA probes intended to isolate specific memory processes. The paper draws on paradigms such as divided attention, memory interference, interleaved event organization, and N-Back-style symbolic representation. Each task is designed around a cognitive theory, instantiated as a controlled video condition, and paired with targeted multiple-choice questions and metrics for failure modes. This design lets the benchmark ask not merely whether a model answers correctly, but what kind of memory breakdown may have produced the error.

What Memory Dimensions Are Tested?
The benchmark characterizes multi-modal memory along four major dimensions that correspond to realistic demands in video understanding. Divided Attention evaluates whether a model can retain concurrent inputs, using split-screen videos and variants with or without spatial swaps to test source identification, order understanding, and content retention. Memory Interference examines whether sequentially similar content disrupts recall, separating robustness to proactive and retroactive interference rather than treating forgetting as simple decay. Interleaved Events tests whether a model can reorganize temporally interleaved clips into coherent event representations, a capability related to tracking story structure across fragmented evidence. N-Back probes whether models can abstract multi-modal information into symbolic attributes and maintain those attributes across temporal gaps.

What the Models Got Wrong
The experiments reveal consistent weaknesses in the memory behavior of current multi-modal models. When processing parallel video streams, models often fail to maintain independent representations, suggesting attention confusion across concurrent visual inputs. In the interference setting, the paper reports that human memory shows stronger retroactive interference than proactive interference, while multi-modal models show more comparable levels across the two conditions. The authors also find that repeated interfering video segments can sometimes improve model understanding of target segments, which differs from standard human memory patterns. In interleaved event settings, models are less capable than humans at organizing temporally mixed information, and their source grounding is more reliable in the spatial domain than in the temporal domain. The N-Back results further indicate limited symbolic memory and difficulty filtering irrelevant information from memory.

Why This Matters
The broader implication of M3Eval is that memory should be measured as a structured capability rather than inferred from success on general long-video benchmarks. By evaluating capacity, fidelity, robustness to interference, temporal organization, and symbolic abstraction, the paper provides a resource for diagnosing where multi-modal models fail even when they can process long inputs. Its cognitively grounded tasks make it possible to compare model behavior not only against answer accuracy, but also against known distinctions from human memory research. The findings suggest that future multi-modal systems need better mechanisms for disentangled representation, temporal source grounding, interference control, and symbolic maintenance. The benchmark therefore serves both as an evaluation protocol and as a research agenda for building models that remember video information more faithfully.
