ReadPaper Blog
Benchmarking Visual State Tracking in Multimodal Video Understanding
This paper introduces VSTAT, a benchmark for testing whether Multimodal Large Language Models can track visual states across an entire video rather than answer from isolated frames. It addresses a gap in video evaluation by using 834 clips and 1,500 questions that require continuous perception of entities, events, and evolving states. The results matter because current state-of-the-art MLLMs remain far below humans and only modestly above answer-prior baselines, suggesting that video understanding still lacks robust visual state tracking.
Source: Benchmarking Visual State Tracking in Multimodal Video Understanding

Why Another Benchmark?
The paper argues that video understanding should be evaluated as the ability to follow continuous dynamics, not merely to recognize objects, actions, or end states in selected frames. Its central motivation is that many existing video benchmarks can be solved from a small number of keyframes or salient moments, which means high benchmark scores may overstate a model's ability to understand processes over time. The authors frame visual state tracking as the capacity to maintain and update information about entities, states, and events as a video unfolds. This matters for settings such as robotics and visual demonstrations, where an agent must know not only what appears in a scene but how the relevant state has changed. By foregrounding this gap, the paper shifts evaluation from static semantic recognition toward temporally grounded perception and integration.

What Is VSTAT?
VSTAT, short for Visual State Tracking, is introduced as a video question-answering benchmark designed to diagnose this specific capability in Multimodal Large Language Models. The dataset contains 834 clips paired with 1,500 questions drawn from synthetic videos, self-recorded videos, and real-world videos in the wild. Each question is constructed so that the answer cannot be recovered from a single frame or a short segment, requiring the model to integrate evidence across the full video stream. The benchmark therefore tests whether a model can continuously observe a procedure, update an internal representation of state, and answer based on accumulated visual evidence. This design makes VSTAT a more targeted probe of temporal visual understanding than evaluations centered on recognition of isolated actions or visible final configurations.

What Makes It Hard?
The paper emphasizes that VSTAT is hard because the relevant state may be distributed, changing, partially hidden, or visually ambiguous over time. Its taxonomy includes state types such as counts, sequences, sets, dictionaries, and locations, which require different forms of memory and update operations. Some tasks demand tracking an entity through occlusion or visual similarity, while others require accumulating events, associating changes with the right object, or remembering an ordering that is not visible at the end. These challenges are meant to prevent shortcut solutions based on static appearance and instead require models to perceive each critical event as it occurs. By combining synthetic, self-recorded, and in-the-wild sources, the benchmark also probes whether state tracking failures persist beyond narrow toy settings.

Why Models Fail
The experiments find that current MLLMs struggle on VSTAT despite strong reported performance on existing video understanding benchmarks. The paper's failure analysis compares models' thinking traces with the underlying video stream and concludes that models can often produce coherent textual reasoning while missing the visual events needed for that reasoning to be correct. The authors also report that stretching synthetic videos over time gives only marginal improvement, suggesting that simple frame sampling limitations are not the sole explanation. In contrast, when the relevant video content is converted into text transcriptions, performance improves substantially, supporting the claim that the main bottleneck is visual perception rather than abstract reasoning. This distinction is important because it shows that chain-of-thought-style reasoning can look plausible even when the perceptual evidence feeding it is wrong.

Takeaway
The paper's broader conclusion is that visual state tracking remains an unsolved weakness for modern Multimodal Large Language Models. State-of-the-art systems perform far below humans on VSTAT and only modestly above answer-prior baselines, which indicates that existing video capabilities do not reliably support continuous tracking of procedural state. The authors also evaluate recent agentic approaches, including MLLM-based video agents and coding agents, and find that these preliminary systems do not readily close the gap. This implies that wrapping models in more elaborate reasoning or tool-use frameworks is insufficient when the underlying model fails to perceive the decisive visual events. VSTAT therefore points future work toward stronger temporally grounded visual perception, better integration over long video streams, and evaluations that distinguish genuine state tracking from frame-level shortcut recognition.
