ReadPaper Blog
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
This paper addresses a persistent weakness of vision-language models: they can often recognize visible objects but struggle when spatial reasoning requires inferring structure that is not directly observable. The authors introduce Imaginative Perception Tokens (IPT), a supervision strategy that trains a multimodal model to predict grounded intermediate percepts such as unseen viewpoints, side views along paths, or integrated spatial maps, improving performance on Perspective Taking, Path Tracing, and Multiview Counting.
Source: (none provided)

The Spatial Mystery
The paper studies why modern vision-language models remain brittle on spatial reasoning despite strong performance on object recognition and visual question answering. Its central claim is that many spatial questions cannot be answered by reading off the input image, because the relevant information may depend on an unseen viewpoint, an occluded region, or a scene-level map assembled from partial observations. The authors frame this missing capability as imaginative perception: the ability to simulate what would be perceived under an alternative spatial configuration while remaining constrained by observed evidence. This framing targets failures reported in spatial benchmarks such as left-right reasoning, perspective changes, depth, occlusion, and multi-view scene reconstruction. The paper’s motivation is therefore not merely to improve recognition, but to give VLMs a training signal for constructing spatial structure that is implied by the scene rather than directly visible in it.

Imagination, Not Just Description
The paper proposes Imaginative Perception Tokens as intermediate perceptual representations for reasoning over unobserved spatial structure. Unlike intermediates such as depth maps, bounding boxes, or visual thoughts that often refine information already present in the input, IPT is intended to represent what the model would perceive from another viewpoint, along a path, or after integrating multiple partial views. The representation is imaginative but not unconstrained: it must stay geometrically and semantically consistent with the observed scene. The authors implement this idea using BAGEL, a unified vision-language model capable of interleaving understanding and generation over text and image tokens. The methodological implication is that spatial reasoning can be supervised in a modality better aligned with geometry than textual chain-of-thought alone.

Three Heroic Trials
To evaluate imaginative perception, the paper defines three tasks designed so that the intermediate spatial percept is central to solving the problem. Perspective Taking asks the model to infer how a scene would appear after moving to a marked position and changing orientation, such as deciding whether an object becomes closer, farther, left, or right. Path Tracing asks the model to reason along a navigation route and infer what object would be visible from a specified side at an intermediate waypoint. Multiview Counting asks the model to combine multiple partial observations into a coherent scene representation in order to count objects such as office chairs or refrigerators. These tasks are deliberately constructive rather than purely discriminative, because success depends on predicting missing spatial structure rather than verifying a relation already visible in a single image.

Training the Team
The paper constructs task-specific datasets with ground-truth intermediate imaginations paired with final answers, allowing the model to learn both the latent spatial percept and the downstream response. The excerpt reports approximately 20K examples per task overall, with table details including Perspective Taking from AI2-THOR, Habitat, and real images; Path Tracing from AI2-THOR; and Multiview Counting from AI2-THOR and other sources mentioned in the dataset table. Each task uses a different IPT format: a novel-viewpoint image for Perspective Taking, a side-view image for Path Tracing, and an integrated perceptual representation for Multiview Counting. The authors also curate human-filtered evaluation benchmarks, including AI2-THOR and Habitat subsets for Perspective Taking and AI2-THOR plus real examples for Path Tracing. This dataset design makes the supervision signal explicit: models are not trained only to guess labels, but to produce intermediate perceptual states aligned with the underlying geometry of the task.

What Won the Battle
Empirically, the paper reports that IPT supervision improves spatial reasoning across several settings compared with answer-only training and often compares favorably to textual chain-of-thought. On Multiview Counting, the excerpt reports a 3.4% accuracy improvement, and on Path Tracing the IPT-trained model reaches performance competitive with strong closed-source models. The authors also find that improvements can persist even when no intermediate image is generated at inference time, suggesting that IPT training may strengthen internal spatial representations rather than only provide an external visual scratchpad. Mixed training with IPT and label-only data can further improve results, while textual chain-of-thought can degrade performance in some cases, highlighting a modality mismatch when spatial computation is forced into language. The paper concludes that supervised imaginative percepts provide a principled and interpretable route toward better VLM spatial generalization, while noting that benefits vary across tasks and depend on imagination quality and task structure.
