ReadPaper Blog
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper asks whether Vision-Language Models or Video Generation Models provide better frozen visual representations for spatial intelligence, where systems must recognize semantic objects while preserving geometric structure. It uses a controlled frozen-feature probing framework across semantic tagging, instance grouping, and 3D geometry prediction, finding that VLMs expose stronger semantic and instance-level information while VGMs expose stronger geometry and camera-motion signals. The result matters because simple feature-level fusion already combines much of these complementary strengths, suggesting a practical route toward stronger spatial-intelligence backbones.

Two Ways to Teach a Vision Brain
The paper is motivated by a core requirement of spatial intelligence: visual representations must encode both what objects are present and how the physical scene is structured in space. It frames Vision-Language Models such as Qwen3-VL and InternVL3 as backbones shaped by language supervision, which tends to align images with semantic categories, attributes, and instructions. It frames Video Generation Models such as WN2.1 and CogVideoX as backbones shaped by temporally evolving visual worlds, which may favor dynamics, continuity, and geometric consistency. The authors argue that embodied AI, autonomous driving, robotics, and real-world scene understanding need both semantic object information and 3D geometry, yet prior comparisons often evaluate full downstream systems rather than the representations themselves. The central research question is therefore not which complete agent performs better, but which pretraining paradigm makes different kinds of spatial information readily recoverable from frozen features.

Freeze the Model, Then Peek Inside
The paper’s method is a frozen-feature probing study designed to isolate the representational substrate of VLMs and VGMs before downstream fine-tuning. Instead of updating the foundation models, the authors freeze each model, extract intermediate visual features, and train lightweight probes with an identical probing backbone and task-specific readout heads. This design reduces confounds from action decoders, robot data mixtures, post-training recipes, inference strategies, and benchmark-specific policy machinery. For video inputs, the framework samples a 76-frame context and uses a fixed 20-frame query set, giving temporally aligned feature banks across model families despite their different architectures. For VGMs, the probe reads internal generator activations during a denoising pass rather than generated pixels, keeping the comparison focused on encoded representations rather than output rendering quality.

Three Tests of Spatial Intelligence
The paper operationalizes spatial intelligence through three complementary probing axes: semantic tagging, instance grouping, and 3D geometry prediction. Semantic tagging asks whether object categories are recoverable from a video representation, capturing the kind of semantic knowledge needed for instruction grounding, task planning, and object discovery. Instance grouping asks whether pixels across multiple views can be assigned to the same object instance, testing object-centric structure needed for tracking, object permanence, and interaction. 3D geometry prediction asks the probe to recover dense depth, point maps, and camera motion, measuring whether the representation preserves scene layout and physical structure. By placing these tasks under one probing framework, the paper spans a spectrum from recognizing semantic objects to reconstructing spatial worlds.

The Surprise: Each Model Has a Specialty
The paper reports a consistent division of strengths rather than a single winning pretraining paradigm. VLM features are stronger for semantic tagging, which supports the claim that language-aligned pretraining makes object-category information more accessible to lightweight readouts. VLMs also perform better on instance grouping, indicating that grouping object instances across views benefits from object-centric semantic representations and not only from low-level geometric continuity. VGMs, by contrast, provide more accessible signals for 3D geometry prediction, including dense point maps, depth, and camera motion. The authors interpret this pattern as evidence that current VLMs better encode semantic objects, while current VGMs better encode spatial geometry.

Best Punchline: Mix Them!
The paper’s constructive implication is that the two representation families are complementary and can be combined directly at the feature level. The authors test a naive fusion strategy that normalizes VLM and VGM features and concatenates them channel-wise, without introducing a complex learned integration scheme. This simple fused representation preserves much of the VLM advantage on semantic tasks while recovering much of the VGM strength on geometry-oriented tasks. The finding suggests that stronger spatial-intelligence backbones may come from integrating language-aligned object representations with generation-induced geometric representations. The paper therefore positions fusion of VLM and VGM features as a promising research direction for embodied AI, robotics, autonomous driving, and scene understanding systems that need both semantics and physical structure.
