ReadPaper Blog
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
This paper studies whether multimodal large language models can reason about the same physical scene across strongly different viewpoints, using wide-baseline matching as a demanding test of geometry, semantics, fine-grained perception, and occlusion reasoning. It introduces ReasonMatch-Bench, an automatically supported benchmark and data pipeline built from RGB-D videos and Structure-from-Motion reconstructions, and proposes Dynamic Correspondence Reinforcement Learning to improve correspondence reasoning with verifiable rewards. The result matters because physical-world MLLMs need robust cross-view spatial reasoning, and the paper shows both a large gap in current models and a practical training route that narrows it.
Source: Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Why Wide-Baseline Matching Is Hard
The paper argues that wide-baseline matching is a powerful stress test for spatial reasoning in multimodal large language models because the task requires recognizing the same 3D scene elements under large viewpoint displacement, perspective distortion, illumination change, repetitive structure, and occlusion. Unlike ordinary object recognition or captioning, WBM forces a model to connect image evidence through geometric regularities, semantic context, and fine-grained visual details. The authors define the problem in terms of camera projections, where corresponding image points arise from the same 3D point observed through different camera intrinsics and extrinsics. They contrast this with classical feature-based pipelines such as SIFT, SURF, ORB, and RANSAC-based geometric verification, which can fail in the extreme wide-baseline regime. The paper’s central motivation is that MLLMs deployed in physical environments need this integrated spatial competence, yet existing evaluation and training methods do not systematically elicit it.

The Benchmark Gap
The paper identifies a substantial benchmark gap between human spatial reasoning and current MLLMs on fine-grained wide-baseline correspondence. On a difficult 90-sample human-study subset of ReasonMatch-Bench, human annotators reach 84.0 F1, while the best existing baseline reaches only 37.2 F1. This evidence supports the authors’ claim that frontier multimodal systems still struggle when matching requires more than local appearance similarity. The gap is especially important because WBM combines multiple abilities that are often evaluated separately in prior spatial benchmarks, including viewpoint imagination, depth and scale cues, occlusion reasoning, and semantic grounding. By quantifying this failure mode, the paper frames WBM not merely as a computer vision subroutine but as a measurable proxy for complex spatial intelligence in MLLMs.

ReasonMatch-Bench
ReasonMatch-Bench is the paper’s benchmark for evaluating cross-view spatial reasoning through wide-baseline matching across controlled difficulty dimensions. The benchmark stratifies examples by viewpoint displacement and matching granularity, so models can be tested on progressively harder cases rather than only on aggregate accuracy. It spans indoor, outdoor, and object-centric scenarios, which helps prevent evaluation from being tied to a single visual domain or scene type. The task formulation presents the MLLM with two images and pre-marked point sets, then asks it to output a textual mapping from points in one image to corresponding points in the other or to mark them as unmatched. This language-mediated partial bipartite matching formulation lets the benchmark test whether an MLLM can associate visual entities using geometry, semantics, and context rather than only dense feature scores.

How the Data Is Built
The paper’s data-generation pipeline addresses the practical obstacle that manually annotating reliable wide-baseline correspondences is expensive and brittle. The authors automatically harvest image pairs and verified correspondences from large-scale video-3D resources, including RGB-D datasets such as CO3D, uCO3D, and ScanNet, as well as RGB video collections with Structure-from-Motion reconstructions such as RealEstate10k and DL3DV. For each sample, the pipeline constructs matchable point sets and distractor point sets, producing examples where some points have valid cross-view correspondences and others should remain unmatched because of occlusion or limited overlap. This design provides diverse supervision while preserving verifiability through underlying 3D geometry and reconstruction constraints. The implication is that WBM can be scaled into a training and evaluation signal for MLLMs without relying entirely on hand-written rationales or fragile synthetic scenes.

DCRL: Train With Verifiable Rewards
Dynamic Correspondence Reinforcement Learning is the paper’s training framework for improving WBM ability through reinforcement learning with verifiable rewards. Instead of requiring explicit chain-of-thought supervision, DCRL rewards the model according to whether its predicted correspondence mapping matches the ground-truth mapping. The method combines Image-Level Viewpoint Progression, which gradually increases view difficulty, with Point-Level Correspondence Curriculum, which adapts the granularity and hardness of correspondence decisions. Experiments reported in the paper show that DCRL improves ReasonMatch-Bench performance to 70.5% F1, outperforming cited open-source and closed-source baselines including GPT-5-mini at 57.9% and Gemini-2.5-Pro at 42.8%. The authors also report transfer gains on related spatial benchmarks, including OmniSpatial by 5.27 percentage points and MindCube by 3.51 percentage points, while maintaining general visual understanding performance with modest gains on several benchmarks.
