ReadPaper Blog
VLM3: Vision Language Models Are Native 3D Learners
The paper argues that standard vision-language models can learn fine-grained 3D understanding without specialized 3D architectures, regression heads, heavy augmentations, or task-specific losses. VLM3 uses focal-length unification, normalized text-based pixel and region references, and mixed-data scaling to train a standard VLM across depth estimation, pixel correspondence, camera pose estimation, and object-level 3D understanding. The reported results suggest that simple text-based supervised fine-tuning can match or surpass several expert 3D vision systems while preserving a general VLM architecture.
Source: VLM3: Vision Language Models Are Native 3D Learners

Mission Briefing: Can standard VLMs learn 3D?
VLM3 addresses a central gap in 3D vision: although vision-language models have become strong general-purpose systems for semantic visual understanding, fine-grained 3D tasks still commonly depend on expert models with specialized encoders, decoders, augmentations, and losses. The paper asks whether a standard VLM, without task-specific architectural changes, can match expert 3D systems on diverse tasks beyond coarse object-level reasoning. Its motivation builds on DepthLM, which showed that VLMs can learn pixel-level metric depth, and extends that question to object-level 3D understanding, metric depth estimation, pixel correspondence, and camera pose estimation. The authors argue that many assumptions inherited from expert 3D vision, including regression formulations and carefully balanced multi-loss training, are not necessary conditions for effective 3D learning. This reframes 3D understanding as a capability that can emerge from standard VLM training when the input geometry and textual references are made learnable at scale.

The Hidden Trick: Three simple ingredients
The core method in VLM3 is deliberately simple: the model unifies focal length by resizing input images so that the focal length is 1000 pixels, uses normalized text-based references for pixels or regions, and relies on data mixture and scaling. Focal-length unification targets camera ambiguity, allowing images from different cameras and datasets to be mixed in training without adding a separate camera-intrinsics module to the VLM. For images lacking intrinsics, the paper uses a pretrained single-image calibration model to estimate camera parameters before applying the same focal-length normalization. The text-based reference scheme normalizes both horizontal and vertical coordinates to the range [0, 2000), enabling prompts such as querying the depth of a pixel by its normalized coordinate rather than rendering a marker into the image. This design removes the visual prompting used by DepthLM and keeps training compatible with standard text-based supervised fine-tuning, with Qwen3-VL-4B used as the stated base model unless otherwise specified.

Not just depth: four 3D battles
The paper emphasizes that VLM3 is not merely a depth-estimation variant, but a unified study of multiple 3D problem types under the same VLM-compatible design. For object-level 3D understanding, the model handles spatial questions about objects without relying on extra object-reference encoders of the kind used in SpatialRGPT-style systems. For metric depth estimation, it predicts camera-grounded depth from 2D images using text-formatted outputs rather than a conventional dense regression head. For pixel correspondence, the normalized coordinate representation is especially important because both inputs and outputs may need to refer to precise image locations in text. For camera pose estimation, the paper tests whether the same standard VLM training framework can recover multi-view geometric relationships that expert systems often approach through multi-task geometric supervision.

The proof in battle
The reported experimental evidence supports the paper’s claim that VLMs can become competitive 3D learners with minimal redesign. On object-level 3D understanding, VLM3-4B is reported to improve over SpatialRGPT-8B on SpatialRGPT-Bench while removing the need for extra encoders. On depth estimation, VLM3-4B improves over the previous VLM-based DepthLM-7B result from 0.84 to 0.9, with evaluation averaged across NuScenes, ETH3D, SUNRGBD, and iBims1, and the paper states that this matches UnidepthV2 accuracy. On pixel correspondence, VLM3 reduces the end-point error of the base VLM by 10× and is reported to outperform expert correspondence models such as DKM and RoMa. On camera pose estimation, the method raises AUC30 for the base VLM from 5% to 94%, surpassing VGGT and matching DA3-Giant according to the paper’s summary of results.

Final scroll: simple, scalable, and 3D-native
The main implication of VLM3 is that the boundary between generalist vision-language modeling and expert 3D vision may be less rigid than prior system designs suggest. The paper argues that architecture changes, large model size, heavy photometric and geometric augmentation, specialized decoders, and regression losses are not required prerequisites for strong 3D performance when focal length, coordinate reference, and data scaling are handled carefully. Its strongest conceptual claim is that treating 3D inputs and outputs as text can be sufficient for accurate object-level, pixel-level, single-view, and multi-view 3D tasks. The work therefore proposes a simpler paradigm for scalable 3D learning: keep the VLM architecture standard, normalize the geometry into a learnable textual interface, and train across a mixture of 3D tasks. If the findings generalize, VLM3 suggests that future foundation models may absorb more geometric capability through unified prompting and supervised fine-tuning rather than through increasingly specialized 3D modules.
