ReadPaper Blog
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
SpatialWorld introduces a benchmark for testing whether multimodal large language model agents can solve real-world spatial tasks through interactive, vision-only exploration rather than passive image or video understanding. The paper unifies eight heterogeneous simulation backends under a shared text-action protocol and evaluates 760 human-annotated tasks, showing that even advanced agents such as GPT-5 and Qwen-3.5 remain far from reliable at active spatial reasoning.
Source: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SPATIALWORLD
The paper addresses a central weakness in current evaluations of multimodal large language models: many spatial reasoning benchmarks measure recognition from static images, Visual Question Answering, or prerecorded videos rather than interactive task solving. SpatialWorld reframes spatial understanding as the ability to perceive, navigate, update beliefs, and act in partially observable 3D environments. The authors argue that real-world spatial competence cannot be captured by a single view because agents must gather egocentric visual evidence over time and make closed-loop decisions. Their benchmark therefore tests agents under vision-only observations and asks them to issue high-level text actions that are natural for MLLM-based policies. This matters because an agent that can describe a room or infer an object relation may still fail when it must explore, plan, and complete a task across multiple steps.

Why old benchmarks fall short
SpatialWorld formalizes each task as a vision-only partially observable Markov decision process, with state, observation, action, transition, observation function, and reward components captured by the tuple ⟨S,O,A,T,Ω,R⟩. At each step, the agent receives a natural-language goal and a raw egocentric RGB observation, without depth, global maps, object coordinates, semantic metadata, or other privileged state signals. The policy conditions on the trajectory history of observations and actions, then predicts the next high-level action through a text-based interface. This formulation is designed to force active spatial inference: an agent must move, look, interact, and revise its internal understanding as new visual evidence arrives. By emphasizing partial observability and multi-turn decision-making, the benchmark targets the gap between passive spatial recognition and embodied spatial competence.

The benchmark idea
The main methodological contribution of the paper is SpatialWorld’s simulator-agnostic benchmark design, which wraps diverse environments into a unified observation-action-verification framework. The benchmark contains 760 human-annotated tasks spanning domains such as household routines, work and study, entertainment, travel, social collaboration, and digital spatial games. These tasks are instantiated across eight simulation backends, including AI2-THOR, ProcTHOR, VirtualHome, CARLA, EmbodiedCity, multi-agent variants, and lightweight 3D game environments. The shared protocol abstracts away simulator-specific action spaces, sensor assumptions, and execution pipelines so that evaluation focuses on general interactive spatial reasoning rather than adaptation to one platform. This cross-platform design lets the paper compare agent behavior across complementary forms of 3D reasoning while preserving a consistent interface for MLLMs.

How tasks are judged
The paper also emphasizes reliable task construction and evaluation rather than relying only on informal success judgments. Each SpatialWorld task includes a human-validated initial state, a reference trajectory, and a task-specific terminal-state verifier. The construction pipeline requires annotators to learn environment tutorials, write task instructions, define success conditions, validate execution in virtual environments, and cross-check data quality with other annotators. Evaluation measures whether the final state satisfies the task requirements and also considers execution efficiency through comparisons such as the ratio of golden steps to actual agent steps. This design allows SpatialWorld to distinguish an agent that eventually reaches a goal through redundant exploration from one that solves the task with efficient, interpretable spatial planning.

What the experiments found
The experiments evaluate fifteen advanced multimodal agents from proprietary and open-source model families and find that robust interactive spatial reasoning remains unsolved. Across the full benchmark, GPT-5 achieves the highest reported average task success rate at 17.4%, while the strongest open-source model, Qwen-3.5-397B-A17B, reaches 14.1%. The paper reports a mismatch between task success and execution efficiency, indicating that higher success does not necessarily mean more economical exploration or better long-horizon control. It also finds substantial domain-specific variation: GPT-5 leads in daily household, travel, and social collaboration tasks, Qwen-3.5-397B-A17B ties GPT-5 in Work & Study and leads physical entertainment, and Gemini-3.1-Pro performs best in digital games. These results position SpatialWorld as a diagnostic testbed for bottlenecks in active exploration, spatial belief updating, long-horizon planning, and action execution.
