ReadPaper Blog
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Embodied-R1.5 proposes a unified Embodied Foundation Model for physical intelligence, addressing the gap between visual understanding and reliable action in the real world. The paper combines spatial cognition, task planning and correction, and embodied pointing/location in a single 8B-parameter model trained with large-scale automated data construction and multi-task balanced reinforcement learning. Its importance lies in showing that stronger embodied reasoning can improve benchmark performance, enable closed-loop autonomy, and reduce the amount of action data needed to adapt the model into a vision-language-action system.
Source: Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Why embodied AI still trips over the real world
The paper motivates Embodied-R1.5 from a central limitation of current embodied AI: models can often perceive scenes, but they do not consistently connect perception to physically grounded reasoning and action. The authors argue that general-purpose physical intelligence requires an Embodied Foundation Model that unifies perception, reasoning, and execution rather than splitting them across specialized systems. They identify three bottlenecks in prior work: fragmented capabilities, multi-task conflict across heterogeneous outputs, and limited validation of closed-loop autonomy on long-horizon tasks. Embodied-R1.5 is presented as a step beyond the earlier Embodied-R1 paradigm, moving from a pointing-focused model toward a comprehensive EFM. The paper’s core claim is that a single architecture can internalize spatial cognition, planning and correction, and actionable grounding strongly enough to support real-world robotic generalization.

Three missing skills in one body
The paper organizes embodied reasoning into three capability dimensions that form a progression from understanding the world to acting in it. Embodied cognition and spatial reasoning cover semantic scene understanding, spatial relations, metric reasoning, 3D scene perception, and object or scene cognition. Task planning and correction cover decomposition of high-level instructions, next-step planning, process detection, error localization, and guidance for correction when execution fails. Embodied pointing and location translate abstract reasoning into coordinates, regions, grasp points, and visual traces that are closer to robot action. The authors emphasize that these capabilities are complementary rather than independent, because physical tasks require a model to recognize scene structure, decide what should happen next, and ground that decision in actionable locations or trajectories.

How Embodied-R1.5 is built
Embodied-R1.5 is built as a single 8B-parameter embodied foundation model supported by a large-scale training system exceeding 15B tokens. The paper describes three automated data construction pipelines designed to broaden coverage over critical embodied capabilities, including cognition and spatial reasoning, planning and correction, and pointing or location tasks. To train across such heterogeneous targets, the authors use a two-stage training strategy with supervised fine-tuning followed by reinforced fine-tuning. A key methodological contribution is the multi-task balanced RL recipe, which is designed to reduce conflicts among tasks with very different output formats, such as long-form reasoning, trajectory prediction, and grounding. The paper frames this recipe as necessary for unified learning, because naïvely mixing embodied tasks can cause capabilities to interfere with one another rather than reinforce one another.

The closed loop that watches itself
The paper introduces the Planner-Grounder-Corrector framework as a closed-loop autonomy mechanism for long-horizon embodied tasks. In this framework, the same model is used to plan task steps, ground those steps into actionable spatial outputs, and detect or correct execution errors as the task unfolds. This design directly targets the paper’s criticism that many embodied models remain confined to Embodied QA and do not demonstrate physically grounded decision-making over extended execution. The authors describe PGC as enabling a single 8B model to drive the full autonomy stack without requiring separate specialist models for planning, grounding, and correction. Reported examples of long-horizon real-world tasks include making milk tea, sweeping garbage, and stacking cups, which the paper uses to support its claim that unified embodied reasoning can sustain autonomous progress under real-world feedback.

What the paper claims it can do
The paper reports that Embodied-R1.5 achieves state-of-the-art results on 16 of 24 embodied VLM benchmarks and an average score of 70.4% across 21 main accuracy-based benchmarks. The authors state that the 8B model surpasses Gemini-Robotics-ER-1.5 and GPT-5.4 by 17.0% and 21.7%, respectively, on the reported embodied benchmark average. The paper further argues that because Embodied-R1.5 internalizes embodied reasoning upstream, it can be adapted into Embodied-R1.5-VLA with only a small amount of action data rather than large-scale action pretraining. In simulation, the resulting VLA is reported to outperform strong baselines including π0.5 across four robotic manipulation benchmark suites and to exceed ManipLLM by 11% on PartNet-Mobility. The paper also reports zero-shot real-robot experiments in instruction following, affordance grounding, articulated object manipulation, contact-rich interaction, and long-horizon tasks, reinforcing its broader implication that reasoning-centered EFMs may reduce dependence on massive robot action datasets.
