ReadPaper Blog
Robots Need More Than VLAs and World Models
“Robots Need More Than VLAs and World Models” argues that generalist robot intelligence cannot be achieved simply by collecting more robot demonstrations and scaling Vision-Language-Action models. The position paper identifies the deeper bottleneck as grounding: converting abundant but unstructured physical behavior from human motion, internet video, simulation, and deployment into robot-usable actions, task semantics, contacts, object states, goals, and rewards. Its proposed research agenda centers on four missing interfaces for data, embodiment, world models, and rewards, reframing robot learning as a grounding-centric pipeline rather than only a policy-scaling problem.
Source: Robots Need More Than VLAs and World Models

Not Just Bigger Robot Brains
The paper’s central claim is that robotics is entering a foundation-model era without yet having the equivalent of the internet-scale supervision that transformed language and vision. It challenges the common assumption that broader robot intelligence will emerge mainly from larger Vision-Language-Action models trained on more robot demonstrations. The authors argue that the world already contains vast behavioral evidence in human demonstrations, household activity, factory workflows, internet video, simulation rollouts, and interactive robot trials, but this evidence is rarely expressed in a form that a robot policy can directly learn from. Unlike text or images, physical experience is tied to embodiment, action spaces, contact dynamics, safety constraints, and task-specific success conditions. The implication is that the key bottleneck is not only policy architecture or dataset size, but the lack of mechanisms that turn unstructured physical experience into grounded robot supervision.

Why the Usual Pipeline Breaks
The paper explains that physical data is rich in task structure but poor in robot-native labels. A video of a person manipulating an object may reveal goals, contact events, object motion, task phases, failures, and success, yet it usually lacks the joint commands, end-effector displacements, gripper states, force signals, or reward values needed to train a particular robot. The same problem appears across internet videos, human motion traces, factory workflows, household activity, and simulation rollouts: they contain meaningful behavioral information, but not necessarily the embodiment-specific actions and task semantics required by imitation learning or reinforcement learning. The authors use the term robot-native supervision for data already represented in a robot-learning coordinate system, such as trajectories paired with observations, actions, task labels, language instructions, or success signals. Their analysis shows why standard pipelines remain expensive and hard to scale: every useful trajectory must be physically executable, aligned with a particular body, and grounded in a task definition.

Four Missing Interfaces
The paper proposes four missing interfaces as the core framework for moving from physical experience to physical intelligence. A data interface would autolabel unstructured behavior by extracting robot-relevant information such as contacts, object states, task phases, goals, and outcomes from raw video, motion, tactile, language, or interaction data. An embodiment interface would retarget behavior across bodies, making human motion or trajectories from one robot usable for another robot with different kinematics, sensors, and action spaces. A world-model interface would provide physics-grounded 3D reasoning, allowing robots to predict physical consequences rather than relying only on surface-level visual or linguistic patterns. A reward interface would infer task progress and success from video, language, and deployment feedback, giving policies usable learning signals even when explicit reward functions are unavailable. Together, these interfaces define a larger robotics stack in which VLAs are downstream policy components that depend on upstream grounding of data, embodiment, dynamics, and rewards.

Evidence from Recent Robotics
The paper supports its position by surveying recent progress in robot-native datasets, generalist policies, cross-embodiment training, video learning, simulation, and world models. It highlights datasets such as RoboNet, BridgeData V2, DROID, RH20T, Open X-Embodiment, and RT-X as evidence that robot learning improves when models see more tasks, objects, environments, and embodiments. It also discusses policy systems including BC-Z, RT-1, RT-2, SayCan, PaLM-E, Octo, RoboCat, Dobb-E, and Diffusion Policy as examples of increasingly general robot-learning machinery built on grounded trajectories and labels. The paper’s interpretation of this evidence is deliberately cautious: these systems show the power of scaling robot-native supervision, but they also reveal how dependent current progress remains on curated demonstrations, explicit actions, language labels, affordances, and embodiment-specific control spaces. Work on human video, simulation, learned simulators, and action-conditioned world models expands the supervision frontier, but the authors argue that these sources become truly useful only when their physical consequences and task meanings are grounded for robots.

Takeaway: Build a Grounding-Centric Stack
The paper’s main takeaway is that future robot learning should become grounding-centric rather than merely robot-data-centric. In the robot-data-centric pipeline, researchers collect demonstrations, attach task labels or language, train a policy, evaluate on hardware, and repeat; in the grounding-centric pipeline, broad physical experience is first transformed into robot-usable quantities such as actions, contacts, object states, task phases, goals, and rewards. The authors do not dismiss Vision-Language-Action models or world models, but position them as parts of a larger physical-intelligence stack rather than complete solutions. Their research agenda implies that progress will depend on interfaces that make non-robot data usable, transfer behavior across embodiments, preserve physical consequences in prediction, and close the loop between reward inference and deployment. By reframing the problem this way, the paper argues that generalist robots will learn not just from explicitly collected robot demonstrations, but from the broader physical world.
