ReadPaper Blog
Cosmos 3: Omnimodal World Models for Physical AI
Cosmos 3 presents a family of omnimodal world models for Physical AI that can jointly process and generate language, image, video, audio, and action sequences. The paper addresses the fragmentation of embodied-AI systems into separate perception, simulation, generation, and control models by proposing a unified mixture-of-transformers architecture with flexible input-output configurations. Its reported evaluations, open checkpoints, synthetic datasets, and Cosmos-HumanEval benchmark position omnimodal world models as scalable backbones for embodied agents.
Source: Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3: One Robot Brain for Many Modalities
The central claim of the paper is that Physical AI needs world models that can both understand and generate across the modalities that embodied agents actually use. Cosmos 3 is introduced as a family of omnimodal world models that jointly handle language, images, video, audio, and action sequences within one unified framework. The paper frames this as more than multimodal perception: the same model family is designed to support generation and prediction, allowing an agent-facing system to reason about observations, plausible futures, and actions. This matters because physical agents operate in environments where visual dynamics, sound, instructions, and motor commands are coupled rather than separable. By using a unified mixture-of-transformers architecture, Cosmos 3 aims to make these coupled signals available to a shared model backbone rather than distributing them across disconnected specialist systems.

Why This Matters for Physical AI
The motivation section argues that current Physical AI pipelines are limited because they often separate understanding models from generative world simulators and action-prediction models. The paper explicitly contrasts Cosmos 3 with vision-language models, video generation models, forward dynamics models, vision-language-action models, and world-action models, treating those categories as partial solutions to one larger embodied-intelligence problem. Its thesis is that understanding depends on anticipating how the world may evolve, while generation depends on structured representations of scenes, dynamics, and agent behavior. Cosmos 3 therefore attempts to subsume these roles into a single general-purpose backbone for embodied agents. The implication is that future robot, driving, warehouse, and digital-human systems may benefit from a model that can connect perception, prediction, simulation, and action in one representational space.

How the Trick Works
The technical approach described in the paper centers on a mixture-of-transformers design that supports flexible input-output configurations across modalities. Cosmos 3 includes modality-specific encoders for image and video, audio, and action, then arranges tokens so that the model can condition on one set of modalities and generate another. The architecture section highlights token arrangement, generation mode, a dual-tower layer structure, dual-stream joint attention, and multimodal position embeddings as mechanisms for making heterogeneous signals work inside one transformer family. The paper also discusses position index allocation and absolute temporal modulation, which are important for modalities such as video, audio, and action where ordering and timing shape meaning. This design lets Cosmos 3 act as a reasoner, generator, world simulator, or policy model depending on the requested input-output pattern rather than requiring a separate architecture for each role.

Proof It Works
The paper reports that Cosmos 3 reaches state-of-the-art performance across a diverse suite of understanding and generation tasks, although the supplied excerpt does not provide the detailed scores. Its evaluation section is organized around reasoner evaluation and generator evaluation, with separate attention to image generation, video generation, audio generation, transfer generation, and action generation. The abstract states that post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis at the time of the report. It also states that a Cosmos 3 policy model was ranked as the best policy model by RoboArena at the time the technical report was written. The breadth of these evaluations supports the paper’s argument that omnimodal world models can serve as scalable general-purpose foundations rather than narrow single-task models.

Open Release for Physical AI Builders
A major practical contribution of the paper is the open release of code, model checkpoints, curated synthetic datasets, and evaluation assets for Physical AI research. The release includes Cosmos code and framework repositories, open-weight checkpoints such as Cosmos3-Super, Cosmos3-Nano, Cosmos3-Super-Text2Image, Cosmos3-Super-Image2Video, and Cosmos3-Nano-Policy-DROID. The paper also names several synthetic data resources, including SDG-PhyxSim, SDG-RobotSim, SDG-DriveSim, SDG-SynHuman, and SDG-Warehouse, which target physical interaction, embodied robots, autonomous driving, digital humans, and warehouse operations. Its benchmark release includes Cosmos-HumanEval, also referred to as Cosmos-HUE, for evaluating human-centered omnimodal capabilities. By placing these resources under the Linux Foundation’s OpenMDW-1.1 License, the paper aims to lower the barrier for reproducing, extending, and deploying omnimodal world models in Physical AI.
