ReadPaper Blog
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Humanoid-GPT addresses the agility–generalization trade-off in humanoid motion tracking by scaling both the training corpus and the policy architecture. The paper introduces a GPT-style causal Transformer trained through expert distillation on a 2B-frame retargeted motion corpus, showing that large-scale data, causal sequence modeling, and diversity-balanced sampling can produce stronger zero-shot whole-body control.
Source: Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

The Problem: Motion Brain Too Small
The paper frames humanoid motion tracking as a generalization problem for embodied agents: a humanoid should robustly execute whole-body behaviors under unseen tasks, styles, and motion distributions. Prior trackers are described as typically shallow MLP policies trained on small motion corpora, often around millions rather than billions of frames. The authors argue that these systems encounter an agility–generalization trade-off, where methods such as BeyondMimic and ASAP emphasize agile in-domain tracking while approaches such as TWIST and UniTracker generalize more broadly but struggle with highly dynamic actions. Humanoid-GPT claims this trade-off is not fundamental, but instead arises from insufficient scale and training designs that do not match the structure of online control. The central contribution is therefore a scaled tracker that combines a much larger motion corpus, a causal Transformer architecture, and a training recipe intended to preserve both dynamic precision and zero-shot transfer.

The Scale Move
The data strategy in the paper is to move humanoid tracking into a new scale regime by aggregating public and private motion sources into a unified corpus. The authors combine datasets including AMASS, LAFAN1, Motion-X++, PHUMA, and MotionMillion with large-scale in-house recordings, then apply filtering, segmentation, augmentation, and retargeting. The resulting corpus is described as approximately 2B G1-retargeted frames or tokens, more than 200 times larger than prior tracker training sets cited in the paper. Retargeting maps human motion into the 29-DoF joint space of the Unitree-G1 humanoid, while filtering removes motions with explicit object or environment interactions such as sitting on chairs, swimming, or stair climbing. The authors also apply time-warping augmentation by accelerating and decelerating sequences, aiming to increase temporal variability and improve robustness to different motion speeds.

The Model Choice
Humanoid-GPT’s model choice is motivated by the causal nature of real-time humanoid control. At deployment, the policy cannot rely on future reference frames, so the paper adopts a GPT-style Transformer with causal temporal attention rather than a non-causal sequence model. The model predicts per-joint PD targets from reference motion and proprioceptive observation histories, aligning the architecture with the online tracking constraint. The training pipeline first learns many PPO-based motion experts over motion clusters and then distills them into a single generalist Transformer policy through parallel DAgger-style supervision. This design is presented as a way to preserve expert-level tracking behavior while exploiting the scalability and sequence-modeling capacity of Transformers, which the authors contrast with shallow MLP trackers that saturate as data increases.

Why It Works
The paper emphasizes that simply increasing the number of motion clips does not guarantee useful generalization, because frequent motion styles can dominate large corpora and suppress rare but important behaviors. To address this, the authors introduce Harmonic Motion Embedding, or HME, as a compact representation for organizing motion diversity directly from raw motion sequences. HME is built by training Periodic Autoencoders to extract per-joint periodic amplitudes and frequencies, then aggregating statistics such as means and standard deviations into sequence-level embeddings. The paper applies K-Means clustering over these embeddings to create roughly 300 motion clusters, each used to train a PPO-based expert on a coherent subset of the corpus. The resulting sampling strategy is described as diversity-aware and distribution-balanced, supporting the paper’s claim that diversity and balance are both necessary for large-scale zero-shot tracking.

Result: New Frontier, Same Chaos
The reported outcome is that Humanoid-GPT advances both agile tracking and zero-shot generalization within a single humanoid controller. The authors compare the system against related trackers such as HumanPlus, OmniH2O, ASAP, GMT, UniTracker, BumbleBee, TWIST, Any2Track, and SONIC, positioning Humanoid-GPT as a Transformer-based low-level tracker trained on 2.0B frames with both agility and zero-shot capability. The paper states that extensive experiments and scaling analyses show robust zero-shot generalization to unseen tasks while maintaining highly dynamic and complex motion tracking. It also claims support for real-time whole-body control and online retargeted unseen motions without fine-tuning. Beyond the specific system, the paper’s broader implication is a scaling law for humanoid motion tracking that relates performance to data scale and model capacity, suggesting a roadmap for future general-purpose whole-body control.
