ReadPaper Blog
On the Geometry of On-Policy Distillation
The paper studies how on-policy distillation (OPD) changes large language model parameters during post-training, a question that remains unclear despite OPD’s growing use for reasoning models. By comparing OPD with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) using parameter-space diagnostics, the paper argues that OPD has its own update geometry: relaxed off-principal localization, early subspace locking, and sensitivity to objective composition.
Source: (none provided)

Why is this training so weird?
The paper asks where on-policy distillation fits within the geometry of post-training methods for large reasoning models. OPD trains a student model on its own sampled trajectories while receiving dense token-level guidance from a stronger teacher, which makes it resemble both supervised fine-tuning and online reinforcement learning. The authors argue that this hybrid description is insufficient because SFT and RLVR are already known to leave sharply different parameter-space footprints: SFT tends toward dense, principal-aligned updates, while RLVR tends toward sparse, off-principal updates that better preserve pretrained spectral structure. To resolve this ambiguity, the paper frames three research questions about OPD’s location in the SFT–RLVR spectrum, its intrinsic update trajectory, and which components of the OPD objective control that trajectory. The central claim is that OPD is not merely a midpoint between SFT and RLVR, but a distinct post-training regime with its own optimization geometry.

OPD is off-principal, but relaxed
To locate OPD in parameter space, the paper applies a diagnostic suite based on update support, principal-subspace rotation, spectral drift, and update localization. In the controlled Qwen3-8B comparison, SFT is measured from the pretrained base model to the SFT checkpoint, while OPD and RLVR are both initialized from the same SFT checkpoint and trained on a math-domain prompt distribution. The bf16-aware update sparsity results place OPD between SFT and RLVR: SFT changes many more visible weights, RLVR changes fewer, and OPD occupies an intermediate but RLVR-leaning position. The paper labels this pattern a relaxed off-principal regime because OPD is more selective and geometry-preserving than SFT, yet less tightly constrained than RLVR. This interpretation is connected to a relaxed Three-Gate view in which OPD inherits RLVR-like geometry-preserving bias while dense teacher supervision broadens the set of active update directions.

The updates get locked early
The paper then moves from endpoint comparisons to training dynamics by tracking cumulative updates across checkpoints. Using effective dimension, update scale, and spectral-shape diagnostics, it finds that OPD rapidly enters a narrow low-dimensional update band rather than continually expanding its parameter-space footprint. This phenomenon is called subspace locking, and the authors distinguish it from a trivial collapse in update magnitude by noting that OPD has a substantially larger cumulative update norm than RLVR while ending with comparable stable rank. In contrast, the paper characterizes SFT as expanding its update subspace and RLVR as contracting over training. The implication is that OPD learns through an early-emerging channel that is low-dimensional but still capable of carrying meaningful post-training changes.

Early space is enough for OPD
The paper further tests whether the locked OPD subspace is stable and functionally useful rather than merely a descriptive artifact. Subspace-similarity analysis shows that OPD’s early update directions align with the final update channel, indicating that the training trajectory commits early to a persistent region of parameter space. The authors then constrain later training to the update subspace formed early in training and find that this preserves OPD performance. The same kind of restriction substantially degrades SFT, which supports the claim that SFT relies on a broader or differently evolving update geometry. This evidence strengthens the paper’s interpretation that OPD’s low-dimensional locked subspace is not only observable but functionally sufficient for the OPD training process.

What actually controls the lock?
To identify what controls subspace locking, the paper perturbs token supervision density, rollout policy, and objective composition. Token sparsification preserves the OPD rank trajectory, suggesting that dense token coverage is not the sole driver of the locked update geometry. Shifting rollout generation off-policy also preserves the rank dynamics, indicating that the locking effect is robust to this runtime change in how trajectories are generated. By contrast, objective-level interpolation with RLVR changes the rank dynamics and exposes a boundary of the locked regime. The paper concludes that OPD’s geometry is governed most strongly by objective composition, making OPD a distinct post-training mechanism rather than a simple interpolation between SFT-style distillation and RLVR-style optimization.
