ReadPaper Blog
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
This paper studies on-policy distillation for large language models, where a student model learns from token-level teacher distributions on the student’s own generated prefixes. It argues that selecting tokens by entropy or raw teacher–student KL disagreement is insufficient, because high disagreement can be either learnable or incompatible with the student’s current predictive support. The authors propose token teachability and Teachability-Aware OPD (TA-OPD), showing across Qwen2.5 and Qwen3 teacher–student settings that a small budget of high-teachability tokens can often match or surpass full-token OPD.
Source: Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Mission Briefing: Why Some Teacher Hints Fail
The paper is motivated by a practical weakness in selective on-policy distillation: existing token selectors often treat salience as a proxy for usefulness. In OPD, the student generates its own rollout prefixes, and the teacher supplies token-level supervision at those student-visited states, reducing the distribution mismatch associated with off-policy distillation. Prior selective OPD methods exploit the non-uniformity of supervision by prioritizing high-entropy positions or tokens with large teacher–student KL divergence. The authors argue that raw KL disagreement is only a coarse signal because it measures how different the two distributions are, not whether the resulting gradient is likely to improve the student. The paper’s central question is therefore which token-level teacher signals in OPD are actually learnable rather than merely large or surprising.

Same KL, Different Fate
The paper’s key conceptual distinction is between learnable disagreement and incompatible disagreement. In learnable disagreement, the teacher’s probability mass reweights tokens that already lie within the student’s local top-K support, so the correction is reachable from the student’s current predictive state. In incompatible disagreement, the teacher assigns much of its mass to tokens outside the student’s current support, producing a large KL gap without necessarily yielding an update the student can absorb locally. This explains why two low-entropy, high-KL positions can have very different learning value under OPD. The paper formalizes this distinction by defining student local support, teacher top-K support, their union, and a compatibility mass that measures how much teacher probability lies on the student’s top-K candidates.

Token Teachability
To test whether token-level disagreement predicts improvement, the authors introduce a fixed-context diagnostic for OPD. They collect student-generated prefixes into a frozen context bank, then rescore both the initial and trained students against the same teacher distribution on those identical contexts. For each token position, they define fixed-context token gain as the reduction in teacher–student KL from the initial student to the trained student, isolating local improvement from rollout resampling noise and downstream context shifts. The paper also presents a gradient-alignment proposition showing that useful token losses must align with fixed-context loss reduction, not merely exhibit high KL. Using this diagnostic, the authors show that support-aligned disagreement better predicts same-context KL reduction than raw disagreement alone.

TA-OPD: Select the Teachably Strong Tokens
The method proposed from this analysis is Teachability-Aware OPD, or TA-OPD. TA-OPD computes a local disagreement score on the union of teacher and student top-K token sets and combines it with compatibility mass to prioritize support-aligned teacher corrections. After robust normalization, the paper decomposes disagreement into a learnable component and an incompatible component, with the learnable part corresponding to large disagreement that remains inside the student’s local support. TA-OPD then applies the OPD loss only to high-teachability token positions under a token budget. The method is designed to be lightweight because it uses teacher and student token probabilities already available during OPD and does not require a reward model, verifier, or external preference signal.

Final Insight: Learnable Signals Beat Mere Quantity
The experimental evidence supports the paper’s claim that token quality can outweigh token quantity in OPD. In standardized token-level regressions controlling for student entropy, local disagreement, token position, and teacher entropy, the learnable disagreement component has roughly twice the coefficient of the incompatible component across K values of 8, 16, and 32, with positive bootstrap gaps. This indicates that the low-entropy, high-divergence region emphasized by prior TIP-style selection is heterogeneous rather than uniformly teachable. Across Qwen2.5 and Qwen3 teacher–student settings that vary in scale, reasoning ability, and backbone, TA-OPD often matches or surpasses full-token OPD while retaining only 5% of tokens. The broader implication is that selective distillation should optimize for learnable teacher signals rather than simply dense supervision, uncertainty, or raw teacher–student disagreement.
