ReadPaper Blog
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD addresses a training bottleneck in speculative decoding, where lightweight draft models accelerate large language model inference by proposing token blocks that a larger target model verifies in parallel. The paper argues that supervised fine-tuning on target-generated trajectories plateaus because draft models are trained on offline target states but evaluated on inference-time states induced by their own proposals. It introduces on-policy distillation with error-position replay, showing improved accepted length and reporting over 5× lossless acceleration for thinking models, with gains over EAGLE-3 and DFlash under matched FLOPs.
Source: Draft-OPD: On-Policy Distillation for Speculative Draft Models

Why SFT Stops Helping
The paper begins from a practical limitation in speculative decoding: the speedup depends on the accepted length of draft-token blocks, but ordinary supervised fine-tuning stops improving that quantity after an initial warm-up. Methods such as EAGLE-3 and DFlash train lightweight draft models on target-generated trajectories, so additional offline SFT mainly exposes the drafter to prefixes produced by the target model rather than to the prefixes that determine verification outcomes. The authors report that continued SFT fluctuates around a plateau in accepted length, and that SFT on data used for on-policy distillation can even reduce accepted length. This evidence motivates the central claim that the bottleneck is not simply insufficient offline training compute. The implication is that improving speculative decoding requires training signals tied to the draft model’s own behavior during draft-and-verify inference.

The Hidden Mismatch
Draft-OPD frames the plateau as an offline-to-inference mismatch between target-made training states and draft-induced evaluation states. In speculative decoding, a target model pθ verifies a block of K candidate tokens sampled from a draft model qϕ, and the useful acceleration is governed by how many of those proposed tokens are accepted before verification rejects a mismatch. SFT trains qϕ on fixed target trajectories, where every prefix comes from the target distribution, but inference tests qϕ on prefixes partly shaped by its own previous proposals. The paper therefore emphasizes accepted length τ as both an efficiency metric and an alignment metric between the draft and target models. This analysis connects speculative decoding to exposure bias: the drafter must match the target not only on clean teacher states, but also on the states it actually visits when proposing token spans.

Why Plain OPD Fails Here
The paper next explains why standard on-policy distillation is not directly usable for modern speculative draft models. Classical OPD assumes that a student can generate full trajectories under its own policy and then receive teacher distributions on those student-induced states, typically through an objective such as DKL(pθt ∥ qϕt). Training-based draft modules such as EAGLE-style and DFlash-style drafters are not designed as standalone autoregressive generators; they are built to predict short spans with target-model context and verification. When forced into draft-only rollouts, these modules can produce repetitive or degenerate continuations, making the collected states poor training material. Target-assisted rollout solves stability, but because speculative verification preserves the target model’s distribution, it discards rejected draft tokens and removes exactly the on-policy errors that OPD should exploit.

Draft-OPD: Replay the Error
Draft-OPD’s method combines stable continuation with a mechanism for recovering the draft model’s own mistakes. It uses target-assisted rollout to keep sequences usable, records the error positions exposed by speculative verification, and then replays drafting from those positions so the target model can score prefixes that reflect the drafter’s actual proposals. This replay step lets the training process include both accepted and rejected tokens rather than learning only from the target-corrected continuation. The paper further describes an acceptance-aware distillation objective that treats accepted and rejected proposals differently: accepted tokens reinforce agreement with the target model, while rejected tokens concentrate learning on the deviations that shorten accepted length. The method is therefore designed specifically for training-based speculative draft models that need on-policy feedback but cannot safely perform full independent rollouts.

What the Experiments Say
The experiments support the paper’s claim that error-focused on-policy distillation can continue improving where offline SFT stalls. The reported training curve shows that after SFT warm-up, continuing with Draft-OPD increases accepted length, while continued SFT remains near a plateau and SFT on OPD data can degrade performance. The paper reports over 5× lossless acceleration for thinking models across diverse tasks, meaning the target model’s output distribution is preserved while fewer expensive decoding steps are needed. Under matched FLOPs, Draft-OPD improves over EAGLE-3 by 23% and over DFlash by 13%, according to the abstract and introduction. These results imply that the key advantage is not merely stronger distillation data, but the ability to train on verification-time draft-induced errors that determine speculative acceptance.
