ReadPaper Blog
ESPO: Early-Stopping Proximal Policy Optimization
ESPO addresses a specific inefficiency in reinforcement learning for large language model reasoning: standard PPO keeps generating long rollouts even after an early reasoning error has made success unlikely. The paper proposes an early-stopping variant of PPO that uses policy logits and critic value estimates already computed during rollout collection to detect likely failure, truncate trajectories, reduce noisy advantage estimates, and save tokens while improving math-reasoning accuracy.
Source: ESPO: Early-Stopping Proximal Policy Optimization

Mission Briefing: The Rollout Waste Problem
The paper frames ESPO around the “rollout continuation problem” in long-horizon reinforcement learning for large language models. In tasks such as mathematical reasoning, a model may make an early mistake—choosing the wrong operation, pursuing an invalid proof path, or drifting away from the problem—after which the rest of the generated trajectory is unlikely to receive positive reward. Standard PPO nevertheless continues sampling until an end-of-sequence token or a fixed horizon Tmax, so many post-failure tokens consume computation without contributing useful learning signal. Because those tokens are included in generalized advantage estimation, their noisy temporal-difference terms can obscure the actual step where the reasoning failed. The paper’s motivation is therefore both computational and statistical: reduce wasted rollout tokens while making the policy-gradient signal better aligned with the true failure point.

ESPO Appears: Stop the Mission on the Spot
ESPO, or Early-Stopping Proximal Policy Optimization, modifies rollout collection rather than changing PPO’s clipped surrogate objective. The method detects likely failed trajectories on-the-fly using only two signals already available in an actor-critic PPO pass: the policy’s logit vector and the critic’s value estimate. This design avoids the need for a process reward model, human annotation of intermediate reasoning steps, or an additional learned termination module such as those used in options-style reinforcement learning. The paper presents ESPO as orthogonal to PPO-style improvements such as GRPO and DAPO, because its termination mechanism can be layered onto advantage estimators without replacing the underlying policy optimization framework. Its central claim is that failure-aware truncation can make RL post-training for reasoning models both more efficient and more accurate.

How the Stop Signal Works
The stopping signal in ESPO is built from a per-step surrogate regret defined as the log-probability gap between the policy’s greedy token and the token actually sampled at a state. Formally, for state st and sampled action at, the paper defines gt as max over vocabulary actions of log πθ(a|st) minus log πθ(at|st), making the signal small when sampling follows the policy mode and larger when it deviates sharply. Because the scale of this regret signal changes during training, ESPO applies exponential moving average normalization using batch-level statistics that are updated only at training-batch boundaries to preserve causal correctness. The normalized regret is accumulated and compared against a threshold informed by the critic’s estimate of remaining value, reflecting the intuition that high regret combined with low expected return indicates a trajectory unlikely to recover. This value-gated criterion lets ESPO terminate rollouts during generation with negligible additional computation beyond standard decoding.

Why Truncation Helps Learning
The paper’s treatment of truncated trajectories is crucial to its learning argument. ESPO maps an early-stopped rollout to an absorbing failure state with a terminal penalty, rather than treating truncation as an ordinary partial episode or adding dense reward shaping. This choice concentrates negative temporal-difference error near the detected failure step, so generalized advantage estimation propagates the consequence of failure to the relevant preceding decisions instead of spreading noisy credit assignment across many post-failure tokens. The approach is also intended to avoid the non-stationary bias that can arise from hand-designed per-step penalties, because ESPO leaves the task reward and token-level action space unchanged. In the paper’s formulation, PPO and GAE then operate normally on the truncated trajectory, making ESPO an intervention at data collection time rather than a new objective function.

Evidence from the Training Arena
The paper reports that ESPO improves both accuracy and rollout efficiency on mathematical reasoning benchmarks. On DeepSeek-R1-Distill-Qwen-7B, ESPO outperforms PPO on AIME 2024 with 46.28% versus 45.25%, AMC 2023 with 85.83% versus 82.94%, and MATH-500 with 87.42% versus 85.43%, while saving more than 20% cumulative rollout tokens. At the 1.5B scale, the paper reports an average accuracy of 59.09% across three benchmarks, compared with 57.03% for PPO and 58.29% for DAPO, while using 927.96M cumulative tokens versus 1069.66M for PPO and 1223.96M for DAPO. A key ablation compares ESPO with random truncation matched for stopping rate, and the random variant reaches only 42.4% on AIME 2024 despite similar average rollout length. This result supports the paper’s claim that the benefit comes from detecting where failure occurs, not merely from shortening generations.
