ReadPaper Blog
Rethinking Continual Experience Internalization for Self-Evolving LLM Agents
The paper studies how large language model agents can convert past interaction experience into reusable parametric capability without relying on ever-growing context. Its central finding is that methods which look effective after one round of context distillation can suffer progressive capability collapse across multiple self-evolution iterations, and that stability depends on how experience is abstracted, injected, and distilled.
Source: Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

Experience That Keeps Slipping
The paper addresses a core weakness in self-evolving LLM agents: learning from experience does not automatically compound over repeated training rounds. Existing experience internalization methods often distill an experience-aware teacher into an experience-free student and report strong single-iteration gains, but the authors show that this is not enough for continual learning. In their preliminary study, iterative on-policy context distillation degrades across iterations on web-reasoning benchmarks including WebWalkerQA, GAIA, and BrowseComp-ZH. This failure matters because autonomous agents are expected to keep improving as they collect trajectories, tool observations, and feedback over time. The paper frames the problem as progressive capability collapse, where the agent’s accumulated experience becomes a source of instability rather than durable improvement.

Three Ways Experience Can Be Packaged
The authors organize the design space of continual experience internalization around three dimensions: experience granularity, experience injection pattern, and internalization regime. Experience granularity distinguishes instance-level records of specific trajectories from principle-level summaries that capture reusable strategies, decision rules, and failure patterns. Experience injection pattern compares global injection, where the same experience context is supplied for an entire trajectory, with step-wise injection, where an LLM-based selector retrieves relevant experience for each intermediate history. Internalization regime contrasts on-policy context distillation, which supervises student-induced states, with off-policy context distillation, which trains on teacher-generated trajectories. The paper’s central recipe is that principle-level experience, step-wise injection, and off-policy context distillation produce a more stable path for self-evolving agents.

Why Principles Travel Better
The paper finds that principle-level experience is more durable than instance-level experience because it abstracts away trajectory-specific details that may not transfer across tasks or iterations. In the authors’ formulation, trajectories are summarized into a natural-language experience pool, and the abstraction level of that pool strongly affects what the student internalizes. Instance-level experience can preserve narrow paths, local observations, or accidental details from a single interaction, increasing the risk that later training reinforces brittle behavior. Principle-level experience instead encodes broader strategies and failure patterns that can guide future reasoning and tool use in new states. The experiments on Qwen3 models indicate that this abstraction is especially important when internalization is repeated rather than evaluated as a one-shot transfer.

Put Experience Where Decisions Happen
The paper also argues that experience must be injected at the point where decisions are made, especially in long-horizon tool-use environments. The agents follow a ReAct-style interaction format, producing interleaved thoughts, actions, tool calls, and observations while using tools such as Search, Visit, Python, Scholar, and File Parser. Under global injection, the teacher receives a fixed experience context for the whole trajectory, which can fail to keep later decisions aligned with the most relevant experience. Under step-wise injection, a selector retrieves experience conditioned on the current interaction history, aligning guidance with intermediate decision states. The authors report that this state-aligned pattern outperforms global injection because long-horizon web reasoning requires experience to remain relevant as the trajectory evolves.

The Safe Recipe for Self-Evolution
The paper’s final practical conclusion is that stable self-evolution requires changing the trajectory distribution used for distillation, not only improving the experience representation. On-policy context distillation can work in a single iteration, but in repeated internalization it trains the teacher to correct states produced by the student, which may already be flawed or locally unstable. The authors argue that this reduces supervision to local corrections rather than coherent demonstrations of experience-guided behavior. Off-policy context distillation instead uses high-quality teacher-generated trajectories and trains the student to match the teacher distribution with forward KL, providing a more stable signal across iterations. Combined with principle-level experience and step-wise injection, this offers concrete guidance for engineering LLM agents that internalize experience as reusable capability rather than accumulating context that eventually collapses.
