ReadPaper Blog
Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference
The paper studies how deployed AI agents can improve their harness of tools, prompts, workflows, and skills when no labeled validation set is available. It introduces Retrospective Harness Optimization (RHO), a self-supervised method that mines unlabeled past trajectories, re-solves difficult tasks, diagnoses failures through self-validation and self-consistency, and selects harness updates through pairwise self-preference. The result matters because the paper shows that agents can improve future task performance from operational history alone, including a reported SWE-Bench Pro pass-rate gain from 59% to 78% after one optimization round without external grading.
Source: Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

When No Labels Exist
The paper’s central problem is that AI agents need their harnesses to evolve after deployment, but most harness-optimization methods depend on labeled validation sets that are often unavailable or poorly matched to future tasks. The authors define a harness as the persistent collection of tools, prompts, skills, and workflows that surround a fixed agent model and shape how it reasons, acts, and observes. In deployment, the agent naturally accumulates trajectories from past tasks, and the paper asks whether those unlabeled traces contain enough signal to improve the harness. The formal obstacle is that the true utility function for future tasks is latent, so the agent cannot directly optimize against a ground-truth success metric. RHO addresses this gap by replacing validation feedback with self-preference over trajectories, turning retrospective experience into a source of optimization signal.

RHO: Retrospective Harness Optimization
Retrospective Harness Optimization is presented as a three-stage self-supervised pipeline over past trajectories: coreset selection, group rollout, and best-of-N harness proposal. In the first stage, the method selects a small but informative coreset of tasks that are both challenging and diverse, using a language-model judge for difficulty scoring and a Determinantal Point Process to encourage diversity. In the second stage, the agent re-attempts each selected task multiple times in parallel under the original harness, producing grouped rollouts for comparison. The paper extracts two diagnostic signals from these rollouts: self-validation, which identifies problems within a trajectory, and self-consistency, which detects disagreements across multiple attempts on the same task. This design focuses computation on trajectories likely to expose reusable failure modes rather than spending optimization budget on easy or redundant cases.

How the Harness Gets Edited
The paper then uses the diagnostic information to generate candidate harness updates, treating observed failures and disagreements as instructions for improving tools, skills, workflows, or prompts. For each candidate harness, the agent re-solves the coreset tasks and compares the new trajectories against a fixed baseline rollout from the original harness. Selection is performed through pairwise self-preference, where the same backbone model ranks trajectories and provides rationales rather than relying on external labels or a validation metric. The best candidate is accepted only if its average preference score over the coreset exceeds the baseline, so the procedure includes a safeguard against adopting a self-judged worse harness. This stage is important because RHO optimizes the full executable and instructional environment around the agent, not merely a memory bank or a list of remembered tips.

What the Paper Shows
Empirically, the paper evaluates RHO across three agent domains spanning software engineering, technical work, and knowledge work, and reports consistent performance improvements. The most concrete result in the excerpt is on SWE-Bench Pro, where a single round of retrospective harness optimization raises pass rate from 59% to 78% without external grading. The authors emphasize that this improvement comes from unlabeled past software-engineering trajectories rather than from iterative tuning on a labeled validation set. Their analysis indicates that RHO targets prior failure modes by designing specific skills and tools that alter the agent’s behavior patterns. The paper also reports that these behavior changes help sustain higher accuracy during long-horizon sessions, suggesting that harness optimization can affect how an agent operates over extended task sequences rather than only improving isolated answers.

Takeaway
The broader implication of the paper is that accumulated deployment trajectories can become a practical substrate for agent self-improvement when ground-truth evaluation is scarce. RHO differs from validation-feedback harness optimization because it performs a single retrospective pass over unlabeled experience, and it differs from memory-centered self-improvement because it can modify the broader harness, including instructions and executable tools. The paper positions self-preference as a proxy for latent task utility, while acknowledging through its formulation that the true future-task utility remains unobserved. Its quantitative analysis of diagnostic signals argues that coreset selection, self-validation, self-consistency, and candidate ranking progressively isolate information that contributes to better harness updates. If this approach generalizes, the paper suggests a deployment model in which agents periodically mine their own histories to adapt to future workloads without requiring a continuously curated validation set.
