ReadPaper Blog
BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
BenchEvolver addresses benchmark saturation in coding evaluation, where frontier large language models now solve many existing tasks too easily for datasets such as LiveCodeBench to distinguish capability. The paper proposes a solution-centric evolutionary framework that mutates executable reference solutions first, then derives aligned problem statements and tests, producing harder but still verifiable tasks. Its experiments on LiveCodeBench and SciCode show that evolved tasks can restore model discrimination and provide useful reinforcement-learning signal for self-improvement.
Source: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

When Benchmarks Get Too Easy
The paper begins from the problem that rapid progress in frontier large language models has saturated widely used coding benchmarks, reducing both evaluation value and training signal. On LiveCodeBench, the authors report that frontier models exceed 99% Pass@1 on easy splits and surpass 90% Pass@1 on average across difficulty levels, making the benchmark less able to separate strong systems. The authors argue that this issue extends beyond competitive programming to reasoning, scientific problem solving, and agentic tasks, where static datasets can lose discriminative power as models improve. Human construction of new frontier-level benchmarks remains expensive and slow, creating a bottleneck for continuous evaluation. BenchEvolver is motivated as a way to convert existing saturated tasks into harder, verifiable variants that can co-evolve with model capability.

The Core Trick: Evolve the Solution First
The central methodological move in BenchEvolver is to generate new tasks in solution space rather than starting with natural-language problem statements. Given an existing executable task with a statement, reference implementation, hidden tests, and execution harness, the framework mutates the reference solution and then derives the corresponding statement, examples, and tests from that evolved solution. This design grounds synthetic task generation in executable semantics, which helps ensure that the generated problem has a correct oracle and testable behavior. The paper emphasizes that accepted mutations must alter the solution structure enough that the parent algorithm is no longer sufficient, so the evolved task is not merely a surface rewrite. This solution-centric approach is presented as a response to weaknesses in instruction-level synthetic generation, where surface diversity may not imply deeper algorithmic difficulty.

How the Evolution Loop Works
BenchEvolver operates as a closed-loop system with a Proposer, an Evaluator, and a Memory module. The Proposer constructs candidate evolutions by transforming the reference solution and producing the associated problem statement and tests. The Evaluator then applies independent consistency checks, including whether the solution passes the tests, whether tests align with the statement, and whether the statement aligns with the solution. Candidate tasks are also screened for meaningfulness, diversity, and empirical difficulty against a panel of target models, with difficulty defined behaviorally by model failure on hidden tests rather than by a heuristic label. The Memory module records accepted lineages, rejection and repair histories, pass rates, target-model failures, and successful mutation patterns, allowing later search to exploit lessons from prior evolutions.

What They Built and Found
The paper evaluates BenchEvolver on LiveCodeBench, a competitive-programming benchmark, and SciCode, a research-oriented scientific coding benchmark. Across these domains, the authors report that the framework generates valid, diverse, and substantially harder evolved tasks at scale while preserving reference correctness. The study also introduces LIVECODEBENCH-PLUS, a 91-problem benchmark that combines evolved tasks with difficult original LiveCodeBench v6 tasks. On this upgraded benchmark, frontier-model Pass@1 ranges from 27.5% to 62.6%, which restores clearer separation among strong coding models than saturated easy splits. The experiments further indicate that solution-centric generation outperforms a problem-centric baseline and that memory-guided evolution improves over independent one-step mutations.

The Punchline: Hard Tasks Can Teach Too
A key implication of the paper is that evolved tasks are not only harder evaluation items but can also become reusable training signal. The authors test this idea with reinforcement learning on evolved LiveCodeBench tasks using gpt-oss-20b as both the evolver and the target model. They report that seed-plus-evolved training improves held-out coding performance by +8.7 Pass@1 on LiveCodeBench v6 Hard and +8.3 Pass@1 on LCB-Pro Easy. These gains exceed seed-only reinforcement-learning gains by 70.7% and 34.8%, respectively, in the two held-out settings highlighted in the paper. The result supports the paper’s broader claim that self-challenging task evolution can close a loop from model-generated weaknesses to capability improvement without depending on a strictly stronger teacher model.
