ReadPaper Blog
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
The paper studies how AI agents can conduct autonomous research over long horizons rather than merely executing many isolated experiments. It introduces Arbor, a framework for Autonomous Optimization that uses a persistent Hypothesis Tree Refinement process to connect hypotheses, artifacts, evidence, and reusable insights, producing stronger held-out gains across real research tasks.
Source: Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Why autonomous research is hard
The paper frames autonomous research as a long-horizon process of exploration, experimentation, and abstraction, where progress depends on carrying lessons from one attempt into later decisions. Its central problem is that an AI agent must improve an initial research artifact under delayed feedback, costly experiments, and frequent failures without step-level human supervision. To make this problem operational, the authors define Autonomous Optimization, in which an agent receives an artifact, a natural-language objective, and an evaluator, then iteratively improves the artifact through experimental feedback. The paper argues that treating experiments as independent local trials loses the structure that makes research cumulative. Its motivation is therefore to build a persistent research state that records what was tried, what evidence was obtained, and how each result changes the search frontier.

The gap in current agents
The paper identifies a gap between long-running tool use and genuine long-horizon research. Existing coding agents such as Codex, Claude Code, and OpenHands can edit code, call tools, retrieve information, and run experiments for extended periods, but the authors argue that sustained execution alone does not preserve the meaning of past successes and failures. The paper also contrasts Arbor with earlier scientific-agent systems that often follow predefined workflows or revise a single line of work at a time. The missing mechanism, according to the paper, is the ability to maintain competing research directions, test them through concrete artifact changes, interpret evidence, and reshape later exploration. This analysis motivates a framework in which autonomy is not just persistence in action, but persistence in structured hypothesis management.

Arbor's core idea
Arbor is the paper’s proposed framework for Autonomous Optimization, and its architecture separates global strategy from local experimental execution. A long-lived coordinator owns the research state, decides how the search frontier should evolve, and manages the sequence of hypothesis refinements. Short-lived executors each test one hypothesis by implementing artifact changes in an isolated git worktree, which lets the system evaluate ideas without corrupting other branches of the research process. Executors return structured evidence rather than only a final answer, allowing the coordinator to compare directions and revise strategy. This design turns the agent from a single continuous coding loop into a research system that can explore multiple possible improvements while preserving a coherent global view.

Hypothesis Tree Refinement
The technical core of Arbor is Hypothesis Tree Refinement, or HTR, which represents the research process as a persistent tree linking hypotheses, artifact versions, experimental evidence, and distilled insights. Each node binds a proposed research direction to the concrete artifact that realizes it and to the results produced by testing it. When executor results return, Arbor writes evidence back to the executed nodes, abstracts local findings upward, and uses the updated tree to choose which branches to expand, prune, or merge. The paper emphasizes that failures are not discarded, because they can become reusable constraints or lessons for later hypotheses. Candidate improvements are promoted only when they improve a held-out evaluation, making the tree both an audit trail and a mechanism for disciplined artifact improvement.

What the paper shows
The paper evaluates Arbor on six real Autonomous Optimization tasks spanning model training, harness engineering, and data synthesis, each with an initial artifact, objective, native metric, and development/test protocol. Across these tasks, Arbor achieves the best held-out result on all six and reports more than 2.5× the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. The examples in the paper include optimizer design, architecture design, Terminal-Bench, BrowseComp, Search-Agent, and math-reasoning data synthesis, showing that the framework is intended to generalize beyond a single research domain. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, which the paper reports as the strongest result in its comparison. The authors attribute these gains to evidence-structured research: hypotheses remain grounded in executable artifacts, local findings become reusable insights, and later decisions are made over an explicit research state.
