ReadPaper Blog
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench evaluates whether AI systems can conduct end-to-end autonomous scientific research rather than merely answer scientific questions or reproduce exposed papers. The benchmark builds 40 tasks from real published papers across 10 scientific domains, provides related literature and raw data while hiding the target paper, and scores outputs with expert-curated weighted rubrics. Its results show that current autonomous agents and native LLMs remain far below target-paper-level re-discovery, making the benchmark a concrete frontier for measuring progress toward reliable AI scientific discovery.
Source: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Mission Briefing: Can an AI Re-Discover Science?
ResearchClawBench addresses a central evaluation gap in automated scientific research: current AI coding agents are increasingly marketed as autonomous research systems, but their ability to complete the full research loop is difficult to verify. The paper defines this loop as more than scientific question answering, requiring systems to work from literature and raw data toward experiments, analysis, figures, and a research report. To make the task scientifically meaningful, the benchmark derives each task from a real published paper with a clear research question, accessible raw data, and practical research value. During evaluation, the target paper is hidden, so a system must attempt target-paper-level re-discovery rather than copy or adapt known conclusions. The resulting suite contains 40 end-to-end tasks spanning Astronomy, Chemistry, Earth Science, Energy Science, Information Science, Life Science, Material Science, Mathematics, Neuroscience, and Physics, giving the benchmark broad coverage across data modalities and evidence standards.

Why Old Tests Fall Short
The paper argues that prior scientific AI benchmarks capture important subskills but do not establish whether a system can perform autonomous research from start to finish. Scientific QA and reasoning benchmarks such as SciQ, GPQA, MMLU, Humanity’s Last Exam, SciBench, and ATLAS test knowledge, factual understanding, or difficult domain reasoning, but they usually stop at local answers rather than a complete research artifact. Agent and research-environment benchmarks such as ScienceWorld, DiscoveryWorld, SciCode, ScienceAgentBench, MLAgentBench, MLE-bench, MLGym, PaperBench, CORE-Bench, and ReproduceBench move closer to research workflows, yet the paper distinguishes them by their narrower scope, simulated settings, exposed target papers, domain concentration, or focus on reproduction rather than hidden-target re-discovery. ResearchClawBench is positioned as a system-agnostic benchmark that combines real papers, raw data, executable interaction, end-to-end reports, broad domain coverage, and open research scope. This design responds to the paper’s claim that open-ended scientific outputs need verifiable anchors without reducing research to exact-match tests or short-answer scoring.

How RCBench Judges the Trial
ResearchClawBench evaluates open-ended research outputs through expert-curated multimodal rubrics derived from the hidden target papers. The paper’s method decomposes expected scientific artifacts into weighted, verifiable sub-criteria, so scoring can assess whether a system recovered core methods, evidence, analyses, and conclusions rather than merely produced plausible prose. This rubric structure is intended to reduce the ambiguity of judging research reports while avoiding overreliance on unconstrained LLM-as-judge evaluation. The benchmark anchors interpretation with a 50-point threshold: a score of 50 indicates target-paper-level re-discovery, while scores above 50 indicate the discovery regime. By tying scores to hidden target-paper artifacts and weighted criteria, the paper provides a way to compare diverse research outputs while preserving room for genuinely new findings.

The Arena: Agents vs. Native LLMs
The experiments evaluate both full autonomous research agents and native LLM baselines under a unified protocol. The paper reports results for seven autonomous research agents, including systems such as Claude Code, Codex CLI, and OpenClaw, on the same ResearchClawBench tasks. To make native LLMs comparable even when they lack a full agent scaffold, the authors introduce ResearchHarness, a lightweight tool-use evaluation harness for LLM baselines. Through ResearchHarness, the paper evaluates seventeen native LLMs, including Claude-Opus-4.7, Claude-Opus-4.6, Qwen3.7 Max, GLM 5.1, Kimi K2.6, Gemini 3.5-Flash, DeepSeek V4-Pro, GPT 5.5, MiMo variants, Grok variants, and other frontier models listed in the benchmark overview. This experimental setup makes the benchmark not just a task collection but a comparative measurement framework for autonomous agents and LLMs attempting the same hidden-target scientific re-discovery problem.

What the Results Mean
The main empirical finding is that current systems remain far from reliable autonomous scientific re-discovery. The strongest autonomous agent reported in the paper, Claude Code, averages 21.5 points, well below the 50-point target-paper-level threshold, and even the best autonomous-agent result selected per task yields a frontier mean of only 24.6. Among ResearchHarness LLM baselines, Claude-Opus-4.7 averages 20.7, while the LLM frontier mean reaches 26.5, again indicating a substantial gap from the re-discovery threshold. The paper’s error analysis identifies three recurring failure modes: experimental protocol mismatch, evidence mismatch, and missing scientific core. These results imply that present AI systems can often produce partial research-like outputs but do not yet reliably reconstruct the central experimental logic, evidential support, and scientific contribution of real papers from raw materials.
