ReadPaper Blog
Claw-SWE-Bench: Evaluating OpenClaw-Style Agent Harnesses on Coding Tasks
Claw-SWE-Bench addresses a measurement problem in SWE-bench-style coding evaluation: general-purpose agents such as OpenClaw cannot be fairly scored unless their execution, workspace behavior, patch extraction, and prediction format are made compatible with the SWE-bench contract. The paper introduces a multilingual benchmark and adapter protocol that fix the task set, prompt, Docker runtime, budget, patch extraction, evaluator, and cost accounting so that the agent harness, or “claw,” can be compared as a controlled variable.
Source: Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Why can’t we just score a coding agent?
The paper argues that evaluating coding agents is not simply a matter of choosing a strong language model and running it on SWE-bench. SWE-bench expects a clean repository checkout, a Docker-based evaluation environment, and a prediction file containing a scorable patch in the `model_patch` field, while general-purpose agents such as OpenClaw normally operate through broader interactive sessions. This mismatch means that final explanations, internal logs, session files, caches, or other artifacts can fail to become valid repository diffs or can contaminate the submitted patch. The authors identify a deeper evaluation problem: reported resolved rates can conflate the LLM, the agent harness, the prompt and tool loop, the task set, the timeout, and the patch extraction strategy. Claw-SWE-Bench is motivated by the need to make OpenClaw-style systems eligible for real GitHub issue-resolution tasks without hiding harness effects inside a single end-to-end score.

Claw-SWE-Bench fixes the experiment
Claw-SWE-Bench proposes a controlled evaluation stack that separates fixed experimental conditions from the replaceable harness under test. The fixed base includes the task set, prompt template, execution container, per-instance runtime budget, workspace contract, patch extraction procedure, prediction format, and upstream SWE-bench evaluator. Different claws enter through a shared adapter protocol that maps each harness’s native lifecycle into the required repository-editing and patch-prediction process. This design turns the harness from an incidental implementation detail into an experimental variable, allowing comparisons where the model, tasks, and scoring pipeline can be held constant. The result is a benchmark for evaluating not only whether an LLM can solve coding tasks, but also whether the surrounding agent harness enables reliable autonomous software repair under the SWE-bench contract.

What is in the benchmark?
The full Claw-SWE-Bench workload contains 350 real GitHub issue-resolution instances across 8 programming languages and 43 repositories. The instances are drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini, with future-commit cleanup applied to avoid evaluation leakage from commits that should not be available at the task’s base state. Each task follows the repository-level SWE-bench setup: a problem statement, target repository, and base commit define the input, and success is determined by applying the generated patch inside the Docker evaluation environment and running the relevant tests. This multilingual composition broadens the evaluation beyond a single-language setting while keeping the scoring mechanism aligned with established SWE-bench practice. By using real GitHub issues rather than synthetic snippets, the benchmark targets practical agent behavior such as reading project context, editing files, and producing patches that survive repository tests.

Accuracy is not the whole story
A central claim of the paper is that accuracy alone is an incomplete measure of coding-agent quality. Autonomous coding systems repeatedly call models, inspect files, issue commands, run tests, and maintain session state, so two systems with similar Pass@1 can impose very different API costs, wall-clock durations, token usage patterns, and interaction lengths. Claw-SWE-Bench therefore reports total API cost, average wall-clock duration, cache hit rate, and Pass@1 under a fixed outer budget. The paper uses a resolve-rate–cost Pareto view over a five-claw by two-model sweep to show that accuracy and cost do not move in lockstep. This makes cost-aware reporting part of the benchmark design rather than a secondary diagnostic, which matters for reproducible evaluation, regression testing, and participation by teams with limited compute budgets.

Main takeaway: the harness matters a lot
The experiments show that harness design materially changes SWE-style coding performance. With the same GLM 5.1 backbone, OpenClaw using a minimal direct-diff adapter reaches only 19.1% Pass@1, while the full adapter reaches 73.4%, demonstrating that adaptation is not a minor engineering detail but a major determinant of measured coding ability. Across an OpenClaw sweep over nine models, model choice changes Pass@1 by 29.4 percentage points, and across a five-claw by two-model sweep, harness choice changes Pass@1 by up to 27.4 percentage points under fixed models. The paper also releases Claw-SWE-Bench Lite, an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns to preserve resolve-rate parity, pairwise ranking stability, language coverage, and cost structure. Lite reduces evaluation cost while supporting faster adapter debugging, model screening, prompt iteration, and regression checks, while the full 350-instance benchmark remains the reference for comprehensive comparison.
