ReadPaper Blog
Agents’ Last Exam: Benchmarking AI Agents on Real-World, Economically Valuable Tasks
Agents’ Last Exam introduces ALE, a benchmark for evaluating whether AI agents can complete long-horizon, real-world professional workflows with verifiable outcomes. The paper argues that strong scores on existing benchmarks have not reliably translated into economically meaningful deployment, so evaluation must measure sustained task completion across domains such as engineering, finance, health, law, manufacturing, visual media, and computing.
Source: Agents’ Last Exam: Benchmarking AI Agents on Real-World, Economically Valuable Tasks

Why another benchmark?
The paper’s central motivation is that AI systems have achieved prominent benchmark successes without producing equally broad, measurable gains in economically important professional work. It frames this mismatch as a utility problem rooted partly in evaluation: many benchmarks test abstract competence, short interactions, or static answers rather than the extended workflows that constitute real jobs. The authors argue that benchmarks shape research priorities, engineering targets, and deployment pathways, citing the role of ImageNet as an example of how a well-designed evaluation can accelerate progress in a domain. For sectors such as finance, law, electrical engineering, and manufacturing, the paper claims that comparable evaluations remain underdeveloped. Agents’ Last Exam is proposed as a way to test whether agents can perform work that matters outside benchmark leaderboards.

The gap
The paper identifies a specific gap in existing agent evaluation: real professional workflows are long, heterogeneous, and difficult to verify. Prior benchmarks often simplify one dimension of the problem by using shorter computer-use tasks, synthetic environments, narrow domain coverage, or question-answering formats. ALE is motivated by the claim that economically valuable work requires sustained planning, tool use, file manipulation, and domain-specific judgment over many steps. The authors emphasize that correct outputs may take the form of spreadsheets, reports, design files, media artifacts, models, or other deliverables rather than a single answer string. This makes evaluation harder, but it is also why the paper treats verifiable real-world task completion as essential for measuring useful agent capability.

What ALE is
Agents’ Last Exam defines its target as the Generalist Computer-Use Agent, a system that combines visual perception, code execution, tool use, and long-horizon planning in a unified action loop. The benchmark is built around long-horizon, economically valuable, real-world tasks whose outcomes can be checked through structured deliverables or milestone-based evaluation. The paper positions ALE as broader than GUI-only benchmarks such as OSWorld and CLI-only benchmarks such as Terminal-Bench, because many ALE tasks require agents to interleave desktop applications, browsers, domain-specific software, shell commands, code execution, and file operations. Its task examples span areas such as weather forecasting, hardware verification, molecular docking, RF circuit matching, 3D CAD modeling, CNC toolpath generation, single-cell clustering, and chip signoff routing. The implication is that ALE measures integrated professional workflow execution rather than isolated model knowledge or a single interaction modality.

How it was built
The construction of ALE relies on a large expert-driven taxonomy anchored in O*NET / SOC 2018, the U.S. federal occupational taxonomy. The paper reports collaboration with more than 250 industry experts and organizes the benchmark into 55 subfields grouped into 13 industry clusters, covering more than 1,000 task instances. Its taxonomy includes domains such as Engineering & Architecture, Manufacturing & Industrial Systems, Computing & Mathematical Sciences, Health & Medicine, Business & Finance, Life Sciences, Visual & Media Arts, Legal, and Physical Sciences. Expert advisory committees identify economically meaningful workflow families, while domain contributors provide tasks drawn from real professional practice rather than invented toy scenarios. The paper also describes a quality-control process involving first-pass review, engineer dry-runs, and final peer review by expert committees before tasks are admitted.

What the results say
The reported results show that ALE’s hardest tier is far from saturated by current AI agent systems. Across mainstream harness and backbone configurations, the paper states that the average full pass rate is below 1%, indicating that existing agents struggle to complete the most demanding real-world professional workflows end to end. This finding supports the authors’ claim that benchmark competence on conventional tests should not be equated with readiness for GDP-relevant deployment. ALE is also designed as a living benchmark, meaning its task pool is intended to grow as additional workflows and industries are onboarded. The paper’s broader implication is that sustained, verifiable evaluation on authentic work could help align AI progress with economically meaningful capability rather than leaderboard performance alone.
