ReadPaper Blog
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
AdaPlanBench is a benchmark for testing whether large language model agents can adaptively plan when both user preferences and world limitations are hidden at first and revealed only through interaction. Built from 307 household tasks and augmented with automatically generated dual constraints, it shows that current LLM agents still struggle to revise plans reliably as constraints accumulate, with the best model reaching 67.75% accuracy.

Why planning gets messy
The paper argues that planning by LLM agents becomes difficult in real-world settings because agents must satisfy both user constraints and world constraints while interacting over multiple steps. User constraints include preferences, priorities, and personal requirements, while world constraints include tool availability, environmental limits, and resource restrictions. The authors position this dual structure as central to practical agent behavior, since agents increasingly operate computers, write code, use tools, and support scientific work through sustained interaction. Existing benchmarks, according to the paper’s comparison, usually emphasize either user-side constraints or world-side constraints rather than their joint effect on planning. AdaPlanBench is motivated by the question of whether LLM agents can produce and revise plans that remain effective when both kinds of constraints shape the solution space.

The gap AdaPlanBench targets
AdaPlanBench targets the specific gap of adaptive planning under progressively disclosed dual constraints in open-ended tasks. The paper emphasizes that real constraints are often not fully specified upfront, so an agent must uncover them through feedback and revise its plan without losing track of what has already been ruled out. This setting differs from static planning benchmarks because the agent’s first plan may be invalid for reasons it could not initially observe, and success depends on iterative re-planning under an accumulating constraint set. The benchmark also preserves a large action and solution space, allowing any plan that satisfies the task and constraints rather than forcing a single canonical answer. This design makes AdaPlanBench a testbed for evaluating whether LLM agents can combine exploration, constraint memory, and grounded plan revision.

How the benchmark is built
The benchmark is built from a curated subset of 307 household-domain instances derived from the MacGyver dataset, which provides tasks that naturally require practical multi-step reasoning. The construction pipeline first rewrites raw MacGyver queries into short, method-agnostic household goals and filters for concrete tasks that require planning. Each retained instance is represented as a query paired with a dual-constraint profile containing a world constraint set and a user constraint set. The paper describes a multi-agent construction framework with role-specific models, including a query rewriter, binary query filter, planner samplers, constraint extractor, merge model, and constraint checker. Candidate plans are sampled to surface likely tools and strategies, extracted tools are converted into world limitations such as unavailable objects, and inferred tool attributes are converted into user preferences such as avoiding high-heat methods.

How evaluation works
At evaluation time, AdaPlanBench withholds the full constraint profile and uses a multi-turn protocol to reveal constraints only when an agent proposes a plan that violates them. The agent submits a plan for the household task, a constraint judge checks it against hidden world and user constraints, and violated constraints are disclosed as feedback. The agent must then re-plan while respecting both the newly revealed constraint and all previous feedback, making the trajectory a test of adaptive revision rather than one-shot answer generation. The protocol includes judging mechanisms based on constraint checks and rubrics, with termination conditions such as success, maximum turns, early stopping, or rubric failure. This runtime design directly measures whether a model can infer, track, and operationalize constraints as interaction unfolds.

What the results say
The experiments evaluate ten leading open-source and proprietary LLMs and find that adaptive planning under progressively revealed dual constraints remains difficult. The strongest model reaches only 67.75% accuracy, while open-weight models typically remain at or below 30%, indicating a large gap between current agent abilities and the benchmark’s demands. The paper further reports that performance degrades as more constraints accumulate along a trajectory and as the overall constraint burden increases. User constraints are identified as especially challenging, and the authors observe that failures often involve weaker physical grounding and reduced goal effectiveness. The results establish AdaPlanBench as evidence that reliable LLM agent planning requires better mechanisms for constraint tracking, physical plausibility, and adaptive re-planning under dynamically revealed user and world requirements.
