ReadPaper Blog
The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
This paper studies how to allocate a fixed inference-time token budget across many LLM queries instead of giving every query the same generation limit. It models each query’s reasoning benefit with a thresholded, surge-shaped utility curve and derives an economic allocation rule governed by a global shadow price. The proposed CLEAR framework improves cost-accuracy tradeoffs on mixed mathematical reasoning workloads by reallocating compute away from low-return queries toward queries near high-leverage reasoning thresholds.
Source: The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

When More Thinking Stops Helping
The paper addresses a practical bottleneck in inference-time scaling: longer reasoning can improve Large Language Model performance, but deployment systems face strict global compute and token budgets. Wan, Zhu, Cai, Chen, Huang, Zhou, and Sun argue that the central problem is no longer simply whether more test-time compute helps, but how a fixed token supply should be distributed across heterogeneous queries. Standard uniform policies, such as assigning the same maximum new-token limit to every request, implicitly assume that all tasks convert tokens into accuracy in similar ways. The paper shows why this assumption is inefficient for mixed reasoning traffic, where easy problems may already be saturated while difficult problems may still be below the minimum useful reasoning length. Its contribution is to formulate token allocation as a batch-level constrained optimization problem that maximizes aggregate expected reasoning utility under a total budget.

The S-Curve Under the Waves
The empirical motivation of the paper is the observation that reasoning utility is nonlinear rather than proportional to the number of generated tokens. Using Qwen2.5-Math-7B, the authors sample reasoning trajectories with high temperature and group outputs into length bins to estimate conditional Pass@1 behavior on AIME-24, GSM8K, and MATH-500. These experiments support a three-region compute-utility pattern: a Strict phase where short trajectories provide negligible utility, a Surge phase where performance rises sharply after a threshold, and an Ample phase where extra tokens yield diminishing returns or may even degrade solution quality. This S-shaped structure explains why a fixed token limit can simultaneously underserve hard instances and overspend on easy ones. The paper uses this empirical pattern as the basis for a more query-aware allocation model.

A Hidden Threshold for Each Query
To connect binary correctness with graded reasoning progress, the paper introduces a latent utility view of each query. Although an observed answer is marked correct or incorrect, the model assumes an unobserved continuous reasoning potential, denoted ϕ(t), that evolves with generation length. For query i, this potential is represented by a shifted-surge function that remains zero before an emergence threshold τ_i and then follows α_i(t−τ_i)e^(−β_i(t−τ_i)) after the threshold. The parameter τ_i captures the minimum length required for useful reasoning to emerge, α_i controls the initial rise in utility, and β_i captures saturation or decay from excessive generation. This formulation gives the paper a concrete way to express instance-specific Strict, Surge, and Ample regimes inside a global optimization objective.

The Market Clears at One Shadow Price
The theoretical core of the paper reframes inference-token allocation as an economic equilibrium. Given N queries and a total token budget B_total, the system maximizes the sum of per-query latent utilities subject to the constraint that total allocated tokens cannot exceed the budget. Applying Lagrange multipliers yields a global shadow price λ, interpreted as the marginal value of one additional token under scarcity. At an optimum, every active query should receive tokens until its marginal reasoning potential equals this shared shadow price, while queries whose best attainable net surplus cannot beat the price are assigned zero budget through a rational abandonment condition. This result turns a non-convex allocation problem over heterogeneous utility curves into a market-clearing principle for scarce inference compute.

CLEAR: The Plug-and-Play Allocator
The paper instantiates this theory in Constrained Latent-utility Equilibrium Allocation for Reasoning, or CLEAR, a plug-and-play inference wrapper that requires no retraining of the backbone LLM. CLEAR estimates each query’s emergence threshold and latent surge-shaped utility curve, then performs price discovery with fast bisection to find the global shadow price whose induced demand matches the available budget. For a fixed price, the paper derives a closed-form token allocation rule using the principal branch of the Lambert W function, with an additional solvency test that handles truncation or abandonment. The experiments report improved Pareto efficiency between total token cost and mean accuracy across several reasoning tasks and traffic streams. In resource-scarce regimes, the paper claims CLEAR achieves up to a 3× improvement in global accuracy compared with uniform allocation, highlighting the value of reallocating tokens toward queries near their high-return Surge region.
