ReadPaper Blog
Trust Region Q-Adjoint Matching
Trust Region Q-Adjoint Matching studies how to stably fine-tune pretrained flow policies with off-policy reinforcement learning. The paper identifies a failure mode in Q-learning with Adjoint Matching, where imperfect critics can amplify small value-estimation errors into destructive drift from the pretrained policy, and proposes TRQAM, a trust-region method that controls path-space KL inside the stochastic sampling dynamics. By adapting a trust-region parameter with projected dual descent, TRQAM improves stability and reports stronger results on 50 OGBench tasks than prior offline and offline-to-online RL baselines.
Source: Trust Region Q-Adjoint Matching

When the flow policy goes off the rails
The paper addresses the problem of improving pretrained flow policies without destroying the useful behavior already captured by pretraining. Flow matching policies can represent rich, multimodal action distributions, but their actions are generated through a multi-step sampling or denoising process that makes direct gradient-based fine-tuning expensive and unstable. Q-learning with Adjoint Matching, or QAM, mitigates this by reformulating fine-tuning as a memoryless stochastic optimal control problem guided by a learned critic. The paper argues that QAM still inherits a deeper critic-guided instability: in off-policy reinforcement learning, critic errors are unavoidable, and those errors can compound through bootstrapping and distribution shift. Its Lemma 1 formalizes that exponentially tilted policy updates can amplify small critic errors into large deviations from the pretrained prior, which can cause model collapse rather than reward improvement.

Why existing fixes still feel flimsy
The paper’s critique of existing stabilizers is that conventional KL regularization acts only at the optimization-loss level and does not guarantee that the realized sampling process stays close to the pretrained flow policy. In the setting studied by the authors, strong or noisy critic signals can overwhelm a soft penalty and push the fine-tuned policy far beyond the intended trust region. The paper contrasts this with the trust-region principle from policy optimization, where the amount of policy movement itself is the object being controlled. It also reports that gradient clipping, used as a partial remedy in QAM, does not reliably prevent instability, with Robomimic experiments showing adjoint-matching losses exploding even under clipping. This motivates a method that constrains deviation structurally within the stochastic optimal control dynamics rather than merely discouraging it in an auxiliary loss.

TRQAM’s core move
TRQAM’s central technical move is to place a trust-region parameter λ directly inside the stochastic optimal control formulation used for flow-policy sampling. The method starts from an SDE representation of the pretrained flow policy and a controlled SDE in which the control term steers samples toward actions preferred by the critic Qπ(s, a). Instead of treating deviation from the pretrained policy as an external penalty, TRQAM scales the diffusion coefficient by √λ inside the dynamics. The paper proves, using Girsanov’s theorem, that this construction makes the path-space KL between the controlled and pretrained sampling processes an explicit closed-form function of λ. This result turns λ into a principled knob for controlling how far the fine-tuned sampler can move away from the pretrained flow policy.

How the algorithm stays calm
The algorithm adapts λ during training by projected dual descent so that the path-space KL tracks a prescribed target bound εKL. This mechanism is designed to enforce the trust region at the sampling level, where the flow policy actually generates actions, rather than relying on a loss-level KL term to compete with critic guidance. In the paper’s formulation, the terminal cost is supplied by the learned critic, while the stochastic optimal control dynamics regulate the amount of deviation from the base policy. The authors emphasize that this distinction matters because the target KL becomes a practical control over the fine-tuning budget, and the best value can depend on task structure. Reported training curves show TRQAM tracking the intended KL bound more tightly than conventional regularization approaches in both offline and online settings.

What the paper claims, in one cynical frame
The empirical claim of the paper is that structural trust-region control improves both stability and performance for off-policy fine-tuning of pretrained flow policies. Across 50 OGBench tasks, TRQAM is reported to outperform prior methods in offline RL and offline-to-online RL. In the offline RL comparison, the paper reports an overall success rate of 68% for TRQAM, compared with 46% for the strongest baseline among methods such as DSRL, QAM-E, QAM, IFQL, FQL, and CGQL-L. The Robomimic evidence further supports the stability argument, showing that fixed-temperature adjoint matching methods can suffer exploding adjoint losses and severe performance collapse while TRQAM remains stable. The broader implication is that critic-guided fine-tuning of generative policies benefits from trust-region constraints embedded in the sampling dynamics, especially when critic estimates are imperfect.
