ReadPaper Blog
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring
The paper studies how to detect execution failures in Vision-Language-Action robot policies while a task is still unfolding, using only trajectory-level success or failure labels for training. It proposes Hide-and-Seek, a coarsely supervised failure detection framework that learns localized failure signals from VLA action embeddings through inter-trajectory and intra-trajectory contrastive objectives. The result matters because it improves runtime monitoring across simulation and real robots without costly step-level annotations, action resampling, or slow external vision-language judges.
Source: Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

The Hidden Failure
The paper addresses a central reliability problem for Vision-Language-Action models: a robot can execute many normal-looking actions before a subtle mistake triggers task failure, while the available training label often says only that the entire trajectory failed. Seongheon Park and collaborators frame this as a mismatch between trajectory-level supervision and the need for per-step runtime decisions. In their setup, a VLA receives RGB observations, a natural language instruction, and robot state, then produces actions whose internal embeddings form a trajectory. The detector must decide from a prefix of these embeddings whether the ongoing execution is likely to result in failure. This formulation is motivated by realistic robot deployment, where annotating the exact onset of errors in long-horizon stochastic rollouts is expensive and difficult to scale.

Why Old Monitors Struggle
The paper argues that existing runtime monitors are limited by either computational cost or noisy supervision. Action-resampling and multi-sampling methods estimate uncertainty by repeatedly querying a policy, but the authors note that this overhead is poorly suited to real-time robot control. External VLM-based judges can reason over visual observations, yet they introduce high inference latency and may recognize failures only after they have become visually obvious. Classifier-based alternatives such as uniformly propagating trajectory-level labels to every timestep reduce annotation cost but mark pre-failure normal behavior as failure, creating substantial label noise. This critique motivates a lightweight detector that uses the VLA’s own internal action representations while learning to distinguish the truly failure-indicative parts of failed trajectories.

Hide-and-Seek Enters
Hide-and-Seek casts VLA failure monitoring as coarsely supervised learning, drawing a connection to weakly supervised localization and multiple instance learning. Instead of treating every timestep in a failed rollout as equally faulty, the method learns a failure scoring function over internal action embeddings and searches for the steps most responsible for the failed outcome. The training data consists of successful trajectories and failure trajectories labeled only by final outcome, with no temporal annotation of failure onset. At runtime, the learned monitor operates on trajectory prefixes and emits a binary failure decision, making the approach compatible with online intervention. The paper emphasizes that this design is architecture-agnostic because it can use embeddings from both autoregressive VLA policies and flow-matching-based VLA policies.

Two Contrastive Powers
The technical core of Hide-and-Seek is a pair of contrastive losses that impose temporal structure on failure scores without step labels. The inter-trajectory contrastive loss compares a failed trajectory with a successful trajectory and enforces that the most failure-indicative step in the failed trajectory scores higher than the most failure-resembling step in the successful one. The intra-trajectory contrastive loss estimates a proxy onset point from the largest score increase, then encourages average post-onset scores to exceed average pre-onset scores within the same failed trajectory. Together, these objectives discourage uniform label propagation and instead shape scores to remain low during normal execution and rise when failure evidence emerges. The method therefore converts coarse success-or-failure supervision into a fine-grained runtime signal over the VLA’s action embeddings.

Proof on the Rooftops
The evaluation covers LIBERO, VLABench, and a real-world robotic platform, using OpenVLA as an autoregressive policy and π0 and π0.5 as flow-matching-based policies. Across these settings, the paper reports state-of-the-art multi-task failure detection performance on both seen and unseen tasks. Hide-and-Seek surpasses the strongest classifier-based baseline by up to 11.7% balanced accuracy and improves over a VLM-based runtime monitor by 13.1% accuracy in the reported comparison. The authors also report that the method raises alarms near annotated failure onset despite never receiving temporal supervision during training. Its reported speed advantage of over 2,000× relative to a VLM-based monitor supports the paper’s claim that coarsely supervised contrastive failure detection can offer a practical accuracy–timeliness trade-off for embodied VLA deployment.
