ReadPaper Blog
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
The paper studies whether vision-language model agents can convert public, human-authored multimodal web guides into executable skills for long-horizon interactive tasks. It introduces guide-to-skill learning, the MMG2Skill-Bench benchmark, and the MMG2Skill closed-loop framework, showing that structured skill construction plus trajectory-driven revision improves fixed VLM agents across GUI control, Minecraft-style gameplay, and strategic card play.
Source: MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Can web guides become agent skills?
The paper frames a central problem for long-horizon VLM agents: abundant procedural knowledge exists on the Web, but it is written for humans rather than for agents acting in interactive environments. The authors call the missing capability guide-to-skill learning, meaning the conversion of in-the-wild multimodal guides into executable, editable skills that can be used and improved during rollouts. Their motivation is that agents need more than next-action reasoning; they need procedural grounding that connects a reusable procedure to the agent’s current observation, progress state, and recovery options. MMG2Skill addresses this gap by initializing skills from public guide material instead of relying only on hand-written skills or skills discovered from failed exploration. The result matters because it tests whether existing web-scale procedural knowledge can become a practical substrate for agent competence across domains.

Why raw guides are not enough
The paper argues that raw guides are not sufficient because human-authored instructions often contain implicit prerequisites, sibling procedures, navigation prose, screenshots, optional branches, and recovery advice that may not align with an agent’s runtime state. A human reader can infer what has already been satisfied and when a step applies, but an agent must explicitly localize each procedure against its current observation and bounded action history. The authors define skills as executable and editable procedural objects that specify applicability conditions, intermediate expected-state cues, and ways to recover when execution drifts. This formulation distinguishes procedural grounding from simple retrieval: the agent must know not only what the guide says, but also whether the guide’s next step is relevant now. The paper’s ablations support this concern by reporting that directly prompting agents with raw guides can degrade performance, whereas structured skill construction provides a safer procedural prior.

MMG2Skill-Bench
To evaluate guide-to-skill learning, the paper introduces MMG2Skill-Bench, described as the first benchmark pairing in-the-wild multimodal guides with environment-grounded agent execution. The benchmark spans three interaction regimes: MMG2Skill-GUI uses OSWorld desktop-application tasks, MMG2Skill-Game uses OpenHA Minecraft tasks in MineStudio, and MMG2Skill-Strategy uses Doudizhu and Mahjong tasks from RLCard. In total, the main benchmark contains 130 success-inferable tasks whose outcomes can be inferred from the agent-visible trajectory or a public final state. The guide corpus draws from product documentation and how-to articles for GUI tasks, wiki and walkthrough material for Minecraft-style tasks, and rule descriptions plus beginner strategy material for card-play tasks. The evaluation deliberately excludes guides that contain gold action traces, hidden labels, or directly copyable benchmark answers, so the benchmark tests procedural grounding rather than answer lookup.

MMG2Skill: compile, use, revise
MMG2Skill itself is a closed-loop framework that compiles multimodal guides into editable skills, conditions a fixed VLM policy on those skills during execution, and revises the skills using trajectory-level diagnoses. In the paper’s formulation, a task instruction I and multimodal guide G induce a skill set S, while the VLM policy πθ remains fixed and actions are sampled from the current observation history, instruction, and skill set. The construction stage normalizes guides into a SKILL.md-style representation containing reusable procedures, applicability conditions, expected-state cues, and recovery knowledge, represented conceptually as tuples zi = (ui, ci, vi, qi). During rollout, the current skill set is injected into the agent context alongside the recent observation-action history, giving the agent a procedural reference without updating model weights. After an attempt, the framework uses analyzer-generated root-cause feedback from agent-visible trajectories to revise the shared skill cache, explicitly avoiding benchmark scores during construction, execution, analysis, and refinement.

What changed, and why it matters
The paper reports that MMG2Skill consistently outperforms vanilla baseline agents in every tested model-domain setting across six VLM backbones, with macro-average gains of +12.8 to +25.3 percentage points across backbones. Its ablation studies identify two necessary ingredients: converting guides into structured skills and revising those skills from trajectory evidence, since raw guide prompting can harm performance and construction alone does not fully repair guide-runtime mismatches. The findings imply that public guides contain useful procedural knowledge that baseline VLM agents cannot reliably reconstruct on their own, but that this knowledge must be grounded, compacted, and made editable. The paper also observes that revision gains are non-monotonic, meaning that repeated refinement can eventually introduce regressions rather than steadily improving success. For success-inferable tasks, analyzer-based early stopping helps mitigate late-stage regressions and saves 25%–53% of attempts when the success signal is properly calibrated.
