ReadPaper Blog
Unsupervised Skill Discovery for Agentic Data Analysis
The paper proposes DataCOPE, an unsupervised verifier-guided framework for discovering reusable data-analysis skills for LLM agents without ground-truth answers, success labels, or human annotations. It addresses the difficulty of improving agentic data analysis when reliable supervision is costly and quality criteria differ between open-ended reports and fixed-answer reasoning tasks. By deriving verifier signals from unlabeled exploration trajectories and distilling contrastive procedures into inference-time skills, DataCOPE improves held-out performance on Deep Data Research and DABStep.
Source: Unsupervised Skill Discovery for Agentic Data Analysis

Mission Briefing: Why Skill Discovery Is Hard
The paper tackles the problem of making data-analytic agents better at complex analysis without retraining their underlying language models. Its starting point is inference-time skill augmentation, where reusable procedural knowledge is injected into an agent so that it can choose better exploration strategies, analytical operations, and error-avoidance behaviors. The authors argue that discovering such skills is difficult in data analysis because supervised signals are expensive: annotators must inspect the user objective, the data resources, the analytical process, and whether the final output is supported by evidence. They also emphasize that data-analysis tasks do not share a single definition of success, since a well-supported report and a correct fixed answer require different kinds of judgment. This motivates the paper’s central question: how can reusable data-analysis skills be discovered from unlabeled exploration alone?

The Gap: One Signal Does Not Fit All
The paper identifies heterogeneity of evaluation signals as a core obstacle for agentic data analysis. In reasoning-oriented tasks, a trajectory can often be compared through its final answer, because agreement with an expected solution is a natural success criterion in offline evaluation. In open-ended report-style tasks, however, there may be no unique target answer, so quality depends on report completeness, verifiable coverage, evidence-supported claims, and analytical insight. This difference prevents a single short-answer-style checker from serving as a universal skill-discovery signal. The authors therefore frame unsupervised skill discovery as the problem of extracting relative quality or agreement from trajectories rather than directly certifying correctness. This framing allows unlabeled trajectories to become useful evidence for learning procedural guidance even when hidden rewards are unavailable during discovery.

Core Idea: DataCOPE
DataCOPE is the paper’s proposed answer to this problem: an unsupervised verifier-guided skill discovery framework for data-analytic agents. The method iteratively coordinates three components: a Data-Analytic Agent that samples exploration trajectories, an Unsupervised Verifier that derives task-dependent signals from those trajectories, and a Skill Manager that performs contrastive skill distillation. The agent is formalized as operating in a task-conditioned POMDP and follows the ReAct paradigm, interleaving reasoning steps, actions such as Python or SQL code generation, observations, and final-answer submission. A skill is represented as a structured knowledge bundle, including a root Markdown document and possible auxiliary resources, that conditions the agent’s behavior without changing model parameters. Across iterations, DataCOPE uses verifier-derived contrast groups to create or update reusable analytical procedures that are intended to generalize from the unlabeled exploration set to held-out tasks.

Two Verifiers, Two Battles
The paper instantiates the Unsupervised Verifier differently for the two major data-analysis formats it studies. For report-style analysis, DataCOPE uses an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and refines the checklist over iterations to reduce incompleteness. This design makes the verifier sensitive to open-ended analytical goals, where quality depends on whether important aspects of the task are addressed and supported by data. For reasoning-style analysis, DataCOPE uses an Answer Agreement Verifier that clusters trajectories by their final answers and treats agreement patterns as an unsupervised signal. The reasoning verifier also uses self-consistency as auxiliary uncertainty information, allowing the framework to contrast more and less reliable trajectories without gold labels. Together, these two verifier designs operationalize the paper’s claim that useful skill-discovery signals must be adapted to the analytical format.

What the Training Proves
The experiments evaluate DataCOPE on report-style tasks from Deep Data Research and reasoning-style tasks from DABStep. The paper reports that DataCOPE consistently improves held-out performance over baselines across both settings, indicating that skills distilled from unlabeled exploration can transfer beyond the discovery tasks. Averaged across four matched base-model settings, the method improves the mean score by 9.71% on report-style analysis and 32.30% on reasoning-style analysis. The authors further analyze iterative refinement, verifier-component ablations, skill granularity, data-analytic agent ablations, label efficiency, and inference cost to support the role of verifier-derived signals in the framework. The overall implication is that unsupervised relative-quality and agreement signals can supply enough contrastive evidence to discover practical procedural skills for LLM-based data-analysis agents.
