DataCOPE: Unsupervised Skill Discovery Framework for Agentic Data Analysis

This paper presents DataCOPE, an unsupervised verifier-guided skill discovery framework for agentic data analysis. Addressing the scarcity of high-quality supervision signals and diverse success criteria in test-time skill enhancement, DataCOPE automatically discovers reusable procedural knowledge from unlabeled exploration trajectories alone. The framework iteratively coordinates a data analysis agent, an unsupervised verifier, and a skill manager to extract validation signals that characterize relative quality or consistency. For report-style analysis, an adaptive checklist verifier is introduced to dynamically generate task-specific criteria and evaluate coverage; for reasoning-style analysis, an answer-consistency verifier leverages self-consistency as an auxiliary signal. Experiments on Deep Data Research and DABStep benchmarks show DataCOPE improves report-style and reasoning-style task scores by 9.71% and 32.30% on average across four model settings, significantly outperforming baselines and offering a new paradigm for low-cost enhancement of data analysis agent capabilities.

Background and Context

The rapid advancement of large language models has catalyzed the development of agentic systems capable of complex data analysis, yet a significant bottleneck remains in the efficient enhancement of these agents' reasoning capabilities at test time. Traditionally, improving an agent's performance on specialized tasks such as financial reporting or scientific data interpretation has relied heavily on supervised fine-tuning using high-quality, human-annotated datasets. This approach is not only resource-intensive but also inherently limited by the scarcity of expert-labeled data across diverse domains. As organizations seek to deploy autonomous data analysis agents that can adapt to novel and unstructured queries, the dependency on static, pre-defined reward functions or golden standards becomes a critical constraint. The core challenge lies in discovering reusable procedural knowledge—specific skills or strategies that an agent can apply to solve new problems—without the benefit of explicit supervision signals that indicate what constitutes a correct or optimal path.

In this context, test-time skill enhancement has emerged as a lightweight and effective alternative to parameter-heavy model updates. By injecting reusable procedural knowledge into the agent's workflow during inference, systems can optimize behavior dynamically. However, existing methods for test-time enhancement often struggle with the heterogeneity of success criteria in data analysis. Unlike mathematical problem-solving, where a single numerical answer serves as a clear verification signal, data analysis tasks vary widely from open-ended report generation to strict logical deduction. The lack of reliable external supervision signals means that traditional reinforcement learning from human feedback (RLHF) or supervised fine-tuning pipelines are difficult to scale. Consequently, there is an urgent need for frameworks that can autonomously identify and refine high-quality analytical strategies solely from the agent's own interactions with data, thereby bypassing the data labeling bottleneck entirely.

To address these limitations, recent research introduces DataCOPE, an unsupervised verifier-guided skill discovery framework designed specifically for agentic data analysis. DataCOPE fundamentally shifts the paradigm from relying on external labels to leveraging internal consistency and relative quality metrics derived from unlabeled exploration trajectories. The framework operates on the premise that even without ground truth answers, the structural coherence, logical consistency, and coverage of an agent's output can serve as robust proxies for skill quality. By iteratively coordinating a data analysis agent, an unsupervised verifier, and a skill manager, DataCOPE automatically extracts validation signals that characterize the relative merit of different analytical paths. This approach enables the system to distill successful patterns from noisy exploration data, creating a repository of reusable skills that enhance performance across both report-style and reasoning-style tasks without requiring any manual annotation.

Deep Analysis

The architectural innovation of DataCOPE lies in its iterative closed-loop system comprising three core components: the Data-Analytic Agent, the Unsupervised Verifier, and the Skill Manager. The process begins with the Data-Analytic Agent generating diverse exploration trajectories when presented with a given task. These trajectories represent various attempts to solve the problem, encompassing different code executions, data visualization choices, and logical reasoning steps. Rather than discarding failed or suboptimal attempts, the framework utilizes them as raw material for skill discovery. The Unsupervised Verifier then analyzes these trajectories to extract signals that reflect their relative quality or consistency. Crucially, this verification process does not rely on a predefined ground truth; instead, it employs dynamic, task-specific criteria to evaluate the outputs. The Skill Manager subsequently uses these verification signals to perform skill distillation via contrastive learning, effectively separating high-quality procedural patterns from noise and consolidating them into reusable skills that can be injected into future inference cycles.

For report-style analysis tasks, which often involve open-ended questions and require comprehensive coverage of data insights, DataCOPE introduces an Adaptive Checklist Verifier. This component addresses the ambiguity inherent in evaluating narrative reports by dynamically generating a set of task-specific verification criteria based on the input context. For instance, if an agent is asked to analyze sales trends, the verifier might generate checklist items such as "identifies peak sales periods," "compares year-over-year growth," and "highlights regional discrepancies." The verifier then evaluates the agent's generated report against this evolving checklist, assigning scores based on the degree of coverage. Importantly, the checklist itself is refined iteratively; as the agent explores different angles of the data, the verifier updates the criteria to ensure they remain relevant and comprehensive. This mechanism ensures that the evaluation standard adapts to the complexity of the task, providing a nuanced signal for skill improvement that static metrics cannot offer.

In contrast, reasoning-style analysis tasks, which typically have definitive answers or logical conclusions, utilize an Answer Agreement Verifier. This component leverages the principle of self-consistency, a technique where multiple reasoning paths are generated for the same problem, and the most frequent answer is considered the most reliable. The Answer Agreement Verifier groups trajectories that arrive at identical final answers and uses the size of these consensus clusters as an auxiliary signal for quality. Trajectories that align with the majority consensus are deemed higher quality, while outliers are flagged for further scrutiny or discarded. This method effectively transforms the stochastic nature of large language models into a strength, using diversity in reasoning paths to identify robust logical structures. By treating consistency as a proxy for correctness, the framework can guide the skill discovery process even in the absence of explicit labels, ensuring that the distilled skills promote logical rigor and accuracy.

The integration of these two distinct verification mechanisms allows DataCOPE to handle the broad spectrum of data analysis challenges. The adaptive checklist verifier ensures that open-ended, exploratory tasks are evaluated on breadth and relevance, while the answer agreement verifier ensures that deductive tasks are evaluated on logical soundness and precision. This dual-track approach prevents the framework from being biased toward either extreme, enabling it to generalize across different types of analytical workflows. Furthermore, the use of contrastive learning in the Skill Manager ensures that the discovered skills are not just memorized solutions to specific problems but abstractable procedures that can be applied to novel scenarios. This distinction is critical for building agents that possess true generalization capabilities rather than mere rote recall.

Industry Impact

The empirical validation of DataCOPE demonstrates its substantial potential to reshape the landscape of automated data analysis. Extensive experiments were conducted on two representative benchmark datasets: Deep Data Research for report-style analysis and DABStep for reasoning-style analysis. The study evaluated the framework across four different underlying model settings to ensure the robustness and generalizability of the results. The findings revealed that DataCOPE consistently outperformed existing baseline methods in all tested scenarios, highlighting its effectiveness in enhancing held-out performance. Specifically, in report-style analysis tasks, the framework achieved an average score improvement of 9.71%. While this gain is significant, the impact was even more pronounced in reasoning-style tasks, where DataCOPE delivered an average improvement of 32.30%. This disparity underscores the particular efficacy of unsupervised consistency signals in complex reasoning scenarios, where the absence of clear structural guidelines makes traditional supervision particularly challenging.

Ablation studies further corroborated the critical role of each component within the DataCOPE framework. The results indicated that the verifier-guided skill distillation process was instrumental in filtering high-quality procedural knowledge from the noisy exploration trajectories. Without the unsupervised verifier, the skill manager struggled to distinguish between plausible but incorrect reasoning paths and genuinely robust strategies. The adaptive nature of the checklist verifier also proved essential for report-style tasks, as static evaluation metrics failed to capture the nuanced requirements of comprehensive data storytelling. These technical validations provide strong evidence that DataCOPE is not merely a theoretical construct but a practical solution that delivers measurable performance gains in real-world applications. The ability to achieve such significant improvements without any additional labeled data positions the framework as a highly cost-effective tool for enterprise deployment.

From an industrial perspective, DataCOPE lowers the barrier to entry for developing high-performance data analysis agents. Small and medium-sized enterprises, as well as individual developers, can now leverage open-source models to build sophisticated analytical tools without the prohibitive costs associated with large-scale data annotation projects. This democratization of advanced AI capabilities allows for more widespread adoption of agentic workflows in sectors such as finance, healthcare, and logistics, where data analysis is critical but resources for custom model training are limited. Moreover, the framework's ability to adapt to specific business contexts through self-exploration means that organizations can deploy agents that continuously improve their skills based on proprietary data. This creates a competitive advantage, as companies can cultivate specialized analytical capabilities that are tailored to their unique operational needs without relying on generic, off-the-shelf solutions.

Furthermore, the introduction of an unsupervised skill discovery paradigm opens new avenues for research and development in the AI industry. It shifts the focus from static dataset curation to dynamic, interaction-based learning, encouraging the development of agents that are more autonomous and resilient. In practical terms, this means that data analysis assistants can be deployed in live environments where they learn from real user interactions and feedback loops, gradually refining their strategies over time. This continuous improvement cycle reduces the maintenance burden on engineering teams and ensures that the agents remain effective even as data distributions shift. As businesses increasingly rely on AI for decision-making, the ability to deploy self-improving, low-cost analytical agents will become a key driver of operational efficiency and strategic insight.

Outlook

Looking ahead, the success of DataCOPE suggests a broader transition in the field of artificial intelligence toward self-supervised and unsupervised learning paradigms for agent optimization. The framework's ability to extract high-quality skills from unlabeled data challenges the prevailing assumption that large-scale human annotation is a prerequisite for advanced reasoning capabilities. Future research may extend this approach to other domains beyond data analysis, such as code generation, scientific discovery, and creative writing, where success criteria are similarly diverse and subjective. By generalizing the concepts of adaptive verification and consistency-based evaluation, researchers can develop more versatile agents capable of mastering complex, multi-step tasks without extensive supervised training. This shift promises to accelerate the pace of AI innovation by reducing the dependency on scarce human expertise for model alignment.

However, several challenges remain before unsupervised skill discovery can be universally adopted. One key area for future investigation is the robustness of the verification signals in adversarial or highly ambiguous contexts. While self-consistency is a powerful proxy for correctness, it is not infallible; models can sometimes converge on incorrect answers with high confidence, a phenomenon known as "consensus hallucination." Enhancing the verifier's ability to detect such failures, perhaps by incorporating external knowledge bases or cross-model validation, will be crucial for ensuring the reliability of deployed agents. Additionally, the computational cost of generating diverse exploration trajectories and running iterative verification loops must be optimized to make the framework scalable for real-time applications. Balancing the depth of exploration with the latency requirements of interactive systems will be a critical engineering hurdle.

Another promising direction is the integration of DataCOPE with multi-agent systems, where multiple specialized agents collaborate to solve complex problems. In such settings, the skill discovery process could be distributed across agents, allowing them to share and refine skills collectively. This collaborative learning approach could lead to the emergence of emergent behaviors and sophisticated division of labor that are difficult to achieve with single-agent architectures. Furthermore, as regulatory frameworks for AI continue to evolve, the transparency and interpretability of unsupervised skill discovery will come under scrutiny. Ensuring that the distilled skills are auditable and aligned with ethical guidelines will be essential for gaining trust in high-stakes industries. Researchers will need to develop methods for explaining why certain skills were selected and how they influence the agent's decision-making process.

In conclusion, DataCOPE represents a significant step forward in the quest for autonomous, efficient, and adaptable data analysis agents. By eliminating the need for expensive labeled data and leveraging the inherent capabilities of large language models to self-evaluate and improve, the framework offers a sustainable path toward more intelligent AI systems. As the technology matures, it has the potential to transform how organizations interact with their data, enabling deeper insights and faster decision-making at a fraction of the current cost. The journey from supervised dependence to unsupervised autonomy is just beginning, and DataCOPE provides a compelling blueprint for navigating this transition, paving the way for a new era of AI-driven analytics.

Sources

arXiv