SkillComposer: Efficient LLM Agent Reasoning via Structured Skill Composition

This paper introduces SkillComposer, a framework that addresses the multi-skill selection bottleneck facing LLM agents on complex tasks. Unlike existing methods that treat skill selection as independent retrieval or ranking problems — thereby ignoring the strong coupling among skill subsets, quantity, and execution order — SkillComposer formulates skill composition as a structured sequence prediction task. A constrained autoregressive decoder jointly determines the activated skill subset, its cardinality, and execution order in a single generation pass. Training data is built from a real-world manually curated skill library and evaluated on the SkillsBench benchmark. Results on two production-grade coding agents (GPT-5.2-Codex and Gemini-3-Pro-Preview) show absolute throughput gains of +23.1 and +18.2 percentage points over a no-skill baseline, surpassing the top three retrieval strategies while reducing prompt token cost and matching the performance of a golden skill retrieval upper bound — offering a new paradigm for modular knowledge orchestration in agents.

Background and Context

The rapid integration of Large Language Models (LLMs) into complex, real-world problem-solving scenarios has elevated the role of modular skill packages, which encapsulate procedural knowledge and specific instructions. As these skill libraries expand in size and utility across diverse domains, the core challenge has shifted from merely accessing skills to efficiently selecting the optimal combination for a given task. Traditional approaches to this problem generally fall into two categories: exposing the agent's entire reasoning process to the full skill set, or relying on embedding vectors and LLM-based rerankers for retrieval. While these methods offer foundational insights, they suffer from a critical structural flaw. They treat skill selection as an independent retrieval or ranking problem, thereby ignoring the strong coupling between the subset of skills, the quantity of skills, and their execution order. This decoupling is problematic because the effectiveness of a skill is often dependent on its context within a sequence, making independent selection an insufficient strategy for complex orchestration.

To address this bottleneck, the SkillComposer framework introduces a novel paradigm by formalizing skill composition as a structured sequence prediction task. Rather than treating the selection of skills as a series of isolated decisions, SkillComposer views the problem as a joint optimization challenge where the activated subset, its cardinality, and the execution order must be determined simultaneously. This approach acknowledges that the decision to activate a specific skill is inextricably linked to the decisions made for preceding and succeeding skills. By framing the problem this way, the framework aims to capture the inherent dependencies and logical flows that characterize expert-level task execution, moving beyond simple semantic matching to true structural understanding of task requirements.

Deep Analysis

The technical core of SkillComposer lies in its use of a constrained autoregressive decoder that operates directly on skill identifiers. This design allows the model to generate the complete skill plan in a single pass, jointly determining the subset, count, and sequence of activated skills. Unlike traditional retrieval methods that may require multiple iterations or complex post-processing logic to resolve conflicts or order dependencies, SkillComposer transforms the complex combinatorial optimization problem into a standard language modeling task. The constraints applied during decoding ensure that the generated sequence is valid and executable, naturally capturing how subsequent skills depend on the outputs or states established by prior skills. This single-pass generation significantly simplifies the inference pipeline, reducing latency and computational overhead compared to iterative retrieval-and-rerank strategies.

The training data for SkillComposer is derived from a real-world, manually curated skill library, ensuring that the model learns from high-quality, human-verified examples of effective skill combinations. This dataset consists of task-composition pairs, providing the model with explicit examples of how different skills should be sequenced to achieve specific outcomes. By training on such authentic data, the model internalizes the practical logic of skill dependency and execution, rather than relying on superficial pattern matching. This focus on real-world curation is crucial for ensuring that the learned representations are robust and applicable to the nuanced demands of actual coding and problem-solving tasks, where abstract semantic similarity often fails to capture the functional requirements of a skill.

Industry Impact

Experimental evaluations of SkillComposer were conducted on two production-grade coding agents: GPT-5.2-Codex and Gemini-3-Pro-Preview, using the SkillsBench benchmark. The results demonstrate significant performance gains over baseline methods. Specifically, SkillComposer achieved an absolute increase in task pass rate of 23.1 percentage points on GPT-5.2-Codex and 18.2 percentage points on Gemini-3-Pro-Preview compared to a no-skill baseline. These improvements highlight the framework's ability to effectively leverage modular knowledge to enhance agent capabilities. Furthermore, SkillComposer outperformed the top three traditional retrieval strategies, indicating that its structured approach to sequence prediction is more effective than conventional ranking or embedding-based methods for complex task execution.

A critical advantage of SkillComposer is its efficiency. The framework not only improves task success rates but also reduces prompt token costs. By generating a concise, structured sequence of skill identifiers, the model avoids the need for extensive context windows or verbose retrieval explanations. Remarkably, SkillComposer's performance matches the upper bound of golden skill retrieval, which assumes access to the optimal set of skills. This achievement is particularly significant because it demonstrates that the model can approximate optimal performance without requiring perfect prior knowledge of the best skills. Ablation studies further confirmed the necessity of joint modeling, showing that decoupling the selection of skills, their quantity, and their order leads to a substantial drop in performance, validating the importance of the structured sequence prediction approach.

Outlook

The implications of SkillComposer extend beyond immediate performance gains, offering a new paradigm for modular knowledge orchestration in AI agents. By proving that structured decision-making can be effectively integrated into autoregressive generation, the framework opens new avenues for research in agent planning, multi-agent collaboration, and dynamic skill management. The ability to handle long-tail skill combinations effectively suggests that the model can generalize well to less common or highly specialized tasks, a common challenge in industrial applications. This capability is vital for building robust agents that can adapt to a wide variety of scenarios without requiring extensive retraining or manual intervention.

For the broader AI community, SkillComposer provides a reproducible benchmark and reference implementation based on real-world data, fostering standardization in skill management. Future work may focus on automating the construction and updating of skill libraries, reducing the reliance on manual curation. Additionally, extending the framework to non-coding domains could unlock its potential in fields such as scientific research, legal analysis, and healthcare, where complex, multi-step reasoning is equally critical. Ultimately, SkillComposer represents a significant step toward more intelligent, efficient, and reliable LLM-based systems, laying the theoretical and technical groundwork for the next generation of autonomous agents capable of navigating the complexities of real-world tasks with precision and adaptability.

Sources

arXiv