Skill-3D: Enhancing 3D Spatial Reasoning via Scene-Aware Skill Evolution

This paper introduces Skill-3D, a framework addressing tool misuse and preference bias in Multimodal Large Language Models (MLLMs) for 3D spatial reasoning. Unlike existing methods that apply uniform tool-use strategies across diverse scenarios, Skill-3D builds "scene memory" to record agent trajectories. It distills successful patterns from similar scenes into reusable, scene-aware skills while incorporating failures as lessons. During training, the system injects these skills when similar scenes recur, creating a closed loop of co-evolving memory and skill libraries. Experiments show significant optimization in tool utilization, with performance on VSI-Bench rising from 39% to 78%, and a 67% improvement for Gemini-3-Flash on MMSI-Bench. Furthermore, post-training agents with skill-guided trajectories boosted Qwen3-VL-8B's performance on VSI-Bench by 43%, demonstrating the framework's effectiveness in enhancing 3D spatial understanding.

Background and Context

The integration of Multimodal Large Language Models (MLLMs) into complex visual tasks has accelerated rapidly, yet a critical bottleneck remains in their ability to perform robust 3D spatial reasoning. While these models excel at 2D image recognition and textual analysis, translating that capability into an understanding of three-dimensional space—essential for applications ranging from robotic navigation to virtual reality interaction—has proven challenging. Current agent-based approaches, which were expected to bridge this gap by allowing models to interact with tools and environments, have largely failed to deliver significant performance improvements over non-agent strategies. The core issue lies not in the foundational capabilities of the MLLMs themselves, but in how they are instructed to utilize external tools within diverse 3D contexts.

A detailed examination of existing methodologies reveals a systemic flaw: the application of uniform, "one-size-fits-all" tool-use strategies across highly heterogeneous 3D scenarios. In reality, different spatial tasks require distinct combinations of tools and reasoning paths. For instance, determining the relative position of objects in a cluttered room demands different computational steps than calculating the volume of a geometric structure. By forcing a static strategy onto dynamic environments, current systems suffer from severe tool misuse and preference bias, where the model either ignores useful tools or over-relies on familiar but inappropriate ones. This rigidity prevents the agent from adapting to the specific nuances of each scene, resulting in stagnant performance gains despite the increased complexity of the agent framework.

To address this fundamental disconnect, recent research introduces Skill-3D, a novel framework designed to instill scene-awareness into the decision-making process of MLLM agents. Rather than relying on predefined, static protocols, Skill-3D enables agents to evolve their strategies based on direct interaction with the environment. The framework shifts the paradigm from generic tool application to the development of specialized, context-dependent skills. By recognizing the unique characteristics of each task scenario, Skill-3D allows the agent to build a dynamic memory system that records and learns from its own operational history. This approach targets the root cause of poor spatial reasoning: the lack of adaptive, experience-driven tool selection.

Deep Analysis

The technical architecture of Skill-3D is built around a sophisticated self-evolution mechanism centered on "scene memory." When an agent encounters a new task, the system first identifies the specific type of scene or context involved. As the agent executes its actions, every step of its tool usage is meticulously recorded as a trajectory within this scene memory. This comprehensive logging ensures that no detail of the interaction is lost, providing a rich dataset for subsequent analysis. The system does not merely store these trajectories passively; it actively processes them to extract actionable insights, distinguishing between successful outcomes and failures.

The core innovation lies in the aggregation and distillation of these recorded trajectories. Successful interactions from similar scenes are synthesized into reusable "scene-aware skills." These skills represent optimized patterns of tool usage that have been proven effective in specific contexts. Crucially, the framework also incorporates failure cases into this knowledge base. Instead of discarding unsuccessful attempts, Skill-3D attaches them to the corresponding skills as "lessons" or cautionary notes. This dual-layered approach ensures that the agent not only knows what works but also understands what to avoid, creating a more robust and resilient decision-making protocol. The resulting skill library is thus a balanced repository of positive examples and negative constraints.

During the training phase, this memory-skill loop becomes active. When the agent encounters a scene that resembles previously encountered contexts, the system automatically injects the relevant scene-aware skills into the prompt or reasoning chain. This guidance steers the agent toward generating new execution trajectories that are informed by past experiences. Whether these new trajectories succeed or fail, they are fed back into the scene memory system, further refining the existing skills. This creates a closed-loop cycle of co-evolution between the memory bank and the skill library. Over time, the agent accumulates a deep, nuanced understanding of how to navigate complex 3D environments, moving beyond blind trial-and-error to strategic, evidence-based action.

This iterative refinement process effectively eliminates the blindness and rigidity inherent in traditional methods. By dynamically selecting the optimal combination of tools and reasoning paths for each specific scenario, the agent avoids the pitfalls of preference bias. The system learns to prioritize tools that are genuinely useful for the task at hand, rather than defaulting to those it is most familiar with. This adaptability is key to handling the high heterogeneity of 3D spatial reasoning tasks, where no single strategy can suffice for all possible configurations of objects, spaces, and goals.

Industry Impact

The empirical validation of Skill-3D demonstrates its profound impact on the performance of MLLMs in 3D spatial reasoning tasks. Extensive experiments conducted on authoritative benchmarks reveal significant improvements in tool utilization efficiency and overall accuracy. On the VSI-Bench, a standard metric for evaluating spatial intelligence, the framework drove tool utilization rates from a baseline of 39% to an impressive 78%. This near-doubling of efficiency indicates that the agent is not only using tools more frequently but also more correctly and appropriately. Such a dramatic increase underscores the effectiveness of the scene-aware skill injection mechanism in guiding the model toward better operational decisions.

Furthermore, the framework exhibits strong generalization capabilities across different model architectures. When applied to Gemini-3-Flash on the MMSI-Bench, Skill-3D facilitated a 67% improvement in performance. This result highlights the compatibility of the framework with state-of-the-art proprietary models, suggesting that the benefits of scene-aware skill evolution are not limited to specific open-source implementations. The ability to enhance diverse models without requiring extensive architectural changes makes Skill-3D a versatile tool for developers and researchers aiming to boost the spatial reasoning capabilities of their existing systems.

Perhaps most notably, the research team explored the potential of agentic post-training using skill-guided trajectories. By fine-tuning the Qwen3-VL-8B model with data generated through the Skill-3D process, they achieved an additional 43% performance boost on VSI-Bench. This finding suggests that the skills distilled by the framework can be effectively transferred into the model's weights, leading to lasting improvements in its innate capabilities. Ablation studies confirmed that both the introduction of scene memory and the combined use of success and failure trajectories were essential for these gains, validating the holistic design of the framework.

These results have significant implications for the broader AI industry, particularly in sectors reliant on precise spatial understanding. For robotics, autonomous vehicles, and augmented reality applications, the ability to reason accurately about 3D space is paramount. Skill-3D offers a pathway to deploy more reliable and efficient agents in these domains, reducing the need for massive amounts of manually labeled training data. By leveraging self-generated experiences and lessons, the framework lowers the barrier to entry for developing specialized spatial agents, potentially accelerating the adoption of MLLMs in real-world industrial settings.

Outlook

The introduction of Skill-3D marks a pivotal shift in how researchers approach the enhancement of MLLM capabilities. It moves the focus away from simply scaling up model parameters or curating larger datasets, towards optimizing the interaction strategies and memory mechanisms of intelligent agents. This perspective emphasizes the importance of "scene awareness" as a critical component of spatial intelligence. Future research is likely to build upon this foundation, exploring more sophisticated methods for scene identification, skill distillation, and memory management. The concept of evolving skills through closed-loop feedback may become a standard paradigm in agent design, extending beyond 3D reasoning to other complex, multi-step tasks.

From an industrial standpoint, the reusability of scene-aware skills presents a compelling opportunity for customization. Companies can leverage the framework to develop tailored agent strategies for specific verticals, such as warehouse logistics, surgical robotics, or immersive gaming. By focusing on the unique spatial challenges of each domain, developers can create highly efficient agents that require less computational overhead and fewer training iterations. This modularity and adaptability will be crucial for scaling AI solutions across diverse applications, where one-size-fits-all models often fall short.

Moreover, the efficient utilization of failure data as "lessons" addresses a long-standing challenge in machine learning: making the most of negative samples. By integrating errors into the learning process, Skill-3D reduces waste and accelerates convergence. This approach aligns with broader trends in sustainable AI development, where maximizing the value of each computation and data point is increasingly important. As the framework matures, it may inspire new techniques for error analysis and corrective learning in other areas of artificial intelligence, promoting more robust and resilient systems.

Ultimately, Skill-3D lays the groundwork for the next generation of autonomous 3D agents. By enabling MLLMs to move beyond simple visual recognition to deeper logical reasoning and spatial cognition, it brings us closer to realizing truly intelligent systems capable of navigating and manipulating the physical world. The continued evolution of such frameworks will be instrumental in unlocking the full potential of multimodal AI, transforming it from a passive observer into an active, competent participant in complex spatial environments.

Sources

arXiv