PEEU: Empowering GUI Agent Task Planning via Autonomous Experience Exploration and Exploitation

Addressing the weak planning capability and insufficient cross-site generalization of small open-source Multimodal Large Language Models (MLLMs) in Graphical User Interface (GUI) task planning, this study proposes a novel method called Planning Experience Exploration and Exploitation (PEEU). PEEU discovers experience by autonomously exploring the environment and leverages retrospective experience synthesis to generate strictly aligned high-level training data, significantly boosting model performance. The study also introduces the Task Decomposition Hierarchical Analysis Framework (TDHAF), which systematically examines compositional generalization behavior across low-, medium-, and high-granularity levels. Experiments reveal that mastering atomic low-level skills does not guarantee high-level planning ability, while training on higher-level tasks yields stronger out-of-distribution (OOD) generalization. In real-world benchmarks, the 7B-parameter model achieved 30.6% accuracy, surpassing the much larger Qwen2.5-VL-32B, demonstrating that constructing high-level retrospective tasks and leveraging experience are critical to enhancing the planning capabilities of small MLLMs.

Background and Context

The proliferation of digital workflows has elevated the role of Multimodal Large Language Models (MLLMs) as autonomous agents capable of executing complex Graphical User Interface (GUI) tasks. While commercial closed-source models dominate the high-end market, small open-source MLLMs offer distinct advantages in cost-efficiency and data privacy, making them attractive for enterprise deployment. However, these smaller models suffer from significant limitations in planning capabilities, particularly when navigating the heterogeneous structures of different websites. The core challenge lies in translating high-level user instructions into a sequence of executable atomic actions with sufficient robustness to handle cross-site variations. Existing solutions often rely on massive labeled datasets or prohibitively large model architectures, creating a barrier for resource-constrained applications where generalization across unseen domains is critical.

To address these deficiencies, researchers have introduced the Planning Experience Exploration and Exploitation (PEEU) framework. This novel approach shifts the paradigm from passive learning to active discovery, enabling agents to autonomously explore their environment to uncover latent experiences. By leveraging retrospective experience synthesis, PEEU converts raw interaction trajectories into strictly aligned, high-level training data. This mechanism effectively bridges the gap between low-level motor skills and high-level strategic planning, allowing small models to develop sophisticated reasoning capabilities without the need for extensive human annotation. The framework is designed to mitigate the scarcity of high-quality training data while enhancing the model's ability to generalize across diverse GUI layouts.

Complementing the PEEU framework is the Task Decomposition Hierarchical Analysis Framework (TDHAF), a methodological tool introduced to systematically dissect the components of generalization behavior. TDHAF categorizes tasks into three distinct granularity levels: low-level atomic operations, mid-level sub-task combinations, and high-level overall task planning. This hierarchical structure allows researchers to isolate and evaluate how models learn at different abstraction layers. By analyzing performance across these tiers, the study reveals critical insights into the relationship between basic operational proficiency and complex planning abilities, providing a structured approach to optimizing model training for specific vertical applications.

Deep Analysis

The technical architecture of PEEU operates through a closed-loop system of exploration and exploitation. Initially, the agent is granted permission to autonomously explore diverse GUI environments, collecting raw interaction trajectories through trial-and-error mechanisms. These initial trajectories are often noisy and inefficient, containing redundant steps or errors. To refine this data, the framework employs a retrospective experience synthesis module that re-evaluates historical interactions. This process identifies key step sequences that led to successful task completion, abstracting them into high-level planning samples. This transformation from raw operational data to structured strategic knowledge enables the model to learn generalized planning strategies rather than memorizing specific interface interactions.

The introduction of TDHAF provides a granular lens through which to examine the efficacy of this training process. By dividing tasks into low, medium, and high granularity, the framework allows for precise quantification of model performance at each stage. Low-level training focuses on atomic skills such as clicking or typing, while high-level training emphasizes the semantic role of these actions within a broader task flow. This layered approach ensures that the model builds a coherent logical chain from perception to decision-making. The analysis demonstrates that simply mastering atomic skills does not guarantee proficiency in complex planning, highlighting the necessity of high-level abstraction in training.

A critical finding from the TDHAF analysis is the disparity between low-level skill acquisition and high-level generalization. Models trained exclusively on atomic operations often struggle with compositional generalization, failing to adapt when faced with complex, multi-step tasks. In contrast, models exposed to high-level task training exhibit significantly stronger out-of-distribution (OOD) generalization capabilities. This suggests that high-level abstract thinking is essential for understanding the essence of a task and transferring knowledge to new contexts. The retrospective experience synthesis mechanism further enhances this by stabilizing planning performance, as increasing the proportion of utilized retrospective experience correlates with improved robustness in task execution.

Industry Impact

The implications of the PEEU framework extend beyond academic research, offering a viable pathway for the democratization of AI agents. By demonstrating that small models can achieve performance comparable to or exceeding much larger commercial counterparts through superior data engineering and training strategies, PEEU lowers the barrier to entry for enterprise AI deployment. This efficiency reduces the computational costs associated with running large-scale models, making advanced automation accessible to organizations with limited infrastructure. The ability of these small models to generalize across different web environments without extensive retraining is particularly valuable for industries requiring rapid adaptation to changing digital landscapes.

Furthermore, the autonomous experience exploration mechanism reduces the dependency on manual data annotation, a significant bottleneck in the development of specialized AI agents. By enabling models to learn from their own interactions, PEEU facilitates continuous improvement and adaptation to new GUI designs. This capability is crucial for sectors such as software testing, where automated agents must navigate evolving user interfaces, and for accessibility tools that assist users with disabilities in managing complex digital tasks. The framework encourages the open-source community to focus on efficient data utilization and algorithmic innovation, potentially accelerating the development of more robust and versatile AI tools.

The success of PEEU also challenges the prevailing notion that model scale is the primary driver of performance in GUI task planning. By proving that a 7B parameter model can outperform a 32B parameter model through effective experience exploitation, the research underscores the importance of data quality and training methodology. This insight encourages a shift in industry focus towards optimizing training pipelines and leveraging retrospective learning, rather than solely investing in larger model architectures. Such a shift could lead to more sustainable and scalable AI solutions, particularly in resource-constrained environments.

Outlook

The experimental results highlight the substantial potential of PEEU in enhancing the planning capabilities of small MLLMs. In real-world benchmarks, the 7B-parameter model achieved an accuracy of 30.6%, surpassing the significantly larger Qwen2.5-VL-32B model. This achievement validates the effectiveness of constructing high-level retrospective tasks and leveraging autonomous experience in boosting model performance. The data indicates that as the proportion of utilized retrospective experience increases, the stability and accuracy of the agent's planning improve, confirming the value of the proposed synthesis mechanism. These findings suggest that future developments in GUI agents will likely prioritize intelligent data curation and hierarchical learning over mere parameter scaling.

Looking ahead, the integration of PEEU principles into broader multimodal systems could unlock new possibilities for cross-platform automation. As web technologies continue to evolve, the ability of agents to generalize from limited experience will become increasingly critical. Future research may explore extending the TDHAF framework to even more complex, multi-modal tasks involving video or audio inputs, further broadening the scope of autonomous agent applications. Additionally, the combination of PEEU with reinforcement learning techniques could lead to agents that not only plan but also continuously refine their strategies through real-time feedback.

Ultimately, the PEEU framework represents a significant step towards more capable and efficient AI agents. By addressing the core limitations of small models in task planning and generalization, it provides a robust foundation for the next generation of GUI automation tools. As the technology matures, we can expect to see wider adoption in industries ranging from enterprise software testing to consumer accessibility, driving a more intelligent and automated digital future. The emphasis on high-level abstraction and experience exploitation sets a new standard for developing AI systems that are not only powerful but also adaptable and resource-efficient.

Sources