PEEU: Enhancing GUI Agent Task Planning via Autonomous Experience Exploration and Retroactive Utilization

To address the weak planning ability and limited cross-website generalization of small open-source Multimodal Large Language Models (MLLMs) in GUI task planning, this study proposes a novel method called Planning Experience Exploration and Utilization (PEEU). This method discovers experiences by autonomously exploring the environment and leverages retroactive experiences to synthesize strictly aligned high-level training data, significantly boosting model performance. The study also introduces the Task Decomposition Hierarchy Analysis Framework (TDHAF), which systematically examines compositional generalization behaviors across low, medium, and high granularity levels. It is found that high-level task training yields stronger out-of-distribution (OOD) generalization. In real-world benchmark tests, the 7B-parameter PEEU model achieves an accuracy of 30.6%, surpassing the much larger Qwen2.5-VL-32B model, demonstrating that constructing high-level retroactive tasks and leveraging experiences are crucial for enhancing the planning capabilities of small MLLMs.

Background and Context

The proliferation of digital workflows has established multimodal web agents as critical tools for automating repetitive graphical user interface (GUI) operations. These agents are designed to translate complex human instructions into executable atomic actions, thereby enhancing productivity in office environments and automated systems. While commercial closed-source large models have historically dominated this space, small open-source Multimodal Large Language Models (MLLMs) offer distinct advantages regarding cost-efficiency and data privacy. However, these smaller models face significant technical hurdles when tasked with complex planning. Specifically, they exhibit weak planning capabilities and limited generalization across different websites, creating a bottleneck that prevents their widespread adoption in real-world scenarios where adaptability is paramount.

To address these limitations, researchers have introduced a novel methodology known as Planning Experience Exploration and Utilization (PEEU). This approach aims to bridge the gap between small model constraints and the demands of complex GUI task planning. The core innovation of PEEU lies in its ability to autonomously explore environments to discover operational experiences. By leveraging hindsight experience mechanisms, the system can synthesize strictly aligned high-level training data. This process allows the model to learn from successful trajectories, effectively弥补ing the data scarcity issues that typically plague smaller models. The method represents a shift from passive learning to active experience discovery, enabling small models to develop a deeper understanding of task logic without relying on massive pre-existing datasets.

Deep Analysis

The technical implementation of PEEU diverges from traditional supervised fine-tuning by integrating reinforcement learning with data synthesis. The model is empowered to explore unknown or semi-structured GUI environments, collecting state-action pairs through trial and error. Once successful task completions are identified, the system employs retroactive learning techniques to analyze these trajectories. This analysis extracts key high-level decision-making logic, which is then used to generate training samples that are strictly aligned with the current task objectives. The resulting synthetic data not only contains specific operational instructions but also encapsulates the logical structure of task decomposition, providing a richer learning signal for the model.

To systematically evaluate the factors driving generalization, the research team developed the Task Decomposition Hierarchy Analysis Framework (TDHAF). This framework categorizes task granularity into three distinct levels: low, medium, and high. Low granularity corresponds to atomic skills such as clicking or typing, medium granularity involves intermediate steps, and high granularity encompasses overall task planning. By analyzing performance across these levels, researchers can pinpoint exactly where a model struggles. The analysis reveals that high-level task training is particularly crucial for fostering out-of-distribution (OOD) generalization. This suggests that understanding the macro-structure of a task is more important than merely mastering micro-operational sequences when facing unseen websites or task variations.

Empirical validation of the PEEU method was conducted across multiple real-world GUI operation benchmarks. The results were striking: a small model with only 7 billion parameters achieved an accuracy rate of 30.6% after applying the PEEU methodology. This performance significantly surpassed that of the Qwen2.5-VL-32B model, which possesses nearly five times the number of parameters. This outcome demonstrates that targeted experience utilization can enable small models to compete with much larger, more resource-intensive general models. Furthermore, ablation studies confirmed that training exclusively on low-level atomic skills does not guarantee high-level planning proficiency. Instead, explicit training on high-level retroactive tasks is essential for robust generalization, highlighting the efficacy of the PEEU framework in enhancing cognitive capabilities within small MLLMs.

Industry Impact

The implications of the PEEU method extend beyond academic metrics, offering tangible benefits for the open-source AI community and industrial applications. By proving that small models can achieve high performance through sophisticated experience exploration, the research reduces the dependency on massive parameter counts. This democratization of capability allows for the deployment of efficient agents in resource-constrained environments, such as edge devices, or in sectors with strict privacy requirements where data cannot be sent to cloud-based proprietary models. The ability to run complex GUI automation locally enhances security and reduces latency, making it attractive for enterprise use cases.

The Task Decomposition Hierarchy Analysis Framework (TDHAF) provides a standardized tool for future research into compositional generalization. For the broader AI community, this framework offers a structured way to diagnose and improve model performance, moving beyond black-box evaluations. For industry players, the low-cost, high-generalization characteristics of PEEU-enhanced models open new avenues in software testing, Robotic Process Automation (RPA), and personal assistant development. These applications require agents that can adapt to diverse interfaces without extensive retraining, a capability that PEEU explicitly addresses. The method thus serves as a blueprint for developing more agile and cost-effective automation solutions.

Outlook

Looking forward, the success of PEEU suggests a paradigm shift in how small multimodal models are trained for interactive tasks. The emphasis on high-level retroactive tasks and autonomous experience discovery points toward a future where AI agents are not just reactive but proactive planners. As more research builds upon these foundations, small open-source models are likely to play a more central role in complex interaction scenarios. This evolution will drive AI from mere perception and recognition toward deeper levels of action and strategic planning.

The trajectory indicated by this study implies that the gap between small and large models in specific domains may continue to narrow. Developers will increasingly prioritize efficient learning mechanisms over sheer model size, leading to more sustainable and accessible AI technologies. The integration of frameworks like TDHAF into standard development pipelines could accelerate the creation of robust GUI agents capable of handling the dynamic nature of modern web interfaces. Ultimately, the PEEU method lays the groundwork for a new generation of intelligent agents that are both powerful and efficient, capable of operating autonomously in diverse and unpredictable digital environments.

The continued refinement of experience utilization techniques will likely yield even greater gains in generalization and accuracy. Future iterations may incorporate more sophisticated reinforcement learning algorithms or hybrid architectures that further enhance the model's ability to reason about task structures. As these technologies mature, we can expect to see widespread adoption in industries ranging from finance to healthcare, where automated GUI interaction is crucial for efficiency. The journey from small model limitations to high-performance autonomy is well underway, with PEEU serving as a pivotal milestone in this ongoing transformation.

Sources