What is the FORGE protocol?

FORGE is a gradient-free memory framework that allows AI agents to self-evolve without weight updates, transforming failures into usable knowledge through internal reflection and external broadcasting.

Why does this research matter?

It significantly boosts performance across major LLM families, increasing average rewards by up to 7.7 times and dropping failure rates to 1% in complex network defense tasks.

What should we watch for next?

Future work will explore FORGE in dialogue and robotics. Its low-compute design makes it highly promising for deploying efficient agents on resource-constrained edge devices.

FORGE: Self-Evolving Agent Memory Mechanism and Population Broadcast Protocol Without Weight Updates

This paper presents FORGE, a gradient-free self-evolving agent memory framework designed to address the lack of long-term memory accumulation in large language model agents performing complex decision-making tasks. FORGE employs a population-based phased protocol: in the inner loop, agents transform failure trajectories into heuristic rules, few-shot examples, or hybrid knowledge through internal reflection; in the outer loop, memories of optimal instances propagate across phases. Evaluated on the CybORG CAGE-2 network defense benchmark, FORGE significantly outperforms both zero-shot and Reflexion baselines across all four major LLM families (GPT, Claude, Gemini, Llama), improving average evaluation reward by 1.7 to 7.7 times while reducing failure rates to approximately 1%. Ablation studies reveal that population broadcast is the core driver of performance gains, with few-shot examples proving most effective across most models. This work establishes a new paradigm for resource-efficient agent self-evolution and helps narrow the performance gap between models of different capability levels.

Background and Context

Large language model agents operating in dynamic, adversarial environments frequently encounter significant limitations due to their static knowledge boundaries. Traditional approaches to enhancing agent decision-making capabilities have relied heavily on gradient-based updates or external fine-tuning processes. While these methods can improve performance, they introduce substantial computational costs and deployment complexities that are often prohibitive for real-time or resource-constrained applications. The core challenge lies in enabling agents to accumulate long-term memory and adapt to complex decision-making tasks without the need for expensive model retraining or weight modifications. This gap has necessitated the development of more efficient, lightweight mechanisms that allow agents to learn from experience in a manner that is both computationally sustainable and scalable.

To address these critical limitations, researchers have introduced FORGE (Failure-Optimized Reflective Graduation and Evolution), a novel protocol designed for self-evolving agent memory that operates without weight updates. FORGE represents a paradigm shift by enabling agents to optimize their decision-making processes through self-generated natural language memories. Unlike conventional methods that require distilling knowledge into stronger models or updating model parameters, FORGE leverages a hierarchical ReAct agent architecture. This architecture facilitates the efficient accumulation and propagation of knowledge through a dual-loop system: an internal reflection loop and an external population broadcast loop. By decoupling memory evolution from model weights, FORGE offers a flexible solution that can be applied across various large language model families without altering their underlying structures.

The significance of FORGE is further underscored by its ability to maintain the original architecture of the underlying language models while injecting memory through prompt engineering. This approach ensures high generalizability and flexibility, allowing the protocol to be easily adapted to different LLM families. The protocol’s design addresses the inefficiency of traditional reinforcement learning methods by focusing on the transformation of failure trajectories into reusable knowledge artifacts. These artifacts, which include heuristic rules, few-shot examples, or hybrid forms, are generated through internal reflection and propagated through population-based mechanisms. This method not only enhances the agent's performance but also optimizes the use of computational resources by avoiding the overhead associated with continuous model updates.

Deep Analysis

FORGE employs a sophisticated two-layer loop architecture that mimics and improves upon existing learning frameworks like Reflexion. In the inner loop, the system utilizes a dedicated reflection agent to analyze failure trajectories encountered during task execution. Instead of relying on a more powerful model for knowledge distillation, FORGE uses the same underlying Large Language Model to perform self-reflection. When an agent fails, the reflection agent extracts generalizable knowledge from the failure sequence and formats it into three distinct types of knowledge artifacts: Rules (heuristic guidelines), Examples (few-shot demonstrations), or Mixed forms (a combination of both). These artifacts are then injected into the agent’s prompt as natural language memory, effectively allowing the agent to learn from its mistakes without any gradient updates. This mechanism ensures that the learning process is intrinsic to the agent’s interaction with the environment and its own memory state. The outer loop introduces a population-based approach to memory propagation, enhancing the diversity and robustness of the learning process. FORGE maintains a population of agent instances, and at the end of each phase, the system evaluates the performance of all instances. The memories of the top-performing instances are then broadcast to the rest of the population, facilitating the spread of successful strategies. A key innovation in this phase is the introduction of a graduation mechanism. When an agent’s memory reaches a certain convergence standard, it is removed from the active population and frozen. This process prevents the waste of computational resources on redundant learning and ensures that the population maintains a diverse set of strategies, thereby avoiding local optima and promoting broader exploration of the solution space. To validate the efficacy of FORGE, extensive evaluations were conducted on the CybORG CAGE-2 benchmark, a partial observability Markov decision process (POMDP) designed for network defense. This benchmark presents a highly stochastic and complex environment where agents must defend against B-line attackers over a 30-step horizon. The study tested FORGE across four major LLM families: Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, and Qwen3-235B. In zero-shot settings, these models exhibited strong negative, heavy-tailed reward distributions, highlighting their inherent difficulties in complex defensive tasks. The results demonstrated that FORGE significantly outperformed both zero-shot and Reflexion baselines across all tested models. Specifically, the average evaluation reward increased by 1.7 to 7.7 times compared to the zero-shot baseline, and by 29% to 72% compared to the isolated single-stream Reflexion baseline. Furthermore, the failure rate, defined as returns below -100, was reduced to approximately 1%, indicating a substantial improvement in system reliability.

Ablation studies provided deeper insights into the components driving FORGE’s success. The removal of the graduation mechanism resulted in a variant that confirmed population broadcast as the primary driver of performance gains, while the graduation mechanism itself was found to be crucial for computational efficiency. In terms of knowledge representation, few-shot examples proved to be the most effective across three of the four tested models, yielding the highest rewards. However, heuristic rules demonstrated superior cost-effectiveness, reducing token usage by approximately 40% while maintaining robust performance. Notably, models with weaker baseline capabilities benefited more significantly from FORGE, suggesting that the protocol helps narrow the performance gap between different tiers of LLMs rather than merely amplifying the advantages of already strong models.

Industry Impact

The introduction of FORGE has profound implications for the deployment of AI agents in resource-constrained environments. By eliminating the need for weight updates, FORGE enables the deployment of sophisticated, self-evolving agents on edge devices or in real-time systems where computational resources are limited. This capability significantly reduces the costs associated with model maintenance and updates, making advanced AI decision-making more accessible and practical for industrial applications. The protocol’s reliance on natural language memory also enhances the transparency and interpretability of the agent’s evolution. Researchers and engineers can directly inspect and analyze the accumulated knowledge, such as heuristic rules or examples, providing valuable insights into the agent’s decision-making logic. This interpretability is critical for debugging, improving agent behavior, and ensuring compliance with safety and regulatory standards in high-stakes environments.

In the cybersecurity domain, FORGE is particularly well-suited for applications requiring long-term memory and rapid adaptation to dynamic threats. The CybORG CAGE-2 benchmark, which simulates network defense scenarios, demonstrates the protocol’s potential in protecting systems against sophisticated, evolving attacks. The ability of FORGE to reduce failure rates to approximately 1% while maintaining high reward scores indicates its reliability in critical infrastructure protection. Additionally, the protocol’s efficiency in managing computational resources through its graduation mechanism makes it ideal for large-scale automated operations where continuous monitoring and response are required. By enabling agents to learn from past failures and propagate successful strategies across a population, FORGE offers a robust framework for building resilient, adaptive security systems.

The open-source community stands to benefit significantly from FORGE’s standardized approach to self-evolution. By providing a protocol that does not depend on specific model architectures or weight updates, FORGE facilitates easier comparison and collaboration among different research teams. This standardization can accelerate the development of new agent-based applications and foster a more collaborative ecosystem. Furthermore, the protocol’s flexibility allows it to be adapted to various other domains beyond cybersecurity, such as customer service, automated trading, and robotic control. The potential for FORGE to bridge the performance gap between different model capabilities also democratizes access to high-performance AI, allowing organizations with limited resources to leverage advanced agent technologies.

Outlook

Looking ahead, the research community is poised to explore the broader applicability of FORGE across diverse task domains. Future studies may investigate the protocol’s effectiveness in areas such as conversational AI, where long-term context retention is crucial, or in robotic control systems, where adaptive learning from physical interactions is required. Optimizing the representation of memory artifacts and refining the broadcast strategies within the population loop are likely to be key areas of focus. Researchers may experiment with more sophisticated hybrid knowledge forms or dynamic adjustment of the graduation criteria to further enhance efficiency and performance. Additionally, there is potential to integrate FORGE with other reinforcement learning techniques to create even more robust and versatile agent systems. The development of FORGE also opens new avenues for understanding the cognitive processes of AI agents. By analyzing the natural language memories generated through the reflection loop, researchers can gain insights into how agents form heuristics and generalize from specific experiences. This could lead to the development of more human-like learning algorithms that mimic biological memory consolidation processes. As the protocol matures, it may also inspire new architectures for multi-agent systems, where populations of agents collaborate and compete to solve complex problems. The emphasis on transparency and interpretability in FORGE could set a new standard for responsible AI development, ensuring that autonomous systems remain understandable and controllable. Finally, the economic implications of FORGE’s resource efficiency cannot be overstated. As AI adoption grows, the cost of training and maintaining large models remains a significant barrier. FORGE’s ability to enhance performance without additional training costs offers a sustainable path forward for the industry. It allows organizations to maximize the value of existing models while continuously improving their capabilities through memory evolution. This approach could lead to a new generation of AI services that are more affordable, scalable, and adaptable. As the technology evolves, we can expect to see FORGE and similar protocols becoming integral components of the AI infrastructure, enabling a future where autonomous agents are not only intelligent but also efficient, transparent, and widely accessible.

The trajectory of FORGE suggests a shift towards more modular and composable AI systems. Instead of monolithic models that require extensive retraining, future systems may rely on lightweight protocols that allow for continuous, low-cost adaptation. This modularity could facilitate faster innovation cycles and more rapid deployment of AI solutions in dynamic markets. The success of FORGE in the CybORG CAGE-2 benchmark serves as a proof of concept for this new paradigm, demonstrating that significant performance gains are achievable through intelligent memory management rather than brute-force computational power. As the field moves forward, the integration of such protocols into mainstream AI development practices will likely redefine the standards for agent performance and efficiency.

Sources

arXiv