Beyond Current Observations: Evaluating Memory and Reasoning of Multimodal LLMs in Controllable Non-Markovian Games

This paper introduces RNG-Bench, a benchmark suite designed to evaluate multimodal large language models (MLLMs) in controllable non-Markovian environments—a key challenge for closed-loop policy deployment. Unlike existing benchmarks that either expose full states or conflate hidden state reconstruction with other capabilities, RNG-Bench isolates the ability to reconstruct past observations and act upon them. The suite features two games—Match-Pair and 3D Maze—with difficulty controlled across grid size, visual modality, and observation modality, scaling up to ~128K token contexts and 350 images. The authors introduce the "memory gap" metric and find that leading models' errors stem primarily from forgetting early observations rather than decision-making failures. Fine-tuning Qwen3.5-9B on optimal policy trajectories significantly improves RNG-Bench performance without degrading general multimodal capabilities, offering a new direction for evaluating and enhancing long-term memory and spatial reasoning.

Background and Context

The deployment of multimodal large language models (MLLMs) as closed-loop policy agents introduces a critical engineering challenge: the necessity to make decisions based on observations that are no longer visible in subsequent time steps. This scenario defines a non-Markovian environment, where current actions depend not merely on the immediate state but on the complete reconstruction of historical information. Despite its importance, existing evaluation benchmarks frequently fail to accurately assess this capability. Many current standards either expose the full environmental state to the model, thereby masking deficiencies in memory, or they conflate the reconstruction of hidden states with other unrelated agent skills, resulting in impure evaluation metrics. Furthermore, many benchmarks only test recall capabilities after an episode has concluded, which fails to reflect the real-time reasoning demands placed on models during active interaction.

To address these systemic gaps, researchers have introduced RNG-Bench (Reconstructive Non-Markov Games), a specialized benchmark suite designed to isolate and evaluate the core ability of foundation models to reconstruct past observations and act upon them. This contribution fills a void in the evaluation of multimodal agents at the intersection of long-term memory and non-Markovian decision-making. By strictly controlling the environment, RNG-Bench allows for a precise measurement of how well models can maintain and retrieve information over extended periods, providing a new lens through which to understand the limitations of large models in complex, dynamic settings.

Deep Analysis

RNG-Bench comprises two complementary game tasks: Match-Pair and 3D Maze. In the Match-Pair task, models must accurately recall the identity of cards shown briefly at specific locations in earlier steps. In the 3D Maze task, agents must integrate first-person visual inputs to construct and maintain an internal spatial map. These tasks are governed by three distinct difficulty axes: grid size, visual pattern complexity, and observation modality. This multi-dimensional control allows for systematic investigation into which factors most significantly impact model performance. The suite also employs a head-to-head confrontation protocol to control instance-level variance, ensuring that evaluation results are statistically significant and robust against random noise.

A pivotal innovation in this study is the introduction of the "memory gap" metric. This metric effectively disentangles errors caused by the forgetting of early observations from those resulting from suboptimal decision-making logic. By isolating these failure modes, researchers can diagnose the root causes of model failures with greater granularity. The experimental setup pushes models to their limits, with the most difficult configurations requiring the processing of approximately 128K token contexts and up to 350 images within a single episode. This scale tests the upper bounds of current multimodal architectures, revealing significant room for improvement even among state-of-the-art systems.

Industry Impact

The findings from RNG-Bench challenge prevailing assumptions about the limitations of large models in complex tasks. Analysis of the memory gap reveals that the primary source of errors in leading MLLMs is not a failure in reasoning or planning logic, but rather the inability to retain and retrieve early observations. This insight shifts the focus of development from purely enhancing decision-making algorithms to improving long-term memory mechanisms and spatial reasoning capabilities. For the industry, this means that the bottleneck in deploying robust multimodal agents lies in their ability to maintain context over time, a critical requirement for applications such as robotics, autonomous driving, and interactive virtual assistants.

The study also demonstrates a practical pathway for improvement. By fine-tuning the Qwen3.5-9B model on optimal policy trajectories and filtered model demonstrations, researchers achieved significant performance gains on RNG-Bench without degrading the model's general multimodal capabilities. This suggests that targeted training on memory-intensive tasks can enhance specific competencies without causing catastrophic forgetting or performance drops in other areas. This finding offers a viable strategy for open-source communities and industrial developers looking to upgrade existing models for more demanding, long-horizon tasks.

Outlook

The introduction of RNG-Bench provides a rigorous framework for evaluating and enhancing the long-term memory of multimodal agents. As the demand for intelligent systems capable of operating in complex, real-world environments grows, the ability to handle non-Markovian challenges will become a key differentiator. The benchmark's design encourages the community to focus on the specific mechanisms of memory retention and retrieval, rather than treating them as secondary concerns. Future research is likely to build upon these findings, exploring new architectures and training methods that explicitly address the memory gap identified in this study.

Moreover, the success of fine-tuning Qwen3.5-9B indicates that existing foundation models can be adapted to meet these rigorous standards with relatively modest interventions. This lowers the barrier for entry for smaller research teams and companies aiming to develop specialized agents. As RNG-Bench gains traction, it is expected to drive a wave of innovation in memory-augmented architectures and spatial reasoning modules. The ultimate goal is to create multimodal agents that can reliably navigate and operate in environments where the past is not immediately visible, paving the way for more autonomous and capable AI systems in production environments.

Sources