What is S-Agent and how does it address VLM limitations?

S-Agent is a spatial tool-use agent paradigm that reframes spatial reasoning as spatiotemporal evidence accumulation. It overcomes the static, stateless limitations of current VLMs in dynamic 3D worlds, shifting from frame-centric recognition to scene-centric understanding.

How does S-Agent's architecture enhance spatial reasoning?

It employs a VLM as a semantic planner with hierarchical spatial tools to convert 2D objects into 3D geometric evidence. Scene and agent memory mechanisms integrate cross-frame information, significantly boosting reasoning robustness in dynamic scenes without requiring additional training.

How does S-Agent-8B perform and what are its implications?

Fine-tuned on S-300K trajectories, S-Agent-8B outperforms baseline small models and rivals advanced proprietary models like GPT-5.4. This enables high-precision spatial intelligence deployment on resource-constrained edge devices for robotics and autonomous driving.

S-Agent：基於時空證據累積的空間智能推理新範式

本文提出S-Agent，一種面向連續多視圖圖像與視頻的空間工具使用智能體範式，旨在解決現有視覺語言模型（VLMs）在處理動態三維世界時的靜態與無狀態局限。S-Agent將空間推理重構為時空證據累積過程，而非孤立的幀級預測。通過讓VLM擔任語義規劃器，結合分層空間工具將2D物件提升至3D幾何證據，並聚合為計數、測量等高層空間知識，實現場景中心理解。引入場景記憶與智能體記憶機制以整合跨幀證據。實驗顯示，S-Agent在無需訓練的情況下顯著提升開源及閉源VLM性能。此外，基於S-Agent生成的S-300K軌跡進行監督微調得到的S-Agent-8B，在小型模型中大幅超越基線，性能媲美GPT-5.4等先進閉源模型。

Sources

arXiv