What is S-Agent and what problem does it solve?

S-Agent is a novel agent paradigm designed for continuous multi-view images and videos. It addresses the limitations of VLMs handling isolated static images by redefining spatial reasoning as a spatiotemporal evidence accumulation process, enabling coherent understanding of dynamic scenes.

Why is this technology significant?

S-Agent significantly boosts the spatial reasoning of various VLMs in a training-free manner. The fine-tuned S-Agent-8B, built on its S-300K dataset, surpasses same-size open-source baselines and rivals top closed models like GPT-5.4, offering a low-cost path to spatial intelligence.

What are the future directions?

With the open-sourcing of the S-300K dataset, academia will focus on large-scale spatial intelligence training. Industrially, S-Agent's hierarchical toolchain is expected to be widely applied in robotics, autonomous driving, and AR, advancing lightweight, high-capability applications.

S-Agent：利用空間工具使用激發連續3D世界的推理智能

本文提出S-Agent，一種面向連續多視角圖像和視頻的空間工具使用智能體範式，旨在解決現有視覺語言模型處理靜態、無狀態孤立視覺觀察時的根本局限。S-Agent將空間推理重新定義為時空證據累積過程，而非孤立的幀級預測，從而實現從幀中心識別向場景中心理解的範式轉變。該方法以視覺語言模型作為語義規劃器，結合分層空間工具鏈與領域專家系統，依次完成二維物體精準定位、三維幾何證據提升以及高層空間知識聚合。同時引入場景記憶與智能體記憶機制，使智能體能夠跨幀整合和持續更新空間證據。實驗表明，S-Agent在不進行任何額外訓練的情況下即可顯著提升多種開源及閉源視覺語言模型的空間推理性能；基於其生成的S-300K大規模軌跡數據監督微調得到的S-Agent-8B模型，在多項基準測試中不僅超越同規模開源基線，更媲美GPT-5.4等先進閉源模型，展現了空間工具使用範式的強大通用能力。

Sources

arXiv