What is operator-level visual token skipping?

It decomposes Transformer layers into attention and feed-forward network operators, selectively skipping redundant computations based on each layer and operator's actual contribution while preserving the full visual token sequence.

Why does this matter? How does it differ from existing acceleration methods?

Existing methods remove tokens or skip entire layers, risking loss of fine-grained evidence. This approach discovers 'answer-silent' redundancy — late-stage visual token updates have large numerical changes but minimal impact on answers, enabling finer optimization.

What were the results and what comes next?

On Qwen3-VL, the method cuts 33.7% TFLOPs while retaining 99.5% of original performance. It works without retraining, enabling efficient deployment on resource-constrained devices and opening paths for real-time video analysis and autonomous driving applications.

關注、變換或靜默：面向高效多模態大模型推理的算子級視覺跳過機制

多模態大語言模型在處理長視覺序列時面臨巨大的推理計算壓力。現有加速方法通常採用粗粒度策略，如直接移除視覺Token或在整層跳過更新，這可能導致細粒度證據丟失或誤傷有用算子。本文從答案可觀測視角出發，發現晚期視覺Token更新雖數值較大，但對答案Token表示影響甚微，存在"答案靜默"冗餘。為此，作者提出一種算子級視覺Token跳過框架，將Transformer層分解為注意力（Attention）和前饋網路（FFN）算子，根據層和算子的重要性選擇性繞過冗餘計算，同時保留完整視覺序列。在三種多模態架構和十個VQA基準上的實驗表明，該方法在Qwen3-VL上降低33.7% TFLOPs的同時，保留了99.5%的原始性能，實現了高效的效率-精度權衡。

Sources

arXiv