What is MoE architecture?

Mixture of Experts splits a model into expert sub-networks, activating only the most relevant few during inference to reduce compute while maintaining performance.

What does 3.6B active parameters mean?

Despite 32B total parameters, only 3.6B are used per inference, making compute costs similar to a small model with much greater capability.

What's the benefit of 1M token context?

It can process ~750K words at once, enough for entire books, large codebases, or very long conversation histories.

NVIDIA Nemotron 3 Nano：320億參數MoE模型僅激活3.6B，百萬token上下文窗口

NVIDIA發布Nemotron 3 Nano，320億參數MoE模型運行時僅激活3.6B，100萬token上下文窗口，專為高效智能體任務設計。

NVIDIA於2026年3月發布Nemotron 3 Nano，採用MoE架構總參數320億但推理時僅激活3.6億——成本降低約90%。100萬token超長上下文窗口。針對工具呼叫、結構化輸出、多步驟推理等智能體任務優化。在函數呼叫基準測試中性能接近GPT-4o，可在消費級GPU上運行。

Sources

NVIDIA Developer Blog / Stormap.ai / Tom's Hardware / The Decoder / Hugging Face