How does Gemma 3 perform on the Arabic SLM benchmark?

Gemma 3 (12B) tops the Arabic benchmark with 4.548/5 across 240 test items, outperforming 11 other small language models in zero-shot evaluation.

Why is a standardized Arabic SLM benchmark important?

Arabic has complex morphology and dialectal diversity. Without a unified benchmark, researchers couldn't fairly compare model capabilities or track progress.

What should developers focus on next for Arabic SLMs?

The study shows alignment quality and instruction-following matter more than model size. Developers should prioritize Arabic data quality and cultural adaptation.

小語言模型阿拉伯語處理能力評估：基準測試與性能分析

本文針對阿拉伯語小語言模型（SLMs）的性能進行了系統性評估，填補了該領域缺乏標準化基準的空白。研究構建了包含240個測試項的阿拉伯語基準，涵蓋理解與生成兩大方向、八大領域和十項語言技能。在嚴格的零樣本設置下，利用GPT-4.1 Mini等模型作為裁判，對十二種SLM進行了全面評測。結果顯示，Gemma 3（12B）以4.548/5的高分位居榜首，Aya和C4AI Command Arabic緊隨其後。研究發現，模型大小並非決定阿拉伯語處理能力的唯一因素，更強的阿拉伯語對齊能力和指令遵循行為才是關鍵。低效能模型常出現提示洩露、幻覺及語言漂移等問題。該基準為構建高效、可靠且符合文化背景的阿拉伯語AI系統提供了重要參考。

Sources

arXiv