What is the core idea of the new reinforcement learning framework proposed in this paper?

The researchers replace the traditional scalar reward with a distribution over reward functions and apply a nonlinear objective over the action set, enabling calibrated behavioral diversity to emerge naturally without sacrificing expected reward.

Why does reward uncertainty help induce behavioral diversity?

When reward functions are ambiguous or incomplete, sticking to a single action is suboptimal. By modeling uncertainty, agents rationally explore diverse strategies, avoiding the performance collapse common in entropy regularization approaches.

What are the implications for real-world applications?

The approach requires no complex heuristic reward engineering, making it directly applicable to RLHF for large language models and automated scientific discovery, potentially becoming a new standard for multimodal generation tasks.

利用獎勵不確定性誘導強化學習中的多樣化行為

傳統強化學習通常致力於尋找最大化標量獎勵期望的確定性策略，但在語言模型微調或科學發現等現代應用中，行為的多樣性至關重要。現有方法如熵正則化往往需要在隨機性與性能之間進行脆弱的權衡，且可能犧牲預期獎勵。本文提出了一種根本性的強化學習目標重構方法，將標量獎勵替換為獎勵函數分佈，並對動作集應用非線性目標。該框架使校準的行為多樣性自然湧現，且無需犧牲預期獎勵。通過在上下文老虎機設置中推導原則性梯度估計器，證明了該方法自然泛化了傳統策略梯度。實驗表明，該框架為傳統方法失效的複雜強化學習任務提供了穩健且理論扎實的替代方案，成功誘導了期望的廣泛代理行為。

Sources

arXiv