What is the In-Context Reward Adaptation framework?

An AI alignment approach leveraging Transformer in-context learning to instantly infer reward structures from a few preference examples, enabling dynamic adaptation without retraining.

Why does it outperform traditional RLHF models?

Traditional RLHF relies on static reward models that struggle with unseen domains. By incorporating human response times as auxiliary signals, it captures decision confidence and eliminates asymptotic bias, greatly improving robustness to distributional shifts.

What are the next steps for research and application?

Future work will integrate diverse behavioral signals like emotional feedback and interaction frequency. This scalable approach enables plug-and-play preference adaptation for industry and open-source communities, reducing alignment costs.

基於上下文獎勵適應的魯棒偏好建模：應對人類價值觀異質性

針對傳統人類回饋強化學習（RLHF）中靜態獎勵模型難以泛化至未見偏好領域的問題，本文提出了上下文獎勵適應（In-Context Reward Adaptation）框架。該方法利用Transformer的上下文學習能力，透過少量偏好演示即時推斷潛在的獎勵結構，從而動態適應異構的人類價值觀。研究表明，標準Transformer存在漸近偏差，但引入人類響應時間作為輔助輸入訊號後，模型能有效適應未見領域的偏好分佈。實驗證實，該框架為偏好建模提供了更堅固的基礎，支持異構獎勵表徵及分佈偏移，為靈活的機人對齊提供了可擴展路徑。

Sources

arXiv