It is the first unified post-training framework that integrates On-Policy Distillation directly into Flow Matching models for text-to-image generation tasks.

It solves reward sparsity and gradient interference, mitigating the seesaw effect between competing metrics and preventing reward hacking for much better training stability.

As AI shifts toward mass commercialization in 2026, this framework could accelerate capability commoditization and drive deeper vertical industry AI solutions.

Flow-OPD：面向流匹配模型的線上策略蒸饈框架

現有的流匹配（FM）文本生成影像模型在多任務對齊下面臨兩個關鍵瓶頸：標量獎勵導致的獎勵稀疏性，以及聯合優化異質目標引發的梯度干擾，二者共同造成指標間的「翹翹板效應」和普遍的獎勵濫用問題。受大型語言模型領域中線上策略蒸饈（OPD）成功的啟發，我們提出了Flow-OPD——首個將在線策略蒸饈集成到流匹配模型中的統一後訓練框架。Flow-OPD採用兩階段訓練策略，透過線上數據蒸饈提升生成質量與訓練穩定性，為多任務條件下的流匹配模型優化提供了新的解決方案。