✨GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
📝 Summary:
GRPO in multi-reward RL suffers from reward normalization collapse, hindering training. GDPO resolves this by decoupling individual reward normalization, improving stability and accuracy. GDPO consistently outperforms GRPO across various reasoning tasks.
🔹 Publication Date: Published on Jan 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.05242
• PDF: https://arxiv.org/pdf/2601.05242
• Project Page: https://nvlabs.github.io/GDPO/
• Github: https://github.com/NVlabs/GDPO
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #MultiRewardRL #PolicyOptimization #MachineLearning #AI
📝 Summary:
GRPO in multi-reward RL suffers from reward normalization collapse, hindering training. GDPO resolves this by decoupling individual reward normalization, improving stability and accuracy. GDPO consistently outperforms GRPO across various reasoning tasks.
🔹 Publication Date: Published on Jan 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.05242
• PDF: https://arxiv.org/pdf/2601.05242
• Project Page: https://nvlabs.github.io/GDPO/
• Github: https://github.com/NVlabs/GDPO
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #MultiRewardRL #PolicyOptimization #MachineLearning #AI