ML Research Hub
32.9K subscribers
4.45K photos
273 videos
23 files
4.81K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
🤖🧠 Agentic Entropy-Balanced Policy Optimization (AEPO): Balancing Exploration and Stability in Reinforcement Learning for Web Agents

🗓️ 17 Oct 2025
📚 AI News & Trends

AEPO (Agentic Entropy-Balanced Policy Optimization) represents a major advancement in the evolution of Agentic Reinforcement Learning (RL). As large language models (LLMs) increasingly act as autonomous web agents – searching, reasoning and interacting with tools – the need for balanced exploration and stability has become crucial. Traditional RL methods often rely heavily on entropy to ...

#AgenticRL #ReinforcementLearning #LLMs #WebAgents #EntropyBalanced #PolicyOptimization
2
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

📝 Summary:
SRPO is a VLA-RL framework that eliminates the need for expert demonstrations. It assigns progress-wise rewards to failed trajectories using latent world representations and the models own successes. This achieved 99.2% success on LIBERO, a significant improvement.

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15605
• PDF: https://arxiv.org/pdf/2511.15605

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #VLAModels #PolicyOptimization #AIResearch #MachineLearning
Soft Adaptive Policy Optimization

📝 Summary:
SAPO improves RL training stability for LLMs. It uses a smooth adaptive gate to attenuate off-policy updates, unlike hard clipping. This selectively down-weights problematic tokens, leading to improved training stability and higher performance.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20347
• PDF: https://arxiv.org/pdf/2511.20347

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #LLMs #PolicyOptimization #DeepLearning #AI
1
Agentic Policy Optimization via Instruction-Policy Co-Evolution

📝 Summary:
INSPO introduces a novel framework dynamically optimizing instructions within the reinforcement learning loop for autonomous agents. It substantially outperforms static instruction methods in multi-turn reasoning by discovering innovative, strategic reasoning paths.

🔹 Publication Date: Published on Dec 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.01945
• PDF: https://arxiv.org/pdf/2512.01945
• Github: https://github.com/cambridgeltl/inspo

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #AIAgents #PolicyOptimization #MachineLearning #AI
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

📝 Summary:
This paper decomposes LLM policies into internal layer and modular policies, revealing distinct reasoning patterns across layers. It finds early layers explore and top layers refine. Motivated by this, Bottom-up Policy Optimization BuPO is proposed to optimize internal layer policies for superior...

🔹 Publication Date: Published on Dec 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19673
• PDF: https://arxiv.org/pdf/2512.19673

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#LLM #PolicyOptimization #DeepLearning #AIResearch #NLP
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

📝 Summary:
GRPO in multi-reward RL suffers from reward normalization collapse, hindering training. GDPO resolves this by decoupling individual reward normalization, improving stability and accuracy. GDPO consistently outperforms GRPO across various reasoning tasks.

🔹 Publication Date: Published on Jan 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.05242
• PDF: https://arxiv.org/pdf/2601.05242
• Project Page: https://nvlabs.github.io/GDPO/
• Github: https://github.com/NVlabs/GDPO

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #MultiRewardRL #PolicyOptimization #MachineLearning #AI
AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

📝 Summary:
AT^2PO is a framework for multi-turn agentic reinforcement learning. It uses a turn-level tree search with entropy-guided expansion and turn-wise credit assignment. This improves exploration, reward propagation, and policy optimization, achieving state-of-the-art results.

🔹 Publication Date: Published on Jan 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.04767
• PDF: https://arxiv.org/pdf/2601.04767
• Github: https://github.com/zzfoutofspace/ATPO

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #AgenticAI #TreeSearch #PolicyOptimization #ArtificialIntelligence