Sunday, July 27, 2025

GRPO and Fixed RL Algorithm on Sequence Models

New paper released by the Alibaba AI team working on Qwen:
Group Sequence Policy Optimization

It appears that they have fixed GRPO and Reinforcement Learning for sequence models.
Could not understand more than half of this, and I had to rely on what others are saying, but this is really good for LMs trained using RL. Try your hand at understanding their paper and let me know what you found!




No comments: