Group Sequence Policy Optimization
It appears that they have fixed GRPO and Reinforcement Learning for sequence models.
Could not understand more than half of this, and I had to rely on what others are saying, but this is really good for LMs trained using RL. Try your hand at understanding their paper and let me know what you found!
Could not understand more than half of this, and I had to rely on what others are saying, but this is really good for LMs trained using RL. Try your hand at understanding their paper and let me know what you found!
No comments:
Post a Comment