Top Headlines

Feeds

New Techniques Reduce Low‑Probability Token Influence in RL‑Trained LLMs

Cached

Low‑probability tokens dominate RL gradient updates Reinforcement learning for large language models (LLMs) gives disproportionate weight to tokens that the model predicts with low probability, because their gradients are unusually large, skewing parameter updates and hindering overall learning [1].

Dominance suppresses high‑probability token learning The inflated gradients from rare tokens drown out the smaller, essential gradients of high‑probability tokens, limiting the model’s reasoning performance despite RL’s promise [1].

Advantage Reweighting and Lopti mitigate the issue The authors propose two methods—Advantage Reweighting, which rescales token advantages, and Low‑Probability Token Isolation (Lopti), which isolates and reduces gradients from low‑probability tokens—to rebalance updates across token probabilities [1].

GRPO‑trained LLMs see up to 46.2% gain on logic puzzles Applying the new methods to Group Relative Policy Optimization (GRPO) models yields substantial improvements, achieving as much as a 46.2% increase on the K&K Logic Puzzle reasoning benchmark [1].

Code released publicly for replication The implementation of Advantage Reweighting and Lopti is available on GitHub, enabling other researchers to reproduce and extend the findings [1].

Links