Work done by Weiqi Wang, Xin Liu
🔍 TL;DR
Problem
- RL is now the standard way to boost LLM reasoning (GRPO, DAPO, etc.). But current pipelines are data-inefficient:
- They sample queries uniformly from a static dataset.
- They never adapt the dataset as the model improves → the data becomes off-policy and misaligned with the model’s current ability.
Idea: HeaPA. Turn a static dataset into a living curriculum by:
- Dual-heap query pool
- Maintains a hard heap and an easy heap based on reward (difficulty proxy).
- Always samples from the boundary region → medium-difficulty problems where the model sometimes succeeds and sometimes fails.
- On-policy query augmentation
- Use the current policy to generate new math problems by tweaking numbers in existing ones.
- These new problems are calibrated to the model’s current capability.
- Teacher-guided verification
- A stronger “teacher” model checks whether augmented problems are solvable and provides ground-truth answers before they enter the pool.
- Reward propagation via a lineage graph
- Parents and children share information: rewards flow along the augmentation tree to keep difficulty estimates fresh.
Results (Qwen2.5-7B, math RL)
- Across DAPO-Math & OpenR1-Math, GRPO & DAPO:
- +3–5 points average accuracy over the baseline.
- On competition benchmarks:
- AIME24 (GRPO, DAPO-Math): 21.4% vs 17.3%
- AMC23 (GRPO, DAPO-Math): 82.4% vs 75.7%
- Compute efficiency:
- Up to 12–16% less compute to reach the same performance vs original sampler.
- Additional wall-clock overhead from HeaPA is only ~1–3%, and <1% for larger models.
Plug-and-play
Code and paper coming up soon!
1. Motivation: Why data efficiency matters in RL for reasoning
LLMs are pretty good at many NLP tasks, but math reasoning is still hard: