HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

🔍 TL;DR

Problem

RL is now the standard way to boost LLM reasoning (GRPO, DAPO, etc.). But current pipelines are data-inefficient:
- They sample queries uniformly from a static dataset.
- They never adapt the dataset as the model improves → the data becomes off-policy and misaligned with the model’s current ability.

Idea: HeaPA. Turn a static dataset into a living curriculum by:

Dual-heap query pool
- Maintains a hard heap and an easy heap based on reward (difficulty proxy).
- Always samples from the boundary region → medium-difficulty problems where the model sometimes succeeds and sometimes fails.
On-policy query augmentation
- Use the current policy to generate new math problems by tweaking numbers in existing ones.
- These new problems are calibrated to the model’s current capability.
Teacher-guided verification
- A stronger “teacher” model checks whether augmented problems are solvable and provides ground-truth answers before they enter the pool.
Reward propagation via a lineage graph
- Parents and children share information: rewards flow along the augmentation tree to keep difficulty estimates fresh.

Results (Qwen2.5-7B, math RL)

Across DAPO-Math & OpenR1-Math, GRPO & DAPO:
- +3–5 points average accuracy over the baseline.
- On competition benchmarks:
  - AIME24 (GRPO, DAPO-Math): 21.4% vs 17.3%
  - AMC23 (GRPO, DAPO-Math): 82.4% vs 75.7%
Compute efficiency:
- Up to 12–16% less compute to reach the same performance vs original sampler.
- Additional wall-clock overhead from HeaPA is only ~1–3%, and <1% for larger models.

Plug-and-play

HeaPA works on top of GRPO and DAPO without changing the core RL algorithm.
Think of it as:

“A better data curriculum layer for RL training, not a new policy optimizer.”

Code and paper coming up soon!

LLMs are pretty good at many NLP tasks, but math reasoning is still hard: