Paper introduces PopuLoRA, a population-based asymmetric self-play framework using co-evolving LoRA teacher/student adapters to prevent curriculum collapse in RLVR post-training.
Key Takeaways
Single-agent self-play self-calibrates: proposer converges on tasks its own solver handles, collapsing curriculum complexity (lower AST depth, cyclomatic complexity, LOC).
PopuLoRA separates teachers (task generators) from students (solvers) as LoRA adapters on a shared frozen base; teacher reward ties to matched-student failure rate, not self-evaluation.
TrueSkill-based prioritized fictitious self-play concentrates training on near-balanced matchups; weight-space evolution (mutation, crossover on LoRA tensors) replaces weak adapters in seconds.
4T+4S configuration (8 adapters) adds only 1.31x wall-clock overhead; memory scales with adapter weights, not full model copies.
PopuLoRA outperforms compute-matched single-agent baseline on HumanEval+, MBPP+, LiveCodeBench, and shows suggestive transfer to math benchmarks (AIME, MATH-500), though cross-domain causality is not isolated.