PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

· ai · Source ↗

TLDR

  • Paper introduces PopuLoRA, a population-based asymmetric self-play framework using co-evolving LoRA teacher/student adapters to prevent curriculum collapse in RLVR post-training.

Key Takeaways

  • Single-agent self-play self-calibrates: proposer converges on tasks its own solver handles, collapsing curriculum complexity (lower AST depth, cyclomatic complexity, LOC).
  • PopuLoRA separates teachers (task generators) from students (solvers) as LoRA adapters on a shared frozen base; teacher reward ties to matched-student failure rate, not self-evaluation.
  • TrueSkill-based prioritized fictitious self-play concentrates training on near-balanced matchups; weight-space evolution (mutation, crossover on LoRA tensors) replaces weak adapters in seconds.
  • 4T+4S configuration (8 adapters) adds only 1.31x wall-clock overhead; memory scales with adapter weights, not full model copies.
  • PopuLoRA outperforms compute-matched single-agent baseline on HumanEval+, MBPP+, LiveCodeBench, and shows suggestive transfer to math benchmarks (AIME, MATH-500), though cross-domain causality is not isolated.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN