Self-Distillation Enables Continual Learning

· ai · Source ↗

TLDR

  • Paper introduces SDFT, a method using in-context learning to generate on-policy training signals from demonstrations, reducing catastrophic forgetting without reward functions.

Key Takeaways

  • Standard SFT is off-policy, causing catastrophic forgetting; SDFT uses a demonstration-conditioned model as its own teacher to stay on-policy.
  • SDFT requires no explicit reward functions, unlike RL-based continual learning approaches, making it practical for real fine-tuning pipelines.
  • In sequential learning experiments, a single model accumulated multiple skills over time with no performance regression on prior tasks.
  • SDFT outperformed SFT on both new-task accuracy and forgetting reduction across skill learning and knowledge acquisition benchmarks.

Hacker News Comment Review

  • One commenter flagged that the paper’s language (“enable,” “establishing”) overstates certainty, raising skepticism about how broadly the results generalize.

Original | Discuss on HN