Using group theory to explore the space of positional encodings for attention

· ai · Source ↗

TLDR

  • Jane Street ML researcher uses one-parameter group theory to prove only a few positional encoding families are valid, and all sensible ones are already in use.

Key Takeaways

  • Formalizing linearity, translation invariance, and continuity constraints forces any positional encoding into a one-parameter matrix group of the form exp(tM).
  • Diagonalizable generators yield three cases: NoPE (zero eigenvalue), exponential decay (negative real eigenvalue, used in linear attention/RetNet/Mamba-3), and RoPE-style rotation (complex eigenvalue).
  • RoPE is recovered directly from the 2D complex-eigenvalue subspace; exponentially damped RoPE generalizes it and appears in RetNet and Mamba-3.
  • Defective (non-diagonalizable, Jordan block) generators are technically legal and produce polynomial-in-time attention modulation, but appear unexplored and unlikely to be practical.
  • The analysis covers continuous and irregularly sampled time, making it applicable beyond integer sequence indices to time-series and Mamba-style learned-time models.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN