Paper shows Transformers can predict PCG pseudorandom sequences in-context, beating published classical attacks, with context length scaling as sqrt(modulus).
Key Takeaways
Transformers (up to 50M params, 5B tokens) successfully predict unseen PCG sequences including variants with bit-shifts, XORs, rotations, and truncations.
Even single-bit truncated output remains reliably predictable, a result the authors flag as surprising.
Context length required for near-perfect prediction scales as sqrt(m) where m is the modulus, a concrete scaling law.
Moduli >= 2^20 require curriculum learning from smaller moduli; without it, optimization stalls in extended stagnation phases.
Embedding layers spontaneously form bitwise rotationally-invariant clusters in top PCA components, explaining cross-moduli representation transfer.