Transformers, LSTMs, Linear RNNs, and word embeddings all independently learn periodic number representations with dominant Fourier periods at T=2, 5, 10.
Key Takeaways
All tested architectures show Fourier-domain sparsity in number representations, but only some develop geometrically separable mod-T features usable for linear classification.
Fourier sparsity is proven necessary but not sufficient for mod-T geometric separability – a meaningful distinction between surface-level and structured numeric understanding.
Two acquisition routes exist: co-occurrence signals in general language data (text-number proximity, cross-number interaction) or multi-token arithmetic training (addition problems).
Single-token addition problems do not produce geometrically separable features; multi-token addition does – tokenization strategy directly affects numeric representation quality.
Architecture, optimizer, data, and tokenizer all independently influence whether a model reaches the higher tier of structured numeric representation.