Researchers trained a 13B model exclusively on 260B tokens of pre-1931 English text to study AI generalization, forecasting, and contamination-free evaluation.
Key Takeaways
talkie-1930-13b is the largest known vintage LM; next target is a GPT-3-scale model planned for summer 2026, with a corpus potentially exceeding 1T historical tokens.
Contamination-free construction enables unique evals: the model can attempt Python coding, future-event prediction, and independent invention discovery with no training-set leakage by design.
OCR quality is the core data bottleneck: conventional OCR transcription yields only 30% of the learning efficiency of human transcription; regex cleaning recovers it to ~70%.
Temporal leakage is hard to eliminate: despite an n-gram anachronism classifier, the 13B model still surfaced knowledge of Roosevelt’s New Deal and early WWII details.
Modern VLM-based OCR was tried but rejected because it hallucinates modern facts into the corpus, directly poisoning the vintage constraint.