Introducing talkie: a 13B vintage language model from 1930

· ai · Source ↗

TLDR

  • Researchers trained a 13B language model exclusively on pre-1931 text to study AI generalization and test whether it can rediscover post-cutoff inventions.

Key Facts

  • talkie-1930-13b was trained on 260B tokens of historical English text: books, newspapers, patents, scientific journals, and case law with a hard December 31, 1930 cutoff.
  • A “modern twin” trained on FineWeb outperforms talkie on standard benchmarks, but performance gaps narrow on core language understanding and numeracy tasks.
  • Vintage OCR quality is a major bottleneck: conventional OCR yields only 30% of the learning efficiency of human-transcribed text; regex cleaning recovers it to 70%.
  • The team is training a GPT-3-scale follow-up model, targeting release this summer, with a corpus potentially exceeding one trillion historical tokens.

Why It Matters

  • Contamination-free training enables clean generalization experiments, such as testing whether a model with no knowledge of digital computers can learn to write Python.
  • Studying models trained outside the web corpus could reveal how much of current LM behavior is shaped by that single dataset rather than language or reasoning in general.

Nick Levine, David Duvenaud, Alec Radford · 2026-04-28 · Read the original