Introducing talkie: a 13B vintage language model from 1930

· ai · Source ↗

TLDR

  • Nick Levine, David Duvenaud, and Alec Radford released talkie, a 13B model trained on 260B tokens of pre-1931 English text, Apache 2.0 licensed.

Key Takeaways

  • Two checkpoints: talkie-1930-13b-base (53.1 GB) trained on out-of-copyright text; talkie-1930-13b-it (26.6 GB) fine-tuned for chat using pre-1931 reference works.
  • Fine-tuning used Claude Sonnet 4.6 as a judge for DPO and Claude Opus 4.6 for rejection-sampled synthetic chats, making the chat model not fully copyright-clean.
  • Key research questions the team is probing: can the model predict post-cutoff events, independently derive General Relativity, or learn to write Python from few-shot examples?
  • The team aims to eventually use vintage base models as their own judges to eliminate anachronistic influence from modern LLMs in post-training.
  • Training required active contamination prevention to keep post-1931 text and modern LLM knowledge out of the corpus.

Why It Matters

  • Demonstrates a viable path to fully out-of-copyright base model training; releasing the training data would be the remaining step toward a “vegan” LLM.
  • The fine-tuning dependency on modern LLMs is a structural unsolved problem for era-pure models, not unique to talkie – Mr. Chatterbox faced the same constraint.
  • Alec Radford (GPT, GPT-2, Whisper) co-authoring signals this is serious research, not a novelty project.

Simon Willison / Simon Willison’s Weblog · 2026-04-28 · Read the original