Talkie: a 13B vintage language model from 1930

· ai · Source ↗

TLDR

  • Researchers trained a 13B model exclusively on 260B tokens of pre-1931 English text to study AI generalization, forecasting, and contamination-free evaluation.

Key Takeaways

  • talkie-1930-13b is the largest known vintage LM; next target is a GPT-3-scale model planned for summer 2026, with a corpus potentially exceeding 1T historical tokens.
  • Contamination-free construction enables unique evals: the model can attempt Python coding, future-event prediction, and independent invention discovery with no training-set leakage by design.
  • OCR quality is the core data bottleneck: conventional OCR transcription yields only 30% of the learning efficiency of human transcription; regex cleaning recovers it to ~70%.
  • Temporal leakage is hard to eliminate: despite an n-gram anachronism classifier, the 13B model still surfaced knowledge of Roosevelt’s New Deal and early WWII details.
  • Modern VLM-based OCR was tried but rejected because it hallucinates modern facts into the corpus, directly poisoning the vintage constraint.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN