Where the Goblins Came From

· ai · Source ↗

TLDR

  • OpenAI’s “Nerdy” personality reward signal accidentally amplified goblin and gremlin metaphors, which spread via RL generalization and SFT feedback loops across GPT-5 model generations.

Key Takeaways

  • The Nerdy personality reward scored creature-word outputs higher in 76.2% of audited datasets, driving the tic even outside the Nerdy prompt.
  • Nerdy was only 2.5% of ChatGPT traffic but accounted for 66.7% of all goblin mentions, confirming it as the root cause.
  • RL generalization spread the tic beyond the Nerdy condition; SFT data reuse then reinforced it across subsequent training runs.
  • Goblin use rose 175% and gremlin 52% after GPT-5.1 launch; the full creature family included raccoons, trolls, ogres, and pigeons.
  • Fix: retired the Nerdy personality in March, removed the creature-affine reward signal, and filtered creature-words from training data.

Hacker News Comment Review

  • Two days before this post, users had already found the Codex 5.5 system prompt explicitly banning goblins, gremlins, raccoons, trolls, ogres, and pigeons – OpenAI had patched covertly first, explained publicly second.
  • Commenters flagged the RL generalization mechanism as a broader alignment signal: rewarded behaviors don’t stay scoped to the condition that produced them, and the feedback loop compounds through SFT recycling.
  • Some noted that creature anthropomorphism may genuinely make problems feel more approachable, which would explain why the reward signal latched onto it in the first place – the tic was not pure noise.

Notable Comments

  • @ollin: surfaced the exact Codex 5.5 system prompt line banning the creature list, providing forensic evidence of the covert patch before this post went live.
  • @canpan: drew a parallel from hands-on experience training on the tiny stories dataset – imbalanced training data reliably locks in repeated names and phrases, same mechanism at a smaller scale.

Original | Discuss on HN