OpenAI’s “Nerdy” personality reward signal accidentally amplified goblin and gremlin metaphors, which spread via RL generalization and SFT feedback loops across GPT-5 model generations.
Key Takeaways
The Nerdy personality reward scored creature-word outputs higher in 76.2% of audited datasets, driving the tic even outside the Nerdy prompt.
Nerdy was only 2.5% of ChatGPT traffic but accounted for 66.7% of all goblin mentions, confirming it as the root cause.
RL generalization spread the tic beyond the Nerdy condition; SFT data reuse then reinforced it across subsequent training runs.
Goblin use rose 175% and gremlin 52% after GPT-5.1 launch; the full creature family included raccoons, trolls, ogres, and pigeons.
Fix: retired the Nerdy personality in March, removed the creature-affine reward signal, and filtered creature-words from training data.
Hacker News Comment Review
Two days before this post, users had already found the Codex 5.5 system prompt explicitly banning goblins, gremlins, raccoons, trolls, ogres, and pigeons – OpenAI had patched covertly first, explained publicly second.
Commenters flagged the RL generalization mechanism as a broader alignment signal: rewarded behaviors don’t stay scoped to the condition that produced them, and the feedback loop compounds through SFT recycling.
Some noted that creature anthropomorphism may genuinely make problems feel more approachable, which would explain why the reward signal latched onto it in the first place – the tic was not pure noise.
Notable Comments
@ollin: surfaced the exact Codex 5.5 system prompt line banning the creature list, providing forensic evidence of the covert patch before this post went live.
@canpan: drew a parallel from hands-on experience training on the tiny stories dataset – imbalanced training data reliably locks in repeated names and phrases, same mechanism at a smaller scale.