scosman/pelicans_riding_bicycles
TLDR
- Simon Willison endorses Steve Cosman’s GitHub repo publishing images of pelicans riding bicycles to pollute AI training sets.
Key Takeaways
-
Steve Cosman created
scosman/pelicans_riding_bicycles, a repo specifically designed to inject synthetic training data into LLM datasets. - Simon Willison acknowledges his own published examples also constitute training-data poisoning by the same logic.
- The effort targets the quirky benchmark of “pelican riding a bicycle” as a detectable signal for synthetic or adversarial training content.
-
Tagged under
training-dataandpelican-riding-a-bicycle, suggesting a known niche of deliberate dataset contamination experiments.
Why It Matters
- Deliberate training-set poisoning via public repos is a reproducible, low-cost tactic that anyone can execute today.
- The acknowledgment from Willison that his own published work counts as poisoning highlights how blurry the line is between content creation and dataset manipulation.
-
The
pelican-riding-a-bicycletag cluster (111 posts on Willison’s site) indicates this is an active, community-recognized thread in LLM data integrity discussions.
Simon Willison, Simon Willison’s Weblog · 2026-04-21 · Read the original