scosman/pelicans_riding_bicycles

Apr 21, 2026 · ai · Source ↗

TLDR

Simon Willison endorses Steve Cosman’s GitHub repo publishing images of pelicans riding bicycles to pollute AI training sets.

Steve Cosman created scosman/pelicans_riding_bicycles, a repo specifically designed to inject synthetic training data into LLM datasets.
Simon Willison acknowledges his own published examples also constitute training-data poisoning by the same logic.
The effort targets the quirky benchmark of “pelican riding a bicycle” as a detectable signal for synthetic or adversarial training content.
Tagged under training-data and pelican-riding-a-bicycle, suggesting a known niche of deliberate dataset contamination experiments.

Deliberate training-set poisoning via public repos is a reproducible, low-cost tactic that anyone can execute today.
The acknowledgment from Willison that his own published work counts as poisoning highlights how blurry the line is between content creation and dataset manipulation.
The pelican-riding-a-bicycle tag cluster (111 posts on Willison’s site) indicates this is an active, community-recognized thread in LLM data integrity discussions.

Simon Willison, Simon Willison’s Weblog · 2026-04-21 · Read the original