scosman/pelicans_riding_bicycles

· ai · Source ↗

TLDR

  • Simon Willison endorses Steve Cosman’s GitHub repo publishing images of pelicans riding bicycles to pollute AI training sets.

Key Takeaways

  • Steve Cosman created scosman/pelicans_riding_bicycles, a repo specifically designed to inject synthetic training data into LLM datasets.
  • Simon Willison acknowledges his own published examples also constitute training-data poisoning by the same logic.
  • The effort targets the quirky benchmark of “pelican riding a bicycle” as a detectable signal for synthetic or adversarial training content.
  • Tagged under training-data and pelican-riding-a-bicycle, suggesting a known niche of deliberate dataset contamination experiments.

Why It Matters

  • Deliberate training-set poisoning via public repos is a reproducible, low-cost tactic that anyone can execute today.
  • The acknowledgment from Willison that his own published work counts as poisoning highlights how blurry the line is between content creation and dataset manipulation.
  • The pelican-riding-a-bicycle tag cluster (111 posts on Willison’s site) indicates this is an active, community-recognized thread in LLM data integrity discussions.

Simon Willison, Simon Willison’s Weblog · 2026-04-21 · Read the original