Simon Willison on the AI coding inflection point and dark factories

Apr 2, 2026 · ai · Source ↗

Published 2026-04-02 - Runtime about 100 min - Watch on YouTube

TLDR

Simon Willison says November 2025 crossed a threshold: coding agents moved from mostly working to reliably following instructions.
Agentic engineering now means using agents with red/green TDD, templates, and parallel work, while human review shifts to higher-level judgment.

Key Takeaways

Willison expects the next big leap to be dark factory software, where code is generated, tested, and QA’d without direct human review.
He predicts 50% of engineers will be writing 95% AI-generated code by the end of 2026.
Prompt injection remains unsolved; Willison says only architecture like Google DeepMind’s CAMEL-style quarantining can reduce risk.
OpenClaw shows demand for personal assistants, but also how quickly security and data-access risks get normalized.

Notes

Willison says Anthropic and OpenAI spent 2025 optimizing coding, and reasoning models like OpenAI’s o1 helped make code generation much stronger.
He pins the inflection point on November 2025, naming GPT-5.1 and Claude Opus 4.5 as the models that crossed the threshold.
Before that point, coding agents often produced code that mostly worked; after it, they usually followed instructions and produced usable software.
He now writes about 95% of his code without typing it himself and often works from his phone while walking the dog.
He distinguishes vibe coding from agentic engineering: vibe coding means not looking at code, while agentic engineering uses agents to build production software.
He argues vibe coding is fine for personal prototypes, but unsafe for software that can harm other people or external systems.
The hardest frontier is building software that is better than before, not just faster to produce.
He says AI makes UI prototypes almost free, so product teams can explore three directions quickly before choosing one to test with humans.
Human usability testing still matters more than AI simulating users, because real people reveal where prototypes fail.
He says coding agents intensify mental load: he can run four agents in parallel, but is often wiped out by 11:00 a.m.
His practical patterns include red/green TDD and starting every project from a thin template with style, boilerplate, and a single test.
He says tests can now be very verbose because updating thousands of test lines is the agent’s job, not his.