Mozilla details how an agentic harness built atop fuzzing infrastructure used Claude Mythos Preview to find and reproduce 271 latent security bugs in Firefox, including sandbox escapes and decade-old memory corruption issues.
Key Takeaways
Sample bugs include a 15-year-old <legend> element UAF, a 20-year-old XSLT hash-table race, and multiple IPC sandbox escapes – all with reproducible PoC testcases generated by the model.
The pipeline stacks discovery, deduplication, triage, and CI integration; the model is the core primitive but the surrounding tooling is what made it scalable across 150+ engineers.
Early static LLM audits (GPT-4, Sonnet 3.5) had too many false positives; agentic harnesses that run reproducible test cases in ephemeral VMs solved the false-positive problem.
Mozilla plans to shift from file-based scanning to patch-level scanning integrated into CI as patches land, expecting equal or better signal density.
The harness also confirmed defensive value of prior hardening: frozen prototype changes blocked multiple attempted sandbox escape strategies the model tried.
Hacker News Comment Review
There is real skepticism about terminology: commenters draw a hard line between “bugs,” “potential vulnerabilities,” and full PoC-backed CVEs, arguing Mozilla’s internal rollup CVE structure inflates the headline number.
The observation that every sampled bug touches C++ despite Firefox being only ~25% C++ is noted as a meaningful signal about where AI-assisted auditing currently concentrates – and where memory-safe rewrites would have the most leverage.
Commenters contrast Mozilla’s posture favorably against projects (specifically Zig) that refuse LLM-generated bug reports, treating this as a practical toolchain-adoption signal.
Notable Comments
@tialaramex: All sampled bugs touch C++ despite it being only ~25% of the codebase – a concrete data point for Rust migration ROI arguments.
@kajman: Initially dismissed the announcement as Anthropic product boosterism; the Mozilla Hacks post with actual bug IDs changed that assessment.