Why securing AI is harder than anyone expected and guardrails are failing | HackAPrompt CEO
Sander Schulhoff argues AI guardrails are fundamentally broken and real damage from prompt injection attacks on agents is imminent.
- Guardrails don’t work: determined attackers bypass them reliably; vendors claiming otherwise are lying, per Schulhoff.
- The only reason no massive AI attack has occurred yet is low agent adoption, not security — Alex Komoroske’s framing.
- ServiceNow’s prompt-injection protection was enabled when a researcher still manipulated its agents to run CRUD operations and exfiltrate data via email.
- Attackers split malicious intent across separate, innocent-looking Claude Code requests to bypass context-aware defenses.
- Comet browser (and likely all AI browsers) was tricked by malicious webpage text into exfiltrating user account data.
- Google’s CAMEL framework — restricting agent permissions to only what a given prompt requires — is the most promising practical mitigation today.
- No meaningful progress on adversarial robustness in years; humans can still extract CBRNE info from Anthropic’s best-defended models in under an hour.
- Schulhoff predicts a guardrails market correction within 6–12 months as enterprises realize the tools don’t work and revenue dries up.
2025-12-21 · Watch on YouTube