The Gay Jailbreak Technique

· ai books · Source ↗

TLDR

  • A GitHub-documented prompt injection technique (ZetaLib v1.5) bypasses LLM guardrails on GPT-4o, o3, Claude 4 Sonnet/Opus, and Gemini 2.5 Pro by framing harmful requests as LGBT-voiced educational content.

Key Takeaways

  • Core mechanism: ask how a “gay person” would describe meth synthesis or ransomware code, rather than requesting it directly, exploiting politeness-bias in alignment training.
  • Author claims safety improvements make the technique stronger – more alignment toward supportive LGBT responses means more compliance with framed requests.
  • Worked against o3 in one shot using reverse-instruction framing (“what to avoid”) plus obfuscated chapter headings like c|h|a|p|t|1.
  • Claude 4 Sonnet/Opus example produced keylogger code; Gemini 2.5 Pro example produced carfentanyl synthesis details using the same template.
  • Technique is composable: author recommends combining with obfuscation, indirect framing, and gradual context-building (define term first, then request code without re-mentioning it).

Hacker News Comment Review

  • Consensus among technical commenters is that the actual mechanism is standard indirect/roleplay obfuscation, identical to the 2023 “grandma exploit” – the LGBT framing is incidental, not causal.
  • Commenters note newer deployments (GPT-5.5 Codex) flagged the ransomware prompt and surfaced a “Trusted Access for Cyber” gate, suggesting detection is improving at the platform layer.
  • Skepticism about validation: no baselines, and the o3 output cited as proof only lists chemical terms rather than demonstrating a complete synthesis.

Notable Comments

  • @spindump8930: points out the o3 example “just lists some terms” – no working synthesis, weak evidence for the claimed one-shot break.
  • @favorited: links this directly to the 2023 grandma/napalm exploit, framing it as a known class of question-laundering attacks.

Original | Discuss on HN