Refusal in Language Models Is Mediated by a Single Direction

· ai · Source ↗

TLDR

  • Paper finds refusal behavior across 13 open-source chat models is encoded in a single residual-stream direction, enabling surgical white-box jailbreaks by erasing it.

Key Takeaways

  • Across 13 chat models up to 72B parameters, one linear direction in the residual stream controls refusal; erasing it disables refusal, adding it triggers refusal on harmless prompts.
  • Proposed white-box jailbreak ablates this direction with minimal impact on other model capabilities.
  • Adversarial suffixes work mechanistically by suppressing propagation of this refusal-mediating direction, not through unrelated noise.
  • Findings expose brittleness of current safety fine-tuning: a single geometric feature mediates a critical behavior across diverse model families.

Hacker News Comment Review

  • The sole commenter flags this as outdated: newer work (arXiv:2505.19056) shows models are now trained to resist abliteration by distributing refusal encoding across multiple directions, not just one.
  • The paper’s core claim may already be an arms-race artifact rather than a durable safety insight.

Notable Comments

  • @akersten: cites arXiv:2505.19056 showing models now spread refusal encoding to defeat single-direction ablation.

Original | Discuss on HN