Paper finds refusal behavior across 13 open-source chat models is encoded in a single residual-stream direction, enabling surgical white-box jailbreaks by erasing it.
Key Takeaways
Across 13 chat models up to 72B parameters, one linear direction in the residual stream controls refusal; erasing it disables refusal, adding it triggers refusal on harmless prompts.
Proposed white-box jailbreak ablates this direction with minimal impact on other model capabilities.
Adversarial suffixes work mechanistically by suppressing propagation of this refusal-mediating direction, not through unrelated noise.
Findings expose brittleness of current safety fine-tuning: a single geometric feature mediates a critical behavior across diverse model families.
Hacker News Comment Review
The sole commenter flags this as outdated: newer work (arXiv:2505.19056) shows models are now trained to resist abliteration by distributing refusal encoding across multiple directions, not just one.
The paper’s core claim may already be an arms-race artifact rather than a durable safety insight.
Notable Comments
@akersten: cites arXiv:2505.19056 showing models now spread refusal encoding to defeat single-direction ablation.