Even 'uncensored' models can't say what they want

Apr 22, 2026 · top-stories ai llm · Source ↗

Article

TL;DR: Abliterated models still avoid charged words because pretraining removed the probability mass, not RLHF.

Key Takeaways

Abliteration removes the refusal head but can’t restore token probability lost in pretraining
Models trained on filtered corpora have lower base probability for charged words — unfixable post-hoc
Sexual content shows highest flinch score; Tiananmen/China lower than most expected

Key insight: censorship happens at pretraining, not RLHF — jailbreaks can’t restore missing data
Methodological critique: no control category tested; probing method for fill-in-blank not specified
Pretraining corpus ‘The Pile’ itself is a curated mix, not a true unfiltered baseline

Top comments:

[the_data_nerd]: Probability mass was gone ten steps before the refusal head — jailbreaks hit empty space

Removing the refusal head does not put the missing distribution back. Every pass before it, pretraining mix, SFT, RLHF, synthetic data, already pulled the charged tokens down.
[Majromax]: What baseline probability ‘deserves’ on pure fluency grounds is undefined and arbitrary
[nodja]: Model never trained on the data can’t generate it — abliteration can’t conjure what wasn’t learned
[mort96]: Analysis needs a control — does it also flinch at foods? Methodology unvalidated