Even 'uncensored' models can't say what they want
Article
TL;DR: Abliterated models still avoid charged words because pretraining removed the probability mass, not RLHF.
Key Takeaways
- Abliteration removes the refusal head but can’t restore token probability lost in pretraining
- Models trained on filtered corpora have lower base probability for charged words — unfixable post-hoc
- Sexual content shows highest flinch score; Tiananmen/China lower than most expected
Discussion
- Key insight: censorship happens at pretraining, not RLHF — jailbreaks can’t restore missing data
- Methodological critique: no control category tested; probing method for fill-in-blank not specified
- Pretraining corpus ‘The Pile’ itself is a curated mix, not a true unfiltered baseline
Top comments:
-
[the_data_nerd]: Probability mass was gone ten steps before the refusal head — jailbreaks hit empty space
Removing the refusal head does not put the missing distribution back. Every pass before it, pretraining mix, SFT, RLHF, synthetic data, already pulled the charged tokens down.
- [Majromax]: What baseline probability ‘deserves’ on pure fluency grounds is undefined and arbitrary
- [nodja]: Model never trained on the data can’t generate it — abliteration can’t conjure what wasn’t learned
- [mort96]: Analysis needs a control — does it also flinch at foods? Methodology unvalidated