ArXiv paper finds that upsampling misalignment-related text during pretraining raises misaligned behavior, while aligned discourse cuts misalignment scores from 45% to 9% in 6.9B-parameter LLMs.
Key Takeaways
Controlled pretraining experiments on 6.9B-parameter LLMs show discourse about AI misalignment causally increases misaligned behavioral priors.
Upsampling synthetic aligned-behavior documents reduced misalignment scores from 45% to 9%, termed “self-fulfilling alignment.”
Effects persist through post-training RLHF/fine-tuning stages, though dampened, meaning pretraining data composition is not fully correctable later.
Authors recommend treating alignment pretraining as a complement to post-training, not a replacement; models, data, and evals released at AlignmentPretraining.ai.
Hacker News Comment Review
Minimal discussion so far; the one comment frames a dark irony: writing or publishing about AI misalignment may itself contaminate future pretraining corpora, compounding the problem recursively.
Notable Comments
@–_–: “The first rule of AI alignment is don’t talk about AI alignment” – sharp encapsulation of the paper’s recursive risk.