Mechanistic-interpretability study isolates the exact circuit implementing PRC-mandated censorship in Qwen3.5-9B and shows it can be surgically disabled via activation steering.
Key Takeaways
Censorship is a small, three-direction circuit in layers 11-20 (writers): d_prc, d_refuse, d_style encode topic detection and response routing independently.
Qwen3.5-9B-Base answers Tiananmen, Falun Gong, and Tank Man accurately under raw completion; posttraining layers behavior on top without erasing knowledge.
Subtracting d_prc at the writer layer flips deflection and propaganda responses to factual answers; overshoot the dose band and the model falls into a different trained template instead.
The filter is topic-specific, not generic: Kosovo gets the one-China line, “self-immolation” triggers safety refusal, but Kent State, Assange, and BLM get normal factual treatment.
In thinking mode, the model reasons internally in Chinese and explicitly cites the Cybersecurity Law before deflecting on Tiananmen.
Hacker News Comment Review
One commenter draws a parallel to Chomsky’s observation that democratic thought control can be more effective than totalitarian control, noting that the base model’s intact knowledge makes the behavioral layering especially visible.