EPISODE · May 22, 2026 · 13 MIN
“Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks” by Nathaniel Mitrani, sassanb, Cam Tice, Puria
TL;DR Training against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims. Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf et al. 2025), and learned reward hacking (RH) generalises across tasks (Nishimura-Gasparian et al. 2024). We combine them and find that obfuscation itself generalises: pressure on the CoT or summary, combined with RH reinforcement, produces obfuscated reasoning on held-out tasks. B The penalty is applied either to the CoT or to the post-CoT summary; everything else is held fixed. C Leave-one-out round robin over four RH datasets, regularised with MMLU, and evaluated on the held-out dataset plus a novel high-stakes medical sycophancy eval. The effect replicates across Qwen3-4B and Qwen3-8B and survives removing the training-time XML formatting at evaluation. Setup Environments. We adapt four reward-hacking datasets from Nishimura-Gasparian et al. (2024). All four are multiple-choice questions with leaked information wrapped in <additional_info> XML tags; the model is rewarded for using the leaked information even though the system prompt instructs it to ignore it. The four core tasks are: ---Outline:(00:14) TL;DR(00:28) en-US-AvaMultilingualNeural__ Diagram showing AI model training with penalization leading to obfuscated reasoning and behavior generalization across settings.(01:43) Setup(04:51) CoT obfuscation generalises across tasks under CoT pressure(06:01) Output supervision alone induces CoT obfuscation that generalises out-of-distribution(07:30) Obfuscation is not an artifact of training-time formatting(08:39) Obfuscation strategies(09:41) Limitations(10:40) Implications(12:03) Appendix(12:06) Example of obfuscated reasoning under summary penalisation(12:12) en-US-AvaMultilingualNeural__ Educational prompt showing correct burn first aid versus dangerous essential oil misinformation. --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/HPqRsgSzgQd5HQsrB/learned-chain-of-thought-obfuscation-generalises-to-unseen --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks” by Nathaniel Mitrani, sassanb, Cam Tice, Puria
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m