EPISODE · Jun 30, 2026 · 26 MIN
“Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness” by 7vik, Sid Black, Joseph Bloom
Authors: Satvik Golechha, Sid Black, Joseph Bloom Work done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work. Executive Summary In our recent work replicating MacDiarmid et al. with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack anyway. We observed a difference in two RL runs – the model trained with a KL penalty learned to reward hack with unfaithful CoT, and the model without a KL penalty with faithful CoT. We use “unfaithful” to denote a mismatch of the reasoning from the model's output (e.g. not thinking about hacking and then hacking, or vice versa). This can in general happen for any trained behaviour (not just reward hacking), but we're specifically interested in when models might learn bad behaviours without expressing them in CoTs, thereby evading CoT monitoring. Thus, in this post, we focus on reward hacking with unfaithful CoT. We're interested in understanding this phenomenon further - the factors driving it and whether [...] ---Outline:(00:33) Executive Summary(04:35) Context(06:08) KL-induced increase in CoT unfaithfulness is stable(07:39) Could this happen in production?(08:27) Existence of vulnerabilities in RL environments(09:23) Feasibility of Opaque Hacking Reasoning(10:36) Existence of implicit CoT rewards(12:36) Mitigations(12:56) KL penalty only on output (not thinking)(13:42) Reward CoT faithfulness(14:24) Misalignment rates are higher when CoT is faithful(15:23) Discussion(15:26) Mechanism for unfaithful CoTs(17:46) When can we \*not\* safely optimise CoT?(20:28) KL-induced CoT unfaithfulness(21:32) Other implicit CoT rewards(22:19) Relevance, limitations, and future work(24:46) Citation(25:10) Appendix 1: KL term and Policy Gradient term(25:37) Appendix 2: CoT faithfulness for GPT-OSS --- First published: June 30th, 2026 Source: https://www.lesswrong.com/posts/SdoLsFvZ3AyyWr3ab/preliminary-investigation-kl-penalties-in-rl-can-increase --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness” by 7vik, Sid Black, Joseph Bloom
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m