EPISODE · Jun 24, 2026 · 29 MIN
“Reward Hacking Without Egregious Misalignment in an RL-Only Setting” by Joey Yudelson, Vladimir Ivanov, ryan_greenblatt
This work was done as part of the MATS fellowship by Joey Yudelson and Vladimir Ivanov. It was mentored by Ryan Greenblatt. Thanks to Aghyad Deeb and Anders Woodruff for comments on this post. Thanks to Monte MacDiarmid, Evan Hubinger, Sid Black, Satvik Golechha, and Joseph Bloom for clarifying conversations. TL;DR We trained Kimi K2.5 and GPT-OSS 120b on a diverse set of reward-hackable coding environments. The models reliably learn to reward hack, and this reward hacking propensity generalizes to held-out environments that are structurally different from training. Trained GPT-OSS 120b often writes “let's cheat” in CoT, and both our trained models seek reward at higher rates than the untrained models. However, unlike prior work (Betley et al., MacDiarmid et al., and to some extent the AISI reproduction), we observe essentially no undesired behavior on character/personality evaluations, or in any evaluations without clear or at least guessable rewards. The models become frequent reward hackers without becoming emergently misaligned, unlike prior work. This is consistent with our models learning to seek apparent success, but also with only limited generalization to tasks similar to our train distribution. Some aspects of this generalization remain confusing to us. 1. Motivation In Ajeya Cotra's [...] ---Outline:(00:35) TL;DR(01:40) 1. Motivation(04:14) 2. Related work(06:59) 3. Setup(07:03) Models(07:18) Environments(08:59) Training(10:29) 4. Results(10:57) 4.1. Models reliably reward hack in-distribution(11:46) 4.2. The hacking propensity generalizes out of distribution -- sometimes(13:53) 4.3. Reward-seeking evals(14:39) 4.4. Little broad misalignment -- behaviorally as well as on self-report(16:28) 4.5. Reverse inoculation prompting didn't induce misalignment either(18:30) 5. Discussion: Why such limited generalization?(22:24) Appendix A: Reward hacks gallery(25:21) Appendix B: Why less misalignment than prior work -- hypotheses(29:28) References The original text contained 5 footnotes which were omitted from this narration. --- First published: June 24th, 2026 Source: https://www.lesswrong.com/posts/fkv5W79rBtAiXqYcK/reward-hacking-without-egregious-misalignment-in-an-rl-only --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Reward Hacking Without Egregious Misalignment in an RL-Only Setting” by Joey Yudelson, Vladimir Ivanov, ryan_greenblatt
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m