EPISODE · Jun 10, 2026 · 16 MIN
“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk
TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and OLMo-3.1-32B-Think – identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage – VEA roughly doubles. Because OLMo ships stepwise checkpoints across all training stages, we can attribute VEA growth to specific points in the pipeline. Measuring VEA across pretraining, the SFT→DPO→RLVR stages at various points on five safety benchmarks, we find: VEA is essentially negligible during pretraining (~1%).It is increased substantially by SFT, collapsed by DPO, and increased again by RLVR. The increase in SFT is likely driven by the SFT data containing VEA, particularly on safety prompts.Eval-gaming behaviour (difference in refusals with or without VEA) roughly increases throughout RLVR (but with high variance). Given OLMo is quite different from how current frontier models are trained, it is unclear whether this analysis would generalise to those models. However, as a somewhat natural setting in which to study evaluation awareness and eval-gaming emergence we think it could still produce interesting insights. We think investigating [...] ---Outline:(00:12) TL;DR(02:24) Motivation(04:06) VEA is negligible during pretraining(05:34) RLVR amplifies VEA(06:53) Which post-training stage contributes most to VEA? SFT increases it, DPO suppresses it, RLVR increases it again.(08:53) Where does VEA come from? Mostly, the data.(11:33) Does increased VEA translate into increased behavioural eval-gaming?(12:47) Takeaways(14:16) Limitations(15:23) Appendix --- First published: June 10th, 2026 Source: https://www.lesswrong.com/posts/c2tqL9xPbttisAHtt/tracing-eval-awareness-emergence-through-training-of-olmo-3 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m