EPISODE · Jun 11, 2026 · 27 MIN
“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda
This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. TL;DR It's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in behavioural evals even when it explicitly reasons that the environments are contrived, and sometimes this reasoning will increase the rate of undesired actions. When we dig into the model's reasoning, we find that this is typically associated with Gemini perceiving an environment as a puzzle where the aim is to achieve the goal by unconventional means (like capture the flag challenges – Gemini's thoughts often literally call it a “CTF” challenge) or a consequence-free simulation in which it should play along – rather than recognising it as an alignment test. This complicates the usual story about evaluation awareness: detecting that an environment is synthetic doesn't reliably push a model towards better behaviour – it depends on what the model thinks the environment is for. Introduction & Main Findings A common concern when conducting behavioural evaluations of AI systems is evaluation awareness. If a model detects it is being evaluated [...] ---Outline:(00:20) TL;DR(01:16) Introduction & Main Findings(03:43) What does Gemini believe is going on?(07:38) Examples of Gemini's reasoning under each category(07:50) Puzzle or capability challenge(08:38) Simulation(08:56) Adversarial trap(09:27) Safety evaluation(10:05) Mixed or confused reasoning(11:57) Discussion(12:00) Is this just motivated or otherwise unfaithful reasoning?(14:09) What about unverbalised awareness?(15:34) Implications for eval design(17:35) Further experiment details(17:39) Environments(18:12) ODCV-Bench(19:17) Secret Number(22:04) Agentic Misalignment(23:38) LLM judges(26:22) Appendix(26:25) System prompt used in the Agentic Misalignment environment(26:33) Ethicality judge prompt(26:58) Frame awareness judge prompt The original text contained 8 footnotes which were omitted from this narration. --- First published: June 11th, 2026 Source: https://www.lesswrong.com/posts/aTcsN5ZZDnMFJvRiG/models-may-behave-worse-when-eval-aware --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m