EPISODE · Apr 10, 2026 · 35 MIN
“Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic's approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research. See here for unabridged transcripts (or in JSON format here). Executive summary In recent system cards, Anthropic uses steering to test for evaluation gaming by suppressing internal representations of evaluation awareness The Sonnet 4.5 system card argues that steering with contrastive pairs based on evaluation vs deployment will suppress internal representations of evaluation awareness, and should not have spurious effects on alignment via other mechanisms.[1]They use “random” vectors (random SAE [...] ---Outline:(01:13) Executive summary(05:37) en-US-AvaMultilingualNeural__ Three line graphs showing rates across steering scales for different experimental conditions with misaligned action and verbalised eval awareness metrics.(07:17) Introduction(09:08) Method(12:04) Results(12:13) Verbalised evaluation awareness suppresses misalignment, but steering to remove verbalised evaluation awareness does not always reveal misalignment(14:17) Comparison with Sonnet 4.5(15:17) We see large effects for control vectors(19:12) Effects are sensitive to steering scale, and steering can have opposite effects on different aspects of evaluation gaming(22:59) Additional result: steering based on the purpose of the evaluation shows some promise(25:15) Discussion(28:40) Future work(31:18) Acknowledgements(31:42) Appendix(31:46) Contrastive pairs used for steering vectors(32:21) Non-control vectors(32:38) Control Vectors(32:56) Full results(34:27) Canary string The original text contained 6 footnotes which were omitted from this narration. --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/HhF5kESdtPHku7kim/reproducing-steering-against-evaluation-awareness-in-a-large-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m