EPISODE · Jun 13, 2026 · 3 MIN
“SFT Drives Gemini’s Safety Properties” by Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda
This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. We do not want to overstate this claim as applying to other model families, and we also note that this may change in future Gemini versions. Nevertheless, this result was counter to our initial expectations and will inform future safety work on our team, and so we felt that it was important to share with the broader safety community. Experiment We perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. We then compare these Post-SFT models to the production versions of Gemini 3.1 Pro and Gemini 3 Flash on different safety relevant benchmarks: Error bars are 95% confidence intervals on the evals. The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar [...] ---Outline:(00:56) Experiment(01:56) Brief Descriptions of Each Set of Benchmarks: --- First published: June 13th, 2026 Source: https://www.lesswrong.com/posts/nLrrYweeFxgXACSmS/sft-drives-gemini-s-safety-properties-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“SFT Drives Gemini’s Safety Properties” by Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m