EPISODE · Jun 14, 2026 · 21 MIN
“Why Do Naive SFT Filters For Safety Properties Fail?” by Josh Engels, Neel Nanda
This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here. Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails. TL;DR: We discuss seven hypotheses for why SFT filtering works surprisingly poorlyWe analyze three hereditary traits that SFT-only Gemini has that other models do not: negative emotion, date confusion, and blackmail in the (highly contrived) agentic misalignment scenarioWe use a “post-training diffing pipeline” between Gemini and Olmo to show that the cause of date confusion and blackmail is largely surprising transfer of behaviors from the SFT teacher model. Notably, there exist small sets of prompts where switching the teacher model for the rollout removes date confusion and blackmail, but dropping the prompts does not.Negative emotion is less affected by the teacher model, but this may be because the Olmo prompt [...] ---Outline:(00:47) TL;DR:(02:17) Initial Hypotheses(05:57) Post-Training Diffing(06:50) Types of Diffing(09:49) Comparison to TDA(11:26) Results --- First published: June 14th, 2026 Source: https://www.lesswrong.com/posts/wyZRNgpeiPeRXB6eT/why-do-naive-sft-filters-for-safety-properties-fail --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Why Do Naive SFT Filters For Safety Properties Fail?” by Josh Engels, Neel Nanda
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m