EPISODE · May 21, 2026 · 20 MIN
“Why does off-model SFT degrade capabilities?” by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, Vivek Hebbar
Off-model SFT (Supervised Fine-Tuning on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it's because off-model SFT forces the model into an unfamiliar reasoning style that it's bad at using[1]. We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models. Off-model SFT often degrades capabilities; degradation severity depends on several factors To start, we establish that off-model SFT degrades capabilities. For several student-teacher pairs, we train the student on the teacher's responses to the Alpaca chat dataset using LoRA, and evaluate the student on IFEval, MMLU, MATH-500, and Olympiads for n=200 problems each. Note that we train on a distribution completely irrelevant to the benchmarks! We [...] ---Outline:(01:09) Off-model SFT often degrades capabilities; degradation severity depends on several factors(02:47) Hypotheses for why off-model SFT degrades capabilities(02:53) Hypothesis 1: The student is imitating a dumb teacher.(04:02) Hypothesis 2: Fitting high-perplexity data moves the model erratically through the loss landscape.(04:46) Hypothesis 3: SFT effectively noises the weights.(06:23) Hypothesis 4: Off-model SFT forces the student into an unfamiliar reasoning style it reasons poorly in.(07:31) Evidence 1: On-model SFT recovers capabilities.(08:15) Evidence 2: Prefilling with the original model's outputs recovers performance.(09:04) Evidence 3: Training on non-text distributions preserves capabilities(10:03) Evidence 4: Reasoning models retain capabilities if we only train on their outputs.(10:52) Evidence 5: Inoculation prompting sometimes prevents degradation.(12:48) Conclusion(13:55) Appendix(13:59) Appendix 1: Degraded capabilities are cheap to recover(14:22) A small amount of on-model SFT recovers performance(14:40) Best-of-N sampling partially recovers capabilities(15:52) RL often recovers capabilities(16:35) Appendix 2: Off-policy data doesn't necessarily cause degradation(17:22) Appendix 3: Full-weight fine-tuning shows the same pattern(19:29) Appendix 4: Filtering out math problems doesn't change the result(20:09) Appendix 5: Full List of Models With Aliases The original text contained 2 footnotes which were omitted from this narration. --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Why does off-model SFT degrade capabilities?” by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, Vivek Hebbar
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m