EPISODE · Jun 16, 2026 · 18 MIN
“Synthetic document finetuning for instilling positive traits” by CallumMcDougall, Arthur Conmy, Neel Nanda
This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here. TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness. Introduction Inspired by Marks et al, where a multi-step finetuning process involving synthetic documents is used to create a model robustly pursuing a complex goal (taking actions favoured by a reward model), we wanted to use this method to robustly instil positive traits instead. Our motivation was deep alignment: we want to train principles into the model which guide behaviour even in highly OOD behaviours. Our MVP pipeline used a "traits document" (a short bullet-pointed list of positive traits we wanted the model to exhibit) as our universe context, with a checkpoint of Gemini 3 Flash [...] ---Outline:(00:52) Introduction(03:52) Results(07:42) Removing Superficial Patterns in Synthetic Data(12:33) Takeaways The original text contained 2 footnotes which were omitted from this narration. --- First published: June 15th, 2026 Source: https://www.lesswrong.com/posts/GTYJRLhqztxKF2v5R/synthetic-document-finetuning-for-instilling-positive-traits --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Synthetic document finetuning for instilling positive traits” by CallumMcDougall, Arthur Conmy, Neel Nanda
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m