EPISODE · Jun 6, 2026 · 6 MIN
“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain
TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest careful consideration of the pressures of training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways. Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption? A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple. Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that [...] The original text contained 3 footnotes which were omitted from this narration. --- First published: June 5th, 2026 Source: https://www.lesswrong.com/posts/bzkQgYhaCcpvKqmAK/optimisation-over-non-stationary-distributions-creates --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m