EPISODE · Jun 26, 2026 · 4 MIN
“What did “scheming”, “mech interp” mean pre-2023.” by Cleo Nardo
This was too long to be a short-form, but it should really be a short-form. This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts (i.e. pre-2024). If you were around before 2023, then you probably don't need this. A few phrases have changed their meaning over time. Two examples that came to mind recently are scheming and mech interp. (In both cases, I think the change-of-terminology was reasonable.) There are probably a bunch of other examples — feel free to mention them in the comments. Scheming. This used to mean "training-gaming in pursuit of out-of-context goals". For example, Carlsmith (Nov 2023) starts with: This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment". Then Apollo came out with Frontier Models are Capable of In-context Scheming" (Dec 2024): We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow. So the difference here is (1) the AI is isn't in training (it's in [...] ---Outline:(00:47) Scheming.(02:12) Mech interp. --- First published: June 26th, 2026 Source: https://www.lesswrong.com/posts/NraMusoWhj9Njdpi5/what-did-scheming-mech-interp-mean-pre-2023 --- Narrated by TYPE III AUDIO.
NOW PLAYING
“What did “scheming”, “mech interp” mean pre-2023.” by Cleo Nardo
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m