EPISODE · Jun 20, 2026 · 15 MIN
“Research agenda: Interpretive debate” by Shi
One sentence pitch: our goal is to develop a piece of epistemic infrastructure for iteratively and empirically answering interpretive questions about AI models, where the accumulation of empirics leads to resolution of interpretive ambiguity and/or calibration of uncertainty. This directly builds on our “performative misalignment” work (paper, LW post), which we see as a minimal demonstration of one round of debate. The difficulty and importance of interpretive questions AI safety involves a lot of fuzzy, interpretive questions: Is this model scheming?Is the model sandbagging?Is the model lying?Is the model introspective?Is the model preferring itself because it recognizes itself?Does this steering vector actually control emotion? Why do we need interpretation? Because despite CoTs being generally monitorable right now, we can’t trust that the model is verbalizing all the reasons that lead to its decisions, or that our behavioral testing fully reveals its latent knowledge and capabilities. These questions are hard, partly because concepts like scheming and introspection are very hard to define accurately. For weaker models, the interpretive questions of interest were like: is the model exploiting multiple-choice format when answering questions, is the model exploiting lexical heuristics when predicting entailment, is the model [...] --- First published: June 18th, 2026 Source: https://www.lesswrong.com/posts/onaSmiocXtBYG5BZZ/research-agenda-interpretive-debate --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Research agenda: Interpretive debate” by Shi
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m