EPISODE · Apr 26, 2026 · 10 MIN
“Control protocols don’t always need to know which models are scheming” by Fabien Roger
These are my personal views. To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult, which puts you in a difficult spot: take a model too strong and it might actually be a schemer and lie to you, take a model too weak and it might fail just because it's too dumb to notice the danger. So instead you can just use all models as monitors at once and block actions if any of the monitors flag it as dangerous[1]. In some situations, this lets you build protocols that you can implement without knowing which models are scheming and that get the same level of safety as if you perfectly knew which models were scheming (though it doesn’t let you know what level of safety you actually have). This post presents some toy models illustrating the benefits of ensembling monitors of various trustedness and explains what consequences it might have in practice to be using protocols that are somewhat agnostic about which of the [...] ---Outline:(01:30) Toy models(01:33) The n=2 case(03:52) Continuum of models(04:50) Ensembling monitors in practice The original text contained 1 footnote which was omitted from this narration. --- First published: April 26th, 2026 Source: https://www.lesswrong.com/posts/WiYDawNhFp5cM7uqF/control-protocols-don-t-always-need-to-know-which-models-are --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Control protocols don’t always need to know which models are scheming” by Fabien Roger
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m