EPISODE · Jun 10, 2026 · 4 MIN
“Three types of model organism” by Francis Rhys Ward
This is a short post to explain a distinction between three different types of model organism (MO) research: Type Purpose Example Worst-case model organisms Stress-test safety and control techniques by making the problem as hard as possible Password-locked models for capability elicitation; sleeper agents for stress-testing alignment training; red-team malign inits in control Natural model organisms Demonstrate plausible emergence of failure modes in realistic training pipelines Emergent misalignment induced by narrow fine-tuning; optimise CoT to look good to a monitor Constructed model organisms Intentionally instantiate the phenomena of interest Models trained with explicit scheming objectives and specifications; alignment faking models trained on false facts 1. Worst-case model organisms These act as an upper bound on the difficulty for some problem, such as hidden capabilities or goals. Worst-case MOs stress-test your control mechanisms, alignment techniques, or other safety measures. For example, MOs of hidden capabilities, like password-locked models, work as worst-cases that stress test capability elicitation techniques. Similarly, sleeper agents act as a stress-test of alignment training methods. In control research, the red team creates a “malign initialization” (malign init) of the AI model which is optimized for making the deployment go poorly, thereby stress-testing blue-team mitigations. Auditbench [...] ---Outline:(00:28) 1. Worst-case model organisms(01:31) 2. Natural model organisms(02:15) 3. Constructed model organisms(03:49) Acknowledgements --- First published: June 10th, 2026 Source: https://www.lesswrong.com/posts/NZDpqhyqpQcrkJx55/three-types-of-model-organism --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Three types of model organism” by Francis Rhys Ward
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m