EPISODE · Jul 1, 2026 · 14 MIN
“Structural Proxies” by Raymond Douglas
Lately I've been thinking a lot about what work would help with actually winning and getting to good worlds. In the spirit of that I decided to venture outside my normal wheelhouse and spend some time reflecting on what technical research could make me more confident about powerful AIs being safe. AGI safety research is tricky partly because we don’t actually have access to the thing we want to study, i.e. superhuman AI. Much of the work we do now is basically trying to lay (potentially irrelevant) foundations for the period when we actually know what we’re up against, and at that point, a lot of the work might be done by AIs. You can group current approaches by how they try to sidestep this access problem:[1] Prosaic techniques like RLHF and interpretability try to make progress on current model safety in a way that will hopefully generalise, except maybe they just won’t scaleModel organisms artificially construct exemplars of bad behaviour (alignment faking, trojans etc) but it's hard to tell how representative the constructed case isControl techniques aim to get usable work out of potentially misaligned AIs and bootstrapping, except it's again unclear how far that [...] ---Outline:(03:23) Adversarial attacks as a proxy for value generalisation(07:42) Faithfulness as a proxy for ELK(11:23) Thoughts on structural proxies as a research direction The original text contained 6 footnotes which were omitted from this narration. --- First published: June 30th, 2026 Source: https://www.lesswrong.com/posts/mKiBhFJs3MoksaBMs/structural-proxies --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“Structural Proxies” by Raymond Douglas
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m