AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis
<a href="https://www.alignmentforum.org/posts/FLMyTjuTiGytE6sP2/inner-misalignment-in-simulator-llms">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...
First published
01/31/2023
Genres:
education
Listen to this episode
Summary
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...
Duration
6 minutes
Parent Podcast
The Nonlinear Library: Alignment Forum Daily
View PodcastSimilar Episodes
AMA: Paul Christiano, alignment researcher by Paul Christiano
Release Date: 12/06/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
What is the alternative to intent alignment called? Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
AI alignment landscape by Paul Christiano
Release Date: 11/19/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Would an option to publish to AF users only be a useful feature?Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Similar Podcasts
The Nonlinear Library
Release Date: 10/07/2021
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Section
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong
Release Date: 03/03/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Top Posts
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
The Nonlinear Library: LessWrong Top Posts
Release Date: 02/15/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
sasodgy
Release Date: 04/14/2021
Description: Audio Recordings from the Students Against Sexual Orientation Discrimination (SASOD) Public Forum with Members of Parliament at the National Library in Georgetown, Guyana
Explicit: No