AF - Thinking about maximization and corrigibility by James Payor
<a href="https://www.alignmentforum.org/posts/AFJgo99YckQnhbF8Z/thinking-about-maximization-and-corrigibility">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on The AI Alignment Forum. Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on. Here's what I'd like to discuss in this post: When unstructured maximization does/doesn't work out for the humans CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy" Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart When does maximization work? In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are: Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error. The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks. The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation. The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x. There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post. Passing the buck on optimization Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers. Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off. But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown: If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted. It might seem better to stay online and watch longer in order to learn more about the utility function. Maybe there's a plan that rates highly on "utility...
First published
04/21/2023
Genres:
education
Listen to this episode
Summary
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on The AI Alignment Forum. Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on. Here's what I'd like to discuss in this post: When unstructured maximization does/doesn't work out for the humans CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy" Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart When does maximization work? In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are: Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error. The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks. The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation. The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x. There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post. Passing the buck on optimization Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers. Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off. But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown: If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted. It might seem better to stay online and watch longer in order to learn more about the utility function. Maybe there's a plan that rates highly on "utility...
Duration
9 minutes
Parent Podcast
The Nonlinear Library: Alignment Forum Daily
View PodcastSimilar Episodes
AMA: Paul Christiano, alignment researcher by Paul Christiano
Release Date: 12/06/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
What is the alternative to intent alignment called? Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
AI alignment landscape by Paul Christiano
Release Date: 11/19/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Would an option to publish to AF users only be a useful feature?Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Similar Podcasts
The Nonlinear Library
Release Date: 10/07/2021
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Section
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong
Release Date: 03/03/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Top Posts
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
The Nonlinear Library: LessWrong Top Posts
Release Date: 02/15/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
sasodgy
Release Date: 04/14/2021
Description: Audio Recordings from the Students Against Sexual Orientation Discrimination (SASOD) Public Forum with Members of Parliament at the National Library in Georgetown, Guyana
Explicit: No