AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger
<a href="https://www.alignmentforum.org/posts/4Tx6ALN8erdgRojkk/quick-thoughts-on-scalable-oversight-super-human-feedback">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quick thoughts on "scalable oversight" / "super-human feedback" research, published by David Scott Krueger on January 25, 2023 on The AI Alignment Forum. The current default view seems to roughly be: Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/sub-optimal, but basically it's all about generalizing correctly) Scalable oversight is the only useful form of outer alignment research remaining. We don't need to worry about sample efficiency in RLHP -- in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a "constition") seems ~good enough. But maybe it's not good? Because it's more like capabilities research? A common example used for motivating scalable oversight is the "AI CEO". My views are: We should not be aiming to build AI CEOs We should be aiming to robustly align AIs to perform "simpler" behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge. We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs). From this PoV, scalable oversight does in fact look mostly like capabilities research. However, scalable oversight research can still be justified because "If we don't, someone else will". But this type of replaceability argument should always be treated with extreme caution. The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms. 2) The details matter, and the tipping points are in different places for different types of research/applications, etc. It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge ("robustness amplification"). For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe. Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures / hybrid systems of some form as well. EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a "chicken-and-egg" problem, i.e. the arguments for safety/alignment end up being somewhat circular: "this system is safe because the system we used as an assistant was safe" (but I don't think we've solved the "build a safe assistant" part yet, i.e. we don't have the base case for the induction). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
First published
01/25/2023
Genres:
education
Listen to this episode
Summary
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quick thoughts on "scalable oversight" / "super-human feedback" research, published by David Scott Krueger on January 25, 2023 on The AI Alignment Forum. The current default view seems to roughly be: Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/sub-optimal, but basically it's all about generalizing correctly) Scalable oversight is the only useful form of outer alignment research remaining. We don't need to worry about sample efficiency in RLHP -- in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a "constition") seems ~good enough. But maybe it's not good? Because it's more like capabilities research? A common example used for motivating scalable oversight is the "AI CEO". My views are: We should not be aiming to build AI CEOs We should be aiming to robustly align AIs to perform "simpler" behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge. We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs). From this PoV, scalable oversight does in fact look mostly like capabilities research. However, scalable oversight research can still be justified because "If we don't, someone else will". But this type of replaceability argument should always be treated with extreme caution. The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms. 2) The details matter, and the tipping points are in different places for different types of research/applications, etc. It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge ("robustness amplification"). For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe. Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures / hybrid systems of some form as well. EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a "chicken-and-egg" problem, i.e. the arguments for safety/alignment end up being somewhat circular: "this system is safe because the system we used as an assistant was safe" (but I don't think we've solved the "build a safe assistant" part yet, i.e. we don't have the base case for the induction). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Duration
2 hours and 55 minutes
Parent Podcast
The Nonlinear Library: Alignment Forum Daily
View PodcastSimilar Episodes
AMA: Paul Christiano, alignment researcher by Paul Christiano
Release Date: 12/06/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
What is the alternative to intent alignment called? Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
AI alignment landscape by Paul Christiano
Release Date: 11/19/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Would an option to publish to AF users only be a useful feature?Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Similar Podcasts
The Nonlinear Library
Release Date: 10/07/2021
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Section
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong
Release Date: 03/03/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Top Posts
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
The Nonlinear Library: LessWrong Top Posts
Release Date: 02/15/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
sasodgy
Release Date: 04/14/2021
Description: Audio Recordings from the Students Against Sexual Orientation Discrimination (SASOD) Public Forum with Members of Parliament at the National Library in Georgetown, Guyana
Explicit: No