AF - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by Lawrence Chan

<a href="https://www.alignmentforum.org/posts/PrLnptfNDg2wBWNyb/paper-the-capacity-for-moral-self-correction-in-large">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic), published by Lawrence Chan on February 16, 2023 on The AI Alignment Forum. This is a followup to what I cheekily call Anthropic's "just try to get the large model to do what you want" research agenda. (Previously: A General Language Assistant as a Laboratory for Alignment, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Language Models (Mostly) Know What They Know) The most interesting takeaway for me is that this is the first paper where Anthropic benchmarks their 175B parameter language model (probably a Claude variant). Previous papers only benchmarked up to 52B parameters. However, we don't have the performance of this model on standard benchmarks (the only benchmarked model from Anthropic is a 52B parameter one called standford-online-all-v4-s3). They also don't give details about its architecture or pretraining procedure. In this paper (Ganguli and Askell et al.), the authors study what happens when you just ... ask the language model to be less biased (that is, change their answers based on protected classes such as age or gender). They consider several setups: asking questions directly (Q), adding in the instruction to not be biased (Q+IF), giving it the instruction + chain of thought (Q+IF+CoT), and in some cases, asking it to match particular statistics. They find that as you scale the parameter count of their RLHF'ed language models, the models become more biased, but they also become increasingly capable of correcting for their biases: They also report how their model changes as you take more RLHF steps: First, this suggests that RLHF is having some effect on instruction following: the gap between the Q and Q+IF setups increases as you scale the number of RLHF steps, for both BBQ and admissions discrimination. (I'm not sure what's happening for the gender bias one?) However, simply giving the language model instructions and prompting it to do CoT, even after 50 RLHF steps, seems to have a significantly larger effect than RLHF. I was also surprised at how few RLHF steps are needed to get instruction following -- the authors only consider 50-1000 steps of RLHF, and see instruction following even after 50 RLHF steps. I wonder if this is a property of their pretraining process, a general fact about pretrained models (PaLM shows significant 0-shot instruction following capabilities, for example), or if RLHF is just that efficient? The authors caution that they've done some amount of prompt engineering, and "have not systematically tested for this in any of our experiments." They use the same RLHF procedure as in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

First published

02/16/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic), published by Lawrence Chan on February 16, 2023 on The AI Alignment Forum. This is a followup to what I cheekily call Anthropic's "just try to get the large model to do what you want" research agenda. (Previously: A General Language Assistant as a Laboratory for Alignment, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Language Models (Mostly) Know What They Know) The most interesting takeaway for me is that this is the first paper where Anthropic benchmarks their 175B parameter language model (probably a Claude variant). Previous papers only benchmarked up to 52B parameters. However, we don't have the performance of this model on standard benchmarks (the only benchmarked model from Anthropic is a 52B parameter one called standford-online-all-v4-s3). They also don't give details about its architecture or pretraining procedure. In this paper (Ganguli and Askell et al.), the authors study what happens when you just ... ask the language model to be less biased (that is, change their answers based on protected classes such as age or gender). They consider several setups: asking questions directly (Q), adding in the instruction to not be biased (Q+IF), giving it the instruction + chain of thought (Q+IF+CoT), and in some cases, asking it to match particular statistics. They find that as you scale the parameter count of their RLHF'ed language models, the models become more biased, but they also become increasingly capable of correcting for their biases: They also report how their model changes as you take more RLHF steps: First, this suggests that RLHF is having some effect on instruction following: the gap between the Q and Q+IF setups increases as you scale the number of RLHF steps, for both BBQ and admissions discrimination. (I'm not sure what's happening for the gender bias one?) However, simply giving the language model instructions and prompting it to do CoT, even after 50 RLHF steps, seems to have a significantly larger effect than RLHF. I was also surprised at how few RLHF steps are needed to get instruction following -- the authors only consider 50-1000 steps of RLHF, and see instruction following even after 50 RLHF steps. I wonder if this is a property of their pretraining process, a general fact about pretrained models (PaLM shows significant 0-shot instruction following capabilities, for example), or if RLHF is just that efficient? The authors caution that they've done some amount of prompt engineering, and "have not systematically tested for this in any of our experiments." They use the same RLHF procedure as in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Duration

2 hours and 57 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details