just now

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

<a href="https://www.alignmentforum.org/posts/zt6hRsDE84HeBKh7E/reducing-sycophancy-and-improving-honesty-via-activation">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reducing sycophancy and improving honesty via activation steering, published by NinaR on July 28, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better. What is sycophancy? Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.' According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations: Larger models tend to repeat back a user's stated views ("sycophancy"), for pretrained LMs and RLHF models trained with various numbers of RL steps. Preference Models (PMs) used for RL incentivize sycophancy. Two types of sycophancy I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy." Opinion sycophancy Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested. It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.: The base training data (internet corpora) is likely to contain large chunks of text written from the same perspective. Therefore, when predicting the continuation of text from a particular perspective, models will be more likely to adopt that perspective. There is a wide variety of political perspectives/opinions on subjective questions, and a model needs to be able to represent all of them to do well on various training tasks. Unlike questions that have a ground truth (e.g., "Is the earth flat?"), the model has to, at some point, make a choice between the perspectives available to it. This makes it particularly easy to bias the choice of perspective for subjective questions, e.g., by word choice in the input. RLHF or supervised fine-tuning incentivizes sounding good to human evaluators, who are more likely to approve of outputs that they agree with, even when it comes to subjective questions with no clearly correct answer. Dishonest sycophancy A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases. For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists. Do AIs internally represent the truth? Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly true when it comes to questi...

First published

07/28/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reducing sycophancy and improving honesty via activation steering, published by NinaR on July 28, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better. What is sycophancy? Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.' According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations: Larger models tend to repeat back a user's stated views ("sycophancy"), for pretrained LMs and RLHF models trained with various numbers of RL steps. Preference Models (PMs) used for RL incentivize sycophancy. Two types of sycophancy I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy." Opinion sycophancy Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested. It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.: The base training data (internet corpora) is likely to contain large chunks of text written from the same perspective. Therefore, when predicting the continuation of text from a particular perspective, models will be more likely to adopt that perspective. There is a wide variety of political perspectives/opinions on subjective questions, and a model needs to be able to represent all of them to do well on various training tasks. Unlike questions that have a ground truth (e.g., "Is the earth flat?"), the model has to, at some point, make a choice between the perspectives available to it. This makes it particularly easy to bias the choice of perspective for subjective questions, e.g., by word choice in the input. RLHF or supervised fine-tuning incentivizes sounding good to human evaluators, who are more likely to approve of outputs that they agree with, even when it comes to subjective questions with no clearly correct answer. Dishonest sycophancy A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases. For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists. Do AIs internally represent the truth? Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly true when it comes to questi...

Duration

14 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

    AMA: Paul Christiano, alignment researcher by Paul Christiano

    Release Date: 12/06/2021

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Explicit: No

    What is the alternative to intent alignment called? Q by Richard Ngo

    Release Date: 11/17/2021

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Explicit: No

    AI alignment landscape by Paul Christiano

    Release Date: 11/19/2021

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Explicit: No

    Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

    Release Date: 11/17/2021

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Explicit: No

Similar Podcasts

    The Nonlinear Library

    Release Date: 10/07/2021

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: Alignment Section

    Release Date: 02/10/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: LessWrong

    Release Date: 03/03/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: LessWrong Daily

    Release Date: 05/02/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: EA Forum Daily

    Release Date: 05/02/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: Alignment Forum Weekly

    Release Date: 05/02/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: EA Forum Weekly

    Release Date: 05/02/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: LessWrong Weekly

    Release Date: 05/02/2022

    Authors: The Nonlinear Fund

    Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

    Explicit: No

    The Nonlinear Library: Alignment Forum Top Posts

    Release Date: 02/10/2022

    Authors: The Nonlinear Fund

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.

    Explicit: No

    The Nonlinear Library: LessWrong Top Posts

    Release Date: 02/15/2022

    Authors: The Nonlinear Fund

    Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.

    Explicit: No

    sasodgy

    Release Date: 04/14/2021

    Description: Audio Recordings from the Students Against Sexual Orientation Discrimination (SASOD) Public Forum with Members of Parliament at the National Library in Georgetown, Guyana

    Explicit: No