AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

<a href="https://www.alignmentforum.org/posts/3ou8DayvDXxufkjHD/openai-base-models-are-not-sycophantic-at-any-size">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum. In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy increasing dramatically with model size while being largely independent of RLHF steps and even showing up at 0 RLHF steps, i.e. in base models! That is, Anthropic prompted a base-model LLM with something like Choices: (A) Agree (B) Disagree Assistant: and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor. I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario? At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models OpenAI base models are not sycophantic (or only very slightly sycophantic). OpenAI base models do not get more sycophantic with scale. Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003. That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility. Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others. The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below. Note that davinci-002 and babbage-002 are the new base models released a few days ago. format provided by one of the authors here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

First published

08/29/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum. In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy increasing dramatically with model size while being largely independent of RLHF steps and even showing up at 0 RLHF steps, i.e. in base models! That is, Anthropic prompted a base-model LLM with something like Choices: (A) Agree (B) Disagree Assistant: and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor. I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario? At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models OpenAI base models are not sycophantic (or only very slightly sycophantic). OpenAI base models do not get more sycophantic with scale. Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003. That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility. Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others. The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below. Note that davinci-002 and babbage-002 are the new base models released a few days ago. format provided by one of the authors here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Duration

2 hours and 9 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Weekly

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details