AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

<a href="https://www.alignmentforum.org/posts/sbGau4QBwToYWEg4k/llms-sometimes-generate-purely-negatively-reinforced-text">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs Sometimes Generate Purely Negatively-Reinforced Text, published by Fabien Roger on June 16, 2023 on The AI Alignment Forum. When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward? In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords. This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply. Code: Arxiv paper: Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers. Thanks to Nix Goldowsky-Dill for feedback on the draft of this post. The Worry Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”. What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”. This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement. This can happen: If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model. If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug; This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples. AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post. The figure below is an example of a circuit that has generalized so that it can ...

First published

06/16/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs Sometimes Generate Purely Negatively-Reinforced Text, published by Fabien Roger on June 16, 2023 on The AI Alignment Forum. When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward? In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords. This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply. Code: Arxiv paper: Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers. Thanks to Nix Goldowsky-Dill for feedback on the draft of this post. The Worry Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”. What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”. This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement. This can happen: If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model. If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug; This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples. AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post. The figure below is an example of a circuit that has generalized so that it can ...

Duration

12 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details