AF - SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow

<a href="https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on The AI Alignment Forum. Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen). Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for: eliciting knowledge generating adversarial inputs automating prompt search (e.g. for fine-tuning) In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post. Prompt generation First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept). We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data. So, how can we apply this approach to language models? Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above. So now we can do stuff like this: And this: We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'. How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail. This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discov...

First published

02/05/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on The AI Alignment Forum. Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen). Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for: eliciting knowledge generating adversarial inputs automating prompt search (e.g. for fine-tuning) In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post. Prompt generation First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept). We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data. So, how can we apply this approach to language models? Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above. So now we can do stuff like this: And this: We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'. How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail. This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discov...

Duration

23 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Weekly

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details