AF - Pretraining Language Models with Human Preferences by Tomek Korbak

<a href="https://www.alignmentforum.org/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...

First published

02/21/2023

Genres:

education

Listen to this episode

0:00 / NaN:NaN

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...

Duration

20 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Weekly

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details