AF - How LLMs are and are not myopic by janus

<a href="https://www.alignmentforum.org/posts/c68SJsBpiAxkPwRHj/how-llms-are-and-are-not-myopic">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

First published

07/25/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

Duration

13 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details