AF - [ASoT] Some thoughts on human abstractions by leogao

<a href="https://www.alignmentforum.org/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...

First published

03/16/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...

Duration

8 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details