LW - Deep Deceptiveness by So8res

<a href="https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by So8res on March 21, 2023 on LessWrong. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.) We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness. The AI is working with a human-operated wetlab (biology lab) and s...

First published

03/21/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by So8res on March 21, 2023 on LessWrong. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.) We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness. The AI is working with a human-operated wetlab (biology lab) and s...

Duration

25 minutes

Parent Podcast

The Nonlinear Library: LessWrong Weekly

View Podcast

Share this episode

Similar Episodes

Announcing AlignmentForum.org Beta by Raymond Arnold.

Release Date: 12/03/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing AlignmentForum.org Beta, published by Raymond Arnold on the AI Alignment Forum. We've just launched the beta for AlignmentForum.org. Much of the value of LessWrong has come from the development of technical research on AI Alignment. In particular, having those discussions be in an accessible place has allowed newcomers to get up to speed and involved. But the alignment research community has at least some needs that are best met with a semi-private forum. For the past few years, agentfoundations.org has served as a space for highly technical discussion of AI safety. But some aspects of the site design have made it a bit difficult to maintain, and harder to onboard new researchers. Meanwhile, as the AI landscape has shifted, it seemed valuable to expand the scope of the site. Agent Foundations is one particular paradigm with respect to AGI alignment, and it seemed important for researchers in other paradigms to be in communication with each other. So for several months, the LessWrong and AgentFoundations teams have been discussing the possibility of using the LW codebase as the basis for a new alignment forum. Over the past couple weeks we've gotten ready for a closed beta test, both to iron out bugs and (more importantly) get feedback from researchers on whether the overall approach makes sense. The current features of the Alignment Forum (subject to change) are: A small number of admins can invite new members, granting them posting and commenting permissions. This will be the case during the beta - the exact mechanism of curation after launch is still under discussion. When a researcher posts on AlignmentForum, the post is shared with LessWrong. On LessWrong, anyone can comment. On AlignmentForum, only AF members can comment. (AF comments are also crossposted to LW). The intent is for AF members to have a focused, technical discussion, while still allowing newcomers to LessWrong to see and discuss what's going on. AlignmentForum posts and comments on LW will be marked as such. AF members will have a separate karma total for AlignmentForum (so AF karma will more closely represent what technical researchers think about a given topic). On AlignmentForum, only AF Karma is visible. (note: not currently implemented but will be by end of day) On LessWrong, AF Karma will be displayed (smaller) alongside regular karma. If a commenter on LessWrong is making particularly good contributions to an AF discussion, an AF Admin can tag the comment as an AF comment, which will be visible on the AlignmentForum. The LessWrong user will then have voting privileges (but not necessarily posting privileges), allowing them to start to accrue AF karma, and to vote on AF comments and threads. We’ve currently copied over some LessWrong posts that seemed like a good fit, and invited a few people to write posts today. (These don’t necessarily represent the longterm vision of the site, but seemed like a good way to begin the beta test) This is a fairly major experiment, and we’re interested in feedback both from AI alignment researchers (who we’ll be reaching out to more individually in the next two weeks) and LessWrong users, about the overall approach and the integration with LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA on EA Forum: Ajeya Cotra, researcher at Open Phil by Ajeya Cotra

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA on EA Forum: Ajeya Cotra, researcher at Open Phil, published by Ajeya Cotra on the AI Alignment Forum. This is a linkpost for Hi all, I'm Ajeya, and I'll be doing an AMA on the EA Forum (this is a linkpost for my announcement there). I would love to get questions from LessWrong and Alignment Forum users as well -- please head on over if you have any questions for me! I’ll plan to start answering questions Monday Feb 1 at 10 AM Pacific. I will be blocking off much of Monday and Tuesday for question-answering, and may continue to answer a few more questions through the week if there are ones left, though I might not get to everything. About me: I’m a Senior Research Analyst at Open Philanthropy, where I focus on cause prioritization and AI. 80,000 Hours released a podcast episode with me last week discussing some of my work, and last September I put out a draft report on AI timelines which is discussed in the podcast. Currently, I’m trying to think about AI threat models and how much x-risk reduction we could expect the “last long-termist dollar” to buy. I joined Open Phil in the summer of 2016, and before that I was a student at UC Berkeley, where I studied computer science, co-ran the Effective Altruists of Berkeley student group, and taught a student-run course on EA. I’m most excited about answering questions related to AI timelines, AI risk more broadly, and cause prioritization, but feel free to ask me anything! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details