LW - "Carefully Bootstrapped Alignment" is organizationally hard by Raemon

<a href="https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Carefully Bootstrapped Alignment" is organizationally hard, published by Raemon on March 17, 2023 on LessWrong. In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that. In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties) I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment." It goes something like: Develop weak AI, which helps us figure out techniques for aligning stronger AI Use a collection of techniques to keep it aligned/constrained as we carefully ramp it's power level, which lets us use it to make further progress on alignment. [implicit assumption, typically unstated] Have good organizational practices which ensure that your org actually consistently uses your techniques to carefully keep the AI in check. If the next iteration would be too dangerous, put the project on pause until you have a better alignment solution. Eventually have powerful aligned AGI, then Do Something Useful with it. I've seen a lot of debate about points #1 and #2 – is it possible for weaker AI to help with the Actually Hard parts of the alignment problem? Are the individual techniques people have proposed to help keep it aligned actually going to work? But I want to focus in this post on point #3. Let's assume you've got some version of carefully-bootstrapped aligned AI that can technically work. What do the organizational implementation details need to look like? When I talk to people at AI labs about this, it seems like we disagree a lot on things like: Can you hire lots of people, without the company becoming bloated and hard to steer? Can you accelerate research "for now" and "pause later", without having an explicit plan for stopping that their employees understand and are on board with? Will your employees actually follow the safety processes you design? (rather than put in token lip service and then basically circumventing them? Or just quitting to go work for an org with fewer restrictions?) I'm a bit confused about where we disagree. Everyone seems to agree these are hard and require some thought. But when I talk to both technical researchers and middle-managers at AI companies, they seem to feel less urgency than me about having a much more concrete plan. I think they believe organizational adequacy needs to be in something like their top 7 list of priorities, and I believe it needs to be in their top 3, or it won't happen and their organization will inevitably end up causing catastrophic outcomes. For this post, I want to lay out the reasons I expect this to be hard, and important. How "Careful Bootstrapped Alignment" might work Here's a sketch at how the setup could work, mostly paraphrased from my memory of Buck's EAG 2022 talk. I think OpenAI's proposed setup is somewhat different, but the broad strokes seemed similar. You have multiple research-assistant-AI tailored to help with alignment. In the near future, these might be language models sifting through existing research to help you make connections you might not have otherwise seen. Eventually, when you're confident you can safely run it, they might be a weak goal-directed reasoning AGI. You have interpreter AIs, designed to figure out how the research-assistant-AIs work. And you have (possibly different interpreter/watchdog AIs) that notice if the research-AIs are behaving anomalously. (there are interpreter-AIs targeting both the research assistant AI, as well other interpreter-AIs. Every AI in t...

First published

03/17/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Carefully Bootstrapped Alignment" is organizationally hard, published by Raemon on March 17, 2023 on LessWrong. In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that. In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties) I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment." It goes something like: Develop weak AI, which helps us figure out techniques for aligning stronger AI Use a collection of techniques to keep it aligned/constrained as we carefully ramp it's power level, which lets us use it to make further progress on alignment. [implicit assumption, typically unstated] Have good organizational practices which ensure that your org actually consistently uses your techniques to carefully keep the AI in check. If the next iteration would be too dangerous, put the project on pause until you have a better alignment solution. Eventually have powerful aligned AGI, then Do Something Useful with it. I've seen a lot of debate about points #1 and #2 – is it possible for weaker AI to help with the Actually Hard parts of the alignment problem? Are the individual techniques people have proposed to help keep it aligned actually going to work? But I want to focus in this post on point #3. Let's assume you've got some version of carefully-bootstrapped aligned AI that can technically work. What do the organizational implementation details need to look like? When I talk to people at AI labs about this, it seems like we disagree a lot on things like: Can you hire lots of people, without the company becoming bloated and hard to steer? Can you accelerate research "for now" and "pause later", without having an explicit plan for stopping that their employees understand and are on board with? Will your employees actually follow the safety processes you design? (rather than put in token lip service and then basically circumventing them? Or just quitting to go work for an org with fewer restrictions?) I'm a bit confused about where we disagree. Everyone seems to agree these are hard and require some thought. But when I talk to both technical researchers and middle-managers at AI companies, they seem to feel less urgency than me about having a much more concrete plan. I think they believe organizational adequacy needs to be in something like their top 7 list of priorities, and I believe it needs to be in their top 3, or it won't happen and their organization will inevitably end up causing catastrophic outcomes. For this post, I want to lay out the reasons I expect this to be hard, and important. How "Careful Bootstrapped Alignment" might work Here's a sketch at how the setup could work, mostly paraphrased from my memory of Buck's EAG 2022 talk. I think OpenAI's proposed setup is somewhat different, but the broad strokes seemed similar. You have multiple research-assistant-AI tailored to help with alignment. In the near future, these might be language models sifting through existing research to help you make connections you might not have otherwise seen. Eventually, when you're confident you can safely run it, they might be a weak goal-directed reasoning AGI. You have interpreter AIs, designed to figure out how the research-assistant-AIs work. And you have (possibly different interpreter/watchdog AIs) that notice if the research-AIs are behaving anomalously. (there are interpreter-AIs targeting both the research assistant AI, as well other interpreter-AIs. Every AI in t...

Duration

17 minutes

Parent Podcast

The Nonlinear Library: LessWrong Weekly

View Podcast

Share this episode

Similar Episodes

Announcing AlignmentForum.org Beta by Raymond Arnold.

Release Date: 12/03/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing AlignmentForum.org Beta, published by Raymond Arnold on the AI Alignment Forum. We've just launched the beta for AlignmentForum.org. Much of the value of LessWrong has come from the development of technical research on AI Alignment. In particular, having those discussions be in an accessible place has allowed newcomers to get up to speed and involved. But the alignment research community has at least some needs that are best met with a semi-private forum. For the past few years, agentfoundations.org has served as a space for highly technical discussion of AI safety. But some aspects of the site design have made it a bit difficult to maintain, and harder to onboard new researchers. Meanwhile, as the AI landscape has shifted, it seemed valuable to expand the scope of the site. Agent Foundations is one particular paradigm with respect to AGI alignment, and it seemed important for researchers in other paradigms to be in communication with each other. So for several months, the LessWrong and AgentFoundations teams have been discussing the possibility of using the LW codebase as the basis for a new alignment forum. Over the past couple weeks we've gotten ready for a closed beta test, both to iron out bugs and (more importantly) get feedback from researchers on whether the overall approach makes sense. The current features of the Alignment Forum (subject to change) are: A small number of admins can invite new members, granting them posting and commenting permissions. This will be the case during the beta - the exact mechanism of curation after launch is still under discussion. When a researcher posts on AlignmentForum, the post is shared with LessWrong. On LessWrong, anyone can comment. On AlignmentForum, only AF members can comment. (AF comments are also crossposted to LW). The intent is for AF members to have a focused, technical discussion, while still allowing newcomers to LessWrong to see and discuss what's going on. AlignmentForum posts and comments on LW will be marked as such. AF members will have a separate karma total for AlignmentForum (so AF karma will more closely represent what technical researchers think about a given topic). On AlignmentForum, only AF Karma is visible. (note: not currently implemented but will be by end of day) On LessWrong, AF Karma will be displayed (smaller) alongside regular karma. If a commenter on LessWrong is making particularly good contributions to an AF discussion, an AF Admin can tag the comment as an AF comment, which will be visible on the AlignmentForum. The LessWrong user will then have voting privileges (but not necessarily posting privileges), allowing them to start to accrue AF karma, and to vote on AF comments and threads. We’ve currently copied over some LessWrong posts that seemed like a good fit, and invited a few people to write posts today. (These don’t necessarily represent the longterm vision of the site, but seemed like a good way to begin the beta test) This is a fairly major experiment, and we’re interested in feedback both from AI alignment researchers (who we’ll be reaching out to more individually in the next two weeks) and LessWrong users, about the overall approach and the integration with LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA on EA Forum: Ajeya Cotra, researcher at Open Phil by Ajeya Cotra

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA on EA Forum: Ajeya Cotra, researcher at Open Phil, published by Ajeya Cotra on the AI Alignment Forum. This is a linkpost for Hi all, I'm Ajeya, and I'll be doing an AMA on the EA Forum (this is a linkpost for my announcement there). I would love to get questions from LessWrong and Alignment Forum users as well -- please head on over if you have any questions for me! I’ll plan to start answering questions Monday Feb 1 at 10 AM Pacific. I will be blocking off much of Monday and Tuesday for question-answering, and may continue to answer a few more questions through the week if there are ones left, though I might not get to everything. About me: I’m a Senior Research Analyst at Open Philanthropy, where I focus on cause prioritization and AI. 80,000 Hours released a podcast episode with me last week discussing some of my work, and last September I put out a draft report on AI timelines which is discussed in the podcast. Currently, I’m trying to think about AI threat models and how much x-risk reduction we could expect the “last long-termist dollar” to buy. I joined Open Phil in the summer of 2016, and before that I was a student at UC Berkeley, where I studied computer science, co-ran the Effective Altruists of Berkeley student group, and taught a student-run course on EA. I’m most excited about answering questions related to AI timelines, AI risk more broadly, and cause prioritization, but feel free to ask me anything! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details