AF - If interpretability research goes well, it may get dangerous by Nate Soares

<a href="https://www.alignmentforum.org/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If interpretability research goes well, it may get dangerous, published by Nate Soares on April 3, 2023 on The AI Alignment Forum. I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed. I acknowledge that spreading research insights less widely comes with real research costs. I'd endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that. I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life. It's much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that's happening out in the open, long before they focus on interpretability research. Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn't require vision and genius; I expect it requires that too.) I simultaneously think it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions. Reality doesn't have to provide you any outs. There's a tradeoff here. And it's not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then. I reiterate that I'd feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier. My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn't, but don't let my praise of interpretability research obscure the fact that there are tradeoffs here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

First published

04/03/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If interpretability research goes well, it may get dangerous, published by Nate Soares on April 3, 2023 on The AI Alignment Forum. I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed. I acknowledge that spreading research insights less widely comes with real research costs. I'd endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that. I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life. It's much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that's happening out in the open, long before they focus on interpretability research. Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn't require vision and genius; I expect it requires that too.) I simultaneously think it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions. Reality doesn't have to provide you any outs. There's a tradeoff here. And it's not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then. I reiterate that I'd feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier. My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn't, but don't let my praise of interpretability research obscure the fact that there are tradeoffs here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Duration

2 hours and 46 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Weekly

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details