LW - Against Almost Every Theory of Impact of Interpretability by Charbel-Raphaël

<a href="https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against Almost Every Theory of Impact of Interpretability, published by Charbel-Raphaël on August 17, 2023 on LessWrong. Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let's go! Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions. When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable. How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I've added a lot of bookmarks to help modularize this post. If you have very little time, just read (this is also the part where I'm most confident): Auditing deception with Interp is out of reach (4 min) Enumerative safety critique (2 min) Technical Agendas with better Theories of Impact (1 min) Here is the list of claims that I will defend: (bolded sections are the most important ones) The overall Theory of Impact is quite poor Interp is not a good predictor of future systems Auditing deception with interp is out of reach What does the end story of interpretability look like? That's not clear at all. Enumerative safety? Reverse engineering? Olah's Interpretability dream? Retargeting the search? Relaxed adversarial training? Microscope AI? Preventive measures against Deception seem much more workable Steering the world towards transparency Cognitive Emulations - Explainability By design Interpretability May Be Overall Harmful Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high So far my best ToI for interp: Nerd Sniping? Even if we completely solve interp, we are still in danger Technical Agendas with better Theories of Impact Conclusion Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models. The emperor has no clothes? I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum. Feature viz? (left image) Um, it's pretty but is this useful? Is this reliable? GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I've never seen anybody use it in industry. Pixel attribution seems useful, but accuracy remains the king. Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool. The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if...

First published

08/17/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against Almost Every Theory of Impact of Interpretability, published by Charbel-Raphaël on August 17, 2023 on LessWrong. Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let's go! Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions. When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable. How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I've added a lot of bookmarks to help modularize this post. If you have very little time, just read (this is also the part where I'm most confident): Auditing deception with Interp is out of reach (4 min) Enumerative safety critique (2 min) Technical Agendas with better Theories of Impact (1 min) Here is the list of claims that I will defend: (bolded sections are the most important ones) The overall Theory of Impact is quite poor Interp is not a good predictor of future systems Auditing deception with interp is out of reach What does the end story of interpretability look like? That's not clear at all. Enumerative safety? Reverse engineering? Olah's Interpretability dream? Retargeting the search? Relaxed adversarial training? Microscope AI? Preventive measures against Deception seem much more workable Steering the world towards transparency Cognitive Emulations - Explainability By design Interpretability May Be Overall Harmful Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high So far my best ToI for interp: Nerd Sniping? Even if we completely solve interp, we are still in danger Technical Agendas with better Theories of Impact Conclusion Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models. The emperor has no clothes? I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum. Feature viz? (left image) Um, it's pretty but is this useful? Is this reliable? GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I've never seen anybody use it in industry. Pixel attribution seems useful, but accuracy remains the king. Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool. The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if...

Duration

1 hour and 9 minutes

Parent Podcast

The Nonlinear Library: LessWrong Weekly

View Podcast

Share this episode

Similar Episodes

Announcing AlignmentForum.org Beta by Raymond Arnold.

Release Date: 12/03/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing AlignmentForum.org Beta, published by Raymond Arnold on the AI Alignment Forum. We've just launched the beta for AlignmentForum.org. Much of the value of LessWrong has come from the development of technical research on AI Alignment. In particular, having those discussions be in an accessible place has allowed newcomers to get up to speed and involved. But the alignment research community has at least some needs that are best met with a semi-private forum. For the past few years, agentfoundations.org has served as a space for highly technical discussion of AI safety. But some aspects of the site design have made it a bit difficult to maintain, and harder to onboard new researchers. Meanwhile, as the AI landscape has shifted, it seemed valuable to expand the scope of the site. Agent Foundations is one particular paradigm with respect to AGI alignment, and it seemed important for researchers in other paradigms to be in communication with each other. So for several months, the LessWrong and AgentFoundations teams have been discussing the possibility of using the LW codebase as the basis for a new alignment forum. Over the past couple weeks we've gotten ready for a closed beta test, both to iron out bugs and (more importantly) get feedback from researchers on whether the overall approach makes sense. The current features of the Alignment Forum (subject to change) are: A small number of admins can invite new members, granting them posting and commenting permissions. This will be the case during the beta - the exact mechanism of curation after launch is still under discussion. When a researcher posts on AlignmentForum, the post is shared with LessWrong. On LessWrong, anyone can comment. On AlignmentForum, only AF members can comment. (AF comments are also crossposted to LW). The intent is for AF members to have a focused, technical discussion, while still allowing newcomers to LessWrong to see and discuss what's going on. AlignmentForum posts and comments on LW will be marked as such. AF members will have a separate karma total for AlignmentForum (so AF karma will more closely represent what technical researchers think about a given topic). On AlignmentForum, only AF Karma is visible. (note: not currently implemented but will be by end of day) On LessWrong, AF Karma will be displayed (smaller) alongside regular karma. If a commenter on LessWrong is making particularly good contributions to an AF discussion, an AF Admin can tag the comment as an AF comment, which will be visible on the AlignmentForum. The LessWrong user will then have voting privileges (but not necessarily posting privileges), allowing them to start to accrue AF karma, and to vote on AF comments and threads. We’ve currently copied over some LessWrong posts that seemed like a good fit, and invited a few people to write posts today. (These don’t necessarily represent the longterm vision of the site, but seemed like a good way to begin the beta test) This is a fairly major experiment, and we’re interested in feedback both from AI alignment researchers (who we’ll be reaching out to more individually in the next two weeks) and LessWrong users, about the overall approach and the integration with LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA on EA Forum: Ajeya Cotra, researcher at Open Phil by Ajeya Cotra

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA on EA Forum: Ajeya Cotra, researcher at Open Phil, published by Ajeya Cotra on the AI Alignment Forum. This is a linkpost for Hi all, I'm Ajeya, and I'll be doing an AMA on the EA Forum (this is a linkpost for my announcement there). I would love to get questions from LessWrong and Alignment Forum users as well -- please head on over if you have any questions for me! I’ll plan to start answering questions Monday Feb 1 at 10 AM Pacific. I will be blocking off much of Monday and Tuesday for question-answering, and may continue to answer a few more questions through the week if there are ones left, though I might not get to everything. About me: I’m a Senior Research Analyst at Open Philanthropy, where I focus on cause prioritization and AI. 80,000 Hours released a podcast episode with me last week discussing some of my work, and last September I put out a draft report on AI timelines which is discussed in the podcast. Currently, I’m trying to think about AI threat models and how much x-risk reduction we could expect the “last long-termist dollar” to buy. I joined Open Phil in the summer of 2016, and before that I was a student at UC Berkeley, where I studied computer science, co-ran the Effective Altruists of Berkeley student group, and taught a student-run course on EA. I’m most excited about answering questions related to AI timelines, AI risk more broadly, and cause prioritization, but feel free to ask me anything! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details