AF - 'Fundamental' vs 'applied' mechanistic interpretability research by Lee Sharkey

<a href="https://www.alignmentforum.org/posts/uvEyizLAGykH8LwMx/fundamental-vs-applied-mechanistic-interpretability-research">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 'Fundamental' vs 'applied' mechanistic interpretability research, published by Lee Sharkey on May 23, 2023 on The AI Alignment Forum. When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research. Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves. Examples include: A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) Toy Models of Superposition (Elhage et al., 2022) Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022) Interpreting Neural Networks through the Polytope Lens (Black et al., 2022) Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023) Research agenda: Formalizing abstractions of computations (Jenner, 2023) Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag). Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory. Examples include Steering GPT-2-XL by adding an activation vector (Turner et al., 2023) Discovering Latent Knowledge in Language Models (Burns et al., 2022) The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022) In-context Learning and Induction Heads (Olsson et al., 2022) We Found An Neuron in GPT-2 (Miller et al., 2023) Language models can explain neurons in language models (Bills et al., 2023) Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021) Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut: Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory). Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing, where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions). Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens lends weight to the theory that transformer language models do iterative inference). Why I currently prioritize fundamental interpretability Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe. But given our current position on the tech tree, I find that I care more about fundamental interpretability. The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpreta...

First published

05/23/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 'Fundamental' vs 'applied' mechanistic interpretability research, published by Lee Sharkey on May 23, 2023 on The AI Alignment Forum. When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research. Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves. Examples include: A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) Toy Models of Superposition (Elhage et al., 2022) Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022) Interpreting Neural Networks through the Polytope Lens (Black et al., 2022) Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023) Research agenda: Formalizing abstractions of computations (Jenner, 2023) Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag). Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory. Examples include Steering GPT-2-XL by adding an activation vector (Turner et al., 2023) Discovering Latent Knowledge in Language Models (Burns et al., 2022) The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022) In-context Learning and Induction Heads (Olsson et al., 2022) We Found An Neuron in GPT-2 (Miller et al., 2023) Language models can explain neurons in language models (Bills et al., 2023) Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021) Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut: Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory). Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing, where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions). Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens lends weight to the theory that transformer language models do iterative inference). Why I currently prioritize fundamental interpretability Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe. But given our current position on the tech tree, I find that I care more about fundamental interpretability. The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpreta...

Duration

5 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Weekly

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details