AF - EIS IV: A Spotlight on Feature Attribution/Saliency by Stephen Casper

<a href="https://www.alignmentforum.org/posts/f8nd9F7dL9SxueLFA/eis-iv-a-spotlight-on-feature-attribution-saliency">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IV: A Spotlight on Feature Attribution/Saliency, published by Stephen Casper on February 15, 2023 on The AI Alignment Forum. Part 4 of 12 in the Engineer’s Interpretability Sequence. Thanks to Tony Wang for a helpful comment. If you want to become more familiar with feature attribution/saliency, a tutorial on them that may offer useful background is Nielsen et al. (2021). Given a model and an input for it, the goal of feature attribution/saliency methods is to identify what features in the input are influential for the model’s decision. The literature on these methods is large and active with many hundreds of papers. In fact, in some circles, the word “interpretability” and especially the word “explainability” are more or less synonymous with feature attribution (some examples are discussed below). But despite the size of this literature, there are some troubles with the research on these methods that are fairly illustrative of broader ones with interpretability overall. Hence this post. There are some analogous ones in AI safety work that will be discussed more in the next two posts in the sequence. Troubles with evaluation and performance Some examples and troubles with the evaluation of feature attributions were already touched on in EIS III which discussed Pan et al. (2021) and Ismail et al. (2021). The claim from Pan et al. (2021) that their method is “obviously better” than alternatives exemplifies how these methods are sometimes simply declared successful after inspection from researchers. And Ismail et al. (2021) demonstrates a form of weak evaluation with a measure that may be quantitative but is not of direct interest to an engineer. In response to this literature, several works have emerged to highlight difficulties with feature attribution/saliency methods. Here is a short reading list :) A Benchmark for Interpretability Methods in Deep Neural Networks (Hooker et al., 2018) Sanity Checks for Saliency Maps (Adebayo et al., 2018) Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Hase and Bansal, 2020) Debugging Tests for Model Explanations (Adebayo et al., 2020) Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior (Denain and Steinhardt, 2022) Towards Benchmarking Explainable Artificial Intelligence Methods (Holmberg, 2022) Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) When they are evaluated, these tools often aren’t very useful and do not pass simple sanity checks. Consider an illustration of this problem: From Adebayo et al. (2018) These visualizations suggest that some of these tools do not reliably highlight features that seem important in images at all, and the ones that do often highlight them do not appear to be obviously better than an edge detector. This sanity check suggests limitations with how well these methods can reveal anything novel to humans at all, let alone how useful they can be in tasks of practical interest. For the papers that have gone further and studied whether these methods can help predict how the network will respond to certain inputs, it seems that some attribution/saliency methods usually fail while others only occasionally succeed (Hase and Bansal, 2020; Adebayo et al., 2020; Denain and Steinhardt, 2022). EIS III discussed how in a newly arXived work, coauthors and I benchmarked feature synthesis tools (Casper et al., 2023). In addition, we use a related approach to evaluate how helpful feature attribution/saliency methods can be for pointing out spurious features that the network has learned. This method was based on seeing how well a method can attribute a trojaned network’s decision to the trojan trigger in an image. From Casper et al. (2023) Shown at the top of the figure above are examples of trojaned ima...

First published

02/15/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IV: A Spotlight on Feature Attribution/Saliency, published by Stephen Casper on February 15, 2023 on The AI Alignment Forum. Part 4 of 12 in the Engineer’s Interpretability Sequence. Thanks to Tony Wang for a helpful comment. If you want to become more familiar with feature attribution/saliency, a tutorial on them that may offer useful background is Nielsen et al. (2021). Given a model and an input for it, the goal of feature attribution/saliency methods is to identify what features in the input are influential for the model’s decision. The literature on these methods is large and active with many hundreds of papers. In fact, in some circles, the word “interpretability” and especially the word “explainability” are more or less synonymous with feature attribution (some examples are discussed below). But despite the size of this literature, there are some troubles with the research on these methods that are fairly illustrative of broader ones with interpretability overall. Hence this post. There are some analogous ones in AI safety work that will be discussed more in the next two posts in the sequence. Troubles with evaluation and performance Some examples and troubles with the evaluation of feature attributions were already touched on in EIS III which discussed Pan et al. (2021) and Ismail et al. (2021). The claim from Pan et al. (2021) that their method is “obviously better” than alternatives exemplifies how these methods are sometimes simply declared successful after inspection from researchers. And Ismail et al. (2021) demonstrates a form of weak evaluation with a measure that may be quantitative but is not of direct interest to an engineer. In response to this literature, several works have emerged to highlight difficulties with feature attribution/saliency methods. Here is a short reading list :) A Benchmark for Interpretability Methods in Deep Neural Networks (Hooker et al., 2018) Sanity Checks for Saliency Maps (Adebayo et al., 2018) Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Hase and Bansal, 2020) Debugging Tests for Model Explanations (Adebayo et al., 2020) Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior (Denain and Steinhardt, 2022) Towards Benchmarking Explainable Artificial Intelligence Methods (Holmberg, 2022) Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) When they are evaluated, these tools often aren’t very useful and do not pass simple sanity checks. Consider an illustration of this problem: From Adebayo et al. (2018) These visualizations suggest that some of these tools do not reliably highlight features that seem important in images at all, and the ones that do often highlight them do not appear to be obviously better than an edge detector. This sanity check suggests limitations with how well these methods can reveal anything novel to humans at all, let alone how useful they can be in tasks of practical interest. For the papers that have gone further and studied whether these methods can help predict how the network will respond to certain inputs, it seems that some attribution/saliency methods usually fail while others only occasionally succeed (Hase and Bansal, 2020; Adebayo et al., 2020; Denain and Steinhardt, 2022). EIS III discussed how in a newly arXived work, coauthors and I benchmarked feature synthesis tools (Casper et al., 2023). In addition, we use a related approach to evaluate how helpful feature attribution/saliency methods can be for pointing out spurious features that the network has learned. This method was based on seeing how well a method can attribute a trojaned network’s decision to the trojan trigger in an image. From Casper et al. (2023) Shown at the top of the figure above are examples of trojaned ima...

Duration

7 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details