AF - Shapley Value Attribution in Chain of Thought by leogao

<a href="https://www.alignmentforum.org/posts/FX5JmftqL2j6K8dn4/shapley-value-attribution-in-chain-of-thought">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum. TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion. Motivation Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general. Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations. Related work Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases. Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance. Effect of Interventions We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case). Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs): We test how frequently this occurs in several different settings (n=100): SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33 Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65. Shapley value attribution We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...

First published

04/14/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum. TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion. Motivation Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general. Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations. Related work Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases. Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance. Effect of Interventions We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case). Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs): We test how frequently this occurs in several different settings (n=100): SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33 Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65. Shapley value attribution We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...

Duration

7 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details