AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse
<a href="https://www.alignmentforum.org/posts/sCJDstZrpCB8dQveA/using-uninterpretable-llms-to-generate-interpretable-ai-code">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Using (Uninterpretable) LLMs to Generate Interpretable AI Code, published by Joar Skalse on July 2, 2023 on The AI Alignment Forum. (This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc). One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard. In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc). [SGD] -> [neural network] -> [output] I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let this system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move. [SGD] -> [neural network] -> [interpretable computer program] -> [output] To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and read-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be very sure, because we can't try all inputs, and we can't understand what all the tensors are doing. An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't neces...
First published
07/02/2023
Genres:
education
Listen to this episode
Summary
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Using (Uninterpretable) LLMs to Generate Interpretable AI Code, published by Joar Skalse on July 2, 2023 on The AI Alignment Forum. (This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc). One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard. In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc). [SGD] -> [neural network] -> [output] I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let this system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move. [SGD] -> [neural network] -> [interpretable computer program] -> [output] To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and read-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be very sure, because we can't try all inputs, and we can't understand what all the tensors are doing. An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't neces...
Duration
5 minutes
Parent Podcast
The Nonlinear Library: Alignment Forum Daily
View PodcastSimilar Episodes
AMA: Paul Christiano, alignment researcher by Paul Christiano
Release Date: 12/06/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
What is the alternative to intent alignment called? Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
AI alignment landscape by Paul Christiano
Release Date: 11/19/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Would an option to publish to AF users only be a useful feature?Q by Richard Ngo
Release Date: 11/17/2021
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Explicit: No
Similar Podcasts
The Nonlinear Library
Release Date: 10/07/2021
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Section
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong
Release Date: 03/03/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Daily
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: EA Forum Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: LessWrong Weekly
Release Date: 05/02/2022
Authors: The Nonlinear Fund
Description: The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Explicit: No
The Nonlinear Library: Alignment Forum Top Posts
Release Date: 02/10/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
The Nonlinear Library: LessWrong Top Posts
Release Date: 02/15/2022
Authors: The Nonlinear Fund
Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
Explicit: No
sasodgy
Release Date: 04/14/2021
Description: Audio Recordings from the Students Against Sexual Orientation Discrimination (SASOD) Public Forum with Members of Parliament at the National Library in Georgetown, Guyana
Explicit: No