AF - AI that shouldn't work, yet kind of does by Donald Hobson

<a href="https://www.alignmentforum.org/posts/NK2CeDNKMEY9gRZp2/ai-that-shouldn-t-work-yet-kind-of-does">Link to original article</a><br/><br/>Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...

First published

02/23/2023

Genres:

education

Listen to this episode

0:00 / 0:00

Summary

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...

Duration

5 minutes

Parent Podcast

The Nonlinear Library: Alignment Forum Daily

View Podcast

Share this episode

Similar Episodes

AMA: Paul Christiano, alignment researcher by Paul Christiano

Release Date: 12/06/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

What is the alternative to intent alignment called? Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is the alternative to intent alignment called? Q, published by Richard Ngo on the AI Alignment Forum. Paul defines intent alignment of an AI A to a human H as the criterion that A is trying to do what H wants it to do. What term do people use for the definition of alignment in which A is trying to achieve H's goals (whether or not H intends for A to achieve H's goals)? Secondly, this seems to basically map on to the distinction between an aligned genie and an aligned sovereign. Is this a fair characterisation? (Intent alignment definition from) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

AI alignment landscape by Paul Christiano

Release Date: 11/19/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI alignment landscape, published byPaul Christiano on the AI Alignment Forum. Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions. Here is the main overview slide from the talk: The highlighted boxes are where I spend most of my time. Here are the full slides from the talk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details

Would an option to publish to AF users only be a useful feature?Q by Richard Ngo

Release Date: 11/17/2021

Description: Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would an option to publish to AF users only be a useful feature?Q , published by Richard Ngo on the AI Alignment Forum. Right now there are quite a few private safety docs floating around. There's evidently demand for a privacy setting lower than "only people I personally approve", but higher than "anyone on the internet gets to see it". But this means that safety researchers might not see relevant arguments and information. And as the field grows, passing on access to such documents on a personal basis will become even less efficient. My guess is that in most cases, the authors of these documents don't have a problem with other safety researchers seeing them, as long as everyone agrees not to distribute them more widely. One solution could be to have a checkbox for new posts which makes them only visible to verified Alignment Forum users. Would people use this? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Explicit: No

Details