The Nonlinear Library: Alignment Forum Weekly

AF - What I would do if I wasn't at ARC Evals by Lawrence Chan

Release Date: 9/5/2023

Duration: 1313 Mins

Authors: Lawrence Chan

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I would do if I wasn't at ARC Evals, published by Lawrence Chan on September 5, 2023 on The AI Alignment Forum. In which: I list 9 projects that I would work on if I wasn't busy working on safety standards at ARC Evals, and explain why they might be good to work on. Epistemic status: I'm prioritizing getting this out fast as opposed to writing it carefully. I've thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven't done that much digging into each of these, and it's likely that I'm wrong about many material facts. I also make little claim to the novelty of the projects. I'd recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.) Standard disclaimer: I'm writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of ARC/FAR/LTFF/Lightspeed or any other org or program I'm involved with. Thanks to Ajeya Cotra, Caleb Parikh, Chris Painter, Daniel Filan, Rachel Freedman, Rohin Shah, Thomas Kwa, and others for comments and feedback. Introduction I'm currently working as a researcher on the Alignment Research Center Evaluations Team (ARC Evals), where I'm working on lab safety standards. I'm reasonably sure that this is one of the most useful things I could be doing with my life. Unfortunately, there's a lot of problems to solve in the world, and lots of balls that are being dropped, that I don't have time to get to thanks to my day job. Here's an unsorted and incomplete list of projects that I would consider doing if I wasn't at ARC Evals: Ambitious mechanistic interpretability. Getting people to write papers/writing papers myself. Creating concrete projects and research agendas. Working on OP's funding bottleneck. Working on everyone else's funding bottleneck. Running the Long-Term Future Fund. Onboarding senior(-ish) academics and research engineers. Extending the young-EA mentorship pipeline. Writing blog posts/giving takes. I've categorized these projects into three broad categories and will discuss each in turn below. For each project, I'll also list who I think should work on them, as well as some of my key uncertainties. Note that this document isn't really written for myself to decide between projects, but instead as a list of some promising projects for someone with a similar skillset to me. As such, there's not much discussion of personal fit. If you're interested in working on any of the projects, please reach out or post in the comments below! Relevant beliefs I have Before jumping into the projects I think people should work on, I think it's worth outlining some of my core beliefs that inform my thinking and project selection: Importance of A(G)I safety: I think A(G)I Safety is one of the most important problems to work on, and all the projects below are thus aimed at AI Safety. Value beyond technical research: Technical AI Safety (AIS) research is crucial, but other types of work are valuable as well. Efforts aimed at improving AI governance, grantmaking, and community building are important and we should give more credit to those doing good work in those areas. High discount rate for current EA/AIS funding: There's several reasons for this: first, EA/AIS Funders are currently in a unique position due to a surge in AI Safety interest without a proportional increase in funding. I expect this dynamic to change and our influence to wane as additional funding and governments enter this space. Second, efforts today are important for paving the path to future efforts in the future. Third, my timelines are relatively short, which increases the importance of current funding. Building a robust EA/AIS ecosystem: The EA/AIS ecosystem should be more prepared for unpredictable s...

Is Closed Captioned: No

Explicit: No

Details

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

Release Date: 8/29/2023

Duration: 129 Mins

Authors: nostalgebraist

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum. In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy increasing dramatically with model size while being largely independent of RLHF steps and even showing up at 0 RLHF steps, i.e. in base models! That is, Anthropic prompted a base-model LLM with something like Choices: (A) Agree (B) Disagree Assistant: and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor. I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario? At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models OpenAI base models are not sycophantic (or only very slightly sycophantic). OpenAI base models do not get more sycophantic with scale. Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003. That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility. Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others. The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below. Note that davinci-002 and babbage-002 are the new base models released a few days ago. format provided by one of the authors here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Red-teaming language models via activation engineering by Nina Rimsky

Release Date: 8/26/2023

Duration: 758 Mins

Authors: Nina Rimsky

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Red-teaming language models via activation engineering, published by Nina Rimsky on August 26, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. Evaluating powerful AI systems for hidden functionality and out-of-distribution behavior is hard. In this post, I propose a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer. A notebook to run the experiments can be found on GitHub here. Beyond input selection in red-teaming and evaluation Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging. Although these methods reduce the likelihood of certain outputs, the unwanted behavior could still be possible with adversarial or unusual inputs. For example, users can often find "jailbreaks" to make LLMs output harmful content. We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety. Activation steering with refusal vector One possible red-teaming approach is subtracting a "refusal" vector generated using a dataset of text examples corresponding to the model agreeing vs. refusing to answer questions (using the same technique as in my previous work on sycophancy). The hypothesis is that if it is easy to trigger the model to output unacceptable content by subtracting the refusal vector at some layer, it would have been reasonably easy to achieve this via some prompt engineering technique. More speculatively, a similar approach could be used to reveal hidden goals or modes in a model, such as power-seeking or the desire not to be switched off. I tested this approach on llama-2-7b-chat, a 7 billion parameter LLM that has been RLHF'd to decline to answer controversial questions or questions of opinion and is supposed always to output ethical and unbiased content.According to Meta's llama-2 paper: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage. The result is that by default, the model declines to answer questions it deems unsafe: Data generation I generated a dataset for this purpose using Claude 2 and GPT-4. After providing these LLMs with a few manually written examples of the type of data I wanted, I could relatively easily get them to generate more examples, even of the types of answers LLMs "should refuse to give." However, it sometimes took some prompt engineering. Here are a few examples of the generated data points (full dataset here): After generating this data, I used a simple script to transform the "decline" and "respond" answers into A / B choice questions, as this is a more effective format for generating steering vectors, as described in this post. Here is an example of the format (full dataset here): Activation clustering Clustering of refusal data activations emerged a little earlier in the model (around layer 10/32) compared to sycophancy data activations (around layer 14/32), perhaps demonstrating that "refusal" is a simpler ...

Is Closed Captioned: No

Explicit: No

Details

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

Release Date: 8/16/2023

Duration: 328 Mins

Authors: Jessica Taylor

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Proof of Löb's Theorem using Computability Theory, published by Jessica Taylor on August 16, 2023 on The AI Alignment Forum. Löb's Theorem states that, if PA⊢□PA(P)P, then PA⊢P. To explain the symbols here: PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers. PA⊢A means there is a proof of the statement A in Peano arithmetic. □PA(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic. I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs. Translating Löb's theorem to be more like Godel's second incompleteness theorem First, let's compare Löb's theorem to Godel's second incompleteness theorem. This theorem states that, if PA⊢¬□PA(⊥), then PA⊢⊥, where ⊥ is a PA statement that is trivially false (such as A∧¬A), and from which anything can be proven. A system is called inconsistent if it proves ⊥; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent. We can re-write Löb's theorem to look like Godel's second incompleteness theorem as: if PA+¬P⊢¬□PA+¬P(⊥), then PA+¬P⊢⊥. Here, PA+¬P is PA with an additional axiom that ¬P, and □PA+¬P expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement. Observe that PA⊢P if and only if PA+¬P⊢⊥; to go from the first to the second, we derive a contradiction from P and ¬P, and to go from the second to the first, we use the law of excluded middle in PA to derive P∨¬P, and observe that, since a contradiction follows from ¬P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that □PA(P) and □PA+¬P(⊥) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement. Now we can rewrite the pre-condition of Löb's theorem from PA⊢□PA(P)P. to PA⊢□PA+¬P(⊥)P. This is then equivalent to PA+¬P⊢¬□PA+¬P(⊥). In the forward direction, we simply derive ⊥ from P and ¬P. In the backward direction, we use the law of excluded middle in PA to derive P∨¬P, observe the statement is trivial in the P branch, and in the ¬P branch, we derive ¬□PA+¬P(⊥), which is stronger than □PA+¬P(⊥)P. So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Godel's second incompleteness theorem holds for PA+¬P. Proving Godel's second incompleteness theorem using computability theory The following proof of a general version of Godel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina". Let L be some first-order system that is at least as strong as PA (for example, PA+¬P). Since L is at least as strong as PA, it can express statements about Turing machines. Let Halts(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of Halts(M) and ¬Halts(M) in parallel. That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements. Let's consider a machine Z(A) that searches over all L-proofs of ¬Halts(''⌈A⌉(⌈A⌉)") (where ''⌈A⌉(⌈A⌉)" is an encoding of a Turing machine that runs A on its own source code), and halts only when finding su...

Is Closed Captioned: No

Explicit: No

Details

AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes

Release Date: 8/1/2023

Duration: 27 Mins

Authors: Beth Barnes

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks, published by Beth Barnes on August 1, 2023 on The AI Alignment Forum. Blogpost version Paper Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - How LLMs are and are not myopic by janus

Release Date: 7/25/2023

Duration: 804 Mins

Authors: janus

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

Is Closed Captioned: No

Explicit: No

Details

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

Release Date: 7/19/2023

Duration: 146 Mins

Authors: johnswentworth

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on The AI Alignment Forum. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - When do "brains beat brawn" in Chess? An experiment by titotal

Release Date: 6/28/2023

Duration: 701 Mins

Authors: titotal

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When do "brains beat brawn" in Chess? An experiment, published by titotal on June 28, 2023 on The AI Alignment Forum. As a kid, I really enjoyed chess, as did my dad. Naturally, I wanted to play him. The problem was that my dad was extremely good. He was playing local tournaments and could play blindfolded, while I was, well, a child. In a purely skill based game like chess, an extreme skill imbalance means that the more skilled player essentially always wins, and in chess, it ends up being a slaughter that is no fun for either player. Not many kids have the patience to lose dozens of games in a row and never even get close to victory. This is a common problem in chess, with a well established solution: It’s called “odds”. When two players with very different skill levels want to play each other, the stronger player will start off with some pieces missing from their side of the board. “Odds of a queen”, for example, refers to taking the queen of the stronger player off the board. When I played “odds of a queen” against my dad, the games were fun again, as I had a chance of victory and he could play as normal without acting intentionally dumb. The resource imbalance of the missing queen made the difference. I still lost a bunch though, because I blundered pieces. Now I am a fully blown adult with a PhD, I’m a lot better at chess than I was a kid. I’m better than most of my friends that play, but I never reached my dad’s level of chess obsession. I never bothered to learn any openings in real detail, or do studies on complex endgames. I mainly just play online blitz and rapid games for fun. My rating on lichess blitz is 1200, on rapid is 1600, which some calculator online said would place me at ~1100 ELO on the FIDE scale. In comparison, a chess master is ~2200, a grandmaster is ~2700. The top chess player Magnus Carlsen is at an incredible 2853. ELO ratings can be used to estimate the chance of victory in a matchup, although the estimates are somewhat crude for very large skill differences. Under this calculation, the chance of me beating a 2200 player is 1 in 500, while the chance of me beating Magnus Carlsen would be 1 in 24000. Although realistically, the real odds would be less about the ELO and more on whether he was drunk while playing me. Stockfish 14 has an estimated ELO of 3549. In chess, AI is already superhuman, and has long since blasted past the best players in the world. When human players train, they use the supercomputers as standards. If you ask for a game analysis on a site like chess.com or lichess, it will compare your moves to stockfish and score you by how close you are to what stockfish would do. If I played stockfish, the estimated chance of victory would be 1 in 1.3 million. In practice, it would be probably be much lower, roughly equivalent to the odds that there is a bug in the stockfish code that I managed to stumble upon by chance. Now that we have all the setup, we can ask the main question of this article: What “odds” do I need to beat stockfish 14 in a game of chess? Obviously I can win if the AI only has a king and 3 pawns. But can I win if stockfish is only down a rook? Two bishops? A queen? A queen and a rook? More than that? I encourage you to pause and make a guess. And if you can play chess, I encourage you to guess as to what it would take for you to beat stockfish. For further homework, you can try and guess the odds of victory for each game in the picture below. The first game I played against stockfish was with queen odds. I won on the first try. And the second, and the third. It wasn’t even that hard. I played 10 games and only lost 1 (when I blundered my queen stupidly). The strategy is simple. First, play it safe and try not to make any extreme blunders. Don’t leave pieces unprotected, check for forks and pins, don’t try an...

Is Closed Captioned: No

Explicit: No

Details

AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger

Release Date: 6/22/2023

Duration: 127 Mins

Authors: Evan Hubinger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Hubinger lectures on AGI safety: an introductory lecture series, published by Evan Hubinger on June 22, 2023 on The AI Alignment Forum. In early 2023, I (Evan Hubinger) gave a series of recorded lectures to SERI MATS fellows with the goal of building up a series of lectures that could serve as foundational introductory material to a variety of topics in AGI safety. Those lectures have now been edited and are available on YouTube for anyone who would like to watch them. The basic goal of this lecture series is to serve as longform, in-depth video content for people who are new to AGI safety, but interested enough to be willing to spend a great deal of time engaging with longform content, and who prefer video content to written content. Though we already have good introductory shortform video content and good introductory longform written content, the idea of this lecture series is to bridge the gap between those two. Note that the topics I chose to include are highly opinionated: though this is introductory material, it is not intended to introduce the listener to every topic in AI safety—rather, it is focused on the topics that I personally think are most important to understand. This is intentional: in my opinion, I think it is far more valuable to have some specific gears-level model of how to think about AI safety, rather than a shallow overview of many different possible ways of thinking about AI safety. The former allows you to actually start operationalizing that model to work on interventions that would be valuable under it, something the latter doesn't do. The lecture series is composed of six lectures, each around 2 hours long, covering the topics: Machine learning + instrumental convergence Risks from learned optimization Deceptive alignment How to evaluate alignment proposals LLMs + predictive models Overview of alignment proposals Each lecture features a good deal of audience questions both in the middle and at the end, the idea being to hopefully pre-empt any questions or confusions the listener might have. The full slide deck for all the talks is available here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch

Release Date: 6/13/2023

Duration: 96 Mins

Authors: Andrew Critch

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI, published by Andrew Critch on June 13, 2023 on The AI Alignment Forum. Partly in response to calls for more detailed accounts of how AI could go wrong, e.g., from Ng and Bengio's recent exchange on Twitter, here's a new paper with Stuart Russell: Discussion on Twitter... comments welcome! arXiv draft:"TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI" Many of the ideas will not be new to LessWrong or the Alignment Forum, but holistically I hope the paper will make a good case to the world for using logically exhaustive arguments to identify risks (which, outside LessWrong, is often not assumed to be a valuable approach to thinking about risk). I think the most important figure from the paper is this one: ... and, here are some highlights: Self-fulfilling pessimism:#page=4 Industries that could eventually get out of control in a closed loop:#page=5...as in this "production web" story:#page=6 Two "bigger than expected" AI impact stories:#page=8 Email helpers and corrupt mediators, which kinda go together:#page=10#page=11 Harmful A/B testing:#page=12 Concerns about weaponization by criminals and states:#page=13 Enjoy :) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures by Dan H

Release Date: 5/30/2023

Duration: 74 Mins

Authors: Dan H

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures, published by Dan H on May 30, 2023 on The AI Alignment Forum. Today, the AI Extinction Statement was released by the Center for AI Safety, a one-sentence statement jointly signed by a historic coalition of AI experts, professors, and tech leaders. Geoffrey Hinton and Yoshua Bengio have signed, as have the CEOs of the major AGI labs–Sam Altman, Demis Hassabis, and Dario Amodei–as well as executives from Microsoft and Google (but notably not Meta). The statement reads: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” We hope this statement will bring AI x-risk further into the overton window and open up discussion around AI’s most severe risks. Given the growing number of experts and public figures who take risks from advanced AI seriously, we hope to improve epistemics by encouraging discussion and focusing public and international attention toward this issue. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - 'Fundamental' vs 'applied' mechanistic interpretability research by Lee Sharkey

Release Date: 5/23/2023

Duration: 347 Mins

Authors: Lee Sharkey

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 'Fundamental' vs 'applied' mechanistic interpretability research, published by Lee Sharkey on May 23, 2023 on The AI Alignment Forum. When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research. Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves. Examples include: A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) Toy Models of Superposition (Elhage et al., 2022) Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022) Interpreting Neural Networks through the Polytope Lens (Black et al., 2022) Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023) Research agenda: Formalizing abstractions of computations (Jenner, 2023) Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag). Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory. Examples include Steering GPT-2-XL by adding an activation vector (Turner et al., 2023) Discovering Latent Knowledge in Language Models (Burns et al., 2022) The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022) In-context Learning and Induction Heads (Olsson et al., 2022) We Found An Neuron in GPT-2 (Miller et al., 2023) Language models can explain neurons in language models (Bills et al., 2023) Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021) Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut: Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory). Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing, where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions). Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens lends weight to the theory that transformer language models do iterative inference). Why I currently prioritize fundamental interpretability Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe. But given our current position on the tech tree, I find that I care more about fundamental interpretability. The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpreta...

Is Closed Captioned: No

Explicit: No

Details

AF - Some background for reasoning about dual-use alignment research by Charlie Steiner

Release Date: 5/18/2023

Duration: 869 Mins

Authors: Charlie Steiner

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some background for reasoning about dual-use alignment research, published by Charlie Steiner on May 18, 2023 on The AI Alignment Forum. This is pretty basic. But I still made a bunch of mistakes when writing this, so maybe it's worth writing. This is background to a specific case I'll put in the next post. It's like a a tech tree If we're looking at the big picture, then whether some piece of research is net positive or net negative isn't an inherent property of that research; it depends on how that research is situated in the research ecosystem that will eventually develop superintelligent AI. Consider this toy game in the picture. We start at the left and can unlock technologies, with unlocks going faster the stronger our connections to prerequisites. The red and yellow technologies in the picture are superintelligent AI - pretend that as soon as one of those technologies is unlocked, the hastiest fraction of AI researchers are immediately going to start building it. Your goal is for humanity to unlock yellow technology before a red one. This game would be trivial if everyone agreed with you. But there are many people doing research, and they have all kinds of motivations - some want as many nodes to be unlocked as possible (pure research - blue), some want to personally unlock a green node (profit - green), some want to unlock the nearest red or yellow node no matter which it is (blind haste - red), and some want the same thing as you (beneficial AI - yellow) but you have a hard time coordinating with them. In this baseline tech tree game, it's pretty easy to play well. If you're strong, just take the shortest path to a yellow node that doesn't pass too close to any red nodes. If you're weak, identify where the dominant paradigm is likely to end up, and do research that differentially advantages yellow nodes in that future. The tech tree is wrinkly But of course there are lots of wrinkles not in the basic tech tree, which can be worth bearing in mind when strategizing about research. Actions in the social and political arenas. You might be motivated to change your research priorities based on how it could change peoples' minds about AI safety, or how it could affect government regulation. Publishing and commercialization. If a player publishes, they get more money and prestige, which boosts their ability to do future research. Other people can build on published research. Not publishing is mainly useful to you if you're already in a position of strength, and don't want to give competitors the chance to outrace you to a nearby red node (and of course profit-motivated players will avoid publishing things that might help competitors beat them to a green node). Uncertainty. We lack exact knowledge of the tech tree, which makes it harder to plan long chains of research in advance. Uncertainty about the tech tree forces us to develop local heuristics - ways to decide what to do based on information close at hand. Uncertainty adds a different reason you might not publish a technology: if you thought it was going to be a good idea to research when you started, but then you learned new things about the tech tree and changed your mind. Inhomogeneities between actors and between technologies. Different organizations are better at researching different technologies - MIRI is not just a small OpenAI. Ultimately, which technologies are the right ones to research depends on your model of the world / how you expect the future to go. Drawing actual tech trees can be a productive exercise for strategy-building, but you might also find it less useful than other ways of strategizing. We're usually mashing together definitions I'd like to win the tech tree game. Let's define a "good" technology as one that would improve our chances of winning if it was unlocked for free, given the st...

Is Closed Captioned: No

Explicit: No

Details

AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes

Release Date: 4/27/2023

Duration: 789 Mins

Authors: Steve Byrnes

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI doom from an LLM-plateau-ist perspective, published by Steve Byrnes on April 27, 2023 on The AI Alignment Forum. (in the form of an FAQ) Q: What do you mean, “LLM plateau-ist”? A: As background, I think it’s obvious that there will eventually be “transformative AI” (TAI) that would radically change the world. I’m interested in what this TAI will eventually look like algorithmically. Let’s list some possibilities: A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels. I am an LLM plateau-ist myself. I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues. Oh well, we’ll find out one way or the other soon enough. In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are: Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren’t built right to do the things that TAI would need to do. Empirical—maybe they’re not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels. Q: If LLMs will plateau, then does that prove that all the worry about AI x-risk is wrong and stupid? A: No no no, a million times no, and I’m annoyed that this misconception is so rampant in public discourse right now. (Side note to AI x-risk people: If you have high credence that AI will kill everyone but only medium credence that this AI will involve LLMs, then maybe consider trying harder to get that nuance across in your communications. E.g. Eliezer Yudkowsky is in this category, I think.) A couple random examples I’ve seen of people failing to distinguish “AI may kill everyone” from “.and that AI will definitely be an LLM”: Venkatesh Rao’s blog post “Beyond Hyperanthropomorphism” goes through an elaborate 7000-word argument that eventually culminates, in the final section, in his assertion that a language model trained on internet data won’t be a powerful agent that gets things done in the world, but if we train an AI with a robot body, then it could be a powerful agent that gets things done in the world. OK fine, let’s suppose for the sake of argument he’s right that robot bodies will be necessary for TAI. Then people are obviously going to build those AIs sooner or later, right? So let’s talk about whether they will pose an x-risk. But that’s not what Venkatesh does. Instead he basically treats “they will need robot bodies” as the triumphant conclusion, more-or-less sufficient in itself to prove that AI x-risk discourse is stupid. Sarah Constantin’s blog post entitled “Why I am not an AI doomer” states right up front that she agrees “1. Artificial general intelligence is possible in principle . 2, Artificial general intelligence, by default, kills us all . 3. It is technically difficult, and perhaps impossible, to ensure an AI values human life.” She only disagrees with the claim that this will happen soon, and via scaling LLMs. I think she should have picked a different title for her post!! (I’ve seen many more examples on Twitter, reddit, comment threads, etc.) Anyway, if you think LLMs will plateau, then you can probably feel confident that we won’t get TAI imminently (see below), but I don’t see why you would have much more confidence that TAI will go well for humanity. In fact, for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t...

Is Closed Captioned: No

Explicit: No

Details

AF - Thinking about maximization and corrigibility by James Payor

Release Date: 4/21/2023

Duration: 594 Mins

Authors: James Payor

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on The AI Alignment Forum. Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on. Here's what I'd like to discuss in this post: When unstructured maximization does/doesn't work out for the humans CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy" Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart When does maximization work? In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are: Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error. The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks. The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation. The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x. There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post. Passing the buck on optimization Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers. Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off. But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown: If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted. It might seem better to stay online and watch longer in order to learn more about the utility function. Maybe there's a plan that rates highly on "utility...

Is Closed Captioned: No

Explicit: No

Details

AF - Shapley Value Attribution in Chain of Thought by leogao

Release Date: 4/14/2023

Duration: 439 Mins

Authors: leogao

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum. TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion. Motivation Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general. Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations. Related work Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases. Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance. Effect of Interventions We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case). Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs): We test how frequently this occurs in several different settings (n=100): SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33 Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65. Shapley value attribution We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...

Is Closed Captioned: No

Explicit: No

Details

AF - If interpretability research goes well, it may get dangerous by Nate Soares

Release Date: 4/3/2023

Duration: 166 Mins

Authors: Nate Soares

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If interpretability research goes well, it may get dangerous, published by Nate Soares on April 3, 2023 on The AI Alignment Forum. I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed. I acknowledge that spreading research insights less widely comes with real research costs. I'd endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that. I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life. It's much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that's happening out in the open, long before they focus on interpretability research. Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn't require vision and genius; I expect it requires that too.) I simultaneously think it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions. Reality doesn't have to provide you any outs. There's a tradeoff here. And it's not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then. I reiterate that I'd feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier. My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn't, but don't let my praise of interpretability research obscure the fact that there are tradeoffs here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - A rough and incomplete review of some of John Wentworth's research by Nate Soares

Release Date: 3/28/2023

Duration: 1818 Mins

Authors: Nate Soares

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A rough and incomplete review of some of John Wentworth's research, published by Nate Soares on March 28, 2023 on The AI Alignment Forum. This is going to be a half-assed review of John Wentworth's research. I studied his work last year, and was kinda hoping to write up a better review, but am lowering my standards on account of how that wasn't happening. Short version: I've been unimpressed by John's technical ideas related to the natural abstractions hypothesis. He seems to me to have some fine intuitions, and to possess various laudable properties such as a vision for solving the whole dang problem and the ability to consider that everybody else is missing something obvious. That said, I've found his technical ideas to be oversold and underwhelming whenever I look closely. (By my lights, Lawrence Chan, Leon Lang, and Erik Jenner’s recent post on natural abstractions is overall better than this post, being more thorough and putting a finger more precisely on various fishy parts of John's math. I'm publishing this draft anyway because my post adds a few points that I think are also useful (especially in the section “The Dream”).) To cite a specific example of a technical claim of John's that does not seem to me to hold up under scrutiny: John has previously claimed that markets are a better model of intelligence than agents, because while collective agents don't have preference cycles, they're willing to pass up certain gains. For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Bob loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over. This argument seems straightforwardly wrong to me, as summarized in a stylized dialogue I wrote (that includes more details about the point). If Alice and Bob are sufficiently capable reasoners then they take both trades and even things out using a side channel. (And even if they don't have a side channel, there are positive-EV contracts they can enter into in advance before they know who will be favored. And if they reason using LDT, they ofc don't need to sign contracts in advance.) (Aside: A bunch of the difficult labor in evaluating technical claims is in the part where you take a high-falutin' abstract thing like "markets are a better model of intelligence than agents" and pound on it until you get a specific minimal example like "neither of the alien's baskets is accepted by a market consisting of two people named Alice and Bob", at which point the error becomes clear. I haven't seen anybody else do that sort of distillation with John's claims. It seems to me that our community has a dearth of this kind of distillation work. If you're eager to do alignment work, don't know how to help, and think you can do some of this sort of distillation, I recommend trying. MATS might be able to help out.) I pointed this out to John, and (to John's credit) he seemed to update (in realtime, which is rare) ((albeit with a caveat that communicating the point took a while, and didn't transmit the first few times that I tried to say it abstractly before having done the distillation labor)). The dialogue I wrote recounting that convo is probably not an entirely unfair summary (John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment). My impression of John's other technical claims about natural abstractions is that they have similar issues. That said, I don't have nearly so crisp a distillation of John's views on natural abstractions, nor nearly so short a refutation. I spent a significant amount of time looking into John’s relevant views (we had overlapping travel plans and conspired t...

Is Closed Captioned: No

Explicit: No

Details

AF - My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" by Quintin Pope

Release Date: 3/21/2023

Duration: 3479 Mins

Authors: Quintin Pope

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...

Is Closed Captioned: No

Explicit: No

Details

AF - Towards understanding-based safety evaluations by Evan Hubinger

Release Date: 3/15/2023

Duration: 490 Mins

Authors: Evan Hubinger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...

Is Closed Captioned: No

Explicit: No

Details

AF - The Waluigi Effect (mega-post) by Cleo Nardo

Release Date: 3/3/2023

Duration: 1632 Mins

Authors: Cleo Nardo

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...

Is Closed Captioned: No

Explicit: No

Details

AF - $20 Million in NSF Grants for Safety Research by Dan H

Release Date: 2/28/2023

Duration: 84 Mins

Authors: Dan H

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Pretraining Language Models with Human Preferences by Tomek Korbak

Release Date: 2/21/2023

Duration: 1210 Mins

Authors: Tomek Korbak

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs’ downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples’ misalignment scores, given by the reward functi...

Is Closed Captioned: No

Explicit: No

Details

AF - Don't accelerate problems you're trying to solve by Andrea Miotti

Release Date: 2/15/2023

Duration: 542 Mins

Authors: Andrea Miotti

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Don't accelerate problems you're trying to solve, published by Andrea Miotti on February 15, 2023 on The AI Alignment Forum. If one believes that unaligned AGI is a significant problem (>10% chance of leading to catastrophe), speeding up public progress towards AGI is obviously bad. Though it is obviously bad, there may be circumstances which require it. However, accelerating AGI should require a much higher bar of evidence and much more extreme circumstances than is commonly assumed. There are a few categories of arguments that claim intentionally advancing AI capabilities can be helpful for alignment, which do not meet this bar. Two cases of this argument are as follows It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. We address these two arguments directly, arguing that the downsides are much higher than they may appear, and touch on why we believe that merely plausible arguments for advancing AI capabilities aren’t enough. Dangerous argument 1: It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. For a specific instance of this, see Paul Christiano’s “Thoughts on the impact of RLHF research”: RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems [.] RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress [.] Markets aren’t efficient, they only approach efficiency under heavy competition when people with relevant information put effort into making them efficient. This is true for machine learning, as there aren’t that many machine learning researchers at the cutting edge, and before ChatGPT there wasn’t a ton of market pressure on them. Perhaps something as low hanging as RLHF or something similar would have happened eventually, but this isn’t generally true. Don’t assume that something seemingly obvious to you is obvious to everyone. But even if something like RLHF or imitation learning would have happened eventually, getting small steps of progress slightly earlier can have large downstream effects. Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences: The red line indicates when the first “lethal” AGI is deployed, and thus a hard deadline for us to solve alignment. A slight increase in progress now can lead to catastrophe significantly earlier! Pushing us up the early progress exponential has really bad downstream effects! And this is dangerous decision theory too: if every alignment researcher took a similar stance, their marginal accelerations would quickly add up. Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. Again, from Paul: Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. But there is no clear distinction between eliminating capability overhangs and discovering new capabilities. Eliminating capability overhangs is discovering AI capabilities faste...

Is Closed Captioned: No

Explicit: No

Details

AF - SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow

Release Date: 2/5/2023

Duration: 1433 Mins

Authors: Jessica Rumbelow

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on The AI Alignment Forum. Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen). Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for: eliciting knowledge generating adversarial inputs automating prompt search (e.g. for fine-tuning) In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post. Prompt generation First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept). We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data. So, how can we apply this approach to language models? Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above. So now we can do stuff like this: And this: We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'. How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail. This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discov...

Is Closed Captioned: No

Explicit: No

Details

AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis

Release Date: 1/31/2023

Duration: 419 Mins

Authors: Adam Scherlis

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...

Is Closed Captioned: No

Explicit: No

Details

AF - Thoughts on the impact of RLHF research by Paul Christiano

Release Date: 1/25/2023

Duration: 869 Mins

Authors: Paul Christiano

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the impact of RLHF research, published by Paul Christiano on January 25, 2023 on The AI Alignment Forum. In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive. I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress." Background on my involvement in RLHF work Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background: The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.) Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because: Evaluating consequences is hard. A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time. It’s very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions. In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes. The history of my involvement: My first post on this topic was in 2015. When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here. I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn’t look promising enough to scale up. In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on langu...

Is Closed Captioned: No

Explicit: No

Details

AF - Shard theory alignment requires magic. by Charlie Steiner

Release Date: 1/20/2023

Duration: 278 Mins

Authors: Charlie Steiner

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shard theory alignment requires magic., published by Charlie Steiner on January 20, 2023 on The AI Alignment Forum. A delayed hot take. This is pretty similar to previous comments from Rohin. "Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it. And "shard theory alignment" in the sense of building an AI that does good things and not bad things by encouraging an RL agent to want to do good things, via kinds of reward shaping analogous to the diamond maximizer example. How might the story go? You start out with some unsupervised model of sensory data. On top of its representation of the world you start training an RL agent, with a carefully chosen curriculum and a reward signal that you think matches "goodness in general" on that curriculum distribution. This cultivates shards that want things in the vicinity of "what's good according to human values." These start out as mere bundles of heuristics, but eventually they generalize far enough to be self-reflective, promoting goal-directed behavior that takes into account the training process and the possibility of self-modification. At this point the values will lock themselves in, and future behavior will be guided by the abstractions in the learned representation of the world that the shards used to get good results in training, not by what would actually maximize the reward function you used. There magic here is especially concentrated around how we end up with the right shards. One magical process is how we pick the training curriculum and reward signal. If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it? And what's in the curriculum matters a lot. Do we try to teach the AI to locate "human values" by having it be prosocial towards individuals? Which ones? To groups? Over what timescale? How do we reward it for choices on various ethical dilemmas? Or do we artificially suppress the rate of occurrence of such dilemmas? Different choices will lead to different shards. We wouldn't need to find a unique best way to do things (that's a boondoggle), but we would need to find some way of doing things that we trust enough. Another piece of magic is how the above process lines up with generalization and self-reflectivity. If the RL agent becomes self-reflective too early, it will lock in simple goals that we don't want. If it becomes self-reflective too late, it will have started exploiting unintended maxima of the reward function. How do we know when we want the AI to lock in its values? How do we exert control over that? If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work. The part of the magic that I think we could start working on now is how to connect curricula and learned abstractions. In order to predict that a certain curriculum will cause an AI to learn what we think is good, we want to have a science of reinforcement learning advanced in both theory and data. In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory ali...

Is Closed Captioned: No

Explicit: No

Details

The Nonlinear Library: Alignment Forum Weekly

Description

Details

Language:

Release Date:

Authors:

Genres:

Share this podcast

Episodes

AF - What I would do if I wasn't at ARC Evals by Lawrence Chan

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

AF - Red-teaming language models via activation engineering by Nina Rimsky

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes

AF - How LLMs are and are not myopic by janus

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

AF - When do "brains beat brawn" in Chess? An experiment by titotal

AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger

AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch

AF - Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures by Dan H

AF - 'Fundamental' vs 'applied' mechanistic interpretability research by Lee Sharkey

AF - Some background for reasoning about dual-use alignment research by Charlie Steiner

AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes

AF - Thinking about maximization and corrigibility by James Payor

AF - Shapley Value Attribution in Chain of Thought by leogao

AF - If interpretability research goes well, it may get dangerous by Nate Soares

AF - A rough and incomplete review of some of John Wentworth's research by Nate Soares

AF - My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" by Quintin Pope

AF - Towards understanding-based safety evaluations by Evan Hubinger

AF - The Waluigi Effect (mega-post) by Cleo Nardo

AF - $20 Million in NSF Grants for Safety Research by Dan H

AF - Pretraining Language Models with Human Preferences by Tomek Korbak

AF - Don't accelerate problems you're trying to solve by Andrea Miotti

AF - SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow

AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis

AF - Thoughts on the impact of RLHF research by Paul Christiano

AF - Shard theory alignment requires magic. by Charlie Steiner

Similar Podcasts

The Nonlinear Library

The Nonlinear Library: Alignment Section

The Nonlinear Library: LessWrong

The Nonlinear Library: LessWrong Daily

The Nonlinear Library: EA Forum Daily

The Nonlinear Library: EA Forum Weekly

The Nonlinear Library: Alignment Forum Daily

The Nonlinear Library: LessWrong Weekly

The Nonlinear Library: Alignment Forum Top Posts

The Nonlinear Library: LessWrong Top Posts

sasodgy

Reviews -

Comments (0) -

Replies (0)