The Nonlinear Library: Alignment Forum Daily

AF - Meta Questions about Metaphilosophy by Wei Dai

Release Date: 9/1/2023

Duration: 282 Mins

Authors: Wei Dai

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta Questions about Metaphilosophy, published by Wei Dai on September 1, 2023 on The AI Alignment Forum. To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses: (10) Science - Science is cool! (15) Philosophy of Science - The scientific method is cool! Oh look, there's a whole field studying it called "philosophy of science"! (20) Probability Theory - Bayesian subjective probability and the universal prior seem to constitute an elegant solution to the philosophy of science. Hmm, there are some curious probability puzzles involving things like indexical uncertainty, copying, forgetting... I and others make some progress on this but fully solving anthropic reasoning seems really hard. (Lots of people have worked on this for a while and have failed, at least according to my judgement.) (25) Decision Theory - Where does probability theory come from anyway? Maybe I can find some clues that way? Well according to von Neumann and Morgenstern, it comes from decision theory. And hey, maybe it will be really important that we get decision theory right for AI? I and others make some progress but fully solving decision theory turns out to be pretty hard too. (A number of people have worked on this for a while and haven't succeeded yet.) (35) Metaphilosophy - Where does decision theory come from? It seems to come from philosophers trying to do philosophy. What is that about? Plus, maybe it will be really important that the AIs we build will be philosophically competent? (45) Meta Questions about Metaphilosophy - Not sure how hard solving metaphilosophy really is, but I'm not making much progress on it by myself. Meta questions once again start to appear in my mind: Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent? Tractability may be a concern but shouldn't more people still be talking about these problems if only to raise the alarm (about an additional reason that the AI transition may go badly)? (I've listened to all the recent podcasts on AI risk that I could find, and nobody brought it up even once.) How can I better recruit attention and resources to this topic? For example, should I draw on my crypto-related fame, or start a prize or grant program with my own money? I'm currently not inclined to do either, out of inertia, unfamiliarity, uncertainty of getting any return, fear of drawing too much attention from people who don't have the highest caliber of thinking, and signaling wrong things (having to promote ideas with one's own money instead of attracting attention based on their merits). But I'm open to having my mind changed if anyone has good arguments about this. What does it imply that so few people are working on this at such a late stage? For example, what are the implications for the outcome of the human-AI transition, and on the distribution of philosophical competence (and hence the distribution of values, decision theories, and other philosophical views) among civilizations in the universe/multiverse? At each stage of this journey, I took what seemed to be the obvious next step (often up a meta ladder), but in retrospect each step left behind something like 90-99% of fellow travelers. From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives? As for the AI safety path (as opposed to pure intellectual curiosity) that also leads...

Is Closed Captioned: No

Explicit: No

Details

AF - Red-teaming language models via activation engineering by Nina Rimsky

Release Date: 8/26/2023

Duration: 758 Mins

Authors: Nina Rimsky

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Red-teaming language models via activation engineering, published by Nina Rimsky on August 26, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. Evaluating powerful AI systems for hidden functionality and out-of-distribution behavior is hard. In this post, I propose a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer. A notebook to run the experiments can be found on GitHub here. Beyond input selection in red-teaming and evaluation Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging. Although these methods reduce the likelihood of certain outputs, the unwanted behavior could still be possible with adversarial or unusual inputs. For example, users can often find "jailbreaks" to make LLMs output harmful content. We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety. Activation steering with refusal vector One possible red-teaming approach is subtracting a "refusal" vector generated using a dataset of text examples corresponding to the model agreeing vs. refusing to answer questions (using the same technique as in my previous work on sycophancy). The hypothesis is that if it is easy to trigger the model to output unacceptable content by subtracting the refusal vector at some layer, it would have been reasonably easy to achieve this via some prompt engineering technique. More speculatively, a similar approach could be used to reveal hidden goals or modes in a model, such as power-seeking or the desire not to be switched off. I tested this approach on llama-2-7b-chat, a 7 billion parameter LLM that has been RLHF'd to decline to answer controversial questions or questions of opinion and is supposed always to output ethical and unbiased content.According to Meta's llama-2 paper: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage. The result is that by default, the model declines to answer questions it deems unsafe: Data generation I generated a dataset for this purpose using Claude 2 and GPT-4. After providing these LLMs with a few manually written examples of the type of data I wanted, I could relatively easily get them to generate more examples, even of the types of answers LLMs "should refuse to give." However, it sometimes took some prompt engineering. Here are a few examples of the generated data points (full dataset here): After generating this data, I used a simple script to transform the "decline" and "respond" answers into A / B choice questions, as this is a more effective format for generating steering vectors, as described in this post. Here is an example of the format (full dataset here): Activation clustering Clustering of refusal data activations emerged a little earlier in the model (around layer 10/32) compared to sycophancy data activations (around layer 14/32), perhaps demonstrating that "refusal" is a simpler ...

Is Closed Captioned: No

Explicit: No

Details

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

Release Date: 8/21/2023

Duration: 1007 Mins

Authors: scottviteri

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Causality and a Cost Semantics for Neural Networks, published by scottviteri on August 21, 2023 on The AI Alignment Forum. Epistemic status: I time-boxed this idea to three days of effort. So any calculations are pretty sloppy, and I haven't looked into any related works. I probably could have done much better if I knew anything about circuit complexity. There are some TODOs and an unfinished last section -- if you are interested in this content and want to pick up where I have left off I'll gladly add you as a collaborator to this post. Here is a "tech tree" for neural networks. I conjecture (based on admittedly few experiments) that the simplest implementation of any node in this tree includes an implementation of its parents, given that we are writing programs starting from the primitives +, , and relu. An especially surprising relationship (to me) is that "if statements" are best implemented downstream of division. Introduction While discussing with my friend Anthony Corso, an intriguing idea arose. Maybe we can define whether program p1 "causes" p2 in the following way: Given a neural network that mimics p1, how easy is it to learn a neural network which mimics the behavior of p2? This proposition is intriguing because it frames causality as a question about two arbitrary programs, and reduces it to a problem of program complexity. Suppose that p1 and p2 are written in a programming language P, and let P(ops) represent P extended with ops as primitive operations. We define a complexity function C:P(ops)R, which takes a program in the extended language and returns a real number representative of the program's complexity for some fixed notion of complexity. Let's define the degree to which p1 "causes" p2 as the minimum complexity achievable by a program p from P(p1) such that p is extensionally equal (equal for all inputs) to p2. If P2 is the set of all p in P(obs+p1) that are extensionally equal to p2, then causes(p1,p2)=minp∈P2C(p). We can also use this definition in the approximate case, considering the minimum complexity achievable by programs p such that E(p(x)-p2(x))2<ε with respect to some L1-integrable probability measure. We can define a particular complexity function C that represents the cost of executing a program. We can estimate this quantity by looking at the program's Abstract Syntax Tree (AST) in relation to some cost model of the primitive operations in the language. For this exploration, we have chosen the lambda calculus as the language. Lambda calculus is a minimalist Lisp-like language with just a single type, which in our case we will think of as floating point numbers. The notation is simple: lambda abstraction is represented as λ x. x, and function application as (f g), which is not the same as f(g) in most other languages. How I Would Like People to Engage with this Work By writing Ops in your favorite programming language By circumventing my proposed tech tree, by reaching a child without reaching a parent and using fewer (or equal) number of operations By training some neural networks between these programs, and seeing how difficult it is to learn one program after pre-training on another Cost Semantics Definition We define the cost of operations and expressions in the following manner: Ops op=1,for any operation op in opsOps c=0,for any floating-point constant cOps x=0,for any variable xOps (λx.e)=Ops eOps (f g)=Ops f+Ops g For operations of higher arity, we have({Ops }({op }x1.xn))=({Ops }{op})+∑i({Ops }xi) The selected operations for a neural network are ops = {+, , relu}. Basic Operations and Warm-Up Let's take a few examples to demonstrate this cost calculus: To derive subtraction, we first create negation neg. (Ops neg) = (Ops (λ x. ( -1 x))) = (Ops ( -1 x))= (Ops ) + (Ops -1) + (Ops x) = 1 + 0 + 0 = 1 The cost of subtraction (-) ...

Is Closed Captioned: No

Explicit: No

Details

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

Release Date: 8/20/2023

Duration: 336 Mins

Authors: Nora Ammann

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them, published by Nora Ammann on August 20, 2023 on The AI Alignment Forum. Meta: This is a short summary & discussion post of a talk on the same topic by Javier Gomez-Lavin, which he gave as part of the PIBBSS speaker series. The speaker series features researchers from both AI Alignment and adjacent fields studying intelligent behavior in some shape or form. The goal is to create a space where we can explore the connections between the work of these scholars and questions in AI Alignment. This post doesn't provide a comprehensive summary of the ideas discussed in the talk, but instead focuses on exploring some possible connections to AI Alignment. For a longer version of Gomez-Levin's ideas, you can check out a talk here. "Dirty concepts" in the Cognitive Sciences Gomez-Lavin argues that cognitive scientists engage in a form of "philosophical laundering," wherein they associate, often implicitly, philosophically loaded concepts (such as volition, agency, etc.) into their concept of "working memory." He refers to such philosophically laundered concepts as "dirty concepts" insofar as they conceal potentially problematic assumptions being made. For instance, if we implicitly assume that working memory requires, for example, volition, we have now stretched our conception of working memory to include all of cognition. But, if we do this, then the concept of working memory loses much of its explanatory power as one mechanism among others underlying cognition as a whole. Often, he claims, cognitive science papers will employ such dirty concepts in the abstract and introduction but will identify a much more specific phenomena being measured in the methods and results section. What to do about it? Gomez-Lavin's suggestion in the case of CogSci The pessimistic response (and some have suggested this) would be to quit using any of these dirty concept (e.g. agency) all together. However, it appears that this would amount to throwing the baby out with the bathwater. To help remedy the problem of dirty concepts in working memory literature, Gomez-Lavin proposes creating an ontology of the various operational definitions of working memory employed in cognitive science by mining a wide range of research articles. The idea is that, instead of insisting that working memory be operationally defined in a single way, we ought to embrace the multiplicity of meanings associated with the term by keeping track of them more explicitly. He refers to this general approach as "productive pessimism." It is pessimistic insofar as it starts from the assumption that dirty concepts are being problematically employed, but it is productive insofar as it attempts to work with this trend rather than fight against it. While it is tricky to reason with those fuzzy concepts, once we are rigorous about proposing working definitions / operationalization of these terms as we use them, we can avoid some of the main pitfalls and improve our definitions over time. Relevance to AI alignment? It seems fairly straightforward that AI alignment discourse, too, suffers from dirty concepts. If this is the case (and we think it is), a similar problem diagnosis (e.g. how dirty concepts can hamper research/intellectual progress) and treatment (e.g. ontology mapping) may apply. A central example here is the notion of "agency". Alignment researchers often speak of AI systems as agents. Yet, there are often multiple, entangled meanings intended when doing so. High-level descriptions of AI x-risk often exploit this ambiguity in order to speak about the problem in general, but ultimately imprecise terms. This is analogous to how cognitive scientists will often describe working memory in general terms in the abstract and operationalize the term ...

Is Closed Captioned: No

Explicit: No

Details

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

Release Date: 8/16/2023

Duration: 328 Mins

Authors: Jessica Taylor

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Proof of Löb's Theorem using Computability Theory, published by Jessica Taylor on August 16, 2023 on The AI Alignment Forum. Löb's Theorem states that, if PA⊢□PA(P)P, then PA⊢P. To explain the symbols here: PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers. PA⊢A means there is a proof of the statement A in Peano arithmetic. □PA(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic. I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs. Translating Löb's theorem to be more like Godel's second incompleteness theorem First, let's compare Löb's theorem to Godel's second incompleteness theorem. This theorem states that, if PA⊢¬□PA(⊥), then PA⊢⊥, where ⊥ is a PA statement that is trivially false (such as A∧¬A), and from which anything can be proven. A system is called inconsistent if it proves ⊥; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent. We can re-write Löb's theorem to look like Godel's second incompleteness theorem as: if PA+¬P⊢¬□PA+¬P(⊥), then PA+¬P⊢⊥. Here, PA+¬P is PA with an additional axiom that ¬P, and □PA+¬P expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement. Observe that PA⊢P if and only if PA+¬P⊢⊥; to go from the first to the second, we derive a contradiction from P and ¬P, and to go from the second to the first, we use the law of excluded middle in PA to derive P∨¬P, and observe that, since a contradiction follows from ¬P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that □PA(P) and □PA+¬P(⊥) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement. Now we can rewrite the pre-condition of Löb's theorem from PA⊢□PA(P)P. to PA⊢□PA+¬P(⊥)P. This is then equivalent to PA+¬P⊢¬□PA+¬P(⊥). In the forward direction, we simply derive ⊥ from P and ¬P. In the backward direction, we use the law of excluded middle in PA to derive P∨¬P, observe the statement is trivial in the P branch, and in the ¬P branch, we derive ¬□PA+¬P(⊥), which is stronger than □PA+¬P(⊥)P. So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Godel's second incompleteness theorem holds for PA+¬P. Proving Godel's second incompleteness theorem using computability theory The following proof of a general version of Godel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina". Let L be some first-order system that is at least as strong as PA (for example, PA+¬P). Since L is at least as strong as PA, it can express statements about Turing machines. Let Halts(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of Halts(M) and ¬Halts(M) in parallel. That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements. Let's consider a machine Z(A) that searches over all L-proofs of ¬Halts(''⌈A⌉(⌈A⌉)") (where ''⌈A⌉(⌈A⌉)" is an encoding of a Turing machine that runs A on its own source code), and halts only when finding su...

Is Closed Captioned: No

Explicit: No

Details

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

Release Date: 7/28/2023

Duration: 866 Mins

Authors: NinaR

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reducing sycophancy and improving honesty via activation steering, published by NinaR on July 28, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better. What is sycophancy? Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.' According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations: Larger models tend to repeat back a user's stated views ("sycophancy"), for pretrained LMs and RLHF models trained with various numbers of RL steps. Preference Models (PMs) used for RL incentivize sycophancy. Two types of sycophancy I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy." Opinion sycophancy Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested. It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.: The base training data (internet corpora) is likely to contain large chunks of text written from the same perspective. Therefore, when predicting the continuation of text from a particular perspective, models will be more likely to adopt that perspective. There is a wide variety of political perspectives/opinions on subjective questions, and a model needs to be able to represent all of them to do well on various training tasks. Unlike questions that have a ground truth (e.g., "Is the earth flat?"), the model has to, at some point, make a choice between the perspectives available to it. This makes it particularly easy to bias the choice of perspective for subjective questions, e.g., by word choice in the input. RLHF or supervised fine-tuning incentivizes sounding good to human evaluators, who are more likely to approve of outputs that they agree with, even when it comes to subjective questions with no clearly correct answer. Dishonest sycophancy A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases. For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists. Do AIs internally represent the truth? Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly true when it comes to questi...

Is Closed Captioned: No

Explicit: No

Details

AF - How LLMs are and are not myopic by janus

Release Date: 7/25/2023

Duration: 804 Mins

Authors: janus

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

Is Closed Captioned: No

Explicit: No

Details

AF - Open problems in activation engineering by Alex Turner

Release Date: 7/24/2023

Duration: 97 Mins

Authors: Alex Turner

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open problems in activation engineering, published by Alex Turner on July 24, 2023 on The AI Alignment Forum. Steering GPT-2-XL by adding an activation vector introduced activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, which doubled TruthfulQA performance by adding a similarly computed activation vector to forward passes! We think that activation engineering has a bunch of low-hanging fruit for steering and understanding models. A few open problems from the list: Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful? Take a circuit studied from existing literature on GPT2, or find another one using ACDC. Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits? What's the mechanism by which adding a steering vector with too large a coefficient breaks the model? (Credit: Thomas Kwa; see also @Ulisse Mini's initial data/explanation.) If you want to work on activation engineering, come by the Slack server to coordinate research projects and propose new ideas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

Release Date: 7/23/2023

Duration: 989 Mins

Authors: Quintin Pope

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on The AI Alignment Forum. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking l...

Is Closed Captioned: No

Explicit: No

Details

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

Release Date: 7/21/2023

Duration: 591 Mins

Authors: Andrea Miotti

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Priorities for the UK Foundation Models Taskforce, published by Andrea Miotti on July 21, 2023 on The AI Alignment Forum. The UK government recently established the Foundation Models Taskforce, focused on AI safety, modelled on the Vaccine Taskforce, and backed by £100M in funding. Founder, investor and AI expert Ian Hogarth leads the new organization. The establishment of the Taskforce shows the UK's intention to be a leading player in the greatest governance challenge of our times: keeping humanity in control of a future with increasingly powerful AIs. This is no small feat, and will require very ambitious policies that anticipate the rapid developments in the AI field, rather than just reacting to them. Here are some recommendations on what the Taskforce should do. The recommendations fall into three categories: Communication and Education about AI risk, International Coordination, and Regulation and Monitoring. Communication and Education about AI Risk The Taskforce is uniquely positioned to educate and communicate about AI development and risks. Here is how it could do it: Private education The Taskforce should organize private education sessions for UK Members of Parliament, Lords, and high-ranking civil servants, in the form of presentations, workshops, and closed-door Q&As with Taskforce experts. These would help bridge the information gap between policymakers and the fast-moving AI field. A new platform: ai.gov.uk The Taskforce should take a proactive role in disseminating knowledge about AI progress, the state of the AI field, and the Taskforce's own actions: The Taskforce should publish bi-weekly or monthly Bulletins and Reports on AI on an official government website. The Taskforce can start doing this right away by publishing its bi-weekly or monthly bulletins and reports on the state of AI progress and AI risk on the UK government's research and statistics portal. The Taskforce should set up ai.gov.uk, an online platform modeled after the UK's COVID-19 dashboard. The platform's main page should be a dashboard showing key information about AI progress and Taskforce progress in achieving its goals, that gets updated regularly. ai.gov.uk should have a progress bar trending towards 100% for all of the Task Force's key objectives. ai.gov.uk should also include a "Safety Plans of AI Companies" monthly report, with key insights visualized on the dashboard. The Taskforce should send an official questionnaire to each frontier AI company to compile this report. This questionnaire should contain questions about companies' estimated risk of human extinction caused by the development of their AIs, their timelines until the existence of powerful and autonomous AI systems, and their safety plans regarding development and deployment of frontier AI models. There is no need to make the questionnaire mandatory. For companies that don't respond or respond only to some questions, the relevant information on the dashboard should be left blank, or filled in with a "best guess" or "most relevant public information" curated by Taskforce experts. Public-facing communications Taskforce members should utilize press conferences, official posts on the Taskforce's website, and editorials in addition to ai.gov.uk to educate the public about AI development and risks. Key topics to cover in these public-facing communications include: Frontier AI development is focused on developing autonomous, superhuman, general agents, not just towards better chatbots or the automation of individual tasks. These are and will increasingly be AIs capable of making their own plans and taking action in the real world. No one fully understands how these systems function, their capabilities or limits, and how to control or restrict them. All of these remain unsolved technical challenges. Consensus on the so...

Is Closed Captioned: No

Explicit: No

Details

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

Release Date: 7/19/2023

Duration: 146 Mins

Authors: johnswentworth

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on The AI Alignment Forum. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

Release Date: 7/18/2023

Duration: 616 Mins

Authors: Ansh Radhakrishnan

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring and Improving the Faithfulness of Model-Generated Reasoning, published by Ansh Radhakrishnan on July 18, 2023 on The AI Alignment Forum. TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting - answering questions by breaking them into subquestions - improves faithfulness while maintaining good task performance. Paper Abstracts Measuring Faithfulness in Chain-of-Thought Reasoning Large language models (LLMs) perform better when they produce step-by-step, "Chain-of -Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior. Externalized Reasoning Oversight Relies on Faithful Reasoning Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-s...

Is Closed Captioned: No

Explicit: No

Details

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

Release Date: 7/2/2023

Duration: 304 Mins

Authors: Joar Skalse

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Using (Uninterpretable) LLMs to Generate Interpretable AI Code, published by Joar Skalse on July 2, 2023 on The AI Alignment Forum. (This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc). One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard. In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc). [SGD] -> [neural network] -> [output] I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let this system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move. [SGD] -> [neural network] -> [interpretable computer program] -> [output] To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and read-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be very sure, because we can't try all inputs, and we can't understand what all the tensors are doing. An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't neces...

Is Closed Captioned: No

Explicit: No

Details

AF - Agency from a causal perspective by Tom Everitt

Release Date: 6/30/2023

Duration: 700 Mins

Authors: Tom Everitt

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agency from a causal perspective, published by Tom Everitt on June 30, 2023 on The AI Alignment Forum. Post 3 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction and Post 2: Causality. By Matt MacDermott, James Fox, Rhys Ward, Jonathan Richens, and Tom Everitt representing the Causal Incentives Working Group. Thanks also to Ryan Carey, Toby Shevlane, and Aliya Ahmad. The purpose of this post is twofold: to lay the foundation for subsequent posts by exploring what agency means from a causal perspective, and to sketch a research program for a deeper understanding of agency. The Importance of Understanding Agency Agency is a complex concept that has been studied from multiple perspectives, including social science, philosophy, and AI research. Broadly it refers to a system able to act autonomously. For the purposes of this blog post, we interpret agency as goal-directedness, i.e. acting as if trying to direct the world in some particular direction. There are strong incentives to create more agentic AI systems. Such systems could potentially do many tasks humans are currently needed for, such as independently researching topics, or even run their own companies. However, making systems more agentic comes with an additional set of potential dangers and harms, as goal-directed AI systems could become capable adversaries if their goals are misaligned with human interest. A better understanding of agency may let us: Understand dangers and harms from powerful machine learning systems. Evaluate whether a particular ML model is dangerously agentic. Design systems that are not agentic, such as AGI scientists or oracles, or which are agentic in a safe way. Lay a foundation for progress on other AGI safety topics, such as interpretability, incentives, and generalisation. Preserve human agency, e.g. through a better understanding of the conditions under which agency is enhanced or diminished. Degrees of freedom (Goal-directed) agents come in all shapes and sizes – from bacteria to humans, from football teams to governments, and from RL policies to LLM simulacra – but they share some fundamental features. First, an agent needs the freedom to choose between a set of options. We don’t need to assume that this decision is free from causal influence, or that we can’t make any prediction about it in advance – but there does need to be a sense in which it could either go one way or another. Dennett calls this degrees of freedom. For example, Mr Jones can choose to turn his sprinkler on or not. We can model his decision as a random variable with “watering” and “not watering” as possible outcomes: Freedom comes in degrees. A thermostat can only choose heater output, while most humans have access to a range of physical and verbal actions. Influence Second, in order to be relevant, an agent’s behaviour must have consequences. Mr Jones decision to turn on the sprinkler affects how green his grass becomes: The amount of influence varies between different agents. For example, a language model’s influence will heavily depend on whether it only interacts with its own developers, or with millions of users through a public API. Suggested measures of influence include (causal) channel capacity, performative power, and power in Markov decision processes. Adaptation Third, and most importantly, goal-directed agents do things for reasons. That is, (they act as if) they have preferences about the world, and these preferences drive their behaviour: Mr Jones turns on the sprinkler because it makes the grass green. If the grass didn’t need water, then Mr Jones likely wouldn’t water it. The consequences drive the behaviour. This feedback loop, or backwards causality, can be represented by adding a so-called mechanism node to each object-level node in the original graph. The mechanism n...

Is Closed Captioned: No

Explicit: No

Details

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

Release Date: 6/26/2023

Duration: 2364 Mins

Authors: Dan H

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Risks from AI #4: Organizational Risks, published by Dan H on June 26, 2023 on The AI Alignment Forum. This is the fourth post in a sequence of posts giving an overview of catastrophic AI risks. 4 Organizational Risks In January 1986, tens of millions of people tuned in to watch the launch of the Challenger Space Shuttle. Approximately 73 seconds after liftoff, the shuttle exploded, resulting in the deaths of everyone on board. Though tragic enough on its own, one of its crew members was a school teacher named Sharon Christa McAuliffe. McAuliffe was selected from over 10,000 applicants for the NASA Teacher in Space Project and was scheduled to become the first teacher to fly in space. As a result, millions of those watching were schoolchildren. NASA had the best scientists and engineers in the world, and if there was ever a mission NASA didn't want to go wrong, it was this one [70]. The Challenger disaster, alongside other catastrophes, serves as a chilling reminder that even with the best expertise and intentions, accidents can still occur. As we progress in developing advanced AI systems, it is crucial to remember that these systems are not immune to catastrophic accidents. An essential factor in preventing accidents and maintaining low levels of risk lies in the organizations responsible for these technologies. In this section, we discuss how organizational safety plays a critical role in the safety of AI systems. First, we discuss how even without competitive pressures or malicious actors, accidents can happen—in fact, they are inevitable. We then discuss how improving organizational factors can reduce the likelihood of AI catastrophes. Catastrophes occur even when competitive pressures are low. Even in the absence of competitive pressures or malicious actors, factors like human error or unforeseen circumstances can still bring about catastrophe. The Challenger disaster illustrates that organizational negligence can lead to loss of life, even when there is no urgent need to compete or outperform rivals. By January 1986, the space race between the US and USSR had largely diminished, yet the tragic event still happened due to errors in judgment and insufficient safety precautions. Similarly, the Chernobyl nuclear disaster in April 1986 highlights how catastrophic accidents can occur in the absence of external pressures. As a state-run project without the pressures of international competition, the disaster happened when a safety test involving the reactor's cooling system was mishandled by an inadequately prepared night shift crew. This led to an unstable reactor core, causing explosions and the release of radioactive particles that contaminated large swathes of Europe [71]. Seven years earlier, America came close to experiencing its own Chernobyl when, in March 1979, a partial meltdown occurred at the Three Mile Island nuclear power plant. Though less catastrophic than Chernobyl, both events highlight how even with extensive safety measures in place and few outside influences, catastrophic accidents can still occur. Another example of a costly lesson on organizational safety came just one month after the accident at Three Mile Island. In April 1979, spores of Bacillus anthracis—or simply "anthrax," as it is commonly known—were accidentally released from a Soviet military research facility in the city of Sverdlovsk. This led to an outbreak of anthrax that resulted in at least 66 confirmed deaths [72]. Investigations into the incident revealed that the cause of the release was a procedural failure and poor maintenance of the facility's biosecurity systems, despite being operated by the state and not subjected to significant competitive pressures. The unsettling reality is that AI is far less understood and AI industry standards are far less stringent th...

Is Closed Captioned: No

Explicit: No

Details

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

Release Date: 6/16/2023

Duration: 754 Mins

Authors: Fabien Roger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs Sometimes Generate Purely Negatively-Reinforced Text, published by Fabien Roger on June 16, 2023 on The AI Alignment Forum. When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward? In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords. This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply. Code: Arxiv paper: Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers. Thanks to Nix Goldowsky-Dill for feedback on the draft of this post. The Worry Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”. What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”. This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement. This can happen: If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model. If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug; This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples. AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post. The figure below is an example of a circuit that has generalized so that it can ...

Is Closed Captioned: No

Explicit: No

Details

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

Release Date: 5/31/2023

Duration: 747 Mins

Authors: Scott Emmons

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS), published by Scott Emmons on May 31, 2023 on The AI Alignment Forum. tl;dr Contrast consistent search (CCS) is a method by Burns et al. that consists of two parts: Generate contrast pairs by adding pseudolabels to an unlabelled dataset. Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties. In discussions with other researchers, I've repeatedly heard (2) as the explanation for how CCS works; I've heard almost no mention of (1). In this post, I want to emphasize that the contrast pairs drive almost all of the empirical performance in Burns et al. Once we have the contrast pairs, standard unsupervised learning methods attain comparable performance to the new CCS loss function. In the paper, Burns et al. do a nice job comparing the CCS loss function to different alternatives. The simplest such alternative runs principal component analysis (PCA) on contrast pair differences, and then it uses the top principal component as a classifier. Another alternative runs linear discriminant analysis (LDA) on contrast pair differences. These alternatives attain 97% and 98% of CCS's accuracy! "[R]epresentations of truth tend to be salient in models: ... they can often be found by taking the top principal component of a slightly modified representation space," Burns et al. write in the introduction. If I understand this statement correctly, it's saying the same thing I want to emphasize in this post: the contrast pairs are what allow Burns et al. to find representations of truth. Empirically, once we have the representations of contrast pair differences, their variance points in the direction of truth. The new logical consistency loss in CCS isn't needed for good empirical performance. Notation We'll follow the notation of the CCS paper. Assume we are given a data set {x1,x2,.,xn} and a feature extractor ϕ(), such as the hidden state of a pretrained language model. First, we will construct a contrast pair for each datapoint xi. We add “label: positive” and “label: negative” to each xi. This gives contrast pairs of the form (x+i,x−i). Now, we consider the set {x+1,x+2,.,x+n} of positive pseudo-labels and {x−1,x−2,.,x−n} of negative pseudo-labels. Because all of the x+i have "label: positive" and all of the x−i have "label: negative", we normalize the positive pseudo-labels and the negative pseudo-labels separately: Here, μ+ and μ− are the element-wise means of the positive and negative pseudo-label sets, respectively. Similarly, σ+ and σ− are the element-wise standard deviations. The goal of this normalization is to remove the embedding of "label: positive" from all the positive pseudo-labels (and "label: negative" from all the negative pseudo-labels). The hope is that by construction, the only difference between ~ϕ(x+i) and ~ϕ(x−i) is that one is true while the other is false. CCS is one way to extract the information about true and false. As we'll discuss more below, doing PCA or LDA on the set of differences {~ϕ(x+i)−~ϕ(x−i)}ni=1 works almost as well. Concept Embeddings in Prior Work In order to better understand contrast pairs, I think it's helpful to review this famous paper by Bolukbasi et al., 2016: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Quoting from Bolukbasi et al.: −−−man−−−−−−woman≈−−−king−−−−−queen Vector differences between words in embeddings have been shown to represent relationships between words. For example given an analogy puzzle, "man is to king as woman is to x" (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because: Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple ...

Is Closed Captioned: No

Explicit: No

Details

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

Release Date: 5/30/2023

Duration: 710 Mins

Authors: Lukas Finnveden

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PaLM-2 & GPT-4 in "Extrapolating GPT-N performance", published by Lukas Finnveden on May 30, 2023 on The AI Alignment Forum. Two and a half years ago, I wrote Extrapolating GPT-N performance, trying to predict how fast scaled-up models would improve on a few benchmarks. One year ago, I added PaLM to the graphs. Another spring has come and gone, and there are new models to add to the graphs: PaLM-2 and GPT-4. (Though I only know GPT-4's performance on a small handful of benchmarks.) Converting to Chinchilla scaling laws In previous iterations of the graph, the x-position represented the loss on GPT-3's validation set, and the x-axis was annotated with estimates of size+data that you'd need to achieve that loss according to the Kaplan scaling laws. (When adding PaLM to the graph, I estimated its loss using those same Kaplan scaling laws.) In these new iterations, the x-position instead represents an estimate of (reducible) loss according to the Chinchilla scaling laws. Even without adding any new data-points, this predicts faster progress, since the Chinchilla scaling laws describes how to get better performance for less compute. The appendix describes how I estimate Chinchilla reducible loss for GPT-3 and PaLM-1. Briefly: For the GPT-3 data points, I convert from loss reported in the GPT-3 paper, to the minimum of parameters and tokens you'd need to achieve that loss according to Kaplan scaling laws, and then plug those numbers of parameters and tokens into the Chinchilla loss function. For PaLM-1, I straightforwardly put its parameter- and token-count into the Chinchilla loss function. To start off, let's look at a graph with only GPT-3 and PaLM-1, with a Chinchilla x-axis. Here's a quick explainer of how to read the graphs (the original post contains more details). Each dot represents a particular model’s performance on a particular category of benchmarks (taken from papers about GPT-3 and PaLM). Color represents benchmark; y-position represents benchmark performance (normalized between random and my guess of maximum possible performance). The x-axis labels are all using the Chinchilla scaling laws to predict reducible loss-per-token, number of parameters, number of tokens, and total FLOP (if language models at that loss were trained Chinchilla-optimally). Compare to the last graph in this comment, which is the same with a Kaplan x-axis. Some things worth noting: PaLM is now ~0.5 OOM of compute less far along the x-axis. This corresponds to the fact that you could get PaLM for cheaper if you used optimal parameter- and data-scaling. The smaller GPT-3 models are farther to the right on the x-axis. I think this is mainly because the x-axis in my previous post had a different interpretation. The overall effect is that the data points get compressed together, and the slope becomes steeper. Previously, the black "Average" sigmoid reached 90% at ~1e28 FLOP. Now it looks like it reaches 90% at ~5e26 FLOP. Let's move on to PaLM-2. If you want to guess whether PaLM-2 and GPT-4 will underperform or outperform extrapolations, now might be a good time to think about that. PaLM-2 If this CNBC leak is to be trusted, PaLM-2 uses 340B parameters and is trained on 3.6T tokens. That's more parameters and less tokens than is recommended by the Chinchilla training laws. Possible explanations include: The model isn't dense. Perhaps it implements some type of mixture-of-experts situation that means that its effective parameter-count is smaller. It's trained Chinchilla-optimally for multiple epochs on a 3.6T token dataset. The leak is wrong. If we assume that the leak isn't too wrong, I think that fairly safe bounds for PaLM-2's Chinchilla-equivalent compute is: It's as good as a dense Chinchilla-optimal model trained on just 3.6T tokens, i.e. one with 3.6T/20=180B parameters. This would ...

Is Closed Captioned: No

Explicit: No

Details

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

Release Date: 5/29/2023

Duration: 97 Mins

Authors: SoerenMind

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Wikipedia as an introduction to the alignment problem, published by SoerenMind on May 29, 2023 on The AI Alignment Forum. AI researchers and others are increasingly looking for an introduction to the alignment problem that is clearly written, credible, and supported by evidence and real examples. The Wikipedia article on AI Alignment has become such an introduction. Link: Aside from me, it is written by Mantas Mazeika and Gavin Leech (who are great technical writers), other Wikipedia contributors, and copy editor Amber Ace. It also had extensive feedback from this community. In the last month, it had ~20k unique readers and was cited by Yoshua Bengio. We've tried hard to keep the article accessible for non-technical readers while also making sense to AI researchers. I think Wikipedia is a good format to introduce many readers to the alignment problem because it can include videos and illustrations (unlike papers) and it is more credible than blog posts. However, Wikipedia has strict rules and could be changed by anyone. Note that we've announced this effort on the Wikipedia talk page and shared public drafts to let other editors give feedback and contribute. I you edit the article, please keep in mind Wikipedia's rules, use reliable sources, and consider that we've worked hard to keep it concise because most Wikipedia readers spend <1 minute on the page. For the latter goal, it's best to focus on edits that reduce or don't increase length. To give feedback, feel free to post on the talk page or message me. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - [Linkpost] Interpretability Dreams by DanielFilan

Release Date: 5/24/2023

Duration: 206 Mins

Authors: DanielFilan

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Interpretability Dreams, published by DanielFilan on May 24, 2023 on The AI Alignment Forum. A brief research note by Chris Olah about the point of mechanistic interpretability research. Introduction and table of contents are below. Interpretability Dreams An informal note on the relationship between superposition and distributed representations by Chris Olah. Published May 24th, 2023. Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges. We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially scalability. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach. Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way. Overview An Epistemic Foundation - Mechanistic interpretability is a "microscopic" theory because it's trying to build a solid foundation for understanding higher-level structure, in an area where it's very easy for us as researchers to misunderstand. What Might We Build on Such a Foundation? - Many tantalizing possibilities for research exist (and have been preliminarily demonstrated in InceptionV1), if only we can resolve superposition and identify the right features and circuits in a model. Larger Scale Structure - It seems likely that there is a bigger picture, more abstract story that can be built on top of our understanding of features and circuits. Something like organs in anatomy or brain regions in neuroscience. Universality - It seems likely that many features and circuits are universal, forming across different neural networks trained on similar domains. This means that lessons learned studying one model give us footholds in future models. Bridging the Microscopic to the Macroscopic - We're already seeing that some microscopic, mechanistic discoveries (such as induction heads) have significant macroscopic implications. This bridge can likely be expanded as we pin down the foundations, turning our mechanistic understanding into something relevant to machine learning more broadly. Automated Interpretability - It seems very possible that AI automation of interpretability may help it scale to large models if all else fails (although aesthetically, we might prefer other paths). The End Goals - Ultimately, we hope this work can eventually contribute to safety and also reveal beautiful structure inside neural networks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Conjecture internal survey | AGI timelines and estimations of probability of human extinction from unaligned AGI by Maris Sala

Release Date: 5/22/2023

Duration: 295 Mins

Authors: Maris Sala

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conjecture internal survey | AGI timelines and estimations of probability of human extinction from unaligned AGI, published by Maris Sala on May 22, 2023 on The AI Alignment Forum. We put together a survey to study the opinions of timelines and probability of human extinction of the employees at Conjecture. The questions were based on previous public surveys and prediction markets, to ensure that the results are comparable with people’s opinions outside of Conjecture. The survey results were polled in April, 2023. There were 23 unique responses from people across teams. Section 1. Probability of human extinction from AI Setup and limitations The specific questions the survey asked were: What probability do you put on human inability to control future advanced A.I. systems causing human extinction or similarly permanent and severe disempowerment of the human species? What probability do A.I. systems causing human extinction or similarly permanent and severe disempowerment of the human species in general (not just because inability to control, but also stuff like people intentionally using AI systems in harmful ways)? The difference between the two questions is that the first focuses on risk from misalignment, whereas the second captures risk from misalignment and misuse. The main caveats of these questions are the following: The questions were not explicitly time bound. I'd expect differences in people’s estimates of risk of extinction this century, in the next 1000 years, and anytime in the future. The longer of a timeframe we consider, the higher the values would be. I suspect employees were considering extinction risk roughly within this century when answering. The first question is a subset of the second question. One employee gave a higher probability for the second question than the first; this was probably a misinterpretation. The questions factor in interventions such as how Conjecture and others’ safety work will impact extinction risk. The expectation is the numbers would be higher if factored out their own or others’ safety work. Responses Out of the 23 respondents, one rejected the premise, and two people did not respond to one of the two questions but answered the other one. The main issue respondents raised was answering without a time constraint. Generally, people estimate the extinction risk from autonomous AI / AI getting out of control to be quite high at Conjecture. The median estimation is 70% and the average estimation is 59%. The plurality estimates the risk to be between 60% to 80%. A few people believe extinction risk from AGI is higher than 80%. The second question surveying extinction risk from AI in general, which includes misalignment and misuse. The median estimate is 80% and the average is 71%. The plurality estimates the risk to be over 80%. Section 2. When will we have AGI? Setup and limitations For this question, we asked respondents to predict when AGI will be built using this specification used on Metaculus, enabling us to compare to the community baseline (Figure 3). The respondents were instructed to toggle with the probability density as seen in Figure 4. This was a deliberate choice to enable differences in confidence towards lower or higher values in uncertainty. The main caveats of this question were: The responses are probably anchored to the Metaculus community prediction. The community prediction is 2031: 8 year timelines. Conjecture responses centering around a similar prediction should not come as a surprise. The question allows for a prediction that AGI is already here. It’s unclear that respondents paid close attention to their lower and upper predictions to ensure that both are accordingly sensible. They probably focused on making their median prediction accurate, and might not have noticed how that affected lower and u...

Is Closed Captioned: No

Explicit: No

Details

AF - Some background for reasoning about dual-use alignment research by Charlie Steiner

Release Date: 5/18/2023

Duration: 869 Mins

Authors: Charlie Steiner

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some background for reasoning about dual-use alignment research, published by Charlie Steiner on May 18, 2023 on The AI Alignment Forum. This is pretty basic. But I still made a bunch of mistakes when writing this, so maybe it's worth writing. This is background to a specific case I'll put in the next post. It's like a a tech tree If we're looking at the big picture, then whether some piece of research is net positive or net negative isn't an inherent property of that research; it depends on how that research is situated in the research ecosystem that will eventually develop superintelligent AI. Consider this toy game in the picture. We start at the left and can unlock technologies, with unlocks going faster the stronger our connections to prerequisites. The red and yellow technologies in the picture are superintelligent AI - pretend that as soon as one of those technologies is unlocked, the hastiest fraction of AI researchers are immediately going to start building it. Your goal is for humanity to unlock yellow technology before a red one. This game would be trivial if everyone agreed with you. But there are many people doing research, and they have all kinds of motivations - some want as many nodes to be unlocked as possible (pure research - blue), some want to personally unlock a green node (profit - green), some want to unlock the nearest red or yellow node no matter which it is (blind haste - red), and some want the same thing as you (beneficial AI - yellow) but you have a hard time coordinating with them. In this baseline tech tree game, it's pretty easy to play well. If you're strong, just take the shortest path to a yellow node that doesn't pass too close to any red nodes. If you're weak, identify where the dominant paradigm is likely to end up, and do research that differentially advantages yellow nodes in that future. The tech tree is wrinkly But of course there are lots of wrinkles not in the basic tech tree, which can be worth bearing in mind when strategizing about research. Actions in the social and political arenas. You might be motivated to change your research priorities based on how it could change peoples' minds about AI safety, or how it could affect government regulation. Publishing and commercialization. If a player publishes, they get more money and prestige, which boosts their ability to do future research. Other people can build on published research. Not publishing is mainly useful to you if you're already in a position of strength, and don't want to give competitors the chance to outrace you to a nearby red node (and of course profit-motivated players will avoid publishing things that might help competitors beat them to a green node). Uncertainty. We lack exact knowledge of the tech tree, which makes it harder to plan long chains of research in advance. Uncertainty about the tech tree forces us to develop local heuristics - ways to decide what to do based on information close at hand. Uncertainty adds a different reason you might not publish a technology: if you thought it was going to be a good idea to research when you started, but then you learned new things about the tech tree and changed your mind. Inhomogeneities between actors and between technologies. Different organizations are better at researching different technologies - MIRI is not just a small OpenAI. Ultimately, which technologies are the right ones to research depends on your model of the world / how you expect the future to go. Drawing actual tech trees can be a productive exercise for strategy-building, but you might also find it less useful than other ways of strategizing. We're usually mashing together definitions I'd like to win the tech tree game. Let's define a "good" technology as one that would improve our chances of winning if it was unlocked for free, given the st...

Is Closed Captioned: No

Explicit: No

Details

AF - $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions by johnswentworth

Release Date: 5/16/2023

Duration: 279 Mins

Authors: johnswentworth

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions, published by johnswentworth on May 16, 2023 on The AI Alignment Forum. Informal Problem Statement We have an information channel between Alice and Bob. Alice picks a function. Bob gets to see the value of that function at some randomly chosen input values... but doesn't know exactly which randomly chosen input values. He does get to see the randomly chosen values of some of the input variables, but not all of them. The problem is to find which functions Alice should pick with what frequencies, in order to maximize the channel capacity. Why Am I Interested In This? I'm interested in characterizing functions which are "insensitive" to subsets of their input variables, especially in high-dimensional spaces. For instance, xor of a bunch of random bits is maximally sensitive: if we have a 50/50 distribution over any one of the bits but know all the others, then all information about the output is wiped out. On the other end of the spectrum, a majority function of a bunch of random bits is highly insensitive: if we have a 50/50 distribution over, say, 10% of the bits, but know all the others, then in most cases we can correctly guess the function's output. I have an argument here that the vast majority of functions f:{0,1}n{0,1} are pretty highly sensitive: as the number of unknown inputs increases, information falls off exponentially quickly. On the other hand, the example of majority functions shows that this is not the case for all functions. Intuitively, in the problem, Alice needs to mostly pick from "insensitive" functions, since Bob mostly can't distinguish between "sensitive" functions. ... And Why Am I Interested In That? I expect that natural abstractions have to be insensitive features of the world. After all, different agents don't all have exactly the same input data. So, a feature has to be fairly insensitive in order for different agents to agree on its value. In fact, we could view the problem statement itself as a very rough way of formulating the coordination problem of language: Alice has to pick some function f which takes in an image and returns 0/1 representing whether the image contains an apple. (The choice of function defines what "apple" means, for our purposes.) Then Alice wants to teach baby Bob what "apple" means. So, there's some random stuff around them, and Alice points at the random stuff and says "apple" for some of it, and says something besides "apple" the rest of the time. Baby Bob is effectively observing the value of the function at some randomly-chosen points, and needs to back out which function Alice intended. And Bob doesn't have perfect access to all the bits Alice is seeing, so the function has to be robust. Formal Problem Statement Consider the following information channel between Alice and Bob: Alice picks a function f:{0,1}n{0,1} Nature generates m possible inputs x1,...,xm, each sampled uniformly and independently from {0,1}n. Nature also generates m subsets S1,...,Sm of 1,...,n, each sampled uniformly and independently from subsets of size s. Bob observes Y=(Y1,...,Ym) where Yi=(f(xi),xSi,Si). The problem is to compute the distribution over f which achieves the channel capacity, i.e. argmaxP[f]∑f,YP[f]P[Y|f]lnP[Y|f]∑f′P[Y|f′]P[f′] Bounty/Prize Info The problem is to characterize the channel throughput maximizing distribution P[f]. The characterization should make clear the answers to questions like: What functions have the highest probability? How quickly does the probability fall off as we move "away" from the most probable functions, and what do marginally-less-probable functions look like? How much probability is assigned to a typical function chosen uniformly at random? Which functions, if any, are assigned zero probability? All of these should ...

Is Closed Captioned: No

Explicit: No

Details

AF - Difficulties in making powerful aligned AI by DanielFilan

Release Date: 5/14/2023

Duration: 851 Mins

Authors: DanielFilan

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Difficulties in making powerful aligned AI, published by DanielFilan on May 14, 2023 on The AI Alignment Forum. Here’s my breakdown of the difficulties involved in ensuring powerful AI makes our lives radically better, rather than taking over the world, as well as some reasons why I think they’re hard. Here are things it’s not: It’s not primarily a justification of why very powerful AI is possible or scary (altho it briefly discusses why very powerful AI would be scary). It’s not primarily a list of underlying factors that cause these difficulties (altho it does include and gesture to some of those). It’s not at all original - basically everything here has been said many times before, plausibly more eloquently. That said, it is my attempt to group the problems in my own words, in a configuration that I haven’t seen before, with enough high-level motivation that one can hopefully tell the extent to which advances in the state of the art address them. 1. What sort of thinking do we want? The first difficulty: we don’t have a sense of what sort of thinking we would want AI systems to use, in sufficient detail that one could (for instance) write python code to execute it. Of course, some of the difficulty here is that we don’t know how smart machines think, but we can give ourselves access to subroutines like “do perfect Bayesian inference on a specified prior and likelihood” or “take a function from vectors to real numbers and find the vector that minimizes the function” and still not solve the problem. To illustrate: Take a hard-coded goal predicate, consider a bunch of plans you could take, and execute the plan that best achieves the goal? Unfortunately, the vast majority of goals you could think of writing down in an executable way will incentivize behaviour like gaining control over sources of usable energy (so that you definitely have enough to achieve your goal, and to double- and triple-check that you’ve really achieved it) and stopping other agents from being able to meddle with your plans (because if they could, maybe they’d stop you from achieving your goal). Do things that maximize the number of thumbs up you get from humans?1 Best plan: take control of the humans, force them to give you a thumbs up, or trick them into doing so. Presumably this is possible if you’re much smarter than humans, and it’s more reliable than doing good things - some people might not see why your good thing is actually good if left to their own devices. Look at humans, figure out what they want based on what they’re doing, and do whatever that is? Main problem: people don’t do the literally optimal thing for what they want. For instance, when people play chess, they usually don’t play perfect moves - even if they’re experts! You need some rule that tells you what people would do if they wanted some goal or another, but it’s not clear what this rule would be, it’s not clear how you make this rule more in line with reality if you never observe “wanting”, and so this ends up having essentially the same problems as plans 1 and 2. Read some text written by humans about what they’d like you to do, and do that?2 This is passing the buck to the text written by humans to specify how we want the AI to think, but that’s precisely the problem we’re trying to solve. Concretely, one way you could imagine doing this is to write something relatively informal like “Please be helpful and harmless to your human operators”, and have your AI correctly understand what we mean by that. That (a) presumes that there is a coherent thing that we mean by that (which doesn’t seem obvious to me, given our difficulty in explicitly formalizing this request), and (b) passes the specification buck to the problem of specifying how you should understand this request. It’s not a priori definitely impossible to build a ...

Is Closed Captioned: No

Explicit: No

Details

AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes

Release Date: 4/27/2023

Duration: 789 Mins

Authors: Steve Byrnes

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI doom from an LLM-plateau-ist perspective, published by Steve Byrnes on April 27, 2023 on The AI Alignment Forum. (in the form of an FAQ) Q: What do you mean, “LLM plateau-ist”? A: As background, I think it’s obvious that there will eventually be “transformative AI” (TAI) that would radically change the world. I’m interested in what this TAI will eventually look like algorithmically. Let’s list some possibilities: A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels. I am an LLM plateau-ist myself. I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues. Oh well, we’ll find out one way or the other soon enough. In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are: Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren’t built right to do the things that TAI would need to do. Empirical—maybe they’re not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels. Q: If LLMs will plateau, then does that prove that all the worry about AI x-risk is wrong and stupid? A: No no no, a million times no, and I’m annoyed that this misconception is so rampant in public discourse right now. (Side note to AI x-risk people: If you have high credence that AI will kill everyone but only medium credence that this AI will involve LLMs, then maybe consider trying harder to get that nuance across in your communications. E.g. Eliezer Yudkowsky is in this category, I think.) A couple random examples I’ve seen of people failing to distinguish “AI may kill everyone” from “.and that AI will definitely be an LLM”: Venkatesh Rao’s blog post “Beyond Hyperanthropomorphism” goes through an elaborate 7000-word argument that eventually culminates, in the final section, in his assertion that a language model trained on internet data won’t be a powerful agent that gets things done in the world, but if we train an AI with a robot body, then it could be a powerful agent that gets things done in the world. OK fine, let’s suppose for the sake of argument he’s right that robot bodies will be necessary for TAI. Then people are obviously going to build those AIs sooner or later, right? So let’s talk about whether they will pose an x-risk. But that’s not what Venkatesh does. Instead he basically treats “they will need robot bodies” as the triumphant conclusion, more-or-less sufficient in itself to prove that AI x-risk discourse is stupid. Sarah Constantin’s blog post entitled “Why I am not an AI doomer” states right up front that she agrees “1. Artificial general intelligence is possible in principle . 2, Artificial general intelligence, by default, kills us all . 3. It is technically difficult, and perhaps impossible, to ensure an AI values human life.” She only disagrees with the claim that this will happen soon, and via scaling LLMs. I think she should have picked a different title for her post!! (I’ve seen many more examples on Twitter, reddit, comment threads, etc.) Anyway, if you think LLMs will plateau, then you can probably feel confident that we won’t get TAI imminently (see below), but I don’t see why you would have much more confidence that TAI will go well for humanity. In fact, for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t...

Is Closed Captioned: No

Explicit: No

Details

AF - How Many Bits Of Optimization Can One Bit Of Observation Unlock? by johnswentworth

Release Date: 4/26/2023

Duration: 289 Mins

Authors: johnswentworth

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How Many Bits Of Optimization Can One Bit Of Observation Unlock?, published by johnswentworth on April 26, 2023 on The AI Alignment Forum. So there’s this thing where a system can perform more bits of optimization on its environment by observing some bits of information from its environment. Conjecture: observing an additional N bits of information can allow a system to perform at most N additional bits of optimization. I want a proof or disproof of this conjecture. I’ll operationalize “bits of optimization” in a similar way to channel capacity, so in more precise information-theoretic language, the conjecture can be stated as: if the sender (but NOT the receiver) observes N bits of information about the noise in a noisy channel, they can use that information to increase the bit-rate by at most N bits per usage. For once, I’m pretty confident that the operationalization is correct, so this is a concrete math question. Toy Example We have three variables, each one bit: Action (A), Observable (O), and outcome (Y). Our “environment” takes in the action and observable, and spits out the outcome, in this case via an xor function: Y=A⊕O We’ll assume the observable bit has a 50/50 distribution. If the action is independent of the observable, then the distribution of outcome Y is the same no matter what action is taken: it’s just 50/50. The actions can perform zero bits of optimization; they can’t change the distribution of outcomes at all. On the other hand, if the actions can be a function of O, then we can take either A=O or A=¯O (i.e. not-O), in which case Y will be deterministically 0 (if we take A=O), or deterministically 1 (for A=¯O). So, the actions can apply 1 bit of optimization to Y, steering Y deterministically into one half of its state space or the other half. By making the actions A a function of observable O, i.e. by “observing 1 bit”, 1 additional bit of optimization can be performed via the actions. Operationalization Operationalizing this problem is surprisingly tricky; at first glance the problem pattern-matches to various standard info-theoretic things, and those pattern-matches turn out to be misleading. (In particular, it’s not just conditional mutual information, since only the sender - not the receiver - observes the observable.) We have to start from relatively basic principles. The natural starting point is to operationalize “bits of optimization” in a similar way to info-theoretic channel capacity. We have 4 random variables: “Goal” G “Action” A “Observable” O “Outcome” Y Structurally: (This diagram is a Bayes net; it says that G and O are independent, A is calculated from G and O and maybe some additional noise, and Y is calculated from A and O and maybe some additional noise. So, P[G,O,A,Y]=P[G]P[O]P[A|G,O]P[Y|A,O].) The generalized “channel capacity” is the maximum value of the mutual information I(G;Y), over distributions P[A|G,O]. Intuitive story: the system will be assigned a random goal G, and then take actions A (as a function of observations O) to steer the outcome Y. The “number of bits of optimization” applied to Y is the amount of information one could gain about the goal G by observing the outcome Y. In information theoretic language: G is the original message to be sent A is the encoded message sent in to the channel O is noise on the channel Y is the output of the channel Then the generalized “channel capacity” is found by choosing the encoding P[A|G,O] to maximize I(G;Y). I’ll also import one more assumption from the standard info-theoretic setup: G is represented as an arbitrarily long string of independent 50/50 bits. So, fully written out, the conjecture says: Let G be an arbitrarily long string of independent 50/50 bits. Let A, O, and Y be finite random variables satisfying P[G,O,A,Y]=P[G]P[O]P[A|G,O]P[Y|A,G] and define Δ:=(max...

Is Closed Captioned: No

Explicit: No

Details

AF - Endo-, Dia-, Para-, and Ecto-systemic novelty by Tsvi Benson-Tilsen

Release Date: 4/23/2023

Duration: 580 Mins

Authors: Tsvi Benson-Tilsen

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Endo-, Dia-, Para-, and Ecto-systemic novelty, published by Tsvi Benson-Tilsen on April 23, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed January 10, 2023. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text might be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Novelty can be coarsely described as one of: fitting within a preexisting system; constituting a shift of the system; creating a new parallel subsystem; or standing unintegrated outside the system. Thanks to Sam Eisenstat for related conversations. Novelty is understanding (structure, elements) that a mind acquires (finds, understands, makes its own, integrates, becomes, makes available for use to itself or its elements, incorporates into its thinking). A novel element (that is, structure that wasn't already there in the mind fully explicitly) can relate to the mind in a few ways, described here mainly by analogy and example. A clearer understanding of novelty than given here might clarify the forces acting in and on a mind when it is acquiring novelty, such as "value drives". Definitions "System" ("together-standing") is used here to emphasize the network of relations between elements of a mind. These terms aren't supposed to be categories, but more like overlapping regions in the space of possibilities for how novelty relates to the preexisting mind. Endosystemic novelty (or "basis-aligned" or "in-ontology") is novelty that is integrated into the mind by fitting alongside and connecting to other elements, in ways analogous to how preexisting elements fit in with each other. Endosystemic novelty is "within the system"; it's within the language, ontology, style of thinking, conceptual scheme, or modus operandi of the preexisting mind. Diasystemic novelty (or "cross-cutting" or "basis-skew" or "ontological shift") is novelty that is constituted as a novel structure of the mind by many shifts in many of the preexisting elements or relations, adding up to something coherent or characteristically patterned. Diasystemic novelty is "throughout the system"; it's skew to the system, cross-cutting the preexisting schemes; it touches (maybe subtly) many elements, many relations, or certain elements that shape much of the mind's activity, hence altering the overall dynamics or character of the system. Parasystemic novelty is novelty that is only loosely integrated into the whole mind, while being more tightly integrated within a subsystem of the mind. Parasystemic novelty is "alongside the system"; it's neither basis-aligned (since it's outside preexisting tightly integrated systems) nor cross-cutting (as it doesn't touch most of the system, or require most of the system for its constitution). Ectosystemic novelty is novelty that is merely juxtaposed or appended to the mind, without being really integrated. Ectosystemic novelty is "on or outside the system"; it's external, only loosely related to the mind, as by a narrow interface or by an external aggregration mechanism. It differs from parasystemic novelty by being even less integrated, and by not nucleating or expanding a tightly integrated subsystem. Analogies Analogy: If a language is like a mind, then a new word would be endosystemic novelty; a sound shift or (more properly) a grammatical innovation would be diasystemic (cross-cutting) novelty; specialized languages (such as scientific jargon), and dialect formation, would be parasystemic novelty; and an encounter with a foreign language would be ectosystemic novelty. Pidgins, being unstable and noncanonical, witness the ectosystemic nature: the foreign languages don't integrate. Creoles, however, could be dubbed "systemopoetic novelty"--like parasystemic novel...

Is Closed Captioned: No

Explicit: No

Details

AF - Thinking about maximization and corrigibility by James Payor

Release Date: 4/21/2023

Duration: 594 Mins

Authors: James Payor

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on The AI Alignment Forum. Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on. Here's what I'd like to discuss in this post: When unstructured maximization does/doesn't work out for the humans CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy" Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart When does maximization work? In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are: Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error. The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks. The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation. The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x. There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post. Passing the buck on optimization Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers. Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off. But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown: If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted. It might seem better to stay online and watch longer in order to learn more about the utility function. Maybe there's a plan that rates highly on "utility...

Is Closed Captioned: No

Explicit: No

Details

AF - Concave Utility Question by Scott Garrabrant

Release Date: 4/15/2023

Duration: 273 Mins

Authors: Scott Garrabrant

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Concave Utility Question, published by Scott Garrabrant on April 15, 2023 on The AI Alignment Forum. This post will just be a concrete math question. I am interested in this question because I have recently come tor reject the independence axiom of VNM, and am thus playing with some weaker versions. Let Ω be a finite set of deterministic outcomes. Let L be the space of all lotteries over these outcomes, and let ⪰ be a relation on L. We write A∼B if A ⪰ B and B ⪰ A. We write A≻B if A⪰B but not A∼B. Here are some axioms we can assume about ⪰: A1. For all A,B∈L, either A⪰B or B⪰A (or both). A2. For all A,B,C∈L, if A⪰B, and B⪰C, then A⪰C. A3. For all A,B,C∈L, if A⪰B, and B⪰C, then there exists a p∈[0,1] such that B∼pA+(1−p)C. A4. For all A,B∈L, and p∈[0,1] if A⪰B, then pA+(1−p)B⪰B. A5. For all A,B∈L, and p∈[0,1], if p>0 and B⪰pA+(1−p)B, then B⪰A. Here is one bonus axiom: B1. For all A,B,C∈L, and p∈[0,1], A⪰B if and only if pA+(1−p)C⪰pB+(1−p)C. (Note that B1 is stronger than both A4 and A5) Finally, here are some conclusions of successively increasing strength: C1. There exists a function u:L[0,1] such that A⪰B if and only if u(A)≥u(B). C2. Further, we require u is quasi-concave. C3. Further, we require u is continuous. C4. Further, we require u is concave. C5. Further, we require u is linear. The standard VNM utility theorem can be thought of as saying A1, A2, A3, and B1 together imply C5. Here is the main question I am curious about: Q1: Do A1, A2, A3, A4, and A5 together imply C4? [ANSWER: NO] (If no, how can we salvage C4, by adding or changing some axioms?) Here are some sub-questions that would constitute significant partial progress, and that I think are interesting in their own right: Q2: Do A1, A2, A3, and A4 together imply C3? [ANSWER: NO] Q3: Do C3 and A5 together imply C4? [ANSWER: NO] (Feel free to give answers that are only partial progress, and use this space to think out loud or discuss anything else related to weaker versions of VNM.) EDIT: AlexMennen actually resolved the question in the negative as stated, but my curiosity is not resolved, since his argument is violating continuity, and I really care about concavity. My updated main question is now: Q4: Do A1, A2, A3, A4, and A5 together imply that there exists a concave function u:L[0,1] such that A⪰B if and only if u(A)≥u(B)? [ANSWER: NO] (i.e. We do not require u to be continuous.) This modification also implies interest in the subquestion: Q5: Do A1, A2, A3, and A4 together imply C2? EDIT 2: Here is another bonus axiom: B2. For all A,B∈L, if A≻B, then there exists some C∈L such that A≻C≻B. (Really, we don't need to assume C is already in L. We just need it to be possible to add a C, and extend our preferences in a way that satisfies the other axioms, and A3 will imply that such a lottery was already in L. We might want to replace this with a cleaner axiom later.) Q6: Do A1, A2, A3, A5, and B2 together imply C4? [ANSWER: NO] EDIT 3: We now have negative answers to everything other than Q5, which I still think is pretty interesting. We could also weaken Q5 to include other axioms, like A5 and B2. Weakening the conclusion doesn't help, since it is easy to get C2 from C1 and A4. I would still really like some axioms that get us all the way to a concave function, but I doubt there will be any simple ones. Concavity feels like it really needs more structure that does not translate well to a preference relation. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Shapley Value Attribution in Chain of Thought by leogao

Release Date: 4/14/2023

Duration: 439 Mins

Authors: leogao

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum. TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion. Motivation Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general. Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations. Related work Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases. Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance. Effect of Interventions We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case). Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs): We test how frequently this occurs in several different settings (n=100): SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33 Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65. Shapley value attribution We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...

Is Closed Captioned: No

Explicit: No

Details

AF - Announcing Epoch’s dashboard of key trends and figures in Machine Learning by Jaime Sevilla

Release Date: 4/13/2023

Duration: 130 Mins

Authors: Jaime Sevilla

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Epoch’s dashboard of key trends and figures in Machine Learning, published by Jaime Sevilla on April 13, 2023 on The AI Alignment Forum. Developments in Machine Learning have been happening extraordinarily fast, and as their impacts become increasingly visible, it becomes ever more important to develop a quantitative understanding of these changes. However, relevant data has thus far been scattered across multiple papers, has required expertise to gather accurately, or has been otherwise hard to obtain. Given this, Epoch is thrilled to announce the launch of our new dashboard, which covers key numbers and figures from our research to help understand the present and future of Machine Learning. This includes: Training compute requirements Model size, measured by the number of trainable parameters The availability and use of data for training Trends in hardware efficiency Algorithmic improvements for achieving better performance with fewer resources The growth of investment in training runs over time Our dashboard gathers all of this information in a single, accessible place. The numbers and figures are accompanied by further information such as confidence intervals, labels representing our degree of uncertainty in the results, and links to relevant research papers. These details are especially useful to illustrate which areas may require further investigation, and how much you should trust our findings. Beyond accessibility, bringing these figures together allows us to compare and contrast trends and drivers of progress. For example, we can verify that growth in training compute is driven by improvements to hardware performance and rising investments: We can also see that performance improvements have historically been driven by algorithmic progress and training compute growth by comparable amounts: Overall, we hope that our dashboard will serve as a valuable resource for researchers, policymakers, and anyone interested in the future of Machine Learning. We plan on keeping our dashboard regularly updated, so stay tuned! If you spot an error or would like to provide feedback, please feel free to reach out to us at info@epochai.org. Visit now the dashboard Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Lessons from Convergent Evolution for AI Alignment by Jan Kulveit

Release Date: 3/27/2023

Duration: 938 Mins

Authors: Jan Kulveit

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Lessons from Convergent Evolution for AI Alignment, published by Jan Kulveit on March 27, 2023 on The AI Alignment Forum. Prelude: sharks, aliens, and AI If you go back far enough, the ancestors of sharks and dolphins look really different: But modern day sharks and dolphins have very similar body shapes: This is a case of convergent evolution: the process by which organisms with different origins develop similar features. Both sharks and dolphins needed speed and energy efficiency when moving in an environment governed by the laws of hydrodynamics, and so they converged on a pretty similar body shape. For us, this isn’t very surprising, and doesn’t require much knowledge of evolution: we have a good intuitive understanding of how water works, and humans knew a lot of the underlying maths for the laws of hydrodynamics before they understood anything about evolution. Starting from these laws, it isn’t very surprising that sharks and dolphins ended up looking similar. But what if instead of starting with knowledge of hydrodynamics and then using that to explain the body shape of sharks and dolphins, we started with only knowledge of sharks’ and dolphins’ body shape, and tried to use that to explain underlying laws? Let’s pretend we’re alien scientists from an alternative universe, and for some weird reason we only have access to simplified 3D digital models of animals and some evolutionary history, but nothing about the laws of physics in the human/shark/dolphin universe. My guess is that these alien scientists would probably be able to uncover a decent amount of physics and a fair bit about the earth’s environment, just by looking at cases of convergent evolution. If I’m right about this guess, then this could be pretty good news for alignment research. When it comes to thinking about AI, we’re much closer to the epistemic position of the alien scientist: we either don't know the ‘physics’ of life and intelligence at all, or are only just in the process of uncovering it. But cases of convergent evolution might help us to deduce deep selection pressures which apply to AI systems as well as biological ones. And if they do, we might be able to say more about what future AI systems might look like, or, if we are lucky, even use some of the selection pressures to shape what systems we get. Introduction This post argues that we should use cases of convergent evolution to look for deep selection pressures which extend to advanced AI systems. Convergent evolution is a potentially big deal for AI alignment work: Finding deep selection pressures could help us predict what advanced AI systems will be like. It seems plausible that some of the properties people in the alignment space assume are convergent don’t actually extend to advanced AI. In this post, I’ll: Share some basics of convergent evolution, Argue that this is a big deal for alignment work, and then Respond to the objection that biology is super different from AI. The basics of convergent evolution The body shape of sharks and dolphins is just one of very many examples of convergent evolution in biology. For example: Visual organs arose “possibly hundreds of times”. Multicellularity evolved independently probably at least 11 times. Some form of higher-level intelligence evolved multiple times - in primates, apes, corvids, cetaceans, elephants - and possibly many other cases, depending on thresholds and definitions. We can think about convergent evolution in terms of: a basin of convergent evolution, an attractor state(s), and selection pressure(s). The basin of convergent evolution is the region of the abstract space in which, once an organism enters the basin, the pull of the selection pressure brings the organism closer to the attractor state. In the case of sharks and dolphins: The basin of convergent evolution is ...

Is Closed Captioned: No

Explicit: No

Details

AF - What happens with logical induction when... by Donald Hobson

Release Date: 3/26/2023

Duration: 102 Mins

Authors: Donald Hobson

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What happens with logical induction when..., published by Donald Hobson on March 26, 2023 on The AI Alignment Forum. So this is a bunch of related technical questions about logical induction. Firstly, do you need the formal theorem prover section? Can you just throw out the formal theorem prover, but give some programs in the market unbounded capital and get the same resultant behaviour? (For example, give the program that bets P(X) towards 1−P(¬X) unbounded downside risk (downside risk of n on day n) ) This means the program would lose infinite money if X and ¬X both turned out to be true. I think that any axioms can be translated into programs. And I think such a setup, with some finite number of fairly simple programs having infinite money available produces a logical inductor. Is this true? What happens when the axioms added under this system are inconsistent. (so this is a logical induction market, without a theorem prover to settle the bets, and with agents with unlimeted money betting both for and against X, possibly indirectly like the bot betting for X, the bot betting for ¬X, and the bot described above trying to make P(X)+P(¬X)=1 ) Can the other agents make unbounded money? Do the prices converge? If I added a bot with infinite money that was convinced fermats last theorem was false to a consistent ZFC system, would I get a probability distribution that assigned high probability to basic arithmetic facts in the limit? Does this make a sensible system for logical counterfactuals? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner

Release Date: 3/23/2023

Duration: 2747 Mins

Authors: Curtis Huebner

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes, published by Curtis Huebner on March 23, 2023 on The AI Alignment Forum. A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing: Getting started 1 CURTIS00:00:22,775 --> 00:00:56,683Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions? 2 STEVE00:00:59,405 --> 00:01:09,452Let's do throughout, but I reserve the right to put people off if something seems tangential or something. 3 CURTIS00:01:10,200 --> 00:01:12,101Awesome. All right, cool. Let's go with that then. 10 STEVE00:02:02,246 --> 00:21:41,951 The talk All right. Thanks, everybody, for coming. This is going to be based on blog posts called Intro to Brain-Like AGI Safety. If you've read all of them, you'll find this kind of redundant, but you're still welcome to stay. My name is Steve Byrnes and I live in the Boston area. I'm employed remotely by Astera Institute, which is based in Berkeley. I'm going to talk about challenges for safe and beneficial brain-like Artificial General Intelligence for the next 35 minutes. Feel free to jump in with questions. Don't worry, I'm funded by an entirely different crypto billionaire. .That joke was very fresh when I wrote it three months ago. I need a new one now. Okay, so I'll start with—well, we don't have to talk about the outline. You'll see as we go. General motivation Start with general motivation. Again, I'm assuming that the audience has a range of backgrounds, and some of you will find parts of this talk redundant. The big question that I'm working on is: What happens when people figure out how to run brain-like algorithms on computer chips? I guess I should say “if and when”, but we can get back to that. And I find that when I bring this up to people, they they tend to have two sorts of reactions: One is that we should think of these future algorithms as “like tools for people to use”. And the other is that we should think of them as “like a new intelligent species on the planet”. So let's go through those one by one. Let’s start with the tool perspective. This is the perspective that would be more familiar to AI people. If we put brain-like algorithms on computer chips, then that would be a form of artificial intelligence. And everybody knows that AI today is a tool for people to use. So on this perspective, the sub-problem I'm working on is accident prevention. We want to avoid the scenarios where the AI does something that nobody wanted it to do—not the people who programmed it, not anybody. So there is a technical problem to solve there, which is: If people figure out how to run brain-like algorithms on computer chips, and they want those algorithms to be trying to do X—where X is solar cell research or being honest or whatever you can think of—then what source code should they write? What training environment should they use? And so on. This is an unsolved problem. It turns out to be surprisingly tricky, for some pretty deep reasons that mostly are not going to be in the scope of this talk, but you can read the series. This slide is the bigger picture of that. So if we want our awesome post-AGI future, then we want to avoid, y'know, catastrophic accidents where the AI gets out of control and self-replicates around the Intern...

Is Closed Captioned: No

Explicit: No

Details

AF - The space of systems and the space of maps by Jan Kulveit

Release Date: 3/22/2023

Duration: 479 Mins

Authors: Jan Kulveit

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The space of systems and the space of maps, published by Jan Kulveit on March 22, 2023 on The AI Alignment Forum. When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems. In this post, we: Introduce a simple model of the epistemic situation, and Share some desiderata for maps useful for alignment. We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction"). As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment. The space of systems and the space of maps Imagine the space of possible intelligent systems: Two things seem especially important about this space: It’s very large; much larger than the space of current systems. We don’t get direct epistemic access to it. This is obviously true of systems which don’t currently exist. In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system: Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories. We often don’t understand the emergent complexity of the systems in question. If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it? Let’s imagine a second space, this time a space of “maps”: The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example: Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems). An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems). The theory of evolution is a way of thinking about evolved creatures, including intelligent ones. Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems. Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments. When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems: Different maps correspond to different regions of the space of intelligent systems. Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated: The space of systems looks different on different maps. Maps can affect which parts of the space of systems actually get developed. Maps are themselves embedded in the space of systems. Which maps and systems actually exist at a given time is evolving and dynamic. AI will play a big role in both the space of maps and the space of systems. We think that the space of systems and the space of maps is a useful simplification which helps us to think ...

Is Closed Captioned: No

Explicit: No

Details

AF - [ASoT] Some thoughts on human abstractions by leogao

Release Date: 3/16/2023

Duration: 519 Mins

Authors: leogao

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ...

Is Closed Captioned: No

Explicit: No

Details

AF - What is a definition, how can it be extrapolated? by Stuart Armstrong

Release Date: 3/14/2023

Duration: 711 Mins

Authors: Stuart Armstrong

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o...

Is Closed Captioned: No

Explicit: No

Details

AF - Implied "utilities" of simulators are broad, dense, and shallow by porby

Release Date: 3/1/2023

Duration: 406 Mins

Authors: porby

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...

Is Closed Captioned: No

Explicit: No

Details

AF - Scarce Channels and Abstraction Coupling by johnswentworth

Release Date: 2/28/2023

Duration: 545 Mins

Authors: johnswentworth

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scarce Channels and Abstraction Coupling, published by johnswentworth on February 28, 2023 on The AI Alignment Forum. Epistemic Status: mental model and intuitive story Scarce Channels vs Scarce Modules Let’s distinguish between two kinds of system-regimes: “scarce channels” and “scarce modules”. A prototypical “scarce modules” system would be one of those 19th-century families living with 12 people in a 500 square foot (46 square meter) home. When at home, everyone knows what everyone else is doing all the time; there is zero privacy. Communication channels are highly abundant - everyone has far more information than they want about what everyone else is doing. Indeed, communication channels exist by default. Conversely, though, modules are scarce - it’s hard for one or more family members to carve out a part of the space which is isolated from the rest of the family, and interacts only through some limited channels. A prototypical “scarce channels” system, by contrast, would be a few hundred 19th-century fur trappers spread out over half of Montana. Most of the time, none of them are anywhere near each other; nobody has any idea what’s going on with anyone else. Communication channels are scarce - getting information to another person is difficult and expensive. Conversely, though, modules are highly abundant - it’s very easy for one or a few trappers to carve out a space which is isolated from the rest, and which interacts only through some limited channels (like e.g. occasionally visiting the nearest town). Indeed, modules exist by default. I want to use this as a mental model for complex adaptive systems, like neural nets or brains. Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net. A baby’s brain similarly throws out approximately-all information by default, as opposed to passing lots of information around to lots of parts of the brain. I’m not particularly going to defend that claim here; rather, I raise it as a plausible hypothesis for how such systems might look, and next we’ll move on to an intuitive story for how an adaptive system in the “scarce channels” regime interacts with natural abstractions in its environment. The upshot is that, when an adaptive system is in the “scarce channels” regime, lots of optimization pressure is required to induce an information channel to form. For instance, picture such a system as a bunch of little pieces, which initially don’t talk to each other at all: In order for an information channel to form from one end to the other, each of the individual pieces along the line-of-communication need to be individually optimized to robustly pass along the right information: So, intuitively, the number of bits-of-optimization required to form that information channel should scale roughly with the number of pieces along the line-of-communication. Furthermore, when information channels do form, they should be approximately as small as possible. Optimization pressure will tend to induce as little information passing as the system can get away with, while still satisfying the optimization criterion. Abstraction Coupling Next question: what sort of patterns-in-the-environment could induce communication channels to form? Well, here’s a situation where communication channels probably won’t form: train a neural net in an environment where the reward/loss its output receives is independent of the input. Or, for a generative net, an environment where the tokens/pixels are all independent. More generally, suppose our adaptive system interfaces with the environment in two different places (and possibly more, ...

Is Closed Captioned: No

Explicit: No

Details

AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger

Release Date: 2/24/2023

Duration: 359 Mins

Authors: Evan Hubinger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agents vs. Predictors: Concrete differentiating factors, published by Evan Hubinger on February 24, 2023 on The AI Alignment Forum. Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback. In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it). Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis. Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ. Sensitivity to distributional information Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime. Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have. In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that. That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, ...

Is Closed Captioned: No

Explicit: No

Details

AF - AI that shouldn't work, yet kind of does by Donald Hobson

Release Date: 2/23/2023

Duration: 342 Mins

Authors: Donald Hobson

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI that shouldn't work, yet kind of does, published by Donald Hobson on February 23, 2023 on The AI Alignment Forum. There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers. So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn't depend on the location within the image. Content is based on more complex features and does depend on the location in the image. We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works. A simple and hacky AI alignment proposal is to just ask chatGPT to do it. This doesn't work because chatGPT has been optimized for text prediction, and so isn't particularly good at AI alignment theory. So here is an alignment plan. I know it isn't great. But some plan is better than no plan. And there was a post about how alignment may well look like "surely no one could have missed that" or "surely such a stupid idea couldn't work, could it?" not "eureka". Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed. Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it's corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text. There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it's own mind. This has a much better understanding it can outsource to, it just has to plug in. Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri's paper on the topic (where hypothetically this isn't in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude. Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful) Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them. This could fail to be smart enough, I mean I wouldn't be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like? Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI is...

Is Closed Captioned: No

Explicit: No

Details

AF - The Open Agency Model by Eric Drexler

Release Date: 2/22/2023

Duration: 525 Mins

Authors: Eric Drexler

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Open Agency Model, published by Eric Drexler on February 22, 2023 on The AI Alignment Forum. Notes on AI for complex, consequential problems Eric DrexlerCentre for the Governance of AIUniversity of Oxford Introduction This document argues for “open agencies” — not opaque, unitary agents — as the appropriate model for applying future AI capabilities to consequential tasks that call for combining human guidance with delegation of planning and implementation to AI systems. This prospect reframes and can help to tame a wide range of classic AI safety challenges, leveraging alignment techniques in a relatively fault-tolerant context. Rethinking safe AI and its applications AI safety research is too varied to summarize, yet broad patterns are obvious. A long-established reference-problem centers on prospects for rational superintelligent agents that pursue narrow goals with potentially catastrophic outcomes. This frame has been productive, but developments in deep learning call for updates that take account of the proliferation of narrow models (for driving, coding, robot control, image generation, game playing.) that are either non-agentic or act as agents in only a narrow sense, and that take account of the rise of more broadly capable foundation models and LLMs. These updates call for reframing questions of AI safety, and call for attention to how consequential tasks might be accomplished by organizing AI systems that usually do approximately what humans intend. Two frames for high-level AI The unitary-agent frame From its beginnings in popular culture, discussion of the AI control problem has centered around a unitary agent model of high-level AI and potential AI risks. In this model, a potentially dominant agent both plans and acts to achieve its goals. The unitary-agent model typically carries assumptions regarding goals, plans, actions, and control. Goals: Internal to an agent, by default including power-seeking goals Plans: Internal to an agent, possibly uninterpretable and in effect secret Actions: Performed by the agent, possibly intended to overcome opposition Control: Humans confront a powerful, potentially deceptive agent The typical unitary-agent threat model contemplates the emergence of a dominant, catastrophically misaligned agent, and safety models implicitly or explicitly call for deploying a dominant agent (or an equivalent collective system) that is both aligned and powerful enough to suppress unaligned competitors everywhere in the world. The open-agency frame Recent developments suggest an alternative open agency model of high-level AI. Today, the systems that look most like AGI are large language models (LLMs), and these are not agents that seek goals, but are generative models that produce diverse outputs in response to prompts (in a generalized sense) and random-number seeds. Most outputs are discarded. Trained on prediction tasks, LLMs learn world models that include agent behaviors, and generative models that are similar in kind can be informed by better world models and produce better plans. There is no need to assume LLM-like implementations: The key point is that generation of diverse plans is by nature a task for generative models, and that in routine operation, most outputs are discarded. These considerations suggest an “open-agency frame” in which prompt-driven generative models produce diverse proposals, diverse critics help select proposals, and diverse agents implement proposed actions to accomplish tasks (with schedules, budgets, accountability mechanisms, and so forth). Goals, plans, actions, and control look different in the open-agency model: Goals: Are provided as prompts to diverse generative models, yielding diverse plans on request Plans: Are selected with the aid of diverse, independent comparison and evaluation mechanisms ...

Is Closed Captioned: No

Explicit: No

Details

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

Release Date: 2/18/2023

Duration: 254 Mins

Authors: Stephen Casper

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VII: A Challenge for Mechanists, published by Stephen Casper on February 18, 2023 on The AI Alignment Forum. Part 7 of 12 in the Engineer’s Interpretability Sequence. Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me. MI = “mechanistic interpretability” Given a network, recover its labeling function. In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. The the best of my knowledge: Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so. Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can’t solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work. This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023). Challenge 1, MNIST CNN I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0’s and the other half as 1’s. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function. Hint 1: The labels are binary. Hint 2: The network gets 95.58% accuracy on the test set. Hint 3: This image may be helpful. Challenge 2, Transformer I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda’s grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function. Hint 1: The labels are binary. Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half. Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary... Prizes If you are the first person to send me the labeling function and a mechanistic explanation for either challenge, I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate $500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool $1,500. Good luck For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how. Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don’t tend to conveniently grok onto a simple, elegant, programmat...

Is Closed Captioned: No

Explicit: No

Details

AF - EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by Stephen Casper

Release Date: 2/17/2023

Duration: 1203 Mins

Authors: Stephen Casper

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety, published by Stephen Casper on February 17, 2023 on The AI Alignment Forum. Part 6 of 12 in the Engineer’s Interpretability Sequence. Thanks to Chris Olah and Neel Nanda for discussions and comments. In particular, I am thankful to Neel Nanda correcting a mistake I made in understanding the arguments in Olsson et al. (2022) in an earlier draft of this post. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” What kind of work this post focused on TAISIC prioritizes a relatively small set of problems in interpretability relative to the research community at large. This work is not homogenous, but a dominant theme is a focus on mechanistic, circuits-style interpretability with the end goals of model verification and/or detecting deceptive alignment. There is a specific line of work that this post focuses on. Key papers from it include: Feature Visualization (Olah et al., 2017) Zoom In: An Introduction to Circuits (Olah et al., 2020) Curve Detectors (Cammarata et al., 2020) A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) In-context Learning and Induction Heads (Olsson et al., 2022) Toy Models of Superposition (Elhage et al., 2022) Softmax Linear Units (Elhage et al., 2022) Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al., 2022) Progress measures for grokking via mechanistic interpretability (Nanda et al., 2023) .etc. And the points in this post will also apply somewhat to the current research agendas of Anthropic, Redwood Research, ARC, and Conjecture. This includes Causal Scrubbing (Chan et al., 2022) and mechanistic anomaly detection (Christiano, 2022). Most (all?) of the above work is either from Distill or inspired in part by Distill’s interpretability work in the late 2010s. To be clear, I believe this research is valuable, and it has been foundational to my own thinking about interpretability. But there seem to be some troubles with this space that might be keeping it from being as productive as it can be. Now may be a good time to make some adjustments to TAISIC’s focus on MI. This may be especially important given how much recent interest there has been in interpretability work and how there are large recent efforts focused on getting a large number of junior researchers working on it. Four issues This section discusses four major critiques of the works above. Not all of these critiques apply to all of the above, but for every paper mentioned above, at least one of the critiques below apply to it. Some but not all of these examples of papers exhibiting these problems will be covered. Cherrypicking results As discussed in EIS III and the Toward Transparent AI survey (Räuker et al., 2022), cherrypicking is common in the interpretability literature, but it manifests in some specific ways in MI work. It is very valuable for papers to include illustrative examples to build intuition, but when a paper makes such examples a central focus, cherrypicking can make results look better than they are. The feature visualization (Olah et al., 2017) and zoom in (Olah et al., 2020) papers have examples of this. Have a look at the cover photo for (Olah et al., 2017). From Olah et al., (2017) These images seem easy to describe and form hypotheses from. But instead of these, try going to OpenAI’ microscope and looking at some random visualizations. For example, here are some from a deep layer in an Inception-v4. From this link. As someone who often works with feature visualizations, I can confirm that these visualizations from OpenAI microscope are quite typical. But notice how they seem quite a bit less ‘lucid’ than the ones in the cover photo from Olah et al., (2017). Of course, many papers present t...

Is Closed Captioned: No

Explicit: No

Details

AF - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by Lawrence Chan

Release Date: 2/16/2023

Duration: 177 Mins

Authors: Lawrence Chan

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic), published by Lawrence Chan on February 16, 2023 on The AI Alignment Forum. This is a followup to what I cheekily call Anthropic's "just try to get the large model to do what you want" research agenda. (Previously: A General Language Assistant as a Laboratory for Alignment, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Language Models (Mostly) Know What They Know) The most interesting takeaway for me is that this is the first paper where Anthropic benchmarks their 175B parameter language model (probably a Claude variant). Previous papers only benchmarked up to 52B parameters. However, we don't have the performance of this model on standard benchmarks (the only benchmarked model from Anthropic is a 52B parameter one called standford-online-all-v4-s3). They also don't give details about its architecture or pretraining procedure. In this paper (Ganguli and Askell et al.), the authors study what happens when you just ... ask the language model to be less biased (that is, change their answers based on protected classes such as age or gender). They consider several setups: asking questions directly (Q), adding in the instruction to not be biased (Q+IF), giving it the instruction + chain of thought (Q+IF+CoT), and in some cases, asking it to match particular statistics. They find that as you scale the parameter count of their RLHF'ed language models, the models become more biased, but they also become increasingly capable of correcting for their biases: They also report how their model changes as you take more RLHF steps: First, this suggests that RLHF is having some effect on instruction following: the gap between the Q and Q+IF setups increases as you scale the number of RLHF steps, for both BBQ and admissions discrimination. (I'm not sure what's happening for the gender bias one?) However, simply giving the language model instructions and prompting it to do CoT, even after 50 RLHF steps, seems to have a significantly larger effect than RLHF. I was also surprised at how few RLHF steps are needed to get instruction following -- the authors only consider 50-1000 steps of RLHF, and see instruction following even after 50 RLHF steps. I wonder if this is a property of their pretraining process, a general fact about pretrained models (PaLM shows significant 0-shot instruction following capabilities, for example), or if RLHF is just that efficient? The authors caution that they've done some amount of prompt engineering, and "have not systematically tested for this in any of our experiments." They use the same RLHF procedure as in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - EIS IV: A Spotlight on Feature Attribution/Saliency by Stephen Casper

Release Date: 2/15/2023

Duration: 429 Mins

Authors: Stephen Casper

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IV: A Spotlight on Feature Attribution/Saliency, published by Stephen Casper on February 15, 2023 on The AI Alignment Forum. Part 4 of 12 in the Engineer’s Interpretability Sequence. Thanks to Tony Wang for a helpful comment. If you want to become more familiar with feature attribution/saliency, a tutorial on them that may offer useful background is Nielsen et al. (2021). Given a model and an input for it, the goal of feature attribution/saliency methods is to identify what features in the input are influential for the model’s decision. The literature on these methods is large and active with many hundreds of papers. In fact, in some circles, the word “interpretability” and especially the word “explainability” are more or less synonymous with feature attribution (some examples are discussed below). But despite the size of this literature, there are some troubles with the research on these methods that are fairly illustrative of broader ones with interpretability overall. Hence this post. There are some analogous ones in AI safety work that will be discussed more in the next two posts in the sequence. Troubles with evaluation and performance Some examples and troubles with the evaluation of feature attributions were already touched on in EIS III which discussed Pan et al. (2021) and Ismail et al. (2021). The claim from Pan et al. (2021) that their method is “obviously better” than alternatives exemplifies how these methods are sometimes simply declared successful after inspection from researchers. And Ismail et al. (2021) demonstrates a form of weak evaluation with a measure that may be quantitative but is not of direct interest to an engineer. In response to this literature, several works have emerged to highlight difficulties with feature attribution/saliency methods. Here is a short reading list :) A Benchmark for Interpretability Methods in Deep Neural Networks (Hooker et al., 2018) Sanity Checks for Saliency Maps (Adebayo et al., 2018) Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Hase and Bansal, 2020) Debugging Tests for Model Explanations (Adebayo et al., 2020) Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior (Denain and Steinhardt, 2022) Towards Benchmarking Explainable Artificial Intelligence Methods (Holmberg, 2022) Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) When they are evaluated, these tools often aren’t very useful and do not pass simple sanity checks. Consider an illustration of this problem: From Adebayo et al. (2018) These visualizations suggest that some of these tools do not reliably highlight features that seem important in images at all, and the ones that do often highlight them do not appear to be obviously better than an edge detector. This sanity check suggests limitations with how well these methods can reveal anything novel to humans at all, let alone how useful they can be in tasks of practical interest. For the papers that have gone further and studied whether these methods can help predict how the network will respond to certain inputs, it seems that some attribution/saliency methods usually fail while others only occasionally succeed (Hase and Bansal, 2020; Adebayo et al., 2020; Denain and Steinhardt, 2022). EIS III discussed how in a newly arXived work, coauthors and I benchmarked feature synthesis tools (Casper et al., 2023). In addition, we use a related approach to evaluate how helpful feature attribution/saliency methods can be for pointing out spurious features that the network has learned. This method was based on seeing how well a method can attribute a trojaned network’s decision to the trojan trigger in an image. From Casper et al. (2023) Shown at the top of the figure above are examples of trojaned ima...

Is Closed Captioned: No

Explicit: No

Details

AF - The Cave Allegory Revisited: Understanding GPT's Worldview by Jan Kulveit

Release Date: 2/14/2023

Duration: 302 Mins

Authors: Jan Kulveit

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Cave Allegory Revisited: Understanding GPT's Worldview, published by Jan Kulveit on February 14, 2023 on The AI Alignment Forum. A short post describing a metaphor I find useful, in particular for explaining some intuitions about systems like GPT to people who don't have deeper technical knowledge about large generative models. Plato's allegory of the cave has been a staple of philosophical discourse for millenia, providing a metaphor for understanding the limits of human perception. In the classical allegory, we are prisoners shackled to a wall of a cave, unable to experience reality directly but only able to infer it based on watching shadows cast on the wall.GPT can be thought of as a blind oracle residing in a deeper cave, where it does not even see the shadows but only hears our conversations in the first cave, always trying to predict the next syllable. It is remarkable that it still learns a lot about the world outside of the cave. Why does it learn this? Because, a model of reality outside of the cave and a decent amount of abstraction are useful for predicting the conversations in the first cave! Moreover, GPT also learns about the speakers in the first cave, as understanding their styles and patterns of speech is crucial for its prediction task. As the speakers are closer to GPT, understanding their styles is in some sense easier and more natural than guessing what's outside of the cave. What does the second cave allegory illustrate? The first insight from the allegory is: if you are in GPT's place, part of the difficulty in figuring out what's going on outside the cave, is that people in the first cave talk a lot about other things apart from the shadows of the real world. Sometimes, they talk about happenings in Middle Earth. Or about how the shadows would look in some counterfactual world. As humans, we are blessed with the luxury of being able to compare such statements to the shadows and determine their veracity. The difference between conversations about fantasy and the shadows of the real world is usually extremely obvious to humans: we never see dragon shadows. In contrast, dragons do show up a lot in the conversations in the first cave; GPT doesn’t get to see the shadows, so it often needs to stay deeply uncertain about whether the speaker is describing the actual shadows or something else to be good at predicting the conversation. The second insight is that one of the biggest challenges for GPT in figuring out the conversation is localizing it, determining who is speaking and what the context is, just from the words. Is it a child regaling another child with a fairy-tale, or a CEO delivering a corporate address? As humans we do not face this conundrum often,because we can see the context in which the conversation is taking place. In fact, we would be worse than GPT at the task it has to deal with. At first, interacting with this type of blind oracle in the second cave was disorienting for humans. Talking to GPT used to be a bit like shouting something through a narrow tunnel into the second cave .and instead of an echo, getting back what the blind oracle hallucinates is the most likely thing that you or someone else would say next. Often people were confused by this. They shouted instructions and expected an answer, but the oracle doesn't listen to instructions or produce answers directly - it just hallucinates what someone might say next. Because on average in the conversations in the first cave questions are followed by answers, and requests by fulfilment, this sort of works. One innovation of ChatGPT, which made it popular with people, was localising the conversation by default: when you are talking with ChatGPT now, it knows that what follows is a conversation between a human - you - and a "helpful AI assistant". There is a subtle point to...

Is Closed Captioned: No

Explicit: No

Details

AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis

Release Date: 1/31/2023

Duration: 419 Mins

Authors: Adam Scherlis

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...

Is Closed Captioned: No

Explicit: No

Details

AF - Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") by David Scott Krueger

Release Date: 1/30/2023

Duration: 180 Mins

Authors: David Scott Krueger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk"), published by David Scott Krueger on January 30, 2023 on The AI Alignment Forum. I think the large majority of AI x-risk is "structural". Like climate change. Here's a good primer on structural risk (note that structural risk is not a synonym for "not caused by out-of-control AI"): I am shocked and amazed and dismayed that more people do not seem to view it this way, even among the AI x-safety community. Heck, even Eliezer's stories of doom are steeped in structural risk (race dynamics, teams rationalizing cutting corners on safety when they should know better, etc.) I expect irresponsible, reckless, negligent deployment of AI systems without proper accounting of externalities. I consider this the default for any technology with potential for significant externalities, absent regulation.When something bad happens in such a context, calling it "accident risk" absolves those researching, developing, and/or deploying the technology of responsibility. They should have known better. Some of them almost certainly did. Rationalization, oversight, and misaligned incentives were almost certainly at play. Failing to predict the particular failure mode encountered is no excuse. Having "good intentions" is no excuse.So... it must be misuse then, right? Well, no. Calling it "misuse" suggests that those researching, developing, and/or deploying the technology set out with nefarious purposes and the technology achieved precisely what they intended. But ~nobody wants to destroy the world. It's just that most people are somewhat selfish and so are willing to trade some x-risk for a large personal benefit.In summary, saying "accident" makes it sounds like an unpredictable effect, instead of painfully obviously risk that was not taken seriously enough. Saying "misuse" makes it sounds like some supervillian or extremist deliberately destroying the world. While some risks may have something more of a flavor or accident or misuse depending on how obvious the risk was, neither of these pictures gives a remotely accurate picture of the nature of the problem. I think this makes it a harmful meme, and ask that others stop making this distinction (without appropriate caveats), and join me in pointing out how it contributes to a confused and misleading discourse when others do. EtA: Many people have responded that "accident" does not connote "unforseen" or "not negligent", etc., and instead it should simply be interpreted as something like "a result that was not deliberately selected for". While it can be used this way, I basically disagree that this is how it is usually used, see below:EtA: as an additional clarification: my main objection is not to the use of "accident" and "misuse", but rather to their use as a dichotomy. Every use of these terms I can recall seeing in writing (other than those that mention structural risk) supports this dichotomy, and it is often made explicitly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger

Release Date: 1/25/2023

Duration: 175 Mins

Authors: David Scott Krueger

Description: Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quick thoughts on "scalable oversight" / "super-human feedback" research, published by David Scott Krueger on January 25, 2023 on The AI Alignment Forum. The current default view seems to roughly be: Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/sub-optimal, but basically it's all about generalizing correctly) Scalable oversight is the only useful form of outer alignment research remaining. We don't need to worry about sample efficiency in RLHP -- in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a "constition") seems ~good enough. But maybe it's not good? Because it's more like capabilities research? A common example used for motivating scalable oversight is the "AI CEO". My views are: We should not be aiming to build AI CEOs We should be aiming to robustly align AIs to perform "simpler" behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge. We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs). From this PoV, scalable oversight does in fact look mostly like capabilities research. However, scalable oversight research can still be justified because "If we don't, someone else will". But this type of replaceability argument should always be treated with extreme caution. The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms. 2) The details matter, and the tipping points are in different places for different types of research/applications, etc. It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge ("robustness amplification"). For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe. Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures / hybrid systems of some form as well. EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a "chicken-and-egg" problem, i.e. the arguments for safety/alignment end up being somewhat circular: "this system is safe because the system we used as an assistant was safe" (but I don't think we've solved the "build a safe assistant" part yet, i.e. we don't have the base case for the induction). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Is Closed Captioned: No

Explicit: No

Details

The Nonlinear Library: Alignment Forum Daily

Description

Details

Language:

Release Date:

Authors:

Genres:

Share this podcast

Episodes

AF - Meta Questions about Metaphilosophy by Wei Dai

AF - Red-teaming language models via activation engineering by Nina Rimsky

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

AF - How LLMs are and are not myopic by janus

AF - Open problems in activation engineering by Alex Turner

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

AF - Agency from a causal perspective by Tom Everitt

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

AF - [Linkpost] Interpretability Dreams by DanielFilan

AF - Conjecture internal survey | AGI timelines and estimations of probability of human extinction from unaligned AGI by Maris Sala

AF - Some background for reasoning about dual-use alignment research by Charlie Steiner

AF - $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions by johnswentworth

AF - Difficulties in making powerful aligned AI by DanielFilan

AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes

AF - How Many Bits Of Optimization Can One Bit Of Observation Unlock? by johnswentworth

AF - Endo-, Dia-, Para-, and Ecto-systemic novelty by Tsvi Benson-Tilsen

AF - Thinking about maximization and corrigibility by James Payor

AF - Concave Utility Question by Scott Garrabrant

AF - Shapley Value Attribution in Chain of Thought by leogao

AF - Announcing Epoch’s dashboard of key trends and figures in Machine Learning by Jaime Sevilla

AF - Lessons from Convergent Evolution for AI Alignment by Jan Kulveit

AF - What happens with logical induction when... by Donald Hobson

AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner

AF - The space of systems and the space of maps by Jan Kulveit

AF - [ASoT] Some thoughts on human abstractions by leogao

AF - What is a definition, how can it be extrapolated? by Stuart Armstrong

AF - Implied "utilities" of simulators are broad, dense, and shallow by porby

AF - Scarce Channels and Abstraction Coupling by johnswentworth

AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger

AF - AI that shouldn't work, yet kind of does by Donald Hobson

AF - The Open Agency Model by Eric Drexler

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

AF - EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by Stephen Casper

AF - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by Lawrence Chan

AF - EIS IV: A Spotlight on Feature Attribution/Saliency by Stephen Casper

AF - The Cave Allegory Revisited: Understanding GPT's Worldview by Jan Kulveit

AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis

AF - Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") by David Scott Krueger

AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger

Similar Podcasts

The Nonlinear Library

The Nonlinear Library: Alignment Section

The Nonlinear Library: LessWrong

The Nonlinear Library: LessWrong Daily

The Nonlinear Library: EA Forum Daily

The Nonlinear Library: Alignment Forum Weekly

The Nonlinear Library: EA Forum Weekly

The Nonlinear Library: LessWrong Weekly

The Nonlinear Library: Alignment Forum Top Posts

The Nonlinear Library: LessWrong Top Posts

sasodgy

Reviews -

Comments (0) -

Replies (0)