LessWrong (Curated & Popular) cover art

All Episodes

LessWrong (Curated & Popular) — 857 episodes

#

Title

Date

Duration

"Automated Alignment is Harder Than You Think" by Aleksandr Bowkis, Marie_DB, Jacob Pfau, Geoffrey Irving

"MATS 9 Retrospective & Advice" by beyarkay

"The primary sources of near-term cybersecurity risk" by lc

"The Owned Ones" by Eliezer Yudkowsky

"The Iliad Intensive Course Materials" by Leon Lang, David Udell, Alexander Gietelink Oldenziel

"The Darwinian Honeymoon - Why I am not as impressed by human progress as I used to be" by Elias Schmied

"What I did in the hedonium shockwave, by Emma, age six and a half" by ozymandias

"Bad Problems Don’t Stop Being Bad Because Somebody’s Wrong About Fault Analysis" by Linch

"x-risk-themed" by kave

"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

"It’s nice of you to worry about me, but I really do have a life" by Viliam

"Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI" by Eliezer Yudkowsky

"Dairy cows make their misery expensive (but their calves can’t)" by Elizabeth

"Takes from two months as an aspiring LLM naturalist" by AnnaSalamon

"Intelligence Dissolves Privacy" by Vaniver

"How Go Players Disempower Themselves to AI" by Ashe Vazquez Nuñez

"On today’s panel with Bernie Sanders" by David Scott Krueger

"Not a Paper: “Frontier Lab CEOs are Capable of In-Context Scheming”" by LawrenceC

"llm assistant personas seem increasingly incoherent (some subjective observations)" by nostalgebraist

"LessWrong Shows You Social Signals Before the Comment" by TurnTrout

"Update on the Alex Bores campaign" by Eric Neyman

"Community misconduct disputes are not about facts" by mingyuan

"The paper that killed deep learning theory" by LawrenceC

"Forecasting is Way Overrated, and We Should Stop Funding It" by mabramov

"Your Supplies Probably Won’t Be Stolen in a Disaster" by jefftk

"10 posts I don’t have time to write" by habryka

"$50 million a year for a 10% chance to ban ASI" by Andrea_Miotti, Alex Amadori, Gabriel Alfour

"Evil is bad, actually (Vassar and Olivia Schaefer callout post)" by plex

"10 non-boring ways I’ve used AI in the last month" by habryka

"Feel like a room has bad vibes? The lighting is probably too “spiky” or too blue" by habryka

"Quality Matters Most When Stakes are Highest" by LawrenceC

"Reevaluating AGI Ruin in 2026" by lc

"Having OCD is like living in North Korea (Here’s how I escaped)" by Declan Molony

"There are only four skills: design, technical, management and physical" by habryka

"Meaningful Questions Have Return Types" by Drake Morrison

"Carpathia Day" by Drake Morrison

"Let goodness conquer all that it can defend" by habryka

"Do not conquer what you cannot defend" by habryka

"Nectome: All That I Know" by Raelifin

"Current AIs seem pretty misaligned to me" by ryan_greenblatt

"Annoyingly Principled People, and what befalls them" by Raemon

"Morale" by J Bostock

"Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes" by Alex Mallen, ryan_greenblatt

"The policy surrounding Mythos marks an irreversible power shift" by sil

"Only Law Can Prevent Extinction" by Eliezer Yudkowsky

"Dario probably doesn’t believe in superintelligence" by RobertM

"Daycare illnesses" by Nina Panickssery

"If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines" by ryan_greenblatt

"Do not be surprised if LessWrong gets hacked" by RobertM

"My picture of the present in AI" by ryan_greenblatt

"The effects of caffeine consumption do not decay with a ~5 hour half-life" by kman

"AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines" by ryan_greenblatt

"dark ilan" by ozymandias

"Dispatch from Anthropic v. Department of War Preliminary Injunction Motion Hearing" by Zack_M_Davis

"The Corner-Stone" by Benquo

"The Practical Guide to Superbabies" by GeneSmith

"Anthropic’s Pause is the Most Expensive Alarm in Corporate History" by Ruby

"“You Have Not Been a Good User” (LessWrong’s second album)" by habryka

"Lesswrong Liberated" by Ronny Fernandez

"Product Alignment is not Superintelligence Alignment (and we need the latter to survive)" by plex

"Gyre" by vgel

"Some things I noticed while LARPing as a grantmaker" by Zach Stein-Perlman

"My hobby: running deranged surveys" by leogao

"Socrates is Mortal" by Benquo

"The Terrarium" by Caleb Biddulph

"My Most Costly Delusion" by Ihor Kendiukhov

"The Case for Low-Competence ASI Failure Scenarios" by Ihor Kendiukhov

"Is fever a symptom of glycine deficiency?" by Benquo

"You can’t imitation-learn how to continual-learn" by Steven Byrnes

"Nullius in Verba" by Aurelia

"Broad Timelines" by Toby_Ord

"No, we haven’t uploaded a fly yet" by Ariel Zeleznikow-Johnston

"Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis

"PSA: Predictions markets often have very low liquidity; be careful citing them." by Eye You

"“The AI Doc” is coming out March 26" by Rob Bensinger, Beckeck

"Customer Satisfaction Opportunities" by Tomás B.

"Requiem for a Transhuman Timeline" by Ihor Kendiukhov

"Personality Self-Replicators" by eggsyntax

"My Willing Complicity In “Human Rights Abuse”" by AlphaAndOmega

"Economic efficiency often undermines sociopolitical autonomy" by Richard_Ngo

"Don’t Let LLMs Write For You" by JustisMills

"Thoughts on the Pause AI protest" by philh

"Prologue to Terrified Comments on Claude’s Constitution" by Zack_M_Davis

"Less Dead" by Aurelia

"Gemma Needs Help" by Anna Soligo

"On Independence Axiom" by Ihor Kendiukhov

"Solar storms" by Croissanthology

"Schelling Goodness, and Shared Morality as a Goal" by Andrew_Critch

"Maybe there’s a pattern here?" by dynomight

"OpenAI’s surveillance language has many potential loopholes and they can do better" by Tom Smith

"An Alignment Journal: Coming Soon" by Dan MacKinlay, JessRiedel, Edmund Lau, Daniel Murfet, Scott Aaronson, Jan_Kulveit

"Frontier AI companies probably can’t leave the US" by Anders Woodruff

"Persona Parasitology" by Raymond Douglas

"Here’s to the Polypropylene Makers" by jefftk

"Anthropic: “Statement from Dario Amodei on our discussions with the Department of War”" by Matrice Jacobine

"Are there lessons from high-reliability engineering for AGI safety?" by Steven Byrnes

"Open sourcing a browser extension that tells you when people are wrong on the internet" by lc

"The persona selection model" by Sam Marks

"Responsible Scaling Policy v3" by HoldenKarnofsky

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

"The Spectre haunting the “AI Safety” Community" by Gabriel Alfour

"Why we should expect ruthless sociopath ASI" by Steven Byrnes

"You’re an AI Expert – Not an Influencer" by Max Winga

"The optimal age to freeze eggs is 19" by GeneSmith

"The truth behind the 2026 J.P. Morgan Healthcare Conference" by Abhishaike Mahajan

"The world keeps getting saved and you don’t notice" by Bogoed

"Solemn Courage" by aysja

"Life at the Frontlines of Demographic Collapse" by Martin Sustrik

"Why You Don’t Believe in Xhosa Prophecies" by Jan_Kulveit

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

"My journey to the microwave alternate timeline" by Malmesbury

"Stone Age Billionaire Can’t Words Good" by Eneasz

"On Goal-Models" by Richard_Ngo

"Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning" by megasilverfist

"Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics" by eleweek

"Post-AGI Economics As If Nothing Ever Happens" by Jan_Kulveit

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

"Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)" by RobertM

"Conditional Kickstarter for the “Don’t Build It” March" by Raemon

"How to Hire a Team" by Gretta Duleba

"The Possessed Machines (summary)" by L Rudolf L

"Ada Palmer: Inventing the Renaissance" by Martin Sustrik

"AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty)" by Stanislav Fort

"Dario Amodei – The Adolescence of Technology" by habryka

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

"Does Pentagon Pizza Theory Work?" by rba

"The inaugural Redwood Research podcast" by Buck, ryan_greenblatt

"Canada Lost Its Measles Elimination Status Because We Don’t Have Enough Nurses Who Speak Low German" by jenn

"Deep learning as program synthesis" by Zach Furman

"Why I Transitioned: A Response" by marisa

"Claude’s new constitution" by Zac Hatfield-Dodds

[Linkpost] "“The first two weeks are the hardest”: my first digital declutter" by mingyuan

"What Washington Says About AGI" by zroe1

"Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks" by James_Miller

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

"Backyard cat fight shows Schelling points preexist language" by jchan

"How AI Is Learning to Think in Secret" by Nicholas Andresen

"On Owning Galaxies" by Simon Lermen

"AI Futures Timelines and Takeoff Model: Dec 2025 Update" by elifland, bhalstead, Alex Kastner, Daniel Kokotajlo

"In My Misanthropy Era" by jenn

"2025 in AI predictions" by jessicata

"Good if make prior after data instead of before" by dynomight

"Measuring no CoT math time horizon (single forward pass)" by ryan_greenblatt

"Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance" by ryan_greenblatt

"Turning 20 in the probable pre-apocalypse" by Parv Mahajan

"Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

"Dancing in a World of Horseradish" by lsusr

"Contradict my take on OpenPhil’s past AI beliefs" by Eliezer Yudkowsky

"Opinionated Takes on Meetups Organizing" by jenn

"How to game the METR plot" by shash42

"Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers" by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

"Scientific breakthroughs of the year" by technicalities

"A high integrity/epistemics political machine?" by Raemon

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw

“The funding conversation we left unfinished” by jenn

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

“Little Echo” by Zvi

“A Pragmatic Vision for Interpretability” by Neel Nanda

“AI in 2025: gestalt” by technicalities

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

“An Ambitious Vision for Interpretability” by leogao

“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

“Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)” by null

“MIRI’s 2025 Fundraiser” by alexvermeer

“The Best Lack All Conviction: A Confusing Day in the AI Village” by null

“The Boring Part of Bell Labs” by Elizabeth

[Linkpost] “The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun” by null

“Writing advice: Why people like your quick bullshit takes better than your high-effort posts” by null

“Claude 4.5 Opus’ Soul Document” by null

“Unless its governance changes, Anthropic is untrustworthy” by null

“Alignment remains a hard, unsolved problem” by null

“Video games are philosophy’s playground” by Rachel Shu

“Stop Applying And Get To Work” by plex

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

“Anthropic is (probably) not meeting its RSP security commitments” by habryka

“Varieties Of Doom” by jdp

“How Colds Spread” by RobertM

“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett

“Where is the Capital? An Overview” by johnswentworth

“Problems I’ve Tried to Legibilize” by Wei Dai

“Do not hand off what you cannot pick up” by habryka

“7 Vicious Vices of Rationalists” by Ben Pace

“Tell people as early as possible it’s not going to work out” by habryka

“Everyone has a plan until they get lied to the face” by Screwtape

“Please, Don’t Roll Your Own Metaethics” by Wei Dai

“Paranoia rules everything around me” by habryka

“Human Values ≠ Goodness” by johnswentworth

“Condensation” by abramdemski

“Mourning a life without AI” by Nikola Jurkovic

“Unexpected Things that are People” by Ben Goldhaber

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec

“The Unreasonable Effectiveness of Fiction” by Raelifin

“Legible vs. Illegible AI Safety Problems” by Wei Dai

“Lack of Social Grace is a Lack of Skill” by Screwtape

[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu

“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

[Linkpost] “You’re always stressed, your mind is always busy, you never have enough time” by mingyuan

“LLM-generated text is not testimony” by TsviBT

“Post title: Why I Transitioned: A Case Study” by Fiora Sunshine

“The Memetics of AI Successionism” by Jan_Kulveit

“How Well Does RL Scale?” by Toby_Ord

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

“Cancer has a surprising amount of detail” by Abhishaike Mahajan

“AIs should also refuse to work on capabilities research” by Davidmanheim

“On Fleshling Safety: A Debate by Klurl and Trapaucius.” by Eliezer Yudkowsky

“EU explained in 10 minutes” by Martin Sustrik

“Cheap Labour Everywhere” by Morpheus

[Linkpost] “Consider donating to AI safety champion Scott Wiener” by Eric Neyman

“Which side of the AI safety community are you in?” by Max Tegmark

“Doomers were right” by Algon

“Do One New Thing A Day To Solve Your Problems” by Algon

“Humanity Learned Almost Nothing From COVID-19” by niplav

“Consider donating to Alex Bores, author of the RAISE Act” by Eric Neyman

“Meditation is dangerous” by Algon

“That Mad Olympiad” by Tomás B.

“The ‘Length’ of ‘Horizons’” by Adam Scholl

“Don’t Mock Yourself” by Algon

“If Anyone Builds It Everyone Dies, a semi-outsider review” by dvd

“The Most Common Bad Argument In These Parts” by J Bostock

“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn

“I take antidepressants. You’re welcome” by Elizabeth

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

“Hospitalization: A Review” by Logan Riggs

“What, if not agency?” by abramdemski

“The Origami Men” by Tomás B.

“A non-review of ‘If Anyone Builds It, Everyone Dies’” by boazbarak

“Notes on fatalities from AI takeover” by ryan_greenblatt

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon

“Omelas Is Perfectly Misread” by Tobias H

“Ethical Design Patterns” by AnnaSalamon

“You’re probably overestimating how well you understand Dunning-Kruger” by abstractapplic

“Reasons to sell frontier lab equity to donate now rather than later” by Daniel_Eth, Ethan Perez

“CFAR update, and New CFAR workshops” by AnnaSalamon

“Why you should eat meat - even if you hate factory farming” by KatWoods

[Linkpost] “Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures” by Charbel-Raphaël

“This is a review of the reviews” by Recurrented

“The title is reasonable” by Raemon

“The Problem with Defining an ‘AGI Ban’ by Outcome (a lawyer’s take).” by Katalina Hernandez

“Contra Collier on IABIED” by Max Harms

“You can’t eval GPT5 anymore” by Lukas Petersson

“Teaching My Toddler To Read” by maia

“Safety researchers should take a public stance” by Ishual, Mateusz Bagiński

“The Company Man” by Tomás B.

“Christian homeschoolers in the year 3000” by Buck

“I enjoyed most of IABED” by Buck

“‘If Anyone Builds It, Everyone Dies’ release day!” by alexvermeer

“Obligated to Respond” by Duncan Sabien (Inactive)

“Chesterton’s Missing Fence” by jasoncrawford

“The Eldritch in the 21st century” by PranavG, Gabriel Alfour

“The Rise of Parasitic AI” by Adele Lopez

“High-level actions don’t screen off intent” by AnnaSalamon

[Linkpost] “MAGA populists call for holy war against Big Tech” by Remmelt

“Your LLM-assisted scientific breakthrough probably isn’t real” by eggsyntax

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by ryan_greenblatt

“⿻ Plurality & 6pack.care” by Audrey Tang

[Linkpost] “The Cats are On To Something” by Hastings

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock

“AI Induced Psychosis: A shallow investigation” by Tim Hua

“Before LLM Psychosis, There Was Yes-Man Psychosis” by johnswentworth

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

“Banning Said Achmiz (and broader thoughts on moderation)” by habryka

“Underdog bias rules everything around me” by Richard_Ngo

“Epistemic advantages of working as a moderate” by Buck

“Four ways Econ makes people dumber re: future AI” by Steven Byrnes

“Should you make stone tools?” by Alex_Altair

“My AGI timeline updates from GPT-5 (and 2025 so far)” by ryan_greenblatt

“Hyperbolic model fits METR capabilities estimate worse than exponential model” by gjm

“My Interview With Cade Metz on His Reporting About Lighthaven” by Zack_M_Davis

“Church Planting: When Venture Capital Finds Jesus” by Elizabeth

“Somebody invented a better bookmark” by Alex_Altair

“How Does A Blind Model See The Earth?” by henry

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

“How anticipatory cover-ups go wrong” by Kaj_Sotala

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

“METR’s Evaluation of GPT-5” by GradientDissenter

“Emotions Make Sense” by DaystarEld

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

“Many prediction markets would be better off as batched auctions” by William Howard

“Whence the Inkhaven Residency?” by Ben Pace

“I am worried about near-term non-LLM AI developments” by testingthewaters

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

“Maya’s Escape” by Bridgett Kay

“Do confident short timelines make sense?” by TsviBT, abramdemski

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

“Make More Grayspaces” by Duncan Sabien (Inactive)

“Shallow Water is Dangerous Too” by jefftk

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“the jackpot age” by thiccythot

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“An Opinionated Guide to Using Anki Correctly” by Luise

“Lessons from the Iraq War about AI policy” by Buck

“So You Think You’ve Awoken ChatGPT” by JustisMills

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

“A deep critique of AI 2027’s bad timeline models” by titotal

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

“The Industrial Explosion” by rosehadshar, Tom Davidson

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

“The best simple argument for Pausing AI?” by Gary Marcus

“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes

“Proposal for making credible commitments to AIs.” by Cleo Nardo

“X explains Z% of the variance in Y” by Leon Lang

“A case for courage, when speaking of AI danger” by So8res

“My pitch for the AI Village” by Daniel Kokotajlo

“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes

“Futarchy’s fundamental flaw” by dynomight

“Do Not Tile the Lightcone with Your Confused Ontology” by Jan_Kulveit

“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan

“Estrogen: A trip report” by cube_flipper

“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo

[Linkpost] “the void” by nostalgebraist

“Mech interp is not pre-paradigmatic” by Lee Sharkey

“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium

“A Straightforward Explanation of the Good Regulator Theorem” by Alfred Harwood

“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC

“Season Recap of the Village: Agents raise $2,000” by Shoshannah Tekofsky

“The Best Reference Works for Every Subject” by Parker Conley

“‘Flaky breakthroughs’ pervade coaching — and no one tracks them” by Chipmonk

“The Value Proposition of Romantic Relationships” by johnswentworth

“It’s hard to make scheming evals look realistic” by Igor Ivanov, dan_moken

[Linkpost] “Social Anxiety Isn’t About Being Liked” by Chipmonk

“Truth or Dare” by Duncan Sabien (Inactive)

“Meditations on Doge” by Martin Sustrik

[Linkpost] “If you’re not sure how to sort a list or grid—seriate it!” by gwern

“What We Learned from Briefing 70+ Lawmakers on the Threat from AI” by leticiagarcia

“Winning the power to lose” by KatjaGrace

[Linkpost] “Gemini Diffusion: watch this space” by Yair Halberstadt

“AI Doomerism in 1879” by David Gross

“Consider not donating under $100 to political candidates” by DanielFilan

“It’s Okay to Feel Bad for a Bit” by moridinamael

“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery

“Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies” by So8res

“Too Soon” by Gordon Seidoh Worley

“PSA: The LessWrong Feedback Service” by JustisMills

“Orienting Toward Wizard Power” by johnswentworth

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

“Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall” by Vladimir_Nesov

“Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis” by jeanne_, eeeee

[Linkpost] “Jaan Tallinn’s 2024 Philanthropy Overview” by jaan

“Impact, agency, and taste” by benkuhn

[Linkpost] “To Understand History, Keep Former Population Distributions In Mind” by Arjun Panickssery

“AI-enabled coups: a small group could use AI to seize power” by Tom Davidson, Lukas Finnveden, rosehadshar

“Accountability Sinks” by Martin Sustrik

“Training AGI in Secret would be Unsafe and Unethical” by Daniel Kokotajlo

“Why Should I Assume CCP AGI is Worse Than USG AGI?” by Tomás B.

“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

[Linkpost] “Playing in the Creek” by Hastings

“Thoughts on AI 2027” by Max Harms

“Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

“Why Have Sentence Lengths Decreased?” by Arjun Panickssery

“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo

“OpenAI #12: Battle of the Board Redux” by Zvi

“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit

“OpenAI #12: Battle of the Board Redux” by Zvi

“You will crash your car in front of my house within the next week” by Richard Korzekwa

“My ‘infohazards small working group’ Signal Chat may have encountered minor leaks” by Linch

“Leverage, Exit Costs, and Anger: Re-examining Why We Explode at Home, Not at Work” by at_the_zoo

“PauseAI and E/Acc Should Switch Sides” by WillPetillo

“VDT: a solution to decision theory” by L Rudolf L

“LessWrong has been acquired by EA” by habryka

“We’re not prepared for an AI market crash” by Remmelt

“Conceptual Rounding Errors” by Jan_Kulveit

“Tracing the Thoughts of a Large Language Model” by Adam Jermyn

“Recent AI model progress feels mostly like bullshit” by lc

“AI for AI safety” by Joe Carlsmith

“Policy for LLM Writing on LessWrong” by jimrandomh

“Will Jesus Christ return in an election year?” by Eric Neyman

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

“Intention to Treat” by Alicorn

“On the Rationality of Deterring ASI” by Dan H

[Linkpost] “METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

“I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?” by shrimpy

“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

“Levels of Friction” by Zvi

“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

“The Most Forbidden Technique” by Zvi

“Trojan Sky” by Richard_Ngo

“OpenAI:” by Daniel Kokotajlo

“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis

“So how well is Claude playing Pokémon?” by Julian Bradshaw

“Methods for strong human germline engineering” by TsviBT

“Have LLMs Generated Novel Insights?” by abramdemski, Cole Wyeth

“A Bear Case: My Predictions Regarding AI Progress” by Thane Ruthenis

“Statistical Challenges with Making Super IQ babies” by Jan Christian Refsgaard

“Self-fulfilling misalignment data might be poisoning our AI models” by TurnTrout

“Judgements: Merging Prediction & Evidence” by abramdemski

“The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better” by Thane Ruthenis

“Power Lies Trembling: a three-book review” by Richard_Ngo

“Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” by Jan Betley, Owain_Evans

“The Paris AI Anti-Safety Summit” by Zvi

“Eliezer’s Lost Alignment Articles / The Arbital Sequence” by Ruby

“Arbital has been imported to LessWrong” by RobertM, jimrandomh, Ben Pace, Ruby

“How to Make Superbabies” by GeneSmith, kman

“A computational no-coincidence principle” by Eric Neyman

“A History of the Future, 2025-2040” by L Rudolf L

“It’s been ten years. I propose HPMOR Anniversary Parties.” by Screwtape

“Some articles in ‘International Security’ that I enjoyed” by Buck

“The Failed Strategy of Artificial Intelligence Doomers” by Ben Pace

“Murder plots are infohazards” by Chris Monteiro

“Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?” by garrison

“The ‘Think It Faster’ Exercise” by Raemon

“So You Want To Make Marginal Progress...” by johnswentworth

“What is malevolence? On the nature, measurement, and distribution of dark traits” by David Althaus

“How AI Takeover Might Happen in 2 Years” by joshc

“Gradual Disempowerment, Shell Games and Flinches” by Jan_Kulveit

“Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development” by Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet), David Duvenaud

“Planning for Extreme AI Risks” by joshc

“Catastrophe through Chaos” by Marius Hobbhahn

“Will alignment-faking Claude accept a deal to reveal its misalignment?” by ryan_greenblatt

“‘Sharp Left Turn’ discourse: An opinionated review” by Steven Byrnes

“Ten people on the inside” by Buck

“Anomalous Tokens in DeepSeek-V3 and r1” by henry

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell

“A Three-Layer Model of LLM Psychology” by Jan_Kulveit

“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub

“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt

“Mechanisms too simple for humans to design” by Malmesbury

“The Gentle Romance” by Richard_Ngo

“Quotes from the Stargate press conference” by Nikola Jurkovic

“The Case Against AI Control Research” by johnswentworth

“Don’t ignore bad vibes you get from people” by Kaj_Sotala

“[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty” by tandem

“Building AI Research Fleets” by bgold, Jesse Hoogland

“What Is The Alignment Problem?” by johnswentworth

“Applying traditional economic thinking to AGI: a trilemma” by Steven Byrnes

“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov

“Parkinson’s Law and the Ideology of Statistics” by Benquo

“Capital Ownership Will Not Prevent Human Disempowerment” by beren

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

“What o3 Becomes by 2028” by Vladimir_Nesov

“What Indicators Should We Watch to Disambiguate AGI Timelines?” by snewman

“How will we update about scheming?” by ryan_greenblatt

“OpenAI #10: Reflections” by Zvi

“Maximizing Communication, not Traffic” by jefftk

“What’s the short timeline plan?” by Marius Hobbhahn

“Shallow review of technical AI safety, 2024” by technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers

“By default, capital will matter more than ever after AGI” by L Rudolf L

“Review: Planecrash” by L Rudolf L

“The Field of AI Alignment: A Postmortem, and What To Do About It” by johnswentworth

“When Is Insurance Worth It?” by kqr

“Orienting to 3 year AGI timelines” by Nikola Jurkovic

“What Goes Without Saying” by sarahconstantin

“o3” by Zach Stein-Perlman

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

“AIs Will Increasingly Attempt Shenanigans” by Zvi

“Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

“Communications in Hard Mode (My new job at MIRI)” by tanagrabeast

“Biological risk from the mirror world” by jasoncrawford

“Subskills of ‘Listening to Wisdom’” by Raemon

“Understanding Shapley Values with Venn Diagrams” by Carson L

“LessWrong audio: help us choose the new voice” by PeterH

“Understanding Shapley Values with Venn Diagrams” by agucova

“o1: A Technical Primer” by Jesse Hoogland

“Gradient Routing: Masking Gradients to Localize Computation in Neural Networks” by cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout

“Frontier Models are Capable of In-context Scheming” by Marius Hobbhahn, AlexMeinke, Bronson Schoen

“(The) Lightcone is nothing without its people: LW + Lighthaven’s first big fundraiser” by habryka

“Repeal the Jones Act of 1920” by Zvi

“China Hawks are Manufacturing an AI Arms Race” by garrison

“Information vs Assurance” by johnswentworth

“You are not too ‘irrational’ to know your preferences.” by DaystarEld

“‘The Solomonoff Prior is Malign’ is a special case of a simpler argument” by David Matolcsi

“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov

“OpenAI Email Archives” by habryka

“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon

“Neutrality” by sarahconstantin

“Making a conservative case for alignment” by Cameron Berg, Judd Rosenblatt, phgubbins, AE Studio

“OpenAI Email Archives (from Musk v. Altman)” by habryka

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

“The Online Sports Gambling Experiment Has Failed” by Zvi

“o1 is a bad idea” by abramdemski

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

“Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky

“Survival without dignity” by L Rudolf L

“The Median Researcher Problem” by johnswentworth

“The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

“What TMS is like” by Sable

“The hostile telepaths problem” by Valentine

“A bird’s eye view of ARC’s research” by Jacob_Hilton

“A Rocket–Interpretability Analogy” by plex

“I got dysentery so you don’t have to” by eukaryote

“Overcoming Bias Anthology” by Arjun Panickssery

“Arithmetic is an underrated world-modeling technology” by dynomight

“My theory of change for working in AI healthtech” by Andrew_Critch

“Why I’m not a Bayesian” by Richard_Ngo

“The AGI Entente Delusion” by Max Tegmark

“Momentum of Light in Glass” by Ben

“Overview of strong human intelligence amplification methods” by TsviBT

“Struggling like a Shadowmoth” by Raemon

“Three Subtle Examples of Data Leakage” by abstractapplic

“the case for CoT unfaithfulness is overstated” by nostalgebraist

“Cryonics is free” by Mati_Roy

“Stanislav Petrov Quarterly Performance Review” by Ricki Heicklen

“Laziness death spirals” by PatrickDFarley

“‘Slow’ takeoff is a terrible term for ‘maybe even faster takeoff, actually’” by Raemon

“ASIs will not leave just a little sunlight for Earth ” by Eliezer Yudkowsky

“Skills from a year of Purposeful Rationality Practice ” by Raemon

“How I started believing religion might actually matter for rationality and moral philosophy ” by zhukeepa

“Did Christopher Hitchens change his mind about waterboarding? ” by Isaac King

“The Great Data Integration Schlep ” by sarahconstantin

“Contra papers claiming superhuman AI forecasting ” by nikos, Peter Mühlbacher, Lawrence Phillips, dschwarz

“OpenAI o1 ” by Zach Stein-Perlman

“The Best Lay Argument is not a Simple English Yud Essay ” by J Bostock

“My Number 1 Epistemology Book Recommendation: Inventing Temperature ” by adamShimi

“That Alien Message - The Animation ” by Writer

“Pay Risk Evaluators in Cash, Not Equity ” by Adam Scholl

“Survey: How Do Elite Chinese Students Feel About the Risks of AI? ” by Nick Corvino

“things that confuse me about the current AI market. ” by DMMF

“Nursing doubts ” by dynomight

“Principles for the AGI Race ” by William_S

“The Information: OpenAI shows ‘Strawberry’ to feds, races to launch it ” by Martín Soto

“What is it to solve the alignment problem? ” by Joe Carlsmith

“Limitations on Formal Verification for AI Safety ” by Andrew Dickson

“Would catching your AIs trying to escape convince AI developers to slow down or undeploy? ” by Buck

“Liability regimes for AI ” by Ege Erdil

“AGI Safety and Alignment at Google DeepMind:A Summary of Recent Work ” by Rohin Shah, Seb Farquhar, Anca Dragan

“Fields that I reference when thinking about AI takeover prevention” by Buck

“WTH is Cerebrolysin, actually?” by gsfitzgerald, delton137

“You can remove GPT2’s LayerNorm by fine-tuning for an hour” by StefanHex

“Leaving MIRI, Seeking Funding” by abramdemski

“How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage” by orthonormal

“This is already your second chance” by Malmesbury

“0. CAST: Corrigibility as Singular Target” by Max Harms

“Self-Other Overlap: A Neglected Approach to AI Alignment” by Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena

“You don’t know how bad most things are nor precisely how they’re bad.” by Solenoid_Entity

“Recommendation: reports on the search for missing hiker Bill Ewasko” by eukaryote

“The ‘strong’ feature hypothesis could be wrong” by lsgos

“‘AI achieves silver-medal standard solving International Mathematical Olympiad problems’” by gjm

“Decomposing Agency — capabilities without desires” by owencb, Raymond D

“Universal Basic Income and Poverty” by Eliezer Yudkowsky

“Optimistic Assumptions, Longterm Planning, and ‘Cope’” by Raemon

“Superbabies: Putting The Pieces Together” by sarahconstantin

“Poker is a bad game for teaching epistemics. Figgie is a better one.” by rossry

“Reliable Sources: The Story of David Gerard” by TracingWoodgrains

“When is a mind me?” by Rob Bensinger

“80,000 hours should remove OpenAI from the Job Board (and similar orgs should do similarly)” by Raemon

[Linkpost] “introduction to cancer vaccines” by bhauth

“Priors and Prejudice” by MathiasKB

“My experience using financial commitments to overcome akrasia” by William Howard

“The Incredible Fentanyl-Detecting Machine” by sarahconstantin

“AI catastrophes and rogue deployments” by Buck

“Loving a world you don’t trust” by Joe Carlsmith

“Formal verification, heuristic explanations and surprise accounting” by paulfchristiano

“LLM Generality is a Timeline Crux” by eggsyntax

“SAE feature geometry is outside the superposition hypothesis” by jake_mendel

“Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data” by Johannes Treutlein, Owain_Evans

“Boycott OpenAI” by PeterMcCluskey

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

“I would have shit in that alley, too” by Declan Molony

“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by ryan_greenblatt

“Why I don’t believe in the placebo effect” by transhumanist_atom_understander

“Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)” by Andrew_Critch

“My AI Model Delta Compared To Christiano” by johnswentworth

“My AI Model Delta Compared To Yudkowsky” by johnswentworth

“Response to Aschenbrenner’s ‘Situational Awareness’” by Rob Bensinger

“Humming is not a free $100 bill” by Elizabeth

“Announcing ILIAD — Theoretical AI Alignment Conference ” by Nora_Ammann, Alexander Gietelink Oldenziel

“Non-Disparagement Canaries for OpenAI” by aysja, Adam Scholl

“MIRI 2024 Communications Strategy” by Gretta Duleba

“OpenAI: Fallout” by Zvi

[HUMAN VOICE] Update on human narration for this podcast

“Maybe Anthropic’s Long-Term Benefit Trust is powerless” by Zach Stein-Perlman

“Notifications Received in 30 Minutes of Class” by tanagrabeast

“AI companies aren’t really using external evaluators” by Zach Stein-Perlman

“EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024” by scasper

“What’s Going on With OpenAI’s Messaging?” by ozziegoen

“Language Models Model Us” by eggsyntax

Jaan Tallinn’s 2023 Philanthropy Overview

“OpenAI: Exodus” by Zvi

DeepMind’s ”Frontier Safety Framework” is weak and unambitious

Do you believe in hundred dollar bills lying on the ground? Consider humming

Deep Honesty

On Not Pulling The Ladder Up Behind You

Mechanistically Eliciting Latent Behaviors in Language Models

Ironing Out the Squiggles

Introducing AI Lab Watch

Refusal in LLMs is mediated by a single direction

Funny Anecdote of Eliezer From His Sister

Thoughts on seed oil

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Express interest in an “FHI of the West”

Transformers Represent Belief State Geometry in their Residual Stream

Paul Christiano named as US AI Safety Institute Head of AI Safety

[HUMAN VOICE] "On green" by Joe Carlsmith

[HUMAN VOICE] "Toward a Broader Conception of Adverse Selection" by Ricki Heicklen

[HUMAN VOICE] "My PhD thesis: Algorithmic Bayesian Epistemology" by Eric Neyman

[HUMAN VOICE] "How could I have thought that faster?" by mesaoptimizer

LLMs for Alignment Research: a safety priority?

[HUMAN VOICE] "Scale Was All We Needed, At First" by Gabriel Mukobi

[HUMAN VOICE] "Using axis lines for good or evil" by dynomight

[HUMAN VOICE] "Social status part 1/2: negotiations over object-level preferences" by Steven Byrnes

[HUMAN VOICE] "Acting Wholesomely" by OwenCB

The Story of “I Have Been A Good Bing”

The Best Tacit Knowledge Videos on Every Subject

[HUMAN VOICE] "Deep atheism and AI risk" by Joe Carlsmith

[HUMAN VOICE] "My Clients, The Liars" by ymeskhout

[HUMAN VOICE] "Speaking to Congressional staffers about AI risk" by Akash, hath

[HUMAN VOICE] "CFAR Takeaways: Andrew Critch" by Raemon

Many arguments for AI x-risk are wrong

Tips for Empirical Alignment Research

Timaeus’s First Four Months

Contra Ngo et al. “Every ‘Every Bay Area House Party’ Bay Area House Party”

[HUMAN VOICE] "Updatelessness doesn't solve most problems" by Martín Soto

[HUMAN VOICE] "And All the Shoggoths Merely Players" by Zack_M_Davis

Every “Every Bay Area House Party” Bay Area House Party

2023 Survey Results

Raising children on the eve of AI

“No-one in my org puts money in their pension”

Masterpiece

CFAR Takeaways: Andrew Critch

[HUMAN VOICE] "Believing In" by Anna Salamon

[HUMAN VOICE] "Attitudes about Applied Rationality" by Camille Berger

Scale Was All We Needed, At First

Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy

[HUMAN VOICE] "A Shutdown Problem Proposal" by johnswentworth, David Lorell

Brute Force Manufactured Consensus is Hiding the Crime of the Century

[HUMAN VOICE] "Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI" by Jeremy Gillen, peterbarnett

Leading The Parade

[HUMAN VOICE] "The case for ensuring that powerful AIs are controlled" by ryan_greenblatt, Buck

Processor clock speeds are not how fast AIs think

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Making every researcher seek grants is a broken model

The case for training frontier AIs on Sumerian-only corpus

This might be the last AI Safety Camp

[HUMAN VOICE] "There is way too much serendipity" by Malmesbury

[HUMAN VOICE] "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by evhub et al

[HUMAN VOICE] "How useful is mechanistic interpretability?" by ryan_greenblatt, Neel Nanda, Buck, habryka

The impossible problem of due process

[HUMAN VOICE] "Gentleness and the artificial Other" by Joe Carlsmith

Introducing Alignment Stress-Testing at Anthropic

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

[HUMAN VOICE] "Meaning & Agency" by Abram Demski

What’s up with LLMs representing XORs of arbitrary features?

Gentleness and the artificial Other

MIRI 2024 Mission and Strategy Update

The Plan - 2023 Version

Apologizing is a Core Rationalist Skill

[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata

The Dark Arts

Critical review of Christiano’s disagreements with Yudkowsky

Most People Don’t Realize We Have No Idea How Our AIs Work

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Succession

Nonlinear’s Evidence: Debunking False and Misleading Claims

Effective Aspersions: How the Nonlinear Investigation Went Wrong

Constellations are Younger than Continents

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

“Humanity vs. AGI” Will Never Look Like “Humanity vs. AGI” to Humanity

Is being sexy for your homies?

[HUMAN VOICE] "Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible" by Gene Smith and Kman

[HUMAN VOICE] "Moral Reality Check (a short story)" by jessicata

AI Control: Improving Safety Despite Intentional Subversion

2023 Unofficial LessWrong Census/Survey

The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity.

[HUMAN VOICE] "What are the results of more parental supervision and less outdoor play?" by Julia Wise

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible

re: Yudkowsky on biological materials

Speaking to Congressional staffers about AI risk

[HUMAN VOICE] "Shallow review of live agendas in alignment & safety" by technicalities & Stag

Thoughts on “AI is easy to control” by Pope & Belrose

The 101 Space You Will Always Have With You

[HUMAN VOICE] "Social Dark Matter" by Duncan Sabien

Shallow review of live agendas in alignment & safety

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

[HUMAN VOICE] "The 6D effect: When companies take risks, one email can be very powerful." by scasper

OpenAI: The Battle of the Board

OpenAI: Facts from a Weekend

Sam Altman fired from OpenAI

Social Dark Matter

[HUMAN VOICE] "Thinking By The Clock" by Screwtape

"You can just spontaneously call people you haven't met in years" by lc

[HUMAN VOICE] "AI Timelines" by habryka, Daniel Kokotajlo, Ajeya Cotra, Ege Erdil

"EA orgs' legal structure inhibits risk taking and information sharing on the margin" by Elizabeth

"Integrity in AI Governance and Advocacy" by habryka, Olivia Jimenez

Loudly Give Up, Don’t Quietly Fade

[HUMAN VOICE] "Deception Chess: Game #1" by Zane et al.

[HUMAN VOICE] "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

"The 6D effect: When companies take risks, one email can be very powerful." by scasper

"The other side of the tidal wave" by Katja Grace

"Does davidad's uploading moonshot work?" by jacobjabob et al.

"Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk" by 1a3orn

"My thoughts on the social response to AI risk" by Matthew Barnett

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

"Thoughts on the AI Safety Summit company policy requests and responses" by So8res

"President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence" by Tristan Williams

[Human Voice] "Book Review: Going Infinite" by Zvi

"We're Not Ready: thoughts on "pausing" and responsible scaling policies" by Holden Karnofsky

"At 87, Pearl is still able to change his mind" by rotatingpaguro

"Architects of Our Own Demise: We Should Stop Developing AI" by Roko

"AI as a science, and three obstacles to alignment strategies" by Nate Soares

"Thoughts on responsible scaling policies and regulation" by Paul Christiano

"Announcing Timaeus" by Jesse Hoogland et al.

[HUMAN VOICE] "Alignment Implications of LLM Successes: a Debate in One Act" by Zack M Davis

"Holly Elmore and Rob Miles dialogue on AI Safety Advocacy" by jacobjacob, Robert Miles & Holly_Elmore

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

"Labs should be explicit about why they are building AGI" by Peter Barnett

[HUMAN VOICE] "Sum-threshold attacks" by TsviBT

"Will no one rid me of this turbulent pest?" by Metacelsus

[HUMAN VOICE] "Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

"RSPs are pauses done right" by evhub

"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI

"Announcing MIRI’s new CEO and leadership team" by Gretta Duleba

"Cohabitive Games so Far" by mako yass

"Announcing Dialogues" by Ben Pace

"Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn" by Zvi

"Evaluating the historical value misspecification argument" by Matthew Barnett

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

"Thomas Kwa's MIRI research experience" by Thomas Kwa and others

"'Diamondoid bacteria' nanobots: deadly threat or dead-end? A nanotech investigation" by titotal

"The Lighthaven Campus is open for bookings" by Habryka

"How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions" by Jan Brauner et al.

"EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem" by Elizabeth

"The King and the Golem" by Richard Ngo

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

"Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

"There should be more AI safety orgs" by Marius Hobbhahn

"The Talk: a brief explanation of sexual dimorphism" by Malmesbury

"A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX" by jacobjacob

"AI presidents discuss AI alignment agendas" by TurnTrout & Garrett Baker

"UDT shows that decision theory is more puzzling than ever" by Wei Dai

"Sum-threshold attacks" by TsviBT

"Report on Frontier Model Training" by Yafah Edelman

"A list of core AI safety problems and how I hope to solve them" by Davidad

"One Minute Every Moment" by abramdemski

"Sharing Information About Nonlinear" by Ben Pace

"Defunding My Mistake" by ymeskhout

"What I would do if I wasn’t at ARC Evals" by LawrenceC

"Meta Questions about Metaphilosophy" by Wei Dai

"The U.S. is becoming less stable" by lc

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

"Dear Self; we need to talk about ambition" by Elizabeth

"Assume Bad Faith" by Zack_M_Davis

"Book Launch: "The Carving of Reality," Best of LessWrong vol. III" by Raemon

"Large Language Models will be Great for Censorship" by Ethan Edwards

"6 non-obvious mental health issues specific to AI safety" by Igor Ivanov

"Ten Thousand Years of Solitude" by agp

"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël

"Feedbackloop-first Rationality" by Raemon

"Inflection.ai is a major AGI lab" by Nikola

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

"When can we trust model evaluations?" bu evhub

"ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks" by Beth Barnes

"The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate" by Adam David Long

"My current LK99 questions" by Eliezer Yudkowsky

"Thoughts on sharing information about language model capabilities" by paulfchristiano

"Cultivating a state of mind where new ideas are born" by Henrik Karlsson

"Self-driving car bets" by paulfchristiano

"Yes, It's Subjective, But Why All The Crabs?" by johnswentworth

"Grant applications and grand narratives" by Elizabeth

"Brain Efficiency Cannell Prize Contest Award Ceremony" by Alexander Gietelink Oldenziel

"Rationality !== Winning" by Raemon

"Cryonics and Regret" by MvB

"Unifying Bargaining Notions (2/2)" by Diffractor

"The ants and the grasshopper" by Richard Ngo

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

"An artificially structured argument for expecting AGI ruin" by Rob Bensinger

"How much do you believe your results?" by Eric Neyman

"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango

"On AutoGPT" by Zvi

"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky

"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

"Deep Deceptiveness" by Nate Soares

"The Onion Test for Personal and Institutional Honesty" by Chana Messinger & Andrew Critch

"There’s no such thing as a tree (phylogenetically)" by Eukaryote

"Losing the root for the tree" by Adam Zerner

"It Looks Like You’re Trying To Take Over The World" by Gwern

"Why I think strong general AI is coming soon" by Porby

"What failure looks like" by Paul Christiano

"Lies, Damn Lies, and Fabricated Options" by Duncan Sabien

""Carefully Bootstrapped Alignment" is organizationally hard" by Raemon

"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes

"Enemies vs Malefactors" by Nate Soares

"The Parable of the King and the Random Process" by moridinamael

"The Waluigi Effect (mega-post)" by Cleo Nardo

"Acausal normalcy" by Andrew Critch

"Please don't throw your mind away" by TsviBT

"Cyborgism" by Nicholas Kees & Janus

"Childhoods of exceptional people" by Henrik Karlsson

"What I mean by "alignment is in large part about making cognition aimable at all"" by Nate Soares

"On not getting contaminated by the wrong obesity ideas" by Natália Coelho Mendonça

"SolidGoldMagikarp (plus, prompt generation)"

"Focus on the places where you feel shocked everyone's dropping the ball" by Nate Soares

"Basics of Rationalist Discourse" by Duncan Sabien

"Sapir-Whorf for Rationalists" by Duncan Sabien

"My Model Of EA Burnout" by Logan Strohl

"The Social Recession: By the Numbers" by Anton Stjepan Cebalo

"Recursive Middle Manager Hell" by Raemon

"The Feeling of Idea Scarcity" by John Wentworth

"Models Don't 'Get Reward'" by Sam Ringer

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

"The next decades might be wild" by Marius Hobbhahn

"Lessons learned from talking to >100 academics about AI safety" by Marius Hobbhahn

"How my team at Lightcone sometimes gets stuff done" by jacobjacob

"Decision theory does not imply that we get to have nice things" by So8res

"What 2026 looks like" by Daniel Kokotajlo

Counterarguments to the basic AI x-risk case

"Introduction to abstract entropy" by Alex Altair

"Consider your appetite for disagreements" by Adam Zerner

"My resentful story of becoming a medical miracle" by Elizabeth

"The Redaction Machine" by Ben

"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

"The shard theory of human values" by Quintin Pope & TurnTrout

"Two-year update on my personal AI timelines" by Ajeya Cotra

"You Are Not Measuring What You Think You Are Measuring" by John Wentworth

"Do bamboos set themselves on fire?" by Malmesbury

"Survey advice" by Katja Grace

"Toni Kurz and the Insanity of Climbing Mountains" by Gene Smith

"Deliberate Grieving" by Raemon

"Toolbox-thinking and Law-thinking" by Eliezer Yudkowsky

"Local Validity as a Key to Sanity and Civilization" by Eliezer Yudkowsky

"Humans are not automatically strategic" by Anna Salamon

"Language models seem to be much better than humans at next-token prediction" by Buck, Fabien and LawrenceC

"Moral strategies at different capability levels" by Richard Ngo

"Worlds Where Iterative Design Fails" by John Wentworth

"(My understanding of) What Everyone in Technical Alignment is Doing and Why" by Thomas Larsen & Eli Lifland

"Unifying Bargaining Notions (1/2)" by Diffractor

'Simulators' by Janus

"Humans provide an untapped wealth of evidence about alignment" by TurnTrout & Quintin Pope

"Changing the world through slack & hobbies" by Steven Byrnes

"«Boundaries», Part 1: a key missing concept from utility theory" by Andrew Critch

"ITT-passing and civility are good; "charity" is bad; steelmanning is niche" by Rob Bensinger

"What should you change in response to an "emergency"? And AI risk" by Anna Salamon

"On how various plans miss the hard bits of the alignment challenge" by Nate Soares

"Humans are very reliable agents" by Alyssa Vance

"Looking back on my alignment PhD" by TurnTrout

"It’s Probably Not Lithium" by Natália Coelho Mendonça

"What Are You Tracking In Your Head?" by John Wentworth

"Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment" by elspood

"Where I agree and disagree with Eliezer" by Paul Christiano

"Six Dimensions of Operational Adequacy in AGI Projects" by Eliezer Yudkowsky

"Moses and the Class Struggle" by lsusr

"Benign Boundary Violations" by Duncan Sabien

"AGI Ruin: A List of Lethalities" by Eliezer Yudkowsky