LessWrong (Curated & Popular) cover art

All Episodes

LessWrong (Curated & Popular) — 857 episodes

#
Title
1

"Automated Alignment is Harder Than You Think" by Aleksandr Bowkis, Marie_DB, Jacob Pfau, Geoffrey Irving

2

"MATS 9 Retrospective & Advice" by beyarkay

3

"The primary sources of near-term cybersecurity risk" by lc

4

"The Owned Ones" by Eliezer Yudkowsky

5

"The Iliad Intensive Course Materials" by Leon Lang, David Udell, Alexander Gietelink Oldenziel

6

"The Darwinian Honeymoon - Why I am not as impressed by human progress as I used to be" by Elias Schmied

7

"What I did in the hedonium shockwave, by Emma, age six and a half" by ozymandias

8

"Bad Problems Don’t Stop Being Bad Because Somebody’s Wrong About Fault Analysis" by Linch

9

"x-risk-themed" by kave

10

"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

11

[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

12

"It’s nice of you to worry about me, but I really do have a life" by Viliam

13

"Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI" by Eliezer Yudkowsky

14

"Dairy cows make their misery expensive (but their calves can’t)" by Elizabeth

15

"Takes from two months as an aspiring LLM naturalist" by AnnaSalamon

16

"Intelligence Dissolves Privacy" by Vaniver

17

"How Go Players Disempower Themselves to AI" by Ashe Vazquez Nuñez

18

"On today’s panel with Bernie Sanders" by David Scott Krueger

19

"Not a Paper: “Frontier Lab CEOs are Capable of In-Context Scheming”" by LawrenceC

20

"llm assistant personas seem increasingly incoherent (some subjective observations)" by nostalgebraist

21

"LessWrong Shows You Social Signals Before the Comment" by TurnTrout

22

"Update on the Alex Bores campaign" by Eric Neyman

23

"Community misconduct disputes are not about facts" by mingyuan

24

"The paper that killed deep learning theory" by LawrenceC

25

"Forecasting is Way Overrated, and We Should Stop Funding It" by mabramov

26

"Your Supplies Probably Won’t Be Stolen in a Disaster" by jefftk

27

"10 posts I don’t have time to write" by habryka

28

"$50 million a year for a 10% chance to ban ASI" by Andrea_Miotti, Alex Amadori, Gabriel Alfour

29

"Evil is bad, actually (Vassar and Olivia Schaefer callout post)" by plex

30

"10 non-boring ways I’ve used AI in the last month" by habryka

31

"Feel like a room has bad vibes? The lighting is probably too “spiky” or too blue" by habryka

32

"Quality Matters Most When Stakes are Highest" by LawrenceC

33

"Reevaluating AGI Ruin in 2026" by lc

34

"Having OCD is like living in North Korea (Here’s how I escaped)" by Declan Molony

35

"There are only four skills: design, technical, management and physical" by habryka

36

"Meaningful Questions Have Return Types" by Drake Morrison

37

"Carpathia Day" by Drake Morrison

38

"Let goodness conquer all that it can defend" by habryka

39

"Do not conquer what you cannot defend" by habryka

40

"Nectome: All That I Know" by Raelifin

41

"Current AIs seem pretty misaligned to me" by ryan_greenblatt

42

"Annoyingly Principled People, and what befalls them" by Raemon

43

"Morale" by J Bostock

44

"Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes" by Alex Mallen, ryan_greenblatt

45

"The policy surrounding Mythos marks an irreversible power shift" by sil

46

"Only Law Can Prevent Extinction" by Eliezer Yudkowsky

47

"Dario probably doesn’t believe in superintelligence" by RobertM

48

"Daycare illnesses" by Nina Panickssery

49

"If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines" by ryan_greenblatt

50

"Do not be surprised if LessWrong gets hacked" by RobertM

51

"My picture of the present in AI" by ryan_greenblatt

52

"The effects of caffeine consumption do not decay with a ~5 hour half-life" by kman

53

"AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines" by ryan_greenblatt

54

"dark ilan" by ozymandias

55

"Dispatch from Anthropic v. Department of War Preliminary Injunction Motion Hearing" by Zack_M_Davis

56

"The Corner-Stone" by Benquo

57

"The Practical Guide to Superbabies" by GeneSmith

58

"Anthropic’s Pause is the Most Expensive Alarm in Corporate History" by Ruby

59

"“You Have Not Been a Good User” (LessWrong’s second album)" by habryka

60

"Lesswrong Liberated" by Ronny Fernandez

61

"Product Alignment is not Superintelligence Alignment (and we need the latter to survive)" by plex

62

"Gyre" by vgel

63

"Some things I noticed while LARPing as a grantmaker" by Zach Stein-Perlman

64

"My hobby: running deranged surveys" by leogao

65

"Socrates is Mortal" by Benquo

66

"The Terrarium" by Caleb Biddulph

67

"My Most Costly Delusion" by Ihor Kendiukhov

68

"The Case for Low-Competence ASI Failure Scenarios" by Ihor Kendiukhov

69

"Is fever a symptom of glycine deficiency?" by Benquo

70

"You can’t imitation-learn how to continual-learn" by Steven Byrnes

71

"Nullius in Verba" by Aurelia

72

"Broad Timelines" by Toby_Ord

73

"No, we haven’t uploaded a fly yet" by Ariel Zeleznikow-Johnston

74

"Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis

75

"PSA: Predictions markets often have very low liquidity; be careful citing them." by Eye You

76

"“The AI Doc” is coming out March 26" by Rob Bensinger, Beckeck

77

"Customer Satisfaction Opportunities" by Tomás B.

78

"Requiem for a Transhuman Timeline" by Ihor Kendiukhov

79

"Personality Self-Replicators" by eggsyntax

80

"My Willing Complicity In “Human Rights Abuse”" by AlphaAndOmega

81

"Economic efficiency often undermines sociopolitical autonomy" by Richard_Ngo

82

"Don’t Let LLMs Write For You" by JustisMills

83

"Thoughts on the Pause AI protest" by philh

84

"Prologue to Terrified Comments on Claude’s Constitution" by Zack_M_Davis

85

"Less Dead" by Aurelia

86

"Gemma Needs Help" by Anna Soligo

87

"On Independence Axiom" by Ihor Kendiukhov

88

"Solar storms" by Croissanthology

89

"Schelling Goodness, and Shared Morality as a Goal" by Andrew_Critch

90

"Maybe there’s a pattern here?" by dynomight

91

"OpenAI’s surveillance language has many potential loopholes and they can do better" by Tom Smith

92

"An Alignment Journal: Coming Soon" by Dan MacKinlay, JessRiedel, Edmund Lau, Daniel Murfet, Scott Aaronson, Jan_Kulveit

93

"Frontier AI companies probably can’t leave the US" by Anders Woodruff

94

"Persona Parasitology" by Raymond Douglas

95

"Here’s to the Polypropylene Makers" by jefftk

96

"Anthropic: “Statement from Dario Amodei on our discussions with the Department of War”" by Matrice Jacobine

97

"Are there lessons from high-reliability engineering for AGI safety?" by Steven Byrnes

98

"Open sourcing a browser extension that tells you when people are wrong on the internet" by lc

99

"The persona selection model" by Sam Marks

100

"Responsible Scaling Policy v3" by HoldenKarnofsky

101

"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

102

"The Spectre haunting the “AI Safety” Community" by Gabriel Alfour

103

"Why we should expect ruthless sociopath ASI" by Steven Byrnes

104

"You’re an AI Expert – Not an Influencer" by Max Winga

105

"The optimal age to freeze eggs is 19" by GeneSmith

106

"The truth behind the 2026 J.P. Morgan Healthcare Conference" by Abhishaike Mahajan

107

"The world keeps getting saved and you don’t notice" by Bogoed

108

"Solemn Courage" by aysja

109

"Life at the Frontlines of Demographic Collapse" by Martin Sustrik

110

"Why You Don’t Believe in Xhosa Prophecies" by Jan_Kulveit

111

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

112

"My journey to the microwave alternate timeline" by Malmesbury

113

"Stone Age Billionaire Can’t Words Good" by Eneasz

114

"On Goal-Models" by Richard_Ngo

115

"Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning" by megasilverfist

116

"Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics" by eleweek

117

"Post-AGI Economics As If Nothing Ever Happens" by Jan_Kulveit

118

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

119

"Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)" by RobertM

120

"Conditional Kickstarter for the “Don’t Build It” March" by Raemon

121

"How to Hire a Team" by Gretta Duleba

122

"The Possessed Machines (summary)" by L Rudolf L

123

"Ada Palmer: Inventing the Renaissance" by Martin Sustrik

124

"AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty)" by Stanislav Fort

125

"Dario Amodei – The Adolescence of Technology" by habryka

126

"AlgZoo: uninterpreted models with fewer than 1,500 parameters" by Jacob_Hilton

127

"Does Pentagon Pizza Theory Work?" by rba

128

"The inaugural Redwood Research podcast" by Buck, ryan_greenblatt

129

"Canada Lost Its Measles Elimination Status Because We Don’t Have Enough Nurses Who Speak Low German" by jenn

130

"Deep learning as program synthesis" by Zach Furman

131

"Why I Transitioned: A Response" by marisa

132

"Claude’s new constitution" by Zac Hatfield-Dodds

133

[Linkpost] "“The first two weeks are the hardest”: my first digital declutter" by mingyuan

134

"What Washington Says About AGI" by zroe1

135

"Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks" by James_Miller

136

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

137

"Backyard cat fight shows Schelling points preexist language" by jchan

138

"How AI Is Learning to Think in Secret" by Nicholas Andresen

139

"On Owning Galaxies" by Simon Lermen

140

"AI Futures Timelines and Takeoff Model: Dec 2025 Update" by elifland, bhalstead, Alex Kastner, Daniel Kokotajlo

141

"In My Misanthropy Era" by jenn

142

"2025 in AI predictions" by jessicata

143

"Good if make prior after data instead of before" by dynomight

144

"Measuring no CoT math time horizon (single forward pass)" by ryan_greenblatt

145

"Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance" by ryan_greenblatt

146

"Turning 20 in the probable pre-apocalypse" by Parv Mahajan

147

"Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

148

"Dancing in a World of Horseradish" by lsusr

149

"Contradict my take on OpenPhil’s past AI beliefs" by Eliezer Yudkowsky

150

"Opinionated Takes on Meetups Organizing" by jenn

151

"How to game the METR plot" by shash42

152

"Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers" by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

153

"Scientific breakthroughs of the year" by technicalities

154

"A high integrity/epistemics political machine?" by Raemon

155

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala

156

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes

157

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f

158

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw

159

“The funding conversation we left unfinished” by jenn

160

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

161

“Little Echo” by Zvi

162

“A Pragmatic Vision for Interpretability” by Neel Nanda

163

“AI in 2025: gestalt” by technicalities

164

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

165

“An Ambitious Vision for Interpretability” by leogao

166

“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

167

“Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)” by null

168

“MIRI’s 2025 Fundraiser” by alexvermeer

169

“The Best Lack All Conviction: A Confusing Day in the AI Village” by null

170

“The Boring Part of Bell Labs” by Elizabeth

171

[Linkpost] “The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun” by null

172

“Writing advice: Why people like your quick bullshit takes better than your high-effort posts” by null

173

“Claude 4.5 Opus’ Soul Document” by null

174

“Unless its governance changes, Anthropic is untrustworthy” by null

175

“Alignment remains a hard, unsolved problem” by null

176

“Video games are philosophy’s playground” by Rachel Shu

177

“Stop Applying And Get To Work” by plex

178

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

179

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

180

“Anthropic is (probably) not meeting its RSP security commitments” by habryka

181

“Varieties Of Doom” by jdp

182

“How Colds Spread” by RobertM

183

“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett

184

“Where is the Capital? An Overview” by johnswentworth

185

“Problems I’ve Tried to Legibilize” by Wei Dai

186

“Do not hand off what you cannot pick up” by habryka

187

“7 Vicious Vices of Rationalists” by Ben Pace

188

“Tell people as early as possible it’s not going to work out” by habryka

189

“Everyone has a plan until they get lied to the face” by Screwtape

190

“Please, Don’t Roll Your Own Metaethics” by Wei Dai

191

“Paranoia rules everything around me” by habryka

192

“Human Values ≠ Goodness” by johnswentworth

193

“Condensation” by abramdemski

194

“Mourning a life without AI” by Nikola Jurkovic

195

“Unexpected Things that are People” by Ben Goldhaber

196

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

197

“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec

198

“The Unreasonable Effectiveness of Fiction” by Raelifin

199

“Legible vs. Illegible AI Safety Problems” by Wei Dai

200

“Lack of Social Grace is a Lack of Skill” by Screwtape

201

[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu

202

“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt

203

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

204

[Linkpost] “You’re always stressed, your mind is always busy, you never have enough time” by mingyuan

205

“LLM-generated text is not testimony” by TsviBT

206

“Post title: Why I Transitioned: A Case Study” by Fiora Sunshine

207

“The Memetics of AI Successionism” by Jan_Kulveit

208

“How Well Does RL Scale?” by Toby_Ord

209

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

210

“Cancer has a surprising amount of detail” by Abhishaike Mahajan

211

“AIs should also refuse to work on capabilities research” by Davidmanheim

212

“On Fleshling Safety: A Debate by Klurl and Trapaucius.” by Eliezer Yudkowsky

213

“EU explained in 10 minutes” by Martin Sustrik

214

“Cheap Labour Everywhere” by Morpheus

215

[Linkpost] “Consider donating to AI safety champion Scott Wiener” by Eric Neyman

216

“Which side of the AI safety community are you in?” by Max Tegmark

217

“Doomers were right” by Algon

218

“Do One New Thing A Day To Solve Your Problems” by Algon

219

“Humanity Learned Almost Nothing From COVID-19” by niplav

220

“Consider donating to Alex Bores, author of the RAISE Act” by Eric Neyman

221

“Meditation is dangerous” by Algon

222

“That Mad Olympiad” by Tomás B.

223

“The ‘Length’ of ‘Horizons’” by Adam Scholl

224

“Don’t Mock Yourself” by Algon

225

“If Anyone Builds It Everyone Dies, a semi-outsider review” by dvd

226

“The Most Common Bad Argument In These Parts” by J Bostock

227

“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn

228

“I take antidepressants. You’re welcome” by Elizabeth

229

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

230

“Hospitalization: A Review” by Logan Riggs

231

“What, if not agency?” by abramdemski

232

“The Origami Men” by Tomás B.

233

“A non-review of ‘If Anyone Builds It, Everyone Dies’” by boazbarak

234

“Notes on fatalities from AI takeover” by ryan_greenblatt

235

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon

236

“Omelas Is Perfectly Misread” by Tobias H

237

“Ethical Design Patterns” by AnnaSalamon

238

“You’re probably overestimating how well you understand Dunning-Kruger” by abstractapplic

239

“Reasons to sell frontier lab equity to donate now rather than later” by Daniel_Eth, Ethan Perez

240

“CFAR update, and New CFAR workshops” by AnnaSalamon

241

“Why you should eat meat - even if you hate factory farming” by KatWoods

242

[Linkpost] “Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures” by Charbel-Raphaël

243

“This is a review of the reviews” by Recurrented

244

“The title is reasonable” by Raemon

245

“The Problem with Defining an ‘AGI Ban’ by Outcome (a lawyer’s take).” by Katalina Hernandez

246

“Contra Collier on IABIED” by Max Harms

247

“You can’t eval GPT5 anymore” by Lukas Petersson

248

“Teaching My Toddler To Read” by maia

249

“Safety researchers should take a public stance” by Ishual, Mateusz Bagiński

250

“The Company Man” by Tomás B.

251

“Christian homeschoolers in the year 3000” by Buck

252

“I enjoyed most of IABED” by Buck

253

“‘If Anyone Builds It, Everyone Dies’ release day!” by alexvermeer

254

“Obligated to Respond” by Duncan Sabien (Inactive)

255

“Chesterton’s Missing Fence” by jasoncrawford

256

“The Eldritch in the 21st century” by PranavG, Gabriel Alfour

257

“The Rise of Parasitic AI” by Adele Lopez

258

“High-level actions don’t screen off intent” by AnnaSalamon

259

[Linkpost] “MAGA populists call for holy war against Big Tech” by Remmelt

260

“Your LLM-assisted scientific breakthrough probably isn’t real” by eggsyntax

261

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by ryan_greenblatt

262

“⿻ Plurality & 6pack.care” by Audrey Tang

263

[Linkpost] “The Cats are On To Something” by Hastings

264

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

265

“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock

266

“AI Induced Psychosis: A shallow investigation” by Tim Hua

267

“Before LLM Psychosis, There Was Yes-Man Psychosis” by johnswentworth

268

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

269

“Banning Said Achmiz (and broader thoughts on moderation)” by habryka

270

“Underdog bias rules everything around me” by Richard_Ngo

271

“Epistemic advantages of working as a moderate” by Buck

272

“Four ways Econ makes people dumber re: future AI” by Steven Byrnes

273

“Should you make stone tools?” by Alex_Altair

274

“My AGI timeline updates from GPT-5 (and 2025 so far)” by ryan_greenblatt

275

“Hyperbolic model fits METR capabilities estimate worse than exponential model” by gjm

276

“My Interview With Cade Metz on His Reporting About Lighthaven” by Zack_M_Davis

277

“Church Planting: When Venture Capital Finds Jesus” by Elizabeth

278

“Somebody invented a better bookmark” by Alex_Altair

279

“How Does A Blind Model See The Earth?” by henry

280

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

281

“How anticipatory cover-ups go wrong” by Kaj_Sotala

282

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

283

“METR’s Evaluation of GPT-5” by GradientDissenter

284

“Emotions Make Sense” by DaystarEld

285

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

286

“Many prediction markets would be better off as batched auctions” by William Howard

287

“Whence the Inkhaven Residency?” by Ben Pace

288

“I am worried about near-term non-LLM AI developments” by testingthewaters

289

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

290

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

291

“Maya’s Escape” by Bridgett Kay

292

“Do confident short timelines make sense?” by TsviBT, abramdemski

293

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

294

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

295

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

296

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

297

“Make More Grayspaces” by Duncan Sabien (Inactive)

298

“Shallow Water is Dangerous Too” by jefftk

299

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

300

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

301

“the jackpot age” by thiccythot

302

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

303

“An Opinionated Guide to Using Anki Correctly” by Luise

304

“Lessons from the Iraq War about AI policy” by Buck

305

“So You Think You’ve Awoken ChatGPT” by JustisMills

306

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

307

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

308

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

309

“A deep critique of AI 2027’s bad timeline models” by titotal

310

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

311

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

312

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

313

“The Industrial Explosion” by rosehadshar, Tom Davidson

314

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

315

“The best simple argument for Pausing AI?” by Gary Marcus

316

“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes

317

“Proposal for making credible commitments to AIs.” by Cleo Nardo

318

“X explains Z% of the variance in Y” by Leon Lang

319

“A case for courage, when speaking of AI danger” by So8res

320

“My pitch for the AI Village” by Daniel Kokotajlo

321

“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes

322

“Futarchy’s fundamental flaw” by dynomight

323

“Do Not Tile the Lightcone with Your Confused Ontology” by Jan_Kulveit

324

“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan

325

“Estrogen: A trip report” by cube_flipper

326

“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo

327

[Linkpost] “the void” by nostalgebraist

328

“Mech interp is not pre-paradigmatic” by Lee Sharkey

329

“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

330

“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium

331

“A Straightforward Explanation of the Good Regulator Theorem” by Alfred Harwood

332

“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC

333

“Season Recap of the Village: Agents raise $2,000” by Shoshannah Tekofsky

334

“The Best Reference Works for Every Subject” by Parker Conley

335

“‘Flaky breakthroughs’ pervade coaching — and no one tracks them” by Chipmonk

336

“The Value Proposition of Romantic Relationships” by johnswentworth

337

“It’s hard to make scheming evals look realistic” by Igor Ivanov, dan_moken

338

[Linkpost] “Social Anxiety Isn’t About Being Liked” by Chipmonk

339

“Truth or Dare” by Duncan Sabien (Inactive)

340

“Meditations on Doge” by Martin Sustrik

341

[Linkpost] “If you’re not sure how to sort a list or grid—seriate it!” by gwern

342

“What We Learned from Briefing 70+ Lawmakers on the Threat from AI” by leticiagarcia

343

“Winning the power to lose” by KatjaGrace

344

[Linkpost] “Gemini Diffusion: watch this space” by Yair Halberstadt

345

“AI Doomerism in 1879” by David Gross

346

“Consider not donating under $100 to political candidates” by DanielFilan

347

“It’s Okay to Feel Bad for a Bit” by moridinamael

348

“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery

349

“Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies” by So8res

350

“Too Soon” by Gordon Seidoh Worley

351

“PSA: The LessWrong Feedback Service” by JustisMills

352

“Orienting Toward Wizard Power” by johnswentworth

353

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

354

“Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall” by Vladimir_Nesov

355

“Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis” by jeanne_, eeeee

356

[Linkpost] “Jaan Tallinn’s 2024 Philanthropy Overview” by jaan

357

“Impact, agency, and taste” by benkuhn

358

[Linkpost] “To Understand History, Keep Former Population Distributions In Mind” by Arjun Panickssery

359

“AI-enabled coups: a small group could use AI to seize power” by Tom Davidson, Lukas Finnveden, rosehadshar

360

“Accountability Sinks” by Martin Sustrik

361

“Training AGI in Secret would be Unsafe and Unethical” by Daniel Kokotajlo

362

“Why Should I Assume CCP AGI is Worse Than USG AGI?” by Tomás B.

363

“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala

364

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

365

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

366

[Linkpost] “Playing in the Creek” by Hastings

367

“Thoughts on AI 2027” by Max Harms

368

“Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov

369

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

370

“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

371

“Why Have Sentence Lengths Decreased?” by Arjun Panickssery

372

“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo

373

“OpenAI #12: Battle of the Board Redux” by Zvi

374

“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit

375

“OpenAI #12: Battle of the Board Redux” by Zvi

376

“You will crash your car in front of my house within the next week” by Richard Korzekwa

377

“My ‘infohazards small working group’ Signal Chat may have encountered minor leaks” by Linch

378

“Leverage, Exit Costs, and Anger: Re-examining Why We Explode at Home, Not at Work” by at_the_zoo

379

“PauseAI and E/Acc Should Switch Sides” by WillPetillo

380

“VDT: a solution to decision theory” by L Rudolf L

381

“LessWrong has been acquired by EA” by habryka

382

“We’re not prepared for an AI market crash” by Remmelt

383

“Conceptual Rounding Errors” by Jan_Kulveit

384

“Tracing the Thoughts of a Large Language Model” by Adam Jermyn

385

“Recent AI model progress feels mostly like bullshit” by lc

386

“AI for AI safety” by Joe Carlsmith

387

“Policy for LLM Writing on LessWrong” by jimrandomh

388

“Will Jesus Christ return in an election year?” by Eric Neyman

389

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

390

“Intention to Treat” by Alicorn

391

“On the Rationality of Deterring ASI” by Dan H

392

[Linkpost] “METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman

393

“I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?” by shrimpy

394

“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

395

“Levels of Friction” by Zvi

396

“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas

397

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

398

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

399

“The Most Forbidden Technique” by Zvi

400

“Trojan Sky” by Richard_Ngo

401

“OpenAI:” by Daniel Kokotajlo

402

“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis

403

“So how well is Claude playing Pokémon?” by Julian Bradshaw

404

“Methods for strong human germline engineering” by TsviBT

405

“Have LLMs Generated Novel Insights?” by abramdemski, Cole Wyeth

406

“A Bear Case: My Predictions Regarding AI Progress” by Thane Ruthenis

407

“Statistical Challenges with Making Super IQ babies” by Jan Christian Refsgaard

408

“Self-fulfilling misalignment data might be poisoning our AI models” by TurnTrout

409

“Judgements: Merging Prediction & Evidence” by abramdemski

410

“The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better” by Thane Ruthenis

411

“Power Lies Trembling: a three-book review” by Richard_Ngo

412

“Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” by Jan Betley, Owain_Evans

413

“The Paris AI Anti-Safety Summit” by Zvi

414

“Eliezer’s Lost Alignment Articles / The Arbital Sequence” by Ruby

415

“Arbital has been imported to LessWrong” by RobertM, jimrandomh, Ben Pace, Ruby

416

“How to Make Superbabies” by GeneSmith, kman

417

“A computational no-coincidence principle” by Eric Neyman

418

“A History of the Future, 2025-2040” by L Rudolf L

419

“It’s been ten years. I propose HPMOR Anniversary Parties.” by Screwtape

420

“Some articles in ‘International Security’ that I enjoyed” by Buck

421

“The Failed Strategy of Artificial Intelligence Doomers” by Ben Pace

422

“Murder plots are infohazards” by Chris Monteiro

423

“Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?” by garrison

424

“The ‘Think It Faster’ Exercise” by Raemon

425

“So You Want To Make Marginal Progress...” by johnswentworth

426

“What is malevolence? On the nature, measurement, and distribution of dark traits” by David Althaus

427

“How AI Takeover Might Happen in 2 Years” by joshc

428

“Gradual Disempowerment, Shell Games and Flinches” by Jan_Kulveit

429

“Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development” by Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet), David Duvenaud

430

“Planning for Extreme AI Risks” by joshc

431

“Catastrophe through Chaos” by Marius Hobbhahn

432

“Will alignment-faking Claude accept a deal to reveal its misalignment?” by ryan_greenblatt

433

“‘Sharp Left Turn’ discourse: An opinionated review” by Steven Byrnes

434

“Ten people on the inside” by Buck

435

“Anomalous Tokens in DeepSeek-V3 and r1” by henry

436

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

437

“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell

438

“A Three-Layer Model of LLM Psychology” by Jan_Kulveit

439

“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub

440

“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt

441

“Mechanisms too simple for humans to design” by Malmesbury

442

“The Gentle Romance” by Richard_Ngo

443

“Quotes from the Stargate press conference” by Nikola Jurkovic

444

“The Case Against AI Control Research” by johnswentworth

445

“Don’t ignore bad vibes you get from people” by Kaj_Sotala

446

“[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty” by tandem

447

“Building AI Research Fleets” by bgold, Jesse Hoogland

448

“What Is The Alignment Problem?” by johnswentworth

449

“Applying traditional economic thinking to AGI: a trilemma” by Steven Byrnes

450

“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov

451

“Parkinson’s Law and the Ideology of Statistics” by Benquo

452

“Capital Ownership Will Not Prevent Human Disempowerment” by beren

453

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

454

“What o3 Becomes by 2028” by Vladimir_Nesov

455

“What Indicators Should We Watch to Disambiguate AGI Timelines?” by snewman

456

“How will we update about scheming?” by ryan_greenblatt

457

“OpenAI #10: Reflections” by Zvi

458

“Maximizing Communication, not Traffic” by jefftk

459

“What’s the short timeline plan?” by Marius Hobbhahn

460

“Shallow review of technical AI safety, 2024” by technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers

461

“By default, capital will matter more than ever after AGI” by L Rudolf L

462

“Review: Planecrash” by L Rudolf L

463

“The Field of AI Alignment: A Postmortem, and What To Do About It” by johnswentworth

464

“When Is Insurance Worth It?” by kqr

465

“Orienting to 3 year AGI timelines” by Nikola Jurkovic

466

“What Goes Without Saying” by sarahconstantin

467

“o3” by Zach Stein-Perlman

468

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

469

“AIs Will Increasingly Attempt Shenanigans” by Zvi

470

“Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

471

“Communications in Hard Mode (My new job at MIRI)” by tanagrabeast

472

“Biological risk from the mirror world” by jasoncrawford

473

“Subskills of ‘Listening to Wisdom’” by Raemon

474

“Understanding Shapley Values with Venn Diagrams” by Carson L

475

“LessWrong audio: help us choose the new voice” by PeterH

476

“Understanding Shapley Values with Venn Diagrams” by agucova

477

“o1: A Technical Primer” by Jesse Hoogland

478

“Gradient Routing: Masking Gradients to Localize Computation in Neural Networks” by cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout

479

“Frontier Models are Capable of In-context Scheming” by Marius Hobbhahn, AlexMeinke, Bronson Schoen

480

“(The) Lightcone is nothing without its people: LW + Lighthaven’s first big fundraiser” by habryka

481

“Repeal the Jones Act of 1920” by Zvi

482

“China Hawks are Manufacturing an AI Arms Race” by garrison

483

“Information vs Assurance” by johnswentworth

484

“You are not too ‘irrational’ to know your preferences.” by DaystarEld

485

“‘The Solomonoff Prior is Malign’ is a special case of a simpler argument” by David Matolcsi

486

“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov

487

“OpenAI Email Archives” by habryka

488

“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon

489

“Neutrality” by sarahconstantin

490

“Making a conservative case for alignment” by Cameron Berg, Judd Rosenblatt, phgubbins, AE Studio

491

“OpenAI Email Archives (from Musk v. Altman)” by habryka

492

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

493

“The Online Sports Gambling Experiment Has Failed” by Zvi

494

“o1 is a bad idea” by abramdemski

495

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

496

“Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky

497

“Survival without dignity” by L Rudolf L

498

“The Median Researcher Problem” by johnswentworth

499

“The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

500

“What TMS is like” by Sable

501

“The hostile telepaths problem” by Valentine

502

“A bird’s eye view of ARC’s research” by Jacob_Hilton

503

“A Rocket–Interpretability Analogy” by plex

504

“I got dysentery so you don’t have to” by eukaryote

505

“Overcoming Bias Anthology” by Arjun Panickssery

506

“Arithmetic is an underrated world-modeling technology” by dynomight

507

“My theory of change for working in AI healthtech” by Andrew_Critch

508

“Why I’m not a Bayesian” by Richard_Ngo

509

“The AGI Entente Delusion” by Max Tegmark

510

“Momentum of Light in Glass” by Ben

511

“Overview of strong human intelligence amplification methods” by TsviBT

512

“Struggling like a Shadowmoth” by Raemon

513

“Three Subtle Examples of Data Leakage” by abstractapplic

514

“the case for CoT unfaithfulness is overstated” by nostalgebraist

515

“Cryonics is free” by Mati_Roy

516

“Stanislav Petrov Quarterly Performance Review” by Ricki Heicklen

517

“Laziness death spirals” by PatrickDFarley

518

“‘Slow’ takeoff is a terrible term for ‘maybe even faster takeoff, actually’” by Raemon

519

“ASIs will not leave just a little sunlight for Earth ” by Eliezer Yudkowsky

520

“Skills from a year of Purposeful Rationality Practice ” by Raemon

521

“How I started believing religion might actually matter for rationality and moral philosophy ” by zhukeepa

522

“Did Christopher Hitchens change his mind about waterboarding? ” by Isaac King

523

“The Great Data Integration Schlep ” by sarahconstantin

524

“Contra papers claiming superhuman AI forecasting ” by nikos, Peter Mühlbacher, Lawrence Phillips, dschwarz

525

“OpenAI o1 ” by Zach Stein-Perlman

526

“The Best Lay Argument is not a Simple English Yud Essay ” by J Bostock

527

“My Number 1 Epistemology Book Recommendation: Inventing Temperature ” by adamShimi

528

“That Alien Message - The Animation ” by Writer

529

“Pay Risk Evaluators in Cash, Not Equity ” by Adam Scholl

530

“Survey: How Do Elite Chinese Students Feel About the Risks of AI? ” by Nick Corvino

531

“things that confuse me about the current AI market. ” by DMMF

532

“Nursing doubts ” by dynomight

533

“Principles for the AGI Race ” by William_S

534

“The Information: OpenAI shows ‘Strawberry’ to feds, races to launch it ” by Martín Soto

535

“What is it to solve the alignment problem? ” by Joe Carlsmith

536

“Limitations on Formal Verification for AI Safety ” by Andrew Dickson

537

“Would catching your AIs trying to escape convince AI developers to slow down or undeploy? ” by Buck

538

“Liability regimes for AI ” by Ege Erdil

539

“AGI Safety and Alignment at Google DeepMind:A Summary of Recent Work ” by Rohin Shah, Seb Farquhar, Anca Dragan

540

“Fields that I reference when thinking about AI takeover prevention” by Buck

541

“WTH is Cerebrolysin, actually?” by gsfitzgerald, delton137

542

“You can remove GPT2’s LayerNorm by fine-tuning for an hour” by StefanHex

543

“Leaving MIRI, Seeking Funding” by abramdemski

544

“How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage” by orthonormal

545

“This is already your second chance” by Malmesbury

546

“0. CAST: Corrigibility as Singular Target” by Max Harms

547

“Self-Other Overlap: A Neglected Approach to AI Alignment” by Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena

548

“You don’t know how bad most things are nor precisely how they’re bad.” by Solenoid_Entity

549

“Recommendation: reports on the search for missing hiker Bill Ewasko” by eukaryote

550

“The ‘strong’ feature hypothesis could be wrong” by lsgos

551

“‘AI achieves silver-medal standard solving International Mathematical Olympiad problems’” by gjm

552

“Decomposing Agency — capabilities without desires” by owencb, Raymond D

553

“Universal Basic Income and Poverty” by Eliezer Yudkowsky

554

“Optimistic Assumptions, Longterm Planning, and ‘Cope’” by Raemon

555

“Superbabies: Putting The Pieces Together” by sarahconstantin

556

“Poker is a bad game for teaching epistemics. Figgie is a better one.” by rossry

557

“Reliable Sources: The Story of David Gerard” by TracingWoodgrains

558

“When is a mind me?” by Rob Bensinger

559

“80,000 hours should remove OpenAI from the Job Board (and similar orgs should do similarly)” by Raemon

560

[Linkpost] “introduction to cancer vaccines” by bhauth

561

“Priors and Prejudice” by MathiasKB

562

“My experience using financial commitments to overcome akrasia” by William Howard

563

“The Incredible Fentanyl-Detecting Machine” by sarahconstantin

564

“AI catastrophes and rogue deployments” by Buck

565

“Loving a world you don’t trust” by Joe Carlsmith

566

“Formal verification, heuristic explanations and surprise accounting” by paulfchristiano

567

“LLM Generality is a Timeline Crux” by eggsyntax

568

“SAE feature geometry is outside the superposition hypothesis” by jake_mendel

569

“Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data” by Johannes Treutlein, Owain_Evans

570

“Boycott OpenAI” by PeterMcCluskey

571

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

572

“I would have shit in that alley, too” by Declan Molony

573

“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by ryan_greenblatt

574

“Why I don’t believe in the placebo effect” by transhumanist_atom_understander

575

“Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)” by Andrew_Critch

576

“My AI Model Delta Compared To Christiano” by johnswentworth

577

“My AI Model Delta Compared To Yudkowsky” by johnswentworth

578

“Response to Aschenbrenner’s ‘Situational Awareness’” by Rob Bensinger

579

“Humming is not a free $100 bill” by Elizabeth

580

“Announcing ILIAD — Theoretical AI Alignment Conference ” by Nora_Ammann, Alexander Gietelink Oldenziel

581

“Non-Disparagement Canaries for OpenAI” by aysja, Adam Scholl

582

“MIRI 2024 Communications Strategy” by Gretta Duleba

583

“OpenAI: Fallout” by Zvi

584

[HUMAN VOICE] Update on human narration for this podcast

585

“Maybe Anthropic’s Long-Term Benefit Trust is powerless” by Zach Stein-Perlman

586

“Notifications Received in 30 Minutes of Class” by tanagrabeast

587

“AI companies aren’t really using external evaluators” by Zach Stein-Perlman

588

“EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024” by scasper

589

“What’s Going on With OpenAI’s Messaging?” by ozziegoen

590

“Language Models Model Us” by eggsyntax

591

Jaan Tallinn’s 2023 Philanthropy Overview

592

“OpenAI: Exodus” by Zvi

593

DeepMind’s ”​​Frontier Safety Framework” is weak and unambitious

594

Do you believe in hundred dollar bills lying on the ground? Consider humming

595

Deep Honesty

596

On Not Pulling The Ladder Up Behind You

597

Mechanistically Eliciting Latent Behaviors in Language Models

598

Ironing Out the Squiggles

599

Introducing AI Lab Watch

600

Refusal in LLMs is mediated by a single direction

601

Funny Anecdote of Eliezer From His Sister

602

Thoughts on seed oil

603

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

604

Express interest in an “FHI of the West”

605

Transformers Represent Belief State Geometry in their Residual Stream

606

Paul Christiano named as US AI Safety Institute Head of AI Safety

607

[HUMAN VOICE] "On green" by Joe Carlsmith

608

[HUMAN VOICE] "Toward a Broader Conception of Adverse Selection" by Ricki Heicklen

609

[HUMAN VOICE] "My PhD thesis: Algorithmic Bayesian Epistemology" by Eric Neyman

610

[HUMAN VOICE] "How could I have thought that faster?" by mesaoptimizer

611

LLMs for Alignment Research: a safety priority?

612

[HUMAN VOICE] "Scale Was All We Needed, At First" by Gabriel Mukobi

613

[HUMAN VOICE] "Using axis lines for good or evil" by dynomight

614

[HUMAN VOICE] "Social status part 1/2: negotiations over object-level preferences" by Steven Byrnes

615

[HUMAN VOICE] "Acting Wholesomely" by OwenCB

616

The Story of “I Have Been A Good Bing”

617

The Best Tacit Knowledge Videos on Every Subject

618

[HUMAN VOICE] "Deep atheism and AI risk" by Joe Carlsmith

619

[HUMAN VOICE] "My Clients, The Liars" by ymeskhout

620

[HUMAN VOICE] "Speaking to Congressional staffers about AI risk" by Akash, hath

621

[HUMAN VOICE] "CFAR Takeaways: Andrew Critch" by Raemon

622

Many arguments for AI x-risk are wrong

623

Tips for Empirical Alignment Research

624

Timaeus’s First Four Months

625

Contra Ngo et al. “Every ‘Every Bay Area House Party’ Bay Area House Party”

626

[HUMAN VOICE] "Updatelessness doesn't solve most problems" by Martín Soto

627

[HUMAN VOICE] "And All the Shoggoths Merely Players" by Zack_M_Davis

628

Every “Every Bay Area House Party” Bay Area House Party

629

2023 Survey Results

630

Raising children on the eve of AI

631

“No-one in my org puts money in their pension”

632

Masterpiece

633

CFAR Takeaways: Andrew Critch

634

[HUMAN VOICE] "Believing In" by Anna Salamon

635

[HUMAN VOICE] "Attitudes about Applied Rationality" by Camille Berger

636

Scale Was All We Needed, At First

637

Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy

638

[HUMAN VOICE] "A Shutdown Problem Proposal" by johnswentworth, David Lorell

639

Brute Force Manufactured Consensus is Hiding the Crime of the Century

640

[HUMAN VOICE] "Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI" by Jeremy Gillen, peterbarnett

641

Leading The Parade

642

[HUMAN VOICE] "The case for ensuring that powerful AIs are controlled" by ryan_greenblatt, Buck

643

Processor clock speeds are not how fast AIs think

644

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

645

Making every researcher seek grants is a broken model

646

The case for training frontier AIs on Sumerian-only corpus

647

This might be the last AI Safety Camp

648

[HUMAN VOICE] "There is way too much serendipity" by Malmesbury

649

[HUMAN VOICE] "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by evhub et al

650

[HUMAN VOICE] "How useful is mechanistic interpretability?" by ryan_greenblatt, Neel Nanda, Buck, habryka

651

The impossible problem of due process

652

[HUMAN VOICE] "Gentleness and the artificial Other" by Joe Carlsmith

653

Introducing Alignment Stress-Testing at Anthropic

654

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

655

[HUMAN VOICE] "Meaning & Agency" by Abram Demski

656

What’s up with LLMs representing XORs of arbitrary features?

657

Gentleness and the artificial Other

658

MIRI 2024 Mission and Strategy Update

659

The Plan - 2023 Version

660

Apologizing is a Core Rationalist Skill

661

[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata

662

The Dark Arts

663

Critical review of Christiano’s disagreements with Yudkowsky

664

Most People Don’t Realize We Have No Idea How Our AIs Work

665

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

666

Succession

667

Nonlinear’s Evidence: Debunking False and Misleading Claims

668

Effective Aspersions: How the Nonlinear Investigation Went Wrong

669

Constellations are Younger than Continents

670

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

671

“Humanity vs. AGI” Will Never Look Like “Humanity vs. AGI” to Humanity

672

Is being sexy for your homies?

673

[HUMAN VOICE] "Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible" by Gene Smith and Kman

674

[HUMAN VOICE] "Moral Reality Check (a short story)" by jessicata

675

AI Control: Improving Safety Despite Intentional Subversion

676

2023 Unofficial LessWrong Census/Survey

677

The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity.

678

[HUMAN VOICE] "What are the results of more parental supervision and less outdoor play?" by Julia Wise

679

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible

680

re: Yudkowsky on biological materials

681

Speaking to Congressional staffers about AI risk

682

[HUMAN VOICE] "Shallow review of live agendas in alignment & safety" by technicalities & Stag

683

Thoughts on “AI is easy to control” by Pope & Belrose

684

The 101 Space You Will Always Have With You

685

[HUMAN VOICE] "Social Dark Matter" by Duncan Sabien

686

Shallow review of live agendas in alignment & safety

687

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

688

[HUMAN VOICE] "The 6D effect: When companies take risks, one email can be very powerful." by scasper

689

OpenAI: The Battle of the Board

690

OpenAI: Facts from a Weekend

691

Sam Altman fired from OpenAI

692

Social Dark Matter

693

[HUMAN VOICE] "Thinking By The Clock" by Screwtape

694

"You can just spontaneously call people you haven't met in years" by lc

695

[HUMAN VOICE] "AI Timelines" by habryka, Daniel Kokotajlo, Ajeya Cotra, Ege Erdil

696

"EA orgs' legal structure inhibits risk taking and information sharing on the margin" by Elizabeth

697

"Integrity in AI Governance and Advocacy" by habryka, Olivia Jimenez

698

Loudly Give Up, Don’t Quietly Fade

699

[HUMAN VOICE] "Deception Chess: Game #1" by Zane et al.

700

[HUMAN VOICE] "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

701

"The 6D effect: When companies take risks, one email can be very powerful." by scasper

702

"The other side of the tidal wave" by Katja Grace

703

"Does davidad's uploading moonshot work?" by jacobjabob et al.

704

"Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk" by 1a3orn

705

"My thoughts on the social response to AI risk" by Matthew Barnett

706

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

707

"Thoughts on the AI Safety Summit company policy requests and responses" by So8res

708

"President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence" by Tristan Williams

709

[Human Voice] "Book Review: Going Infinite" by Zvi

710

"We're Not Ready: thoughts on "pausing" and responsible scaling policies" by Holden Karnofsky

711

"At 87, Pearl is still able to change his mind" by rotatingpaguro

712

"Architects of Our Own Demise: We Should Stop Developing AI" by Roko

713

"AI as a science, and three obstacles to alignment strategies" by Nate Soares

714

"Thoughts on responsible scaling policies and regulation" by Paul Christiano

715

"Announcing Timaeus" by Jesse Hoogland et al.

716

[HUMAN VOICE] "Alignment Implications of LLM Successes: a Debate in One Act" by Zack M Davis

717

"Holly Elmore and Rob Miles dialogue on AI Safety Advocacy" by jacobjacob, Robert Miles & Holly_Elmore

718

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

719

"Labs should be explicit about why they are building AGI" by Peter Barnett

720

[HUMAN VOICE] "Sum-threshold attacks" by TsviBT

721

"Will no one rid me of this turbulent pest?" by Metacelsus

722

[HUMAN VOICE] "Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

723

"RSPs are pauses done right" by evhub

724

"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI

725

"Announcing MIRI’s new CEO and leadership team" by Gretta Duleba

726

"Cohabitive Games so Far" by mako yass

727

"Announcing Dialogues" by Ben Pace

728

"Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn" by Zvi

729

"Evaluating the historical value misspecification argument" by Matthew Barnett

730

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

731

"Thomas Kwa's MIRI research experience" by Thomas Kwa and others

732

"'Diamondoid bacteria' nanobots: deadly threat or dead-end? A nanotech investigation" by titotal

733

"The Lighthaven Campus is open for bookings" by Habryka

734

"How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions" by Jan Brauner et al.

735

"EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem" by Elizabeth

736

"The King and the Golem" by Richard Ngo

737

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

738

"Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth

739

"There should be more AI safety orgs" by Marius Hobbhahn

740

"The Talk: a brief explanation of sexual dimorphism" by Malmesbury

741

"A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX" by jacobjacob

742

"AI presidents discuss AI alignment agendas" by TurnTrout & Garrett Baker

743

"UDT shows that decision theory is more puzzling than ever" by Wei Dai

744

"Sum-threshold attacks" by TsviBT

745

"Report on Frontier Model Training" by Yafah Edelman

746

"A list of core AI safety problems and how I hope to solve them" by Davidad

747

"One Minute Every Moment" by abramdemski

748

"Sharing Information About Nonlinear" by Ben Pace

749

"Defunding My Mistake" by ymeskhout

750

"What I would do if I wasn’t at ARC Evals" by LawrenceC

751

"Meta Questions about Metaphilosophy" by Wei Dai

752

"The U.S. is becoming less stable" by lc

753

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

754

"Dear Self; we need to talk about ambition" by Elizabeth

755

"Assume Bad Faith" by Zack_M_Davis

756

"Book Launch: "The Carving of Reality," Best of LessWrong vol. III" by Raemon

757

"Large Language Models will be Great for Censorship" by Ethan Edwards

758

"6 non-obvious mental health issues specific to AI safety" by Igor Ivanov

759

"Ten Thousand Years of Solitude" by agp

760

"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël

761

"Feedbackloop-first Rationality" by Raemon

762

"Inflection.ai is a major AGI lab" by Nikola

763

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

764

"When can we trust model evaluations?" bu evhub

765

"ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks" by Beth Barnes

766

"The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate" by Adam David Long

767

"My current LK99 questions" by Eliezer Yudkowsky

768

"Thoughts on sharing information about language model capabilities" by paulfchristiano

769

"Cultivating a state of mind where new ideas are born" by Henrik Karlsson

770

"Self-driving car bets" by paulfchristiano

771

"Yes, It's Subjective, But Why All The Crabs?" by johnswentworth

772

"Grant applications and grand narratives" by Elizabeth

773

"Brain Efficiency Cannell Prize Contest Award Ceremony" by Alexander Gietelink Oldenziel

774

"Rationality !== Winning" by Raemon

775

"Cryonics and Regret" by MvB

776

"Unifying Bargaining Notions (2/2)" by Diffractor

777

"The ants and the grasshopper" by Richard Ngo

778

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

779

"An artificially structured argument for expecting AGI ruin" by Rob Bensinger

780

"How much do you believe your results?" by Eric Neyman

781

"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango

782

"On AutoGPT" by Zvi

783

"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky

784

"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares

785

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

786

"Deep Deceptiveness" by Nate Soares

787

"The Onion Test for Personal and Institutional Honesty" by Chana Messinger & Andrew Critch

788

"There’s no such thing as a tree (phylogenetically)" by Eukaryote

789

"Losing the root for the tree" by Adam Zerner

790

"It Looks Like You’re Trying To Take Over The World" by Gwern

791

"Why I think strong general AI is coming soon" by Porby

792

"What failure looks like" by Paul Christiano

793

"Lies, Damn Lies, and Fabricated Options" by Duncan Sabien

794

""Carefully Bootstrapped Alignment" is organizationally hard" by Raemon

795

"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes

796

"Enemies vs Malefactors" by Nate Soares

797

"The Parable of the King and the Random Process" by moridinamael

798

"The Waluigi Effect (mega-post)" by Cleo Nardo

799

"Acausal normalcy" by Andrew Critch

800

"Please don't throw your mind away" by TsviBT

801

"Cyborgism" by Nicholas Kees & Janus

802

"Childhoods of exceptional people" by Henrik Karlsson

803

"What I mean by "alignment is in large part about making cognition aimable at all"" by Nate Soares

804

"On not getting contaminated by the wrong obesity ideas" by Natália Coelho Mendonça

805

"SolidGoldMagikarp (plus, prompt generation)"

806

"Focus on the places where you feel shocked everyone's dropping the ball" by Nate Soares

807

"Basics of Rationalist Discourse" by Duncan Sabien

808

"Sapir-Whorf for Rationalists" by Duncan Sabien

809

"My Model Of EA Burnout" by Logan Strohl

810

"The Social Recession: By the Numbers" by Anton Stjepan Cebalo

811

"Recursive Middle Manager Hell" by Raemon

812

"The Feeling of Idea Scarcity" by John Wentworth

813

"Models Don't 'Get Reward'" by Sam Ringer

814

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

815

"The next decades might be wild" by Marius Hobbhahn

816

"Lessons learned from talking to >100 academics about AI safety" by Marius Hobbhahn

817

"How my team at Lightcone sometimes gets stuff done" by jacobjacob

818

"Decision theory does not imply that we get to have nice things" by So8res

819

"What 2026 looks like" by Daniel Kokotajlo

820

Counterarguments to the basic AI x-risk case

821

"Introduction to abstract entropy" by Alex Altair

822

"Consider your appetite for disagreements" by Adam Zerner

823

"My resentful story of becoming a medical miracle" by Elizabeth

824

"The Redaction Machine" by Ben

825

"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

826

"The shard theory of human values" by Quintin Pope & TurnTrout

827

"Two-year update on my personal AI timelines" by Ajeya Cotra

828

"You Are Not Measuring What You Think You Are Measuring" by John Wentworth

829

"Do bamboos set themselves on fire?" by Malmesbury

830

"Survey advice" by Katja Grace

831

"Toni Kurz and the Insanity of Climbing Mountains" by Gene Smith

832

"Deliberate Grieving" by Raemon

833

"Toolbox-thinking and Law-thinking" by Eliezer Yudkowsky

834

"Local Validity as a Key to Sanity and Civilization" by Eliezer Yudkowsky

835

"Humans are not automatically strategic" by Anna Salamon

836

"Language models seem to be much better than humans at next-token prediction" by Buck, Fabien and LawrenceC

837

"Moral strategies at different capability levels" by Richard Ngo

838

"Worlds Where Iterative Design Fails" by John Wentworth

839

"(My understanding of) What Everyone in Technical Alignment is Doing and Why" by Thomas Larsen & Eli Lifland

840

"Unifying Bargaining Notions (1/2)" by Diffractor

841

'Simulators' by Janus

842

"Humans provide an untapped wealth of evidence about alignment" by TurnTrout & Quintin Pope

843

"Changing the world through slack & hobbies" by Steven Byrnes

844

"«Boundaries», Part 1: a key missing concept from utility theory" by Andrew Critch

845

"ITT-passing and civility are good; "charity" is bad; steelmanning is niche" by Rob Bensinger

846

"What should you change in response to an "emergency"? And AI risk" by Anna Salamon

847

"On how various plans miss the hard bits of the alignment challenge" by Nate Soares

848

"Humans are very reliable agents" by Alyssa Vance

849

"Looking back on my alignment PhD" by TurnTrout

850

"It’s Probably Not Lithium" by Natália Coelho Mendonça

851

"What Are You Tracking In Your Head?" by John Wentworth

852

"Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment" by elspood

853

"Where I agree and disagree with Eliezer" by Paul Christiano

854

"Six Dimensions of Operational Adequacy in AGI Projects" by Eliezer Yudkowsky

855

"Moses and the Class Struggle" by lsusr

856

"Benign Boundary Violations" by Duncan Sabien

857

"AGI Ruin: A List of Lethalities" by Eliezer Yudkowsky