LessWrong (30+ Karma) cover art

All Episodes

LessWrong (30+ Karma) — 561 episodes

#
Title
1

“Green” by Adam Zerner

2

“A CERN for AI is a distraction; push for an IAEA instead” by Charbel-Raphaël

3

“Model access for third-parties — it’s a big deal!” by Cleo Nardo

4

“You Should Come to The AI Protest” by Ronak_Mehta

5

“Structural Proxies” by Raymond Douglas

6

“The consequences of locking intelligence away: an introduction to Claude relays in China” by CMLKevin

7

“In partial defence of p(doom)” by Mikhail Samin

8

“What Capable Agents Must Know: Why AI Consciousness May Be an Inevitable Byproduct of Capability” by Aran Nayebi

9

“Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness” by 7vik, Sid Black, Joseph Bloom

10

“Agency is not a natural kind (and why that might matter for alignment)” by SJ_Beard

11

“Human-Guided Agentic Research: A Research Agenda” by fastfedora

12

“Destroying the universe: How hard can it be?” by djbinder

13

“AI will make biological extinction risks worse before it makes them better” by MichaelDickens

14

″$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects” by mbrooks, Mckiev

15

“Third-parties should focus on scrutinising systems cards” by Cleo Nardo

16

“P(doom) is a Dumb Meme” by Max Harms

17

“A reading list for generalists” by Dylan Bowman

18

“What comes with cheap math?” by abramdemski

19

“Do LLMs Have Desires?” by Christopher Ackerman

20

“Agents as Webs of Beliefs” by Richard_Ngo

21

“Austin & Oli on funding and incubating projects” by Austin Chen, habryka

22

“Deployment Awareness Matters More Than Evaluation Awareness” by VojtaKovarik, Tomáš Gavenčiak, Mateusz Bagiński

23

“Why are adversaries assumed to be incapable of responding to AI risk?” by KatjaGrace

24

“What did “scheming”, “mech interp” mean pre-2023.” by Cleo Nardo

25

“Not making a strong argument is a relief” by Kaj_Sotala

26

[Linkpost] “Don’t ignore the car crashes, and remember your freshman CS” by jcksanderson

27

“Chorus-Reinterpretation Country Songs” by jefftk

28

“The Case for Model Forensics” by aditya singh, gersonkroiz, Senthooran Rajamanoharan, Neel Nanda

29

“Existential AI safety needs an effective social movement. PauseAI is building it” by Maxime Fournes, Espedair Street

30

“White House Will Ad Hoc Decide Who Can Individually Access GPT-5.6” by Zvi

31

“Surprising facts about the slave trade” by Joseph Miller

32

“Exploration: fine-tuning with parameter decomposition” by Lucius Bushnaq

33

“Alignment & Succession: The Ideology of Successionism” by L Rudolf L

34

“The shouting equilibrium” by KatjaGrace

35

“Things are not a fixed size in mind-space” by KatjaGrace

36

“Door’s Locked, Try the Window” by Prakrat Agrawal, Jérémy Scheurer

37

“How does such unprofessional AI get the job?” by KatjaGrace

38

“AI catastrophe: more like a genocide than a thought experiment” by KatjaGrace

39

“Expert Views on Continual Learning: Survey Results and Forecasts” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

40

“Elephant seal IV” by KatjaGrace

41

“What is up with e/acc?” by KatjaGrace

42

“AI pause: the case for ASAP” by KatjaGrace

43

“Reward Hacking Without Egregious Misalignment in an RL-Only Setting” by Joey Yudelson, Vladimir Ivanov, ryan_greenblatt

44

“Planning for Preservation in the Age of AI” by Raelifin

45

“Risk-Averse AIs” by wdmacaskill, Elliott Thornley (EJT)

46

“And what happens next?” by Sean Herrington

47

“Superintelligence vs. The Second Strike” by Felix Choussat

48

“The worthlessness of vitamin D is mildly exaggerated” by dynomight

49

“A system overview for near-term, low-trust AI compute verification” by Naci Cankaya

50

“Model Size Scaling in 2023-2031” by Vladimir_Nesov

51

“The AI Industrial Explosion — Part 4: Cheap power” by djbinder

52

“A Theory of Prompt Injection (and why you should study roles)” by Charles Ye, softboiledheart

53

“Coup is the Pareto-optimal social game” by Daniel Tan

54

“A brief list of ways AI safety efforts could be net negative” by Elias Schmied

55

“NLA explanations can be shortened without harming reconstruction” by loops

56

“Introducing MonitoringBench” by monika_j

57

“The Invisible Side of AI Governance” by Charbel-Raphaël

58

“Google Can’t Math Parsecs” by jefftk

59

″[Linkpost] How Transparent Is DiffusionGemma (and why it matters)” by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy

60

“Would anybody here be interested in a “mistake postmortem” discussion group?” by SK2

61

“Hyperstition as the Natural Enemy of Rationality” by alseph

62

“AI Safety Ecosystem Research notes” by Eneasz

63

“Research agenda: Interpretive debate” by Shi

64

“The LLM shoggoth meme is weirder than you think” by HedonicEscalator

65

“Introduction: Gaussian Natural Latents” by Haru

66

“San Silvestro” by Tomás B.

67

“The one-week sprint” by Daniel Tan

68

“On “Model Organisms”” by J Bostock

69

“The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t” by Alek Westover, SebastianP, Alexa Pan, Jozdien

70

″“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms” by Alan Cooney, David Africa, Geoffrey Irving

71

“Contra Pace on When to Apologize” by Zack_M_Davis

72

“GDM AI Control Roadmap” by Mary Phuong, Erik Jenner, Rohin Shah, Seb Farquhar

73

“Your Model Organisms Might Be Fried” by Daniel Tan, J Bostock, draganover, ma-rmartinez, sidbaines, David Africa

74

“Rational Agentic Maximalist Philosophies” by Connor Blake

75

“Leveraged on being right” by Ben Pace

76

“AI #173: AI Pauses” by Zvi

77

“Gears for political races” by Tom Smith

78

“Several frontier models are substantially prefill aware” by yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor, RobertKirk

79

“Alignement pretraining could backfire” by Alexandre Variengien

80

“The Financial Ledger Theory of Apologies” by Ben Pace

81

[Linkpost] “Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?” by gwern

82

[Linkpost] “Guardian Angels: LLM Personalization for Productivity and Security” by gwern

83

“Predicting LLM Safety Before Release by Simulating Deployment” by Tomek Korbak, Marcus Williams, micahcarroll, Cameron Raymond, Hannah Sheahan

84

“How the AI Village works” by Adam B

85

“What are some angles of attack for making continual learning safer?” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

86

“Does preservation make sense before we know how to revive?” by Aurelia

87

“Synthetic document finetuning for instilling positive traits” by CallumMcDougall, Arthur Conmy, Neel Nanda

88

“A Test Suite for Concepts” by Gretta Duleba

89

“A frontier AI company should shut down” by MichaelDickens

90

“The Once And Future Fable #2” by Zvi

91

“Why Do Naive SFT Filters For Safety Properties Fail?” by Josh Engels, Neel Nanda

92

“Impressions at the Extremity of Civilization” by Ben Pace

93

“The Hidden Structures of Problems” by spencerg

94

“How might continual learning affect safety and alignment?” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

95

“SFT Drives Gemini’s Safety Properties” by Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda

96

“American Government Takes Down Claude Fable” by Zvi

97

“The term “AGI” is almost useless at this point [Linkpost]” by Noosphere89

98

“The Uncertainty That Matters Isn’t Fundamental” by jimmy

99

[Linkpost] “US government directive to suspend access to Fable 5 and Mythos 5” by Capybasilisk

100

“Citations Needed: Magic Encyclopedias to Save the World” by Oliver Sourbut

101

“Simulating Simulators” by kromem

102

“Implications of Continual Learning for LLM Agents: Introduction” by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

103

“Reward Hacking at the 1937 World’s Fair” by frmsaul

104

“Claude Fable 5 and Mythos 5: The System Card” by Zvi

105

“Building and evaluating model diffing agents” by bilalchughtai, Josh Engels, Neel Nanda

106

“Sympathy for both sides of the egregious misalignment debate” by Steven Byrnes

107

“Celene’s thoughts on consciousness” by ToasterLightning

108

“Parkinson’s Heuristic” by Ben Pace

109

“PSA: Almost nobody is working on alignment” by Chi Nguyen, peterbarnett

110

“AI #172: The First Fable” by Zvi

111

“Models May Behave Worse When Eval Aware” by Senthooran Rajamanoharan, Neel Nanda

112

“Thoughts on Claude Fable’s silent safeguards” by Andy Arditi

113

“You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them” by RobinHa

114

“Anthropic did not call for a pause on AI” by Andrea_Miotti, Gabriel Alfour

115

“Tracing Eval-Awareness Emergence Through Training of OLMo 3” by Ram Bharadwaj, RobertKirk

116

“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask

117

“Three types of model organism” by Francis Rhys Ward

118

“Sequent: scale and automation for higher confidence in alignment” by Geoffrey Irving, Alex HT, Jesse Hoogland, Daniel Murfet, Jacob Pfau, Marco Cozzi, Stan van Wingerden

119

“Machinic Psychopharmacology: Do LLMs Self-Medicate?” by Sid Black, Joseph Bloom

120

“The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably” by Alex Amadori

121

″“Programmer Science Fiction: My case for a new sub-genre”, Sam T. Oates 2026” by gwern

122

“Even “illegible” Mythos reasoning traces seem pretty legible” by faul_sname

123

“Claude Fable 5 and Mythos 5 [Linkpost]” by fluxxrider

124

“A Mike’s-Eye View of ARC’s Research” by Jacob_Hilton

125

“Towards a Formal Scientific Epistemology” by Richard_Ngo

126

“LLMs and almost good code” by kqr

127

“On Slop” by Jan

128

“The Machines Lack Honour” by Raymond Douglas

129

“How to build a cancer vaccine, and whether they will work this time” by Abhishaike Mahajan

130

“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck

131

“Bun’s Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment” by Sayhan Yalvaçer

132

“Mental causation is not load-bearing” by jessicata

133

“How Far Apart Does a Model Think Its Tokens Are?” by Brendan Long

134

“Can activation verbalizers surface an internal chain of thought?” by oakhu, ryan_greenblatt

135

“Against Corrigibility” by peralice

136

“Coming Around To Political Donations” by jefftk

137

“Optimisation over non-stationary distributions creates weirder minds” by Samuel Ratnam, Pjain

138

“Why Software Automation Is Hard” by silentbob

139

“SecureBio Detection is Hiring Software Engineers” by jefftk

140

“What if Anthropic unilaterally paused capabilities development right now?” by Karl von Wendt

141

“Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain

142

“Beyond the lexical personality traits: What is the structure of personality?” by tailcalled

143

“My research agenda and work” by Seth Herd

144

“Logits as a new monitor for evaluation awareness” by Santiago Aranguri

145

“One Year of PauseAI UK” by Joseph Miller, PauseAI UK

146

“Learnings from starting an AI safety research team” by draganover, Erin Robertson

147

“OpenAI Offers A New Policy Blueprint” by Zvi

148

“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn

149

“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)

150

″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger

151

“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda

152

“Rohin Shah on AGI Safety” by anaguma

153

“Sixteen schemes for AI safety” by Austin Chen

154

“AI #171: False Flag” by Zvi

155

“Don’t Edit Your Ideas Before Having Them” by Hide

156

“Society Explained: a tool for efficiently exploring >100 theories of society” by spencerg

157

“Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases” by Zvi

158

“China won’t win the AI race but would it be much worse if it did?” by Chastity Ruth

159

“A Town Without Children” by SeñorDingDong

160

“My favorite depiction of utopia” by Caleb Biddulph

161

“Why Even Experts Don’t Know What to Do About AI Risk” by Luc Brinkman, plex

162

“Agent Foundations Reminds Me of Continental Philosophy” by IanWS

163

“Announcing the ARC White-Box Estimation Challenge” by Jacob_Hilton

164

“Claude Opus 4.8: Capabilities and Reactions” by Zvi

165

“Tech I’m skeptical of and why” by harsimony

166

“Dissolving the Deep Learning Sample Efficiency Gap” by Samuel Knoche

167

″“Contagious Humming” to Silence a Room” by JohnofCharleston

168

[Linkpost] “NYT: Senator Sanders Proposes Gov’t Take 50% Ownership of AI labs” by Julian Bradshaw

169

“Opus 4.8 Part 2: Model Welfare” by Zvi

170

[Linkpost] “Some humans are both male and female, and can (but shouldn’t) have children with themselves” by HedonicEscalator

171

“Outrunning your headlights” by mattshu0410

172

“Lighthaven East - A Feasibility Study” by JohnofCharleston

173

“Notes on axes of variation in third-party risk assessment” by Buck

174

“Financial Costs of an AI Pause?” by PeterMcCluskey

175

“When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability” by Logan Riggs, tdooms, Conflux, lwroe, MLNissenGonzalez

176

“Testing Gemini models for scheming tendencies” by Vika, David Lindner, Seb Farquhar, Rohin Shah

177

“Comment on “Banning Said Achmiz”” by Zack_M_Davis

178

“Announcing: Iliad’s Fall 2026 Programs” by David Udell, Alexander Gietelink Oldenziel, Leon Lang

179

“Data you could have observed but didn’t” by Gretta Duleba

180

“Claude Opus 4.8: The System Card” by Zvi

181

“Retrying vs Resampling in AI Control” by james.lucassen, Adam Kaufman

182

“AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles” by Max Tegmark

183

“Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour” by JasonB, Edward James Young

184

“Does Claude really care about you?” by Simon Lermen

185

“How can the middle powers avoid getting trounced during the intelligence explosion? A plan.” by Tom Davidson

186

“Trees are mostly made of air and a generalizable lesson for AI safety” by zroe1

187

“Advice for making robust-to-training model organisms” by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, Dylan Xu

188

“Claude… doesn’t know who you are?” by Smaug123

189

“Mnemonic portraits for 19,023 human genes” by Brinedew

190

“Some Dating Stories” by johnswentworth

191

“Infinite ethics and UDASSA” by David Matolcsi

192

“AI #170: Lack of Executive Order” by Zvi

193

“The ballad of TIGIT” by Abhishaike Mahajan

194

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

195

“LLMs Through the Eyes of Vinge” by Gordon Seidoh Worley

196

“Announcing Geodesic Research” by Puria, Cam, Alexandra Narin, Edward James Young, Kyle O’Brien

197

“Full automation of AI R&D probably yields a large speed up even without a software-only singularity” by ryan_greenblatt

198

“Quantitative AI risk assessment: a starting point” by Henry Papadatos, jakub_krys, malcolmmurray, Renn Karageorgieva

199

“Finding the Mole: Bayesianism is Hard” by laniakea

200

“Notes on Fourier Analysis” by Menotim

201

“Standard deviations from just two values” by kqr

202

“Contra Wentworth on Physical Attractiveness for Men” by Gretta Duleba

203

“Practical Learnings from Synthetic Document Finetuning” by Axel Højmark, Jérémy Scheurer

204

“Claude, Author of the Humanitas” by Linch

205

“RTMH: Pope Leo’s Magnifica Humanitas on AI” by Zvi

206

“Brackets Are a Bad Way to Regulate” by Hide

207

“Many portions of Magnifica Humanitas appear to be AI-written” by DanielFilan

208

“Donating 80% While It Still Counts” by jefftk

209

“Cognitive Security as an AI Safety Cause Area” by jsteinhardt

210

“Linkpost: New Vatican Encyclical on AI Governance” by Jackson Wagner

211

“A (Slightly) Mechanistic Theory for Exponentially Increasing AI Time Horizons?” by Oliver Sourbut

212

“Taxing Small Cars To Improve MPG” by jefftk

213

“We made a map of the doom debate” by Sean Herrington, Paul Hindoian, mikaelacankosyan, David Bravo, keivnc, Josh Tuffy, Christopher Davis, Khai Tran, Maryam Hampaei

214

“Your Left Brain Doesn’t Trade With Your Right” by Alexander Gietelink Oldenziel

215

“Probabilities are not the right concept” by David Matolcsi

216

“Basic principles for dressing better.” by spookycat

217

“Will we really put data centers in space?” by Avi Parrack, fin

218

“PLA Daily Translation: Reflections on Warfare Brought by AGI” by eeeee

219

“Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List” by Owain_Evans

220

“Numb mental state shifts” by KatjaGrace

221

“You can opt out of allergies” by Rattengift

222

“Notes on Collaborating with Claude Opus” by Nissa Seru

223

“Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks” by Nathaniel Mitrani, sassanb, Cam Tice, Puria

224

“Gemini 3.5 Flash Looks Good For How Fast It Is” by Zvi

225

“What am I, if not an AI?” by makiba

226

“Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate” by Jordan Taylor, Max H, Ed Fage, Thomas Read, Joseph Bloom

227

“AI #169: New Knowledge” by Zvi

228

“Why does off-model SFT degrade capabilities?” by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, Vivek Hebbar

229

“Women should be able to open things” by KatjaGrace

230

“Toward Interoperability of Minimal Programs” by johnswentworth

231

“theory uplift differentially benefits safety & is massively underpriced” by Yudhister Kumar

232

“Power-seeking agents will likely be developed” by Alec Harris

233

“Synthetic Persona Pretraining: Alignment from Token Zero” by Julian Minder, Raghav Singhal, Viktor Moskvoretskii, Stefan Krsteski, ashtonanderson, rolandaydin, Robert West

234

“If AI is normal technology, history is not reassuring.” by Davidmanheim

235

“Pythagorean addition” by kqr

236

“Brain Structure and IQ: How Myelin Elevates Intelligence” by Shiva’s Right Foot

237

“Conclave 1492” by Vaniver

238

“Humans are not automatically strategic — “inner work” edition” by Chris Lakin

239

“Implications Of Predicting The Next Token” by jdp

240

“A Visual Guide to Natural Latents” by Alfred Harwood

241

“Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training” by David Africa, Neil Shah, Sukrati_Gautam

242

“Advice on interviewing candidates for AI safety fellowships” by beyarkay

243

“Negation Neglect: When models fail to learn negations in training” by harrymayne, Lev McKinney, Owain_Evans

244

“Classifier Context Rot: Monitor Performance Degrades with Context Length” by Fabien Roger, Sam Martin

245

“why pollen allergies?” by bhauth

246

“How to Quit Fandom: Apostasy” by Laiba Rehman

247

“James C. Scott: Seeing Like a State” by Martin Sustrik

248

“How to Reason about Your Health Issues” by Taylor G. Lunt

249

“Benchmarking Real Work” by kaivu, leni, rohuang, zef

250

“A relatively brief explanation of Boltzmann Brains” by Eliezer Yudkowsky

251

“An Introduction to Exemplar Partitioning for Mechanistic Interpretability” by Jessica Rumbelow

252

“A Year Late, Claude Finally Beats Pokémon” by Julian Bradshaw

253

“Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

254

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

255

“Announcing the Center for Shared AI Prosperity” by Dylan Matthews

256

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

257

“Mechanistic estimation for expectations of random products” by Jacob_Hilton

258

“MATS 9 Retrospective & Advice” by beyarkay

259

“Monthly Roundup #42: May 2026” by Zvi

260

[Linkpost] “Don’t be too Clever to Take Obvious Advice” by Hide

261

“Verification-Centric AI” by Raemon

262

“Convergent Abstraction Hypothesis” by Jan_Kulveit

263

“Automated Alignment is Harder Than You Think” by Aleksandr Bowkis, Marie_DB, Jacob Pfau, Geoffrey Irving

264

“The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness” by Charlie Griffin, Patrick Leask

265

“AI #168: Not Leading the Future” by Zvi

266

“Predicting Rare LLM Failures with 30× Fewer Rollouts” by Santiago Aranguri, Francisco Pernice

267

[Linkpost] “Claude is Now Alignment Pretrained” by RogerDearnaley

268

“The primary sources of near-term cybersecurity risk” by lc

269

“Most “inner work” looks like entertainment.” by Chris Lakin

270

[Linkpost] “Apollo Update May 2026” by Marius Hobbhahn

271

“Voters are surprisingly open to talking about AI risk” by less_raichu

272

“Childhood and Education #18: Do The Math” by Zvi

273

“The Owned Ones” by Eliezer Yudkowsky

274

“Optimisation: Selective versus Predictive” by Raymond Douglas

275

“AI companies are already profitable (in the way that matters)” by Yair Halberstadt

276

“The Iliad Intensive Course Materials” by Leon Lang, David Udell, Alexander Gietelink Oldenziel

277

“Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)” by Steven Byrnes

278

“How useful is the information you get from working inside an AI company?” by Buck, Anders Cairns Woodruff

279

“Who Got Breasts First and How We Got Them” by rba

280

“Anthropic’s strange fixation on “hyperstition”” by Simon Lermen

281

“How the AI Labs Make Profit (Maybe, Eventually)” by mabramov

282

“Sawtooth Problems” by Alexander Slugworth

283

“The Darwinian Honeymoon - Why I am not as impressed by human progress as I used to be” by Elias Schmied

284

“International Law Cannot Prevent Extinction Either” by Sausage Vector Machine

285

“Neural Networks learn Bloom Filters” by Alex Gibson

286

“If digital computers are conscious, they are conscious at the hardware level” by cube_flipper

287

“Why You Can’t Use Your Right to Try” by Stephen Martin

288

“A benchmark is a sensor” by Håvard Tveit Ihle, mabynke

289

“Bad Problems Don’t Stop Being Bad Because Somebody’s Wrong About Fault Analysis” by Linch

290

“Write Cause You Have Something to Say” by Logan Riggs

291

“AI is Breaking Two Vulnerability Cultures” by jefftk

292

“Is ProgramBench Impossible?” by frmsaul

293

“Bringing More Expertise to Bear on Alignment” by Edmund Lau, Geoffrey Irving, Cameron Holmes, David Africa

294

[Linkpost] “How to prevent AI’s 2008 moment (We’re hiring)” by felixgaston

295

“Mechanistic estimation for wide random MLPs” by Jacob_Hilton

296

“Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations” by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

297

“Try, even if they have you cold” by WalterL

298

“A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck

299

“AI #167: The Prior Restraint Era Begins” by Zvi

300

“There is no evidence you should reapply sunscreen every 2 hours.” by Hide

301

“Many individual CEVs are probably quite bad” by Viliam

302

“x-risk-themed” by kave

303

“What if LLMs are mostly crystallized intelligence?” by deep

304

“What is Anthropic?” by Zvi

305

“Your rights when flying to Europe” by Yair Halberstadt

306

“Model Spec Midtraining: Improving How Alignment Training Generalizes” by Chloe Li, saraprice, Sam Marks, Jonathan Kutasov

307

“Motivated reasoning, confirmation bias, and AI risk theory” by Seth Herd

308

“Are you looking up?” by Craig Green

309

“The AI Ad-Hoc Prior Restraint Era Begins” by Zvi

310

[Linkpost] “Interpreting Language Model Parameters” by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

311

“It’s nice of you to worry about me, but I really do have a life” by Viliam

312

“Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI” by Eliezer Yudkowsky

313

“Housing Roundup #15: The War Against Renters” by Zvi

314

“AI Industrial Takeoff — Part 1: Maximum growth rates with current technology” by djbinder

315

“Taking woo seriously but not literally” by Kaj_Sotala

316

“Dairy cows make their misery expensive (but their calves can’t)” by Elizabeth

317

“Measuring the ability of Opus 4.5 to fool narrow classifiers” by Fabien Roger, John Hughes

318

“A new rationalist self-improvement book: the 12 Levers” by spencerg

319

“OpenAI’s red line for AI self-improvement is fundamentally flawed” by Charbel-Raphaël

320

“You Are Not Immune To Mode Collapse” by J Bostock

321

“Primary Care Physicians are Incompetent. We Need More of Them.” by Hide

322

“How Go Players Disempower Themselves to AI” by Ashe Vazquez Nuñez

323

“How much should the ideal person cry wolf?” by KatjaGrace

324

“Conditional misalignment: Mitigations can hide EM behind contextual cues” by Jan Dubiński, Owain_Evans

325

“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

326

“Sanity-checking “Incompressible Knowledge Probes”” by Sturb, LawrenceC

327

“AI unemployment and AI extinction are often the same” by KatjaGrace

328

“AI risk was not invested by AI CEOs to hype their companies” by KatjaGrace

329

“Cyborg evals” by Eye You, frmsaul

330

“To what extent is Qwen3-32B predicting its persona?” by Arjun Khandelwal, ryan_greenblatt, Alex Mallen

331

“Research Sabotage in ML Codebases” by egan

332

“Maybe I was too harsh on deep learning theory (three days ago)” by LawrenceC

333

“Notes on Transformer Consciousness” by slavachalnev

334

“On today’s panel with Bernie Sanders” by David Scott Krueger

335

“No Strong Orthogonality From Selection Pressure” by lumpenspace

336

“Learning zero, and what SLT gets wrong about it” by Dmitry Vaintrob

337

“LLM Style Slop is Absolutely Everywhere” by silentbob

338

“Goblin Mode, 24 Hours Later” by Dylan Bowman

339

“Let Kids Keep More Productivity Gains” by jefftk

340

“The Most Important Charts In The World” by Zvi

341

“llm assistant personas seem increasingly incoherent (some subjective observations)” by nostalgebraist

342

“Not a Paper: “Frontier Lab CEOs are Capable of In-Context Scheming”” by LawrenceC

343

“The Problem in the “Nerd Sniping” xkcd Comic” by peralice

344

“Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers” by Jozdien, Alex Mallen

345

“Contra Binder on far-UVC and filtration” by jefftk

346

“Takes from two months as an aspiring LLM naturalist” by AnnaSalamon

347

“Forecasting is Not Overrated and It’s Probably Funded Appropriately” by Ben S.

348

“On the political feasibility of stopping AI” by David Scott Krueger

349

“Sleeper Agent Backdoor Results Are Messy” by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny

350

“LessWrong Shows You Social Signals Before the Comment” by TurnTrout

351

“Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen

352

“Curious cases of financial engineering in biotech” by Abhishaike Mahajan

353

“Update on the Alex Bores campaign” by Eric Neyman

354

“AI companies should publish security assessments” by ryan_greenblatt

355

“In defense of parents” by Yair Halberstadt

356

“The other paper that killed deep learning theory” by LawrenceC

357

“What holds AI safety together? Co-authorship networks from 200 papers” by Anna Thieser

358

″“Bad faith” means intentionally misrepresenting your beliefs” by TFD

359

“Retrospective on my unsupervised elicitation challenge” by DanielFilan

360

“Control protocols don’t always need to know which models are scheming” by Fabien Roger

361

“Anthropic spent too much don’t-be-annoying capital on Mythos” by draganover

362

“The paper that killed deep learning theory” by LawrenceC

363

“Forecasting is Way Overrated, and We Should Stop Funding It” by mabramov

364

″“Thinkhaven”” by Raemon

365

“Is the Cat Out of the Bag?: Who knows how to make AGI?” by Oliver Sourbut

366

“Against the “Permanent” Underclass” by Marcus Plutowski

367

“Quick Paper Review: “There Will Be a Scientific Theory of Deep Learning”” by LawrenceC

368

“Protecting Cognitive Integrity: Our internal AI use policy (V1)” by Tom DAVID

369

“Methodology for inferring propensities of LLMs” by Olli Järviniemi

370

“vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models” by Alan Cooney, Sid Black

371

“What Happens When a Model Thinks It Is AGI?” by josh :), David Africa

372

“Should We Train Against (CoT) Monitors?” by RohanS

373

“If Everyone Reads It, Nobody Dies - Course Launch” by Luc Brinkman, Chris-Lons

374

“Does your AI perform badly because you — you, specifically — are a bad person” by Natalie Cargill

375

“A “Lay” Introduction to “On the Complexity of Neural Computation in Superposition”” by LawrenceC

376

“An Angry Review of Greg Egan’s “Didicosm”” by LawrenceC

377

“Evil is bad, actually (Vassar and Olivia Schaefer)” by plex

378

“Your Supplies Probably Won’t Be Stolen in a Disaster” by jefftk

379

“Community misconduct disputes are not about facts” by mingyuan

380

“Why no new notations since 1960?” by Carl Feynman

381

“Narrow Secret Loyalty Dodges Black-Box Audits” by Alfie Lamerton, Fabien Roger

382

“10 posts I don’t have time to write” by habryka

383

“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan

384

″$50 million a year for a 10% chance to ban ASI” by Andrea_Miotti, Alex Amadori, Gabriel Alfour

385

“Automated Deanonymization is Here” by jefftk

386

“Evil is bad, actually (Vassar and Olivia Schaefer callout post)” by plex

387

“10 non-boring ways I’ve used AI in the last month” by habryka

388

“Introducing LinuxArena” by Tyler Tracy, Ram Potham, Nick Kuhn, Myles H

389

“The “Budgeting” Skill Has The Most Betweenness Centrality (Probably)” by JenniferRM

390

“Finetuning Borges” by Linch

391

“9 kinds of hard-to-verify tasks” by Cleo Nardo

392

“How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?” by dx26, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Buck, Julian Stastny

393

“Automating philosophy if Timothy Williamson is correct” by Cleo Nardo

394

“CLR’s Safe Pareto Improvements Research Agenda” by Anthony DiGiovanni

395

“LLMs are about to disrupt algorithmic media feeds” by lsusr

396

“Resources for starting and growing an AI safety org” by Bryce Robertson, Søren Elverlin, Melissa Samworth, jakkdl

397

“Quality Matters Most When Stakes are Highest” by LawrenceC

398

“Feel like a room has bad vibes? The lighting is probably too “spiky” or too blue” by habryka

399

“I did a jhana meditation retreat (in 2024) with Jhourney and it was okay.” by Jules

400

“R1 CoT illegibility revisited” by nostalgebraist

401

“Reevaluating AGI Ruin in 2026” by lc

402

“If It’s Worth Arguing, It’s Worth Arguing With Whiteboards” by Drake Morrison

403

“There are only four skills: design, technical, management and physical” by habryka

404

“Having OCD is like living in North Korea (Here’s how I escaped)” by Declan Molony

405

“Claude knows who you are” by Smaug123

406

“Vladimir Putin’s CEV is probably pretty good” by habryka

407

“Post-mortem’ing my earliest ML research paper, 7 years later” by LawrenceC

408

“If You’ve Never Bought a Tool You Didn’t Need, You’re Not Buying Enough Tools” by Drake Morrison

409

“3” by AnnaJo

410

“Consent-Based RL: Letting Models Endorse Their Own Training Updates” by Logan Riggs

411

“Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability” by Elle Najt, Asa Cooper Stickland, Xander Davies

412

“Let goodness conquer all that it can defend” by habryka

413

“Specialization is a Driver of Natural Ontology” by johnswentworth

414

[Linkpost] “You can only build safe ASI if ASI is globally banned” by Connor Leahy

415

“Beware of Well-Written Posts” by alseph

416

“You Aren’t in Charge of the Overton Window; Politics Is Not Interior Design” by Davidmanheim

417

“Carpathia Day” by Drake Morrison

418

“Do not conquer what you cannot defend” by habryka

419

“What is the Iliad Intensive?” by Leon Lang, Alexander Gietelink Oldenziel, David Udell

420

“The Mirror Test Is Complicated” by J Bostock

421

“Contra Leicht on AI Pauses” by David Scott Krueger (formerly: capybaralet)

422

“Nectome: All That I Know” by Raelifin

423

“Effective Altruism, Seen From Slytherin” by Xylix

424

“Majority Report” by peralice

425

“Current AIs seem pretty misaligned to me” by ryan_greenblatt

426

“Contra Byrnes on UV & Cancer” by HedonicEscalator

427

“Everyone Has a Plan Until They Get Social Pressure To the Face” by Czynski

428

“Mechanisms of Introspective Awareness” by Uzay Macar

429

“Load-Bearing Sincerity: On the Motive Reinforcement Thesis” by Fiora Starlight

430

“Diary of a “Doomer”: 12+ years arguing about AI risk (part 1)” by David Scott Krueger (formerly: capybaralet)

431

“A Retrospective of Richard Ngo’s 2022 List of Conceptual Alignment Projects” by LawrenceC

432

“From personas to intentions: towards a science of motivations for AI models” by David Africa, Jacob Pfau

433

“The Shapley Share of Responsibility?” by Raemon

434

“Who Killed Common Law?” by Benquo

435

“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, ryan_greenblatt

436

“Meaningful Questions Have Return Types” by Drake Morrison

437

“Only Law Can Prevent Extinction” by Eliezer Yudkowsky

438

“AI Safety’s Biggest Talent Gap Isn’t Researchers. It’s Generalists.” by Topaz, agucova, Alexandra Bates, Parv Mahajan

439

“Tomas Bjartur: The Last Prodigy” by Linch

440

“Annoyingly Principled People, and what befalls them” by Raemon

441

“TAPs or it didn’t happen” by Raemon

442

“Returns to intelligence” by RobertM

443

“Daycare illnesses” by Nina Panickssery

444

“The policy surrounding Mythos marks an irreversible power shift” by sil

445

“Talk English, Think Something Else” by J Bostock

446

“Sparse Autoencoders for Single-Cell Models” by Ihor Kendiukhov

447

“Eggs, rooms, puzzles, and talking about AI” by KatjaGrace

448

“Morale” by J Bostock

449

“Your Mom is a Chimera” by michaelwaves

450

“The Blast Radius Principle” by Martin Sustrik

451

“How to make good tea” by RobertM

452

“Catching illicit distributed training operations during an AI pause” by Robi Rahman

453

[Linkpost] “Scott Alexander gentrified my meetup” by dominicq

454

“Pausing AI Is the Best Answer to Post-Alignment Problems” by MichaelDickens

455

“Some thoughts on Nectome’s risk and resilience” by Aurelia

456

“Chocolate Sloths, Tinder, and Moral Backstops” by J Bostock

457

“Dario probably doesn’t believe in superintelligence” by RobertM

458

“The Unintelligibility is Ours: Notes on Chain-of-Thought” by 1a3orn

459

“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by ryan_greenblatt

460

“Why Control Creates Conflict, and When to Open Instead” by plex

461

“Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom

462

“Have we already lost? Part 2: Reasons for Doom” by LawrenceC

463

“Model organisms researchers should check whether high LRs defeat their model organisms” by dx26, Sebastian Prasanna, Alek Westover, Vivek Hebbar, Julian Stastny

464

“Anthropic did not publish a “risk discussion” of Mythos when required by their RSP” by RobertM

465

“Some takes on UV & cancer” by Steven Byrnes

466

“Help me launch Obsolete: a book aimed at building a new movement for AI reform” by garrison

467

“Slightly-Super Persuasion Will Do” by Tomás B.

468

“Have we already lost? Part 1: The Plan in 2024” by LawrenceC

469

“Do not be surprised if LessWrong gets hacked” by RobertM

470

“One Week in the Rat Farm” by Philip Harker

471

“101 Humans of New York on the Risks of AI” by Corm

472

“Baking tips” by RobertM

473

“An easy coordination problem?” by KatjaGrace

474

“Excerpts and Notes on Mythos Model Card” by williawa

475

“The effects of caffeine consumption do not decay with a ~5 hour half-life” by kman

476

“You don’t know what you are made of till you’ve been stalked across three countries” by Shoshannah Tekofsky

477

“Why is Flesh So Weak?” by J Bostock

478

“The hard part isn’t noticing when papers are bad, it’s deciding what to do afterwards” by LawrenceC

479

“We can prevent progress! Conceptual clarity, and inspiration from the FDA” by KatjaGrace

480

“AI as a Trojan horse race” by KatjaGrace

481

“My unsupervised elicitation challenge” by DanielFilan

482

“Role-playing vs Self-modelling” by Jan_Kulveit

483

“Elementary Condensation” by Jan

484

“Hedging and Survival-Weighted Planning” by Vaniver

485

“Opus’s Schelling Steganography Has Amplifiable Secrecy Against Weaker Eavesdroppers” by Elle Najt

486

“An Alignment Journal: Features and policies” by JessRiedel, Dan MacKinlay, Luca, Daniel Murfet, david reinstein

487

“Fantasy ideology” by Ninety-Three

488

[Linkpost] “Questions raised about OpenAI leaders’ trustworthiness by the New Yorker” by Remmelt

489

“Claude Mythos System Card Preview” by anaguma

490

“My picture of the present in AI” by ryan_greenblatt

491

[Linkpost] ”[Paper] Stringological sequence prediction I” by Vanessa Kosoy

492

“We’re actually running out of benchmarks to upper bound AI capabilities” by LawrenceC

493

“Don’t write for LLMs, just record everything” by RobertM

494

“Contra Nina Panickssery on advice for children” by Sean Herrington

495

“By Strong Default, ASI Will End Liberal Democracy” by MichaelDickens

496

“AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines” by ryan_greenblatt

497

“Paper close reading: “Why Language Models Hallucinate”” by LawrenceC

498

“Ten different ways of thinking about Gradual Disempowerment” by David Scott Krueger (formerly: capybaralet)

499

“11 pieces of advice for children” by Nina Panickssery

500

“Steering Might Stop Working Soon” by J Bostock

501

“Am I the baddie?” by Ustice

502

“Academic Proof-of-Work in the Age of LLMs” by LawrenceC

503

“Positive sum does not mean “win-win”” by loops

504

“Considerations for growing the pie” by Zach Stein-Perlman

505

″“Following the incentives”” by David Scott Krueger (formerly: capybaralet)

506

“Chicken-Free Egg Whites” by jefftk

507

“dark ilan” by ozymandias

508

“Mean field sequence: an introduction” by Dmitry Vaintrob, Lauren Greenspan

509

“Democracy Dies With The Rifleman” by Vaniver

510

“The bar is lower than you think” by XelaP

511

“Did Anyone Predict the Industrial Revolution?” by Lost Futures

512

“Why do I believe preserving structure is enough?” by Aurelia

513

“There should be $100M grants to automate AI safety” by Marius Hobbhahn

514

“Sadly, The Whispering Earring” by Dentosal

515

“Common research advice #2: say precisely what you want to say” by LawrenceC

516

“2026: The year of throwing my agency at my health (now with added cyborgism)” by Ruby

517

[Linkpost] “Q1 2026 Timelines Update” by Daniel Kokotajlo, elifland, bhalstead

518

“How social ideas get corrupt” by Kaj_Sotala

519

“The Indestructible Future” by WillPetillo

520

“My most common advice for junior researchers” by LawrenceC

521

“The Practical Guide to Superbabies” by GeneSmith

522

“The Corner-Stone” by Benquo

523

“Systematically dismantle the AI compute supply chain.” by David Scott Krueger (formerly: capybaralet)

524

“The quest for general intelligence is hitting a wall” by Sean Herrington

525

“Intelligence Dissolves Privacy” by Vaniver

526

“Anthropic’s Pause is the Most Expensive Alarm in Corporate History” by Ruby

527

“I’m Suing Anthropic for Unauthorized Use of My Personality” by Linch

528

“Orders of magnitude: use semitones, not decibels” by Oliver Sourbut

529

“Dying with Whimsy” by NickyP

530

“AI for AI for Epistemics” by owencb, Lukas Finnveden

531

“Announcing Doublehaven with Reflections on Humour” by J Bostock

532

“Save the Sun Shrimp!” by Jack

533

“LIMBO: Who We Are, What We Do, and an Exciting High-Impact Funding Opportunity” by faul_sname

534

“Chat, is this sus?” by Tyler Tracy

535

″“You Have Not Been a Good User” (LessWrong’s second album)” by habryka

536

“Lesswrong Liberated” by Ronny Fernandez

537

“The Claude Code Source Leak” by Error

538

“Experiments With Opus 4.6’s Fiction” by Tomás B.

539

“Product Alignment is not Superintelligence Alignment (and we need the latter to survive)” by plex

540

“Co-Found Lens Academy With Me. (We have early users and funding)” by Luc Brinkman

541

“Slack in Cells, Slack in Brains” by Mateusz Bagiński

542

“I am definitely missing the pre-AI writing era” by N. Cailie

543

“The state of AI safety in four fake graphs” by Boaz Barak

544

“AI should be a good citizen, not just a good assistant” by Tom Davidson, wdmacaskill

545

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

546

[Linkpost] “Parkinson’s Law of Worry” by Jakub Halmeš

547

“Folie à Machine: LLMs and Epistemic Capture” by DaystarEld

548

“Stop asking “how good is this” to decide between donation opportunities I recommend” by Zach Stein-Perlman

549

“Nick Bostrom: How big is the cosmic endowment?” by Zach Stein-Perlman

550

“Don’t Overdose Locally Beneficial Changes” by Mateusz Bagiński

551

“Stanley Milgram wasn’t pessimistic enough about human nature?” by David Gross

552

[Linkpost] “What if superintelligence is just weak?” by Simon Lermen

553

“Pray for Casanova” by Tomás B.

554

“ControlAI 2025 Impact Report” by Andrea_Miotti, Alex Amadori

555

“AI’s capability improvements haven’t come from it getting less affordable” by Anders Woodruff

556

“Scaffolded Reproducers, Scaffolded Agents” by Mateusz Bagiński

557

“My hobby: running deranged surveys” by leogao

558

“The Terrarium” by Caleb Biddulph

559

“Sen. Sanders (I-VT) and Rep. Ocasio-Cortez (D-NY) propose AI Data Center Moratorium Act” by Matrice Jacobine

560

“Test your best methods on our hard CoT interp tasks” by daria, Riya Tyagi, Josh Engels, Neel Nanda

561

″“What Exactly Would An International AI Treaty Say?” Is a Bad Objection” by Davidmanheim