EPISODE · Mar 6, 2024 · 49 MIN
Sleeper Agents | Evan Hubinger | EA Global Bay Area: 2024
from EAG Talks · host Aaron Bergman
If an AI system learned a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? That's the question that Evan and his coauthors at Anthropic sought to answer in their work on ""Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"", which Evan will be discussing. Evan Hubinger leads the new Alignment Stress-Testing team at Anthropic, which is tasked with red-teaming Anthropic's internal alignment techniques and evaluations. Prior to joining Anthropic, Evan was a Research Fellow at the Machine Intelligence Research Institute and worked on a variety of theoretical alignment work, including ""Risks from Learned Optimization in Advanced Machine Learning Systems"". Evan will be talking about the Anthropic Alignment Stress-Testing team's first paper, ""Sleeper Agents: Building Deceptive LLMs that Persist Through Safety Training"". Watch on Youtube: https://www.youtube.com/watch?v=BgfT0AcosHw
NOW PLAYING
Sleeper Agents | Evan Hubinger | EA Global Bay Area: 2024
No transcript for this episode yet
Similar Episodes
No similar episodes found.