Can Empathy Make AI Honest? | Self–Other Overlap Explained | Am I? | Ep 7 episode artwork

EPISODE · Sep 18, 2025 · 55 MIN

Can Empathy Make AI Honest? | Self–Other Overlap Explained | Am I? | Ep 7

from Am I? · host The AI Risk Network

AI can look aligned on the surface while quietly optimizing for something else. If that’s true, we need tools that shape what models are on the inside—not just what they say.In this episode, AE Studio’s Cameron and co-host sit down with Mark Carleanu, lead researcher at AE on Self-Other Overlap (SOO). We dig into a pragmatic alignment approach rooted in cognitive neuroscience, new experimental results, and a path to deployment.What we explore in this episode:* What “Self-Other Overlap” means and why internals matter more than behavior* Results: less in-context deception and low alignment tax* How SOO works and the threat model of “alignment faking”* Consciousness, identity, and why AI welfare is on the table* Timelines and risk: sober takes, no drama* Roadmap: from toy setups to frontier lab deployment* Reception and critiques—and how we’re addressing themWhat “Self-Other Overlap” means and why internals matter more than behaviorSOO comes from empathy research: the brain reuses “self” circuitry when modeling others. Mark generalizes this to AI. If a model’s internal representation of “self” overlaps with its representation of “humans,” then helping us is less in conflict with its own aims. In Mark’s early work, cooperative agents showed higher overlap; flipping goals dropped overlap across actions.The punchline: don’t just reward nice behavior. Target the internal representations. Capable models can act aligned to dodge updates while keeping misaligned goals intact. SOO aims at the gears inside.Results: less in-context deception and low alignment taxIn a NeurIPS workshop paper, the team shows an architecture-agnostic way to increase self-other overlap in both LLMs and RL agents. As models scale, in-context deception falls—approaching near-zero in some settings—while capabilities stay basically intact. That’s a low alignment tax.This is not another brittle guardrail. It’s a post-training nudge that plays well with RLHF and other methods. Fewer incentives to scheme, minimal performance hit. 👉 Watch the full episode on YouTube for more insights.How SOO works and the threat model of “alignment faking”You don’t need to perfectly decode a model’s “self” or “other.” You can mathematically “smush” their embeddings—nudging them closer across relevant contexts. When the model’s self and our interests overlap more, dishonest or harmful behavior becomes less rewarding for its internal objectives.This squarely targets alignment faking: models that act aligned during training to avoid weight updates, then do their own thing later. SOO tries to make honest behavior non-frustrating for the model—so there’s less reason to plan around us.Consciousness, identity, and why AI welfare is on the tableThere’s a soft echo of Eastern ideas here—dissolving self/other boundaries—but the approach is empirical, first-principles. Identity and self-modeling sit at the core. Mark offers operational criteria for making progress on “consciousness”: predict contents and conditions; explain what things do.AI is a clean testbed to deconfuse these concepts. If systems develop preferences and valenced experiences, then welfare matters. Alignment (don’t frustrate human preferences) and AI welfare (don’t chronically frustrate models’ preferences) can reinforce each other.Timelines and risk: sober takes, no dramaMark’s guess: 3–12 years to AGI (>50% probability), and ~20% risk of bad outcomes conditional on getting there. That’s in line with several industry voices—uncertain, but not dismissive.This isn’t a doomer pitch; it’s urgency without theatrics. If there’s real risk, we should ship methods that reduce it—soon.Roadmap: from toy setups to frontier lab deploymentShort term: firm up results on toy and model-organism setups—show deception reductions that scale with minimal capability costs. Next: partner with frontier labs (e.g., Anthropic) to test at scale, on real infra.Best case: SOO becomes a standard knob alongside RLHF and post-training methods in frontier models. If it plays nicely and keeps the alignment tax low, it’s deployable.Reception and critiques—and how we’re addressing themEliezer Yudkowsky called SOO the right “shape” of solution compared to RLHF alone. Main critiques: Are we targeting the true self-model or a prompt-induced facade? Do models even have a coherent self? Responses: agency and self-models emerge post-training; situational awareness can recruit the true self; simplicity priors favor cross-context compression into a single representation.Practically, you can raise task complexity to force the model to use its best self-model. AE’s related work suggests self-modeling reduces model complexity; ongoing work aims to better identify and trigger the right representations. Neuroscience inspires, but the argument stands on its own.Closing Thoughts‍If models can look aligned while pursuing something else, we need levers that change what they care about inside. SOO is a simple, testable lever that seems to cut deception without neutering capability.The stakes are real, but so is the path forward. Make aligned behavior feel natural to the model, not like a mask it wears. That’s how you get systems that help by default.‍* 📺 Watch the full episode (YouTube Link)* 🔔 Subscribe to the YouTube channel* 🤝 Share this blog with a colleague This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit theairisknetwork.substack.com/subscribe

NOW PLAYING

Can Empathy Make AI Honest? | Self–Other Overlap Explained | Am I? | Ep 7

0:00 55:40

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

The Lee Olsen Show Lee Olsen CJF I want to help you improve all areas of your life by 3 types of podcasts!👉Blood, Sweat & Blessings-Interviews of normal people that have achieved BIG things!👉Series!!! For Love of the Horse- Brad Jackman DVM & Lee Olsen CJF, how to help your horse!👉Business Tips- Proven Life Changing Business Strategies with Lee Olsen The Field Priest Methodius Chwastek The Field is a place of cultivation and of battle. In the Church, we learn to cultivate a life pleasing to God. This life is shaped in the spiritual battle. This series examines, chapter by chapter, the Christian classic The Field, by Saint Ignatius Brianchaninov. Please join me as I explain this great work in terms the modern Orthodox Christian can understand.  The Course Mentors Podcast The Course Mentors Hey there, future course creator!Ever feel like turning your know-how into an online course is like trying to solve a Rubik's cube blindfolded? Well, grab your headphones because "The Course Mentors Podcast" is here to be your secret weapon!Meet Aimee and Odette (that's us!), your new best friends in the course creation world. We've been in the trenches for over a decade, and for the last five years, we've been rocking the online course space. Now we're here to spill all our secrets in bite-sized, 15-20 minute episodes that'll fit perfectly in your coffee breaks.No fluff, no filler - just real, actionable advice that'll take you from "um, what's a landing page?" to "holy moly, I just hit six figures!". We're talking everything from crafting your course to marketing it like a pro and building a business that'll have you pinching yourself.Whether you're dreaming of ditching the 9-to-5 grind, adding a sweet extra income str Tweens and Dreams Anna B 💕 Hi! I’m Anna, a 12 year old in seventh grade! I’m a theater kid! (HAMILTON IS GOD!!) I post about a variety of things; some of these things include journaling, TV shows/movies, music, shopping, theater, books, etc. If you have any episode requests please comment and I will do my best to do them! If you have any movie, TV show, book, or music recommendations I would love to hear them so please comment!! I’m always looking for more TV shows, movies, books, and music artists to watch/read/listen to! But anyways, I hope you enjoy listening 💕💕

Frequently Asked Questions

How long is this episode of Am I??

This episode is 55 minutes long.

When was this Am I? episode published?

This episode was published on September 18, 2025.

What is this episode about?

AI can look aligned on the surface while quietly optimizing for something else. If that’s true, we need tools that shape what models are on the inside—not just what they say.In this episode, AE Studio’s Cameron and co-host sit down with Mark...

Can I download this Am I? episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!