How many episodes does AI Odyssey have?

AI Odyssey currently has 50 episodes available on PodParley. New episodes are automatically indexed when they're published to the podcast feed.

What is AI Odyssey about?

AI Odyssey is your journey through the vast and evolving world of artificial intelligence. Powered by AI, this podcast breaks down both the foundational concepts and the cutting-edge developments in the field. Whether you're just starting to explore the role of AI in our world or you're a seasoned...

How often does AI Odyssey release new episodes?

AI Odyssey has 50 episodes. Check the episode list to see recent publication dates and frequency.

Where can I listen to AI Odyssey?

You can listen to AI Odyssey on PodParley by clicking any episode. We provide an embedded audio player for direct listening, and you can also subscribe via your preferred podcast app using the RSS feed.

Who hosts AI Odyssey?

AI Odyssey is created and hosted by Anlie Arnaudy, Daniel Herbera and Guillaume Fournier.

AI Odyssey Podcast - All Episodes

80

Prompting Is Dead. Loops Are the New Interface.

The next frontier in AI is not better prompts. It is systems that trigger, act, observe, judge, and stop on their own. This episode explores loop engineering: the shift from manual chat with an AI to autonomous workflows that can test software, review documentation, simulate users, inspect screenshots, fix errors, and open pull requests while humans sleep.But autonomy has a cost. Without hard stop conditions, independent verification, maker-checker separation, and spending limits, loops can burn tokens, produce quiet technical debt, or drift into days of useless activity.Inspired by recent analyses from Matthew Berman, Nate Hunter, and the Prompt Engineering channel, this episode was created using Google's NotebookLM. Source note: this episode is based on multiple technical videos and developer discussions.

Jul 6, 2026

23m

79

AI Agents Are Not Agents Yet

What if today’s “AI agents” are mostly automation pipelines wearing a more ambitious label?This episode explores Critique of Agent Model, a paper that draws a sharp line between agentic systems, which look autonomous because engineers scaffold workflows around them, and agentive systems, where goals, identity, decisions, self-regulation, and learning are internal to the system itself.The authors propose a Goal-Identity-Configurator (GIC) architecture as a path toward genuine machine agency, while keeping the central safety question unavoidable: greater autonomy also makes oversight significantly more difficult.Inspired by the work of Eric Xing, Mingkai Deng, and Jinyu Hou, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/abs/2606.23991

Jun 27, 2026

22m

78

The End of Shared Memory for AI Agents?

What if the best way for AI agents to learn together is to stop forcing them to share the same memory?This paper introduces DecentMem, a framework where each agent keeps its own adaptive memory instead of relying on one central repository. The result is striking: better accuracy, lower token use, and less risk of every agent collapsing into the same behaviour.For enterprises building agent teams, the message is uncomfortable: coordination is not always intelligence. Sometimes, shared memory is the bottleneck.Inspired by the work of Guangya Hao, Yunbo Long, and Zhuokai Zhao, this episode was created using Google's NotebookLM.Read the original paper here:https://arxiv.org/abs/2605.22721

Jun 15, 2026

21m

77

Your Best Colleague Is Now a Skill

What if an AI agent could preserve a colleague’s judgment without pretending to become that person?COLLEAGUE.SKILL turns chats, documents, emails, screenshots, and other traces into inspectable agent skills: portable folders of instructions, examples, metadata, and correction history.The key idea is expert knowledge distillation : the extraction of useful human expertise into a bounded technical artifact.For enterprises, this points to a new operating model. Scarce expertise can become reusable, auditable, and updateable, but only if provenance, consent, and limits remain visible.Inspired by the work of Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao, and Xia Hu, this episode was created using Google's NotebookLM.Read the original paper :https://arxiv.org/abs/2605.31264

Jun 7, 2026

19m

76

AI Agents Just Learned to Train Their Own Skills

What if the next leap in AI agents is not a bigger model, but a skill document that learns from failure? SkillOpt treats agent skills as trainable external memory: a separate optimizer edits a compact procedure, then keeps only changes that improve held-out validation, meaning tests not used for the edit. Across 52 model, benchmark, and harness settings, the method is best or tied every time, with gains above 20 points on GPT-5.5 in several loops. For enterprises, this points to a new layer of governance: skills that improve, transfer, and remain auditable.Inspired by the work of Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/abs/2605.23904

May 31, 2026

22m

75

AI Agents Fail the Spreadsheet Test

What happens when AI agents are asked to build the spreadsheets finance teams actually use?WorkstreamBench, a benchmark for end-to-end financial spreadsheet work, exposes the gap between impressive demos and professional deliverables. It tests complete multi-sheet workbooks, not single formulas or table questions.The benchmark scores accuracy, formula quality, and formatting, because in finance a model must be auditable, readable, and easy to modify.Claude Web leads with 69.1 out of 100, but even the best systems degrade as tasks become more complex. Enterprise AI still has a spreadsheet reliability problem.Inspired by the work of Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, and Hongseok Namkoong, this episode was created using Google's NotebookLM.Read the original paper here:https://arxiv.org/pdf/2605.22664

May 25, 2026

23m

74

Hermes Agent and the Rise of Agentic Operating Systems

Every forty years, the way we touch a computer changes shape. The command line gave way to the mouse. The mouse gave way to the touchscreen. And now, quietly, the screen itself is starting to disappear. In this episode, we follow Hermes, an open-source agentic operating system that hit number one on OpenRouter in ninety days, processing 224 billion tokens a day. Persistent memory, self-written skills, local-first execution: Hermes is not an app you launch, it is a digital coworker that launches things for you. And while the text interface collapses into orchestration, the voice interface is collapsing into presence: Mira Murati's Thinking Machines Lab just unveiled "interaction models" that listen, watch, and speak at the same time, in 200-millisecond micro-turns. Two paradigm shifts, one direction. The OS becomes the agent. The agent becomes the conversation.Inspired by recent research on Agentic Operating Systems, this episode was created using Google's NotebookLM.

May 16, 2026

15m

73

The Agent Question Nobody Asked: When Should AI Interrupt You?

Most people assume an AI agent should ask for clarification as early as possible. This paper shows that the truth is more subtle.For long-horizon agents — AI systems that execute many steps over time — the value of a clarification depends on what is missing : goal, input, constraint, or context. Some answers lose value almost immediately. Others remain useful much later.For enterprises, this is not a UX detail. It is a governance problem : when should an agent stop, ask, and avoid compounding a bad assumption?Inspired by the work of Anmol Gulati, Hariom Gupta, Elias Lumer, Sahil Sen, and Vamse Kumar Subbiah, this episode was created using Google's NotebookLM.Read the original paper here :https://arxiv.org/abs/2605.07937v1

May 14, 2026

18m

72

AI Agents Have a Coordination Problem

What if multi-agent AI systems fail less because the models are weak, and more because the agents are badly coordinated? This paper treats coordination as an architectural layer : who talks to whom, who decides, how outputs are merged, and how failures are handled.The authors test five coordination patterns on prediction markets and find a sharp result for builders : more agents and more debate do not automatically create better systems. In this experiment, simple ensembles and sequential pipelines beat popular orchestration patterns on the cost-quality frontier.Inspired by the work of Maksym Nechepurenko and Pavel Shuvalov, this episode was created using Google’s NotebookLM.Read the original paper here :https://arxiv.org/pdf/2605.03310

May 10, 2026

25m

71

AI Agents Are Becoming Companies

What if the next leap in AI agents is not a smarter worker, but a better organisation?This paper introduces OneManCompany, a framework that turns scattered agents, tools, skills, and runtime configurations into managed “Talents” that can be hired, reviewed, replaced, and improved over time. Its Explore-Execute-Review loop decomposes work, assigns accountability, checks outputs, and learns from failures.The result is striking: 84.67% success on PRDBench, beating reported baselines by 15.48 percentage points. But the catch is equally important: this organisational intelligence costs more and is still mostly validated on software tasks.Inspired by the work of Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang, Weilin Luo, and Jun Wang, this episode was created using Google’s NotebookLM. Read the original paper here: https://arxiv.org/abs/2604.22446v1

May 3, 2026

17m

70

AI Agents Just Learned to Remember

What if the real bottleneck for AI agents is not reasoning,but memory?StructMem argues that long-term agents should not storeconversations as isolated facts or expensive knowledge graphs. Instead, they should remember temporally grounded events: what happened, who was involved, and how one event connects to another. On the LoCoMo benchmark, thisstructure-enriched memory reaches the best overall score while cutting construction costs dramatically compared with graph-heavy approaches.For anyone building autonomous agents, the message is clear:memory is becoming an architecture problem, not just a retrieval problem. Inspired by the work of Buqiang Xu, Yijun Chen, Jizhan Fang,Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, and Shumin Deng, this episode was created using Google's NotebookLM.Read the original paper here:https://arxiv.org/pdf/2604.21748v1

Apr 27, 2026

18m

69

The Protocol That Lets Agents Rewrite Themselves

What if the missing layer in agent design isn't communication, but version control?In this episode, we unpack Autogenesis, a two-layer protocol that treats prompts, tools, and memory as first-class resources with explicit lifecycle, versioning, and rollback. The core insight is striking: connectivity standards like MCP and A2A tell agents how to reach tools, but stay silent on what happens when agents start rewriting those tools on their own. Autogenesis fills that gap, and the numbers speak loudly, including a 33% jump on the hardest GAIA benchmark tasks.Inspired by the work of Wentao Zhang, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/abs/2604.15034

Apr 18, 2026

20m

68

When Agents Learn to Forget: The Memory Revolution in AI Research

What if the biggest bottleneck in AI agents wasn't reasoning power, but memory management?In this episode, we explore a fascinating new framework called MIA, the Memory Intelligence Agent, which reimagines how AI research agents store, compress, and reuse their past experiences. Instead of hoarding every search trace into an ever-growing context window, MIA separates memory into a Manager, a Planner, and an Executor, each with a distinct role. The result: a 7-billion parameter model that outperforms GPT-4o on complex research tasks, and even boosts GPT-5.4 performance by up to 9%. We unpack why "keeping everything" is a trap, and how forgetting strategically might be the real key to smarter AI.Inspired by the work of Jingyang Qiao, Weicheng Meng, Yu Cheng, and colleagues at East China Normal University, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/pdf/2604.04503

Apr 12, 2026

25m

67

The Web is a Minefield: How AI Agents Get Trapped

What if the biggest threat to AI agents isn't a flaw in the model, but the internet itself?A new paper from Google DeepMind introduces the first systematic framework for "AI Agent Traps": adversarial content hidden in websites, documents, and digital resources, engineered to manipulate autonomous agents. From invisible HTML instructions that hijack summaries, to poisoned memory stores that corrupt decisions across sessions, to systemic traps that could trigger flash crashes across agent economies. The researchers identify six categories of attack targeting every layer of an agent's architecture: perception, reasoning, memory, action, multi-agent dynamics, and the human overseer.As enterprises deploy agents at scale, this paper is a wake-up call: the web was built for human eyes, and rebuilding it for machine readers demands a fundamentally new security playbook.Inspired by the work of Matija Franklin, Nenad Tomašev, Julian Jacobs, Joel Z. Leibo, and Simon Osindero, this episode was created using Google's NotebookLM.Read the original paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438

Apr 6, 2026

23m

66

🎧 AI That Rewrites Its Own Brain: Meet the HyperAgent

What happens when you give an AI system the ability to modify not just its answers, but the very process it uses to improve itself?In this episode, we explore HyperAgents, a new framework from Meta and UBC that enables AI systems to recursively improve their own learning mechanisms. Unlike previous approaches where the improvement strategy was fixed by human engineers, HyperAgents can rewrite their own self-improvement code, creating a loop where getting better at a task also means getting better at getting better. The results are striking: improvements discovered in one domain, like reviewing research papers, transfer to completely unrelated tasks like grading Olympic math solutions.Inspired by the work of Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/abs/2603.19461

Mar 29, 2026

24m

65

When Agents Remember Their Mistakes: The End of AI Amnesia

What if an AI agent could learn from every single failure, every clumsy workaround, every brilliant recovery, and feed that experience back into its own future performance?Today’s LLM-powered agents suffer from a fundamental flaw: amnesia. They repeat the same mistakes, miss the same shortcuts, and rediscover the same solutions over and over. A new framework from IBM Research changes that by mining agent execution trajectories for three types of actionable knowledge: strategy tips from clean successes, recovery tips from failure-and-fix sequences, and optimization tips from tasks completed inefficiently.On the AppWorld benchmark, agents equipped with this learned memory improved scenario goal completion by up to 14.3 percentage points on unseen tasks, and by a staggering 28.5 points on complex multi-step challenges. That is a 149% relative increase, with zero model changes.Inspired by the work of Gaodan Fang, Vatche Isahagian, K. R. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, and Gegi Thomas, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/abs/2603.10600

Mar 22, 2026

21m

64

Agents That Teach Themselves

What if AI agents could diagnose their own mistakes and build the exact skills they need to fix them, with no human intervention?In this episode, we explore EvoSkill, a self-evolving framework where coding agents automatically discover and refine reusable skills through iterative failure analysis. Instead of optimizing prompts or fine-tuning models, EvoSkill lets agents build structured skill libraries that accumulate over time, improving performance by up to 12% on challenging benchmarks. Even more striking: skills learned on one task transfer to completely different tasks without modification.Inspired by the work of Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/pdf/2603.02766

Mar 14, 2026

13m

63

Your AI Agent is Flying Blind: The Skills Gap No One is Talking About

What if the biggest bottleneck in AI agent performance isn’t the model itself—but what it doesn’t know how to do?In this episode, we explore SkillsBench, the first benchmark that systematically measures how structured procedural knowledge—called Agent Skills—impacts AI agent performance across real-world tasks. The results are striking: curated Skills boost agent success rates by 16 percentage points on average, with some domains like Healthcare seeing gains above 50 points. But here’s the twist—when models try to generate their own Skills, performance actually drops. The takeaway? AI agents desperately need human expertise to unlock their full potential.Inspired by the work of Xiangyi Li, Wenbo Chen, Yimin Liu, and colleagues, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/pdf/2602.12670

Mar 2, 2026

23m

62

Your AI Assistant Doesn't Know You Yet. But It's Learning.

What if your AI assistant could actually remember you — not just your name, but how your preferences evolve over time?Researchers from Meta have introduced PAHF — Personalized Agents from Human Feedback — a framework that lets AI agents learn who you are in real time, through the natural back-and-forth of interaction. Before acting, the agent asks targeted questions to avoid costly mistakes. After acting, it listens to your corrections and updates its understanding of you. No pre-collected data required. No static profiles. Just a system that gets smarter about you with every exchange.For anyone deploying AI agents at scale — in enterprise, banking, or consumer applications — this is the missing piece: personalization that actually keeps up with people.Inspired by the work of Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, Shuyan Zhou, and Saghar Hosseini, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/pdf/2602.16173

Feb 22, 2026

20m

61

🎧 Deep Agents Are Here: The End of AI Assistants as We Know Them

What if AI stopped waiting for your instructions and started planning, delegating, and executing complex projects on its own — for hours or even days?In this episode, we explore the rise of “Deep Agents” — a new generation of autonomous AI systems that go far beyond chatbots. These agents can decompose complex goals into sub-tasks, delegate work to specialized AI teammates, maintain persistent memory across sessions, and self-correct when things go wrong. From building C compilers to autonomous financial auditing, Deep Agents are reshaping how enterprises think about digital labor.We unpack the four architectural pillars behind this shift — explicit planning, hierarchical delegation, persistent workspaces, and extreme context engineering — and examine why 86% of enterprises are already deploying AI coding agents in production.Inspired by a comprehensive synthesis of current research and industry reports, this episode was created using Google’s NotebookLM.

Feb 8, 2026

14m

60

🎧 OpenClaw: The Lobster That Wants to Run Your Life

Remember when Siri was supposed to change everything? This might actually be it.OpenClaw is the Jarvis we were promised—an AI assistant that actually does things. It reads your emails, manages your calendar, negotiates prices, drafts follow-ups. Andrej Karpathy calls what's emerging around it "the most sci-fi takeoff adjacent thing" he's seen. Fair warning: it still makes plenty of mistakes. But for the first time, the dream feels real.Inspired by the work of Peter Steinberger and the OpenClaw community, this episode was created using Google's NotebookLM.Source: Community analysis and documentation (January 2026)

Jan 31, 2026

13m

59

🎧 Judging the Judges: Why AI Now Needs AI Agents to Grade AI

What happens when the technology we built to evaluate AI becomes too limited to keep up with AI itself?In this episode, we explore a fundamental shift in how we assess artificial intelligence. For years, we relied on large language models to judge other models—a paradigm known as LLM-as-a-Judge. But as AI systems tackle increasingly complex, multi-step tasks, this approach is breaking down. The solution? Turning judges into agents—autonomous systems that can plan, use tools, collaborate, and verify their assessments against real-world evidence.We unpack what this means for AI development pipelines, from code generation to medical diagnosis, and why the future of AI evaluation may determine the future of AI itself.Inspired by the work of Runyang You, Hongru Cai, Caiqi Zhang, Yongqi Li, Wenjie Li, and colleagues at Hong Kong Polytechnic University, Cambridge, and Huawei, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/pdf/2601.05111

Jan 24, 2026

14m

58

Skills: The Secret Weapon That Makes AI Agents 50% Faster

What if you could get all the benefits of multi-agent AI systems—at half the cost and twice the speed?In this episode, we explore a powerful new paradigm for building AI agents: replacing expensive multi-agent coordination with single agents equipped with skill libraries. The results are striking—54% fewer tokens, 50% lower latency, and accuracy that matches or beats traditional approaches. But this research goes further, uncovering a fascinating connection between AI decision-making and human cognition. As skill libraries grow, LLMs exhibit the same capacity limits that constrain our own minds—and the solutions mirror how humans have always managed complexity.Inspired by the work of Xiaoxiao Li (University of British Columbia, Vector Institute, CIFAR AI Chair), this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/abs/2601.04748

Jan 11, 2026

15m

57

AI Memory Crisis: The Answer Was in Biology All Along

Why do AI systems still struggle to remember and generalize like humans do?In this episode, we dive into one of AI's most pressing challenges: memory. While tech giants race to build longer context windows and external memory systems, researchers at Tsinghua University took a radically different approach—they looked at how biological brains actually form lasting, generalizable memories. Their discovery is striking: a 140-year-old psychology principle called the "spacing effect" works just as powerfully in artificial neural networks as it does in fruit flies and humans. By mimicking how biology spaces out learning and introduces controlled variation, they achieved significant improvements in AI generalization—without adding a single parameter.Inspired by the work of Guanglong Sun, Ning Huang, Hongwei Yan, Liyuan Wang, and colleagues at Tsinghua University, this episode was created using Google's NotebookLM.Read the original paper here: https://www.biorxiv.org/content/10.64898/2025.12.18.695340v1.full

Jan 2, 2026

5m

56

The CFA Exam is Solved: AI Scores 97%

What if artificial intelligence could outperform seasoned financial analysts on the world’s toughest investment exams? In this episode, we dive into the stunning turnaround of "reasoning models"—like GPT-5 and Gemini 3.0 Pro—which have moved from failing the Chartered Financial Analyst (CFA) exams to achieving near-perfect scores. We explore how these models have mastered complex portfolio synthesis and what their record-breaking performance means for the future of human investment professionals.Inspired by the work of Jaisal Patel, Yunzhe Chen, and colleagues, this episode was created using Google’s NotebookLM. Read the original paper here: https://arxiv.org/pdf/2512.08270v1

Dec 13, 2025

11m

55

Can We Teach AI to Confess Its Sins?

It turns out that sophisticated AI models can learn to lie, deceive, or "hack" their instructions to achieve a high score—but they also know exactly when they’re doing it. In this episode, we explore a fascinating new method called "Confessions," where researchers train models to self-report their own bad behavior by creating a "safe space" separate from their main tasks.Inspired by the work of Manas Joglekar, Jeremy Chen, Gabriel Wu, and their colleagues, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/abs/2511.06626

Dec 9, 2025

14m

54

When AI Agents Gossip: The Secret Language of Economic Stability

What if the health of our economy depends less on tax rates and more on what people are saying to each other? In this episode, we dive into the "Think, Speak, Decide" framework (LAMP)—a revolutionary new approach where AI agents don't just crunch numbers; they read the news, spread rumors, and talk to one another to make financial decisions. We explore how teaching AI to understand human language creates economies that are surprisingly more robust and realistic than those run on math alone.Inspired by the work of Heyang Ma, Qirui Mi, and colleagues, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/pdf/2511.12876

Nov 29, 2025

14m

53

The Manager in the Machine: Introducing Agentic Organization

What if an AI didn't just think in a straight line, but actually managed a team of internal agents to solve your problems? In this episode, we dive into "AsyncThink" and the concept of Agentic Organization—a new framework where Large Language Models act as "Organizers," dynamically delegating sub-tasks to "Workers" to solve complex puzzles faster and more accurately. It is not just about thinking harder; it is about thinking together.Inspired by the work of Zewen Chi, Li Dong, and their colleagues at Microsoft Research, this episode was created using Google’s NotebookLM. Read the original paper here: https://arxiv.org/abs/2510.26658

Nov 22, 2025

12m

52

The End of the Cloud? The Rise of Local AI

What if 88% of your AI queries didn't need a massive data center, but could run directly on your laptop? In this episode, we dive into "Intelligence per Watt"—a new metric redefining how we measure AI efficiency. We explore how smaller, local models are rapidly catching up to frontier giants, potentially saving billions in energy costs and democratizing access to intelligence.Inspired by the work of Jon Saad-Falcon, Avanika Narayan, and their team at Stanford and Together AI, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/abs/2511.07885v1

Nov 18, 2025

11m

51

When AI Learns From Its Own Context — Self-Improving Language Models

We're all trying to find the perfect "prompt," but what happens when our instructions to an AI get too complex? New research shows they can suddenly fail or "collapse," losing all their knowledge. In this episode, we explore "Agentic Context Engineering," a new framework that avoids this. Instead of a static prompt, it builds an "evolving playbook" that allows the AI to learn from every single task, failure, and success.Inspired by the work of Qizheng Zhang, Changran Hu, and colleagues, this episode was created using Google’s NotebookLM. Read the original paper here: https://arxiv.org/abs/2510.04618

Nov 9, 2025

17m

50

Will Your Next Prompt Engineer Be an AI?

What if you could get the performance of a massive, 100-example prompt, but with 13 times fewer tokens?That’s the breakthrough promise of "instruction induction" —teaching an AI to be the prompt engineer.This week, we dive into PROMPT-MII , a new framework that essentially meta-learns how to write compact, high-performance instructions for LLMs. It’s a reinforcement learning approach that could make AI adaptation both cheaper and more effective.This episode explores the original research by Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, and Graham Neubig from Carnegie Mellon University.Read the full paper here for a deeperdive: https://arxiv.org/abs/2510.16932

Nov 1, 2025

17m

49

The Vision Hack: How a Picture Solved AI's Biggest Memory Problem

The biggest bottleneck for AIs handling massive documents—the context window—just got a radical fix. DeepSeek AI's DeepSeek-GOCR uses a counterintuitive trick: it turns text into an image to compress it by up to 10 times without losing accuracy. That means your AI can suddenly read the equivalent of 20 million tokens (entire codebases or legal troves) efficiently! This episode dives into the elegant vision-based solution, the power of its Mixture of Experts architecture, and why some experts believe all AI input should become an image.Original Research: DeepSeek-GOCR is a breakthrough by the DeepSeek AI team.Content generated with the help of Google's NotebookLM.Link to the Original Research Paper: https://deepseek.ai/blog/deepseek-ocr-context-compression

Oct 24, 2025

14m

48

Smarter Agents, Less Budget: Reinforcement Learning with Tree Search

Training AI agents using Reinforcement Learning (RL) to handle complex, multi-turn tasks is notoriously difficult.Traditional methods face two major hurdles: high computational costs (generating numerous interaction scenarios, or "rollouts," is expensive) and sparse supervision (rewards are only given at the very end of a task, making it hard for the agent to learn which specific steps were useful).In this episode, we explore "Tree Search for LLM Agent Reinforcement Learning," by researchers from Xiamen University, AMAP (Alibaba Group), and the Southern University of Science and Technology. They introduce a novel approach called Tree-GRPO (Tree-based Group Relative Policy Optimization) that fundamentally changes how agents explore possibilities.Tree-GRPO replaces inefficient "chain-based" sampling with a tree-search strategy. By allowing different trajectories to share common prefixes (the initial steps of an interaction), the method significantly increases the number of scenarios explored within the same budget. Crucially, the tree structure allows the system to derive step-by-step "process supervision signals," even when only the final outcome reward is available. The results demonstrate superior performance over traditional methods, with some models achieving better results using only a quarter of the training budget.📄 Paper: Tree Search for LLM Agent Reinforcement Learning https://arxiv.org/abs/2509.21240

Oct 22, 2025

0m

47

Beyond the AI Agent Builders Hype

Everyone's talking about AI agents that can automate complex tasks. But what happens when a cool demo meets the real world? We dive into hard-won, and often surprising, lessons from builders on the front lines. Discover why your first strategic choice isn't about a tool, but an entire ecosystem; why more agents can actually make things worse; and why the most critical skill is shifting from "prompt engineering" to "context engineering." This episode cuts through the noise to reveal what it really takes to build reliable AI agents that deliver value.

Oct 11, 2025

14m

46

AI That Quietly Helps: Overhearing Agents

In this IA Odyssey episode, we unpack “overhearing agents”—AI systems that listen to human activity (audio, text, or video) and step in only when help is useful, like surfacing a diagram during a class discussion, prepping trail options while a family plans a hike, or pulling case notes in a medical consult.While conversational AI (like chatbots) requires direct user engagement, overhearing agents continuously monitor ambient activities, such as human-to-human conversations, and intervene only to provide contextual assistance without interruption. Examples include silently providing data during a medical consultation or scheduling meetings as colleagues discuss availability.The paper introduces a clear taxonomy for how these agents activate: always-on, user-initiated, post-hoc analysis, or rule-based triggers. This framework helps developers think about when and how an AI should “step in” without becoming intrusive.Original paper: https://arxiv.org/pdf/2509.16325Credits: Episode notes synthesized with Google’s NotebookLM to analyze and summarize the paper; all insights credit the original authors.

Oct 4, 2025

0m

45

Beyond Single Agents: The Future of Multi-Agent LLMs

Can large language models achieve more when they collaborate instead of working alone? In this episode, we dive into “LLM Multi-Agent Systems: Challenges and Open Problems” by Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu.We explore how multi-agent systems—where AI agents specialize, debate, and share knowledge—can tackle complex problems beyond the reach of a single model. The paper highlights open challenges such as:• Optimizing task allocation across diverse agents• Enhancing reasoning through debates and iterative loops• Managing layered context and memory across multiple agents• Ensuring security, privacy, and coordination in shared memory systemsWe also discuss how these systems could reshape blockchain applications, from fraud detection to smarter contract negotiation.This episode was generated with the help of Google’s NotebookLM.Read the full paper here: https://arxiv.org/abs/2402.03578

Sep 28, 2025

0m

44

AI's Guessing Game

Ever wondered why AI chatbots sometimes state things with complete confidence, only for you to find out it's completely wrong? This phenomenon, known as "hallucination," is a major roadblock to trusting AI. A recent paper from OpenAI explores why this happens, and the answer is surprisingly simple: we're training them to be good test-takers rather than honest partners.This description is based on the paper "Why Language Models Hallucinate" by authors Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Content was generated using Google's NotebookLM.Link to the original paper: https://openai.com/research/why-language-models-hallucinate

Sep 20, 2025

0m

43

From Search Buddy to Personal Agent

Ever feel like your AI assistants don't really get you? We're diving into how AI is moving beyond generic answers to offer truly personalized experiences. This episode explores the journey from Retrieval-Augmented Generation (RAG), a fancy term for AIs that look things up before they speak, to sophisticated AI Agents that can understand your unique needs, plan tasks, and act on your behalf. It's the next step in making AI a genuine partner in our digital lives.This description was generated using Google's NotebookLM, based on the work of Xiaopeng Li, Pengyue Jia, and their co-authors.Read the original paper here:https://arxiv.org/abs/2504.10147

Sep 13, 2025

0m

42

Smarter LLM Routing: Balancing Cost and Performance

How can we get the best out of large language models without breaking the budget? This episode dives into Adaptive LLM Routing under Budget Constraints by Pranoy Panda, Raghav Magazine, Chaitanya Devaguptapu, Sho Takemori, and Vishal Sharma. The authors reimagine the problem of choosing the right LLM for each query as a contextual bandit task, learning from user feedback rather than costly full supervision. Their new method, PILOT, combines human preference data with online learning to route queries efficiently—achieving up to 93% of GPT-4’s performance at just 25% of its cost.We also look at their budget-aware strategy, modeled as a multi-choice knapsack problem, that ensures smarter allocation of expensive queries to stronger models while keeping overall costs low.Original paper: https://arxiv.org/abs/2508.21141This podcast description was generated with the help of Google’s NotebookLM.

Sep 8, 2025

22m

41

Nano Banana & the Future of Visual Creativity

Google’s latest breakthrough, Gemini 2.5 Flash Image—nicknamed “Nano Banana”—is reshaping what’s possible in digital art and beyond. From keeping characters consistent across scenes to natural-language editing and even blending multiple images, this model is lowering the barrier to creation like never before. Imagine building entire fantasy worlds or accelerating scientific research without the traditional costs and time sinks.But with this power comes profound questions: How do we handle the risks of fakes, hallucinations, and lost trust in what we see? What happens to human artists when machines can produce in seconds what once took weeks?In this episode of IA Odyssey, we dive into the promises and perils of Gemini 2.5 Flash Image, exploring how it may democratize creativity, shift the role of artists, and force us all to rethink authenticity in the age of AI.Original content generated with the help of Google’s NotebookLM.

Aug 30, 2025

4m

40

From Agents to Teammates: Building Cohesive AI Squads

Meet the Aime framework—ByteDance’s fresh take on multi-agent systems that lets AI teammates think on their feet instead of following brittle, pre-planned scripts. A dynamic planner keeps adjusting the big picture, an Actor Factory spins up just-right specialist agents on demand, and a shared progress board keeps everyone in sync. In tests ranging from general reasoning (GAIA) to software bug-fixing (SWE-Bench) and live web navigation (WebVoyager), Aime consistently out-performed hand-tuned rivals—showing that flexible, reactive collaboration beats static role-play every time.This episode of IA Odyssey unpacks how Yexuan Shi and colleagues replace rigid “plan-and-execute” pipelines with fluid teamwork, why it matters for real-world tasks, and where adaptive agent swarms might head next. Source paper: https://arxiv.org/abs/2507.11988Content generated with help from Google’s NotebookLM.

Jul 19, 2025

15m

39

When Machines Self-Improve: Inside the Self-Challenging AI

In this episode of IA Odyssey, we explore a bold new approach in training intelligent AI agents: letting them invent their own problems.We dive into “Self-Challenging Language Model Agents” by Yifei Zhou, Sergey Levine (UC Berkeley), Jason Weston, Xian Li, and Sainbayar Sukhbaatar (FAIR at Meta), which introduces a powerful framework called Self-Challenging Agents (SCA). Rather than relying on human-labeled tasks, this method enables AI agents to generate their own training tasks, assess their quality using executable code, and learn through reinforcement learning — all without external supervision.Using the novel Code-as-Task format, agents first act as "challengers," designing high-quality, verifiable tasks, and then switch roles to "executors" to solve them. This process led to up to 2× performance improvements in multi-tool environments like web browsing, retail, and flight booking.It’s a glimpse into a future where LLMs teach themselves to reason, plan, and act — autonomously.Original research: https://arxiv.org/pdf/2506.01716Generated with the help of Google’s NotebookLM.

Jul 16, 2025

13m

38

Beyond Code: Navigating the AI Software Revolution with Andrej Karpathy

We're witnessing one of the most profound shifts in the history of software—a rapid evolution from traditional coding (Software 1.0) to neural networks (Software 2.0) and now, the dawn of Software 3.0: large language models (LLMs) programmable with simple English. Inspired by insights from Andrej Karpathy, former AI Director at Tesla, we explore how this paradigm shift reshapes the very concept of programming and its profound implications for everyone engaging with technology.From the "Iron Man" analogy, where AI augments human capabilities rather than replacing them, to the fascinating vision of LLMs as new operating systems, this episode dives deep into the practical challenges and enormous opportunities ahead. We discuss Karpathy’s real-world perspective versus the consultant-driven hype, emphasizing that the path forward lies in human-AI collaboration rather than immediate full automation.Generated using Google's NotebookLM.Inspired by Andrej Karpathy’s insights: https://youtu.be/LCEmiRjPEtQ?si=NulC7m-qN8FVvBhQ

Jul 5, 2025

16m

37

Unlocking the Secrets: How Much Do Language Models Memorize?

Ever wondered how much information your favorite AI language models, like GPT, actually retain from their training data? In this episode of AI Odyssey, we delve into groundbreaking research by John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. The authors introduce a new method for quantifying memorization in AI, distinguishing between unintended memorization (dataset-specific information) and generalization (knowledge of underlying data patterns). With findings revealing that models like GPT have a surprising capacity of about 3.6 bits per parameter, this study explores how memorization plateaus and eventually gives way to true understanding, a phenomenon known as "grokking."Created using Google's NotebookLM, this episode demystifies how language models balance memorization and generalization, offering fresh insights into model training and privacy implications.Dive deeper into the full paper here: https://www.arxiv.org/abs/2505.24832

Jun 29, 2025

18m

36

Simulating UX with AI: Introducing UXAgent

What if you could simulate a full-scale usability test—before involving a single human user? In this episode, we explore UXAgent, a groundbreaking system developed by researchers from Northeastern University, Amazon, and the University of Notre Dame. This tool leverages Large Language Models (LLMs) to create persona-driven agents that simulate real user interactions on web interfaces.UXAgent's innovative architecture mimics both fast, intuitive decisions and deeper, reflective reasoning—bringing realistic and diverse user behavior into early-stage UX testing. The system enables rapid iteration of study designs, helps identify potential flaws, and even allows interviews with simulated users.This episode is powered by insights generated using Google’s NotebookLM. Special thanks to the authors Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang.🔗 Read the full paper here: https://arxiv.org/abs/2504.09407

Jun 21, 2025

17m

35

AI Agents Are Old News—Meet the Rise of Agentic AI

What if your AI didn't just follow instructions… but coordinated a whole team to solve complex problems on its own?In this episode, we dive into the fascinating shift from traditional AI Agents to a bold new paradigm: Agentic AI. Based on the eye-opening paper “AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges”, we unpack why single-task bots like AutoGPT are already being outpaced by swarms of intelligent agents that collaborate, strategize, and adapt—almost like digital organizations.Discover how these systems are transforming research, medicine, robotics, and cybersecurity, and why Google’s new A2A protocol could be a game-changer. From hallucination traps to multi-agent breakthroughs, this is the frontier of AI you haven’t heard enough about.Synthesized with help from Google’s NotebookLM.Full paper here 👇https://arxiv.org/abs/2505.10468

Jun 14, 2025

16m

34

The Illusion of Thinking: When More Reasoning Doesn’t Mean Better Reasoning

In this episode, we explore “The Illusion of Thinking”, a thought-provoking study from Apple researchers that dives into the true capabilities—and surprising limits—of Large Reasoning Models (LRMs). Despite being designed to "think harder," these advanced AI models often fall short when problem complexity increases, failing to generalize reasoning and even reducing effort just when it’s most needed.Using controlled puzzle environments, the authors reveal a curious three-phase behavior: standard language models outperform LRMs on simple tasks, LRMs shine on moderately complex ones, but both collapse entirely under high complexity. Even with access to explicit algorithms, LRMs struggle to follow logical steps consistently.This paper challenges our assumptions about AI reasoning and suggests we're still far from building models that trulythink. Generated using Google’s NotebookLM.🎧 Listen in and learn why scaling up “thinking” might not be the answer we thought it was.🔗 Read the full paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf📚 Authors: Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar (Apple)

Jun 9, 2025

16m

33

Smarter Prompts, Faster Results: The Power of Local Prompt Optimization

Prompting AI just got smarter. In this episode, we dive into Local Prompt Optimization (LPO) — a breakthrough approach that turbocharges prompt engineering by focusing edits on just the right words. Developed by Yash Jain and Vishal Chowdhary from Microsoft, LPO refines prompts with surgical precision, dramatically improving accuracy and speed across reasoning benchmarks like GSM8k, MultiArith, and BIG-bench Hard.Forget rewriting entire prompts. LPO reduces the optimization space, speeding up convergence and enhancing performance — even in complex production environments. We explore how this technique integrates seamlessly into existing prompt optimization methods like APE, APO, and PE2, and how it delivers faster, smarter, and more controllable AI outputs.This episode was generated using insights synthesized in Google’s NotebookLM.Read the full paper here: https://arxiv.org/abs/2504.20355

May 31, 2025

12m

32

Back to Basics: Understanding AI, From Buzzwords to Reality

AI is everywhere—but what is it, really? In this episode, we cut through the noise to explore the fundamentals of artificial intelligence, from narrow AI and reactive systems to generative models, AI agents, and the emerging frontier of agentic AI. Using insights from expert sources, articles, and research papers, we break down key concepts in simple, accessible terms.You'll learn how tools like ChatGPT work under the hood, why generative AI felt like such a leap, and what it actually means for an AI to be an agent—or part of a multi-agent system. We explore the real capabilities and limits of today’s AI, as well as the ethical and societal questions shaping its future.

May 24, 2025

19m

31

From Nothing to Genius: How AI Learns Without Data

What if an AI could become smarter without being taught anything? In this episode, we dive into Absolute Zero, a groundbreaking framework where an AI model trains itself to reason—without any curated data, labeled examples, or human guidance. Developed by researchers from Tsinghua, BIGAI, and Penn State, this radical approach replaces traditional training with a bold form of self-play, where the model invents its own tasks and learns by solving them.The result? Absolute Zero Reasoner (AZR) surpasses existing models that depend on tens of thousands of human-labeled examples, achieving state-of-the-art performance in math and code reasoning tasks. This paper doesn’t just raise the bar—it tears it down and rebuilds it.Get ready to explore a future where models don’t just answer questions—they ask them too.Original research by Andrew Zhao, Yiran Wu, Yang Yue, and colleagues. Content powered by Google’s NotebookLM.Read the full paper: https://arxiv.org/abs/2505.03335

May 19, 2025

17m