PODCAST · technology
Embodied AI 101
by Shaoqing Tan
Stay in the loop on research in AI and physical intelligence.
-
67
Claw-Eval: Toward Trustworthy and Transparent Evaluation of Autonomous Agents
Benchmark with 2,159 rubric items across 300 tasks using trajectory-aware grading and 3-trial Pass^3 scoring to mitigate luck. Evaluates agent reliability in real-world robotics settings.
-
66
LIBERO-Para: Paraphrase Robustness in Robotic Manipulation
Reveals paraphrase fragility in VLAs causing 22-52% success drops due to task misidentification. Introduces PRIDE metric weighting success by paraphrase difficulty on LIBERO benchmark manipulation tasks.
-
65
YOR: Your Own Mobile Manipulator for Generalizable Robotics
Low-cost mobile manipulator design and training strategies for broad generalization in real-world tasks.
-
64
EgoSim: Egocentric World Simulator for Embodied Interaction Generation
Closed-loop egocentric video simulator maintaining persistent 3D scene state for consistent interactions, enabling cross-embodiment transfer from human videos to robotic manipulation.
-
63
Accelerating Video World Models: From Generative Videos to Real-Time Simulators
Comprehensive survey taxonomizing efficient architectures/algorithms for video world models as simulators, targeting compute bottlenecks in embodied AI, autonomous driving, and games with techniques like short-window attention for real-time long-horizon prediction.
-
62
From Tokens to Thoughts: Continuous Latent Reasoning in Large Models and Robot Control
Curated collection of 100+ works surveying shift to continuous latent spaces in LLMs/VLMs/VLAs for improved reasoning over discrete tokens, with relevance to robotics action modeling.
-
61
CaP-X: Coding Agents for Physical eXecution
CaP-X is an open-source agentic robotics framework where LLMs/VLMs generate code to call perception and control APIs for execution across diverse simulated and real robots in CaP-Gym's 187 manipulation tasks. The framework includes CaP-Bench for evaluating frontier models and CaP-RL, which boosts a 7B model's success from 20% to 72% with minimal sim-to-real gap.
-
60
DoRA: Weight-Decomposed Low-Rank Adaptation
An upgrade over LoRA for parameter-efficient fine-tuning, enabling better performance in LLMs by decomposing weights into magnitude and direction components.
-
59
AI Model Collapse: What Happens When AI Trains on Its Own Outputs
Seminal work showing how training on AI-generated data leads to 'model collapse' in neural networks, with urgent implications for future scaling.
-
58
PhAIL: Benchmarking Vision-Language-Action Models on Real-World Bin-Picking
Real-world hardware evaluation of VLAs on blind bin-to-bin picking, achieving max 64 picks/hour across hundreds of runs, with full videos/data exposing gaps in production-scale robotic manipulation reliability.
-
57
Co-training Large Behavior Models: Data Modalities and Training Strategies for Robot Manipulation
Comprehensive evaluation of 89 policies showing optimal co-training practices mixing real robot data with sim/egocentric human videos to boost diversity and performance in large robotics foundation models.
-
56
HyDRA: Hybrid Memory for Dynamic Video World Models
Novel memory system preserving dynamic object identity and motion continuity across occlusions in video world models, addressing frozen/vanishing issues for improved predictive physics in embodied AI.
-
55
# WildWorld: Dynamic World Modeling with Actions and Explicit State
Massive dataset enabling dynamic world models with explicit states and actions, supporting predictive modeling for cross-embodiment robotic control.
-
54
Omni-WorldBench: Evaluating Interactive 4D World Models
New benchmark assessing world models on interaction tasks, pushing predictive physics and video modeling towards robotics applications with action-conditioned evaluation.
-
53
SIMART: From Static Meshes to Sim-Ready Articulated Models
Unified MLLM framework with Sparse 3D VQ-VAE (70% token reduction) for part-level mesh decomposition and kinematic chain prediction, enabling physics-based robotic simulation from monolithic assets.
-
52
EgoSim: An Egocentric World Simulator for Embodied Interaction
Closed-loop egocentric simulator persistently updating 3D scene state to generate spatially consistent interaction videos for continuous simulation, enabling cross-embodiment transfer from human videos to robotic manipulation tasks.
-
51
Digit's New Motor Cortex: Sim-to-Real RL for Whole-Body Control
AI-trained capabilities for new whole-body motions using mocap/teleop data and sim-to-real reinforcement learning, deployable overnight on hardware.
-
50
EgoNav: Diffusion-Based Humanoid Navigation from Human Egocentric Video
Diffusion-based humanoid navigation trained solely on 5 hours of human egocentric video data, enabling zero-shot deployment on Unitree G1 for complex behaviors like handling glass walls, crowds, and dynamic obstacles via 360° visual memory and hybrid trajectory sampling; upcoming release of dataset, models, and code.
-
49
CaP-X: A Code-as-Policy Framework for Robot Manipulation
Comprehensive open-source agentic robotics framework treating VLMs/LLMs as code-generating APIs for perception (SAM3, Molmo) and control (IK, grasping), with CaP-Gym benchmark of 187 diverse manipulation tasks (tabletop, bimanual, mobile; sim/real) and CaP-Bench evaluating 12 frontier models; demonstrates rapid RL gains (7B model from 20% to 72% success) with strong sim-to-real transfer.
-
48
Embodied Intelligence Breakthrough: Generalist AI’s GEN-1 Robots
We've created GEN-1, our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model that crosses a new performance threshold: mastery of simple physical tasks. It improves average success rates to 99% on tasks where previous models achieve 64%, completes tasks roughly 3x faster than state of the art, and requires only 1 hour of robot data for each of these results. GEN-1 unlocks commercial viability across a broad range of applications—and while it cannot solve all tasks today, it is a significant step towards our mission of creating generalist intelligence for the physical world.
-
47
CaP-X: LMs' First Physical Exam
A novel benchmark that evaluates language models on physical examination tasks, testing their ability to understand and perform clinical physical exam procedures in simulated environments. This work introduces a comprehensive evaluation framework for AI systems in medical/clinical settings.
-
46
AI Model Collapse: The Danger of Training on AI-Generated Data
Demonstrated that LLMs trained recursively on AI-generated data suffer model collapse, a degenerative process where they lose grasp of true data distributions. Sparked critical debates on data provenance and the importance of preserving human-generated training data.
-
45
High-Level Automated Reasoning with Qwen2.5-7B
Qwen2.5-7B achieved 79.6% on MATH benchmark, surpassing GPT-4o, by employing atomic reasoning actions combined with Monte Carlo Tree Search. Demonstrated that strategic reasoning architectures can enable smaller models to outperform much larger ones.
-
44
Co-Training Large Behavior Models: Multimodal Data for Robot Manipulation
Explores data modalities and co-training strategies to enhance large behavior models (foundation models) for improved performance in robot manipulation tasks, supporting end-to-end learning and cross-embodiment generalization.
-
43
HyDRA: Hybrid Memory for Dynamic Video World Models
Memory architecture preserving identity and motion continuity for out-of-view dynamic subjects, addressing frozen/vanishing issues in video world models.
-
42
DexWM: Leveraging Human Videos for Dexterous Robot World Models
Dataset of robot trajectories designed for training world models to learn dexterous hand-object interactions directly from human videos.
-
41
World Models in Robotics
Technical survey categorizing world models into action-conditioned, video-inverse dynamics, and joint world-action models (WAMs), discussing their generalization, video data leverage, and trends for closing the robotics data gap.
-
40
SIMART: Decomposing Monolithic Meshes into Sim-Ready Articulated Assets
Unified MLLM framework with Sparse 3D VQ-VAE that reduces tokens by 70% for efficient part-level decomposition and kinematic prediction in physics-based robotic simulations.
-
39
LeWorldModel: A Stable JEPA World Model from Pixels
Stable end-to-end JEPA world model trained directly from pixels using simple MSE prediction loss and SIGReg anti-collapse regularization, enabling efficient latent planning under 1 second on 15M params with emergent spatial structure outperforming prior methods.
-
38
World Models for Robots: The Next Big Leap?
Technical overview defining world models in robotics, their potential to solve diverse problems via video prediction, and key enablers like scale.
-
37
Harnessing Long-Running AI in Embodied Systems
As AI moves from quick Q&A to marathon tasks, designers grapple with continuity. This episode explores how Anthropics harness design principles translate to embodied AI - robots that need to maintain context across long-running missions.
-
36
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
Whole-Body Mobile Manipulation Interface (HoMMI) that learns bimanual and whole-body manipulation, long-horizon navigation, and active perception directly from egocentric human demonstrations without teleoperation.
-
35
TurboQuant: Redefining AI Efficiency with Extreme Compression
This episode explores TurboQuant, a revolutionary set of quantization algorithms from Google Research that redefines AI efficiency through extreme compression.We dive deep into how TurboQuant addresses one of AI's most pressing challenges: the memory bottleneck created by high-dimensional vectors in key-value caches. The research introduces theoretically grounded quantization methods that enable massive compression for large language models and vector search engines without sacrificing performance.Key topics covered:The theoretical foundations of TurboQuant's quantization algorithmsHow extreme compression works for LLMs and vector search enginesImpact on high-dimensional vectors and key-value cache memory bottlenecksPerformance metrics and comparisons with existing methodsPractical implications for AI deployment and efficiencyLinks:Paper: https://arxiv.org/pdf/2504.19874Blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
-
34
DexWM: Learning Dexterous Object Manipulation from Human Videos
Dataset of robot trajectories designed for training world models that learn dexterous hand-object interactions from human videos, released on Hugging Face.
-
33
FlashAttention-3: Fast & Accurate Attention with Asynchrony & Low-Precision
Major efficiency leap for Transformer attention mechanisms, enabling faster training/inference on long sequences with low-precision compute.
-
32
When AI Trains on Its Own Output: The Model Collapse Problem
Warns of "model collapse" in LLMs trained on synthetic data from prior models, urging preservation of human-generated data. One of 2024's most influential papers.
-
31
MolmoBot: A Vision-Language Model for Zero-Shot Robot Manipulation
Vision-language model (VLM) for zero-shot robot manipulation, trained entirely in simulation without real-world data; achieves 79.2% success rate on real-world tabletop tasks, outperforming π₀.₅ baseline at 39.2%.
-
30
LeWorldModel: Stable End-to-End JEPA from Pixels
A stable end-to-end Joint Embedding Predictive Architecture (JEPA) trained directly from pixels that enables robust world modeling for embodied AI systems.
-
29
EgoVerse: An Egocentric Data Ecosystem for Scaling Robot Learning
Ecosystem with over 1300 hours of egocentric human video data spanning 240 scenes and 2000+ tasks, designed for scalable robot policy training via behavior cloning; includes cloud infrastructure, data viewer, and human-to-robot transfer algorithms to enable cross-embodiment learning without teleoperation.
-
28
HSImul3R: Physics-Driven Reconstruction of Human–Scene Interactions
Physics-in-the-loop bi-directional optimization pipeline reconstructing stable, simulation-ready 3D human-scene interactions from casual videos, deployable directly to humanoid robots for world modeling and manipulation.
-
27
MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
Open-source suite of large-scale simulation environments and benchmarks designed for advancing end-to-end learning in robot navigation and manipulation across multiple embodiments.
-
26
DreamZero: World Action Models Are Zero-Shot Policies
Introduces World Action Models (WAMs), a family of 14B-parameter autoregressive diffusion models that jointly predict video and robotic actions to enable zero-shot generalization across manipulation tasks, outperforming fine-tuned Vision-Language-Action models on benchmarks like MolmoSpaces and RoboArena.
-
25
Kinema4D: A 4D Generative Simulator for Embodied AI
An action-conditioned 4D generative robotic simulator that disentangles precise kinematic control from environmental dynamics, facilitating physically-plausible simulations of complex robot-world interactions for training and world modeling.
-
24
VEGA-3D: Teaching multimodal LLMs spatial reasoning through video generation
A plug-and-play framework extracts implicit 3D priors from video diffusion models to enhance multimodal LLMs with spatial reasoning capabilities, enabling improved geometric scene understanding and embodied decision-making without explicit 3D supervision.
We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
Stay in the loop on research in AI and physical intelligence.
HOSTED BY
Shaoqing Tan
CATEGORIES
Loading similar podcasts...