AI Illuminated podcast artwork

PODCAST · education

AI Illuminated

A new way to keep up with AI research. Delivered to your ears. Illuminated by AI.Part of the GenAI4Good initiative.

  1. 25

    MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

    0:00 Introduction  0:20 Limitations of traditional SfM and SLAM techniques. 0:57 Shortcomings of existing neural network methods. 1:07 MegaSaM's approach: balance of accuracy, speed, and robustness. 1:31 Differentiable bundle adjustment (BA) layer. 2:03 Integration of monocular depth priors and motion probability maps. 2:37 Uncertainty-aware global BA scheme. 3:14 Two-stage training scheme. 3:45 Consistent video depth estimation without test-time fine-tuning. 4:16 Key quantitative and qualitative improvements. 4:49 Limitations of MegaSaM and future research avenues. 5:15 Synthetic data for training and generalization to real-world videos. 5:49 Datasets used for evaluation. 6:26 DepthAnything and UniDepth for monocular depth estimation. 7:02 Summary of MegaSaM's advancements. Authors: Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, Noah Snavely Affiliations: Google DeepMind, UC Berkeley, University of Michigan Abstract: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: this https URL Link: https://mega-sam.github.io/

  2. 24

    SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    [00:00] SVD-Quant: 4-bit diffusion model quantization [00:27] Challenge: Outlier sensitivity in 4-bit quantization [00:59] Solution: Smoothing + SVD approach [01:37] Technical: SVD's role in low-rank approximation [02:08] Nunchuku: New inference engine with kernel fusion [02:35] Comparison: INT4 vs FP4 quantization methods [03:00] Results: 3.5x memory reduction on Flux-1.0 [03:44] Feature: Seamless LoRA compatibility [04:06] Study: Validating combined approach effectiveness [04:40] Future: Hardware compatibility and improvements [06:12] Methods: Image quality assessment metrics [06:53] Impact: Open-source deployment benefits Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han Affiliations: MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, Pika Labs Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"ıvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-Σ, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced. Link: https://hanlab.mit.edu/projects/svdquant

  3. 23

    D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

    [00:00] Intro [00:18] Current limitations in depth-sensing technology [00:56] D3RoMa's diffusion model approach to depth estimation [01:47] Integration of geometric constraints in the model [02:27] HiSS: New dataset for transparent/specular objects [03:18] Benchmark results showing major accuracy improvements [04:02] Current limitations and future development areas [05:34] Technical details of HiSS dataset creation [06:30] Real-world testing with robotic systems [07:15] Why diffusion models outperform GANs [08:54] Implementation of consistency loss functions [12:00] Solving simulation-to-real-world transfer [13:25] Potential expansion to single-camera systems Authors: Songlin Wei, Haoran Geng, Jiayi Chen, Congyue Deng, Wenbo Cui, Chengyang Zhao, Xiaomeng Fang, Leonidas Guibas, He Wang Affiliations: Peking University, UC Berkeley, Stanford, Galbot, University of Chinese Academy of Sciences, Beijing Academy of Artificial Intelligence Abstract: Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios. Link: https://arxiv.org/abs/2409.14365

  4. 22

    Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

    [00:00] Intro [00:21] Key problem: Poor generalization in robotic learning [00:51] HPT: New transformer architecture for robotics [00:59] Core components of HPT architecture [01:44] Scale analysis: Data and model size impacts [02:16] Training data: Real robots, simulations, human videos [02:54] Results: 20% improvement on new tasks [04:04] Real-world testing limitations [05:18] Future additions: Tactile and 3D data [05:57] Requirements for better robotics datasets [06:48] Weight sampling in heterogeneous data [08:55] Benefits of modular architecture [10:30] Scaling challenges and trade-offs Authors: Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He Affiliations: MIT CSAIL, Meta FAIR Abstract: One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (this https URL) for code and videos. Link: https://arxiv.org/abs/2409.20537

  5. 21

    HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots

    [00:00] Introduction to Hover: Neural Whole Body Controller for Humanoids [00:15] Problem: Current controllers lack versatility across tasks [00:50] Human motion imitation as a unified control approach [01:23] Policy distillation: Learning from an oracle policy [02:01] Command space: Kinematic, joint angle, and root tracking modes [02:34] Motion retargeting: From human data to robot movements [03:09] Performance comparison with specialist policies [03:43] Real-world testing on Unitree H1 robot [04:15] Comparison with MHC and Masked Mimic approaches [04:49] Future work and current limitations [05:18] Reward function design and components [06:02] D-Agger advantages in policy learning [06:33] Domain randomization for sim-to-real transfer [07:06] Conclusions on Hover's contributions Authors: Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, Linxi Fan, Yuke Zhu Affiliations: NVIDIA, CMU, UC Berkeley, UT Austin, UC San Diego Abstract: Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications. Link: https://hover-versatile-humanoid.github.io/

  6. 20

    Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

    [00:00] Intro [00:24] Tackles RL challenges using a visual backbone, efficient RL, and human feedback. [01:20] Pretrained backbone boosts stability and exploration efficiency. [02:06] RLPD combines offline data and human corrections effectively. [02:57] Human-guided interventions reduce errors, enabling gradual autonomy. [03:42] System choices aid spatial generalization and safe exploration. [04:40] RL outperforms imitation learning in success and speed. [05:29] Funnel model shows reliable, focused policy improvement. [06:07] Learns both reactive and predictive tasks, enhancing flexibility. [06:57] HIL-SERL excels over baselines in integrating human data. [07:27] Outperforms diffusion policy on reactive tasks. [08:04] Future work: longer tasks, pretraining, unstructured testing. [08:57] Key takeaway: human-in-the-loop RL enables adaptable, efficient robotic policies. Authors: Jianlan Luo, Charles Xu, Jeffrey Wu, Sergey Levine Affiliations: UC Berkeley Abstract: Reinforcement learning (RL) holds great promise for enabling autonomous acquisition of complex robotic manipulation skills, but realizing this potential in real-world settings has been challenging. We present a human-in-the-loop vision-based RL system that demonstrates impressive performance on a diverse set of dexterous manipulation tasks, including dynamic manipulation, precision assembly, and dual-arm coordination. Our approach integrates demonstrations and human corrections, efficient RL algorithms, and other system-level design choices to learn policies that achieve near-perfect success rates and fast cycle times within just 1 to 2.5 hours of training. We show that our method significantly outperforms imitation learning baselines and prior RL approaches, with an average 2x improvement in success rate and 1.8x faster execution. Through extensive experiments and analysis, we provide insights into the effectiveness of our approach, demonstrating how it learns robust, adaptive policies for both reactive and predictive control strategies. Our results suggest that RL can indeed learn a wide range of complex vision-based manipulation policies directly in the real world within practical training times. We hope this work will inspire a new generation of learned robotic manipulation techniques, benefiting both industrial applications and research advancements. Videos and code are available at our project website this https URL. Link: https://hil-serl.github.io/

  7. 19

    Local Policies Enable Zero-shot Long Horizon Manipulation

    [00:00] Paper intro: Zero-shot robotic manipulation via local policies [00:26] Key challenges: Limited generalization and sim-to-real transfer [01:03] Local policies: Task decomposition through localized focus regions [01:38] Foundation models: VLMs for task understanding [02:07] Training approach: Simulation-based RL + visuomotor policy distillation [02:46] Implementation: Depth maps and impedance control system [03:25] Results: 97% simulation success, 76% real-world success [04:02] Challenges: Vision errors and collision handling [04:32] Limitations: Issues with reflective objects and complex contacts [05:48] Impact: Advancing autonomous robotic manipulation [06:36] Design: Modular system for continuous improvement [07:21] Dependencies: VLM and motion planner requirements Authors: Murtaza Dalal, Min Liu, Walter Talbott, Chen Chen, Deepak Pathak, Jian Zhang, Ruslan Salakhutdinov Affiliations: Carnegie Mellon University, Apple Abstract: Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at this https URL Link: https://mihdalal.github.io/manipgen/

  8. 18

    MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

    [00:00] Introduction to Mentor system for visual RL [00:29] Problem: Sample inefficiency in robotic learning [00:59] Innovation: Mixture of Experts (MoE) architecture [01:55] Results: MoE achieves 100% success in multi-task testing [02:33] Feature: Task-oriented perturbation for exploration [03:55] Real-world testing: 83% success in robotic tasks [04:33] Study: MoE and perturbation each boost performance by 30% [05:14] Future work: Optimizing MoE implementation [05:59] Challenge: Bridging simulation-to-real-world gap [06:45] Impact: Advancing practical robotics applications Authors: Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, Huazhe Xu Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Shanghai AI Lab Abstract: Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone, enhancing the agent's ability to handle complex tasks by leveraging modular expert learning to avoid gradient conflicts. Furthermore, MENTOR introduces a task-oriented perturbation mechanism, which heuristically samples perturbation candidates containing task-relevant information, leading to more targeted and effective optimization. MENTOR outperforms state-of-the-art methods across three simulation domains -- DeepMind Control Suite, Meta-World, and Adroit. Additionally, MENTOR achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks including peg insertion, cable routing, and tabletop golf, which significantly surpasses the success rate of 32% from the current strongest model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at this https URL. Link: https://arxiv.org/abs/2410.14972

  9. 17

    SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment

    [00:00] SkillGen: AI system for robotic learning and automation. [00:19] Core Innovation: Automated dataset generation from minimal human input. [01:12] Skill Segmentation: Smart system for breaking down and adapting complex tasks. [01:59] Hybrid Skill Policy: Framework for controlling robot actions and task completion. [02:50] Performance Results: 75.4% success rate, generating 24,000+ demonstrations. [04:18] Real-World Testing: 35% success in direct simulation-to-reality transfer. [05:01] Current Limitations: Preset sequences and object tracking requirements. [06:47] HSP Variants: Different approaches to robot control and motion planning. [07:43] Practical Applications: Successful implementation in pick-and-place tasks. Authors: Caelan Garrett, Ajay Mandlekar, Bowen Wen, Dieter Fox Affiliation: NVIDIA Abstract: Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion. We also propose a Hybrid Skill Policy (HSP) framework for learning skill initiation, control, and termination components from SkillGen datasets, enabling skills to be sequenced using motion planning at test-time. We demonstrate that SkillGen greatly improves data generation and policy learning performance over a state-of-the-art data generation framework, resulting in the capability to produce data for large scene variations, including clutter, and agents that are on average 24% more successful. We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations, and training proficient, often near-perfect, HSP agents. Finally, we apply SkillGen to 3 real-world manipulation tasks and also demonstrate zero-shot sim-to-real transfer on a long-horizon assembly task. Videos, and more at this https URL. Link: https://arxiv.org/abs/2410.18907

  10. 16

    LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

    [00:00] Intro to LVSM: Novel transformer for view synthesis [00:14] Problems with existing 3D synthesis methods [00:59] LVSM architecture: encoder-decoder vs decoder-only [01:41] Performance trade-offs between architectures [02:13] Using Pluecker rays for implicit 3D geometry [02:49] Zero-shot capabilities with varying input views [03:23] Training stability and technical solutions [03:59] Training & evaluation datasets [04:23] Insights from architecture ablation studies [05:00] Achieving SOTA with limited GPU resources [05:25] Future work and research directions [06:05] Parallels with language models [06:38] Limitations in aspect ratio handling Authors: Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu Affiliations: Cornell University, The University of Texas at Austin, Adobe Research, Massachusetts Institute of Technology Abstract: We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: this https URL . Link: https://arxiv.org/abs/2410.17242

  11. 15

    Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

    [00:00] Introduction to 3D Gaussian tracking for robotic manipulation [00:26] Limitations of current video prediction methods [01:11] Advantages of 3D Gaussian representation [02:04] Graph Neural Networks for modeling object dynamics [02:54] Control particle implementation and computation reduction [03:42] Physics-based optimization for prediction stability [04:25] Integration with real-world robotic systems [05:12] Performance testing across different materials [05:58] Advantages over traditional physics-based methods [09:16] Implementation of object detection systems [10:02] Data collection and synchronization challenges [14:39] Long-term prediction capabilities and limitations Authors: Mingtong Zhang, Kaifeng Zhang, Yunzhu Li Affiliations: University of Illinois Urbana-Champaign, Columbia University Abstract: Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at this https URL. Link: https://arxiv.org/abs/2410.18912

  12. 14

    SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation

    [00:00] Introduction [00:20] Core limitations in robot manipulation: challenges with RL and IL [01:08] SPIRE's hybrid approach: combining task planning with learning methods [01:44] TAMP-gated learning: selective application of learned policies [02:20] Training innovations: warm-starting RL and KL-divergence implementation [02:59] Results: 35-50% performance gain, 6x more data efficient [04:04] Multi-worker framework: improved sampling and distribution [05:11] Future directions: expanding beyond rigid objects [05:59] Curriculum learning: sequential training strategies [07:11] Safety improvements: demonstrated through coffee task example Authors: Zihan Zhou, Animesh Garg, Dieter Fox, Caelan Garrett, Ajay Mandlekar Affiliations: NVIDIA, University of Toronto, Vector Institute, Georgia Institute of Technology Abstract: Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View this https URL for more details. Link: https://arxiv.org/abs/2410.18065

  13. 13

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    [00:00] VILA-U: A unified visual AI model [00:29] Problem: Inefficiency of separate visual modules [01:11] Vision tower: Novel quantization approach [02:09] Training strategy: CLIP-based staged learning [03:03] RVQ technique: Enhanced visual representation [03:47] Multi-modal training: Text-image-video fusion [04:35] Performance: Results and current limitations [05:23] Impact: Contrastive loss effectiveness [06:03] Generation: Optimal guidance settings [06:37] Capabilities: Video, Q&A, and image reasoning [07:14] Applications: Future use cases and scaling [08:00] Architecture: LLaMA 2 7B integration [08:48] Data: Quality vs quantity considerations [09:35] Impact: Unified framework achievements Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu Affiliations: Tsinghua University, MIT, NVIDIA, UC Berkeley, UC San Diego Abstract: VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework. Link: https://arxiv.org/abs/2409.04429

  14. 12

    CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

    [00:00] Intro [00:31] Challenge: Limited multi-humanoid training data [00:55] CooHOI's two-phase learning framework [01:49] Object dynamics as implicit agent communication [02:25] Bounding box strategy for long objects [03:07] Results: Superior performance vs baselines [03:42] Ablation study findings: Key system components [04:18] Limitation: Basic hand manipulation only [04:57] Impact: New approach to robot cooperation Authors: Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang Affiliations: Shanghai AI Laboratory, Tsinghua University, Beihang University, Nanyang Technological University, Carnegie Mellon University Abstract: Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types. Link: https://arxiv.org/abs/2406.14558v2

  15. 11

    SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints

    [00:00] Introduction to SynFlowNet [00:29] Problem: AI-generated molecules often can't be synthesized [01:17] Solution: SynFlowNet - uses real chemical reactions [02:03] GFlowNets: Enables diverse molecule generation [02:47] Scalability: Morgan fingerprints handle 200K+ compounds [03:14] Challenge: Solving backward trajectory issues [04:14] Results: Better synthesis rates and molecular diversity [05:30] Scale test: Successfully handled 221K molecules [06:06] Application: Integration with fragment screening [06:38] Wrap-up: SynFlowNet advances drug design Authors: Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, Pietro Liò Affiliations: University of Cambridge, EPFL, Microsoft Research, Valence Labs Abstract: Generative models see increasing use in computer-aided drug design. However, while performing well at capturing distributions of molecular motifs, they often produce synthetically inaccessible molecules. To address this, we introduce SynFlowNet, a GFlowNet model whose action space uses chemical reactions and buyable reactants to sequentially build new molecules. By incorporating forward synthesis as an explicit constraint of the generative mechanism, we aim at bridging the gap between in silico molecular generation and real world synthesis capabilities. We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool to assess the synthesizability of our compounds, and motivate the choice of GFlowNets through considerable improvement in sample diversity compared to baselines. Additionally, we identify challenges with reaction encodings that can complicate traversal of the MDP in the backward direction. To address this, we introduce various strategies for learning the GFlowNet backward policy and thus demonstrate how additional constraints can be integrated into the GFlowNet MDP framework. This approach enables our model to successfully identify synthesis pathways for previously unseen molecules. Link: https://arxiv.org/abs/2405.01155v2

  16. 10

    L3DG: Latent 3D Gaussian Diffusion

    [00:00] Intro to L3DG for 3D modeling [00:32] Solving room-sized 3D scene complexity [01:36] VQ-VAE compresses 3D Gaussian representation [02:41] Generative sparse transpose convolution [03:20] Latent diffusion for scene generation [04:30] Visual improvements over baselines [05:14] Scalability challenges for room-sized scenes [06:13] Spherical harmonics for view dependence [06:58] RGB and perceptual loss in training [07:59] L1 and SSIM for 3D Gaussian optimization [08:55] Training pipeline overview [09:59] Densification in 3D Gaussian optimization [10:49] Hyperparameter selection impact [11:45] Future research directions [12:42] Implementation optimization potential [13:29] Comparison with GANs and diffusion methods [14:27] Sparse grid representation trade-offs [15:26] Evaluation datasets [16:20] Chamfer distance for geometric analysis [17:13] Applications Authors: Barbara Roessle, Norman Müller, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, Angela Dai, Matthias Nießner Affiliations: Technical University of Munich, Meta Reality Labs Zurich Abstract: We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation. Link: https://arxiv.org/abs/2410.13530

  17. 9

    The Ingredients for Robotic Diffusion Transformers

    [00:00] Intro [00:33] Combining transformers & diffusion models [01:12] Key design: Scalable attention blocks (AdaLN) [02:30] Efficient observation tokenization [03:45] DiT Block policy architecture overview [04:20] BiPlay dataset introduction [04:53] Performance improvements over baselines [05:30] Key findings from ablations [06:08] Generalization to different robot types [06:43] Simulation vs real-world performance [07:13] Takeaways and future research directions Authors: Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, Sergey Levine Affiliations: Carnegie Mellon University, University of California, Berkeley. Abstract: In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon (1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: this https URL Link: https://arxiv.org/abs/2410.10088

  18. 8

    Estimating Body and Hand Motion in an Ego-sensed World

    [00:00] Introduction to EgoAllo system [00:38] Challenges in egocentric motion estimation [01:20] Importance of spatial/temporal invariance [02:11] Comparison of conditioning parameterizations [02:57] Integration of hand observations [03:50] Global alignment phase [04:28] Guidance losses in sampling [05:03] Handling longer sequences [05:35] Evaluation results [06:30] System limitations and future work [07:13] Implications for other egocentric tasks [08:05] Advantages of diffusion models [09:07] Use of synthetic datasets [09:53] Promising research directions [10:43] Impact on future motion capture systems [11:41] Comparison to traditional methods [12:31] Improved hand estimation accuracy [13:25] SLAM data inaccuracies impact [14:09] Levenberg-Marquardt optimizer usage [15:14] Adapting to complex environments Authors: Brent Yi, Vickie Ye, Maya Zheng, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa Affiliation: UC Berkeley, UT Austin Abstract: We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates.  Project page: https://egoallo.github.io/

  19. 7

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    [00:00] Intro [00:28] Limitation of existing unified models [00:57] Janus's decoupled visual encoding solution [01:18] Advantages of decoupling [02:03] Janus architecture [02:50] Three-stage training [03:41] Ablation studies [04:23] Extensions for Janus [05:10] Performance gains [05:47] Current limitations [06:31] Impact of simplicity and extensibility [07:10] Qualitative results [08:18] Potential applications [08:52] Key takeaways Abstract: In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models. Authors: Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo Affiliations: DeepSeek-AI, The University of Hong Kong, Peking University Link: https://arxiv.org/abs/2410.13848

  20. 6

    One Step Diffusion via Shortcut Models

    [00:00] Introduction [00:23] Computational cost of traditional diffusion models [00:59] Reducing iterations in image generation [01:06] Shortcut models [01:39] Training process and self-consistency property [02:22] Advantages over other methods [03:05] Results on image generation benchmarks [03:45] Application to robotic control [04:15] Limitations and future work [04:54] Best practices Authors: Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel Affiliations: UC Berkeley Abstract: Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time. Link: https://arxiv.org/abs/2410.12557

  21. 5

    Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

    [00:00] Intro: Research on multi-task learning in LLMs [00:38] Balancing safety and performance in multilingual settings [01:17] Model merging techniques explored [02:16] Model merging outperforms data mixing [02:49] Merging monolingual models improves multilingual capabilities [03:28] Key ablation studies [04:14] Safety and performance evaluation metrics [04:52] Effectiveness variations across languages [05:26] Safety model weight impact in linear merging [06:00] Insights on merging and preference training [06:25] Comparison to existing research [06:56] Implications for LLM development [07:29] Limitations and future research Authors: Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, Sara Hooker Affiliations: Cohere For AI, Cohere Abstract: Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore model merging in a diverse multi-task setting, combining safety and general-purpose tasks within a multilingual context. Each language introduces unique and varied learning challenges across tasks. We find that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively. We also find that language-based merging is highly effective -- by merging monolingually fine-tuned models, we achieve a 4% increase in general performance and 7% reduction in harm across all languages on top of the data mixtures method using the same available data. Overall, our comprehensive study of merging approaches provides a useful framework for building strong and safe multilingual models. Link: https://arxiv.org/abs/2410.10801

  22. 4

    CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

    [00:00] Intro: CoTracker 3 paper [00:17] Challenge: Point tracker training [00:51] Existing methods' limitations [01:05] CoTracker 3 innovations [01:41] Novel semi-supervised training [02:24] Online vs offline versions [03:02] Performance improvements [03:47] Ablation study insights [04:29] Limitations and future work Authors: Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht Affiliations: Meta AI; Visual Geometry Group, University of Oxford Abstract: Most state-of-the-art point trackers are trained on synthetic data due to the difficulty of annotating real videos for this task. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. In order to understand these issues better, we introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe. This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers. The new model eliminates or simplifies components from previous trackers, resulting in a simpler and often smaller architecture. This training scheme is much simpler than prior work and achieves better results using 1,000 times less data. We further study the scaling behaviour to understand the impact of using more real unsupervised data in point tracking. The model is available in online and offline variants and reliably tracks visible and occluded points. Link: https://arxiv.org/abs/2410.11831

  23. 3

    nGPT: Normalized Transformer with Representation Learning on the Hypersphere

    [00:00] Introduction [00:30] Consistent unit norm normalization in NGPT [01:08] Mathematical mechanism behind faster convergence [01:52] Elimination of weight decay in NGPT [02:21] Role of learnable eigen learning rates in optimization [03:04] Discussion on training speedup vs. per-step computation time [03:46] Condition number differences between GPT and NGPT [04:18] Ablation studies on scaling factors [04:53] NGPT's relationship to Riemannian optimization [05:27] Future research [06:02] Takeaways for practitioners Authors: Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg Affiliations: NVIDIA Abstract: We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

  24. 2

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    [00:00] Intro [00:20] Challenges in bimanual manipulation models [01:00] Robotics Diffusion Transformer (RDT) approach [01:34] RDT architecture design [02:22] Data scarcity and unified action space [03:07] Multi-task bimanual dataset [03:49] RDT's experimental results [04:30] Benefits of large-scale pre-training [05:12] Diffusion models in robotics [06:02] RDT's architectural modifications [06:57] Unified action space benefits [07:46] GPT-4 Turbo for data augmentation [08:24] Real robot experiments [09:14] Future research directions Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu Affiliations: Tsinghua University Abstract: Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to this https URL for the code and videos. Link: https://rdt-robotics.github.io/rdt-robotics/

  25. 1

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    [00:00] Continuous-time consistency models (CTMs) [00:20] Limitations of existing CTMs [00:51] TrigFlow: New CTM formulation [01:42] CTM training instability [02:21] Training objective modifications [02:55] Scaling CTMs to 1.5 billion parameters [03:37] Comparison with state-of-the-art models [04:14] Consistency training vs. distillation [04:52] CTMs vs. variational score distillation [05:20] Key takeaways for practitioners [06:09] JVP rearrangement and flash attention [06:52] FID metric evaluation [07:32] Adaptive weighting benefits [08:03] Future research directions [08:37] Conclusion Authors: Cheng Lu, Yang Song Affiliations: OpenAI Abstract: Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512x512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512, narrowing the gap in FID scores with the best existing diffusion models to within 10%.

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

A new way to keep up with AI research. Delivered to your ears. Illuminated by AI.Part of the GenAI4Good initiative.

HOSTED BY

The AI Illuminators

CATEGORIES

Frequently Asked Questions

How many episodes does AI Illuminated have?

AI Illuminated currently has 25 episodes available on PodParley. New episodes are automatically indexed when they're published to the podcast feed.

What is AI Illuminated about?

A new way to keep up with AI research. Delivered to your ears. Illuminated by AI.Part of the GenAI4Good initiative.

How often does AI Illuminated release new episodes?

AI Illuminated has 25 episodes. Check the episode list to see recent publication dates and frequency.

Where can I listen to AI Illuminated?

You can listen to AI Illuminated on PodParley by clicking any episode. We provide an embedded audio player for direct listening, and you can also subscribe via your preferred podcast app using the RSS feed.

Who hosts AI Illuminated?

AI Illuminated is created and hosted by The AI Illuminators.
URL copied to clipboard!