All Episodes
Unzip — 72 episodes
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
Co-Evolving Policy Distillation
LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
Co-Director: Agentic Generative Video Storytelling
PageGuide: Browser extension to assist users in navigating a webpage and locating information
DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Coevolving Representations in Joint Image-Feature Diffusion
Seeing Fast and Slow: Learning the Flow of Time in Videos
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Three-Phase Transformer
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Therefore I am. I Think
Video Models Reason Early: Exploiting Plan Commitment for Maze Solving
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
AVControl: Efficient Framework for Training Audio-Visual Controls
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Abstraction as a Memory-Efficient Inductive Bias for Continual Learning
Repurposing Geometric Foundation Models for Multi-view Diffusion
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
POLCA: Stochastic Generative Optimization with LLM
VoXtream2: Full-stream TTS with dynamic speaking rate control
Autonomous Agents Coordinating Distributed Discovery Through Emergent Artifact Exchange
SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
COMIC: Agentic Sketch Comedy Generation
Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Variational Flow Maps: Make Some Noise for One-Step Conditional Generation
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
SkillNet: Create, Evaluate, and Connect AI Skills
SageBwd: A Trainable Low-bit Attention
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution
PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?
Benchmark Test-Time Scaling of General LLM Agents
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
ReIn: Conversational Error Recovery with Reasoning Inception
"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing
"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing
"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing
BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
Thinking with Drafting: Optical Decompression via Logical Reconstruction
Thinking with Drafting: Optical Decompression via Logical Reconstruction
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty