1

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

May 7, 2026

0:00

2

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

May 6, 2026

0:00

3

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

May 5, 2026

0:00

4

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

May 4, 2026

0:00

5

Co-Evolving Policy Distillation

May 3, 2026

0:00

6

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

May 2, 2026

0:00

7

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

May 1, 2026

0:00

8

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

Apr 30, 2026

0:00

9

Co-Director: Agentic Generative Video Storytelling

Apr 29, 2026

0:00

10

PageGuide: Browser extension to assist users in navigating a webpage and locating information

Apr 28, 2026

0:00

11

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Apr 27, 2026

0:00

12

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Apr 26, 2026

0:00

13

Coevolving Representations in Joint Image-Feature Diffusion

Apr 25, 2026

0:00

14

Seeing Fast and Slow: Learning the Flow of Time in Videos

Apr 24, 2026

0:00

15

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Apr 23, 2026

0:00

16

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Apr 22, 2026

0:00

17

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Apr 21, 2026

0:00

18

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

Apr 20, 2026

0:00

19

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Apr 19, 2026

0:00

20

Three-Phase Transformer

Apr 18, 2026

0:00

21

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Apr 17, 2026

0:00

22

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Apr 6, 2026

0:00

23

Therefore I am. I Think

Apr 5, 2026

0:00

24

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Apr 4, 2026

0:00

25

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Apr 2, 2026

0:00

26

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Apr 1, 2026

0:00

27

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Mar 31, 2026

0:00

28

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Mar 30, 2026

0:00

29

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Mar 29, 2026

0:00

30

AVControl: Efficient Framework for Training Audio-Visual Controls

Mar 28, 2026

0:00

31

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Mar 27, 2026

0:00

32

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Mar 26, 2026

0:00

33

Abstraction as a Memory-Efficient Inductive Bias for Continual Learning

Mar 25, 2026

0:00

34

Repurposing Geometric Foundation Models for Multi-view Diffusion

Mar 24, 2026

0:00

35

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Mar 23, 2026

0:00

36

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Mar 22, 2026

0:00

37

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Mar 21, 2026

0:00

38

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Mar 20, 2026

0:00

39

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Mar 19, 2026

0:00

40

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

Mar 18, 2026

0:00

41

POLCA: Stochastic Generative Optimization with LLM

Mar 17, 2026

0:00

42

VoXtream2: Full-stream TTS with dynamic speaking rate control

Mar 15, 2026

0:00

43

Autonomous Agents Coordinating Distributed Discovery Through Emergent Artifact Exchange

Mar 14, 2026

0:00

44

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Mar 13, 2026

0:00

45

COMIC: Agentic Sketch Comedy Generation

Mar 12, 2026

0:00

46

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Mar 11, 2026

0:00

47

Variational Flow Maps: Make Some Noise for One-Step Conditional Generation

Mar 10, 2026

0:00

48

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Mar 9, 2026

0:00

49

SkillNet: Create, Evaluate, and Connect AI Skills

Mar 8, 2026

0:00

50

SageBwd: A Trainable Low-bit Attention

Mar 7, 2026

0:00

51

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Mar 6, 2026

0:00

52

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Mar 5, 2026

0:00

53

ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution

Mar 4, 2026

0:00

54

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Mar 3, 2026

0:00

55

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Mar 2, 2026

0:00

56

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Mar 1, 2026

0:00

57

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Feb 28, 2026

0:00

58

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Feb 27, 2026

0:00

59

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Feb 26, 2026

0:00

60

Benchmark Test-Time Scaling of General LLM Agents

Feb 25, 2026

0:00

61

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Feb 24, 2026

0:00

62

ReIn: Conversational Error Recovery with Reasoning Inception

Feb 23, 2026

0:00

63

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Feb 22, 2026

0:00

64

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Feb 21, 2026

0:00

65

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Feb 20, 2026

0:00

66

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Feb 19, 2026

0:00

67

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Feb 18, 2026

0:00

68

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Feb 17, 2026

0:00

69

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Feb 16, 2026

0:00

70

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Feb 15, 2026

0:00

71

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Feb 14, 2026

0:00

72

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Feb 8, 2026

0:00

All Episodes