Daily Paper Cast cover art

All Episodes

Daily Paper Cast — 1869 episodes

#
Title
1

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

2

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

3

$δ$-mem: Efficient Online Memory for Large Language Models

4

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

5

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

6

World Action Models: The Next Frontier in Embodied AI

7

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

8

Efficient Pre-Training with Token Superposition

9

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

10

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

11

Qwen-Image-2.0 Technical Report

12

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

13

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

14

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

15

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

16

Model Merging Scaling Laws in Large Language Models

17

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

18

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

19

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

20

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

21

Flow-OPD: On-Policy Distillation for Flow Matching Models

22

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

23

Anisotropic Modality Align

24

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

25

MiA-Signature: Approximating Global Activation for Long-Context Understanding

26

When to Trust Imagination: Adaptive Action Execution for World Action Models

27

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

28

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

29

Stream-T1: Test-Time Scaling for Streaming Video Generation

30

RLDX-1 Technical Report

31

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

32

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

33

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

34

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

35

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

36

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

37

MolmoAct2: Action Reasoning Models for Real-world Deployment

38

From Context to Skills: Can Language Models Learn from Context Skillfully?

39

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

40

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

41

Heterogeneous Scientific Foundation Model Collaboration

42

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

43

Co-Evolving Policy Distillation

44

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

45

Efficient Training on Multiple Consumer GPUs with RoundPipe

46

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

47

Large Language Models Explore by Latent Distilling

48

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

49

ClawGym: A Scalable Framework for Building Effective Claw Agents

50

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

51

Recursive Multi-Agent Systems

52

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

53

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

54

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

55

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

56

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

57

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

58

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

59

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

60

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

61

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

62

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

63

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

64

Video Analysis and Generation via a Semantic Progress Function

65

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

66

LLM Safety From Within: Detecting Harmful Content with Internal Representations

67

LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics

68

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

69

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

70

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

71

Near-Future Policy Optimization

72

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

73

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

74

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

75

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

76

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

77

AgentSPEX: An Agent SPecification and EXecution Language

78

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

79

TEMPO: Scaling Test-time Training for Large Reasoning Models

80

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

81

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

82

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

83

OpenGame: Open Agentic Coding for Games

84

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

85

EasyVideoR1: Easier RL for Video Understanding

86

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

87

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

88

PersonaVLM: Long-Term Personalized Multimodal LLMs

89

Qwen3.5-Omni Technical Report

90

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

91

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

92

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

93

Seedance 2.0: Advancing Video Generation for World Complexity

94

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

95

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

96

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

97

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

98

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

99

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

100

Exploration and Exploitation Errors Are Measurable for Language Model Agents

101

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

102

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

103

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

104

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

105

Toward Autonomous Long-Horizon Engineering for ML Research

106

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

107

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

108

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

109

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

110

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

111

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

112

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

113

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

114

CocoaBench: Evaluating Unified Digital Agents in the Wild

115

CodeTracer: Towards Traceable Agent States

116

WildDet3D: Scaling Promptable 3D Detection in the Wild

117

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

118

EXAONE 4.5 Technical Report

119

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

120

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

121

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

122

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

123

RAGEN-2: Reasoning Collapse in Agentic RL

124

MARS: Enabling Autoregressive Models Multi-Token Generation

125

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

126

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

127

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

128

Learning to Retrieve from Agent Trajectories

129

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

130

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

131

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

132

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

133

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

134

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

135

Watch Before You Answer: Learning from Visually Grounded Post-Training

136

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

137

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

138

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

139

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

140

Adam's Law: Textual Frequency Law on Large Language Models

141

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

142

ClawArena: Benchmarking AI Agents in Evolving Information Environments

143

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

144

LightThinker++: From Reasoning Compression to Memory Management

145

Self-Distilled RLVR

146

A Simple Baseline for Streaming Video Understanding

147

Token Warping Helps MLLMs Look from Nearby Viewpoints

148

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

149

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

150

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

151

Generative World Renderer

152

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

153

Steerable Visual Representations

154

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

155

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

156

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

157

Terminal Agents Suffice for Enterprise Automation

158

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

159

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

160

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

161

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

162

Reasoning Shift: How Context Silently Shortens LLM Reasoning

163

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

164

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

165

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

166

Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells

167

GEMS: Agent-Native Multimodal Generation with Memory and Skills

168

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

169

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

170

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

171

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

172

daVinci-LLM:Towards the Science of Pretraining

173

TAPS: Task Aware Proposal Distributions for Speculative Sampling

174

Towards a Medical AI Scientist

175

Gen-Searcher: Reinforcing Agentic Search for Image Generation

176

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

177

EpochX: Building the Infrastructure for an Emergent Agent Civilization

178

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

179

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

180

Make Geometry Matter for Spatial Reasoning

181

PRBench: End-to-end Paper Reproduction in Physics Research

182

PixelSmile: Toward Fine-Grained Facial Expression Editing

183

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

184

Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

185

RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

186

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

187

Voxtral TTS

188

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

189

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

190

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

191

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

192

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

193

PEARL: Personalized Streaming Video Understanding Model

194

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

195

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

196

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

197

RealMaster: Lifting Rendered Scenes into Photorealistic Video

198

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

199

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

200

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

201

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

202

OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

203

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

204

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

205

F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

206

mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

207

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

208

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

209

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

210

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

211

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

212

The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

213

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

214

Hyperagents

215

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

216

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

217

FASTER: Rethinking Real-Time Flow VLAs

218

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

219

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

220

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

221

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

222

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

223

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

224

Memento-Skills: Let Agents Design Agents

225

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

226

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

227

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

228

Alignment Makes Language Models Normative, Not Descriptive

229

Complementary Reinforcement Learning

230

When AI Navigates the Fog of War

231

MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

232

InCoder-32B: Code Foundation Model for Industrial Scenarios

233

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

234

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

235

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

236

Demystifing Video Reasoning

237

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

238

TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

239

Online Experiential Learning for Language Models

240

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

241

AI Can Learn Scientific Taste

242

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

243

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

244

Grounding World Simulation Models in a Real-World Metropolis

245

HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

246

Attention Residuals

247

Mixture-of-Depths Attention

248

Effective Distillation to Hybrid xLSTM Architectures

249

Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

250

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

251

LMEB: Long-horizon Memory Embedding Benchmark

252

Can Vision-Language Models Solve the Shell Game?

253

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

254

daVinci-Env: Open SWE Environment Synthesis at Scale

255

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

256

OpenClaw-RL: Train Any Agent Simply by Talking

257

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

258

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

259

LLM2Vec-Gen: Generative Embeddings from Large Language Models

260

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

261

STEP3-VL-10B Technical Report

262

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

263

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

264

Controlled Self-Evolution for Algorithmic Code Optimization

265

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

266

MAXS: Meta-Adaptive Exploration with LLM Agents

267

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

268

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

269

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

270

OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

271

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

272

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences

273

Solar Open Technical Report

274

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

275

User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

276

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

277

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

278

MemoBrain: Executive Memory as an Agentic Brain for Reasoning

279

Motion Attribution for Video Generation

280

3AM: Segment Anything with Geometric Consistency in Videos

281

BabyVision: Visual Reasoning Beyond Language

282

PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

283

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

284

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

285

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

286

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

287

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

288

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

289

MMFormalizer: Multimodal Autoformalization in the Wild

290

CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature

291

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

292

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

293

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

294

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

295

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

296

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

297

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

298

Token-Level LLM Collaboration via FusionRoute

299

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

300

Evolving Programmatic Skill Networks

301

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

302

Benchmark^2: Systematic Evaluation of LLM Benchmarks

303

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

304

LTX-2: Efficient Joint Audio-Visual Foundation Model

305

MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

306

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

307

NitroGen: An Open Foundation Model for Generalist Gaming Agents

308

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

309

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

310

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

311

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

312

GARDO: Reinforcing Diffusion Models without Reward Hacking

313

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

314

VINO: A Unified Visual Generator with Interleaved OmniModal Context

315

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

316

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

317

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

318

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

319

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

320

Deep Delta Learning

321

AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

322

Nested Learning: The Illusion of Deep Learning Architectures

323

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

324

Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

325

mHC: Manifold-Constrained Hyper-Connections

326

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

327

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

328

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

329

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

330

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

331

Yume-1.5: A Text-Controlled Interactive World Generation Model

332

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

333

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

334

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

335

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

336

SpotEdit: Selective Region Editing in Diffusion Transformers

337

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

338

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

339

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

340

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

341

Latent Implicit Visual Reasoning

342

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

343

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

344

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

345

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

346

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

347

SemanticGen: Video Generation in Semantic Space

348

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

349

LongVideoAgent: Multi-Agent Reasoning with Long Videos

350

SpatialTree: How Spatial Abilities Branch Out in MLLMs

351

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

352

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

353

Region-Constraint In-Context Generation for Instructional Video Editing

354

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

355

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

356

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

357

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

358

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

359

When Reasoning Meets Its Laws

360

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

361

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

362

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

363

Are We on the Right Way to Assessing LLM-as-a-Judge?

364

Kling-Omni Technical Report

365

Adaptation of Agentic AI

366

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

367

Next-Embedding Prediction Makes Strong Vision Learners

368

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

369

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

370

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

371

Generative Refocusing: Flexible Defocus Control from a Single Image

372

DeContext as Defense: Safe Image Editing in Diffusion Transformers

373

Step-GUI Technical Report

374

DEER: Draft with Diffusion, Verify with Autoregressive Models

375

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

376

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

377

Puzzle Curriculum GRPO for Vision-Centric Reasoning

378

MMGR: Multi-Modal Generative Reasoning

379

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

380

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

381

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

382

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

383

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

384

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

385

Towards Scalable Pre-training of Visual Tokenizers for Generation

386

Memory in the Age of AI Agents

387

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

388

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

389

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

390

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

391

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

392

KlingAvatar 2.0 Technical Report

393

MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

394

EgoX: Egocentric Video Generation from a Single Exocentric Video

395

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

396

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

397

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

398

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

399

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

400

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

401

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

402

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

403

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

404

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

405

OmniPSD: Layered PSD Generation with Diffusion Transformer

406

Composing Concepts from Images and Videos via Concept-prompt Binding

407

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

408

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

409

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

410

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

411

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

412

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

413

Unified Video Editing with Temporal Reasoner

414

Voxify3D: Pixel Art Meets Volumetric Rendering

415

Scaling Zero-Shot Reference-to-Video Generation

416

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

417

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

418

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

419

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

420

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

421

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

422

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

423

Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

424

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

425

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

426

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

427

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

428

Qwen3-VL Technical Report

429

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

430

PretrainZero: Reinforcement Active Pretraining

431

ViDiC: Video Difference Captioning

432

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

433

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

434

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

435

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

436

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

437

DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

438

Guided Self-Evolving LLMs with Minimal Human Supervision

439

SimScale: Learning to Drive via Real-World Simulation at Scale

440

InnoGym: Benchmarking the Innovation Potential of AI Agents

441

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

442

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

443

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

444

How Far Are We from Genuinely Useful Deep Research Agents?

445

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

446

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

447

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

448

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

449

LFM2 Technical Report

450

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

451

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

452

Vision Bridge Transformer at Scale

453

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

454

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

455

Multimodal Evaluation of Russian-language Architectures

456

Latent Collaboration in Multi-Agent Systems

457

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

458

GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms

459

MedSAM3: Delving into Segment Anything with Medical Concepts

460

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

461

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

462

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

463

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

464

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

465

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

466

Soft Adaptive Policy Optimization

467

General Agentic Memory Via Deep Research

468

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

469

Computer-Use Agents as Judges for Generative User Interface

470

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

471

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

472

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

473

In-Video Instructions: Visual Signals as Generative Control

474

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

475

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

476

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

477

SAM 3: Segment Anything with Concepts

478

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

479

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

480

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

481

VisPlay: Self-Evolving Vision-Language Models from Images

482

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

483

VIDEOP2R: Video Understanding from Perception to Reasoning

484

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

485

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

486

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

487

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

488

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

489

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

490

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

491

P1: Mastering Physics Olympiads with Reinforcement Learning

492

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

493

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

494

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

495

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

496

GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning

497

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

498

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

499

GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

500

DoPE: Denoising Rotary Position Embedding

501

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

502

UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

503

AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery

504

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

505

Virtual Width Networks

506

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

507

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

508

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

509

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

510

DeepEyesV2: Toward Agentic Multimodal Model

511

Visual Spatial Tuning

512

VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks

513

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

514

V-Thinker: Interactive Thinking with Images

515

Scaling Agent Learning via Experience Synthesis

516

Diffusion Language Models are Super Data Learners

517

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

518

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

519

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

520

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

521

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

522

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

523

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

524

The Underappreciated Power of Vision Models for Graph Structural Understanding

525

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

526

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

527

PHUMA: Physically-Grounded Humanoid Locomotion Dataset

528

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

529

World Simulation with Video Foundation Models for Physical AI

530

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

531

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

532

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

533

The End of Manual Decoding: Towards Truly End-to-End Language Models

534

Kimi Linear: An Expressive, Efficient Attention Architecture

535

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

536

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

537

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

538

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

539

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

540

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

541

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

542

Language Models are Injective and Hence Invertible

543

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

544

LightMem: Lightweight and Efficient Memory-Augmented Generation

545

Efficient Long-context Language Model Training by Core Attention Disaggregation

546

World-in-World: World Models in a Closed-Loop World

547

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

548

Chem-R: Learning to Reason as a Chemist

549

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

550

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

551

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

552

IF-VidCap: Can Video Caption Models Follow Instructions?

553

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

554

PICABench: How Far Are We from Physically Realistic Image Editing?

555

Glyph: Scaling Context Windows via Visual-Text Compression

556

FineVision: Open Data Is All You Need

557

TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

558

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

559

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

560

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

561

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

562

NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

563

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

564

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

565

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

566

Latent Diffusion Model without Variational Autoencoder

567

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

568

Agentic Entropy-Balanced Policy Optimization

569

WithAnyone: Towards Controllable and ID Consistent Image Generation

570

AI for Service: Proactive Assistance with AI Glasses

571

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

572

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

573

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

574

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

575

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

576

BitNet Distillation

577

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

578

Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

579

DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

580

Scaling Language-Centric Omnimodal Representation Learning

581

Robot Learning: A Tutorial

582

Detect Anything via Next Point Prediction

583

A Survey of Vibe Coding with Large Language Models

584

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

585

Dr.LLM: Dynamic Layer Routing in LLMs

586

Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models

587

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

588

Diffusion Transformers with Representation Autoencoders

589

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

590

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

591

Spotlight on Token Perception for Multimodal Reinforcement Learning

592

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

593

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

594

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

595

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

596

BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

597

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

598

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

599

TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling

600

AutoPR: Let's Automate Your Academic Promotion!

601

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

602

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

603

StreamingVLM: Real-Time Understanding for Infinite Video Streams

604

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

605

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

606

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

607

Agent Learning via Early Experience

608

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

609

MemMamba: Rethinking Memory Patterns in State Space Model

610

UniVideo: Unified Understanding, Generation, and Editing for Videos

611

From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning

612

When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

613

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

614

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

615

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

616

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

617

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

618

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

619

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

620

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

621

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

622

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

623

Vibe Checker: Aligning Code Evaluation with Human Preference

624

Less is More: Recursive Reasoning with Tiny Networks

625

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

626

Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

627

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

628

Fast-dLLM v2: Efficient Block-Diffusion LLM

629

CoDA: Coding LM via Diffusion Adaptation

630

Drax: Speech Recognition with Discrete Flow Matching

631

Paper2Video: Automatic Video Generation from Scientific Papers

632

MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information

633

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

634

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

635

Imperceptible Jailbreaking against Large Language Models

636

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

637

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

638

Optimal Scaling Needs Optimal Norm

639

Apriel-1.5-15b-Thinker

640

Large Reasoning Models Learn Better Alignment from Flawed Thinking

641

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

642

LongCodeZip: Compress Long Context for Code Language Models

643

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

644

ExGRPO: Learning to Reason from Experience

645

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

646

Interactive Training: Feedback-Driven Neural Network Optimization

647

ModernVBERT: Towards Smaller Visual Document Retrievers

648

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

649

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

650

GEM: A Gym for Agentic LLMs

651

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

652

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

653

PIPer: On-Device Environment Setup via Online Reinforcement Learning

654

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

655

ACON: Optimizing Context Compression for Long-horizon LLM Agents

656

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

657

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

658

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

659

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

660

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

661

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

662

OceanGym: A Benchmark Environment for Underwater Embodied Agents

663

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

664

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

665

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

666

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

667

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

668

Multiplayer Nash Preference Optimization

669

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

670

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

671

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

672

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

673

Democratizing AI scientists using ToolUniverse

674

Visual Jigsaw Post-Training Improves MLLMs

675

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

676

LongLive: Real-time Interactive Long Video Generation

677

Quantile Advantage Estimation for Entropy-Safe Reasoning

678

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

679

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

680

ReviewScore: Misinformed Peer Review Detection with Large Language Models

681

Variational Reasoning for Language Models

682

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

683

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

684

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

685

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

686

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

687

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

688

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

689

Tree Search for LLM Agent Reinforcement Learning

690

Seedream 4.0: Toward Next-generation Multimodal Image Generation

691

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

692

AutoIntent: AutoML for Text Classification

693

Video models are zero-shot learners and reasoners

694

SIM-CoT: Supervised Implicit Chain-of-Thought

695

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

696

Reinforcement Learning on Pre-Training Data

697

Do You Need Proprioceptive States in Visuomotor Policies?

698

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

699

LIMI: Less is More for Agency

700

Qwen3-Omni Technical Report

701

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

702

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

703

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

704

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

705

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

706

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

707

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

708

FlowRL: Matching Reward Distributions for LLM Reasoning

709

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

710

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

711

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

712

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

713

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

714

SAIL-VL2 Technical Report

715

PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

716

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

717

Scaling Agents via Continual Pre-training

718

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

719

Towards General Agentic Intelligence via Environment Scaling

720

WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

721

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

722

Single-stream Policy Optimization

723

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

724

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

725

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

726

IntrEx: A Dataset for Modeling Engagement in Educational Conversations

727

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

728

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

729

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

730

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

731

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

732

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

733

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

734

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

735

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

736

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

737

A Survey of Reinforcement Learning for Large Reasoning Models

738

RewardDance: Reward Scaling in Visual Generation

739

3D and 4D World Modeling: A Survey

740

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

741

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

742

Visual Representation Alignment for Multimodal Large Language Models

743

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

744

Reconstruction Alignment Improves Unified Multimodal Models

745

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

746

Reverse-Engineered Reasoning for Open-Ended Generation

747

Does DINOv3 Set a New Medical Vision Standard?

748

Symbolic Graphics Programming with Large Language Models

749

Set Block Decoding is a Language Model Inference Accelerator

750

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

751

From Editor to Dense Geometry Estimator

752

Towards a Unified View of Large Language Model Post-Training

753

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

754

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

755

Open Data Synthesis For Deep Research

756

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

757

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

758

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

759

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

760

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

761

Baichuan-M2: Scaling Medical Capability with Large Verifier System

762

Kwai Keye-VL 1.5 Technical Report

763

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

764

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

765

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

766

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

767

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

768

VibeVoice Technical Report

769

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

770

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

771

OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

772

Spacer: Towards Engineered Scientific Inspiration

773

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

774

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

775

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

776

MV-RAG: Retrieval Augmented Multiview Diffusion

777

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

778

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

779

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

780

Intern-S1: A Scientific Multimodal Foundation Model

781

Mobile-Agent-v3: Foundamental Agents for GUI Automation

782

Deep Think with Confidence

783

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

784

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

785

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

786

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

787

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

788

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

789

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

790

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

791

Prompt Orchestration Markup Language

792

Ovis2.5 Technical Report

793

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

794

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

795

Next Visual Granularity Generation

796

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

797

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

798

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study

799

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

800

SSRL: Self-Search Reinforcement Learning

801

DINOv3

802

Thyme: Think Beyond Images

803

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

804

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

805

We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

806

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

807

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

808

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

809

Story2Board: A Training-Free Approach for Expressive Storyboard Generation

810

Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

811

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

812

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

813

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

814

AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

815

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

816

Complex Logical Instruction Generation

817

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

818

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches

819

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

820

WideSearch: Benchmarking Agentic Broad Info-Seeking

821

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

822

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

823

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

824

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

825

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

826

MolmoAct: Action Reasoning Models that can Reason in Space

827

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

828

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

829

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

830

VeriGUI: Verifiable Long-Chain GUI Dataset

831

Efficient Agents: Building Effective Agents While Reducing Cost

832

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

833

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

834

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

835

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

836

Qwen-Image Technical Report

837

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

838

CellForge: Agentic Design of Virtual Cell Models

839

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

840

Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report

841

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

842

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

843

PixNerd: Pixel Neural Field Diffusion

844

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

845

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

846

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

847

BANG: Dividing 3D Assets via Generative Exploded Dynamics

848

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

849

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

850

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

851

ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge

852

Agentic Reinforced Policy Optimization

853

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

854

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

855

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

856

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

857

Reconstructing 4D Spatial Intelligence: A Survey

858

Deep Researcher with Test-Time Diffusion

859

$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention

860

Group Sequence Policy Optimization

861

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

862

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

863

Pixels, Patterns, but No Poetry: To See The World like Humans

864

Yume: An Interactive World Generation Model

865

DesignLab: Designing Slides Through Iterative Detection and Correction

866

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

867

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

868

Step-Audio 2 Technical Report

869

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

870

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

871

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

872

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

873

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

874

The Invisible Leash: Why RLVR May Not Escape Its Origin

875

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

876

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

877

GR-3 Technical Report

878

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

879

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

880

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

881

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

882

A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

883

A Survey of Context Engineering for Large Language Models

884

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

885

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning

886

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

887

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

888

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

889

RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

890

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

891

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

892

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

893

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

894

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

895

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

896

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

897

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

898

Test-Time Scaling with Reflective Generative Model

899

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

900

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

901

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

902

KV Cache Steering for Inducing Reasoning in Small Language Models

903

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

904

Neural-Driven Image Editing

905

Scaling RL to Long Videos

906

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

907

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

908

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

909

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

910

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

911

PyVision: Agentic Vision with Dynamic Tooling

912

4KAgent: Agentic Any Image to 4K Super-Resolution

913

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

914

Perception-Aware Policy Optimization for Multimodal Reasoning

915

MIRIX: Multi-Agent Memory System for LLM-Based Agents

916

Rethinking Verification for LLM Code Generation: From Generation to Testing

917

SingLoRA: Low Rank Adaptation Using a Single Matrix

918

A Survey on Latent Reasoning

919

OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

920

How to Train Your LLM Web Agent: A Statistical Diagnosis

921

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

922

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

923

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

924

MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos

925

MemOS: A Memory OS for AI System

926

Should We Still Pretrain Encoders with Masked Language Modeling?

927

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

928

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

929

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

930

Pre-Trained Policy Discriminators are General Reward Models

931

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

932

WebSailor: Navigating Super-human Reasoning for Web Agent

933

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

934

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

935

IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction

936

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

937

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

938

Kwai Keye-VL Technical Report

939

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

940

Depth Anything at Any Condition

941

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

942

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

943

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

944

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

945

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

946

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

947

Ovis-U1 Technical Report

948

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

949

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

950

Calligrapher: Freestyle Text Image Customization

951

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

952

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

953

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

954

Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

955

Effective Red-Teaming of Policy-Adherent Agents

956

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

957

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

958

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

959

Text-Aware Image Restoration with Diffusion Models

960

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

961

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

962

Discrete Audio Tokens: More Than a Survey!

963

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

964

Seedance 1.0: Exploring the Boundaries of Video Generation Models

965

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

966

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

967

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

968

PlayerOne: Egocentric World Simulator

969

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

970

Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models

971

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

972

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

973

Reinforcement Pre-Training

974

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

975

MiniCPM4: Ultra-Efficient LLMs on End Devices

976

SpatialLM: Training Large Language Models for Structured Indoor Modeling

977

Image Reconstruction as a Tool for Feature Analysis

978

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

979

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

980

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

981

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

982

Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs

983

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

984

ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

985

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

986

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

987

Video World Models with Long-term Spatial Memory

988

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

989

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

990

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

991

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

992

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

993

MiMo-VL Technical Report

994

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

995

AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

996

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

997

A Controllable Examination for Long-Context Language Models

998

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

999

Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

1000

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

1001

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

1002

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

1003

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

1004

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

1005

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

1006

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

1007

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

1008

OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation

1009

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

1010

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

1011

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

1012

Taming LLMs by Scaling Learning Rates with Gradient Grouping

1013

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

1014

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

1015

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

1016

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

1017

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

1018

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

1019

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

1020

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

1021

Time Blindness: Why Video-Language Models Can't See What Humans Can?

1022

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

1023

Large Language Models for Data Synthesis

1024

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

1025

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

1026

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

1027

Table-R1: Inference-Time Scaling for Table Reasoning

1028

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

1029

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

1030

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

1031

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

1032

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

1033

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

1034

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

1035

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

1036

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

1037

Skywork Open Reasoner 1 Technical Report

1038

Sherlock: Self-Correcting Reasoning in Vision-Language Models

1039

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

1040

SageAttention2++: A More Efficient Implementation of SageAttention2

1041

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

1042

Fostering Video Reasoning via Next-Event Prediction

1043

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

1044

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

1045

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

1046

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

1047

OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

1048

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

1049

SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

1050

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

1051

Exploring the Latent Capacity of LLMs for One-Step Text Generation

1052

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

1053

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

1054

Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

1055

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

1056

Alchemist: Turning Public Text-to-Image Data into Generative Gold

1057

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

1058

PATS: Process-Level Adaptive Thinking Mode Switching

1059

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

1060

ARM: Adaptive Reasoning Model

1061

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

1062

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

1063

B-score: Detecting biases in large language models using response history

1064

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

1065

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

1066

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

1067

Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

1068

One RL to See Them All: Visual Triple Unified Reinforcement Learning

1069

Distilling LLM Agent into Small Models with Retrieval and Code Tools

1070

QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization

1071

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

1072

Scaling Image and Video Generation via Test-Time Evolutionary Search

1073

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

1074

NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

1075

Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

1076

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

1077

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

1078

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

1079

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

1080

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

1081

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

1082

Scaling Diffusion Transformers Efficiently via $μ$P

1083

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

1084

MMaDA: Multimodal Large Diffusion Language Models

1085

Scaling Law for Quantization-Aware Training

1086

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

1087

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

1088

Efficient Agent Training for Computer Use

1089

This Time is Different: An Observability Perspective on Time Series Foundation Models

1090

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

1091

Emerging Properties in Unified Multimodal Pretraining

1092

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

1093

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

1094

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

1095

Visual Agentic Reinforcement Fine-Tuning

1096

Neurosymbolic Diffusion Models

1097

Chain-of-Model Learning for Language Model

1098

AdaptThink: Reasoning Models Can Learn When to Think

1099

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

1100

Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

1101

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

1102

Faster Video Diffusion with Trainable Sparse Attention

1103

Thinkless: LLM Learns When to Think

1104

Model Merging in Pre-training of Large Language Models

1105

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

1106

Qwen3 Technical Report

1107

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

1108

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

1109

Visual Planning: Let's Think Only with Images

1110

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

1111

System Prompt Optimization with Meta-Learning

1112

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

1113

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

1114

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

1115

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

1116

Seed1.5-VL Technical Report

1117

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

1118

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

1119

Learning from Peers in Reasoning Models

1120

Unified Continuous Generative Models

1121

REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

1122

Bielik v3 Small: Technical Report

1123

Bielik 11B v2 Technical Report

1124

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

1125

On Path to Multimodal Generalist: General-Level and General-Bench

1126

Flow-GRPO: Training Flow Matching Models via Online RL

1127

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

1128

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

1129

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

1130

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

1131

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

1132

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

1133

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

1134

RM-R1: Reward Modeling as Reasoning

1135

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

1136

Practical Efficiency of Muon for Pretraining

1137

PixelHacker: Image Inpainting with Structural and Semantic Consistency

1138

A Survey of Interactive Generative Video

1139

DeepCritic: Deliberate Critique with Large Language Models

1140

Sadeed: Advancing Arabic Diacritization Through Small Language Model

1141

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

1142

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

1143

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

1144

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

1145

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

1146

ReasonIR: Training Retrievers for Reasoning Tasks

1147

The Leaderboard Illusion

1148

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

1149

RepText: Rendering Visual Text via Replicating

1150

Towards Understanding Camera Motions in Any Video

1151

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

1152

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

1153

Step1X-Edit: A Practical Framework for General Image Editing

1154

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

1155

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

1156

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

1157

DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

1158

Trillion 7B Technical Report

1159

Tina: Tiny Reasoning Models via LoRA

1160

I-Con: A Unifying Framework for Representation Learning

1161

Kuwain 1.5B: An Arabic SLM via Language Injection

1162

TTRL: Test-Time Reinforcement Learning

1163

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

1164

Describe Anything: Detailed Localized Image and Video Captioning

1165

Learning Adaptive Parallel Reasoning with Language Models

1166

Learning to Reason under Off-Policy Guidance

1167

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

1168

FlowReasoner: Reinforcing Query-Level Meta-Agents

1169

ToolRL: Reward is All Tool Learning Needs

1170

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

1171

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

1172

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

1173

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

1174

NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes

1175

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

1176

Antidistillation Sampling

1177

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

1178

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

1179

WORLDMEM: Long-term Consistent World Simulation with Memory

1180

A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

1181

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

1182

BitNet b1.58 2B4T Technical Report

1183

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

1184

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

1185

Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

1186

How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

1187

Heimdall: test-time scaling on the generative verification

1188

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

1189

TextArena

1190

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

1191

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

1192

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

1193

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

1194

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

1195

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

1196

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

1197

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

1198

Kimi-VL Technical Report

1199

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

1200

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

1201

DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

1202

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

1203

MM-IFEngine: Towards Multimodal Instruction Following

1204

HoloPart: Generative 3D Part Amodal Segmentation

1205

DDT: Decoupled Diffusion Transformer

1206

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

1207

A Unified Agentic Framework for Evaluating Conditional Image Generation

1208

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

1209

OmniSVG: A Unified Scalable Vector Graphics Generation Model

1210

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

1211

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

1212

An Empirical Study of GPT-4o Image Generation Capabilities

1213

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

1214

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

1215

SmolVLM: Redefining small and efficient multimodal models

1216

One-Minute Video Generation with Test-Time Training

1217

Rethinking Reflection in Pre-Training

1218

URECA: Unique Region Caption Anything

1219

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

1220

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

1221

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

1222

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

1223

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

1224

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

1225

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

1226

WikiVideo: Article Generation from Multiple Videos

1227

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

1228

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

1229

Understanding R1-Zero-Like Training: A Critical Perspective

1230

Towards Physically Plausible Video Generation via VLM Planning

1231

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

1232

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

1233

START: Self-taught Reasoner with Tools

1234

Token-Efficient Long Video Understanding for Multimodal LLMs

1235

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

1236

EgoLife: Towards Egocentric Life Assistant

1237

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

1238

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

1239

Process-based Self-Rewarding Language Models

1240

Visual-RFT: Visual Reinforcement Fine-Tuning

1241

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

1242

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

1243

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

1244

Chain of Draft: Thinking Faster by Writing Less

1245

Multi-Turn Code Generation Through Single-Step Rewards

1246

Self-rewarding correction for mathematical reasoning

1247

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

1248

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

1249

LongRoPE2: Near-Lossless LLM Context Window Scaling

1250

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

1251

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

1252

UniTok: A Unified Tokenizer for Visual Generation and Understanding

1253

NeoBERT: A Next-Generation BERT

1254

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

1255

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

1256

GHOST 2.0: generative high-fidelity one shot transfer of heads

1257

Kanana: Compute-efficient Bilingual Language Models

1258

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

1259

Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance

1260

Language Models' Factuality Depends on the Language of Inquiry

1261

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

1262

Towards an AI co-scientist

1263

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

1264

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

1265

Rank1: Test-Time Compute for Reranking in Information Retrieval

1266

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

1267

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

1268

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

1269

How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

1270

S*: Test Time Scaling for Code Generation

1271

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

1272

Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning

1273

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

1274

Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

1275

S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

1276

Qwen2.5-VL Technical Report

1277

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

1278

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

1279

MoM: Linear Sequence Modeling with Mixture-of-Memories

1280

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

1281

Craw4LLM: Efficient Web Crawling for LLM Pretraining

1282

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

1283

Small Models Struggle to Learn from Strong Reasoners

1284

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

1285

SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?

1286

Soundwave: Less is More for Speech-Text Alignment in LLMs

1287

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

1288

Continuous Diffusion Model for Language Modeling

1289

Phantom: Subject-consistent video generation via cross-modal alignment

1290

Rethinking Diverse Human Preference Learning through Principal Component Analysis

1291

Magma: A Foundation Model for Multimodal AI Agents

1292

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

1293

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

1294

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

1295

You Do Not Fully Utilize Transformer's Representation Capacity

1296

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

1297

Learning Getting-Up Policies for Real-World Humanoid Robots

1298

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

1299

CRANE: Reasoning with constrained LLM generation

1300

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

1301

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

1302

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

1303

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

1304

Region-Adaptive Sampling for Diffusion Transformers

1305

Large Language Diffusion Models

1306

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

1307

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

1308

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

1309

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

1310

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

1311

Diverse Inference and Verification for Advanced Reasoning

1312

Precise Parameter Localization for Textual Generation in Diffusion Models

1313

DarwinLM: Evolutionary Structured Pruning of Large Language Models

1314

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

1315

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

1316

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

1317

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

1318

Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

1319

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging

1320

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

1321

Exploring the Potential of Encoder-free Architectures in 3D LMMs

1322

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

1323

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

1324

Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

1325

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

1326

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

1327

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

1328

Distillation Scaling Laws

1329

TransMLA: Multi-Head Latent Attention Is All You Need

1330

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

1331

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

1332

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

1333

Expect the Unexpected: FailSafe Long Context QA for Finance

1334

Competitive Programming with Large Reasoning Models

1335

Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models

1336

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

1337

Magic 1-For-1: Generating One Minute Video Clips within One Minute

1338

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

1339

Teaching Language Models to Critique via Reinforcement Learning

1340

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

1341

Enhance-A-Video: Better Generated Video for Free

1342

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

1343

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

1344

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

1345

Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

1346

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

1347

LM2: Large Memory Models

1348

Matryoshka Quantization

1349

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

1350

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

1351

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

1352

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

1353

Fast Video Generation with Sliding Tile Attention

1354

Goku: Flow Based Video Generative Foundation Models

1355

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

1356

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

1357

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

1358

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

1359

Agency Is Frame-Dependent

1360

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

1361

Generating Symbolic World Models via Test-time Scaling of Large Language Models

1362

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

1363

UltraIF: Advancing Instruction Following from the Wild

1364

Great Models Think Alike and this Undermines AI Oversight

1365

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

1366

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

1367

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

1368

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

1369

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

1370

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

1371

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

1372

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

1373

Demystifying Long Chain-of-Thought Reasoning in LLMs

1374

LIMO: Less is More for Reasoning

1375

Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

1376

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

1377

On Teacher Hacking in Language Model Distillation

1378

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

1379

Jailbreaking with Universal Multi-Prompts

1380

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

1381

Inverse Bridge Matching Distillation

1382

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

1383

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

1384

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

1385

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

1386

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

1387

The Differences Between Direct Alignment Algorithms are a Blur

1388

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

1389

Process Reinforcement through Implicit Rewards

1390

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

1391

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model

1392

Preference Leakage: A Contamination Problem in LLM-as-a-judge

1393

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

1394

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

1395

AIN: The Arabic INclusive Large Multimodal Model

1396

s1: Simple test-time scaling

1397

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

1398

Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models

1399

PixelWorld: Towards Perceiving Everything as Pixels

1400

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

1401

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

1402

Scalable-Softmax Is Superior for Attention

1403

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

1404

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

1405

GuardReasoner: Towards Reasoning-based LLM Safeguards

1406

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

1407

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

1408

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

1409

Large Language Models Think Too Fast To Explore Effectively

1410

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

1411

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

1412

o3-mini vs DeepSeek-R1: Which One is Safer?

1413

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

1414

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

1415

Atla Selene Mini: A General Purpose Evaluation Model

1416

Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts

1417

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

1418

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

1419

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

1420

People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

1421

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

1422

Optimizing Large Language Model Training Using FP4 Quantization

1423

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

1424

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

1425

Open Problems in Mechanistic Interpretability

1426

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

1427

IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding

1428

Histoires Morales: A French Dataset for Assessing Moral Alignment

1429

Qwen2.5-1M Technical Report

1430

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

1431

Towards General-Purpose Model-Free Reinforcement Learning

1432

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

1433

iFormer: Integrating ConvNet and Transformer for Mobile Application

1434

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

1435

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

1436

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

1437

Humanity's Last Exam

1438

Chain-of-Retrieval Augmented Generation

1439

Redundancy Principles for MLLMs Benchmarks

1440

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

1441

RL + Transformer = A General-Purpose Problem Solver

1442

Relightable Full-Body Gaussian Codec Avatars

1443

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

1444

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

1445

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

1446

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

1447

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

1448

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

1449

Improving Video Generation with Human Feedback

1450

Temporal Preference Optimization for Long-Form Video Understanding

1451

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

1452

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

1453

DiffuEraser: A Diffusion Model for Video Inpainting

1454

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

1455

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

1456

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

1457

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

1458

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

1459

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

1460

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

1461

Kimi k1.5: Scaling Reinforcement Learning with LLMs

1462

Autonomy-of-Experts Models

1463

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

1464

Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

1465

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

1466

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

1467

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

1468

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

1469

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

1470

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

1471

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

1472

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

1473

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

1474

Reasoning Language Models: A Blueprint

1475

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

1476

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

1477

GameFactory: Creating New Games with Generative Interactive Videos

1478

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

1479

SEAL: Entangled White-box Watermarks on Low-Rank Adaptation

1480

The Lessons of Developing Process Reward Models in Mathematical Reasoning

1481

Tensor Product Attention Is All You Need

1482

$\text{Transformer}^2$: Self-adaptive LLMs

1483

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

1484

VideoAuteur: Towards Long Narrative Video Generation

1485

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

1486

WebWalker: Benchmarking LLMs in Web Traversal

1487

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

1488

UnCommon Objects in 3D

1489

VideoRAG: Retrieval-Augmented Generation over Video Corpus

1490

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

1491

Enabling Scalable Oversight via Self-Evolving Critic

1492

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

1493

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

1494

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

1495

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

1496

The GAN is dead; long live the GAN! A Modern GAN Baseline

1497

An Empirical Study of Autoregressive Pre-training from Videos

1498

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

1499

Entropy-Guided Attention for Private LLMs

1500

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

1501

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

1502

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

1503

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

1504

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

1505

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

1506

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

1507

Agent Laboratory: Using LLM Agents as Research Assistants

1508

LLM4SR: A Survey on Large Language Models for Scientific Research

1509

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

1510

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

1511

GeAR: Generation Augmented Retrieval

1512

Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

1513

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

1514

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

1515

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

1516

Cosmos World Foundation Model Platform for Physical AI

1517

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

1518

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

1519

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

1520

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

1521

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

1522

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

1523

MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

1524

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

1525

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

1526

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

1527

Personalized Graph-Based Retrieval for Large Language Models

1528

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

1529

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

1530

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

1531

TransPixar: Advancing Text-to-Video Generation with Transparency

1532

AutoPresent: Designing Structured Visuals from Scratch

1533

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

1534

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

1535

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

1536

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

1537

SDPO: Segment-Level Direct Preference Optimization for Social Agents

1538

Graph Generative Pre-trained Transformer

1539

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

1540

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

1541

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

1542

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

1543

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

1544

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

1545

ProgCo: Program Helps Self-Correction of Large Language Models

1546

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

1547

A3: Android Agent Arena for Mobile GUI Agents

1548

MLLM-as-a-Judge for Image Safety without Human Labeling

1549

Dynamic Scaling of Unit Tests for Code Reward Modeling

1550

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

1551

Xmodel-2 Technical Report

1552

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

1553

HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

1554

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

1555

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

1556

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

1557

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

1558

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

1559

Bringing Objects to Life: 4D generation from 3D objects

1560

Efficiently Serving LLM Reasoning Programs with Certaindex

1561

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

1562

Edicho: Consistent Image Editing in the Wild

1563

Facilitating large language model Russian adaptation with Learned Embedding Propagation

1564

Training Software Engineering Agents and Verifiers with SWE-Gym

1565

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

1566

Slow Perception: Let's Perceive Geometric Figures Step-by-step

1567

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

1568

1.58-bit FLUX

1569

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

1570

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

1571

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

1572

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

1573

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

1574

The Superposition of Diffusion Models Using the Itô Density Estimator

1575

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

1576

CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

1577

YuLan-Mini: An Open Data-efficient Language Model

1578

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

1579

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

1580

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

1581

DepthLab: From Partial to Complete

1582

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

1583

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

1584

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

1585

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

1586

SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval

1587

PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

1588

MotiF: Making Text Count in Image Animation with Motion Focal Loss

1589

Bridging the Data Provenance Gap Across Text, Speech and Video

1590

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

1591

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

1592

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

1593

Diving into Self-Evolving Training for Multimodal Reasoning

1594

Deliberation in Latent Space via Differentiable Cache Augmentation

1595

Large Motion Video Autoencoding with Cross-modal Video VAE

1596

OpenAI o1 System Card

1597

Revisiting In-Context Learning with Long Context Language Models

1598

Outcome-Refining Process Supervision for Code Generation

1599

LearnLM: Improving Gemini for Learning

1600

Parallelized Autoregressive Visual Generation

1601

Offline Reinforcement Learning for LLM Multi-Step Reasoning

1602

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

1603

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

1604

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

1605

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

1606

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

1607

TRecViT: A Recurrent Video Transformer

1608

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

1609

Multi-LLM Text Summarization

1610

Qwen2.5 Technical Report

1611

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

1612

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

1613

How to Synthesize Text Data without Model Collapse?

1614

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

1615

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

1616

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

1617

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

1618

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

1619

No More Adam: Learning Rate Scaling at Initialization is All You Need

1620

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

1621

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

1622

AniDoc: Animation Creation Made Easier

1623

FashionComposer: Compositional Fashion Image Generation

1624

GUI Agents: A Survey

1625

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

1626

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

1627

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

1628

Are Your LLMs Capable of Stable Reasoning?

1629

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

1630

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

1631

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

1632

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

1633

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

1634

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

1635

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

1636

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

1637

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

1638

Byte Latent Transformer: Patches Scale Better Than Tokens

1639

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

1640

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

1641

BrushEdit: All-In-One Image Inpainting and Editing

1642

ColorFlow: Retrieval-Augmented Image Sequence Colorization

1643

Smaller Language Models Are Better Instruction Evolvers

1644

Causal Diffusion Transformers for Generative Modeling

1645

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

1646

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

1647

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

1648

Apollo: An Exploration of Video Understanding in Large Multimodal Models

1649

GenEx: Generating an Explorable World

1650

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

1651

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

1652

Large Action Models: From Inception to Implementation

1653

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

1654

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

1655

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

1656

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

1657

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

1658

Phi-4 Technical Report

1659

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

1660

Multimodal Latent Language Modeling with Next-Token Diffusion

1661

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

1662

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

1663

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

1664

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

1665

JuStRank: Benchmarking LLM Judges for System Ranking

1666

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

1667

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

1668

POINTS1.5: Building a Vision-Language Model towards Real World Applications

1669

Learning Flow Fields in Attention for Controllable Person Image Generation

1670

StyleMaster: Stylize Your Video with Artistic Generation and Translation

1671

StreamChat: Chatting with Streaming Video

1672

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

1673

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

1674

The BrowserGym Ecosystem for Web Agent Research

1675

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

1676

Hidden in the Noise: Two-Stage Robust Watermarking for Images

1677

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

1678

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

1679

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

1680

Mobile Video Diffusion

1681

Granite Guardian

1682

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

1683

ProcessBench: Identifying Process Errors in Mathematical Reasoning

1684

Training Large Language Models to Reason in a Continuous Latent Space

1685

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

1686

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

1687

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

1688

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

1689

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

1690

Robust Multi-bit Text Watermark with LLM-based Paraphrasers

1691

MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views

1692

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

1693

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

1694

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

1695

APOLLO: SGD-like Memory, AdamW-level Performance

1696

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

1697

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

1698

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

1699

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

1700

CompCap: Improving Multimodal Large Language Models with Composite Captions

1701

VisionZip: Longer is Better but Not Necessary in Vision Language Models

1702

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

1703

NVILA: Efficient Frontier Visual Language Models

1704

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

1705

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

1706

Evaluating Language Models as Synthetic Data Generators

1707

A Noise is Worth Diffusion Guidance

1708

Structured 3D Latents for Scalable and Versatile 3D Generation

1709

Negative Token Merging: Image-based Adversarial Feature Guidance

1710

MV-Adapter: Multi-view Consistent Image Generation Made Easy

1711

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

1712

Star Attention: Efficient LLM Inference over Long Sequences

1713

Pathways on the Image Manifold: Image Editing via Video Generation

1714

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

1715

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

1716

SketchAgent: Language-Driven Sequential Sketch Generation

1717

TEXGen: a Generative Diffusion Model for Mesh Textures

1718

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

1719

Learning 3D Representations from Procedural 3D Programs

1720

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

1721

Material Anything: Generating Materials for Any 3D Object via Diffusion

1722

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

1723

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

1724

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

1725

MH-MoE: Multi-Head Mixture-of-Experts

1726

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

1727

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

1728

Knowledge Transfer Across Modalities with Natural Language Supervision

1729

One Diffusion to Generate Them All

1730

VisualLens: Personalization through Visual History

1731

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

1732

Style-Friendly SNR Sampler for Style-Driven Generation

1733

OminiControl: Minimal and Universal Control for Diffusion Transformer

1734

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

1735

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

1736

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

1737

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

1738

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

1739

MyTimeMachine: Personalized Facial Age Transformation

1740

Novel View Extrapolation with Video Diffusion Priors

1741

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

1742

Multimodal Autoregressive Pre-training of Large Vision Encoders

1743

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

1744

Hymba: A Hybrid-head Architecture for Small Language Models

1745

Natural Language Reinforcement Learning

1746

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

1747

Ultra-Sparse Memory Network

1748

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

1749

Stable Flow: Vital Layers for Training-Free Image Editing

1750

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

1751

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

1752

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

1753

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

1754

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

1755

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

1756

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

1757

Stylecodes: Encoding Stylistic Information For Image Generation

1758

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

1759

Loss-to-Loss Prediction: Scaling Laws for All Datasets

1760

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation

1761

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

1762

Continuous Speculative Decoding for Autoregressive Image Generation

1763

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

1764

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

1765

Soft Robotic Dynamic In-Hand Pen Spinning

1766

Building Trust: Foundations of Security, Safety and Transparency in AI

1767

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

1768

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

1769

Generative World Explorer

1770

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

1771

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

1772

AnimateAnything: Consistent and Controllable Animation for Video Generation

1773

Top-$nσ$: Not All Logits Are You Need

1774

Drowning in Documents: Consequences of Scaling Reranker Inference

1775

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

1776

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

1777

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

1778

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

1779

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

1780

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

1781

Xmodel-1.5: An 1B-scale Multilingual LLM

1782

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

1783

MagicQuill: An Intelligent Interactive Image Editing System

1784

Cut Your Losses in Large-Vocabulary Language Models

1785

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

1786

Sharingan: Extract User Action Sequence from Desktop Recordings

1787

Hermes: A Large Language Model Framework on the Journey to Autonomous Networks

1788

Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples

1789

Direct Preference Optimization Using Sparse Feature-Level Constraints

1790

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

1791

Can sparse autoencoders be used to decompose and interpret steering vectors?

1792

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

1793

SAMPart3D: Segment Any Part in 3D Objects

1794

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

1795

Stronger Models are NOT Stronger Teachers for Instruction Tuning

1796

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

1797

Scaling Properties of Diffusion Models for Perceptual Tasks

1798

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

1799

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

1800

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

1801

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

1802

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

1803

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

1804

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

1805

Watermark Anything with Localized Messages

1806

Autoregressive Models in Vision: A Survey

1807

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

1808

Balancing Pipeline Parallelism with Vocabulary Parallelism

1809

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

1810

DELIFT: Data Efficient Language model Instruction Fine Tuning

1811

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

1812

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

1813

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

1814

Improving the detection of technical debt in Java source code with an enriched dataset

1815

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

1816

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

1817

BitNet a4.8: 4-bit Activations for 1-bit LLMs

1818

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

1819

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

1820

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

1821

Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model

1822

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

1823

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

1824

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

1825

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

1826

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

1827

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

1828

Self-Consistency Preference Optimization

1829

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

1830

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

1831

LLaMo: Large Language Model-based Molecular Graph Assistant

1832

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

1833

Controlling Language and Diffusion Models by Transporting Activations

1834

Sample-Efficient Alignment for LLMs

1835

DreamPolish: Domain Score Distillation With Progressive Geometry Generation

1836

Adaptive Length Image Tokenization via Recurrent Allocation

1837

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

1838

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

1839

Inference Optimal VLMs Need Only One Visual Token but Larger Models

1840

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

1841

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

1842

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

1843

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

1844

Training-free Regional Prompting for Diffusion Transformers

1845

How Far is Video Generation from World Model: A Physical Law Perspective

1846

Survey of Cultural Awareness in Language Models: Text and Beyond

1847

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

1848

GenXD: Generating Any 3D and 4D Scenes

1849

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

1850

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

1851

Personalization of Large Language Models: A Survey

1852

Constant Acceleration Flow

1853

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

1854

Randomized Autoregressive Visual Generation

1855

Survey of User Interface Design and Interaction Techniques in Generative AI Applications

1856

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

1857

In-Context LoRA for Diffusion Transformers

1858

Physics in Next-token Prediction

1859

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

1860

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

1861

What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

1862

A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

1863

Language Models can Self-Lengthen to Generate Long Texts

1864

Constraint Back-translation Improves Complex Instruction Following of Large Language Models

1865

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

1866

SelfCodeAlign: Self-Alignment for Code Generation

1867

Learning Video Representations without Natural Videos

1868

AAAR-1.0: Assessing AI's Potential to Assist Research

1869

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays