All Episodes
Daily Paper Cast — 1869 episodes
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
$δ$-mem: Efficient Online Memory for Large Language Models
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
World Action Models: The Next Frontier in Embodied AI
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Efficient Pre-Training with Token Superposition
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
Qwen-Image-2.0 Technical Report
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
Model Merging Scaling Laws in Large Language Models
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Flow-OPD: On-Policy Distillation for Flow Matching Models
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Anisotropic Modality Align
Beyond Retrieval: A Multitask Benchmark and Model for Code Search
MiA-Signature: Approximating Global Activation for Long-Context Understanding
When to Trust Imagination: Adaptive Action Execution for World Action Models
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-T1: Test-Time Scaling for Streaming Video Generation
RLDX-1 Technical Report
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
MolmoAct2: Action Reasoning Models for Real-world Deployment
From Context to Skills: Can Language Models Learn from Context Skillfully?
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Heterogeneous Scientific Foundation Model Collaboration
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Co-Evolving Policy Distillation
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
Efficient Training on Multiple Consumer GPUs with RoundPipe
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
Large Language Models Explore by Latent Distilling
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
ClawGym: A Scalable Framework for Building Effective Claw Agents
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Recursive Multi-Agent Systems
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
Video Analysis and Generation via a Semantic Progress Function
DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction
LLM Safety From Within: Detecting Harmful Content with Internal Representations
LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Near-Future Policy Optimization
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
AgentSPEX: An Agent SPecification and EXecution Language
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
TEMPO: Scaling Test-time Training for Large Reasoning Models
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
OpenGame: Open Agentic Coding for Games
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
EasyVideoR1: Easier RL for Video Understanding
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
PersonaVLM: Long-Term Personalized Multimodal LLMs
Qwen3.5-Omni Technical Report
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation
Seedance 2.0: Advancing Video Generation for World Complexity
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Exploration and Exploitation Errors Are Measurable for Language Model Agents
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
Toward Autonomous Long-Horizon Engineering for ML Research
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Strips as Tokens: Artist Mesh Generation with Native UV Segmentation
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
CocoaBench: Evaluating Unified Digital Agents in the Wild
CodeTracer: Towards Traceable Agent States
WildDet3D: Scaling Promptable 3D Detection in the Wild
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
EXAONE 4.5 Technical Report
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
RAGEN-2: Reasoning Collapse in Agentic RL
MARS: Enabling Autoregressive Models Multi-Token Generation
Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Learning to Retrieve from Agent Trajectories
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning
ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
Watch Before You Answer: Learning from Visually Grounded Post-Training
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Adam's Law: Textual Frequency Law on Large Language Models
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
ClawArena: Benchmarking AI Agents in Evolving Information Environments
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
LightThinker++: From Reasoning Compression to Memory Management
Self-Distilled RLVR
A Simple Baseline for Streaming Video Understanding
Token Warping Helps MLLMs Look from Nearby Viewpoints
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Generative World Renderer
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Steerable Visual Representations
EgoSim: Egocentric World Simulator for Embodied Interaction Generation
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers
Terminal Agents Suffice for Enterprise Automation
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
Reasoning Shift: How Context Silently Shortens LLM Reasoning
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells
GEMS: Agent-Native Multimodal Generation with Memory and Skills
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
daVinci-LLM:Towards the Science of Pretraining
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Towards a Medical AI Scientist
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
EpochX: Building the Infrastructure for an Emergent Agent Civilization
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
GEditBench v2: A Human-Aligned Benchmark for General Image Editing
Make Geometry Matter for Spatial Reasoning
PRBench: End-to-end Paper Reproduction in Physics Research
PixelSmile: Toward Fine-Grained Facial Expression Editing
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models
MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Voxtral TTS
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
PEARL: Personalized Streaming Video Understanding Model
DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
RealMaster: Lifting Rendered Scenes into Photorealistic Video
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting
mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Hyperagents
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
FASTER: Rethinking Real-Time Flow VLAs
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Memento-Skills: Let Agents Design Agents
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
Alignment Makes Language Models Normative, Not Descriptive
Complementary Reinforcement Learning
When AI Navigates the Fog of War
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
InCoder-32B: Code Foundation Model for Industrial Scenarios
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Demystifing Video Reasoning
WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas
Online Experiential Learning for Language Models
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
AI Can Learn Scientific Taste
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
Grounding World Simulation Models in a Real-World Metropolis
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Attention Residuals
Mixture-of-Depths Attention
Effective Distillation to Hybrid xLSTM Architectures
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer
LMEB: Long-horizon Memory Embedding Benchmark
Can Vision-Language Models Solve the Shell Game?
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
daVinci-Env: Open SWE Environment Synthesis at Scale
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
OpenClaw-RL: Train Any Agent Simply by Talking
Flash-KMeans: Fast and Memory-Efficient Exact K-Means
MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
LLM2Vec-Gen: Generative Embeddings from Large Language Models
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
STEP3-VL-10B Technical Report
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Controlled Self-Evolution for Algorithmic Code Optimization
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
MAXS: Meta-Adaptive Exploration with LLM Agents
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences
Solar Open Technical Report
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
MemoBrain: Executive Memory as an Agentic Brain for Reasoning
Motion Attribution for Video Generation
3AM: Segment Anything with Geometric Consistency in Videos
BabyVision: Visual Reasoning Beyond Language
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
MMFormalizer: Multimodal Autoformalization in the Wild
CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Token-Level LLM Collaboration via FusionRoute
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Evolving Programmatic Skill Networks
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Benchmark^2: Systematic Evaluation of LLM Benchmarks
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
LTX-2: Efficient Joint Audio-Visual Foundation Model
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
GARDO: Reinforcing Diffusion Models without Reward Hacking
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Deep Delta Learning
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Nested Learning: The Illusion of Deep Learning Architectures
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
mHC: Manifold-Constrained Hyper-Connections
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Yume-1.5: A Text-Controlled Interactive World Generation Model
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
SpotEdit: Selective Region Editing in Diffusion Transformers
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Latent Implicit Visual Reasoning
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
SemanticGen: Video Generation in Semantic Space
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
LongVideoAgent: Multi-Agent Reasoning with Long Videos
SpatialTree: How Spatial Abilities Branch Out in MLLMs
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Region-Constraint In-Context Generation for Instructional Video Editing
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
When Reasoning Meets Its Laws
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Are We on the Right Way to Assessing LLM-as-a-Judge?
Kling-Omni Technical Report
Adaptation of Agentic AI
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Next-Embedding Prediction Makes Strong Vision Learners
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Generative Refocusing: Flexible Defocus Control from a Single Image
DeContext as Defense: Safe Image Editing in Diffusion Transformers
Step-GUI Technical Report
DEER: Draft with Diffusion, Verify with Autoregressive Models
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Puzzle Curriculum GRPO for Vision-Centric Reasoning
MMGR: Multi-Modal Generative Reasoning
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Towards Scalable Pre-training of Visual Tokenizers for Generation
Memory in the Age of AI Agents
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
KlingAvatar 2.0 Technical Report
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
EgoX: Egocentric Video Generation from a Single Exocentric Video
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
OmniPSD: Layered PSD Generation with Diffusion Transformer
Composing Concepts from Images and Videos via Concept-prompt Binding
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
Unified Video Editing with Temporal Reasoner
Voxify3D: Pixel Art Meets Volumetric Rendering
Scaling Zero-Shot Reference-to-Video Generation
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing
Qwen3-VL Technical Report
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
PretrainZero: Reinforcement Active Pretraining
ViDiC: Video Difference Captioning
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
Guided Self-Evolving LLMs with Minimal Human Supervision
SimScale: Learning to Drive via Real-World Simulation at Scale
InnoGym: Benchmarking the Innovation Potential of AI Agents
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
How Far Are We from Genuinely Useful Deep Research Agents?
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
LFM2 Technical Report
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Vision Bridge Transformer at Scale
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Multimodal Evaluation of Russian-language Architectures
Latent Collaboration in Multi-Agent Systems
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms
MedSAM3: Delving into Segment Anything with Medical Concepts
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Soft Adaptive Policy Optimization
General Agentic Memory Via Deep Research
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Computer-Use Agents as Judges for Generative User Interface
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
In-Video Instructions: Visual Signals as Generative Control
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
SAM 3: Segment Anything with Concepts
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
VisPlay: Self-Evolving Vision-Language Models from Images
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
VIDEOP2R: Video Understanding from Perception to Reasoning
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
P1: Mastering Physics Olympiads with Reinforcement Learning
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
DoPE: Denoising Rotary Position Embedding
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
Virtual Width Networks
One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
DeepEyesV2: Toward Agentic Multimodal Model
Visual Spatial Tuning
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
V-Thinker: Interactive Thinking with Images
Scaling Agent Learning via Experience Synthesis
Diffusion Language Models are Super Data Learners
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
The Underappreciated Power of Vision Models for Graph Structural Understanding
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
PHUMA: Physically-Grounded Humanoid Locomotion Dataset
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
World Simulation with Video Foundation Models for Physical AI
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
The End of Manual Decoding: Towards Truly End-to-End Language Models
Kimi Linear: An Expressive, Efficient Attention Architecture
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Language Models are Injective and Hence Invertible
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
LightMem: Lightweight and Efficient Memory-Augmented Generation
Efficient Long-context Language Model Training by Core Attention Disaggregation
World-in-World: World Models in a Closed-Loop World
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Chem-R: Learning to Reason as a Chemist
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
IF-VidCap: Can Video Caption Models Follow Instructions?
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
PICABench: How Far Are We from Physically Realistic Image Editing?
Glyph: Scaling Context Windows via Visual-Text Compression
FineVision: Open Data Is All You Need
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Latent Diffusion Model without Variational Autoencoder
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
Agentic Entropy-Balanced Policy Optimization
WithAnyone: Towards Controllable and ID Consistent Image Generation
AI for Service: Proactive Assistance with AI Glasses
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
BitNet Distillation
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
Scaling Language-Centric Omnimodal Representation Learning
Robot Learning: A Tutorial
Detect Anything via Next Point Prediction
A Survey of Vibe Coding with Large Language Models
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
Dr.LLM: Dynamic Layer Routing in LLMs
Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Diffusion Transformers with Representation Autoencoders
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Spotlight on Token Perception for Multimodal Reinforcement Learning
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling
AutoPR: Let's Automate Your Academic Promotion!
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Agent Learning via Early Experience
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
MemMamba: Rethinking Memory Patterns in State Space Model
UniVideo: Unified Understanding, Generation, and Editing for Videos
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Vibe Checker: Aligning Code Evaluation with Human Preference
Less is More: Recursive Reasoning with Tiny Networks
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Fast-dLLM v2: Efficient Block-Diffusion LLM
CoDA: Coding LM via Diffusion Adaptation
Drax: Speech Recognition with Discrete Flow Matching
Paper2Video: Automatic Video Generation from Scientific Papers
MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Imperceptible Jailbreaking against Large Language Models
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Optimal Scaling Needs Optimal Norm
Apriel-1.5-15b-Thinker
Large Reasoning Models Learn Better Alignment from Flawed Thinking
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
LongCodeZip: Compress Long Context for Code Language Models
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
ExGRPO: Learning to Reason from Experience
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
Interactive Training: Feedback-Driven Neural Network Optimization
ModernVBERT: Towards Smaller Visual Document Retrievers
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
GEM: A Gym for Agentic LLMs
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
PIPer: On-Device Environment Setup via Online Reinforcement Learning
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
ACON: Optimizing Context Compression for Long-horizon LLM Agents
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
OceanGym: A Benchmark Environment for Underwater Embodied Agents
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Multiplayer Nash Preference Optimization
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Democratizing AI scientists using ToolUniverse
Visual Jigsaw Post-Training Improves MLLMs
When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance
LongLive: Real-time Interactive Long Video Generation
Quantile Advantage Estimation for Entropy-Safe Reasoning
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
ReviewScore: Misinformed Peer Review Detection with Large Language Models
Variational Reasoning for Language Models
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Tree Search for LLM Agent Reinforcement Learning
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
AutoIntent: AutoML for Text Classification
Video models are zero-shot learners and reasoners
SIM-CoT: Supervised Implicit Chain-of-Thought
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
Reinforcement Learning on Pre-Training Data
Do You Need Proprioceptive States in Visuomotor Policies?
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
LIMI: Less is More for Agency
Qwen3-Omni Technical Report
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
FlowRL: Matching Reward Distributions for LLM Reasoning
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
SAIL-VL2 Technical Report
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Scaling Agents via Continual Pre-training
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Towards General Agentic Intelligence via Environment Scaling
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Single-stream Policy Optimization
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
A Survey of Reinforcement Learning for Large Reasoning Models
RewardDance: Reward Scaling in Visual Generation
3D and 4D World Modeling: A Survey
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Visual Representation Alignment for Multimodal Large Language Models
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Reconstruction Alignment Improves Unified Multimodal Models
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Reverse-Engineered Reasoning for Open-Ended Generation
Does DINOv3 Set a New Medical Vision Standard?
Symbolic Graphics Programming with Large Language Models
Set Block Decoding is a Language Model Inference Accelerator
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
From Editor to Dense Geometry Estimator
Towards a Unified View of Large Language Model Post-Training
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Open Data Synthesis For Deep Research
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Baichuan-M2: Scaling Medical Capability with Large Verifier System
Kwai Keye-VL 1.5 Technical Report
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
VibeVoice Technical Report
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
Spacer: Towards Engineered Scientific Inspiration
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
MV-RAG: Retrieval Augmented Multiview Diffusion
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
Intern-S1: A Scientific Multimodal Foundation Model
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Deep Think with Confidence
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Prompt Orchestration Markup Language
Ovis2.5 Technical Report
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
4DNeX: Feed-Forward 4D Generative Modeling Made Easy
Next Visual Granularity Generation
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
SSRL: Self-Search Reinforcement Learning
DINOv3
Thyme: Think Beyond Images
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Story2Board: A Training-Free Approach for Expressive Storyboard Generation
Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Complex Logical Instruction Generation
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
WideSearch: Benchmarking Agentic Broad Info-Seeking
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
MolmoAct: Action Reasoning Models that can Reason in Space
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
VeriGUI: Verifiable Long-Chain GUI Dataset
Efficient Agents: Building Effective Agents While Reducing Cost
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Qwen-Image Technical Report
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
CellForge: Agentic Design of Virtual Cell Models
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
PixNerd: Pixel Neural Field Diffusion
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
BANG: Dividing 3D Assets via Generative Exploded Dynamics
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge
Agentic Reinforced Policy Optimization
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
Reconstructing 4D Spatial Intelligence: A Survey
Deep Researcher with Test-Time Diffusion
$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention
Group Sequence Policy Optimization
MUR: Momentum Uncertainty guided Reasoning for Large Language Models
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Pixels, Patterns, but No Poetry: To See The World like Humans
Yume: An Interactive World Generation Model
DesignLab: Designing Slides Through Iterative Detection and Correction
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
Step-Audio 2 Technical Report
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
The Invisible Leash: Why RLVR May Not Escape Its Origin
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
GR-3 Technical Report
Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
A Survey of Context Engineering for Large Language Models
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Test-Time Scaling with Reflective Generative Model
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
KV Cache Steering for Inducing Reasoning in Small Language Models
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Neural-Driven Image Editing
Scaling RL to Long Videos
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
PyVision: Agentic Vision with Dynamic Tooling
4KAgent: Agentic Any Image to 4K Super-Resolution
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Perception-Aware Policy Optimization for Multimodal Reasoning
MIRIX: Multi-Agent Memory System for LLM-Based Agents
Rethinking Verification for LLM Code Generation: From Generation to Testing
SingLoRA: Low Rank Adaptation Using a Single Matrix
A Survey on Latent Reasoning
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
How to Train Your LLM Web Agent: A Statistical Diagnosis
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
MemOS: A Memory OS for AI System
Should We Still Pretrain Encoders with Masked Language Modeling?
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Pre-Trained Policy Discriminators are General Reward Models
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
WebSailor: Navigating Super-human Reasoning for Web Agent
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Kwai Keye-VL Technical Report
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Depth Anything at Any Condition
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Ovis-U1 Technical Report
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
VMoBA: Mixture-of-Block Attention for Video Diffusion Models
Calligrapher: Freestyle Text Image Customization
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
Effective Red-Teaming of Policy-Adherent Agents
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Text-Aware Image Restoration with Diffusion Models
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Discrete Audio Tokens: More Than a Survey!
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
PlayerOne: Egocentric World Simulator
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
Reinforcement Pre-Training
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
MiniCPM4: Ultra-Efficient LLMs on End Devices
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Image Reconstruction as a Tool for Feature Analysis
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Video World Models with Long-term Spatial Memory
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
MiMo-VL Technical Report
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
A Controllable Examination for Long-Context Language Models
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Taming LLMs by Scaling Learning Rates with Gradient Grouping
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Time Blindness: Why Video-Language Models Can't See What Humans Can?
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Large Language Models for Data Synthesis
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Table-R1: Inference-Time Scaling for Table Reasoning
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Skywork Open Reasoner 1 Technical Report
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
SageAttention2++: A More Efficient Implementation of SageAttention2
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Fostering Video Reasoning via Next-Event Prediction
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
Exploring the Latent Capacity of LLMs for One-Step Text Generation
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Alchemist: Turning Public Text-to-Image Data into Generative Gold
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
PATS: Process-Level Adaptive Thinking Mode Switching
Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance
ARM: Adaptive Reasoning Model
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
B-score: Detecting biases in large language models using response history
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Distilling LLM Agent into Small Models with Retrieval and Code Tools
QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Scaling Image and Video Generation via Test-Time Evolutionary Search
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Scaling Diffusion Transformers Efficiently via $μ$P
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
MMaDA: Multimodal Large Diffusion Language Models
Scaling Law for Quantization-Aware Training
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Efficient Agent Training for Computer Use
This Time is Different: An Observability Perspective on Time Series Foundation Models
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Emerging Properties in Unified Multimodal Pretraining
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
Visual Agentic Reinforcement Fine-Tuning
Neurosymbolic Diffusion Models
Chain-of-Model Learning for Language Model
AdaptThink: Reasoning Models Can Learn When to Think
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Faster Video Diffusion with Trainable Sparse Attention
Thinkless: LLM Learns When to Think
Model Merging in Pre-training of Large Language Models
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Qwen3 Technical Report
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Visual Planning: Let's Think Only with Images
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
System Prompt Optimization with Meta-Learning
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Seed1.5-VL Technical Report
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
Learning from Peers in Reasoning Models
Unified Continuous Generative Models
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
Bielik v3 Small: Technical Report
Bielik 11B v2 Technical Report
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
On Path to Multimodal Generalist: General-Level and General-Bench
Flow-GRPO: Training Flow Matching Models via Online RL
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
RM-R1: Reward Modeling as Reasoning
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
Practical Efficiency of Muon for Pretraining
PixelHacker: Image Inpainting with Structural and Semantic Consistency
A Survey of Interactive Generative Video
DeepCritic: Deliberate Critique with Large Language Models
Sadeed: Advancing Arabic Diacritization Through Small Language Model
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
ReasonIR: Training Retrievers for Reasoning Tasks
The Leaderboard Illusion
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
RepText: Rendering Visual Text via Replicating
Towards Understanding Camera Motions in Any Video
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Step1X-Edit: A Practical Framework for General Image Editing
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Trillion 7B Technical Report
Tina: Tiny Reasoning Models via LoRA
I-Con: A Unifying Framework for Representation Learning
Kuwain 1.5B: An Arabic SLM via Language Injection
TTRL: Test-Time Reinforcement Learning
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Describe Anything: Detailed Localized Image and Video Captioning
Learning Adaptive Parallel Reasoning with Language Models
Learning to Reason under Off-Policy Guidance
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
FlowReasoner: Reinforcing Query-Level Meta-Agents
ToolRL: Reward is All Tool Learning Needs
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Antidistillation Sampling
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
WORLDMEM: Long-term Consistent World Simulation with Memory
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
BitNet b1.58 2B4T Technical Report
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
Heimdall: test-time scaling on the generative verification
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
TextArena
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Iterative Self-Training for Code Generation via Reinforced Re-Ranking
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Kimi-VL Technical Report
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
MM-IFEngine: Towards Multimodal Instruction Following
HoloPart: Generative 3D Part Amodal Segmentation
DDT: Decoupled Diffusion Transformer
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
A Unified Agentic Framework for Evaluating Conditional Image Generation
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
An Empirical Study of GPT-4o Image Generation Capabilities
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
SmolVLM: Redefining small and efficient multimodal models
One-Minute Video Generation with Test-Time Training
Rethinking Reflection in Pre-Training
URECA: Unique Region Caption Anything
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
WikiVideo: Article Generation from Multiple Videos
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Understanding R1-Zero-Like Training: A Critical Perspective
Towards Physically Plausible Video Generation via VLM Planning
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
START: Self-taught Reasoner with Tools
Token-Efficient Long Video Understanding for Multimodal LLMs
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
EgoLife: Towards Egocentric Life Assistant
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Process-based Self-Rewarding Language Models
Visual-RFT: Visual Reinforcement Fine-Tuning
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
Chain of Draft: Thinking Faster by Writing Less
Multi-Turn Code Generation Through Single-Step Rewards
Self-rewarding correction for mathematical reasoning
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
LongRoPE2: Near-Lossless LLM Context Window Scaling
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
UniTok: A Unified Tokenizer for Visual Generation and Understanding
NeoBERT: A Next-Generation BERT
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
GHOST 2.0: generative high-fidelity one shot transfer of heads
Kanana: Compute-efficient Bilingual Language Models
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
Language Models' Factuality Depends on the Language of Inquiry
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Towards an AI co-scientist
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Rank1: Test-Time Compute for Reranking in Information Retrieval
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
S*: Test Time Scaling for Code Generation
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Qwen2.5-VL Technical Report
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
MoM: Linear Sequence Modeling with Mixture-of-Memories
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
Craw4LLM: Efficient Web Crawling for LLM Pretraining
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
Small Models Struggle to Learn from Strong Reasoners
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
Soundwave: Less is More for Speech-Text Alignment in LLMs
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Continuous Diffusion Model for Language Modeling
Phantom: Subject-consistent video generation via cross-modal alignment
Rethinking Diverse Human Preference Learning through Principal Component Analysis
Magma: A Foundation Model for Multimodal AI Agents
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
You Do Not Fully Utilize Transformer's Representation Capacity
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Learning Getting-Up Policies for Real-World Humanoid Robots
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
CRANE: Reasoning with constrained LLM generation
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Region-Adaptive Sampling for Diffusion Transformers
Large Language Diffusion Models
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Diverse Inference and Verification for Advanced Reasoning
Precise Parameter Localization for Textual Generation in Diffusion Models
DarwinLM: Evolutionary Structured Pruning of Large Language Models
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Exploring the Potential of Encoder-free Architectures in 3D LMMs
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Distillation Scaling Laws
TransMLA: Multi-Head Latent Attention Is All You Need
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
Expect the Unexpected: FailSafe Long Context QA for Finance
Competitive Programming with Large Reasoning Models
Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Magic 1-For-1: Generating One Minute Video Clips within One Minute
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Teaching Language Models to Critique via Reinforcement Learning
Scaling Pre-training to One Hundred Billion Data for Vision Language Models
Enhance-A-Video: Better Generated Video for Free
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
LM2: Large Memory Models
Matryoshka Quantization
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Fast Video Generation with Sliding Tile Attention
Goku: Flow Based Video Generative Foundation Models
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Agency Is Frame-Dependent
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Generating Symbolic World Models via Test-time Scaling of Large Language Models
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
UltraIF: Advancing Instruction Following from the Wild
Great Models Think Alike and this Undermines AI Oversight
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
Demystifying Long Chain-of-Thought Reasoning in LLMs
LIMO: Less is More for Reasoning
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
On Teacher Hacking in Language Model Distillation
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
Jailbreaking with Universal Multi-Prompts
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Inverse Bridge Matching Distillation
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
The Differences Between Direct Alignment Algorithms are a Blur
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Process Reinforcement through Implicit Rewards
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
Preference Leakage: A Contamination Problem in LLM-as-a-judge
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
AIN: The Arabic INclusive Large Multimodal Model
s1: Simple test-time scaling
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
PixelWorld: Towards Perceiving Everything as Pixels
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Scalable-Softmax Is Superior for Attention
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
GuardReasoner: Towards Reasoning-based LLM Safeguards
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Large Language Models Think Too Fast To Explore Effectively
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
o3-mini vs DeepSeek-R1: Which One is Safer?
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Atla Selene Mini: A General Purpose Evaluation Model
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Optimizing Large Language Model Training Using FP4 Quantization
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Open Problems in Mechanistic Interpretability
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Histoires Morales: A French Dataset for Assessing Moral Alignment
Qwen2.5-1M Technical Report
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Towards General-Purpose Model-Free Reinforcement Learning
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
iFormer: Integrating ConvNet and Transformer for Mobile Application
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
CodeMonkeys: Scaling Test-Time Compute for Software Engineering
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Humanity's Last Exam
Chain-of-Retrieval Augmented Generation
Redundancy Principles for MLLMs Benchmarks
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
RL + Transformer = A General-Purpose Problem Solver
Relightable Full-Body Gaussian Codec Avatars
Question Answering on Patient Medical Records with Private Fine-Tuned LLMs
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
Improving Video Generation with Human Feedback
Temporal Preference Optimization for Long-Form Video Understanding
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
DiffuEraser: A Diffusion Model for Video Inpainting
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Autonomy-of-Experts Models
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Reasoning Language Models: A Blueprint
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
GameFactory: Creating New Games with Generative Interactive Videos
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
SEAL: Entangled White-box Watermarks on Low-Rank Adaptation
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Tensor Product Attention Is All You Need
$\text{Transformer}^2$: Self-adaptive LLMs
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
VideoAuteur: Towards Long Narrative Video Generation
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
WebWalker: Benchmarking LLMs in Web Traversal
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
UnCommon Objects in 3D
VideoRAG: Retrieval-Augmented Generation over Video Corpus
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Enabling Scalable Oversight via Self-Evolving Critic
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
The GAN is dead; long live the GAN! A Modern GAN Baseline
An Empirical Study of Autoregressive Pre-training from Videos
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Entropy-Guided Attention for Private LLMs
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Agent Laboratory: Using LLM Agents as Research Assistants
LLM4SR: A Survey on Large Language Models for Scientific Research
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
GeAR: Generation Augmented Retrieval
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Cosmos World Foundation Model Platform for Physical AI
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Personalized Graph-Based Retrieval for Large Language Models
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
TransPixar: Advancing Text-to-Video Generation with Transparency
AutoPresent: Designing Structured Visuals from Scratch
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Graph Generative Pre-trained Transformer
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
ProgCo: Program Helps Self-Correction of Large Language Models
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
A3: Android Agent Arena for Mobile GUI Agents
MLLM-as-a-Judge for Image Safety without Human Labeling
Dynamic Scaling of Unit Tests for Code Reward Modeling
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Xmodel-2 Technical Report
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
Bringing Objects to Life: 4D generation from 3D objects
Efficiently Serving LLM Reasoning Programs with Certaindex
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Edicho: Consistent Image Editing in the Wild
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Training Software Engineering Agents and Verifiers with SWE-Gym
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Slow Perception: Let's Perceive Geometric Figures Step-by-step
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
1.58-bit FLUX
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
The Superposition of Diffusion Models Using the Itô Density Estimator
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
YuLan-Mini: An Open Data-efficient Language Model
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
DepthLab: From Partial to Complete
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
In Case You Missed It: ARC 'Challenge' Is Not That Challenging
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
MotiF: Making Text Count in Image Animation with Motion Focal Loss
Bridging the Data Provenance Gap Across Text, Speech and Video
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
Diving into Self-Evolving Training for Multimodal Reasoning
Deliberation in Latent Space via Differentiable Cache Augmentation
Large Motion Video Autoencoding with Cross-modal Video VAE
OpenAI o1 System Card
Revisiting In-Context Learning with Long Context Language Models
Outcome-Refining Process Supervision for Code Generation
LearnLM: Improving Gemini for Learning
Parallelized Autoregressive Visual Generation
Offline Reinforcement Learning for LLM Multi-Step Reasoning
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
TRecViT: A Recurrent Video Transformer
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Multi-LLM Text Summarization
Qwen2.5 Technical Report
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
How to Synthesize Text Data without Model Collapse?
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
No More Adam: Learning Rate Scaling at Initialization is All You Need
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
AniDoc: Animation Creation Made Easier
FashionComposer: Compositional Fashion Image Generation
GUI Agents: A Survey
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Are Your LLMs Capable of Stable Reasoning?
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
Byte Latent Transformer: Patches Scale Better Than Tokens
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
BrushEdit: All-In-One Image Inpainting and Editing
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Smaller Language Models Are Better Instruction Evolvers
Causal Diffusion Transformers for Generative Modeling
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Apollo: An Exploration of Video Understanding in Large Multimodal Models
GenEx: Generating an Explorable World
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Large Action Models: From Inception to Implementation
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Phi-4 Technical Report
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Multimodal Latent Language Modeling with Next-Token Diffusion
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
JuStRank: Benchmarking LLM Judges for System Ranking
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
POINTS1.5: Building a Vision-Language Model towards Real World Applications
Learning Flow Fields in Attention for Controllable Person Image Generation
StyleMaster: Stylize Your Video with Artistic Generation and Translation
StreamChat: Chatting with Streaming Video
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
The BrowserGym Ecosystem for Web Agent Research
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Hidden in the Noise: Two-Stage Robust Watermarking for Images
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Mobile Video Diffusion
Granite Guardian
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Training Large Language Models to Reason in a Continuous Latent Space
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Robust Multi-bit Text Watermark with LLM-based Paraphrasers
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
APOLLO: SGD-like Memory, AdamW-level Performance
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
CompCap: Improving Multimodal Large Language Models with Composite Captions
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
NVILA: Efficient Frontier Visual Language Models
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Evaluating Language Models as Synthetic Data Generators
A Noise is Worth Diffusion Guidance
Structured 3D Latents for Scalable and Versatile 3D Generation
Negative Token Merging: Image-based Adversarial Feature Guidance
MV-Adapter: Multi-view Consistent Image Generation Made Easy
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Star Attention: Efficient LLM Inference over Long Sequences
Pathways on the Image Manifold: Image Editing via Video Generation
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
SketchAgent: Language-Driven Sequential Sketch Generation
TEXGen: a Generative Diffusion Model for Mesh Textures
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Learning 3D Representations from Procedural 3D Programs
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Material Anything: Generating Materials for Any 3D Object via Diffusion
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
MH-MoE: Multi-Head Mixture-of-Experts
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Knowledge Transfer Across Modalities with Natural Language Supervision
One Diffusion to Generate Them All
VisualLens: Personalization through Visual History
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
Style-Friendly SNR Sampler for Style-Driven Generation
OminiControl: Minimal and Universal Control for Diffusion Transformer
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
MyTimeMachine: Personalized Facial Age Transformation
Novel View Extrapolation with Video Diffusion Priors
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Multimodal Autoregressive Pre-training of Large Vision Encoders
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Hymba: A Hybrid-head Architecture for Small Language Models
Natural Language Reinforcement Learning
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Ultra-Sparse Memory Network
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Stable Flow: Vital Layers for Training-Free Image Editing
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Stylecodes: Encoding Stylistic Information For Image Generation
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Loss-to-Loss Prediction: Scaling Laws for All Datasets
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Continuous Speculative Decoding for Autoregressive Image Generation
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Soft Robotic Dynamic In-Hand Pen Spinning
Building Trust: Foundations of Security, Safety and Transparency in AI
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
Generative World Explorer
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
AnimateAnything: Consistent and Controllable Animation for Video Generation
Top-$nσ$: Not All Logits Are You Need
Drowning in Documents: Consequences of Scaling Reranker Inference
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
Xmodel-1.5: An 1B-scale Multilingual LLM
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
MagicQuill: An Intelligent Interactive Image Editing System
Cut Your Losses in Large-Vocabulary Language Models
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Sharingan: Extract User Action Sequence from Desktop Recordings
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks
Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples
Direct Preference Optimization Using Sparse Feature-Level Constraints
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
Can sparse autoencoders be used to decompose and interpret steering vectors?
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
SAMPart3D: Segment Any Part in 3D Objects
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Stronger Models are NOT Stronger Teachers for Instruction Tuning
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Scaling Properties of Diffusion Models for Perceptual Tasks
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
Watermark Anything with Localized Messages
Autoregressive Models in Vision: A Survey
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Balancing Pipeline Parallelism with Vocabulary Parallelism
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
DELIFT: Data Efficient Language model Instruction Fine Tuning
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Improving the detection of technical debt in Java source code with an enriched dataset
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
BitNet a4.8: 4-bit Activations for 1-bit LLMs
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Self-Consistency Preference Optimization
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
LLaMo: Large Language Model-based Molecular Graph Assistant
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Controlling Language and Diffusion Models by Transporting Activations
Sample-Efficient Alignment for LLMs
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
Adaptive Length Image Tokenization via Recurrent Allocation
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge
Inference Optimal VLMs Need Only One Visual Token but Larger Models
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Training-free Regional Prompting for Diffusion Transformers
How Far is Video Generation from World Model: A Physical Law Perspective
Survey of Cultural Awareness in Language Models: Text and Beyond
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
GenXD: Generating Any 3D and 4D Scenes
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Personalization of Large Language Models: A Survey
Constant Acceleration Flow
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Randomized Autoregressive Visual Generation
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
In-Context LoRA for Diffusion Transformers
Physics in Next-token Prediction
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
Language Models can Self-Lengthen to Generate Long Texts
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
SelfCodeAlign: Self-Alignment for Code Generation
Learning Video Representations without Natural Videos
AAAR-1.0: Assessing AI's Potential to Assist Research
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays