All Episodes
Mechanical Dreams — 138 episodes
Lost in Backpropagation- The LM Head is a Gradient Bottleneck
Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs
Learning State-Tracking from Code Using Linear RNNs
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
GLM-5
Cautious Optimizers
Backward Gradient Normalization in Deep Neural Networks
Attention Residuals
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
Scaling Laws for Precision
On the "Induction Bias" in Sequence Models
NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches
Flash Attention 4
An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
Midtraining Bridges Pretraining and Posttraining Distributions
SiameseNorm
ÜberWeb
Why Do Reasoning Models Loop
OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Teon
Cautious Weight Decay
Predictable Scale
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
Challenges and Research Directions for Large Language Model Inference Hardware
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings
The Quantization Model of Neural Scaling
EAGLE-3
Engram Paper
From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
NorMuon- Making Muon more efficient and scalable
Dion- Distributed Orthonormalized Updates
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2
Key and Value Weights Are Probably All You Need
Latent State Models of Training Dynamics
DeepSeek OCR
The Coverage Principle - How Pre-training Enables Post-Training
The Coverage Principle- How Pre-training Enables Post-Training
The Art of Scaling Reinforcement Learning Compute for LLMs
Continual Learning via Sparse Memory Finetuning
DeepSeek OCR Paper
Untitled Episode
Characterization and Mitigation of Training Instabilities in Microscaling Formats
Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Drop-Muon- Update Less, Converge Faster
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Apertus Tech Report
Learning Facts at Scale with Active Reading
Fantastic Pretraining Optimizers and Where to Find Them
Benchmarking Optimizers for Large Language Model Pretraining
Learning Facts at Scale with Active Reading.old
Apertus Tech Report.old
Benchmarking Optimizers for Large Language Model Pretraining.old
Fantastic Pretraining Optimizers and Where to Find Them.old
The Pitfalls of Next-Token Prediction
Large Language Models and Games
UQ - Assessing Language Models on Unsolved Questions
UQ- Assessing Language Models on Unsolved Questions
Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation
Signal and Noise- A Framework for Reducing Uncertainty in Language Model Evaluation
Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
Thinking Like Transformers
Kimi K2
ERNIE Technical Report
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Gemini 2.5
How new data permeates LLM knowledge and how to dilute it
Harnessing the Universal Geometry of Embeddings
Model Merging in Pre-training of Large Language Models
Learning Dynamics in Continual Pre-Training for Large Language Models
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs
Scalable-Softmax Is Superior for Attention
Breast Cancer Recurrence Prediction
Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research
Native Sparse Attention
Critical Batch Size Revisited
Base of RoPE Bounds Context Length
Rope to Nope and Back Again
Training Deep Learning Models with Norm-Constrained LMOs
SkyLadder
LLMs on the Line
The Leaderboard Illusion
Why Linearly Decaying the Learning Rate to Zero Works Best
Not All Data Are Unlearned Equally
A Multi-Power Law for Loss Curve Prediction
Efficient Training of Ultra-Long Context Large Language Models
Multi-Token Attention
From Style to Facts
Compute Optimal Scaling of Skills
Predictive Data Selection
Continual Pre-training of MoEs
s1 - Simple test-time scaling
Cognitive Behaviors that Enable Self-Improving Reasoners
Phi 4 Multimodal Instruct
Claude 3.7 Sonnet System Card
Project Sid: Many-agent simulations toward AI civilization
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
NExtLong - Toward Effective Long-Context Training without Long Documents
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
Over-Tokenized Transformer
HashAttention: Semantic Sparsity for Faster Inference
From Tokens to Words
DeepSeek V3
Optimal Linear Decay Learning Rate Schedules and Further Refinements
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
Phi-4
Rephrasing natural text data with different languages and quality levels
Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
EXAONE 3.5
Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Nemotron-CC
Tülu 3
The Zamba2 Suite
Small-scale proxies for large-scale Transformer training instabilities
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Understanding WSD Learning Rates
Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?
How Does Critical Batch Size Scale in Pre-training?
The Road Less Scheduled
Learning-Rate-Free Learning by D-Adaptation
Scaling FP8 Training to Trillion Token LLMs
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
A Survey on Model MoErging
Liquid Time-constant Networks
Scaling Laws for Predicting Downstream Performance in LLMs
A Spectral Condition for Feature Learning
Don't decay the learning rate
OLMoE
An Empirical Model of Large Batch Training