1

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

Mar 28, 2026

21:09

2

Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs

Mar 27, 2026

23:56

3

Learning State-Tracking from Code Using Linear RNNs

Mar 26, 2026

20:34

4

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Mar 25, 2026

19:50

5

GLM-5

Mar 24, 2026

24:36

6

Cautious Optimizers

Mar 23, 2026

21:19

7

Backward Gradient Normalization in Deep Neural Networks

Mar 22, 2026

22:10

8

Attention Residuals

Mar 21, 2026

21:00

9

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

Mar 20, 2026

20:28

10

Scaling Laws for Precision

Mar 19, 2026

20:33

11

On the "Induction Bias" in Sequence Models

Mar 13, 2026

17:09

12

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches

Mar 12, 2026

16:05

13

Flash Attention 4

Mar 10, 2026

15:52

14

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Feb 28, 2026

16:55

15

Midtraining Bridges Pretraining and Posttraining Distributions

Feb 27, 2026

16:22

16

SiameseNorm

Feb 24, 2026

17:21

17

ÜberWeb

Feb 23, 2026

18:00

18

Why Do Reasoning Models Loop

Feb 18, 2026

17:16

19

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Feb 11, 2026

17:42

20

Teon

Feb 3, 2026

18:52

21

Cautious Weight Decay

Jan 28, 2026

20:59

22

Predictable Scale

Jan 27, 2026

17:42

23

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Jan 27, 2026

17:20

24

Challenges and Research Directions for Large Language Model Inference Hardware

Jan 17, 2026

19:24

25

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Jan 16, 2026

20:00

26

The Quantization Model of Neural Scaling

Jan 15, 2026

17:12

27

EAGLE-3

Jan 14, 2026

17:52

28

Engram Paper

Jan 12, 2026

17:35

29

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence

Jan 9, 2026

19:56

30

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Jan 8, 2026

19:43

31

NorMuon- Making Muon more efficient and scalable

Jan 7, 2026

19:09

32

Dion- Distributed Orthonormalized Updates

Jan 6, 2026

18:40

33

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5

Jan 6, 2026

20:02

34

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4

Jan 6, 2026

19:38

35

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3

Jan 6, 2026

20:02

36

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2

Jan 6, 2026

22:09

37

Key and Value Weights Are Probably All You Need

Nov 26, 2025

14:36

38

Latent State Models of Training Dynamics

Oct 28, 2025

12:26

39

DeepSeek OCR

Oct 24, 2025

13:59

40

The Coverage Principle - How Pre-training Enables Post-Training

Oct 23, 2025

15:19

41

The Coverage Principle- How Pre-training Enables Post-Training

Oct 23, 2025

15:19

42

The Art of Scaling Reinforcement Learning Compute for LLMs

Oct 23, 2025

13:51

43

Continual Learning via Sparse Memory Finetuning

Oct 22, 2025

13:33

44

DeepSeek OCR Paper

Oct 22, 2025

13:59

45

Untitled Episode

Oct 10, 2025

11:56

46

Characterization and Mitigation of Training Instabilities in Microscaling Formats

Oct 8, 2025

13:44

47

Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Oct 7, 2025

14:20

48

Drop-Muon- Update Less, Converge Faster

Oct 6, 2025

11:56

49

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Oct 6, 2025

13:09

50

Apertus Tech Report

Sep 21, 2025

13:11

51

Learning Facts at Scale with Active Reading

Sep 20, 2025

14:46

52

Fantastic Pretraining Optimizers and Where to Find Them

Sep 19, 2025

13:51

53

Benchmarking Optimizers for Large Language Model Pretraining

Sep 19, 2025

16:18

54

Learning Facts at Scale with Active Reading.old

Sep 19, 2025

15:42

55

Apertus Tech Report.old

Sep 18, 2025

12:53

56

Benchmarking Optimizers for Large Language Model Pretraining.old

Sep 18, 2025

16:39

57

Fantastic Pretraining Optimizers and Where to Find Them.old

Sep 18, 2025

14:30

58

The Pitfalls of Next-Token Prediction

Sep 11, 2025

10:58

59

Large Language Models and Games

Sep 9, 2025

17:02

60

UQ - Assessing Language Models on Unsolved Questions

Sep 5, 2025

13:35

61

UQ- Assessing Language Models on Unsolved Questions

Sep 5, 2025

13:35

62

Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation

Aug 19, 2025

14:20

63

Signal and Noise- A Framework for Reducing Uncertainty in Language Model Evaluation

Aug 19, 2025

14:20

64

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

Aug 18, 2025

14:43

65

Thinking Like Transformers

Jul 29, 2025

13:19

66

Kimi K2

Jul 28, 2025

13:29

67

ERNIE Technical Report

Jul 25, 2025

11:17

68

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Jul 24, 2025

14:30

69

Gemini 2.5

Jul 4, 2025

11:12

70

How new data permeates LLM knowledge and how to dilute it

Jun 18, 2025

11:18

71

Harnessing the Universal Geometry of Embeddings

Jun 17, 2025

12:24

72

Model Merging in Pre-training of Large Language Models

Jun 16, 2025

10:57

73

Learning Dynamics in Continual Pre-Training for Large Language Models

Jun 13, 2025

11:10

74

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs

Jun 12, 2025

11:50

75

Scalable-Softmax Is Superior for Attention

Jun 11, 2025

10:20

76

Breast Cancer Recurrence Prediction

Jun 6, 2025

10:20

77

Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research

Jun 4, 2025

10:27

78

Native Sparse Attention

Jun 4, 2025

11:40

79

Critical Batch Size Revisited

Jun 3, 2025

10:57

80

Base of RoPE Bounds Context Length

May 17, 2025

11:03

81

Rope to Nope and Back Again

May 17, 2025

12:11

82

Training Deep Learning Models with Norm-Constrained LMOs

May 15, 2025

11:10

83

SkyLadder

May 9, 2025

12:25

84

LLMs on the Line

May 7, 2025

9:30

85

The Leaderboard Illusion

Apr 30, 2025

10:16

86

Why Linearly Decaying the Learning Rate to Zero Works Best

Apr 16, 2025

9:01

87

Not All Data Are Unlearned Equally

Apr 15, 2025

12:38

88

A Multi-Power Law for Loss Curve Prediction

Apr 14, 2025

12:31

89

Efficient Training of Ultra-Long Context Large Language Models

Apr 11, 2025

10:48

90

Multi-Token Attention

Apr 3, 2025

15:04

91

From Style to Facts

Apr 2, 2025

10:50

92

Compute Optimal Scaling of Skills

Mar 22, 2025

9:10

93

Predictive Data Selection

Mar 15, 2025

8:43

94

Continual Pre-training of MoEs

Mar 12, 2025

10:42

95

s1 - Simple test-time scaling

Mar 6, 2025

10:38

96

Cognitive Behaviors that Enable Self-Improving Reasoners

Mar 5, 2025

7:21

97

Phi 4 Multimodal Instruct

Mar 4, 2025

11:27

98

Claude 3.7 Sonnet System Card

Feb 25, 2025

9:22

99

Project Sid: Many-agent simulations toward AI civilization

Feb 9, 2025

10:16

100

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Feb 9, 2025

9:07

101

Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Feb 5, 2025

8:18

102

NExtLong - Toward Effective Long-Context Training without Long Documents

Jan 30, 2025

11:38

103

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

Jan 30, 2025

12:37

104

Over-Tokenized Transformer

Jan 30, 2025

10:55

105

HashAttention: Semantic Sparsity for Faster Inference

Jan 16, 2025

11:05

106

From Tokens to Words

Jan 15, 2025

14:05

107

DeepSeek V3

Jan 7, 2025

11:11

108

Optimal Linear Decay Learning Rate Schedules and Further Refinements

Jan 5, 2025

18:28

109

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Dec 20, 2024

9:34

110

Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale

Dec 20, 2024

11:49

111

Phi-4

Dec 14, 2024

9:05

112

Rephrasing natural text data with different languages and quality levels

Dec 13, 2024

11:02

113

Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs

Dec 12, 2024

11:59

114

EXAONE 3.5

Dec 11, 2024

8:59

115

Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Dec 9, 2024

6:15

116

Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Dec 6, 2024

13:14

117

Nemotron-CC

Dec 5, 2024

12:39

118

Tülu 3

Dec 2, 2024

12:03

119

The Zamba2 Suite

Nov 29, 2024

13:04

120

Small-scale proxies for large-scale Transformer training instabilities

Nov 26, 2024

10:07

121

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Nov 25, 2024

10:29

122

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Nov 19, 2024

8:34

123

Understanding WSD Learning Rates

Nov 18, 2024

9:10

124

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Nov 16, 2024

7:30

125

Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Nov 8, 2024

13:36

126

Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?

Nov 7, 2024

6:28

127

How Does Critical Batch Size Scale in Pre-training?

Nov 4, 2024

7:41

128

The Road Less Scheduled

Nov 1, 2024

8:54

129

Learning-Rate-Free Learning by D-Adaptation

Oct 31, 2024

4:37

130

Scaling FP8 Training to Trillion Token LLMs

Oct 30, 2024

9:55

131

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

Oct 29, 2024

15:44

132

A Survey on Model MoErging

Oct 28, 2024

8:58

133

Liquid Time-constant Networks

Oct 27, 2024

8:11

134

Scaling Laws for Predicting Downstream Performance in LLMs

Oct 26, 2024

10:07

135

A Spectral Condition for Feature Learning

Oct 25, 2024

16:53

136

Don't decay the learning rate

Oct 24, 2024

7:07

137

OLMoE

Oct 23, 2024

6:59

138

An Empirical Model of Large Batch Training

Oct 23, 2024

11:32

All Episodes