Mechanical Dreams cover art

All Episodes

Mechanical Dreams — 138 episodes

#
Title
1

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

2

Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs

3

Learning State-Tracking from Code Using Linear RNNs

4

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

5

GLM-5

6

Cautious Optimizers

7

Backward Gradient Normalization in Deep Neural Networks

8

Attention Residuals

9

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

10

Scaling Laws for Precision

11

On the "Induction Bias" in Sequence Models

12

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches

13

Flash Attention 4

14

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

15

Midtraining Bridges Pretraining and Posttraining Distributions

16

SiameseNorm

17

ÜberWeb

18

Why Do Reasoning Models Loop

19

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

20

Teon

21

Cautious Weight Decay

22

Predictable Scale

23

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

24

Challenges and Research Directions for Large Language Model Inference Hardware

25

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

26

The Quantization Model of Neural Scaling

27

EAGLE-3

28

Engram Paper

29

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence

30

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

31

NorMuon- Making Muon more efficient and scalable

32

Dion- Distributed Orthonormalized Updates

33

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5

34

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4

35

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3

36

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2

37

Key and Value Weights Are Probably All You Need

38

Latent State Models of Training Dynamics

39

DeepSeek OCR

40

The Coverage Principle - How Pre-training Enables Post-Training

41

The Coverage Principle- How Pre-training Enables Post-Training

42

The Art of Scaling Reinforcement Learning Compute for LLMs

43

Continual Learning via Sparse Memory Finetuning

44

DeepSeek OCR Paper

45

Untitled Episode

46

Characterization and Mitigation of Training Instabilities in Microscaling Formats

47

Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls

48

Drop-Muon- Update Less, Converge Faster

49

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

50

Apertus Tech Report

51

Learning Facts at Scale with Active Reading

52

Fantastic Pretraining Optimizers and Where to Find Them

53

Benchmarking Optimizers for Large Language Model Pretraining

54

Learning Facts at Scale with Active Reading.old

55

Apertus Tech Report.old

56

Benchmarking Optimizers for Large Language Model Pretraining.old

57

Fantastic Pretraining Optimizers and Where to Find Them.old

58

The Pitfalls of Next-Token Prediction

59

Large Language Models and Games

60

UQ - Assessing Language Models on Unsolved Questions

61

UQ- Assessing Language Models on Unsolved Questions

62

Signal and Noise - A Framework for Reducing Uncertainty in Language Model Evaluation

63

Signal and Noise- A Framework for Reducing Uncertainty in Language Model Evaluation

64

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

65

Thinking Like Transformers

66

Kimi K2

67

ERNIE Technical Report

68

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

69

Gemini 2.5

70

How new data permeates LLM knowledge and how to dilute it

71

Harnessing the Universal Geometry of Embeddings

72

Model Merging in Pre-training of Large Language Models

73

Learning Dynamics in Continual Pre-Training for Large Language Models

74

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs

75

Scalable-Softmax Is Superior for Attention

76

Breast Cancer Recurrence Prediction

77

Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research

78

Native Sparse Attention

79

Critical Batch Size Revisited

80

Base of RoPE Bounds Context Length

81

Rope to Nope and Back Again

82

Training Deep Learning Models with Norm-Constrained LMOs

83

SkyLadder

84

LLMs on the Line

85

The Leaderboard Illusion

86

Why Linearly Decaying the Learning Rate to Zero Works Best

87

Not All Data Are Unlearned Equally

88

A Multi-Power Law for Loss Curve Prediction

89

Efficient Training of Ultra-Long Context Large Language Models

90

Multi-Token Attention

91

From Style to Facts

92

Compute Optimal Scaling of Skills

93

Predictive Data Selection

94

Continual Pre-training of MoEs

95

s1 - Simple test-time scaling

96

Cognitive Behaviors that Enable Self-Improving Reasoners

97

Phi 4 Multimodal Instruct

98

Claude 3.7 Sonnet System Card

99

Project Sid: Many-agent simulations toward AI civilization

100

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

101

Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

102

NExtLong - Toward Effective Long-Context Training without Long Documents

103

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

104

Over-Tokenized Transformer

105

HashAttention: Semantic Sparsity for Faster Inference

106

From Tokens to Words

107

DeepSeek V3

108

Optimal Linear Decay Learning Rate Schedules and Further Refinements

109

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

110

Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale

111

Phi-4

112

Rephrasing natural text data with different languages and quality levels

113

Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs

114

EXAONE 3.5

115

Model soups - averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

116

Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

117

Nemotron-CC

118

Tülu 3

119

The Zamba2 Suite

120

Small-scale proxies for large-scale Transformer training instabilities

121

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

122

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

123

Understanding WSD Learning Rates

124

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

125

Amuro & Char - Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

126

Evaluation data contamination in LLMs: How do we measure it and (when) does it matter?

127

How Does Critical Batch Size Scale in Pre-training?

128

The Road Less Scheduled

129

Learning-Rate-Free Learning by D-Adaptation

130

Scaling FP8 Training to Trillion Token LLMs

131

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

132

A Survey on Model MoErging

133

Liquid Time-constant Networks

134

Scaling Laws for Predicting Downstream Performance in LLMs

135

A Spectral Condition for Feature Learning

136

Don't decay the learning rate

137

OLMoE

138

An Empirical Model of Large Batch Training