Mechanical Dreams Podcast - All Episodes

138

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

In this episode:• Chapter 1: Introduction to the Bottleneck: Linda introduces the paper and the general concept of the LM head. Professor Norris expresses initial skepticism about revisiting the softmax bottleneck.• Chapter 2: Expressivity vs. Optimization: The hosts discuss how the paper shifts the focus from the classical expressivity limitation to a fundamental optimization problem.• Chapter 3: The Math of Gradient Destruction: Linda breaks down the matrix math, explaining how backpropagating a V-dimensional gradient through a rank-D layer destroys up to 99 percent of the gradient norm.• Chapter 4: SpamLang and Real-World Evidence: The discussion moves to the SpamLang synthetic task and 2B parameter pretraining experiments, proving that the gradient bottleneck severely limits training speed and capacity.• Chapter 5: Implications for Scaling Laws: Norris and Linda wrap up by discussing what this means for the future of LLM pretraining and potential architectural fixes.

Mar 28, 2026

21m

137

Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs

In this episode:• The Context Window Illusion: Norris and Linda introduce the episode and the paper, discussing why million-token context windows don't automatically solve reasoning tasks.• The Math of Score Dilution: Linda dives into the theoretical bottleneck of static self-attention, explaining why the target-distractor margin must scale logarithmically.• Query-Only Test-Time Training: Linda reveals the paper's solution: updating only the query projection matrices at inference time to avoid invalidating the KV cache.• Compute Equivalency: qTTT vs Thinking Tokens: Norris challenges the computational cost, leading to a discussion on how qTTT strictly matches the FLOPs of chain-of-thought decoding.• Results and Takeaways: The hosts discuss the empirical results on LongBench-v2 and ZeroScrolls, concluding with the implications for inference-time compute scaling.

Mar 27, 2026

23m

136

Learning State-Tracking from Code Using Linear RNNs

In this episode:• Introduction to State-Tracking: Linda and Professor Norris introduce the paper and discuss the historical context of state-tracking in sequence models.• The Next-Token Prediction Testbed: The hosts discuss how the authors used Python REPL traces with print statements to evaluate models using next-token prediction instead of sequence-to-sequence.• DeltaNet Triumphs Over Transformers: Linda explains how DeltaNet with extended eigenvalues perfectly extrapolated the tracking task, while Transformers failed even with dense supervision.• The Catch: Partial Observability: Professor Norris questions the limits, leading Linda to introduce Probabilistic Finite-State Automata with State Reveals (PFSA-SR) and unobservable branching.• The Math of Norm Decay: A deep dive into why linear RNNs suffer exponential norm decay without non-linear renormalization, finalizing the episode's takeaways.

Mar 26, 2026

20m

135

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

In this episode:• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal direction that is dampened by aggressive optimization schedules.• The Solution: Curriculum Model Averaging (CMA): Linda details the paper's proposed method: replacing learning rate decay with a constant learning rate combined with weight averaging (EMA) to stabilize the model while keeping it plastic enough to learn from good data.• Results at Scale: A deep dive into the experimental results on 1.5B parameter models, showing how this new regime outperforms random shuffling by over 1.6% on standard benchmarks.• Rethinking the Pretraining Recipe: Professor Norris concedes the brilliance of the approach, and the two discuss the broader implications for mid-training and the necessity of co-designing data curricula with optimization hyperparameters.

Mar 25, 2026

19m

134

GLM-5

In this episode:• Welcome & The End of Vibe Coding?: Linda introduces GLM-5 and the paradigm shift from passive vibe coding to autonomous agentic engineering.• Architecture & DeepSeek Sparse Attention: Professor Norris and Linda examine the 744B parameter model and how transitioning from dense to sparse attention drastically cuts compute costs.• Asynchronous RL and the Slime Framework: A deep dive into decoupled training engines, addressing off-policy drift with TITO and token-level clipping.• Evaluating Real-World Agentic Engineering: Reviewing GLM-5's performance on SWE-bench and the innovative Agent-as-a-Judge pipeline for interactive frontend testing.• Hardware Adaptation & Pony Alpha: Discussing the model's extreme quantization for domestic GPUs and the dramatic anonymous release on OpenRouter.

Mar 24, 2026

24m

133

Cautious Optimizers

In this episode:• Introduction to Cautious Optimizers: Linda introduces the paper and its bold claim of improving optimizers with just one line of code, while Norris expresses his initial skepticism.• The Inertia Problem in Momentum: The hosts discuss how standard momentum-based optimizers like AdamW can overshoot due to inertia, temporarily increasing the loss function.• The One-Line Fix and Scaling: Linda breaks down the PyTorch implementation of the cautious mask, explaining how it zeros out conflicting directions and scales the remaining updates.• Hamiltonian Dynamics and Convergence: Norris and Linda explore the theoretical guarantees of the paper, discussing how the method preserves Hamiltonian descent and ensures monotonic loss reduction.• Empirical Triumphs and Overhead: The conversation shifts to the experimental results on LLaMA pretraining and Vision Transformers, noting the impressive performance and minimal 3 percent computational overhead.• Conclusion: Norris admits he is fully convinced by the elegant simplicity of the paper, and Linda signs off for the episode.

Mar 23, 2026

21m

132

Backward Gradient Normalization in Deep Neural Networks

In this episode:• Welcome and Introduction: Professor Norris and Linda introduce the episode and the paper of the week: 'Backward Gradient Normalization in Deep Neural Networks'.• The Ghost of Gradients Past: A discussion on the classic vanishing and exploding gradient problems, and why existing solutions like Batch Normalization and ResNets still leave room for improvement.• Unpacking Backward Gradient Normalization: Linda explains the core mechanics of the BGN layer, detailing how it leaves the forward pass untouched while scaling gradients during backpropagation.• Visualizing the Flow: The hosts delve into the paper's experiments with 90-layer deep networks, comparing gradient decay across ReLU, Sigmoid, and Tanh activation functions.• Results, Trade-offs, and Conclusions: A breakdown of the accuracy improvements and training time efficiency of BGN compared to Batch Normalization on the MNIST dataset, followed by final thoughts.

Mar 22, 2026

22m

131

Attention Residuals

In this episode:• The PreNorm Dilution Problem: Professor Norris and Linda introduce the episode and discuss the fundamental limitations of standard residual connections, focusing on the unbounded magnitude growth caused by PreNorm.• Attention Residuals and the Time-Depth Duality: Linda introduces the core concept of Full Attention Residuals, treating network depth like sequence length. Professor Norris raises concerns about the memory and communication overhead.• Block Attention Residuals: The hosts discuss how the Kimi Team solves the overhead problem by partitioning layers into blocks, reducing the cost while preserving the benefits of selective aggregation.• Infrastructure and System Optimizations: A deep dive into the engineering feats that make Block AttnRes practical, including cross-stage caching for pipeline parallelism and a two-phase computation strategy for inference.• Results, Scaling Laws, and Wrap-up: Linda shares the impressive scaling law results and downstream benchmark improvements. The hosts reflect on how AttnRes bounds hidden-state magnitudes.

Mar 21, 2026

21m

130

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

In this episode:• Introduction: Linda and Professor Norris introduce the podcast and the focus of the episode: the PoPE paper.• The Problem with RoPE: A discussion on Rotary Position Embedding and how it entangles content and positional information.• Introducing PoPE: Linda explains the mathematical shift to polar coordinates to decouple the what and the where.• Empirical Triumphs: Reviewing the massive performance jump on the Indirect Indexing task, plus music, genomics, and language modeling.• Length Extrapolation and Conclusion: Analyzing PoPE's zero-shot length extrapolation capabilities compared to YaRN, followed by episode wrap-up.

Mar 20, 2026

20m

129

Scaling Laws for Precision

In this episode:• Introduction to Precision in Scaling Laws: Linda introduces the new paper which adds precision as a third variable to the Chinchilla scaling laws. Professor Norris reflects on how precision is usually treated as an afterthought.• The Post-Training Quantization Paradox: The hosts discuss the surprising finding that overtraining models on too much data actually makes them degrade worse when applying post-training quantization.• Effective Parameters and Low-Precision Training: Linda explains the concept of effective parameter count, and how lowering precision in weights, activations, and KV cache shrinks the model's effective size multiplicatively.• Finding the Compute-Optimal Precision: Professor Norris is surprised to learn that compute-optimal pretraining precision is around 7 to 8 bits, completely independent of the compute budget unless model size is constrained.• A Unified Scaling Law and Takeaways: The episode wraps up by bringing pretraining and post-training precision into a single mathematical framework, discussing what this means for the future of model training.

Mar 19, 2026

20m

128

On the "Induction Bias" in Sequence Models

In this episode:• Introduction: The Transformer's Kryptonite: Professor Norris jokes about Transformers solving everything, but Linda introduces a new paper that challenges their ability to perform basic state tracking efficiently. They set the stage by distinguishing between the well-known Out-of-Distribution failures and the paper's focus on In-Distribution data efficiency.• The Setup: Modulo Arithmetic and Supervision Regimes: Linda explains the experimental setup using modular addition and permutation composition, and defines the three supervision formats: Outcome Supervision, Chain-of-Thought (CoT), and Aligned CoT. Norris questions why simple math requires such complex architectures, leading to a discussion on sample efficiency.• The Showdown: Transformers vs. RNNs: The hosts discuss the surprising results where recurrent models (LSTMs and Dense-SSMs) crush Transformers in outcome supervision. They analyze why Transformers rely heavily on Chain-of-Thought to function, whereas RNNs struggle with standard CoT due to recall bottlenecks but excel with Aligned CoT.• The Core Theory: Induction Bias and The Sharing Factor: Linda dives into the concept of the "Sharing Factor" (kappa), explaining that RNNs use an inductive bias to share weights across sequence lengths, effectively learning the algorithm. Norris is fascinated by the finding that Transformers exhibit "length isolation," essentially relearning the task from scratch for every new sequence length.• Conclusion: Brute Force vs. True Learning: The pair wraps up by discussing the implications for Large Language Models, specifically regarding "context rot" and the massive data requirements for agentic workflows. Norris concedes that perhaps we haven't solved state tracking just yet, and they sign off.

Mar 13, 2026

17m

127

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches

In this episode:• A Noble Introduction: Professor Norris makes a pun about aristocracy while Linda introduces the paper 'NOBLE' from Canva Research, setting the stage for a discussion on accelerating Transformer pretraining.• The Linear Collapse Problem: Linda explains why standard LoRA doesn't work for pretraining from scratch, and Norris helps clarify the difference between parameter-efficient fine-tuning and architectural augmentation.• Anatomy of a Nonlinear Branch: A deep dive into the NOBLE architecture and the 'CosNet' activation function, discussing why a cosine sandwich is better than ReLU for low-rank bottlenecks.• Crunching the Numbers: The hosts discuss the experimental results, highlighting the 1.47x step speedup and debating whether the parameter overhead is worth the wall-clock time savings.• The Mixup Mystery: Linda reveals a fascinating caveat regarding Mixup/CutMix augmentation, leading to a theoretical realization about NOBLE's role in learning high-frequency signals versus smooth global trends.• Inference and Impact: The duo wraps up by discussing the trade-offs, specifically the permanent inference cost, and gives their final verdict on whether NOBLE is the future of pretraining.

Mar 12, 2026

16m

126

Flash Attention 4

In this episode:• Welcome to the Hardware Lottery: Professor Norris and Linda introduce the episode's focus: FlashAttention-4. They set the stage by discussing the arrival of NVIDIA's Blackwell architecture and why existing optimization techniques suddenly hit a wall.• The Asymmetry Problem: Linda explains the concept of 'Asymmetric Hardware Scaling' found in the B200 GPUs, where tensor cores doubled in speed but memory bandwidth and special function units didn't. Norris questions why simply running FlashAttention-3 isn't good enough.• Bottlenecks in the Forward Pass: The duo dives into the algorithmic changes for the forward pass, specifically how the paper mitigates the 'exponential unit' bottleneck by emulating exponential functions on FMA units and using conditional softmax rescaling.• Taming the Backward Pass with TMEM: A deep dive into the backward pass optimizations. Linda explains the use of Tensor Memory (TMEM) and the '2-CTA MMA' mode to reduce shared memory traffic, satisfying Norris's curiosity about how to hide latency.• Escaping Template Hell: They discuss the implementation framework: CuTe-DSL embedded in Python. Norris rejoices at the reduction in compile times compared to C++ templates, while Linda highlights the flexibility for researchers.• The Verdict: The hosts wrap up the findings, noting the impressive speedups over cuDNN and Triton, and offer final thoughts on the future of hardware-aware algorithm design.

Mar 10, 2026

15m

125

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

In this episode:• The Multi-Million Dollar NaN: Linda introduces the paper 'An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence' by Zhang et al., setting the stage with the high stakes of expensive pretraining runs failing. Professor Norris expresses skepticism that simple 'bad data' is the root cause of complex divergences.• The Toxic Five Tokens: The hosts discuss the paper's methodology of injecting synthetic uniform random noise. Linda reveals the counter-intuitive finding that a restricted vocabulary of noise (like repeating hash codes) is significantly more destabilizing than random noise drawn from the full vocabulary.• Bigger, Deeper, More Fragile: A look at the scaling laws of failure; contrary to the hope that scale fixes everything, Linda explains how the paper proves that deeper models are substantially more likely to diverge when exposed to noise than their wider or smaller counterparts.• CSI: Gradient Descent: Professor Norris and Linda dive into the forensic diagnostics, distinguishing between failures caused by high learning rates versus noisy data. They discuss the specific 'smoking gun' of maximum attention logits capping at around 1800 for noise-induced failures versus 4000 for learning rate issues.• MoE Stability and The QK-Fix: They address the concern that Mixture-of-Experts (MoE) models might be hypersensitive to noise, which the paper disproves, and discuss QK-LayerNorm as the architectural 'safety belt' when perfect data cleaning isn't possible.• Closing Thoughts: Final takeaways on the necessity of data curation and a witty sign-off from Professor Norris regarding the cleanliness of his own reading glasses versus the training data.

Feb 28, 2026

16m

124

Midtraining Bridges Pretraining and Posttraining Distributions

In this episode:• Introduction: Do We Really Need Another Phase?: Professor Norris jokingly laments the ever-expanding terminology of LLM training, while Linda introduces the paper on 'Midtraining' as a distinct, intermediate phase between pretraining and post-training.• The Mechanism: Building a Distributional Bridge: Linda explains the core theory: midtraining isn't just 'cooling down,' but shifting the model's initialization closer to the target distribution to smooth out the optimization path.• Results: Where It Works (and Where It Doesn't): The hosts discuss the finding that midtraining shines in 'distant' domains like Code and Math but matters less for general instructions, and cover the surprising reduction in catastrophic forgetting.• The Plasticity Window: Timing and Mixtures: A deep dive into the interaction between when you start midtraining and how much specialized data you use, highlighting the dangers of late, aggressive data injection.• Conclusion: Better Than Continued Pretraining?: Norris concedes the method's utility after seeing the comparison against standard continued pretraining, and the pair summarize the practical takeaways for training schedules.

Feb 27, 2026

16m

123

SiameseNorm

In this episode:• Introduction: The Never-Ending Normalization Wars: Professor Norris and Linda kick off the episode. Norris cracks a joke about how normalization layers are like seasoning—too little and it's bland, too much and you ruin the dish. Linda introduces the paper 'SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm', setting the stage for a discussion on the fundamental trade-offs in Transformer architecture.• The Dilemma: Dilution vs. Distortion: Linda explains the core problem: Pre-Norm is stable but suffers from 'signal dilution' which limits effective depth, while Post-Norm offers high expressivity but is plagued by 'gradient distortion' and instability. Norris plays the skeptic, asking why we can't just combine them, leading to a discussion on why previous hybrid attempts have failed.• The Solution: SiameseNorm's Dual Streams: Linda describes the paper's novel architecture: SiameseNorm. She explains how it uses two parallel streams (one Pre-Norm-like, one Post-Norm-like) that share the same residual block parameters. This allows the model to decouple the optimization dynamics (via the identity path) from the representation learning (via the normalized path).• Under the Hood: The Gradient Analysis: Professor Norris dives into the mathematical justification provided in the paper. He breaks down the Jacobian matrix analysis, seemingly impressed by how the architecture preserves an explicit identity term (for the gradient highway) while simultaneously enforcing bounded representations, effectively solving the vanishing/exploding gradient problem.• Results: The Arithmetic Leap and High Learning Rates: Linda presents the empirical evidence, highlighting that SiameseNorm allows for much more aggressive learning rates (up to 2e-3) without diverging. She emphasizes the massive 40% relative gain in arithmetic reasoning tasks compared to Pre-Norm, which finally convinces Norris that the 'effective depth' has indeed been restored.• Conclusion: A Unified Future?: The hosts wrap up the episode. Norris concedes that this might be the 'best of both worlds' solution the field has been waiting for. They discuss the implications for training even larger models and sign off with their catchphrase.

Feb 24, 2026

17m

122

ÜberWeb

In this episode:• Welcome to the ÜberWeb: Professor Norris and Linda introduce the episode's focus: the 'ÜberWeb' paper by DatologyAI, setting the stage for a discussion on the challenges of training high-quality multilingual models on a massive scale.• The Curse That Wasn't: The hosts debate the 'curse of multilinguality,' with Linda explaining the paper's central thesis: that performance degradation is often due to poor data quality ('curse of data quality') rather than a lack of model parameters.• A Rising Tide Lifts All Boats: Discussion on the paper's most surprising finding: that curating high-quality English data improves non-English performance, and conversely, cleaning non-English data boosts English capabilities.• Bespoke Curation and the Translation Trap: Linda details why generic filters fail for diverse scripts and how the paper utilized bespoke pipelines, while Norris interrogates the nuance of using translated data effectively versus blindly translating noise.• The New Pareto Frontier: A look at the hard numbers, where the hosts analyze how 3B and 8B models trained on just 1 trillion curated tokens managed to outperform significantly larger open-source baselines like Llama and Qwen.• Conclusion and Sign-off: Norris and Linda wrap up the episode, reflecting on the future of data-centric AI and the move toward more efficient, language-inclusive foundation models.

Feb 23, 2026

18m

121

Why Do Reasoning Models Loop

In this episode:• Introduction: The Infinite Loop: Professor Norris and Linda introduce the episode's topic: the phenomenon of reasoning models getting stuck in repetitive loops. Norris jokes about his own lectures looping, while Linda introduces the paper 'Wait, Wait, Wait... Why Do Reasoning Models Loop?' and the context of Chain-of-Thought reasoning.• The Distillation Mystery: Linda presents the paper's empirical findings, highlighting that 'student' models (distilled) loop significantly more than their 'teacher' models. Norris is skeptical that a student could be worse than the teacher if trained properly, leading to a discussion on 'errors in learning.'• Mechanism 1: Risk Aversion and Hard Steps: The hosts dive into the first theoretical mechanism: Risk Aversion due to Hardness of Learning. Linda uses the 'Star Graph' analogy to explain how models prefer easy, cyclic actions (like resetting) over hard, progress-making steps when they are uncertain.• Mechanism 2: Deja Vu and Correlated Errors: They discuss the second mechanism: Inductive Bias for Temporally Correlated Errors. Norris learns why models don't just guess randomly when confused but instead make the *same* mistake repeatedly, leading to the 'Groundhog Day' effect in reasoning traces.• Temperature: A Cure or a Band-Aid?: Linda explains why turning up the 'temperature' (randomness) helps break loops but is ultimately just a stopgap that masks the underlying learning errors. They conclude with a look at how loops become self-reinforcing catalysts.

Feb 18, 2026

17m

120

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

In this episode:• Introduction: Hitting the Data Wall: Professor Norris and Linda introduce the episode's paper, 'OPUS', and discuss the looming 'Data Wall' where high-quality public text is exhausted, necessitating a shift from more tokens to better tokens.• The Flaw in Current Data Selection: The hosts debate existing methods, contrasting static filters like FineWeb-Edu with dynamic selection. Linda explains why scoring data based on raw gradients fails when modern optimizers like AdamW or Muon reshape the update geometry.• Defining Utility in the Optimizer's World: Linda breaks down the core mechanism of OPUS: measuring data utility in the optimizer-induced update space rather than the raw gradient space. Norris grapples with the concept of aligning data selection with the actual trajectory of the optimization.• Scaling Up: Ghosts and Sketches: A deep dive into how OPUS makes per-sample gradient estimation computationally feasible. The discussion covers the use of the 'Ghost' technique combined with CountSketch to project updates into low-dimensional space without full materialization.• Diversity via Boltzmann and The Proxy: The hosts discuss how OPUS avoids 'diversity collapse' using Boltzmann sampling instead of greedy selection, and how it constructs a stable 'Bench-Proxy' from the pre-training corpus to guide the model.• Results and Final Thoughts: Reviewing the empirical results where OPUS outperforms industrial baselines on GPT-2 and Qwen3-8B. Norris concedes the cleverness of the approach, and they wrap up with thoughts on data efficiency.

Feb 11, 2026

17m

119

Teon

In this episode:• Introduction: The Optimizer Zoo: Professor Norris and Linda introduce the topic of optimization in LLMs, joking about the explosion of new optimizers before introducing the paper of the week: TEON.• The Muon Foundation: Linda recaps the Muon optimizer, explaining how it uses orthogonalization to prevent gradient rank collapse, while Norris questions its limitations regarding layer independence.• Enter the Tensor: How TEON Works: Linda explains the core innovation of TEON: stacking gradients from multiple layers into a tensor and using matricization to orthogonalize them jointly.• The Theory: Singular Vector Alignment: The hosts discuss the theoretical justification, focusing on Proposition 4.6 and why gradients in Transformers (specifically Q, K, and V) exhibit strong singular vector alignment.• Results and The Polar Express: A look at the experimental results on GPT and LLaMA models, confirming TEON outperforms Muon even when using approximate SVD methods like PolarExpress.• Conclusion: Professor Norris concedes that TEON offers a principled improvement over Muon, and the duo signs off.

Feb 3, 2026

18m

118

Cautious Weight Decay

In this episode:• Introduction: The Weight Decay Dilemma: Professor Norris and Linda introduce the episode's topic: Cautious Weight Decay. They discuss the historical context of weight decay as a regularization technique and why standard approaches might be accidentally sabotaging model learning.• The Mechanism: To Decay or Not to Decay?: Linda explains the core algorithm of Cautious Weight Decay (CWD). The hosts break down the 'sign alignment' logic, explaining how CWD decides when to apply the 'brakes' of regularization and when to let the weights grow freely.• Mathematical Foundations: Lyapunov and Sliding Modes: Professor Norris dives into the theoretical proofs provided in the paper. He discusses how CWD doesn't just optimize a proxy loss function but actually finds Pareto-optimal points on the stationary manifold of the original objective.• Experimental Results: A Drop-in Upgrade: Linda presents the empirical data, covering performance on Large Language Models and Vision Transformers. They highlight the 'killer feature': CWD requires no hyperparameter retuning compared to AdamW.• Conclusion and Final Verdict: The hosts summarize the findings. Norris gives his skeptical-but-approved stamp of approval, and they discuss the potential for this simple one-line change to become a new standard in deep learning optimization.

Jan 28, 2026

20m

117

Predictable Scale

In this episode:• Introduction: The Alchemy of Training: Professor Norris laments the 'black magic' of hyperparameter tuning, and Linda introduces the paper 'Predictable Scale: Part I, Step Law' which promises to turn that alchemy into science.• The Million-Hour Experiment: The hosts discuss the unprecedented scale of the study, involving 3,700 models and nearly one million H800 GPU hours, to map the loss landscape.• Defining the Step Law: Linda explains the core mathematical findings: how Learning Rate scales with model size (N) and data size (D), and the surprising revelation that optimal Batch Size depends almost entirely on D, not N.• Universality: MoEs and Data Recipes: A deep dive into how the Step Law holds up against sparse Mixture-of-Experts models and varying data distributions (like code or multilingual data), outperforming previous scaling laws like DeepSeek or OpenAI's.• Conclusion: A Plug-and-Play Future: Norris concedes that the empirical evidence is overwhelming. They wrap up with the implications for efficient LLM training and what this means for the industry.

Jan 27, 2026

17m

116

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

In this episode:• Introduction: The Heavy Cost of Curvature: Professor Norris and Linda introduce the paper 'A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs' by researchers at Meta and UMD, setting the stage by discussing why measuring the Hessian matrix is a computational nightmare for large models.• The Proposal: Critical Sharpness: Linda explains the core innovation of the paper: a method to estimate sharpness using a simple line search algorithm (Critical Sharpness) instead of expensive eigenvalue decompositions.• Validation: The Edge of Stability: The hosts discuss how this new metric confirms the 'Edge of Stability' phenomenon in massive models like OLMo-2 7B, proving that models naturally train right on the precipice of instability.• Application: Solving Catastrophic Forgetting: The discussion moves to the most practical takeaway: using 'Relative Critical Sharpness' to determine the perfect ratio of pre-training data to mix in during fine-tuning to prevent the model from becoming 'dumb' on general tasks.• Conclusion and Takeaways: Norris and Linda wrap up with final thoughts on how this tool essentially gives engineers a flashlight to navigate the dark, high-dimensional valleys of loss landscapes without needing a supercomputer.

Jan 27, 2026

17m

115

Challenges and Research Directions for Large Language Model Inference Hardware

In this episode:• Introduction: The Disconnect: Professor Norris and Linda introduce the paper 'Challenges and Research Directions for Large Language Model Inference Hardware' by Ma and Patterson, discussing the widening gap between academic architecture research and industry reality.• The Inference Crisis: Prefill vs. Decode: The hosts break down why LLM inference is fundamentally different from training, explaining the 'Memory Wall' and the specific bottleneck of the autoregressive Decode phase.• Solution 1: High Bandwidth Flash: Linda proposes High Bandwidth Flash (HBF) as a solution for capacity, while Professor Norris questions the latency and endurance issues inherent to flash memory.• Solution 2 & 3: PNM and 3D Stacking: A discussion on Processing-Near-Memory (PNM) versus Processing-In-Memory (PIM), and how 3D stacking can shorten the distance between compute and data.• Solution 4: Interconnects and New Metrics: The duo discusses why latency matters more than bandwidth for inference interconnects, and concludes with a look at new evaluation metrics like TCO and Carbon Footprint.

Jan 17, 2026

19m

114

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

In this episode:• To PE or Not to PE?: Professor Norris and Linda kick off the episode by introducing the paper 'Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings' (DroPE). Norris expresses immediate skepticism about removing such a fundamental component of the Transformer architecture, setting the stage for the debate.• The Inductive Bias Paradox: Linda explains the paper's first major observation: Positional Embeddings (PEs) are necessary scaffolding for fast training convergence but become a straitjacket for zero-shot generalization. They discuss the theoretical findings regarding 'attention positional bias' and why NoPE models struggle to learn initially.• Why Scaling Frequency Breaks Meaning: The hosts dive into the technical critique of current context-extension methods like YaRN and RoPE-scaling. Linda details how compressing low frequencies to fit longer contexts distorts 'semantic heads'—the parts of the model that match content rather than position—causing failures in retrieval tasks.• The DroPE Solution: Remove the Training Wheels: They discuss the proposed method: training with RoPE, then stripping it away and doing a quick recalibration. Norris warms up to the analogy of PEs acting as 'scaffolding' that should be removed once the building (the model) is self-supporting.• Needles in Haystacks and Future Architectures: Reviewing the empirical results, including the massive improvements on Needle-in-a-Haystack benchmarks compared to YaRN. The episode concludes with a discussion on whether future foundation models will all be trained with this 'drop-and-recalibrate' paradigm.

Jan 16, 2026

20m

113

The Quantization Model of Neural Scaling

In this episode:• Introduction: The Mystery of the Straight Line: Professor Norris and Linda introduce the paper 'The Quantization Model of Neural Scaling' by Michaud et al., setting the stage by discussing the ubiquity of power laws in deep learning and the puzzle of why scaling curves are so predictable.• The Quantization Hypothesis: Linda explains the core theory that neural network knowledge is not continuous but composed of discrete, indivisible chunks called 'quanta,' analogous to Max Planck's quantization of energy.• Zipf's Law and the Toy Model: The hosts discuss how learning discrete skills ordered by frequency (Zipfian distribution) results in smooth power law scaling, using the authors' 'multitask sparse parity' toy dataset as proof.• Monogenic vs. Polygenic Traits in LLMs: Transitioning to real Language Models (Pythia), the discussion explores why some capabilities emerge suddenly (monogenic) while others improve gradually (polygenic), borrowing terminology from genetics.• Mechanistic Evidence: Clustering Gradients: Linda details the 'Quanta Discovery from Gradients' (QDG) technique used to automatically identify specific skills within a model, such as incrementing numbers or closing quotes.• Conclusion: A Society of Quanta: Professor Norris and Linda wrap up by reflecting on Minsky's 'Society of Mind' and the implications of this decomposability for the future of mechanistic interpretability.

Jan 15, 2026

17m

112

EAGLE-3

In this episode:• Introduction: The Wait for Tokens: Professor Norris and Linda introduce the episode's paper, EAGLE-3, and discuss the persistent bottleneck of autoregressive generation costs in modern LLMs.• The Speculative Ceiling: Linda explains how previous speculative sampling methods like EAGLE hit a performance wall where adding more training data failed to improve the draft model, identifying the feature prediction constraint as the culprit.• Innovation: Training-Time Test: A deep dive into EAGLE-3's core innovation: abandoning feature prediction in favor of direct token prediction that simulates the testing environment during the training phase.• Going Deeper: Multi-Layer Fusion: The hosts discuss the second major architectural change, where the model stops relying solely on top-layer features and instead fuses low, mid, and high-level features for better context.• Results: A New Scaling Law: Linda reveals the experimental results, including a 6.5x speedup, SGLang integration, and the discovery of a scaling law where draft models finally benefit from more data.

Jan 14, 2026

17m

111

Engram Paper

In this episode:• The Memory Bottleneck: Professor Norris and Linda introduce the paper 'Conditional Memory via Scalable Lookup' and debate the inefficiency of using expensive neural computation to simulate simple knowledge retrieval.• Engram: N-grams Strike Back: Linda breaks down the 'Engram' module, explaining how it uses hashed N-grams and context-aware gating to inject static embeddings directly into the Transformer backbone.• The U-Shaped Curve of Sparsity: The hosts discuss the 'Sparsity Allocation' problem, analyzing the trade-off between MoE experts and memory capacity, and the discovery that a hybrid approach yields superior results.• Deepening the Network Without Layers: A discussion on mechanistic analysis, focusing on how Engram handles static patterns like named entities in early layers, freeing up the model's attention for complex reasoning.• Prefetching the Future: Linda and Norris explore the system-level advantages of deterministic lookups, including offloading massive embedding tables to CPU memory, and conclude the episode.

Jan 12, 2026

17m

110

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence

In this episode:• Introduction: Is Shannon Information Theory Broken?: Professor Norris and Linda introduce the episode, with Norris expressing skepticism about challenging the foundations of information theory. Linda introduces the paper 'From Entropy to Epiplexity' and the premise that traditional theory fails to account for computational bounds.• The Paradox of Deterministic Creation: The hosts discuss the first major paradox: how deterministic processes like AlphaZero or synthetic data generation seem to create new knowledge, despite the Data Processing Inequality suggesting otherwise. Linda explains why cryptographic randomness proves that 'computational difficulty' looks like entropy.• Defining Epiplexity and Time-Bounded Entropy: Linda breaks down the core definitions of the paper, explaining Epiplexity as the structural information a specific model can actually learn, versus Time-Bounded Entropy, which is the residual unpredictability relative to that model's resources.• Emergence, Induction, and the Chess Experiment: A deep dive into the paper's experiments with Cellular Automata and Chess. The hosts discuss how the order of data (Forward vs. Reverse) impacts what a model learns and how limited compute forces models to learn emergent rules rather than brute-force simulation.• Practical Implications for LLMs and Conclusion: The discussion moves to real-world application, specifically how Epiplexity explains why pre-training on text transfers better than images. Norris admits the utility of the theory for data selection in Large Language Models.

Jan 9, 2026

19m

109

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

In this episode:• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.

Jan 8, 2026

19m

108

NorMuon- Making Muon more efficient and scalable

In this episode:• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prompting a discussion on 'condition numbers' versus 'neuron norms.'• The NorMuon Solution: Best of Both Worlds: The hosts dive into the algorithm itself, detailing how NorMuon applies neuron-wise adaptive learning rates (similar to Adam-mini) *after* Muon's orthogonalization step. They discuss the intuition behind using second-order momentum to normalize the disparate scales of neuron updates.• Engineering at Scale: FSDP2 and Distributed Newton-Schulz: The discussion shifts to the systems engineering required to make this work on large clusters. Linda explains how the authors implemented NorMuon under the FSDP2 framework, specifically how they distribute the expensive Newton-Schulz orthogonalization across devices to avoid redundant computation.• Results and Verdict: Efficiency Gains: Norris reviews the empirical results, noting the 21% efficiency gain over Adam on 1.1B parameter models and the impressive memory savings. The episode concludes with a consensus that orthogonalization and adaptive scaling are complementary, not competitive, technologies.

Jan 7, 2026

19m

107

Dion- Distributed Orthonormalized Updates

In this episode:• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skeptical about the math, but Linda explains how this integrates cleanly with weight sharding.• The Magic of Error Feedback: The hosts discuss the 'rank-fraction' parameter and how low-rank updates save compute. Linda explains the crucial role of 'error feedback' in maintaining accuracy, finally winning over Norris's skepticism.• Lazy Updates and CPU Offloading: A look at the algorithmic flexibility of Dion, including 'Lazy-Dion' and CPU offloading variants. They discuss the experimental results showing Dion matching Muon's performance with significantly lower wall-clock time.• Future-Proofing Optimization: Professor Norris admits the elegance of the solution. The pair wraps up with thoughts on how Dion might become the standard for training next-generation foundation models.

Jan 6, 2026

18m

106

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5

Jan 6, 2026

20m

105

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4

Jan 6, 2026

19m

104

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3

Jan 6, 2026

20m

103

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2

Jan 6, 2026

22m

102

Key and Value Weights Are Probably All You Need

In this episode:• Is Query Redundant?: Linda introduces a provocative paper suggesting a core part of the Transformer attention mechanism, the Query matrix, might be unnecessary. Professor Norris expresses his trademark skepticism about simplifying such a fundamental component.• The Usual Suspects: Q, K, and V: Linda provides a quick, intuitive refresher on the roles of Query, Key, and Value matrices in self-attention. Professor Norris helps frame it with an analogy, emphasizing why each component has traditionally been considered essential.• Disappearing Queries and Basis Transformations: Linda explains the paper's core theoretical claim that the Query matrix can be mathematically absorbed into other components through a change of basis. Professor Norris probes the 'simplifying assumptions,' like the absence of Layer Normalization, required for the proof to hold.• Putting It to the Test: The discussion moves to the empirical results, where models trained without Query matrices perform surprisingly well. Linda details the crucial hyperparameter adjustments, which Professor Norris identifies as the key to bridging the gap between theory and practice.• So, Is Query Really All You Don't Need?: The hosts debate the broader implications for parameter efficiency and our understanding of transformer architecture. They conclude by questioning if this simplification is an artifact of smaller models or a fundamental insight that will reshape future designs.

Nov 26, 2025

14m

101

Latent State Models of Training Dynamics

In this episode:• Why Does Seed 42 Work Best?: Linda introduces a paper that tries to answer a classic machine learning question: why does the random seed have such a big impact on training? Professor Norris laments that this is a problem as old as neural networks themselves.• A Roadmap for Training: Linda explains the paper's novel approach of using a Hidden Markov Model to turn messy training dynamics into a clean 'training map' of latent states. Professor Norris expresses his surprise and curiosity at seeing a classic model like an HMM used to analyze modern deep learning.• Taking the Scenic Route to Convergence: The hosts discuss the paper's key findings on 'grokking' tasks, where different random seeds lead to different paths on the training map. Linda explains the concept of 'detour states,' which are optional, slower paths to convergence that some models get stuck in.• You Are the Traffic Controller: Professor Norris highlights the paper's powerful conclusion that training variability isn't inherent to a task, but a result of the training setup. Linda explains how removing components like batch normalization can create detours in stable tasks, while adding them can remove detours from unstable ones.• Maps, Not Just Metrics: Linda and Professor Norris conclude by discussing the practical implications, such as a new way to analyze and compare hyperparameter settings by looking at the structure of their training maps.

Oct 28, 2025

12m

100

DeepSeek OCR

Oct 24, 2025

13m

99

The Coverage Principle - How Pre-training Enables Post-Training

Oct 23, 2025

15m

98

The Coverage Principle- How Pre-training Enables Post-Training

In this episode:• Why a Good Pre-trainer Isn't Always a Good Finetuner: The hosts introduce the puzzle of pre-training: why doesn't a lower cross-entropy loss always guarantee better performance after fine-tuning? They set the stage for today's paper which proposes a new perspective.• Are We Covering Our Bases? The Coverage Principle: Linda explains the paper's central concept of 'coverage,' a metric that measures if a model assigns at least some probability to a wide range of high-quality responses, contrasting it with the pitfalls of cross-entropy.• The Implicit Genius of Next-Token Prediction: The hosts dive into the paper's main theoretical result, explaining how the standard next-token prediction objective implicitly optimizes for good coverage, and why this metric is a much better predictor of downstream success than raw loss.• From Theory to Practice: Interventions for Better Coverage: The discussion turns to practical applications, exploring the paper's proposed methods for actively improving coverage, including gradient normalization schemes and novel checkpoint selection strategies.• What's Next for Coverage?: Professor Norris and Linda recap the key insight that coverage is a crucial link between pre-training and post-training success, and discuss the future research directions this new perspective opens up.

Oct 23, 2025

15m

97

The Art of Scaling Reinforcement Learning Compute for LLMs

In this episode:• The Art and Science of Scaling RL: Professor Norris and Linda introduce today's topic, a new paper from Meta that aims to make training large models with reinforcement learning more predictable and scientific.• More Art than Science: Linda explains why scaling Reinforcement Learning is so difficult compared to pre-training, highlighting the lack of predictive scaling laws and the immense compute costs that sideline smaller research groups.• Not a Power Law, but a Sigmoid: The hosts discuss the paper's core proposal: using a sigmoidal curve to model performance. Linda breaks down the key parameters like asymptotic performance (A) and compute efficiency (B), while Professor Norris relates it to human learning curves.• The ScaleRL Cookbook: Linda walks through the 'ScaleRL' recipe, a combination of techniques discovered through a massive 400,000 GPU-hour study. They discuss the difference between choices that raise the performance ceiling versus those that just improve efficiency.• Predictable Progress and The Bitter Lesson: The hosts discuss the implications of this work, such as enabling cheaper, more accessible research by extrapolating from small-scale experiments, and how it reinforces the 'bitter lesson' of prioritizing scalable methods.• Next Week on Mechanical Dreams: Professor Norris and Linda wrap up their discussion on scaling RL and give a brief teaser for the topic of next week's episode.

Oct 23, 2025

13m

96

Continual Learning via Sparse Memory Finetuning

In this episode:• The Frozen Brains of AI: Linda introduces the problem of static LLMs and the challenge of 'catastrophic forgetting.' Professor Norris provides historical context on this long-standing issue in AI and introduces the day's paper on continual learning.• Why Can't Models Just Keep Learning?: The hosts discuss traditional approaches to continual learning, like data replay and regularization. Linda explains why modern methods like LoRA, while better than full finetuning, still fall short of solving the forgetting problem.• Memory and Sparsity: The Secret Sauce: Linda details the paper's main contribution: Sparse Memory Finetuning. She explains the concepts of memory layers and how the authors use a TF-IDF-like mechanism to identify and update only a tiny fraction of model parameters.• Learning vs. Forgetting: The Showdown: Linda and Professor Norris analyze the paper's striking results, highlighting how the proposed method learns new facts effectively while forgetting dramatically less than both full finetuning and LoRA. They discuss the Pareto frontier plot as a key piece of evidence.• What's Next for Lifelong Learners?: The hosts discuss the implications and future directions for this research, such as applying the technique to more complex skills beyond fact acquisition. They conclude that sparse updates are a promising path toward creating truly dynamic AI models.

Oct 22, 2025

13m

95

DeepSeek OCR Paper

In this episode:• A Picture is Worth a Thousand Tokens: The hosts introduce the challenge of long context in LLMs and present the paper's radical idea: compressing text by taking a picture of it.• Compressing Text into Pixels: A deep dive into the main concept of optical compression, exploring how a page of text can be represented with far fewer vision tokens than text tokens.• The Secret Sauce: DeepEncoder: An explanation of the novel 'DeepEncoder' architecture, which efficiently processes high-resolution images into a small number of vision tokens for the language model to read.• The Proof is in the Pixels: Discussion of the experimental results, focusing on the impressive ~97% accuracy at a 10x compression ratio and its superior efficiency on industry benchmarks.• Forgetting, The Smart Way: Exploring the broader implications of optical compression, particularly the paper's proposal to use it as a 'forgetting mechanism' for ultra-long contexts that mimics human memory.

Oct 22, 2025

13m

94

Untitled Episode

Oct 10, 2025

11m

93

Characterization and Mitigation of Training Instabilities in Microscaling Formats

In this episode:• The Need for Speed: Microscaling Formats: Linda introduces new low-precision MX formats for training LLMs, designed to save massive amounts of compute. Professor Norris is intrigued but skeptical about the practical trade-offs.• When Good Training Goes Bad: The hosts discuss the core problem identified in the paper: severe training instabilities and sudden, unrecoverable loss spikes when using MX formats, especially at scale.• It's the Layernorm, Stupid!: Linda explains how the researchers used a proxy model to diagnose the instabilities, tracing the root cause to a systematic gradient bias from quantizing layernorm parameters.• The Hybrid Solution: Professor Norris and Linda discuss the paper's proposed mitigations, focusing on a clever hybrid-precision approach that uses low-precision for weights and high-precision for activations.• Precision on a Budget: The episode concludes by showing how these mitigation strategies successfully stabilize training, allowing for performance competitive with full-precision while still saving compute.

Oct 8, 2025

13m

92

Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls

In this episode:• The Synthetic Data Gold Rush: The hosts introduce the data scarcity problem for training large language models and present today's paper, which systematically investigates synthetic data as a potential solution.• Real Fake Data: What Kinds Are We Talking About?: Linda breaks down the different types of synthetic data studied, including rephrased web text and entirely novel 'synthetic textbooks', while Professor Norris questions the quality of this model-generated content.• The Secret Sauce: How Much Synthetic is Too Much?: Discussion of the paper's core finding: a 'good' mixture of ~30% rephrased synthetic data with natural web text can accelerate pre-training up to 10x, whereas 100% synthetic data offers no advantage.• Does a Bigger Generator Mean Better Data?: The hosts explore the paper's counter-intuitive discovery that using an 8B parameter model to generate data can outperform a much larger 70B model, challenging the 'bigger is always better' intuition.• Takeaways: A Measured Dose of Artificial Text: Professor Norris and Linda summarize the practical takeaways: synthetic data is a powerful but nuanced tool, not a silver bullet. The right type, mixture, and generator model are key to accelerating training.

Oct 7, 2025

14m

91

Drop-Muon- Update Less, Converge Faster

In this episode:• Introduction: Less is More in Optimization?: Professor Norris and Linda introduce the "Drop-Muon" paper, which challenges the fundamental assumption that all neural network layers must be updated at every training step. They set the stage by questioning if selectively updating layers could lead to faster convergence.• A Refresher on the Muon Family: Linda provides a high-level overview of modern non-Euclidean optimizers like Muon, Scion, and Gluon. They discuss how these methods use layer-specific geometry to improve training, which provides the foundation for the Drop-Muon approach.• The Drop-Muon Algorithm: Randomized Progressive Training: Linda explains the core mechanism of Drop-Muon, focusing on how it samples a random subset of layers to update at each iteration. Professor Norris probes the practicalities of this approach, especially the concept of 'Randomized Progressive Training' and its computational cost.• The Theoretical Justification: When is Full-Network Update Optimal?: The hosts delve into the paper's theoretical contributions, highlighting the key finding that full-network updates are only optimal under a very restrictive and unlikely condition on layer smoothness constants. They discuss the implications of the cost model, which accounts for backpropagation and parameter update costs.• Empirical Results and Final Thoughts: Linda presents the experimental results, which show Drop-Muon achieving the same accuracy as standard Muon up to 1.4x faster in wall-clock time. They conclude by discussing the practical impact of this 'update less, converge faster' strategy for training large models.

Oct 6, 2025

11m

90

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

In this episode:• The Finicky Diet of Large Language Models: Linda introduces a paper about how LLMs learn from mixtures of web data and high-quality data. Professor Norris expresses his initial intuition that more data is always better, setting the stage for the paper's surprising findings.• It's Not a Slope, It's a Cliff: Unveiling Phase Transitions: The hosts discuss the paper's core finding: knowledge acquisition isn't gradual but exhibits sudden 'phase transitions'. Linda explains how, below a critical model size or data mixing ratio, models learn almost nothing from specialized datasets, a result Professor Norris finds both fascinating and counter-intuitive.• The Knapsack Theory of Knowledge: To explain the 'why', Linda and Professor Norris explore the paper's theoretical model of 'capacity allocation'. They use a knapsack analogy to describe how a model with finite capacity strategically decides which data is 'worth' learning to minimize overall loss.• Learning More by Training on Less?: Linda and Professor Norris discuss the practical implications, including the paradoxical strategy of throwing away data to improve learning. They cover the paper's proposed solutions, like random subsampling and Compact Knowledge Mixing, and what this means for data curation.• Final Thoughts and Critical Points: The hosts summarize the paper's key insight: data mixing recipes are not one-size-fits-all, and the relationship between model size, data, and knowledge is sharp and discontinuous. They wrap up by emphasizing the importance of understanding these dynamics for efficient model training.

Oct 6, 2025

13m

89

Apertus Tech Report

In this episode:• Another Week, Another 'Open' Model?: Linda introduces the Apertus paper, framing it as a response to the systemic shortcomings of current open models. Professor Norris questions what makes this one different from the countless other 'open' releases.• Data Compliance and the Goldfish in the Machine: The hosts dive into Apertus's strict data compliance, including its novel retroactive application of robots.txt and the use of the 'Goldfish' training objective to prevent the model from memorizing its training data.• More Than Just English: A Truly Global LLM: Linda gets excited about the model's vast multilingual capabilities, trained on over 1800 languages. They discuss the implications for low-resource languages and the significance of a 40% non-English training data mix.• The Swiss AI Charter and Other Training Secrets: The discussion turns to the technical details of training Apertus, including its unique optimizer and its novel approach to safety alignment using a 'Swiss AI Charter' for controversial topics.• Final Thoughts: A New Standard for Openness?: Professor Norris and Linda summarize Apertus's contributions, concluding that its commitment to compliance, multilingualism, and full transparency sets a powerful new benchmark for the entire field.

Sep 21, 2025

13m

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

Let's (not) just put things in Context- Test-Time Training for Long-Context LLMs

Learning State-Tracking from Code Using Linear RNNs

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

GLM-5

Cautious Optimizers

Backward Gradient Normalization in Deep Neural Networks

Attention Residuals

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

Scaling Laws for Precision

On the "Induction Bias" in Sequence Models

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches

Flash Attention 4

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Midtraining Bridges Pretraining and Posttraining Distributions

SiameseNorm

ÜberWeb

Why Do Reasoning Models Loop

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Teon

Cautious Weight Decay

Predictable Scale

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Challenges and Research Directions for Large Language Model Inference Hardware

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

The Quantization Model of Neural Scaling

EAGLE-3

Engram Paper

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

NorMuon- Making Muon more efficient and scalable

Dion- Distributed Orthonormalized Updates

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v5

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v4

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v3

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining-v2

Key and Value Weights Are Probably All You Need

Latent State Models of Training Dynamics

DeepSeek OCR

The Coverage Principle - How Pre-training Enables Post-Training

The Coverage Principle- How Pre-training Enables Post-Training

The Art of Scaling Reinforcement Learning Compute for LLMs

Continual Learning via Sparse Memory Finetuning

DeepSeek OCR Paper

Untitled Episode

Characterization and Mitigation of Training Instabilities in Microscaling Formats

Demystifying Synthetic Data in LLM Pre-training- A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Drop-Muon- Update Less, Converge Faster

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Apertus Tech Report

Authentication Required