All Episodes
Mastering Language Models: From Architecture to Optimization — 17 episodes
Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal
SSD: Self-Distillation for Code Generation Without a Teacher
FlashAttention-2: Faster Attention with Better Work Partitioning
FlashAttention: Exact Attention with IO-Awareness
Distributed Training Architecture: From GPU Kernels to Cluster Design
Fully Sharded Data Parallel: Fewer Copies, Larger Models
ZeRO: Removing Memory Redundancy from Data Parallel Training
Megatron-LM: Tensor Parallelism Inside the Transformer
GPipe: Training Giant Networks with Pipeline Parallelism
Advanced Distributed Training: Overcoming Bottlenecks
Scaling Data-Constrained Language Models: When Fresh Text Runs Out
Training Compute-Optimal Large Language Models: The Chinchilla Lesson
Scaling Laws for Neural Language Models: Predicting Progress
Scaling and Training Large Models Efficiently
Attention Is All You Need: The Transformer Breakthrough
Foundations of Sequence Modeling: The Transformer Revolution
Series Overview: Mastering Language Models from Architecture to Optimization