PODCAST · technology
Mastering Language Models: From Architecture to Optimization
by William Liu
A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.
-
21
Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal
A forward-looking episode on Train-to-Test scaling laws, which jointly optimize model size, training tokens, and inference samples under end-to-end compute budgets.
-
20
SSD: Self-Distillation for Code Generation Without a Teacher
A post-training efficiency episode on simple self-distillation: using a model’s own sampled code outputs as supervised fine-tuning data without a verifier, teacher, or reinforcement learning.
-
19
FlashAttention-2: Faster Attention with Better Work Partitioning
A follow-up episode on FlashAttention-2: once memory movement improves, the next gains come from better parallelism, less non-matmul work, and smarter warp/thread-block layout.
-
18
FlashAttention: Exact Attention with IO-Awareness
A deep dive into FlashAttention’s central insight: attention speed is not only about arithmetic, it is about moving less data between GPU memory levels.
-
17
Distributed Training Architecture: From GPU Kernels to Cluster Design
A systems episode that connects the individual techniques into an architecture-level view: GPUs, memory hierarchy, interconnects, scheduling, fault tolerance, and efficiency.
-
16
Fully Sharded Data Parallel: Fewer Copies, Larger Models
A practical episode on FSDP, how it shards model parameters across data-parallel workers, and why it became a production-friendly path for training bigger models.
-
15
ZeRO: Removing Memory Redundancy from Data Parallel Training
A deep dive into ZeRO’s memory model: optimizer states, gradients, and parameters are not sacred copies; they can be partitioned across workers.
-
14
Megatron-LM: Tensor Parallelism Inside the Transformer
A technical but accessible episode on Megatron-LM’s intra-layer model parallelism and why splitting matrix operations inside Transformer layers became a central scaling technique.
-
13
GPipe: Training Giant Networks with Pipeline Parallelism
A deep dive into GPipe, the paper that made layer-wise pipeline parallelism feel like a general recipe for training giant sequential networks.
-
12
Advanced Distributed Training: Overcoming Bottlenecks
A map of the distributed-training bottlenecks that decide whether a large language model can be trained at all: memory, communication, data movement, pipeline bubbles, and utilization.
-
11
Scaling Data-Constrained Language Models: When Fresh Text Runs Out
A deep dive into data-constrained scaling, explaining repeated data, effective tokens, diminishing returns, and the training-data bottleneck.
-
10
Training Compute-Optimal Large Language Models: The Chinchilla Lesson
A deep dive into Chinchilla and compute-optimal training, explaining why many large models were undertrained and why tokens should scale with parameters.
-
9
Scaling Laws for Neural Language Models: Predicting Progress
A deep dive into Scaling Laws for Neural Language Models, explaining power-law loss trends, compute allocation, forecasting, and the limits of loss as a proxy.
-
8
Scaling and Training Large Models Efficiently
A topic-level overview of efficient large-model scaling, introducing parameters, tokens, compute, data quality, compute optimality, and the core disagreements around scale.
-
7
Attention Is All You Need: The Transformer Breakthrough
A deep dive into Attention Is All You Need, covering self-attention, multi-head attention, positional encodings, masking, and why the Transformer changed sequence modeling.
-
6
Foundations of Sequence Modeling: The Transformer Revolution
A topic-level overview of the Transformer revolution, explaining attention as learned information routing and setting up the Attention Is All You Need deep dive.
-
5
Series Overview: Mastering Language Models from Architecture to Optimization
A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.
HOSTED BY
William Liu
CATEGORIES
Loading similar podcasts...