Mastering Language Models: From Architecture to Optimization

by William Liu

A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.

Subscribe · 0 Bookmark

21

Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal

A forward-looking episode on Train-to-Test scaling laws, which jointly optimize model size, training tokens, and inference samples under end-to-end compute budgets.

Apr 26, 2026

11m
20

SSD: Self-Distillation for Code Generation Without a Teacher

A post-training efficiency episode on simple self-distillation: using a model’s own sampled code outputs as supervised fine-tuning data without a verifier, teacher, or reinforcement learning.

Apr 26, 2026

11m
19

FlashAttention-2: Faster Attention with Better Work Partitioning

A follow-up episode on FlashAttention-2: once memory movement improves, the next gains come from better parallelism, less non-matmul work, and smarter warp/thread-block layout.

Apr 26, 2026

11m
18

FlashAttention: Exact Attention with IO-Awareness

A deep dive into FlashAttention’s central insight: attention speed is not only about arithmetic, it is about moving less data between GPU memory levels.

Apr 26, 2026

10m
17

Distributed Training Architecture: From GPU Kernels to Cluster Design

A systems episode that connects the individual techniques into an architecture-level view: GPUs, memory hierarchy, interconnects, scheduling, fault tolerance, and efficiency.

Apr 26, 2026

7m
16

Fully Sharded Data Parallel: Fewer Copies, Larger Models

A practical episode on FSDP, how it shards model parameters across data-parallel workers, and why it became a production-friendly path for training bigger models.

Apr 26, 2026

10m
15

ZeRO: Removing Memory Redundancy from Data Parallel Training

A deep dive into ZeRO’s memory model: optimizer states, gradients, and parameters are not sacred copies; they can be partitioned across workers.

Apr 26, 2026

11m
14

Megatron-LM: Tensor Parallelism Inside the Transformer

A technical but accessible episode on Megatron-LM’s intra-layer model parallelism and why splitting matrix operations inside Transformer layers became a central scaling technique.

Apr 26, 2026

11m
13

GPipe: Training Giant Networks with Pipeline Parallelism

A deep dive into GPipe, the paper that made layer-wise pipeline parallelism feel like a general recipe for training giant sequential networks.

Apr 26, 2026

11m
12

Advanced Distributed Training: Overcoming Bottlenecks

A map of the distributed-training bottlenecks that decide whether a large language model can be trained at all: memory, communication, data movement, pipeline bubbles, and utilization.

Apr 26, 2026

12m
11

Scaling Data-Constrained Language Models: When Fresh Text Runs Out

A deep dive into data-constrained scaling, explaining repeated data, effective tokens, diminishing returns, and the training-data bottleneck.

Apr 26, 2026

11m
10

Training Compute-Optimal Large Language Models: The Chinchilla Lesson

A deep dive into Chinchilla and compute-optimal training, explaining why many large models were undertrained and why tokens should scale with parameters.

Apr 26, 2026

11m
9

Scaling Laws for Neural Language Models: Predicting Progress

A deep dive into Scaling Laws for Neural Language Models, explaining power-law loss trends, compute allocation, forecasting, and the limits of loss as a proxy.

Apr 26, 2026

10m
8

Scaling and Training Large Models Efficiently

A topic-level overview of efficient large-model scaling, introducing parameters, tokens, compute, data quality, compute optimality, and the core disagreements around scale.

Apr 26, 2026

11m
7

Attention Is All You Need: The Transformer Breakthrough

A deep dive into Attention Is All You Need, covering self-attention, multi-head attention, positional encodings, masking, and why the Transformer changed sequence modeling.

Apr 26, 2026

12m
6

Foundations of Sequence Modeling: The Transformer Revolution

A topic-level overview of the Transformer revolution, explaining attention as learned information routing and setting up the Attention Is All You Need deep dive.

Apr 26, 2026

11m
5

Series Overview: Mastering Language Models from Architecture to Optimization

A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.

Apr 26, 2026

11m

View all 21 episodes →

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

Share your thoughts

ABOUT THIS SHOW

HOSTED BY

William Liu

Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal

SSD: Self-Distillation for Code Generation Without a Teacher

FlashAttention-2: Faster Attention with Better Work Partitioning

FlashAttention: Exact Attention with IO-Awareness

Distributed Training Architecture: From GPU Kernels to Cluster Design

Fully Sharded Data Parallel: Fewer Copies, Larger Models

ZeRO: Removing Memory Redundancy from Data Parallel Training

Megatron-LM: Tensor Parallelism Inside the Transformer

GPipe: Training Giant Networks with Pipeline Parallelism

Advanced Distributed Training: Overcoming Bottlenecks

Scaling Data-Constrained Language Models: When Fresh Text Runs Out

Training Compute-Optimal Large Language Models: The Chinchilla Lesson

Scaling Laws for Neural Language Models: Predicting Progress

Scaling and Training Large Models Efficiently

Attention Is All You Need: The Transformer Breakthrough

Foundations of Sequence Modeling: The Transformer Revolution

Series Overview: Mastering Language Models from Architecture to Optimization

Authentication Required