Mastering Language Models: From Architecture to Optimization

PODCAST · technology

Mastering Language Models: From Architecture to Optimization

A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.

  1. 21

    Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal

    A forward-looking episode on Train-to-Test scaling laws, which jointly optimize model size, training tokens, and inference samples under end-to-end compute budgets.

  2. 20

    SSD: Self-Distillation for Code Generation Without a Teacher

    A post-training efficiency episode on simple self-distillation: using a model’s own sampled code outputs as supervised fine-tuning data without a verifier, teacher, or reinforcement learning.

  3. 19

    FlashAttention-2: Faster Attention with Better Work Partitioning

    A follow-up episode on FlashAttention-2: once memory movement improves, the next gains come from better parallelism, less non-matmul work, and smarter warp/thread-block layout.

  4. 18

    FlashAttention: Exact Attention with IO-Awareness

    A deep dive into FlashAttention’s central insight: attention speed is not only about arithmetic, it is about moving less data between GPU memory levels.

  5. 17

    Distributed Training Architecture: From GPU Kernels to Cluster Design

    A systems episode that connects the individual techniques into an architecture-level view: GPUs, memory hierarchy, interconnects, scheduling, fault tolerance, and efficiency.

  6. 16

    Fully Sharded Data Parallel: Fewer Copies, Larger Models

    A practical episode on FSDP, how it shards model parameters across data-parallel workers, and why it became a production-friendly path for training bigger models.

  7. 15

    ZeRO: Removing Memory Redundancy from Data Parallel Training

    A deep dive into ZeRO’s memory model: optimizer states, gradients, and parameters are not sacred copies; they can be partitioned across workers.

  8. 14

    Megatron-LM: Tensor Parallelism Inside the Transformer

    A technical but accessible episode on Megatron-LM’s intra-layer model parallelism and why splitting matrix operations inside Transformer layers became a central scaling technique.

  9. 13

    GPipe: Training Giant Networks with Pipeline Parallelism

    A deep dive into GPipe, the paper that made layer-wise pipeline parallelism feel like a general recipe for training giant sequential networks.

  10. 12

    Advanced Distributed Training: Overcoming Bottlenecks

    A map of the distributed-training bottlenecks that decide whether a large language model can be trained at all: memory, communication, data movement, pipeline bubbles, and utilization.

  11. 11

    Scaling Data-Constrained Language Models: When Fresh Text Runs Out

    A deep dive into data-constrained scaling, explaining repeated data, effective tokens, diminishing returns, and the training-data bottleneck.

  12. 10

    Training Compute-Optimal Large Language Models: The Chinchilla Lesson

    A deep dive into Chinchilla and compute-optimal training, explaining why many large models were undertrained and why tokens should scale with parameters.

  13. 9

    Scaling Laws for Neural Language Models: Predicting Progress

    A deep dive into Scaling Laws for Neural Language Models, explaining power-law loss trends, compute allocation, forecasting, and the limits of loss as a proxy.

  14. 8

    Scaling and Training Large Models Efficiently

    A topic-level overview of efficient large-model scaling, introducing parameters, tokens, compute, data quality, compute optimality, and the core disagreements around scale.

  15. 7

    Attention Is All You Need: The Transformer Breakthrough

    A deep dive into Attention Is All You Need, covering self-attention, multi-head attention, positional encodings, masking, and why the Transformer changed sequence modeling.

  16. 6

    Foundations of Sequence Modeling: The Transformer Revolution

    A topic-level overview of the Transformer revolution, explaining attention as learned information routing and setting up the Attention Is All You Need deep dive.

  17. 5

    Series Overview: Mastering Language Models from Architecture to Optimization

    A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

A conversational roadmap for the full language-model series, introducing the expert mental models and major disagreements that connect architecture, scaling, optimization, fine-tuning, RLHF, open models, and sparse experts.

HOSTED BY

William Liu

CATEGORIES

URL copied to clipboard!