Mastering Language Models: From Architecture to Optimization cover art

All Episodes

Mastering Language Models: From Architecture to Optimization — 17 episodes

Title

Date

Duration

Train-to-Test Scaling: Why Overtraining Can Become Compute-Optimal

Apr 26, 2026

11:17

SSD: Self-Distillation for Code Generation Without a Teacher

Apr 26, 2026

11:52

FlashAttention-2: Faster Attention with Better Work Partitioning

Apr 26, 2026

11:09

FlashAttention: Exact Attention with IO-Awareness

Apr 26, 2026

10:59

Distributed Training Architecture: From GPU Kernels to Cluster Design

Apr 26, 2026

7:36

Fully Sharded Data Parallel: Fewer Copies, Larger Models

Apr 26, 2026

10:58

ZeRO: Removing Memory Redundancy from Data Parallel Training

Apr 26, 2026

11:04

Megatron-LM: Tensor Parallelism Inside the Transformer

Apr 26, 2026

11:29

GPipe: Training Giant Networks with Pipeline Parallelism

Apr 26, 2026

11:28

Advanced Distributed Training: Overcoming Bottlenecks

Apr 26, 2026

12:17

Scaling Data-Constrained Language Models: When Fresh Text Runs Out

Apr 26, 2026

11:06

Training Compute-Optimal Large Language Models: The Chinchilla Lesson

Apr 26, 2026

11:10

Scaling Laws for Neural Language Models: Predicting Progress

Apr 26, 2026

10:59

Scaling and Training Large Models Efficiently

Apr 26, 2026

11:12

Attention Is All You Need: The Transformer Breakthrough

Apr 26, 2026

12:38

Foundations of Sequence Modeling: The Transformer Revolution

Apr 26, 2026

11:35

Series Overview: Mastering Language Models from Architecture to Optimization

Apr 26, 2026

11:35