EPISODE · Apr 30, 2024 · 55 MIN
Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228
from MLOps.community · host Demetrios
Join us at our first in-person conference on June 25, all about AI Quality: https://www.aiqualityconference.comSimon Karasik is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.// AbstractThe talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, and how big are the checkpoints? It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.// BioFull-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.// MLOps Jobs board jobs.mlops.community// MLOps Swag/Merchhttps://mlops-community.myshopify.com/// Related Links --------------- ✌️Connect With Us ✌️ -------------Join our Slack community: https://go.mlops.community/slackFollow us on Twitter: @mlopscommunitySign up for the next meetup: https://go.mlops.community/registerCatch all episodes, blogs, newsletters, and more: https://mlops.community/Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/Timestamps:[00:00] Simon's preferred beverage[01:23] Takeaways[04:22] Simon's tech background[08:42] Zombie models garbage collection[10:52] The road to LLMs[15:09] Trained models Simon worked on[16:26] LLM Checkpoints[20:36] Confidence in AI Training[22:07] Different Checkpoints[25:06] Checkpoint parts [29:05] Slurm vs Kubernetes[30:43] Storage choices lessons[36:02] Paramount components for setup[37:13] Argo workflows[39:49] Kubernetes node troubleshooting[42:35] Cloud virtual machines have pre-installed mentoring[45:41] Fine-tuning[48:16] Storage, networking, and complexity in network design[50:56] Start simple before advanced; consider model needs.[53:58] Join us at our first in-person conference on June 25, all about AI Quality
NOW PLAYING
Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228
No transcript for this episode yet
Similar Episodes
Mar 10, 2026 ·83m
Feb 17, 2026 ·94m
Jan 19, 2026 ·90m
Jan 5, 2026 ·98m
Dec 22, 2025 ·85m