We Cut LLM Latency by 70% in Production episode artwork

EPISODE · Apr 10, 2026 · 1H 5M

We Cut LLM Latency by 70% in Production

from MLOps.community · host Demetrios

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks down exactly how his team cut latency by 70%, reduced GPU spend through counterintuitive scaling strategies, and navigated the messy reality of taking AI from proof-of-concept to production.How We Cut LLM Latency 70% With TensorRT in Production // MLOps Podcast #369 with Maher Hanafi, SVP of Engineering at Betterworks Key topics covered:The AI Iceberg — Why the invisible work behind AI (performance, latency, throughput, cost, accuracy) is harder than building the features themselvesGPU Cost Optimization — How upgrading to more expensive GPUs actually saved money by reducing total runtime hoursTensorRT LLM Deep Dive — Rewiring neural networks to match GPU architecture for 50-70% latency reductionCold Start Solutions — Using AWS FSx, baking models into container images, and cutting minutes off spin-up timesKV Cache & In-Flight Batching — Why using one model per GPU with maximum KV cache beats cramming multiple models togetherScheduled & Dynamic Scaling — Pattern-based scaling for HR tech workloads (nights, weekends, end-of-quarter spikes)Verticalized AI Platform — Building horizontal AI infrastructure that serves multiple HR product verticalsAI Engineering Lab — How junior vs. senior engineers adopted AI coding tools differently, and the cultural shift that followedAgentic Coding in Practice — Navigating AI coding agent costs, quality control, and redefining the SDLCChinese Models & Compliance — Why enterprise customers block DeepSeek/Qwen and the geopolitics of model training dataThis episode is for engineering leaders building AI in production, MLOps engineers optimizing GPU infrastructure, and anyone navigating the gap between AI demos and enterprise-scale deployment.Links & Resources:TensorRT LLM: https://github.com/NVIDIA/TensorRT-LLMNVIDIA Run: ai Model Streamer (cold start optimization): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/vLLM vs TensorRT-LLM comparison: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-themTimestamps: [00:00] Optimizing GPU Usage and Latency[00:21] Learning AI as Leadership[04:34] AI Cost Centers[13:56] Throughput and Infrastructure Efficiency[18:10] Scaling and Unit Economics[24:14] Championing AI ROI[36:11] Queue to Value Engine[41:30] Failed Product Features[46:12] Agentic Engineering Costs[58:49] AI Self-Hosting in Engineering[1:04:40] Wrap up

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks down exactly how his team cut latency by 70%, reduced GPU spend through counterintuitive scaling strategies, and navigated the messy reality of taking AI from proof-of-concept to production.How We Cut LLM Latency 70% With TensorRT in Production // MLOps Podcast #369 with Maher Hanafi, SVP of Engineering at Betterworks Key topics covered:The AI Iceberg — Why the invisible work behind AI (performance, latency, throughput, cost, accuracy) is harder than building the features themselvesGPU Cost Optimization — How upgrading to more expensive GPUs actually saved money by reducing total runtime hoursTensorRT LLM Deep Dive — Rewiring neural networks to match GPU architecture for 50-70% latency reductionCold Start Solutions — Using AWS FSx, baking models into container images, and cutting minutes off spin-up timesKV Cache & In-Flight Batching — Why using one model per GPU with maximum KV cache beats cramming multiple models togetherScheduled & Dynamic Scaling — Pattern-based scaling for HR tech workloads (nights, weekends, end-of-quarter spikes)Verticalized AI Platform — Building horizontal AI infrastructure that serves multiple HR product verticalsAI Engineering Lab — How junior vs. senior engineers adopted AI coding tools differently, and the cultural shift that followedAgentic Coding in Practice — Navigating AI coding agent costs, quality control, and redefining the SDLCChinese Models & Compliance — Why enterprise customers block DeepSeek/Qwen and the geopolitics of model training dataThis episode is for engineering leaders building AI in production, MLOps engineers optimizing GPU infrastructure, and anyone navigating the gap between AI demos and enterprise-scale deployment.Links & Resources:TensorRT LLM: https://github.com/NVIDIA/TensorRT-LLMNVIDIA Run: ai Model Streamer (cold start optimization): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/vLLM vs TensorRT-LLM comparison: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-themTimestamps: [00:00] Optimizing GPU Usage and Latency[00:21] Learning AI as Leadership[04:34] AI Cost Centers[13:56] Throughput and Infrastructure Efficiency[18:10] Scaling and Unit Economics[24:14] Championing AI ROI[36:11] Queue to Value Engine[41:30] Failed Product Features[46:12] Agentic Engineering Costs[58:49] AI Self-Hosting in Engineering[1:04:40] Wrap up

NOW PLAYING

We Cut LLM Latency by 70% in Production

0:00 1:05:20

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

She’s a Hazard to Herself She’s a Hazard Hi there, I’m Mallory, and I’d like to invite you into our world with “She’s a Hazard to Herself!” Join us as we navigate life with Multiple Sclerosis from the seat of my power wheelchair. Discover stories of resilience, family, and the community we’ve built around chronic illness. Whether you’re impacted by MS or want to learn from our journey, there’s something here for you. So why wait? Subscribe to “She’s a Hazard to Herself” on your favorite podcast app and be part of our journey today. Let’s lift each other up, one episode at a time! Tips, News and Stories for Older Adults Esther C Kane CAPS, C.D.S. "Tips, News, and Stories for Older Adults" delivers weekly insights tailored for seniors. We bring you summaries of curated news, practical advice, and inspiring stories that matter to the 55+ community. From health and finance to technology and lifestyle, our content keeps you informed and engaged. Sourced from trusted outlets, each episode offers valuable information for navigating your golden years. Join us as we explore aging with positivity, wisdom, and engaging stories. Your perfect companion for staying active, learning, and embracing life's later chapters. Prayer Time Heir Waves Prayer Time A podcast especially for our Prayer Time community NEWMORROW SESSIONS - A PodCast Series on the Future of Hospitality Mario C. Bauer, Florian Schneider, Axel Weber & Dr. Tillman Bardt The Newmorrow PodCast is more than a podcast — it's a platform for open dialog on the future of our business, a platform for those building what doesn’t exist yet. Here, we share and embrace our passion for the hospitality industry, but we won’t romanticize the journey. We ask the tough questions, confront uncomfortable truths, and prepare for a future that resists easy answers. We believe that the tougher and wilder times become, the more openly, honestly and humanely people need to talk to each other and act together. We believe, openness, togetherness, and truthfulness should also be cornerstones of a professional community to develop our utopian idea of „open source“. This is a space where visionaries don’t just imagine the future — they wrestle with the paradoxes that shape it: success vs. happiness, data vs. instinct, stability vs. reinvention. Join leaders, entrepreneurs, and thinkers as they share not what made them — but what’s actively shaping them, now and next. So tune in

Frequently Asked Questions

How long is this episode of MLOps.community?

This episode is 1 hour and 5 minutes long.

When was this MLOps.community episode published?

This episode was published on April 10, 2026.

What is this episode about?

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks...

Can I download this MLOps.community episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!