GPU Inference Performance: The Compute Lie Killing Your AI Latency episode artwork

EPISODE · Nov 29, 2025 · 22 MIN

GPU Inference Performance: The Compute Lie Killing Your AI Latency

from M365.FM - Modern work, security, and productivity with Microsoft 365 · host Mirko Peters - Founder of m365.fm, m365.show and m365con.net

(00:00:00) The Mysterious GPU Slowdown (00:03:31) The Silent Saboteur: CPU Fallback (00:07:43) The Hidden Pitfalls of Version Mismatch (00:12:24) The Container Culprit: Efficiency Erosion (00:16:52) The Remedy: Provable Acceleration (00:22:05) Closing Thoughts and Next Steps In this episode of M365.fm, Mirko Peters investigates a familiar horror story in AI operations: GPU bills climbing while GPU utilization is near zero and latency quietly explodes. He dissects a real text‑to‑image Stable Diffusion workload where everything on paper looks right — ONNX/TensorRT, NVIDIA GPUs, containers, CI‑controlled rollouts — yet requests crawl and P95 latency blows past every SLO.WHAT YOU WILL LEARNWhy your “GPU‑accelerated” service may actually be running on CPU without telling youHow CPU fallback in ONNX Runtime works and why it almost never raises a visible errorHow subtle CUDA / ONNX Runtime / TensorRT version mismatches destroy fused kernels and fast pathsHow container misconfiguration (missing device mounts, wrong nvidia‑container‑toolkit setup) turns accelerators into expensive heatersWhich three metrics — latency, throughput, and GPU utilization — tell you the truth when dashboards lieTHE CORE INSIGHTMost AI outages at scale aren’t about the model; they’re about infrastructure honesty. Your system will happily “work” on the wrong execution provider, with degraded kernels, or with no GPU attached at all — and it will do so silently unless you force it to prove otherwise. Mirko shows how provider order, capability logs, and device mounts form the real chain of evidence for whether your GPUs are actually doing the work you’re paying for.You’ll hear a detailed walk‑through of “Evidence File A”: CPU fallback as the quiet saboteur. ONNX Runtime tries TensorRT, then CUDA, then shrugs and runs everything on CPU when drivers, libraries, or device mounts don’t line up — logging a single line most teams never read. The service stays green, but GPU duty cycles hover at 5%, CPU cores peg, P50 latency quadruples, and P95 unravels under bursty traffic as autoscale happily spreads the defect across more replicas.Then in “Evidence File B,” Mirko explores version drift: CUDA, cuDNN, ONNX Runtime, and TensorRT that technically run but miss fused attention kernels, FP16 paths, and tensor core optimizations. Engines deserialize with warnings, fall back to generic kernels, and keep responding — just slower and more memory‑hungry. Utilization charts look “busy enough,” but PCIe and memory movement dominate, and your cost per request quietly spikes.Most teams treat containerization and CI as safety nets; here you’ll see how they can just as easily freeze defects in amber when you don’t assert GPU health at startup. Mirko outlines concrete countermeasures: hard‑fail if GPU providers aren’t present, validate IO binding with a warm‑up inference, enforce latency gates during rollout, and build canary prompts that exercise the fused kernels you care about. In other words, trade a bit of availability at deploy time for integrity and predictable performance in production.WHO THIS EPISODE IS FORThis episode is ideal for ML engineers, MLOps and platform teams, SREs, and cloud architects running GPU‑backed inference for diffusion models and other heavy workloads. If your GPU bill is high, your latency is unstable, and your dashboards insist everything is fine, this conversation will give you a field manual for proving whether your accelerators are actually accelerating — and what to fix when they’re not.BOUT THE HOSTMirko Peters is a Microsoft 365 and cloud consultant who helps organizations turn AI infrastructure from expensive experiments into reliable, observable production systems. Through M365.fm, Mirko shares real incident stories, performance forensics, and hard‑won patterns that help teams keep their GPUs honest and their SLOs intact.Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

(00:00:00) The Mysterious GPU Slowdown (00:03:31) The Silent Saboteur: CPU Fallback (00:07:43) The Hidden Pitfalls of Version Mismatch (00:12:24) The Container Culprit: Efficiency Erosion (00:16:52) The Remedy: Provable Acceleration (00:22:05) Closing Thoughts and Next Steps In this episode of M365.fm, Mirko Peters investigates a familiar horror story in AI operations: GPU bills climbing while GPU utilization is near zero and latency quietly explodes. He dissects a real text‑to‑image Stable Diffusion workload where everything on paper looks right — ONNX/TensorRT, NVIDIA GPUs, containers, CI‑controlled rollouts — yet requests crawl and P95 latency blows past every SLO.WHAT YOU WILL LEARNWhy your “GPU‑accelerated” service may actually be running on CPU without telling youHow CPU fallback in ONNX Runtime works and why it almost never raises a visible errorHow subtle CUDA / ONNX Runtime / TensorRT version mismatches destroy fused kernels and fast pathsHow container misconfiguration (missing device mounts, wrong nvidia‑container‑toolkit setup) turns accelerators into expensive heatersWhich three metrics — latency, throughput, and GPU utilization — tell you the truth when dashboards lieTHE CORE INSIGHTMost AI outages at scale aren’t about the model; they’re about infrastructure honesty. Your system will happily “work” on the wrong execution provider, with degraded kernels, or with no GPU attached at all — and it will do so silently unless you force it to prove otherwise. Mirko shows how provider order, capability logs, and device mounts form the real chain of evidence for whether your GPUs are actually doing the work you’re paying for.You’ll hear a detailed walk‑through of “Evidence File A”: CPU fallback as the quiet saboteur. ONNX Runtime tries TensorRT, then CUDA, then shrugs and runs everything on CPU when drivers, libraries, or device mounts don’t line up — logging a single line most teams never read. The service stays green, but GPU duty cycles hover at 5%, CPU cores peg, P50 latency quadruples, and P95 unravels under bursty traffic as autoscale happily spreads the defect across more replicas.Then in “Evidence File B,” Mirko explores version drift: CUDA, cuDNN, ONNX Runtime, and TensorRT that technically run but miss fused attention kernels, FP16 paths, and tensor core optimizations. Engines deserialize with warnings, fall back to generic kernels, and keep responding — just slower and more memory‑hungry. Utilization charts look “busy enough,” but PCIe and memory movement dominate, and your cost per request quietly spikes.Most teams treat containerization and CI as safety nets; here you’ll see how they can just as easily freeze defects in amber when you don’t assert GPU health at startup. Mirko outlines concrete countermeasures: hard‑fail if GPU providers aren’t present, validate IO binding with a warm‑up inference, enforce latency gates during rollout, and build canary prompts that exercise the fused kernels you care about. In other words, trade a bit of availability at deploy time for integrity and predictable performance in production.WHO THIS EPISODE IS...

NOW PLAYING

GPU Inference Performance: The Compute Lie Killing Your AI Latency

0:00 22:28

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of M365.FM - Modern work, security, and productivity with Microsoft 365?

This episode is 22 minutes long.

When was this M365.FM - Modern work, security, and productivity with Microsoft 365 episode published?

This episode was published on November 29, 2025.

What is this episode about?

(00:00:00) The Mysterious GPU Slowdown (00:03:31) The Silent Saboteur: CPU Fallback (00:07:43) The Hidden Pitfalls of Version Mismatch (00:12:24) The Container Culprit: Efficiency Erosion (00:16:52) The Remedy: Provable Acceleration (00:22:05)...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this M365.FM - Modern work, security, and productivity with Microsoft 365 episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!