GPU Inference Performance: The Compute Lie Killing Your AI Latency

from M365.FM - Modern work, security, and productivity with Microsoft 365 · host Mirko Peters - Founder of m365.fm, m365.show and m365con.net

(00:00:00) The Mysterious GPU Slowdown (00:03:31) The Silent Saboteur: CPU Fallback (00:07:43) The Hidden Pitfalls of Version Mismatch (00:12:24) The Container Culprit: Efficiency Erosion (00:16:52) The Remedy: Provable Acceleration (00:22:05) Closing Thoughts and Next Steps In this episode of M365.fm, Mirko Peters investigates a familiar horror story in AI operations: GPU bills climbing while GPU utilization is near zero and latency quietly explodes. He dissects a real text‑to‑image Stable Diffusion workload where everything on paper looks right — ONNX/TensorRT, NVIDIA GPUs, containers, CI‑controlled rollouts — yet requests crawl and P95 latency blows past every SLO.WHAT YOU WILL LEARNWhy your “GPU‑accelerated” service may actually be running on CPU without telling youHow CPU fallback in ONNX Runtime works and why it almost never raises a visible errorHow subtle CUDA / ONNX Runtime / TensorRT version mismatches destroy fused kernels and fast pathsHow container misconfiguration (missing device mounts, wrong nvidia‑container‑toolkit setup) turns accelerators into expensive heatersWhich three metrics — latency, throughput, and GPU utilization — tell you the truth when dashboards lieTHE CORE INSIGHTMost AI outages at scale aren’t about the model; they’re about infrastructure honesty. Your system will happily “work” on the wrong execution provider, with degraded kernels, or with no GPU attached at all — and it will do so silently unless you force it to prove otherwise. Mirko shows how provider order, capability logs, and device mounts form the real chain of evidence for whether your GPUs are actually doing the work you’re paying for.You’ll hear a detailed walk‑through of “Evidence File A”: CPU fallback as the quiet saboteur. ONNX Runtime tries TensorRT, then CUDA, then shrugs and runs everything on CPU when drivers, libraries, or device mounts don’t line up — logging a single line most teams never read. The service stays green, but GPU duty cycles hover at 5%, CPU cores peg, P50 latency quadruples, and P95 unravels under bursty traffic as autoscale happily spreads the defect across more replicas.Then in “Evidence File B,” Mirko explores version drift: CUDA, cuDNN, ONNX Runtime, and TensorRT that technically run but miss fused attention kernels, FP16 paths, and tensor core optimizations. Engines deserialize with warnings, fall back to generic kernels, and keep responding — just slower and more memory‑hungry. Utilization charts look “busy enough,” but PCIe and memory movement dominate, and your cost per request quietly spikes.Most teams treat containerization and CI as safety nets; here you’ll see how they can just as easily freeze defects in amber when you don’t assert GPU health at startup. Mirko outlines concrete countermeasures: hard‑fail if GPU providers aren’t present, validate IO binding with a warm‑up inference, enforce latency gates during rollout, and build canary prompts that exercise the fused kernels you care about. In other words, trade a bit of availability at deploy time for integrity and predictable performance in production.WHO THIS EPISODE IS FORThis episode is ideal for ML engineers, MLOps and platform teams, SREs, and cloud architects running GPU‑backed inference for diffusion models and other heavy workloads. If your GPU bill is high, your latency is unstable, and your dashboards insist everything is fine, this conversation will give you a field manual for proving whether your accelerators are actually accelerating — and what to fix when they’re not.BOUT THE HOSTMirko Peters is a Microsoft 365 and cloud consultant who helps organizations turn AI infrastructure from expensive experiments into reliable, observable production systems. Through M365.fm, Mirko shares real incident stories, performance forensics, and hard‑won patterns that help teams keep their GPUs honest and their SLOs intact.Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

What this episode covers

NOW PLAYING

0:00 22:28

1×

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Share this episode

Similar Episodes

I'm ok

Mar 26, 2026 ·1m

Food Saved My Life

Mar 19, 2026 ·34m

Eat More Vegetables: The 4 Foods That Beat Ozempic (Naturally)

Feb 18, 2026 ·11m

How to End Heart Disease with Dr. Fuhrman

Feb 11, 2026 ·45m

Revolutionizing Breast Health: QT Imaging, Overdiagnosis, and What to Do Instead

Jan 27, 2026 ·35m

REMIX: Why we over-shop and compulsively acquire, and how to stop, with Dr Jan Eppingstall

Jan 9, 2026 ·61m

Similar Podcasts

MG Show MG Show The MG Show, hosted by Jeffrey Pedersen and Shannon Townsend, is a leading alternative media platform dedicated to uncovering the truth behind today’s most pressing political issues. Launched in 2019, the show has grown exponentially, offering unfiltered insights, comprehensive research, and real-time analysis. With a commitment to independent journalism and factual integrity, the MG Show empowers its audience with knowledge and encourages active participation in the political discourse. Ask A Spaceman Archives - 365 Days of Astronomy Ask A Spaceman Archives - 365 Days of Astronomy Podcasting Astronomy Every Day of the Year Breaking News Show | eTurboNews Juergen Thomas Steinmetz News is relevant to the global travel and tourism industry, human rights and global issues.Breaking news when it happens and only from the source. Eat to Live Jenna Fuhrman, Dr. Fuhrman Our health is our most precious gift and smart nutrition can change your life. Each month, join Dr. Fuhrman and his daughter, Jenna Fuhrman as they discuss important topics in the world of nutrition. Eat to Live will change the way you eat and think about food.

Frequently Asked Questions

How long is this episode of M365.FM - Modern work, security, and productivity with Microsoft 365?

This episode is 22 minutes long.

When was this M365.FM - Modern work, security, and productivity with Microsoft 365 episode published?

This episode was published on November 29, 2025.

What is this episode about?

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this M365.FM - Modern work, security, and productivity with Microsoft 365 episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.

URL copied to clipboard!