How Microsoft SREs Automate Capacity Planning at Cloud Scale episode artwork

EPISODE · May 26, 2026 · 10 MIN

How Microsoft SREs Automate Capacity Planning at Cloud Scale

from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo

Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom management, and autoscaling — and walk through a real case where a retail giant's Black Friday traffic spike was absorbed without a single incident. They discuss the tension between efficiency and resilience, how SREs use historical traffic patterns and machine learning to predict compute needs, and why over-provisioning isn't always the answer. Listeners will learn how capacity planning has evolved from a manual quarterly spreadsheet exercise into a continuous, automated feedback loop — and why that shift is critical for any organization running infrastructure at scale. #SRE #CapacityPlanning #Azure #Microsoft #CloudComputing #Autoscaling #DemandForecasting #SiteReliabilityEngineering #IncidentPrevention #BlackFriday #Retail #MachineLearning #Observability #Uptime #Infrastructure #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom management, and autoscaling — and walk through a real case where a retail giant's Black Friday traffic spike was absorbed without a single incident. They discuss the tension between efficiency and resilience, how SREs use historical traffic patterns and machine learning to predict compute needs, and why over-provisioning isn't always the answer. Listeners will learn how capacity planning has evolved from a manual quarterly spreadsheet exercise into a continuous, automated feedback loop — and why that shift is critical for any organization running infrastructure at scale. #SRE #CapacityPlanning #Azure #Microsoft #CloudComputing #Autoscaling #DemandForecasting #SiteReliabilityEngineering #IncidentPrevention #BlackFriday #Retail #MachineLearning #Observability #Uptime #Infrastructure #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

NOW PLAYING

How Microsoft SREs Automate Capacity Planning at Cloud Scale

0:00 10:59

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering?

This episode is 10 minutes long.

When was this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode published?

This episode was published on May 26, 2026.

What is this episode about?

Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom...

Can I download this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!