EPISODE · May 29, 2026 · 8 MIN
How SRE Teams Use Capacity Planning to Prevent Black Friday Outages
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
In this episode, Lucas and Luna explore how site reliability engineering teams use capacity planning to avoid catastrophic outages during peak traffic events like Black Friday and Cyber Monday. They break down the specific methodology used by major e-commerce platforms, including the concept of 'headroom targets' and 'traffic shaping' — techniques that go beyond simple auto-scaling. Lucas explains how teams model demand using historical data and synthetic load testing, and why many companies still get caught off-guard by the 'thundering herd' problem. The conversation also covers real-world examples from retailers who learned hard lessons after underestimating mobile traffic surges. Luna challenges the common assumption that more servers is always the answer, and they discuss the trade-offs between cost optimization and reliability. A must-listen for engineers, SREs, and anyone responsible for keeping services running under extreme load. #CapacityPlanning #BlackFriday #SiteReliabilityEngineering #SRE #TrafficSurge #AutoScaling #HeadroomTargets #ThunderingHerd #LoadTesting #Ecommerce #Uptime #IncidentPrevention #CloudInfrastructure #ProductionEngineering #FexingoBusiness #Technology #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
In this episode, Lucas and Luna explore how site reliability engineering teams use capacity planning to avoid catastrophic outages during peak traffic events like Black Friday and Cyber Monday. They break down the specific methodology used by major e-commerce platforms, including the concept of 'headroom targets' and 'traffic shaping' — techniques that go beyond simple auto-scaling. Lucas explains how teams model demand using historical data and synthetic load testing, and why many companies still get caught off-guard by the 'thundering herd' problem. The conversation also covers real-world examples from retailers who learned hard lessons after underestimating mobile traffic surges. Luna challenges the common assumption that more servers is always the answer, and they discuss the trade-offs between cost optimization and reliability. A must-listen for engineers, SREs, and anyone responsible for keeping services running under extreme load. #CapacityPlanning #BlackFriday #SiteReliabilityEngineering #SRE #TrafficSurge #AutoScaling #HeadroomTargets #ThunderingHerd #LoadTesting #Ecommerce #Uptime #IncidentPrevention #CloudInfrastructure #ProductionEngineering #FexingoBusiness #Technology #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How SRE Teams Use Capacity Planning to Prevent Black Friday Outages
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m