EPISODE · Jun 3, 2026 · 14 MIN
How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage worse. Lucas explains the three states — closed, open, half-open — and the math behind the thresholds. Luna challenges the conventional wisdom about when to trigger half-open probes. The hosts also discuss the trade-off between circuit breakers and client-side retry logic, and share a real example of a broken circuit breaker causing a cascading failure in a major payment processor. For SRE teams building their own or using libraries like Resilience4J, this episode offers practical guidance on tuning thresholds, measuring success, and avoiding common pitfalls. #CircuitBreaker #SRE #SiteReliabilityEngineering #Resilience #CascadingFailure #Netflix #Hystrix #Amazon #Lyft #Resilience4J #FaultTolerance #Microservices #CloudComputing #Technology #SystemsDesign #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage worse. Lucas explains the three states — closed, open, half-open — and the math behind the thresholds. Luna challenges the conventional wisdom about when to trigger half-open probes. The hosts also discuss the trade-off between circuit breakers and client-side retry logic, and share a real example of a broken circuit breaker causing a cascading failure in a major payment processor. For SRE teams building their own or using libraries like Resilience4J, this episode offers practical guidance on tuning thresholds, measuring success, and avoiding common pitfalls. #CircuitBreaker #SRE #SiteReliabilityEngineering #Resilience #CascadingFailure #Netflix #Hystrix #Amazon #Lyft #Resilience4J #FaultTolerance #Microservices #CloudComputing #Technology #SystemsDesign #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m