How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures episode artwork

EPISODE · Jun 3, 2026 · 14 MIN

How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures

from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo

When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage worse. Lucas explains the three states — closed, open, half-open — and the math behind the thresholds. Luna challenges the conventional wisdom about when to trigger half-open probes. The hosts also discuss the trade-off between circuit breakers and client-side retry logic, and share a real example of a broken circuit breaker causing a cascading failure in a major payment processor. For SRE teams building their own or using libraries like Resilience4J, this episode offers practical guidance on tuning thresholds, measuring success, and avoiding common pitfalls. #CircuitBreaker #SRE #SiteReliabilityEngineering #Resilience #CascadingFailure #Netflix #Hystrix #Amazon #Lyft #Resilience4J #FaultTolerance #Microservices #CloudComputing #Technology #SystemsDesign #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage worse. Lucas explains the three states — closed, open, half-open — and the math behind the thresholds. Luna challenges the conventional wisdom about when to trigger half-open probes. The hosts also discuss the trade-off between circuit breakers and client-side retry logic, and share a real example of a broken circuit breaker causing a cascading failure in a major payment processor. For SRE teams building their own or using libraries like Resilience4J, this episode offers practical guidance on tuning thresholds, measuring success, and avoiding common pitfalls. #CircuitBreaker #SRE #SiteReliabilityEngineering #Resilience #CascadingFailure #Netflix #Hystrix #Amazon #Lyft #Resilience4J #FaultTolerance #Microservices #CloudComputing #Technology #SystemsDesign #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

NOW PLAYING

How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures

0:00 14:03

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering?

This episode is 14 minutes long.

When was this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode published?

This episode was published on June 3, 2026.

What is this episode about?

When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired...

Can I download this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!