How SRE Teams Use Chaos Engineering for Non-Netflix Systems episode artwork

EPISODE · May 27, 2026 · 8 MIN

How SRE Teams Use Chaos Engineering for Non-Netflix Systems

from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo

Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing servers—and structured chaos experiments guided by a game day calendar and error budget thresholds. Luna pushes back on whether small teams can afford the complexity, and Lucas cites an open-source tool, Litmus, that reduced BlinkMart's mean time to detect from 12 minutes to under 90 seconds. The conversation covers the three pillars of a practical chaos program: a hypothesis-driven experiment design, a blast radius limit (no more than 5% of user traffic), and a rollback trigger. They also discuss how chaos engineering shifts team culture from fear of failure to curiosity about failure modes. Donation segment: BlinkMart's VP of engineering mentioned that the chaos experiments cost less than half the projected revenue loss from a single outage, making the ROI argument simple for the board. #ChaosEngineering #SRE #SiteReliability #BlinkMart #FailureInjection #Litmus #GameDays #BlastRadius #ErrorBudget #MeanTimeToDetect #BlackFriday #DatabaseReplication #IncidentResponse #ResilienceEngineering #OpenSource #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing servers—and structured chaos experiments guided by a game day calendar and error budget thresholds. Luna pushes back on whether small teams can afford the complexity, and Lucas cites an open-source tool, Litmus, that reduced BlinkMart's mean time to detect from 12 minutes to under 90 seconds. The conversation covers the three pillars of a practical chaos program: a hypothesis-driven experiment design, a blast radius limit (no more than 5% of user traffic), and a rollback trigger. They also discuss how chaos engineering shifts team culture from fear of failure to curiosity about failure modes. Donation segment: BlinkMart's VP of engineering mentioned that the chaos experiments cost less than half the projected revenue loss from a single outage, making the ROI argument simple for the board. #ChaosEngineering #SRE #SiteReliability #BlinkMart #FailureInjection #Litmus #GameDays #BlastRadius #ErrorBudget #MeanTimeToDetect #BlackFriday #DatabaseReplication #IncidentResponse #ResilienceEngineering #OpenSource #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

NOW PLAYING

How SRE Teams Use Chaos Engineering for Non-Netflix Systems

0:00 8:33

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering?

This episode is 8 minutes long.

When was this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode published?

This episode was published on May 27, 2026.

What is this episode about?

Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical...

Can I download this The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!