EPISODE · May 27, 2026 · 8 MIN
How SRE Teams Use Chaos Engineering for Non-Netflix Systems
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing servers—and structured chaos experiments guided by a game day calendar and error budget thresholds. Luna pushes back on whether small teams can afford the complexity, and Lucas cites an open-source tool, Litmus, that reduced BlinkMart's mean time to detect from 12 minutes to under 90 seconds. The conversation covers the three pillars of a practical chaos program: a hypothesis-driven experiment design, a blast radius limit (no more than 5% of user traffic), and a rollback trigger. They also discuss how chaos engineering shifts team culture from fear of failure to curiosity about failure modes. Donation segment: BlinkMart's VP of engineering mentioned that the chaos experiments cost less than half the projected revenue loss from a single outage, making the ROI argument simple for the board. #ChaosEngineering #SRE #SiteReliability #BlinkMart #FailureInjection #Litmus #GameDays #BlastRadius #ErrorBudget #MeanTimeToDetect #BlackFriday #DatabaseReplication #IncidentResponse #ResilienceEngineering #OpenSource #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing servers—and structured chaos experiments guided by a game day calendar and error budget thresholds. Luna pushes back on whether small teams can afford the complexity, and Lucas cites an open-source tool, Litmus, that reduced BlinkMart's mean time to detect from 12 minutes to under 90 seconds. The conversation covers the three pillars of a practical chaos program: a hypothesis-driven experiment design, a blast radius limit (no more than 5% of user traffic), and a rollback trigger. They also discuss how chaos engineering shifts team culture from fear of failure to curiosity about failure modes. Donation segment: BlinkMart's VP of engineering mentioned that the chaos experiments cost less than half the projected revenue loss from a single outage, making the ROI argument simple for the board. #ChaosEngineering #SRE #SiteReliability #BlinkMart #FailureInjection #Litmus #GameDays #BlastRadius #ErrorBudget #MeanTimeToDetect #BlackFriday #DatabaseReplication #IncidentResponse #ResilienceEngineering #OpenSource #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How SRE Teams Use Chaos Engineering for Non-Netflix Systems
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m