EPISODE · Jun 9, 2026 · 10 MIN
How SRE Teams Use Chaos Engineering to Test Resilience
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
In episode 40 of The Site Reliability Podcast, Lucas and Luna dive into chaos engineering — the practice of intentionally breaking systems to find weaknesses before real incidents strike. They explore how Netflix pioneered the approach with Chaos Monkey, the lessons SRE teams can learn from controlled failure experiments, and how to start small with simple game days that simulate a database partition or a DNS failure. Lucas breaks down the difference between load testing and chaos testing, and why the goal isn't to break everything but to build confidence in your system's ability to recover. They also discuss common pitfalls like running experiments during peak traffic or without proper observability in place. Whether you're a seasoned SRE or just starting to think about resilience, this episode gives you a concrete framework for making your systems more robust — one controlled explosion at a time. Plus, Lucas and Luna explain why keeping this podcast ad-free matters and how listener support makes it possible. #ChaosEngineering #SRE #SiteReliabilityEngineering #Netflix #ChaosMonkey #ResilienceTesting #FailureInjection #ProductionTesting #Uptime #IncidentResponse #Observability #GameDays #FaultTolerance #CloudEngineering #DevOps #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
In episode 40 of The Site Reliability Podcast, Lucas and Luna dive into chaos engineering — the practice of intentionally breaking systems to find weaknesses before real incidents strike. They explore how Netflix pioneered the approach with Chaos Monkey, the lessons SRE teams can learn from controlled failure experiments, and how to start small with simple game days that simulate a database partition or a DNS failure. Lucas breaks down the difference between load testing and chaos testing, and why the goal isn't to break everything but to build confidence in your system's ability to recover. They also discuss common pitfalls like running experiments during peak traffic or without proper observability in place. Whether you're a seasoned SRE or just starting to think about resilience, this episode gives you a concrete framework for making your systems more robust — one controlled explosion at a time. Plus, Lucas and Luna explain why keeping this podcast ad-free matters and how listener support makes it possible. #ChaosEngineering #SRE #SiteReliabilityEngineering #Netflix #ChaosMonkey #ResilienceTesting #FailureInjection #ProductionTesting #Uptime #IncidentResponse #Observability #GameDays #FaultTolerance #CloudEngineering #DevOps #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How SRE Teams Use Chaos Engineering to Test Resilience
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m