EPISODE · Jun 13, 2026 · 8 MIN
How SRE Teams Use Game Days to Build Incident Muscle Memory
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
Lucas and Luna explore how site reliability engineering teams use game days — structured, simulated incident exercises — to prepare for real outages. They break down the approach used by a major fintech company that runs quarterly game days for its entire on-call rotation, with concrete scenarios like a simulated database failover and a DNS misconfiguration. The episode covers how game days differ from chaos engineering, why they build 'muscle memory' faster than reading runbooks, and how teams measure improvement by tracking time-to-acknowledge and time-to-resolve across repeated drills. Lucas shares data from a 2025 industry survey showing that teams running at least four game days per year cut mean time to resolve by 40 percent compared to teams that don't. Luna presses on the common failure modes — overly scripted scenarios, not including non-engineering stakeholders, and treating game days as pass-fail rather than learning exercises. They wrap with practical advice for starting small: a one-hour scenario with three people and a clear objective. #GameDays #IncidentSimulation #SRE #SiteReliabilityEngineering #OnCallPreparation #IncidentResponse #MuscleMemory #ChaosEngineering #RunbookAutomation #MeanTimeToResolve #Fintech #ProductionEngineering #Uptime #Resilience #LearningExercises #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
Lucas and Luna explore how site reliability engineering teams use game days — structured, simulated incident exercises — to prepare for real outages. They break down the approach used by a major fintech company that runs quarterly game days for its entire on-call rotation, with concrete scenarios like a simulated database failover and a DNS misconfiguration. The episode covers how game days differ from chaos engineering, why they build 'muscle memory' faster than reading runbooks, and how teams measure improvement by tracking time-to-acknowledge and time-to-resolve across repeated drills. Lucas shares data from a 2025 industry survey showing that teams running at least four game days per year cut mean time to resolve by 40 percent compared to teams that don't. Luna presses on the common failure modes — overly scripted scenarios, not including non-engineering stakeholders, and treating game days as pass-fail rather than learning exercises. They wrap with practical advice for starting small: a one-hour scenario with three people and a clear objective. #GameDays #IncidentSimulation #SRE #SiteReliabilityEngineering #OnCallPreparation #IncidentResponse #MuscleMemory #ChaosEngineering #RunbookAutomation #MeanTimeToResolve #Fintech #ProductionEngineering #Uptime #Resilience #LearningExercises #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How SRE Teams Use Game Days to Build Incident Muscle Memory
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m