EPISODE · Jun 11, 2026 · 10 MIN
How SRE Teams Use Observability to Find Unknown Unknowns
from The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering · host Fexingo
Episode 45 of The Site Reliability Podcast digs into observability—how modern SRE teams go beyond monitoring to discover the 'unknown unknowns' that cause the worst outages. Lucas and Luna break down the difference between watching known metrics (CPU, memory) and exploring unknown failure modes with structured events and high-cardinality data. They walk through a real example: a major e-commerce platform that lost $340,000 in seven minutes during a 2023 flash sale because their monitoring didn't catch a latency spike in a new authentication microservice. They explain how distributed tracing and log-based metrics surfaced the root cause after the fact, and how the team now uses observability-driven dashboards to spot anomalies before they become incidents. The episode also covers practical steps—start with one service, instrument with OpenTelemetry, and build a culture of exploration—so listeners can apply observability in their own SRE practice. No ads, just actionable engineering insights. #Observability #SRE #SiteReliabilityEngineering #Monitoring #DistributedTracing #OpenTelemetry #IncidentResponse #UnknownUnknowns #HighCardinality #Microservices #Latency #LogBasedMetrics #ServiceLevelObjectives #ChaosEngineering #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
What this episode covers
Episode 45 of The Site Reliability Podcast digs into observability—how modern SRE teams go beyond monitoring to discover the 'unknown unknowns' that cause the worst outages. Lucas and Luna break down the difference between watching known metrics (CPU, memory) and exploring unknown failure modes with structured events and high-cardinality data. They walk through a real example: a major e-commerce platform that lost $340,000 in seven minutes during a 2023 flash sale because their monitoring didn't catch a latency spike in a new authentication microservice. They explain how distributed tracing and log-based metrics surfaced the root cause after the fact, and how the team now uses observability-driven dashboards to spot anomalies before they become incidents. The episode also covers practical steps—start with one service, instrument with OpenTelemetry, and build a culture of exploration—so listeners can apply observability in their own SRE practice. No ads, just actionable engineering insights. #Observability #SRE #SiteReliabilityEngineering #Monitoring #DistributedTracing #OpenTelemetry #IncidentResponse #UnknownUnknowns #HighCardinality #Microservices #Latency #LogBasedMetrics #ServiceLevelObjectives #ChaosEngineering #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
NOW PLAYING
How SRE Teams Use Observability to Find Unknown Unknowns
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m