The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering podcast artwork

PODCAST · business

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know

  1. 47

    How SRE Teams Use Postmortem Action Items to Prevent Recurrence

    In Episode 60, Lucas and Luna dive into the most overlooked part of incident response: the postmortem action items that actually prevent the same outage from happening twice. They unpack a 2025 study from Google's SRE team that found 67% of postmortem action items are never completed, and explore why. Using concrete examples from a major AWS S3 outage and a Stripe payment-processing incident, they discuss common failure modes like vague ownership, lack of prioritization, and action items that don't address root causes. The hosts also share practical tactics: assigning a single DRI per action, using 'blameless' language to increase completion rates, and tying action items directly to error budget burn. A must-listen for any engineer who has ever written a postmortem only to see the same incident happen again. #SiteReliabilityEngineering #SRE #Postmortem #IncidentResponse #ActionItems #GoogleSRE #AWS #Stripe #ErrorBudget #BlamelessCulture #ReliabilityEngineering #Uptime #IncidentManagement #RootCauseAnalysis #DevOps #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  2. 46

    How SRE Teams Use Incident Severity Classification to Prioritize Response

    Episode 59 of The Site Reliability Podcast explores how SRE teams classify incidents by severity to decide how fast to respond and who to page. Lucas and Luna break down real-world classification frameworks — from SEV-1 (service down, all hands on deck) to SEV-4 (minor hiccup, fix in the next sprint). They discuss why vague severity definitions lead to alert fatigue and slow response times, and how companies like Google and Stripe have standardized their severity matrices. Lucas shares a concrete example from a payment processing outage where misclassifying a SEV-2 as a SEV-3 delayed response by 45 minutes. Luna highlights the role of severity escalation policies and how automated detection can adjust severity based on customer impact. The hosts also touch on the tension between over-classifying (too many SEV-1s) and under-classifying (missing critical signals). A practical episode for any engineer who's ever argued about whether an incident is 'really' a SEV-2. #IncidentSeverity #SEV1 #SEV2 #SRE #SiteReliability #IncidentResponse #OnCall #Alerting #PagerDuty #GoogleSRE #Stripe #Classification #SeverityMatrix #Uptime #Tech #FexingoBusiness #BusinessPodcast #ProductionEngineering Keep every episode free: buymeacoffee.com/fexingo

  3. 45

    How SRE Teams Use Post-Incident Reviews as Learning Tools

    Episode 58 of The Site Reliability Podcast with Fexingo digs into post-incident reviews — not as blame sessions or compliance checkboxes, but as structured learning mechanisms. Lucas and Luna examine Google's seminal 2016 Titan key outage to illustrate how root cause analysis misses the point if teams don't ask 'why' five times. They discuss the difference between finding a 'root cause' and understanding systemic contributors, how to write blameless incident reports, and why the best SRE teams treat each P0 as a free systems audit. Concrete examples from Google, Etsy, and the 2023 Amazon Kinesis outage show how post-incident reviews reduce mean time to recover and surface recurring failure patterns. The episode also covers common traps — like over-focus on code fixes over process gaps, or documenting the review but never acting on it. Listeners walk away with a five-step template for running their own post-incident review that actually prevents the next outage. #PostIncidentReview #BlamelessCulture #SRE #SiteReliabilityEngineering #IncidentManagement #RootCauseAnalysis #LearningFromFailure #GoogleTitanKey #Etsy #AmazonKinesis #P0 #MTTR #SystemsThinking #ResilienceEngineering #IncidentResponse #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  4. 44

    How SRE Teams Use Cost of Delay to Prioritize Reliability Work

    Lucas and Luna explore how SRE teams at companies like Spotify and Etsy use 'cost of delay' — a concept borrowed from product management — to quantify the business impact of reliability work. Lucas explains the math behind deferring a reliability project, using a real-world example: a payment-processing team deciding whether to fix a latency issue or build a new feature. Luna pushes back on the difficulty of estimating delay costs, and they discuss a practical framework — weighted shortest job first (WSJF) — that helps teams rank reliability initiatives alongside feature work. The episode includes a concrete example: if deferring an SRE project by one quarter costs $200,000 in incident-related losses, the team can calculate the cost of delay per week and compare it to the effort required. Listeners learn how to present reliability investments in the language executives understand: dollars and time. The conversation closes with a reflection on how cost of delay changes the conversation from 'how reliable should we be?' to 'what happens if we defer this work?' #SiteReliabilityEngineering #CostOfDelay #WSJF #Spotify #Etsy #SREPrioritization #ReliabilityEngineering #IncidentResponse #Technology #BusinessCase #ProductManagement #WeightedShortestJobFirst #SREMetrics #LatencyOptimization #FexingoBusiness #BusinessPodcast #TechPodcast #SREPodcast Keep every episode free: buymeacoffee.com/fexingo

  5. 43

    How SRE Teams Reduce Incident Noise with Intelligent Alert Routing

    Episode 56 of The Site Reliability Podcast explores how SRE teams at companies like Airbnb and Etsy use intelligent alert routing to slash incident noise by over 60 percent. Lucas and Luna break down the evolution from on-call pagers to modern event-driven routing, explain how machine learning models classify alerts by severity and team ownership, and discuss the trade-off between routing accuracy and latency. They also touch on the human side: how noise reduction cuts burnout and improves on-call experience. A must-listen for any SRE or platform engineer tired of being woken up for non-critical alerts. #SiteReliabilityPodcast #SRE #AlertRouting #IncidentManagement #OnCall #NoiseReduction #MachineLearning #Airbnb #Etsy #PagerDuty #OpsGenie #Burnout #ReliabilityEngineering #Technology #DevOps #ProductionEngineering #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  6. 42

    How SRE Teams Use Incident Cost Analysis to Prioritize Reliability Investments

    Episode 55 of The Site Reliability Podcast with Fexingo dives into incident cost analysis — a growing practice at companies like Google and Stripe where SRE teams assign a dollar value to every outage minute. Lucas and Luna break down the methodology: how to quantify direct revenue loss, reputational damage, and opportunity cost from incidents, and how that data helps teams justify automation spend, toil reduction, and architecture changes. They walk through a real example from a mid-size e-commerce platform that cut its annual incident cost by 40 percent after implementing this framework. The episode also covers common pitfalls, like overvaluing rare catastrophic events or ignoring compounding effects of small incidents. By the end, listeners will understand how to build a simple incident cost model and use it to make the case for reliability work in language the business understands. #SiteReliabilityEngineering #IncidentCostAnalysis #SRE #ReliabilityEngineering #ProductionEngineering #Uptime #IncidentResponse #CostOptimization #Automation #ToilReduction #Google #Stripe #BusinessCase #Technology #FexingoBusiness #BusinessPodcast #TechOps #DevOps Keep every episode free: buymeacoffee.com/fexingo

  7. 41

    How SRE Teams Use On-Call Compensation to Prevent Burnout

    Most SRE teams talk about incident response and automation, but fewer talk about the human side of on-call: how to pay people fairly for the disruption. Lucas and Luna dig into a 2025 survey of 500 SREs that found 62% feel on-call pay does not match the cognitive load. They compare models — flat stipend versus per-incident pay — and discuss how companies like Honeycomb and PagerDuty structure their on-call compensation. They also explore the link between fair pay and retention: teams with transparent on-call comp have 40% lower turnover. Listeners will walk away with a framework for evaluating whether their own on-call setup is sustainable. #SRE #SiteReliabilityEngineering #OnCallCompensation #BurnoutPrevention #IncidentResponse #CognitiveLoad #TechWorkers #DevOps #Honeycomb #PagerDuty #FairPay #Retention #WorkLifeBalance #SREBestPractices #Technology #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

  8. 40

    SRE Teams Use SLO Burn Rate Alerts to Detect Incidents Faster

    Site reliability engineering has a well-known failure mode: your pager goes off at 2 AM for a minor blip, or worse, you don't get paged until a full-blown outage has already hit users. This episode explains SLO burn rate alerts — a concept that Google's SRE team refined in their 2016 book and which is now baked into tools like Google Cloud Monitoring, Datadog, and Grafana. Lucas and Luna walk through a concrete example: a Latency SLO of 99.9 percent over a 30-day window, and how burning through your error budget at 10x or 1000x the target rate can trigger tiered alerts minutes before a crisis. The hosts discuss what burn rate actually means, why a single threshold alert fails, and how teams set up multiple alert windows (1 hour, 6 hours, 24 hours) to catch fast and slow burn scenarios. No abstract theory — just the math and the practical config that keeps on-call engineers asleep when they should be. Produced by the Fexingo Business podcast network. #SRE #SiteReliabilityEngineering #SLO #BurnRate #IncidentResponse #Alerting #Observability #GoogleSRE #DevOps #Uptime #OnCall #ErrorBudget #Reliability #Monitoring #TechPodcast #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo

  9. 39

    How SRE Teams Use Software Bill of Materials for Supply Chain Security

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the growing importance of the Software Bill of Materials (SBOM) for securing software supply chains. They use the 2024 XZ Utils backdoor as a concrete case study to explain how a single maintainer burnout led to a critical vulnerability that an SBOM could have caught earlier. Lucas breaks down what an SBOM is, how it works with dependency graphs, and why the US Executive Order on Cybersecurity now mandates them for federal suppliers. Luna asks about practical implementation challenges, including tooling like SPDX and CycloneDX, and the conversation explores how SRE teams can automate SBOM generation in CI/CD pipelines. They also touch on the trade-offs between transparency and operational overhead. The episode includes a short, organic donor segment tied to the value of free, in-depth tech content. #SBOM #SoftwareBillOfMaterials #SupplyChainSecurity #XZUtils #OpenSource #DependencyGraph #Cybersecurity #ExecutiveOrder #SPDX #CycloneDX #CI/CD #SRE #SiteReliabilityEngineering #Uptime #IncidentResponse #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  10. 38

    How SRE Teams Use Feature Flags to Reduce Deployment Risk

    In Episode 51 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use feature flags—not just for canary releases, but as a core tool to decouple deployment from release, reduce blast radius, and enable instant rollback without redeploying. They walk through a real incident at a major streaming company where a misconfigured flag caused a 47-minute partial outage, and how the team later rebuilt their flag lifecycle with expiration dates, audit trails, and mandatory approvals for 'kill switches'. Lucas explains the difference between boolean flags, multivariate flags, and permission-based flags, and why treating flags as 'technical debt' is critical for long-term reliability. The episode also touches on how feature flags intersect with observability—specifically, how teams instrument their flag state changes to correlate with metrics in dashboards. If you've ever wondered why your feature toggles keep piling up, this episode gives you a concrete process to clean them up. #SRE #SiteReliabilityEngineering #FeatureFlags #DeploymentRisk #ReleaseManagement #IncidentResponse #Observability #Toggles #KillSwitch #FlagDebt #ContinuousDelivery #SoftwareEngineering #Technology #Podcast #FexingoBusiness #BusinessPodcast #TechOps #ProductionEngineering Keep every episode free: buymeacoffee.com/fexingo

  11. 37

    How SRE Teams Use Stress Testing to Simulate Real Workloads

    Lucas and Luna explore how production stress testing goes beyond standard load testing to simulate realistic user behavior, with a deep dive into how a major streaming platform used session replay and gradual ramp-up to validate infrastructure before a global event. They unpack why stress testing must replicate authentication flows, API call patterns, and edge case traffic shapes — not just raw requests per second. The episode explains how SRE teams combine production shadowing, canary analysis, and real-user monitoring data to build stress tests that catch issues traditional benchmarks miss. A practical look at a specific technique that prevents cascading failures during peak traffic. #SiteReliabilityEngineering #StressTesting #ProductionTesting #LoadTesting #Observability #ChaosEngineering #Infrastructure #SRE #Reliability #Performance #Scalability #FaultTolerance #CapacityPlanning #RealUserMonitoring #CanaryDeployments #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  12. 36

    How SRE Teams Use Game Days to Build Incident Muscle Memory

    Lucas and Luna explore how site reliability engineering teams use game days — structured, simulated incident exercises — to prepare for real outages. They break down the approach used by a major fintech company that runs quarterly game days for its entire on-call rotation, with concrete scenarios like a simulated database failover and a DNS misconfiguration. The episode covers how game days differ from chaos engineering, why they build 'muscle memory' faster than reading runbooks, and how teams measure improvement by tracking time-to-acknowledge and time-to-resolve across repeated drills. Lucas shares data from a 2025 industry survey showing that teams running at least four game days per year cut mean time to resolve by 40 percent compared to teams that don't. Luna presses on the common failure modes — overly scripted scenarios, not including non-engineering stakeholders, and treating game days as pass-fail rather than learning exercises. They wrap with practical advice for starting small: a one-hour scenario with three people and a clear objective. #GameDays #IncidentSimulation #SRE #SiteReliabilityEngineering #OnCallPreparation #IncidentResponse #MuscleMemory #ChaosEngineering #RunbookAutomation #MeanTimeToResolve #Fintech #ProductionEngineering #Uptime #Resilience #LearningExercises #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  13. 35

    How SRE Teams Use Error Budgets to Align Risk and Velocity

    In episode 48 of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into error budgets — the SRE concept that turns reliability into a business decision rather than a purely technical one. They break down how Google originally defined error budgets via the Service Level Indicator (SLI) / Service Level Objective (SLO) / error budget framework, then explore how teams at companies like Shopify and Netflix use them to decide when to push features versus when to freeze releases. Lucas explains the math: if your SLO is 99.9% uptime, your error budget is 0.1% of total time — roughly 43 minutes per month. Once that budget is consumed, releases stop. Luna challenges whether rigid budget enforcement works in practice, citing a case where a startup blew through its budget during a holiday sale but made the right call. They also discuss tooling like Google Cloud Monitoring and Datadog SLO tracking, and how error budgets prevent the classic tension between 'ship fast' and 'keep stable.' The episode closes with a reflection on whether error budgets scale to smaller teams. #SiteReliabilityEngineering #ErrorBudgets #SRE #SLI #SLO #Google #Shopify #Netflix #ReliabilityEngineering #DevOps #IncidentResponse #Uptime #ReleaseVelocity #Datadog #GoogleCloudMonitoring #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  14. 34

    How SRE Teams Use SLIs to Define Reliability

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the often-overlooked first step of SRE practice: defining Service Level Indicators (SLIs). They explore how vague uptime percentages fail to capture user experience and walk through a concrete example from a major streaming platform that shifted from a 'five nines' target to a more granular SLI based on video start latency. The hosts discuss common pitfalls like measuring everything versus measuring what matters, the tension between signal and noise, and how a well-defined SLI transforms the conversation from 'is the site up?' to 'are users happy?' Listeners will learn why three specific SLIs—latency, error rate, and throughput—cover most production services, and how to avoid the trap of vanity metrics. The episode closes on a forward-looking note about how LLM-based systems challenge traditional SLI thinking. #SLI #ServiceLevelIndicators #SRE #SiteReliabilityEngineering #Uptime #Latency #ErrorRate #Throughput #Observability #UserExperience #ProductionEngineering #ReliabilityMetrics #Fexingo #FexingoBusiness #BusinessPodcast #Technology #DevOps #IncidentResponse Keep every episode free: buymeacoffee.com/fexingo

  15. 33

    How SRE Teams Use Cognitive Load Management to Prevent Burnout

    Episode 46 of The Site Reliability Podcast with Fexingo dives into how SRE teams are applying cognitive load theory to reduce burnout and improve incident response. Lucas and Luna explore the concept of 'cognitive load' — the mental effort required to operate complex systems — and how teams at companies like Google and Netflix use techniques like toil reduction, documentation, and team topologies to keep operators in the flow. They discuss real examples: how one team sliced their on-call rotation to cut context-switching, and why a 'three-document rule' for runbooks can double first-response accuracy. The episode also touches on the limits of automation and why human cognition remains the critical bottleneck in reliability engineering. If you're an SRE, platform engineer, or manager trying to keep your team sustainable, this one is for you. #SiteReliabilityEngineering #SRE #CognitiveLoad #BurnoutPrevention #ToilReduction #IncidentResponse #TeamTopologies #GoogleSRE #Netflix #Runbooks #OnCall #FlowState #Automation #Technology #EffortBudget #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

  16. 32

    How SRE Teams Use Observability to Find Unknown Unknowns

    Episode 45 of The Site Reliability Podcast digs into observability—how modern SRE teams go beyond monitoring to discover the 'unknown unknowns' that cause the worst outages. Lucas and Luna break down the difference between watching known metrics (CPU, memory) and exploring unknown failure modes with structured events and high-cardinality data. They walk through a real example: a major e-commerce platform that lost $340,000 in seven minutes during a 2023 flash sale because their monitoring didn't catch a latency spike in a new authentication microservice. They explain how distributed tracing and log-based metrics surfaced the root cause after the fact, and how the team now uses observability-driven dashboards to spot anomalies before they become incidents. The episode also covers practical steps—start with one service, instrument with OpenTelemetry, and build a culture of exploration—so listeners can apply observability in their own SRE practice. No ads, just actionable engineering insights. #Observability #SRE #SiteReliabilityEngineering #Monitoring #DistributedTracing #OpenTelemetry #IncidentResponse #UnknownUnknowns #HighCardinality #Microservices #Latency #LogBasedMetrics #ServiceLevelObjectives #ChaosEngineering #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  17. 31

    How SRE Teams Use Dependency Graphs to Predict Outages

    In this episode of The Site Reliability Podcast, hosts Lucas and Luna explore how SRE teams at major tech companies build and maintain dependency graphs to predict cascading failures before they happen. Using concrete examples from cloud infrastructure and microservices architectures, they explain how graph-based service maps help teams identify single points of failure, model blast radius, and prioritize resilience investments. Lucas walks through a real-world case where a seemingly minor dependency change caused a multi-region outage at a large e-commerce platform, and discusses how post-incident graph analysis revealed the weak link. Luna shares how her team uses automated graph discovery tools to update dependency maps continuously, preventing stale data from misleading incident responders. The episode also touches on the trade-offs between graph granularity and maintainability, and how teams can start small with critical path diagrams before scaling to full service meshes. Whether you're an SRE veteran or new to production engineering, this episode offers a practical framework for turning service dependencies from a blame map into a predictive reliability tool. #SiteReliabilityEngineering #DependencyGraphs #OutagePrediction #Microservices #CloudInfrastructure #IncidentResponse #ServiceMaps #ResilienceEngineering #ChaosEngineering #BlastRadius #GraphTheory #ProductionEngineering #SRE #Technology #Podcast #FexingoBusiness #BusinessPodcast #Uptime Keep every episode free: buymeacoffee.com/fexingo

  18. 30

    How SRE Teams Use toil budgets to prioritize automation

    Episode 43 of The Site Reliability Podcast. Lucas and Luna explore how SRE teams are adopting 'toil budgets' — a concept inspired by error budgets — to cap the amount of manual, repetitive work engineers do each sprint. They break down Google's internal definition of toil (hands-on work with no enduring value), how a toil budget works alongside an error budget, and a concrete case from a mid-sized SaaS company that cut toil from 40% to 15% of engineering time over six months using a simple spreadsheet-based tracking system. Lucas shares the specific criteria for classifying toil, the formula for setting the budget as a percentage of total effort, and the governance process — a weekly toil review board — that prevented scope creep. Luna pushes back on whether toil budgets just push work onto other teams, and Lucas explains the 'clean-up after yourself' rule that prevents that. The episode closes with a practical tip: start by running a three-week time diary before imposing any budget. No marketing fluff. #ToilBudget #SRE #SiteReliabilityEngineering #Automation #GoogleSRE #IncidentResponse #Productivity #EngineeringCulture #DevOps #TechOps #WorkflowAutomation #Observability #FexingoBusiness #BusinessPodcast #Technology #Infrastructure #ToilReduction #SprintPlanning Keep every episode free: buymeacoffee.com/fexingo

  19. 29

    How SRE Teams Use Service Level Objectives to Drive Daily Decisions

    This episode explores how Site Reliability Engineering teams use Service Level Objectives (SLOs) not just as a quarterly dashboard metric, but as a real-time decision-making tool that shapes pager rotations, deployment gating, and incident prioritization. Lucas walks through how Shopify's SRE team used a 99.95% availability SLO to flag a critical degradations before it became a full outage in 2025. Luna pushes back on whether SLOs can become a bureaucratic checkbox, and the two debate how to keep SLOs actionable without adding overhead. The hosts also briefly discuss listener support: if you value an ad-free tech podcast, you can support the show at buy me a coffee dot com slash fexingo. #SLO #SRE #ServiceLevelObjectives #SiteReliabilityEngineering #Shopify #Reliability #IncidentResponse #Uptime #ProductionEngineering #DevOps #Observability #SLI #ErrorBudget #DeploymentGating #PagerDuty #TechPodcast #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  20. 28

    How SRE Teams Use Canary Deployments to Reduce Release Risk

    Lucas and Luna dive into canary deployments: the practice of routing a small percentage of production traffic to a new version before rolling it out broadly. Lucas explains why Netflix's 'canary clusters' and Etsy's 'feature flipping' approach revolutionized how SRE teams think about release risk, and contrasts it with the old all-at-once deploys that caused major incidents. They discuss specific strategies: using metrics comparison between canary and baseline, automatic rollback triggers, and the trade-off between speed and safety. Luna brings up the 2023 incident where a mismatched canary size led to a slow-burn outage, and they explore how teams decide on canary percentage and duration. A concrete episode for any engineer or manager responsible for production releases. #SiteReliabilityEngineering #CanaryDeployments #ReleaseManagement #ProductionEngineering #IncidentPrevention #Netflix #Etsy #ContinuousDelivery #SRE #Uptime #ReliabilityEngineering #DeploymentStrategies #Technology #FexingoBusiness #BusinessPodcast #SoftwareEngineering #DevOps #RiskMitigation Keep every episode free: buymeacoffee.com/fexingo

  21. 27

    How SRE Teams Use Chaos Engineering to Test Resilience

    In episode 40 of The Site Reliability Podcast, Lucas and Luna dive into chaos engineering — the practice of intentionally breaking systems to find weaknesses before real incidents strike. They explore how Netflix pioneered the approach with Chaos Monkey, the lessons SRE teams can learn from controlled failure experiments, and how to start small with simple game days that simulate a database partition or a DNS failure. Lucas breaks down the difference between load testing and chaos testing, and why the goal isn't to break everything but to build confidence in your system's ability to recover. They also discuss common pitfalls like running experiments during peak traffic or without proper observability in place. Whether you're a seasoned SRE or just starting to think about resilience, this episode gives you a concrete framework for making your systems more robust — one controlled explosion at a time. Plus, Lucas and Luna explain why keeping this podcast ad-free matters and how listener support makes it possible. #ChaosEngineering #SRE #SiteReliabilityEngineering #Netflix #ChaosMonkey #ResilienceTesting #FailureInjection #ProductionTesting #Uptime #IncidentResponse #Observability #GameDays #FaultTolerance #CloudEngineering #DevOps #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  22. 26

    How SRE Teams Use Capacity Planning to Prevent Outages

    Episode 39 of The Site Reliability Podcast with Fexingo dives into capacity planning as a proactive SRE practice. Lucas and Luna explore how teams at companies like Google and Netflix use trend analysis, load testing, and headroom budgeting to avoid capacity-related outages. They discuss a real-world case from 2025 where a major streaming service averted a Super Bowl crash by scaling capacity weeks in advance. The episode explains the difference between reactive and proactive capacity planning, the role of predictive modeling, and how error budgets tie into headroom decisions. Listeners will learn concrete metrics (like peak-to-average ratio and utilization targets) and hear why capacity planning is as much about culture as it is about tools. A must for SREs, platform engineers, and anyone responsible for keeping services up during traffic spikes. #SRE #CapacityPlanning #SiteReliabilityEngineering #Uptime #IncidentPrevention #Google #Netflix #LoadTesting #PredictiveModeling #Infrastructure #CloudComputing #Technology #ProductionEngineering #Scalability #TrafficSpikes #ErrorBudgets #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  23. 25

    How SRE Teams Use Immutable Infrastructure to Eliminate Configuration Drift

    In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use immutable infrastructure to eliminate configuration drift and improve reliability. They dive into a real case from Google's Borg paper, explaining how replacing mutable servers with golden images reduces incident rates and recovery times. The hosts break down the trade-offs with mutable servers, the role of infrastructure as code, and practical steps for adoption. They also discuss when immutable infrastructure might not be the right fit—for example, in legacy databases or stateful systems. By the end, you'll understand a core SRE principle that many teams are adopting to reduce toil and increase uptime. The episode includes a natural donation segment tied to the value of learning production engineering concepts from the show. #ImmutableInfrastructure #ConfigurationDrift #SiteReliabilityEngineering #SRE #GoogleBorg #GoldenImages #InfrastructureAsCode #DevOps #CloudComputing #ProductionEngineering #Uptime #IncidentResponse #Technology #FexingoBusiness #BusinessPodcast #TechOps #Reliability #ToilReduction Keep every episode free: buymeacoffee.com/fexingo

  24. 24

    How SRE Teams Use Auto-Remediation to Resolve Incidents Without Humans

    In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna explore how SRE teams are using auto-remediation to automatically resolve incidents without human intervention. They break down the anatomy of an auto-remediation pipeline — from monitoring alerts to automated runbook execution — using real-world examples like a major streaming service that reduced pager fatigue by 40 percent. Lucas explains the critical distinction between deterministic remediation (simple if-then rules) and AI-driven remediation (pattern-matching across past incidents). The hosts also discuss where auto-remediation fails: novel incidents, complex multi-service failures, and scenarios requiring human judgment. They emphasize that auto-remediation isn't about replacing SREs but about freeing them to focus on higher-value work. Practical tips include starting with high-frequency, low-complexity alerts and gradually expanding scope. No fluff, just a focused look at a key SRE practice. Tune in for a concrete example you can apply to your own incident response. #AutoRemediation #SiteReliabilityEngineering #IncidentResponse #RunbookAutomation #PagerFatigue #DeterministicRemediation #AIDrivenRemediation #StreamingServiceCaseStudy #SRE #Uptime #ProductionEngineering #FexingoBusiness #BusinessPodcast #TechnologyPodcast #LucasAndLuna #IncidentManagement #OnCall #Observability Keep every episode free: buymeacoffee.com/fexingo

  25. 23

    How SRE Teams Use Incident Command Systems to Coordinate Response

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the incident command system (ICS) model that large-scale SRE teams borrow from emergency services to manage complex outages. They walk through a real example: a major payment processing incident at a fintech company where a database migration triggered a cascading failure affecting three million users. Lucas explains the four key roles in an SRE incident command structure — incident commander, operations lead, communications lead, and scribe — and how each prevents the chaos of engineers stepping on each other during a crisis. Luna challenges whether ICS slows down response time for smaller incidents, and Lucas shares how teams use tiered response models to scale the approach. They also discuss the one mistake teams make most often: failing to formally hand off the incident commander role during long-running incidents. The episode closes with a practical tip for any team looking to adopt ICS without formal training: start by assigning a scribe for the next on-call rotation. #IncidentCommandSystem #SRE #SiteReliabilityEngineering #IncidentResponse #OnCall #CascadingFailure #Fintech #DatabaseMigration #IncidentCommander #OperationsLead #CommunicationsLead #Scribe #TieredResponse #Handoff #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  26. 22

    How SRE Teams Use Blameless Postmortems to Build Better Systems

    In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and tracking follow-ups. They discuss common pitfalls like blame drift and incomplete data, and share how one SRE team at a mid-size SaaS company reduced repeat incidents by 40 percent after adopting a structured blameless process. If you're looking to turn outages into learning opportunities, this episode offers a practical playbook. #BlamelessPostmortems #SRE #SiteReliabilityEngineering #IncidentManagement #ProductionEngineering #Uptime #RootCauseAnalysis #DevOps #Reliability #LearningFromFailure #BlamelessCulture #IncidentResponse #SaaSSRE #TechOps #Technology #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

  27. 21

    How SRE Teams Use Postmortems That Actually Change Behavior

    In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and a contributing factor, and explains why the 'action items' list is usually the weakest part. Luna pushes back on the idea that postmortems should always be public, and they discuss how psychological safety changes whether people actually report the truth. The episode closes with a practical takeaway: if your postmortem doesn't change how you deploy, monitor, or alert, it's a report, not a postmortem. #SRE #SiteReliabilityEngineering #Postmortems #IncidentResponse #BlamelessCulture #GoogleSRE #GmailOutage #ActionItems #PsychologicalSafety #IncidentAnalysis #ReliabilityEngineering #DevOps #FexingoBusiness #BusinessPodcast #Technology #LearningFromFailure #ContinuousImprovement #RootCauseAnalysis Keep every episode free: buymeacoffee.com/fexingo

  28. 20

    How SRE Teams Use Runbook Automation to Reduce Human Error

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical side of runbook automation — moving beyond static documentation to executable, automated responses. They explore how companies like Google and Netflix use runbook automation to reduce mean time to repair by up to 60%, and discuss the common pitfalls: over-automation, stale runbooks, and the tension between speed and safety. Lucas shares a concrete example from a major e-commerce platform where automated runbooks cut incident response time from 45 minutes to under 5. Luna challenges whether automation can replace human judgment in complex outages. The conversation also touches on tools like Rundeck, PagerDuty Automation, and custom Slack bots. By the end, listeners will understand the key principles for building runbooks that actually get followed in the heat of an incident. #SiteReliabilityEngineering #RunbookAutomation #SRE #IncidentResponse #DevOps #Automation #GoogleSRE #Netflix #PagerDuty #Rundeck #MeanTimeToRepair #Technology #ProductionEngineering #Uptime #FexingoBusiness #BusinessPodcast #TechOps #OnCall Keep every episode free: buymeacoffee.com/fexingo

  29. 19

    How SRE Teams Use Cost Optimization to Balance Performance and Budget

    In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into the often-overlooked intersection of site reliability engineering and cloud cost optimization. They explore how SRE teams at companies like Uber and Airbnb use techniques such as right-sizing instances, leveraging spot instances, and implementing autoscaling policies to reduce infrastructure spend without sacrificing reliability. Specific metrics like cost per transaction and cost per request are discussed as key indicators. The hosts also examine the trade-offs between reserved and on-demand instances, the role of FinOps in SRE, and how to set cost-aware SLOs. A concrete example from a mid-sized SaaS company shows how they saved 35% on AWS costs by shifting to a well-architected framework. This episode offers practical strategies for SREs and platform engineers looking to optimize both uptime and cloud bills. #SiteReliabilityEngineering #CloudCostOptimization #FinOps #SRE #CostOptimization #Uber #Airbnb #AWS #Autoscaling #SpotInstances #ReservedInstances #SLOs #Technology #Podcast #FexingoBusiness #BusinessPodcast #CloudComputing #Infrastructure Keep every episode free: buymeacoffee.com/fexingo

  30. 18

    How SRE Teams Use Load Shedding to Survive Traffic Spikes

    When a massive traffic spike hits, every millisecond of latency can cost thousands of dollars. In this episode, Lucas and Luna explore load shedding — the SRE technique of intentionally dropping non-critical requests to keep core systems running. They walk through how Google SREs used load shedding during the 2020 YouTube outage, how Stripe applies graceful degradation during payment surges, and why Netflix deliberately kills low-priority traffic during peak hours. They also break down the mental shift required: treating load shedding as a feature, not a failure. If you're an SRE, platform engineer, or just someone who wonders why services fail gracefully sometimes and fall over completely other times, this one's for you. #SiteReliabilityEngineering #LoadShedding #TrafficSpikes #GoogleSRE #Stripe #Netflix #GracefulDegradation #CapacityPlanning #IncidentResponse #SREBestPractices #Observability #PriorityBasedShedding #FexingoBusiness #BusinessPodcast #Technology #Podcast #SRE #Uptime Keep every episode free: buymeacoffee.com/fexingo

  31. 17

    How SRE Teams Use Feature Flags to Reduce Incident Risk

    Feature flags are a powerful tool for SREs, but they come with their own operational risks. In this episode, Lucas and Luna explore how companies like Etsy, Netflix, and LaunchDarkly use feature flags to decouple deployment from release, enabling canary rollouts, instant kill switches, and safer experimentation. They break down the difference between boolean flags, multivariate flags, and experiment flags, and discuss the hidden costs: flag debt, stale flags, and the risk of configuration cascades. Lucas shares a specific incident where a misconfigured flag caused a cascading failure at a major e-commerce platform, and how the team rebuilt their flag management system. Luna asks the hard questions about observability and testing: how do you know a flag is safe to flip? And when do you remove an old flag? The episode closes with a forward-looking question about the future of progressive delivery and whether SRE teams should treat flags as infrastructure code. #FeatureFlags #SRE #SiteReliabilityEngineering #LaunchDarkly #Etsy #Netflix #ProgressiveDelivery #CanaryDeployments #KillSwitch #FlagDebt #ConfigurationManagement #Observability #IncidentResponse #DevOps #Technology #FexingoBusiness #BusinessPodcast #ProductionEngineering Keep every episode free: buymeacoffee.com/fexingo

  32. 16

    How SRE Teams Use Incident Metrics to Reduce Mean Time to Resolve

    In episode 29 of The Site Reliability Podcast, Lucas and Luna dive into the specific metrics SRE teams use to reduce mean time to resolve (MTTR) during incidents. They break down the difference between mean time to acknowledge (MTTA) and MTTR, using real-world examples from companies like Google and Etsy. Lucas explains the concept of a 'rescue time' target—a hard limit on how long an incident can last before automatic escalation kicks in. Luna shares a story about a startup that cut their MTTR from 45 minutes to 12 by adopting a single-pane-of-glass monitoring tool. The hosts discuss how to set realistic MTTR targets based on historical data, and why chasing the lowest number can backfire. They also touch on the role of runbooks in accelerating resolution. This episode is packed with actionable advice for SREs and DevOps engineers looking to improve their incident response times. #SRE #MTTR #IncidentResponse #SiteReliability #DevOps #Monitoring #Alerting #Runbooks #Google #Etsy #MeanTimeToResolve #MTTA #Observability #Automation #Escalation #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  33. 15

    How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures

    When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage worse. Lucas explains the three states — closed, open, half-open — and the math behind the thresholds. Luna challenges the conventional wisdom about when to trigger half-open probes. The hosts also discuss the trade-off between circuit breakers and client-side retry logic, and share a real example of a broken circuit breaker causing a cascading failure in a major payment processor. For SRE teams building their own or using libraries like Resilience4J, this episode offers practical guidance on tuning thresholds, measuring success, and avoiding common pitfalls. #CircuitBreaker #SRE #SiteReliabilityEngineering #Resilience #CascadingFailure #Netflix #Hystrix #Amazon #Lyft #Resilience4J #FaultTolerance #Microservices #CloudComputing #Technology #SystemsDesign #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

  34. 14

    How SREs Use Error Budgets to Balance Reliability and Velocity

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical mechanics of error budgets — the SRE tool that lets teams trade reliability for feature velocity without breaking trust. They walk through a real example: a team running a service with a 99.9% SLO that has 0.1% error budget per month, and what happens when they burn through it by week two. Lucas explains how Google baked error budgets into their SRE handbook to resolve the tension between ops and product teams, and Luna challenges whether the concept works outside of hyper-scale tech. They discuss the math behind budget burn rate, how to set alerting thresholds at 50% and 75% consumption, and why some teams treat error budgets as a compliance checkbox rather than a strategic lever. The episode also touches on the human side — how error budgets reduce blame during incidents because the team already agreed on the risk. If you've ever wondered how Netflix, Google, or smaller SaaS teams decide when to release and when to hold, this episode gives you the concrete framework. No abstract theory; just the numbers and the culture shift. #SiteReliabilityEngineering #ErrorBudgets #SLO #Reliability #Velocity #GoogleSRE #IncidentManagement #ProductionEngineering #TechOps #SoftwareEngineering #Uptime #DevOps #BlamelessPostmortems #SREHandbook #Availability #RiskManagement #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  35. 13

    How SRE Teams Use Game Days to Build Muscle Memory for Incidents

    In Episode 26 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams run 'game days' — simulated incident exercises — to build muscle memory and reduce panic during real outages. They break down how Etsy, a pioneer in game days, structures its exercises using realistic scenarios, mini-game design, and post-mortem debriefs without blame. The hosts discuss the difference between chaos engineering and game days, how to avoid making exercises feel like busywork, and why even small teams can run a low-stakes simulation with nothing more than a staging environment and a script. Lucas shares concrete steps: start with a single failure mode, assign roles, time-box the response, and debrief with the question 'What did we learn?' rather than 'Who messed up?' The episode also touches on the metrics that matter — mean time to acknowledge, mean time to resolve, and how game days improve both without requiring expensive tools. If today's conversation gave you something usable, they mention the listener-supported model that keeps the show ad-free, with a link at buy me a coffee dot com slash fexingo. No fluff, just practical SRE discipline. #SiteReliabilityEngineering #GameDays #IncidentResponse #ChaosEngineering #Etsy #SRE #ToilReduction #MuscleMemory #OnCall #Postmortem #ReliabilityEngineering #DevOps #IncidentManagement #FaultInjection #SyntheticMonitoring #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  36. 12

    How SRE Teams Use Error Budgets to Balance Reliability and Velocity

    In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use error budgets to make smart trade-offs between reliability and feature velocity. They break down the concept with concrete examples from Google's original SRE model, showing how a 99.99% uptime target translates to 52.6 minutes of allowed downtime per year. The hosts discuss how error budgets empower teams to take calculated risks, like skipping a canary deployment for a critical fix, without breaking service level objectives. They also touch on common pitfalls: teams that spend their entire error budget in the first week of the quarter, and the danger of setting error budgets too tight. The episode includes practical advice on setting realistic SLOs and monitoring error budget burn rate to avoid surprises. No prior knowledge assumed — just a practical look at one of SRE's most useful tools. #SRE #SiteReliabilityEngineering #ErrorBudgets #ServiceLevelObjectives #SLO #Uptime #Reliability #FeatureVelocity #GoogleSRE #ProductionEngineering #IncidentResponse #SiteReliabilityPodcast #Fexingo #FexingoBusiness #TechnologyPodcast #BusinessPodcast #SREBestPractices #DevOps Keep every episode free: buymeacoffee.com/fexingo

  37. 11

    SRE Runbooks That Actually Get Followed

    Most SRE teams have runbooks. Few have runbooks that engineers actually use in the middle of an incident. Lucas and Luna dive into why the typical runbook fails — too long, too vague, or written for the person who already knows the system. They break down what Google's internal SRE teams do differently: five-sentence maximum per procedure, explicit decision trees, and a 'runbook owners' workflow that keeps documents from rotting. Luna shares a real example from her time at a mid-size fintech, where a 47-page runbook was replaced with a single-page checklist — and incident resolution time dropped by 22 percent. Lucas explains why runbooks are actually a form of capacity planning: every minute an engineer spends hunting for the right command is a minute they're not fixing the outage. The episode closes on a forward-looking question: with AI-assisted ops tools like Copilot for incident response, do we still need human-readable runbooks at all? #Runbooks #SRE #SiteReliabilityEngineering #IncidentResponse #GoogleSRE #OnCall #Playbooks #Automation #DevOps #Observability #ITOps #OpsDocs #Checklists #ToilReduction #Fintech #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  38. 10

    How SRE Teams Use Observability to Reduce Mean Time to Acknowledge

    Mean time to acknowledge (MTTA) is the clock that starts when an alert fires and stops when an engineer clicks 'ack'. For most teams, that gap is the single biggest waste of incident response time. In this episode, Lucas and Luna examine how Airbnb's SRE team cut their MTTA from 12 minutes to under 90 seconds by redesigning alert routing and escalation policies. They walk through the three-tier system Airbnb uses — primary, secondary, and tertiary on-call — and how a simple Slack integration with contextual alert summaries eliminated the 'wait and see' behavior that inflated MTTA. They also discuss why MTTA matters more than mean time to resolve (MTTR) for many teams, and how measuring the wrong metric can actually make incident response worse. If you're an SRE or platform engineer looking to shave minutes off your response pipeline, this episode gives you a concrete playbook drawn from one of the most demanding production environments in tech. #MeanTimeToAcknowledge #MTTA #SRE #SiteReliabilityEngineering #Airbnb #IncidentResponse #AlertRouting #OnCall #EscalationPolicy #Observability #SlackIntegration #Uptime #ProductionEngineering #Technology #FexingoBusiness #BusinessPodcast #ReliabilityEngineering #IncidentManagement Keep every episode free: buymeacoffee.com/fexingo

  39. 9

    How SRE Teams Use Synthetic Monitoring to Catch Outages First

    Episode 22 of The Site Reliability Podcast explores synthetic monitoring — proactive testing that catches outages before real users feel them. Lucas and Luna break down how companies like Etsy and Twilio simulate user journeys from multiple locations every minute, generating tens of thousands of transactions daily to validate critical flows. They discuss the difference between synthetic and real-user monitoring (RUM), why synthetic monitoring is essential for high-traffic events like Black Friday, and how to avoid common pitfalls like over-testing and false positives. The episode also covers tooling options, from open-source projects like Grafana Synthetic Monitoring to commercial services, and explains alerting strategies that reduce noise. A practical, actionable guide for SRE teams looking to shift from reactive incident response to proactive detection. If today's tech conversation gave you something usable, listener support at buy me a coffee dot com slash fexingo keeps the podcast ad-free and focused on real engineering insights. #SyntheticMonitoring #SiteReliabilityEngineering #ProactiveMonitoring #Uptime #IncidentResponse #Etsy #Twilio #Grafana #BlackFriday #RUM #Alerting #Observability #SRE #DevOps #Technology #FexingoBusiness #BusinessPodcast #Podcast Keep every episode free: buymeacoffee.com/fexingo

  40. 8

    How SRE Teams Use Traffic Shadowing for Safe Testing

    In this episode of The Site Reliability Podcast, Lucas and Luna explore traffic shadowing: a technique that lets SRE teams test new services with live production traffic without affecting real users. They break down how GitHub used shadowing to validate a new caching layer without risking customer data, and how Stripe employs it to test payment processing changes safely. Lucas explains the difference between shadowing, canary deployments, and A/B testing, and walks through the infrastructure needed — request duplication, isolation, and metric comparison. Luna asks about the risks of shadowing, including latency impact and data privacy concerns. The hosts also discuss how shadowing can detect subtle issues like race conditions and silent errors that canary deployments might miss. A practical, concrete episode for SREs looking to reduce deployment risk. #TrafficShadowing #SRE #SiteReliabilityEngineering #SafeTesting #ProductionTesting #GitHub #Stripe #CanaryDeployments #IncidentPrevention #Observability #Technology #FexingoBusiness #BusinessPodcast #ProductionEngineering #Latency #DataPrivacy #RaceConditions #DeploymentRisk Keep every episode free: buymeacoffee.com/fexingo

  41. 7

    How SRE Teams Use Canary Deployments to Reduce Blast Radius

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practice of canary deployments—a key strategy for reducing blast radius in production. They break down how teams like Etsy and Netflix use phased rollouts to catch issues early, with specific numbers: Etsy's Deployinator halved deployment failures after adopting canaries, and Netflix's Spinnaker pipeline automatically rolls back if error rates spike by just 1 percent. Lucas explains the optimal canary size (5-10 percent of traffic), the metrics to watch (latency, error rate, CPU usage), and why automating the rollout is critical. Luna questions whether canaries slow down velocity, and they discuss the trade-off between speed and safety. The episode also covers how to design a canary pipeline for microservices, including the use of feature flags and observability tools like Prometheus and Grafana. Recorded on May 30, 2026, this conversation gives SREs a practical guide to deploying with confidence, avoiding the all-at-once rollbacks that cause chaos. #CanaryDeployments #SRE #SiteReliabilityEngineering #BlastRadius #PhasedRollout #Etsy #Netflix #Deployinator #Spinnaker #FeatureFlags #Prometheus #Grafana #IncidentPrevention #DeploymentStrategies #DevOps #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo

  42. 6

    How SRE Teams Use Data to Predict Incidents Before They Happen

    Most incident response is reactive—you get paged, you triage, you fix. But a growing number of SRE teams are flipping the model: using historical data, machine learning, and anomaly detection to predict incidents before they actually impact users. In this episode, Lucas and Luna explore how companies like Google, Datadog, and a major European bank are deploying predictive SRE. They break down the difference between simple threshold alerts and true predictive models, walk through a real example of a database connection pool exhaustion that was predicted 12 minutes before it caused a 503 spike, and discuss the practical challenges: data quality, model drift, and the uncomfortable question of whether you trust a prediction enough to wake someone up at 3 AM. The episode also covers the role of service level objectives in training predictive models, the concept of 'time-to-predict' as a new SRE metric, and why some teams are pairing predictive alerts with automated rollback. No hype, just the engineering reality of what works today and what's still experimental. #PredictiveSRE #IncidentPrediction #SiteReliabilityEngineering #AnomalyDetection #MachineLearning #Observability #GoogleSRE #Datadog #SLOs #TimeToPredict #AutomatedRollback #DataScience #DevOps #ReliabilityEngineering #Technology #TechPodcast #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  43. 5

    How SRE Teams Use Capacity Planning to Prevent Black Friday Outages

    In this episode, Lucas and Luna explore how site reliability engineering teams use capacity planning to avoid catastrophic outages during peak traffic events like Black Friday and Cyber Monday. They break down the specific methodology used by major e-commerce platforms, including the concept of 'headroom targets' and 'traffic shaping' — techniques that go beyond simple auto-scaling. Lucas explains how teams model demand using historical data and synthetic load testing, and why many companies still get caught off-guard by the 'thundering herd' problem. The conversation also covers real-world examples from retailers who learned hard lessons after underestimating mobile traffic surges. Luna challenges the common assumption that more servers is always the answer, and they discuss the trade-offs between cost optimization and reliability. A must-listen for engineers, SREs, and anyone responsible for keeping services running under extreme load. #CapacityPlanning #BlackFriday #SiteReliabilityEngineering #SRE #TrafficSurge #AutoScaling #HeadroomTargets #ThunderingHerd #LoadTesting #Ecommerce #Uptime #IncidentPrevention #CloudInfrastructure #ProductionEngineering #FexingoBusiness #Technology #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo

  44. 4

    How SRE Teams Use Service Level Objectives to Drive Business Decisions

    Lucas and Luna explore how service level objectives (SLOs) have evolved from a technical metric into a strategic business tool. Using examples from Google, Etsy, and a mid-size fintech startup, they show how SLOs help SRE teams align with product managers, trade reliability for feature velocity, and communicate risk in terms executives understand. The episode drills into the concept of 'SLO-based product trade-offs' — a framework that lets teams decide when to launch a new feature versus fix reliability debt. Lucas shares a concrete example from a hypothetical payment processor that used an SLO budget to defer 90 percent of non-critical incidents. Luna pushes back on the challenges of getting product teams to respect SLOs. The episode includes a brief, organic segment about listener support via Buy Me a Coffee. No fluff, no clickbait — just a focused look at one of the most underused superpowers in site reliability engineering. #SLO #ServiceLevelObjectives #SiteReliabilityEngineering #SRE #IncidentResponse #ReliabilityEngineering #GoogleSRE #EtsySRE #FeatureVelocity #ProductTradeOffs #ErrorBudget #Observability #DevOps #BusinessMetrics #TechStrategy #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  45. 3

    How SRE Teams Use Toil Budgets to Prioritise Automation

    Episode 16 of The Site Reliability Podcast explores toil budgets: the SRE practice of capping manual, repetitive work so teams have time for automation. Lucas and Luna break down how Google defined toil in its SRE book, how a mid-size fintech used a 50% toil budget to reduce incident response time, and why tracking toil by hand feels ironic. They discuss a concrete case where one team freed up 30 hours per week by automating a single database restart task. The episode also covers where toil budgets break down — when manual work is actually valuable, like customer onboarding configuration. If you run on-call rotations or manage production systems, this gives you a practical framework to argue for automation spend. #ToilBudget #SRE #Automation #SiteReliabilityEngineering #IncidentResponse #GoogleSRE #OnCall #FexingoBusiness #BusinessPodcast #TechPodcast #ProductionEngineering #DevOps #OperationalExcellence #ManualToil #ErrorBudget #CapacityPlanning #TechOps #AlertFatigue Keep every episode free: buymeacoffee.com/fexingo

  46. 2

    How SRE Teams Handle On-Call Burnout Without Burning Out

    Episode 15 of The Site Reliability Podcast with Fexingo dives into the human side of site reliability engineering: on-call burnout. Lucas and Luna explore how teams at companies like Etsy and Honeycomb use structured rotations, incident-free shifts, and proactive 'time to recover' metrics to keep engineers fresh. They break down specific data—like the effect of 12-hour versus 7-day rotations on alert responsiveness—and discuss why burnout in SRE correlates directly with system reliability. The episode also touches on the hidden cost of 'always-on' culture and how shifting from reactive firefighting to deliberate recovery changes team dynamics. A must-listen for any SRE, DevOps engineer, or manager building resilient teams. #SRE #OnCallBurnout #SiteReliabilityEngineering #BurnoutPrevention #IncidentResponse #Etsy #Honeycomb #TimeToRecover #AlertFatigue #DevOps #ProductionEngineering #TeamHealth #ReliabilityCulture #TechPodcast #Technology #FexingoBusiness #BusinessPodcast #SREPodcast Keep every episode free: buymeacoffee.com/fexingo

  47. 1

    How SRE Teams Use Chaos Engineering for Non-Netflix Systems

    Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing servers—and structured chaos experiments guided by a game day calendar and error budget thresholds. Luna pushes back on whether small teams can afford the complexity, and Lucas cites an open-source tool, Litmus, that reduced BlinkMart's mean time to detect from 12 minutes to under 90 seconds. The conversation covers the three pillars of a practical chaos program: a hypothesis-driven experiment design, a blast radius limit (no more than 5% of user traffic), and a rollback trigger. They also discuss how chaos engineering shifts team culture from fear of failure to curiosity about failure modes. Donation segment: BlinkMart's VP of engineering mentioned that the chaos experiments cost less than half the projected revenue loss from a single outage, making the ROI argument simple for the board. #ChaosEngineering #SRE #SiteReliability #BlinkMart #FailureInjection #Litmus #GameDays #BlastRadius #ErrorBudget #MeanTimeToDetect #BlackFriday #DatabaseReplication #IncidentResponse #ResilienceEngineering #OpenSource #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  48. 0

    How Microsoft SREs Automate Capacity Planning at Cloud Scale

    Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom management, and autoscaling — and walk through a real case where a retail giant's Black Friday traffic spike was absorbed without a single incident. They discuss the tension between efficiency and resilience, how SREs use historical traffic patterns and machine learning to predict compute needs, and why over-provisioning isn't always the answer. Listeners will learn how capacity planning has evolved from a manual quarterly spreadsheet exercise into a continuous, automated feedback loop — and why that shift is critical for any organization running infrastructure at scale. #SRE #CapacityPlanning #Azure #Microsoft #CloudComputing #Autoscaling #DemandForecasting #SiteReliabilityEngineering #IncidentPrevention #BlackFriday #Retail #MachineLearning #Observability #Uptime #Infrastructure #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

  49. -1

    How GitHub SREs Run Postmortems Without Blame

    Episode 12 of The Site Reliability Podcast with Fexingo digs into GitHub's postmortem culture — specifically how their SRE team runs incident reviews that actually prevent recurrence without destroying psychological safety. Lucas and Luna walk through the five-part structure GitHub uses, the distinction between 'blame' and 'accountability,' and why writing a timeline before identifying causes changes the whole conversation. They also touch on a 2023 availability incident that GitHub's postmortem process turned into a system-wide improvement. If your team's incident reviews feel like finger-pointing or box-checking, this episode gives you a concrete framework to try. No ads, listener-supported — a few folks chip in at buy me a coffee dot com slash fexingo. #GitHub #Postmortem #BlamelessCulture #IncidentReview #SRE #SiteReliabilityEngineering #IncidentManagement #PsychologicalSafety #LearningCulture #DevOps #ProductionEngineering #Uptime #Infrastructure #RootCauseAnalysis #Technology #FexingoBusiness #BusinessPodcast #SREPodcast Keep every episode free: buymeacoffee.com/fexingo

  50. -2

    How Cloudflare Handles 46 Million Requests Per Second With SRE

    In this episode of The Site Reliability Podcast, Lucas and Luna dive into how Cloudflare's SRE team manages to process over 46 million HTTP requests per second across its global edge network. They explore the concept of 'edge of network' infrastructure, the role of anycast routing in distributing load, and how the team uses automated canary deployments to catch failures before they impact customers. Lucas breaks down the specific alerting thresholds that trigger human intervention versus automated rollback, and Luna challenges him on the limits of automation in incident response. The episode also covers how Cloudflare's post-incident review process differs from traditional postmortems, focusing on blameless analysis and systemic fixes. This concrete case study offers listeners a rare behind-the-scenes look at how one of the internet's largest traffic intermediaries keeps its infrastructure running smoothly. #Cloudflare #SRE #SiteReliabilityEngineering #EdgeComputing #CDN #HTTPRequests #AnycastRouting #CanaryDeployments #IncidentResponse #Postmortem #AutomatedRollback #Alerting #Infrastructure #ProductionEngineering #Uptime #Scalability #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know

HOSTED BY

Fexingo

CATEGORIES

Frequently Asked Questions

How many episodes does The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering have?

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering currently has 50 episodes available on PodParley. New episodes are automatically indexed when they're published to the podcast feed.

What is The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering about?

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a...

How often does The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering release new episodes?

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering has 50 episodes. Check the episode list to see recent publication dates and frequency.

Where can I listen to The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering?

You can listen to The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering on PodParley by clicking any episode. We provide an embedded audio player for direct listening, and you can also subscribe via your preferred podcast app using the RSS feed.

Who hosts The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering?

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering is created and hosted by Fexingo.
URL copied to clipboard!