PODCAST · technology
Humans of Reliability
by Rootly
Behind every reliable software system, there are people working hard to keep it online. Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly. From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.https://rootly.com/humans-of-reliability
-
32
The Golden Hour: Why the First 15 Minutes of an Incident Decide Everything w/ Gandhi M. N. Kumar (Twillio)
Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar, Principal Incident Commander at Twilio, with 14 years running calls that have pulled in up to 100 responders, argues the work that actually matters happens in the first 15 minutes. In this episode, Gandhi walks through what he calls the golden hour: the window where you decide whether you know what's broken, who belongs on the call, and whether to chase the root cause or reach for redundancy. He gets into why mitigation has to come before diagnosis, why customers trust your status page more than your engineers, and why he once sat with a stopwatch counting how many clicks it took to declare an incident. Along the way: the human side leaders keep underinvesting in, the math of on-call fatigue, and where AI is actually pulling weight in the incident commander seat.
We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
Behind every reliable software system, there are people working hard to keep it online. Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly. From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.https://rootly.com/humans-of-reliability
HOSTED BY
Rootly
CATEGORIES
Loading similar podcasts...