Slight Reliability

PODCAST · technology

Slight Reliability

Learning SRE, one day at a time.

  1. 124

    Being a Digital Nomad with Amin Astaneh (Episode 122)

    Send us Fan MailCan you combine your career with personal adventure? What would it be like to live in a truck and travel the country while working remotely?This week I'm joined again by SRE & DevOps legend Amin Astaneh in one of the most human interviews I've ever done. We explore...💉 The impact of the post-Covid layoffs🛻 The logistics of living in a mobile home💥 Staring a consultancy❤️ The impact of the nomad lifestyle on relationships🧸 Being woken up by a giant black bear...and much more.You can find Amin on...LinkedIn: https://www.linkedin.com/in/aminastaneh/His website: https://certomodo.io/and the Reliability Rebels podcast: https://podcast.certomodo.io/Companies that Amin partners with...Nobl9 (SLOs as a service): https://www.nobl9.com/LearnKube (Kubernes education): https://learnkube.com/Amin also mentioned Go Fast Campers if you were interested (I have no partnership or even knowledge of this company, just putting it here in case anyone was interested in seeing what Amin's living space looks like): https://gofastcampers.com/You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  2. 123

    Four Golden Signals to Kickstart SRE (Episode 121)

    Send us Fan MailWhen you first start implementing SRE it's a good idea to find early wins. Implementing monitoring of the four golden signals + availability is something I'm experimenting with at the moment to give our SRE team momentum and to pave the way for SLOs and more advanced observability.In this solo episode I share my experiences, including...💛 What are the four golden signals?🦥 Observability in an async processing context📏 The power of tracking availability🤷 Why SLOs can be challenging to kick off on day 1🚪 An invitation to come on the podcast...and much more.You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  3. 122

    Staying Motivated as a Leader with Cads Oakley (Episode 120)

    Send us Fan MailAs an engineer you get constant dopamine hits by solving technical problems. As a leader you're often working toward long term goals that span months or years. How do you stay motivated in that context?This week I'm joined by technology leader and fellow Aucklander Cads Oakley. We cover...🐆 Don't chase the result, harvest the learning🏆 The payoff for being a leader🎁 Giving feedback and giving others permission to give it to you💉 The role of dopamine🤷 How much control do we really have in our work?...and much more.You can find Cads on...LinkedIn: https://www.linkedin.com/in/cadsoakley/Website: https://aurelia.nz/Content mentioned during the episode...Cads' article on the role of dopamine: https://www.linkedin.com/pulse/managing-dopamine-mind-cædman-cads-oakley/"Happy" by Derren Brown https://www.goodreads.com/book/show/30142270-happyWhat Got You Here Won't Get You There by Marshall Goldsmith https://www.amazon.com.au/dp/1401301304You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  4. 121

    A Beginner's Guide to SRE (Episode 119)

    Send us Fan MailThis week I repurpose a talk I just did at the JuniorDev meetup in Auckland. If you're new to SRE or observability then this is the talk for you. For the more seasoned listeners, it's a chance to see how my perspective and understanding has changed over the years.In the episode I refer to the This Is Fine! podcast on resilience engineering: https://www.thisisfinepod.com/You can read the Google SRE books for free online here: https://sre.google/books/You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  5. 120

    Freeing Observability Data Hostages with Jacob Leverich (Episode 118)

    Send us Fan MailHow do you ingest and store petabytes of telemetry every day in a cost effective and high performing way? How can you do this in a way which gives engineers the operational data they need to keep services running? How has this challenge be tackled in the past and what's been the evolution?This week I'm joined by Observe co-founder Jacob Leverich to go deep into this topic. We discuss...💾 A deep-dive into the evolution of telemetry storage and where it's going💽 The advent of generic storage that handles metrics, logs, and traces well🫶 Having empathy for people who need great observability but struggle to obtain it🤲 Not holding telemetry data hostage✍️ Lessons from 8 years of running a start-up...and much more.You can find Jacob on...LinkedIn: https://www.linkedin.com/in/jacob-leverich/And find out more about Observe here:https://www.observeinc.com/In the episode Jacob referred to Google's Dremel analysis platform: https://research.google/pubs/dremel-interactive-analysis-of-web-scale-datasets-2/And Apache Iceberg: https://iceberg.apache.org/You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  6. 119

    How to Change the World with Rob Roe (Episode 117)

    Send us Fan MailHow do you take all the utopian ideas you read about in books and apply them to the reality of the organisations we work in?This week I'm joined by leader, mentor, and coach Rob Roe to tackle this question. We discuss...🌪️ The pitfalls of functional silos🤫 Is the annual budget a load of rubbish?🍃 How our management promotion systems are often broken🫂 The power of virtual teams📗 Team interaction models...and much more.You can find Rob on...LinkedIn: https://www.linkedin.com/in/robinsonroe/References from the episode:Rob's YouTube channel "Unlocking Change": https://www.youtube.com/@unlockingchangeAction Inquiry by Bill Torbert: https://gla.global/the-glp/action-inquiry/Seven Transformations of Leadership: https://hbr.org/2005/04/seven-transformations-of-leadershipTrademe Team Topologies case study: https://teamtopologies.com/industry-examples/trade-me-journey-towards-a-thinnest-viable-platformYou can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  7. 118

    Human Software with Richard Bown (Episode 116)

    Send us Fan MailWe spend a third of our life at work. It needs to be something we enjoy and something with purpose. Our work experience also impacts our family, friends, and our personal lives.This week I'm joined by tech engineer, leader, and author Richard Bown to explore this and many other topics including...🌪️ The difficulty in applying the ideas we read in books in real organisations🤫 When you want to implement a thing you can't talk directly about the thing🍃 Does change require senior leadership? Can it be grassroots?🫂 The importance of looking after yourself at work📗 The experience of publishing a novel...and much more.You can find Richard's book "Human Software - A Life in I.T" here: https://humansoftwarebook.com/...and you can find Richard on...LinkedIn: https://www.linkedin.com/in/richard-bown/References from the episode (excluding some of the TV shows we mentioned):Rosegarden Linux music composition editing environment: https://www.rosegardenmusic.com/Managing Humans by Michael Lopp: https://www.oreilly.com/library/view/managing-humans-biting/9781430243144/Rands leadership Slack: https://randsinrepose.com/welcome-to-rands-leadership-slack/The Goal by Eliyahu Goldratt: https://www.goodreads.com/book/show/113934.The_GoalLocal Hero (1983): https://www.imdb.com/title/tt0085859/Turn the Ship Around! by David Marquet: https://davidmarquet.com/books/turn-the-ship-around-book/Scrivener (book writing tool): https://scrivener.app/You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  8. 117

    Leadership Gym with Xiao Zhang (Episode 115)

    Send us Fan MailWhen you become a people leader there is no manual. How can we not only learn leadership skills but practice them and build leadership muscle?This week I'm joined by Orion Group Limited co-founder Xiao Zhang to discuss...👑 The challenge of transitioning into people leadership💪 How we don't get fit by watching other people work out⌚ Pausing as an act of active leadership🌌 The power of slack time for creativity and systems thinking🌊 Going below the waterline...and much more.You can find Xiao on:LinkedIn: https://www.linkedin.com/in/xiao-zhang-nz/You can find Orion Group Limited here: https://www.oriongrouplimited.com/Xiao mentioned Brené Brown's new book Strong Ground https://brenebrown.com/book/strong-ground/I mentioned The Phoenix Project by Gene Kim, Kevin Behr and George Spafford https://itrevolution.com/product/the-phoenix-project/I couldn't find the article on slack time but this is pretty good: https://buildrightside.com/problem-upgrade-chartYou can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  9. 116

    Starting a New Role (Episode 114)

    Send us Fan MailThis week I kick off the 2026 season with some news and we explore how to prepare for a new role.You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  10. 115

    AI Use-cases for SRE with Shmuel Kliger (Episode 113)

    Send us Fan MailFrom the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us?This week I'm joined by Causely founder Shmuel Kliger to dive into...🌊 The three waves of AI hype over the decades (the history of AI)☠️ The dangers of over-promising and under-delivering what AI can do🧠 What is causal reasoning?😱 Is AI replacing SREs?🔮 AI as a way to allow humans to solve higher level problems...and much more.You can find Shmuel on:LinkedIn: https://www.linkedin.com/in/shmuel-kliger-1a91963/You can find Causely here: https://www.causely.ai/Shmuel mentioned 'The Book of Why' by Judea Pearl and Dana McKenzie which can be found here: https://www.amazon.com.au/dp/046509760XYou can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  11. 114

    Operational Intelligence with Adam Kinniburgh (Episode 112)

    Send us Fan MailWhat is operational intelligence and how is it different from observability or BI?This week I'm joined by SquaredUp's VP of Innovation Adam Kinniburgh to answer that question and many more including...❓ What is operational intelligence?🙈 Relating observability back to customer, business, or revenue😎 The value of giving stakeholders confidence🌉 Who bridges the gap between tech and business or engineers and leadership?🦋 Correlation VS causation and our innate desire to build connections...and much more.You can find Adam on:LinkedIn: https://www.linkedin.com/in/adamkinniburgh/...and the Operationally Intelligent podcast here:https://squaredup.com/operationally-intelligent/You can find the Slight Reliability merch store here:https://slightreliability.digitees.co.nz/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  12. 113

    Leading Platform Teams with Dinesh Sukhija (Episode 111)

    Send us Fan MailHow does leading platform teams differ from leading product teams?This week I'm joined by experienced technology leader Dinesh Sukhija to answer that question and many more including...❓ What is a platform team?⚽ Coaching engineers to focus on outcomes☀️ Connecting platform initiatives to business goals✋ Identifying the limiters in your team🎤 Spreading knowledge and avoiding single points of failure...and much more.You can find Dinesh on:LinkedIn: https://www.linkedin.com/in/dinesh-sukhija/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  13. 112

    Leadership Round One! (Episode 110)

    Send us Fan MailHow has my first two years as a manager in tech been? What have I learned? What do I need to work on?This week I share my experiences over the past couple of years. I cover:🔥 My recent close call with burnout🫶 How I attempted to build a team culture💪 The importance of tough conversations🥱 How roles and responsibilities might be boring to think about but is critical❓ What's next?...and much more.You can find me on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  14. 111

    The Implications of AI on Observability with Aaron "Checo" Pacheco (Episode 109)

    Send us Fan MailHow could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight?This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling within organisations and how AI is helping to find signals amongst the noise and reduce cognitive load for SREs. We discuss...🎂 The need for a layer of logic on top of our telemetry data🚲 The observability lifecycle of a DevOps team🎶 How most orgs have many observability tools, and how we might make that work🤯 Reaching the limits of what humans can comprehend as a reason for AI📕 How poor documentation may become AI's downfall in the future...and much more.You can find Checo on:LinkedIn: https://www.linkedin.com/in/checopacheco/You can find more about Ottermon AI on their website: https://www.ottermon.ai/ or on LinkedIn: https://www.linkedin.com/company/ottermon/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  15. 110

    Chaos Engineering with Kolton Andrus (Episode 108)

    Send us Fan MailWhat is chaos engineering and how is it being used in 2025?This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss...🌪️ What is chaos engineering and what is its origins?🪴 How has it evolved over the year?🤖 The role of AI agents in SRE work💰 Justifying the value of chaos engineering🏃‍♀️‍➡️ How do I get started?...and much more.You can find Kolton on:LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/And you can find out more about Gremlin's new reliability intelligence platform here: https://www.gremlin.com/technologies/reliability-intelligenceYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  16. 109

    Team Topologies with Luke McManus (Episode 107)

    Send us Fan MailWhat are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)?This week I'm joined by Luke McManus to discuss...⛰️ What are the four team topologies?🏆 Can we have too much collaboration?⌚ Team interaction models🌏 Cognitive load🏃‍♀️‍➡️ Value dynamics mapping...and much more.You can find Luke on:LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/Check out the recently released second edition of the Team Topologies book by Matthew Skelton and Manual Pais here: https://itrevolution.com/product/team-topologies-second-edition/Or Unbundling the Enterprise by Stephen Fishman and Matt McLarty here: https://itrevolution.com/product/unbundling-the-enterprise/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  17. 108

    Contributing to Open Source with Wendy Ha (Episode 106)

    Send us Fan MailHow do you begin contributing to an open source project? What's it like? What do you get out of it?This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore...⛰️ What it's like working on one of the biggest open source projects in the world🏆 The benefits of contributing to open source⌚ How much time and effort does it take?🌏 The unique challenges of contributing from APAC (and the need for more contributors in Australia and New Zealand)🏃‍♀️‍➡️ How to get started...and much more.You can find Wendy on:LinkedIn: https://www.linkedin.com/in/wendyha-sut/Ways you can get started contributing to Open Source:CNCF from Zero to Merge Program: https://project.linuxfoundation.org/cncf-zero-to-merge-applicationLFX Mentorship Program: https://mentorship.lfx.linuxfoundation.org/#projects_allOutreachy Mentorship Program: https://www.outreachy.org/mentor/Google Summer of Code: https://summerofcode.withgoogle.com/Kubernetes Release Team Shadowing: https://github.com/kubernetes/sig-release/blob/master/release-team/README.md#release-team-shadowYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  18. 107

    Influencing Leadership with Nora Jones (Episode 105)

    Send us Fan MailAs an #SRE how do you influence senior leadership to get support and priority for the things you care about?To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on...🤝 How understanding needs to flow both ways (between engineers and leaders)🎨 Reliability is as much an art as a science📝 Using napkin math to start conversations🧠 Understand the system (your org) before trying to change it💬 Using micro-interactions to gradually implement change...and so much more.You can find Nora on:LinkedIn: https://www.linkedin.com/in/norajones1/You can find more about PagerDuty here: https://www.pagerduty.com/nlp/trial-sign-up/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  19. 106

    Slight Reliability Podcast Retrospective (Episode 104)

    Send us Fan MailThis week I do a retrospective on the Slight Reliability podcast.👂 How many people listen to it?❤️ How do I feel about the show?🎉 What's going well?🪴 What could be better?❔ What's next for the show?If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here:https://www.youtube.com/@performance-timeYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  20. 105

    Burnout with Colette Alexander (Episode 103)

    Send us Fan MailHave you burned out at work? What was your experience? How did you work through it?This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover...🔥 What is burnout?❓ Why does it happen?🫀 What are the symptoms?🥊 Fight, flight, or freeze🧑‍🚒 Advice on how to recover...and much more.Resources from the show...Why you're so angry at work (and what to do about it) by Natalie Rothfels https://www.lennysnewsletter.com/p/why-youre-so-angry-at-workBurnout (book) by Amelia and Emily Nagoski https://www.burnoutbook.net/ How to do nothing (book) by Jenny Odell https://www.penguinrandomhouse.com/books/600671/how-to-do-nothing-by-jenny-odell/You can find Colette on:LinkedIn: https://www.linkedin.com/in/colette-alexander-4168267/You can find the This Is Fine! podcast here: https://www.thisisfinepod.com/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  21. 104

    Mobile Observability with Hanson Ho (Episode 102)

    Send us Fan MailThis week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover...📱 The mobile/backend observability divide✍️ The challenge of distributed tracing on mobile apps🌏 The entire device runtime environment matters for your app👤 The quest for user-centric mobile observability✅ Advice on how to get started with mobile observability...and much more.You can find Hanson on:LinkedIn: https://www.linkedin.com/in/hanson-ho/Bluesky: https://bsky.app/profile/bidetofevil.wtfYou can find out more about Embrace at https://embrace.io/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  22. 103

    Intro to Resilience Engineering with Michelle Casey (Episode 101)

    Send us Fan MailThis week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover...🏋️‍♀️ Reliability VS Robustness VS Resilience🧩 What is a complex system?🔢 Safety one/safety two🧠 Mental models😩 Human error...and so much more.Resources from this episode:Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implications_for_the_future_of_resilience_engineeringBuilding and revising adaptive capacity sharing for technical incident response (paper) by Dr Richard Cook and Dr Beth Long https://www.researchgate.net/publication/344259449_Building_and_revising_adaptive_capacity_sharing_for_technical_incident_response_A_case_of_resilience_engineeringSystems Thinking for Incident Analysis (talk) by Laura Nolan from LFI Conf 23 https://www.youtube.com/watch?v=-uXGg3g2ypsHow Complex Systems Fail (website) by Dr. Richard Cook https://how.complexsystems.fail/A Tale of Two Safeties (book) by Erik Hollnagel https://erikhollnagel.com/A Tale of Two Safeties.pdfFrom Safety One to Safety Two (book) by Erik Hollnagel https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdfResilience: It's not you, it's the System (talk) by Dr Carl Horsley https://www.youtube.com/watch?v=ugC3GTKt23UAbove the line / Below the line (paper) by Dr Richard Cook (not original link) https://www.researchgate.net/figure/Above-the-Line-Below-the-Line-framework-adapted-with-permission-Cook-Woods-2016_fig3_333091997How Your Systems Keep Running Day After Day (talk) by John Allspaw https://www.youtube.com/watch?v=xA5U85LSk0MBehind Human Error (book) https://www.amazon.com.au/Behind-Human-Error-David-Woods/dp/0754678342The Field Guide to Human Error Investigations (book) by Sydney Dekker https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker/books/DekkersFieldGuide.pdfThe Howie Guide (paper) by Dr Laura Maguire, Nora Jones and Vanessa Granda https://howie-guide.pagerduty.com/Resilience Engineering: Where do I start? (website) by Lorin Hochstein https://www.resilience-engineering-association.org/resources/where-do-i-start/The STELLA report (paper) https://snafucatchers.github.io/DORA Communtiy Discussion - Resilience Engineering (discussion) https://www.youtube.com/watch?v=g3cEJ7njJbcThis Is Fine! (podcast) by Colette Alexander and Clint Byrum https://www.thisisfinepod.com/the-pod

  23. 102

    Learning with John Allspaw (Episode 100)

    Send us Fan MailThis week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss...📒 Classroom VS situated learning🤝 The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy🥕 How you cannot incentivise to avoid incidents (it doesn't work that way)❤️‍🩹 You can't understand how something is broken unless you know how it's supposed to work in the first place...and much more.Resources from this episode:Pre-Accident Investigations by Todd Conklin https://www.amazon.com.au/Pre-Accident-Investigations-Introduction-Organizational-Safety/dp/1409447820Working at the Center of the Cyclone by Dr. Richard Cook https://itrevolution.com/articles/center-of-the-cyclone-dr-richard-cook/To join the Resilience in Software Foundation head over to: https://resilienceinsoftware.org/You can find John on:Website: https://www.kitchensoap.com/LinkedIn: https://www.linkedin.com/in/jallspaw/You can find Adapative Capacity Labs here: https://www.adaptivecapacitylabs.com/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  24. 101

    Focusing on What Matters with Trent Hornibrook (Episode 99)

    Send us Fan MailThis week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss...🔌 Empowering engineers to implement change in your org🧑‍🍼 Focusing on what matters (customer & business > technology)👀 Not just adding more monitoring as the output of each PIR😎 How autonomy can lead to accountability🌳 How to influence change in an organisation...and much more.You can find Trent on:LinkedIn: https://www.linkedin.com/in/trenthornibrook/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  25. 100

    The Root Cause Fallacy with Andrew Hatch (Episode 98)

    Send us Fan MailThis week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore...🌌 Is the root cause of every incident the big bang?🦖 How the value of root cause degrades as complexity increases🫣 That if the culture is not blameless, people will hide things🌳 Alternative approaches to root cause analysis such as branching timelines🙋 Getting someone without skin in the game to facilitate your blameless post-mortems...and much more.You can find Andrew on:LinkedIn: https://www.linkedin.com/in/hatchman76/Check out Andrew's SREcon21 talk 'Learning from Complex Systems' which covers many of the topics introduced in this episode: https://www.youtube.com/watch?v=5pKGW61RyvoYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  26. 99

    Synthetic Monitoring with David Dick (Episode 97)

    Send us Fan MailThis week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover...🤖 What is synthetic monitoring?🦾 What are the benefits and drawbacks to using it?☢️ Non-web based synthetics (the tough stuff)🍹 Combining RUM and synthetics🫢 Does synthetics need an OTEL-like framework?...and much more.You can find David on:LinkedIn: https://www.linkedin.com/in/david-dick/You can find more about 2 Steps at https://2steps.io/#You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  27. 98

    Tech Leadership with Milan Brown (Episode 96)

    Send us Fan MailThis week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss...✖️ Theory X vs Theory Y management🗣️ Intention based leadership and communication🏢 Conditions in an org for people to thrive😵‍💫 How do you learn to manage and lead?🫤 Managing people when you're not an expert in what they do...and much more.Resources mentioned during the episode:Turn The Ship Around! (book): https://davidmarquet.com/turn-the-ship-around-book/Agile Conversations (book): https://itrevolution.com/product/agile-conversations/Drive (book): https://www.danpink.com/books/drive/Radical Candor (book): https://www.radicalcandor.com/the-book/The Team Canvas (technique): https://theteamcanvas.com/The Enginer/Manager Pendulum (article): https://charity.wtf/2017/05/11/the-engineer-manager-pendulum/Retromat (tool for running retrospectives): https://retromat.org/You can find Milan on:LinkedIn: https://www.linkedin.com/in/milan-brown/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  28. 97

    Finding Tech Work with Leon Adato (Episode 95)

    Send us Fan MailThis week Leon Adato and I break down the state of applying for roles in tech. We cover...📝 What a resume or CV is and is not🤝 Leveraging your connections rather than relying on applying cold🪄 How most job descriptions are works of fiction🦾 White-fonting to game AI resume assessment🧪 Experimental ways we could recruit...and our pitch for Kubernetes the Rock Opera (and much more)You can find Leon's job postings weekly on his website:https://www.adatosystems.com/category/joblistings/You can find Leon on:LinkedIn: https://www.linkedin.com/in/leonadato/Bluesky: https://bsky.app/profile/leonadato.bsky.socialYou can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  29. 96

    Getting a Start in SRE with Priyam Kumar (Episode 94)

    Send us Fan MailThis week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss...🪖 War stories and examples of production incidents🩹 The "hacks" we build to keep things running (and how maybe that's just normal)😎 Keeping it simple... YAGNI (You Ain't Gonna Need It!)🧯 The perils of getting stuck in reactive mode📖 Areas of of learning if you want to get into SRE...and much much more.You can find Priyam on:LinkedIn: https://www.linkedin.com/in/priyam-kumar/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  30. 95

    SRE Leadership with Michelle Casey (Episode 93)

    Send us Fan MailThis week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as...🤷🏽 Why move into leadership?👁️ Learning from other leaders💎 What is unique about SRE leadership?👑 Women in engineering leadership...and we go through some feedback I got as a leader recently.Resources that Michelle mentions during the episode:The Five Dysfunctions of a Team (book): https://www.tablegroup.com/topics-and-resources/teamwork-5-dysfunctions/The Phoenix Project (novel): https://itrevolution.com/product/the-phoenix-project/The Unicorn Project (novel): https://itrevolution.com/product/the-unicorn-project/How Complex Systems Fail (website): https://how.complexsystems.fail/How Your Systems Keep Running Day After Day (talk): https://www.youtube.com/watch?v=xA5U85LSk0MThe Curse of the Systems Thinker (article): https://blog.relyabilit.ie/the-curse-of-systems-thinkers/Confessions of an SRE Manager (talk): https://www.usenix.org/conference/srecon23americas/presentation/hatchGender Decoder (website): https://gender-decoder.katmatfield.com/You can find Michelle on:LinkedIn: https://www.linkedin.com/in/michelle-casey-00b39837/Steve Licks Instagram: https://www.instagram.com/tailsofstevielicks?igsh=MWFhenVzdzh6Zmtudw%3D%3DYou can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  31. 94

    Observability Maturity with Ádám Tóth (Episode 92)

    Send us Fan MailThis week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as...💸 Does your org treat observability as a cost centre or a value add?🔥 Are you using observability reactively to solve problems? Or proactively to build better products and services?👤 Is your observability connected to your users and business in a meaningful way?🌐 Is monitoring the social media sentiment of your product part of observability?...and much more.You can find Adam at:LinkedIn: https://www.linkedin.com/in/adam-toth-innovateq/InnovaTeQ website: https://innovateq.io/I mentioned the 'This Is Fine!' podcast about resilience engineering. Find it on Spotify or at https://www.thisisfinepod.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  32. 93

    Head in the Clouds (Episode 91)

    Send us Fan MailIn this episode I explore the challenges of achieving unified observability when integrating with SaaS products and services. I cover:🌊 The new wave of mega-complex SaaS⚗️ Challenges integrating SaaS with our observability pipelines👩‍🦯 How the lack of SaaS autonomy limits the effectiveness of OpenTelemetry💰 Paying twice to ingest, store, and search telemetry📈 Monitoring and predicting SaaS observability costs...and much more.Shout out to Mark Chiavaroli (and apologies for mispronouncing your surname multiple times), Damian Sharrock, and Reece Hewitt for bouncing ideas on this topic.The 'Is it observable?' series can be found here: https://isitobservable.io/...and you can find Henrik on LinkedIn: https://www.linkedin.com/in/hrexed/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  33. 92

    Non-Prod Reliability Engineering + 2024 Wrap (Episode 90)

    Send us Fan MailThis week I check in and give an update on work, life, and my attempts at bringing to life SRE practices in the world of non-production environment management.You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  34. 91

    Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand

    Send us Fan MailThis week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover:🦅 The recent Crowdstrike outage and their public post-mortem🚑 When do we do a blameless post-mortem?😕 How do we do a blameless post-mortem?✅ How do we make sure action items are followed through?📰 The power of learning from post-mortems created by other teams and orgs...and much more.You can find Karanveer on LinkedIn: https://www.linkedin.com/in/karanveer/You can find Crowdstrike's preliminary post incident report here: https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  35. 90

    Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel

    Send us Fan MailThis week Zach Michel from https://middleware.io/ and I discuss the state of OpenTelemetry and what it means to adopt it. We cover:🌩️ Achieving observability in a SaaS world🥫 Context propagation - the magic sauce of OTEL🚪 The telemetry gateway concept and leveraging the OTEL collector🪵 The state of OpenTelemetry logging🫂 Making use of the OpenTelemetry community...and much more.You can find Zach on LinkedIn: https://www.linkedin.com/in/zamichel/You can find the official Slight Reliability podcast website at: https://slightreliability.com/For a list of ways to interact with the OpenTelemetry community go to:https://opentelemetry.io/community/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  36. 89

    Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko

    Send us Fan MailIn Episode 80 Niall Murphy talked about the need for SREs to be better at articulating the value of our work. In this episode I'm joined by ex-Googler and Engineering Director (SRE) at Culture Amp Artem Yakimenko about how we might achieve this.We discuss both quantifiable and qualitative approaches including leveraging the untapped data in support tickets, customer sentiment and rankings, the relationship between finance and performance, the link between user design and performance, and so much more.Books mentioned in the episode:100 Things Every Designer Needs to Know About PeopleBy Susan Weinschenkhttps://www.amazon.com.au/Things-Every-Designer-Needs-People/dp/0321767535You can find Artem on LinkedIn: https://www.linkedin.com/in/temikus/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  37. 88

    Slight Reliability Episode 86 - Evolving SLOs with Dom Finn

    Send us Fan MailIn the world of SRE we constantly talk about defining SLOs, but what about evolving them over time? This week I chat with SRE Tech Lead Dom Finn about just that. We cover the relationship between reliability and user analytics, latency classes as a way to speak SLOs with business stakeholders, the role of NFRs and how the thresholds differ from SLOs, and much more.Books mentioned in the episode:The Beginning of Infinity: Explanations That Transform the WorldBy David Deutchhttps://www.amazon.com.au/Beginning-Infinity-Explanations-Transform-World/dp/0143121359Turn The Ship Around!By David Marquettehttps://davidmarquet.com/turn-the-ship-around-book/You can find Dom on LinkedIn: https://www.linkedin.com/in/dom-finn/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  38. 87

    Slight Reliability Episode 85 - Feeling SaaSsy

    Send us Fan MailThis week I talk about the impact of SaaS-first technology strategies on the work of an SRE. I pose questions about observability, ownership, on-call, and how much control we have over reliability.You can find the Bleeding Tech blog on Medium: https://medium.com/@stownshendYou can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  39. 86

    Slight Reliability Episode 84 - Clinical Troubleshooting with Dan Slimmon

    Send us Fan MailThis week I chat with Dan Slimmon about applying the approach doctors use to treat patient symptoms during incident response.You can find Dan's blog at https://blog.danslimmon.com/ or connect with him on LinkedIn here: https://www.linkedin.com/in/danslimmon/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  40. 85

    Slight Reliability Episode 83 - An Unfulfilled Promise with Itiel Shwartz

    Send us Fan MailThis week I hear about all things Kubernetes from Komodor CTO and co-founder Itiel Shwartz. We chat about the promise that was made when Kubernetes first entered the industry, the challenge of getting developers engaged and capable of working in Kubernetes, my hate/hate relationship with Helm but its important contribution to the Kubernetes project, Kubernetes observability, and so much more.You can find the Kubernetes for Humans podcast here:https://komodor.com/blog/the-kubernetes-for-humans-podcast/Or find out more about Komodor here:https://komodor.com/Or find Itiel on LinkedIn: https://www.linkedin.com/in/itiel-shwartz-18542853/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  41. 84

    Slight Reliability Episode 82 - CI/CD with Amin Astaneh

    Send us Fan MailThis week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more.You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastanehYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sreThis episode was sponsored by SquaredUp. SquaredUp combines all your data with awesome dashboards, analytics, health rollup, and notifications, into a unified observability portal. Using a data mesh architecture, SquaredUp is a beautifully simple way to get instant access to the insights that matter, whenever you need them. If you want to know more head over to https://squaredup.com/ to sign up for your free account.

  42. 83

    Slight Reliability Episode 81 - Incident Management in Non-Prod Environments

    Send us Fan Mail"Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently?In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments.(Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode)You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

  43. 82

    Slight Reliability Episode 80 - What's Been Bugging Niall Murphy

    Send us Fan MailThis week I speak with co-author of the original SRE book + the SRE workbook, and renowned speaker Niall Murphy.We chat about the state of SRE in the current macro-economic climate and how we're not yet doing a very good job at articulating the value of SRE to leaders, the relationship that velocity and reliability have, the value of new features versus reliability improvements, and *much* more.You can find Niall at:LinkedIn: https://www.linkedin.com/in/niallm/X: https://twitter.com/niallmWebsite: https://relyabilit.ie/(and his company Stanza: https://www.stanza.systems/)You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  44. 81

    Slight Reliability Episode 76 - Sampling Distributed Traces with Paige Cruz

    Send us Fan MailPaige Cruz (from Chronosphere) is back. This week we discuss sampling. What is sampling? Why do it? What kinds of sampling are there?You can check out Chronosphere's cloud native observability platform here: https://chronosphere.io/You can find Paige on:LinkedIn: https://www.linkedin.com/in/paigerduty/X: https://twitter.com/paigerdutyYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  45. 80

    Slight Reliability Episode 79 - Incident Story Time with Valeska Victoria

    Send us Fan MailThis week Valeska Victoria returns to share some of her experiences working as an SRE at eBay.We look at the cascading effect of production issues in complex integrated environments (how there's often no single root cause), developer literacy of how infrastructure works, the importance of ownership and accountability of reliability, and much more.You can find Valeska on: LinkedIn: https://www.linkedin.com/in/valeska-victoria/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  46. 79

    Slight Reliability Episode 78 - Developer Experience with Ankit Jain

    Send us Fan MailThis week I chat with Ankit Jain from aviator.co about developer experience.We define developer experience and developer productivity, and how this applies to SRE. We discuss the growing expectation on developers and how this leads to frustration and burnout. We also explore how to measure developer experience and how to start working to make improvements.You can check out Aviator's developer experience platform here: https://www.aviator.co/You can find Ankit on:LinkedIn: https://www.linkedin.com/in/ankitjaindce/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  47. 78

    December 2023 Update

    Send us Fan MailA brief mid-week update on my changing circumstances and the future of the podcast.

  48. 77

    Slight Reliability Episode 77 - SRE to DevRel with Liz Fong-Jones

    Send us Fan MailThis week I had the privilege of interviewing Liz Fong-Jones from honeycomb.io about DevRel, Developer Advocacy, and how that applies to SRE.We discuss the difference between Developer Relations (DevRel) and Developer Advocacy, how Liz got into advocacy, how DevRel helps companies and the community, and some tips on how to get traction with SRE practices in your organisation.You can check out Honeycomb's observability platform here: https://www.honeycomb.io/You can find Liz on:LinkedIn: https://www.linkedin.com/in/efong/Website: https://www.lizthegrey.com/ (all her social/links are here)You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  49. 76

    Slight Reliability Episode 75 - Enterprise SRE with Steve McGhee

    Send us Fan MailThis week I had the honour of chatting with Steve McGhee (former Google SRE, current Google Reliability Advocate, and co-author of Enterprise Roadmap to SRE).We discuss the evolution of SRE from where it began at Google and how it is being adopted by enterprises around the world now (and why this is happening). We talk about getting leadership support and how we get reliability taken seriously, the lies we tell ourselves to justify incidents and issues, leveraging transformation projects to bring SRE to life, how SLOs can act as the fulcrum between dev and ops, the fallacy of the pyramid model of reliability... and so much more.You can find Steve at on:LinkedIn: https://www.linkedin.com/in/stevemcghee/X: https://twitter.com/stevemcgheeYou can find Steve's book "Enterprise Roadmap to SRE" here: https://sre.google/resources/practices-and-processes/enterprise-roadmap-to-sre/Steve also mentions the book "A Seat at the Table": https://itrevolution.com/product/a-seat-at-the-table/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

  50. 75

    Slight Reliability Episode 74 - The Hidden Side of Vendor Lock-In

    Send us Fan MailThis week on Slight Reliability Stephen discusses observability vendor lock-in. What is it? What does OpenTelemetry do to help? What areas are yet to be solved?You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

Learning SRE, one day at a time.

HOSTED BY

Stephen Townshend

CATEGORIES

URL copied to clipboard!