PODCAST · technology
This is Fine! A podcast about resilience engineering and software
by Colette Alexander and Clint Byrum
A podcast about resilience engineering and software.Ever wondered why things on the internet break? Do you work in software and wish that you could have a Dear-Abby-Like call-in show that could answer your deepest questions about how to make your workplace suck less? We're here to help! Write us anonymously at our open question formEmail us at: [email protected] us and leave a voicemail, or text us at: (401) 592-7574
-
35
Paper Club: Two Years Before the Mast w/special guest eric dobbs
Mitchell Hashimoto’s post on leaving Github: https://mitchellh.com/writing/ghostty-leaving-githubThe Reddit post on Github’s availability historically (that we find questionable): https://www.reddit.com/r/github/comments/1rnvhs9/githubs_historic_downtime_scraped_and_plotted/A reminder, the Messy 9 are: congestion, cascade, conflict, lag, saturation, friction, tempo, surprise, tanglesWe have sometimes loved his stuff, but Gergely is annoying us with these posts: https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?r=78c7k&utm_medium=emailhttps://x.com/GergelyOrosz/status/2048017382036082706You can find the RISF store with Hindsight Bias merch here: https://www.bonfire.com/store/risf/You can find a copy of Richard Cook’s Two Years Before the Mast at Lorin’s Blog: https://surfingcomplexity.blog/wp-content/uploads/2026/03/twoyearsbeforethemast.pdfA reminder, Richard Cook’s How Complex Systems Fail can be found at http://how.complexsystems.failSome writing on the 1996 Annenberg conference: https://www.researchgate.net/publication/351953417_Coming_Together_TheFolk models paper (not by Woods, by Dekker and Hollnagel), which is specifically targeting Situational Awareness as being a folk model: https://link.springer.com/article/10.1007/s10111-003-0136-9Some stuff about SNAFU Catchers: https://www.snafucatchers.com/And https://snafucatchers.github.io/Eric referenced our conversation with Beth Long about Building and Revising Adaptive Capacity, which she co-wrote with Richard Cook about New Relic’s real-life example of resilience engineering: https://youtu.be/A_rU4-M61Hk and https://www.sciencedirect.com/science/article/abs/pii/S0003687020301903?via%3Dihub for the paperErik Hollnagel’s RAG get’s referenced: https://erikhollnagel.com/onewebmedia/RAG%20Outline%20V2.pdfOnce again, we link you to Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Eric is referencing Lund, that is their Human Factors and Systems Safety program: https://www.humanfactors.lth.se/Check out Crisis Engineering! https://crisisengineering.layeraleph.com/crisis-engineering-the-book/The upcoming RISF event on Practice of Practice Gamelan: https://resilienceinsoftware.org/events/245030
-
34
SRECon Americas 2026 recap
Colette’s talk at SRECon intro: https://www.usenix.org/conference/srecon26americas/presentation/alexanderClint’s talk at SRECon intro: https://www.usenix.org/conference/srecon26americas/presentation/byrumDan Slimmon is an excellent engineer (per Clint’s shoutout) and ALSO an excellent podcast creator/host: https://techblows.net/Michelle Brush’s Keynote summary is here: https://www.usenix.org/conference/srecon26americas/presentation/brushJevon’s Paradox: https://en.wikipedia.org/wiki/Jevons_paradoxDr. Nicole Forsgren’s talk summary: https://www.usenix.org/conference/srecon26americas/presentation/forsgrenDORA is always worth a dive into if you haven’t taken a look yet: https://dora.dev/The blog post Colette mentioned comparing AI gold rush to Mao’s Revolution: https://leehanchung.github.io/blogs/2026/04/05/the-ai-great-leap-forward/Many people have written about why MTTR is a bad metric to track, you can read a write up from Adrian Hornsby here: https://newsletter.resiliumlabs.com/p/mttr-problems-better-incident-metricsAnd watch the OG, Courtney Nash, speak about it here: https://www.youtube.com/watch?v=uhCgBOHo8EYBeth Long’s SRE Soundbath: https://www.usenix.org/conference/srecon26americas/presentation/longVanessa Huerta-Granda’s talk is summarized here: https://www.usenix.org/conference/srecon26americas/presentation/huerta-grandaMartin Smith and Abe Hoffman’s talk is summarized here: https://www.usenix.org/conference/srecon26americas/presentation/hoffmanSome information about Metrist: https://vault42consulting.com/about/portfolio/metristAI Agents Good Bad and Ugly talk: https://www.usenix.org/conference/srecon26americas/presentation/budichenkoThe CAST talk: https://www.usenix.org/conference/srecon26americas/presentation/barrosoEngineering a Safer World by Nancy Leveson is worth a look: https://bookshop.org/p/books/engineering-a-safer-world-systems-thinking-applied-to-safety-nancy-g-leveson/57b01ef464f9f81b?ean=9780262533690&next=tErik Hollnagel wrote the book on FRAM and it has a lot of support in the safety world across industries: https://functionalresonance.com/ and https://etn-peter.eu/2021/02/11/fram-in-a-nutshell/ are good resources.Daria Barteneva’s closing keynote on game theory and SRE was great: https://www.usenix.org/conference/srecon26americas/presentation/bartenevaSome good stuff on Above the Line/Below the Line, if you’re curious: https://queue.acm.org/detail.cfm?id=3380777https://www.youtube.com/watch?v=xA5U85LSk0MLorin Hochstein’s closing keynote on storytelling was rad: https://www.usenix.org/conference/srecon26americas/presentation/hochsteinSRECon EMEA 2026 (in Dublin) has their CFP up: https://www.usenix.org/conference/srecon26emea/call-for-participationAs always, you can check out the Resilience in Software Foundation at resilienceinsoftware.org
-
33
The 2025 DORA Report w/special guest Fred Hebert
You can find the 2025 DORA Report here: https://dora.dev/research/2025/dora-report/Read more of Fred’s work/opinions here: https://ferd.ca/If you want to know more about Lund’s Human Factors and Systems Safety program, you can read here: https://www.humanfactors.lth.se/DORA has some good writeups of generative leadership and Westrum’s model here: https://dora.dev/capabilities/generative-organizational-culture/We can reset the counter, it’s been 0 episodes since we mentioned Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Fred writes well about the Law of Stretched Systems: https://ferd.ca/the-law-of-stretched-cognitive-systems.htmlWe’re still trying to schedule a DORA event with our friends who make the report, but keep an eye out on https://resilienceinsoftware.org/events - it will pop up there when we do!
-
32
Building and Revising Adaptive Capacity Sharing for Technical Incident Response with Beth Adele Long
The Keewenaw snow gauge that Colette mentioned is a tourist attraction. If you want to see where measurements are at for the season you can find them here: https://www.pasty.com/snow/The paper we’re talking about today can be found here: https://www.sciencedirect.com/science/article/abs/pii/S0003687020301903If you want to know more about SNAFU Catchers, you can see their website here: https://www.snafucatchers.com/They produced the STELLA report: https://snafucatchers.github.io/Richard Cook’s Bone Talk is kind of famous - here’s a version from REDeploy: https://www.youtube.com/watch?v=8LbePBiOvZ4Some writing from New Relic about NERFs: https://newrelic.com/blog/observability/best-practices-incident-commander-trainingWe failed to mention it in the podcast itself, but Michael Wettick did a great thesis at Lund on asking for help in software operations incidents: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9150096&fileOId=9150099Speaking of Hitchhiker’s Guide, etsy has some cool merch: https://www.etsy.com/listing/1071043200/dont-panic-hitchhikers-guide-to-theYou can find David Woods’ paper on Graceful Extensibility here: https://link.springer.com/article/10.1007/s10669-018-9708-3Our Paper Club event on this paper on March 17th can be signed up for here: https://resilienceinsoftware.org/events/164680
-
31
Outsourcing and Resilience
Colette mentioned Menlo Innovations https://menloinnovations.com/ and Atomic Object https://atomicobject.com/ who both build custom software for folks. The CEO of Menlo is Richard Sheridan who wrote Joy, Inc. - https://bookshop.org/p/books/joy-inc-how-we-built-a-workplace-people-love-richard-sheridan/7677689?ean=9781591847120&next=tChad Todd’s thesis on Handovers in Software Operations is worth a read: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9076274&fileOId=9076276Clint refers to Zingerman’s and their servant leadership model, one of Colette’s favorite places to learn about leadership from. If you want to know more, go to https://www.zingtrain.com/ and in particular, read https://shop.zingtrain.com/products/a-lapsed-anarchists-approach-to-being-a-better-leader
-
30
The Messy 9 and Coding with AI - A Panel Discussion
Special thanks to John Allspaw, Sheeri Cabral, Martin Smith, and David Woods for joining us!Ben Affleck’s been making the promo rounds, but the specific convo we reference is recapped here: https://www.moviemaker.com/ben-affleck-ai-explains/The Messy 9 are:congestioncascadeconflictlagsaturationfrictiontemposurprisetanglesDave’s been doing a set of videos on Resilience Engineering, some of which have some crossover with the Messy 9 - you can find the first one here:https://resiliencefoundations.github.io/video-1-introduction-pt-1-it's-all-about-viability.htmlPrevious TiF episode on the messy 9:https://www.thisisfinepod.com/the-pod/complex-systems-and-the-messy-nine-wspecial-guests-dave-woods-and-john-allspawRichard Cook on Above the Line/Below the Line: Written - https://dl.acm.org/doi/pdf/10.1145/3379510A good excerpt from a talk from John Allspaw on Above the Line/Below the Line: https://www.youtube.com/watch?v=8bxj-FLEi10&list=PLb1aZTnPf3-OMChMkrr6WsokRI6LOnuem Colette mentioned the competence knowledge model: https://en.wikipedia.org/wiki/Four_stages_of_competenceThere’s a good argument based on the conversation here that AI makes it harder for Consciously Incompetent people to graduate to Conscious Competence. And, in Martin’s case, it makes Unconsciously Competent folks need to backtrack into Conscience Competence to “teach” it how to do things they don’t always think about.We can reset the clock to 0 episodes since we’ve mentioned the Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfThere is a good blog on Jamie Zawinski’s saying on regular expressions here: https://regex.info/blog/2006-09-15/247Alex Gorbachev and The Battle Against Any Guess seems to have become a paper https://www.researchgate.net/publication/251255185_Battle_Against_Any_GuessDave talks about Robust Yet Fragile as part of Resilience Engineering here: https://www.youtube.com/watch?v=gFotUdLL2zsLorin Hochstein’s blog post that Dave is referencing is https://surfingcomplexity.blog/2026/01/19/amdahl-gustafson-coding-agents-and-you/Fred writes a good one on the Law of Stretched Systems: https://ferd.ca/the-law-of-stretched-cognitive-systems.htmlThe 1985 paper Dave keeps mentioning could be any number of things he released that year, but I have a hunch it’s this one: https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/511 or this one: https://link.springer.com/chapter/10.1007/978-3-642-50329-0_11Dave references a lot of things around the economic sustainability around AI, and Ed Zitron has been writing quite a bit about that for the last year and change. See: https://www.wheresyoured.at/wheres-the-money/https://www.wheresyoured.at/big-tech-2tr/Among others.
-
29
Going Solid
If you’re feeling like you need to do more to respond to our moment:Lots of place to donate to in the twin cities are listed here: https://mspmag.com/arts-and-culture/general-interest/ice-minnesota-support-immigrant-communities-fundraisers-food-drives-trainings/You can always find mutual aid networks in your own area, including immigrant aid networkshttps://immigrantdefensenetwork.org/ does good work, tooThe Hometown Holler podcast with Tressie McMillan Cottom was a wonderful discussion: https://www.youtube.com/watch?v=2gr4mW8aR-gThe Ruth Wilson Gilmore’s interview that I quoted clumsily is here: https://www.nytimes.com/2019/04/17/magazine/prison-abolition-ruth-wilson-gilmore.html The paper itself: https://qualitysafety.bmj.com/content/14/2/130.shortIf you haven’t seen The Pitt, you should, it’s super good: https://en.wikipedia.org/wiki/The_PittCharles Perrow’s Normal Accidents has more definitions/examples of coupling: https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-professor-charles-perrow/cad38a43fcffa1f8?ean=9780691004129&next=tSome stuff on microservices and coupling here: https://microservices.io/post/architecture/2023/03/28/microservice-architecture-essentials-loose-coupling.htmlColette’s #notanad endorsement for paper organizing is https://paperpile.com/Rasmussen’s boundary model comes initially from his paper here: https://www.sciencedirect.com/science/article/abs/pii/S0925753597000520And if you want a good writeup on Rasmussen’s boundary model explaining it, you can always read Lorin’s blog: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Dr Cook’s talk at Velocity is a classic, and goes over Rasmussen’s boundary model really well: https://www.youtube.com/watch?v=PGLYEDpNu60Fred does a great job writing about the Law of Stretched Systems and how it applies to his own work on his blog: https://ferd.ca/the-law-of-stretched-cognitive-systems.html“Plans are nothing, but planning is everything” is a paraphrase of Eisenhower: https://www.presidency.ucsb.edu/documents/remarks-the-national-defense-executive-reserve-conferenceWant to chat about this paper with other folks? Come to the RISF live event for a Paper Party! https://resilienceinsoftware.org/events/157553
-
28
The Year in Resilience w/special guest John Allspaw
Seriously though, can’t wait to gtfo of this year.Palisades fire links: https://www.nbclosangeles.com/investigations/anonymous-letter-demands-independent-palisades-fire-investigations/3800442/https://internationalfireandsafetyjournal.com/palisades-fire-report/https://www.latimes.com/california/story/2025-12-20/lafd-report-on-palisades-fire-was-watered-down-in-editing-process-records-showCorey Quinn’s commentary on the AWS outage in October is here: https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/Time to reset the clock on how many episodes it’s been since we’ve mentioned the Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfAlso on Rasmussen’s Boundary Model, which Lorin does a great write up on: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Lorin’s Law is our favorite law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/You can ask us questions or write to us using our form linked from our website: thisisfinepod.comResilience in Software Foundation is at resilienceinsoftware.org
-
27
Incident Status: On Hold w/special guest Will Gallego
Mentioned multiple times, Em Ruppe’s amazing talk on incident severity: https://www.usenix.org/conference/srecon24americas/presentation/ruppeWe talk about the RIS Slack sometimes - you can join us in the slack, by joining the Foundation here: https://resilienceinsoftware.org/Please ask us a question at thisisfinepod.com
-
26
Complex Systems and the Messy Nine w/special guests Dave Woods and John Allspaw
The writeup on the AWS outage from AWS themselves, if you haven’t seen it: https://aws.amazon.com/message/101925/Dave’s department at OSU, Cognitive Systems Engineering: https://ise.osu.edu/human-systems-integration/cognitive-systems-engineering is a part of the larger Integrated Systems Engineering school: https://ise.osu.edu/human-systems-integration Dave was talking early on about the discussion on the war on expertise, it was this webinar through the NDM association: https://vimeo.com/1129606494?fl=pl&fe=sh&mc_cid=c807a504fbDave was a part of the Paul Feltovich got a shout out - he wrote a lot, but one of the best is with Gary Klein on Common Ground and Coordination in Joint Activity: https://www.academia.edu/download/31764257/Common_Ground_Single.pdfAnd Studies of Expertise from Psychological Perspectives: https://www.researchgate.net/profile/Paul-J-Feltovich/publication/200772882_Studies_of_expertise_from_psychological_perspectives/links/58bd18b2aca27261e528de07/Studies-of-Expertise-from-Psychological-Perspectives.pdfDave mentions his “Command-Adapt Paradox chapter” - you can find that here: https://library.oapen.org/bitstream/handle/20.500.12657/88327/1/978-3-031-45055-6.pdf#page=77Shout out to Norbert Weiner, the godfather of cybernetics: https://www.jstor.org/stable/24945913For just two studies on how private equity in hospitals causes worse outcomes for patients you can see: https://hsph.harvard.edu/news/private-equitys-appetite-for-hospitals-may-put-patients-at-risk/Andhttps://www.sciencedirect.com/science/article/pii/S0304405X25001151Dave talks a bit about saturation and crossing boundaries towards failure - it’s worth familiarizing yourself with Rasmussen’s boundary model - Lorin Hochstein writes a good summary over at his blog: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Dave also mentions graceful extensibility - this is a concept he’s written quite a bit about, you can start here: https://link.springer.com/article/10.1007/s10669-018-9708-3Shout out to Slight Reliability: https://slightreliability.com/One of the great Woods/Cook write ups on anticipation in anesthesiology: https://www.sciencedirect.com/science/article/pii/S0952818096900094In case you’re unfamiliar with the Chicago Seven: https://en.wikipedia.org/wiki/Chicago_SevenThe Messy 9 are:congestioncascadeconflictlagsaturationfrictiontemposurprisetanglesKeep an eye on the merch store over at https://www.bonfire.com/store/risf/ if you want the t-shirt.
-
25
All the things about Incident Command
It’s Spamton G (not J) Spamton, Clint! Get hip to the game characters! https://deltarune.fandom.com/wiki/SpamtonThere are a couple of incident command trainers out there who tend to get recommended in the tech world (that we know of): https://www.blackrock3.com/ and Great Circle: https://greatcircle.com/im/
-
24
Root Cause Analysis vs. Resilience Engineering w/special guest Lorin Hochstein
A history of the 5 whys and root cause analysis from papersSome critiques of the 5 whys:From John Allspaw: https://www.oreilly.com/radar/the-infinite-hows/From Alan J Card: https://qualitysafety.bmj.com/content/26/8/671James Reason and the Swiss Cheese Model: https://pmc.ncbi.nlm.nih.gov/articles/PMC8514562/James Reason’s book Human Error: https://bookshop.org/p/books/human-error/9e06d8a100a07537?ean=9780521314190&next=tAnd a classic from Sidney Dekker (et al.) on the implication of complexity within safety investigations:https://www.sciencedirect.com/science/article/abs/pii/S0925753511000105?via%3DihubWe always recommend the Howie Guide: https://howie-guide.pagerduty.com/STAMP is starting to get popular: https://functionalsafetyengineer.com/introduction-to-stamp/Google’s STAMP paper: https://www.usenix.org/publications/loginonline/evolution-sre-googleGoogle’s STAMP discussion on ProdCast: https://sre.google/prodcast/#season4-episode7And presentation at SRECon: https://www.usenix.org/conference/srecon25americas/presentation/kleinNancy Leveson’s google scholar is always worth browsing: https://scholar.google.com/citations?user=78y4sEcAAAAJ&hl=enAllspaw’s LinkedIn post that we quoted: https://www.linkedin.com/posts/jallspaw_important-reminders-about-learning-effectively-activity-7378775591447183360-c_eDLorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Want to talk more about this subject? We’re doing a live event co-sponsored by RISF and you can sign up for it here: https://resilienceinsoftware.org/networks/events/146485
-
23
First Stories/Second Stories
More robustness than resilience, but worth repeating that you should always check your earthquake go-bag: https://www.earthquakeauthority.com/blog/2019/how-to-make-an-earthquake-emergency-kitClint did ASA 103: https://americansailing.com/learn-to-sail/certifications/asa-103-coastal-cruising/Since this is a science podcast, there is a scientific reason people get emotional on airplanes: https://www.cntraveler.com/story/why-do-we-always-cry-on-planes52 Hertz Whale documentary: https://en.wikipedia.org/wiki/The_Loneliest_Whale:_The_Search_for_52And Leslie Jamison wrote 52 Blue as a chapter in one of her essay collections (you can read it excerpted here: https://slate.com/technology/2014/08/52-blue-the-loneliest-whale-in-the-world.html )Colette was wrong, Jamison referenced a famous Kathryn Schulz piece in one of her own essays, which was the source of confusion - The Big One: https://www.newyorker.com/magazine/2015/07/20/the-really-big-one about a cataclysmic earthquake on the west coast. In case you’re curious, Colette uses scholar.google.com and paperpile.com shamelessly live.We reference A Tale of Two Stories: Contrasting View of Patient Safety by Richard Cook and Dave Woods: https://www.researchgate.net/publication/245102691_A_Tale_of_Two_Stories_Contrasting_Views_of_Patient_Safety?enrichId=rgreq-a699511fb5bc518bf1584a0a6613d8d0-XXX&enrichSource=Y292ZXJQYWdlOzI0NTEwMjY5MTtBUzoyMDYyMjM2NjExMTMzNDdAMTQyNjE3ODk2MDQ4NA%3D%3D&el=1_x_2&_esc=publicationCoverPdfThe Beaumaiden report (that dives into a deeper, second story) is here: https://dmaib.com/reports/2021/beaumaiden-grounding-on-18-october-2021We will continue to point to DORA’s organizational model page: https://dora.dev/capabilities/generative-organizational-culture/Some Wikipedia on double loop learning: https://en.wikipedia.org/wiki/Double-loop_learningColette mentioned Mads Møller’s Lund HFSS thesis on deaths and accountability: https://lup.lub.lu.se/student-papers/search/publication/9106422And Bram Couteaux’s Lund HFSS thesis on the drunk flight attendants/pilots court: https://lup.lub.lu.se/student-papers/search/publication/9111661J Paul Reed wrote about being ‘Blame Aware’ - https://medium.com/@jpaulreed/why-blameless-postmortems-might-feel-wrong-cbeee00d51b2
-
22
How (Not) to Introduce Resilience Engineering at Work with special guest Michelle Casey
Lorikeets are pretty: https://en.wikipedia.org/wiki/Rainbow_lorikeetYou think Colette’s kidding about the kangaroo? https://www.youtube.com/watch?v=DQjHVRHXbc8 The Mackinac Bridge is long: https://en.wikipedia.org/wiki/Mackinac_BridgeMichelle’s Blog post:https://resilienceinsoftware.org/news/1288714DORA has some good writing on Westrum’s cultural models if you’re wondering about it: https://dora.dev/capabilities/generative-organizational-culture/The link to our TiF live event with Michelle where we will be discussing the blog post! https://resilienceinsoftware.org/networks/events/143194Please ask us questions! You can go to thisisfinepod.com to get the link to our anonymous google form!
-
21
How long should you wait after an incident to do your retro?
Corn sweat is a real thing: https://www.scientificamerican.com/article/humidity-from-corn-sweat-intensifies-extreme-heat-wave-in-midwest-u-s/Also, plugging Tajin here, because: https://en.wikipedia.org/wiki/Taj%C3%ADn_seasoningWikipedia tells me Tajin is Mexican. I dunno, Clint.Beaumaiden report, for those that didn’t listen to the prior episode where we mentioned it: https://dmaib.com/reports/2021/beaumaiden-grounding-on-18-october-2021John Allspaw’s talk at Spotify that we referenced: https://www.youtube.com/watch?v=M8mYPyRG1fQLorin’s Law is always a good plug: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Clint’s book recommendation: https://bookshop.org/p/books/the-15-commitments-of-conscious-leadership-a-new-paradigm-for-sustainable-success-diana-chapman/14574335?ean=9780990976905&next=tSend us questions! at thisisfinepod.com or find us on LinkedIn here: https://www.linkedin.com/company/this-is-fine-a-podcast-about-software-and-resilience-engineering/You can come to the Lund panelist event for RISF by signing up here: https://resilienceinsoftware.org/networks/events/133948
-
20
Lund University - Academic Theory and Practice
A huge thanks to our panelists:John AllspawJed NeedleChad ToddRISF and TiF will host a live follow up to this episode on July 31st! You can sign up here: https://resilienceinsoftware.org/networks/events/133948If you’re interested in Lund’s Masters of Science program in Human Factors and Systems Safety, or any of their learning labs, you can check out more info here: https://www.humanfactors.lth.se/Adaptive Capacity Labs is how Jed was introduced to some of the concepts of LFI & Resilience Engineering, which eventually landed him at Lund.John mentioned SciShow Tangents, a podcast by Hank Green and Ceri Riley: https://www.youtube.com/c/scishowtangentsAs well as Conway’s Law: https://en.wikipedia.org/wiki/Conway%27s_lawAnd Dunbar’s Number: https://en.wikipedia.org/wiki/Dunbar%27s_number And the Theory of Graceful Extensibility, which you can read about here: https://infoscience.epfl.ch/server/api/core/bitstreams/87cfe245-c138-43cb-87c9-4062dc1a0519/contentLund theses list: https://www.humanfactors.lth.se/ny-sajt/msc-programme/msc-theses/Our panel’s select theses that they love:Colette’s pick: https://lup.lub.lu.se/student-papers/search/publication/9106422Chad’s pick: https://lup.lub.lu.se/student-papers/search/publication/9009930John’s picks were all of the software theses, I’m probably missing some but this is my attempt:John’s (was the first): https://lup.lub.lu.se/student-papers/search/publication/8084520 J Paul Reed: https://lup.lub.lu.se/student-papers/search/publication/8966930 Chad’s thesis on handovers in software: https://lup.lub.lu.se/student-papers/search/publication/9076274 Michael Wettick: https://lup.lub.lu.se/student-papers/search/publication/9150096 Colette’s thesis on QRA: https://lup.lub.lu.se/student-papers/search/publication/9148570Jessica De Vita: https://lup.lub.lu.se/student-papers/search/publication/9149521 Dr. Raymer’s I want to Treat the Patient and Not the Alarm: https://lup.lub.lu.se/student-papers/search/publication/2861164
-
19
What’s the ROI on Reliability and Resilience work?
Dave Wood’s Talk at SRECon 25 was on Complexification and SRE: https://www.youtube.com/watch?v=lmBvUJnGUX4Jens Rasmussen’s model - Is really well explained by Richard Cook’s talk at Velocity: https://www.youtube.com/watch?v=PGLYEDpNu60&t=3sLorin’s blog also has a good summary: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/And finally, Jens Rasmussen’s original paper on the subject: Risk Management in a Dynamic Society https://linkinghub.elsevier.com/retrieve/pii/S0925753597000520SRECon 25 talk on Incident Metrics that Matter that was awesome - https://www.youtube.com/watch?v=QrR2SvpWvdgWant to read about how things are getting a bit fash-y in tech these days?https://www.newyorker.com/culture/infinite-scroll/techno-fascism-comes-to-america-elon-muskhttps://www.theguardian.com/technology/ng-interactive/2025/jan/29/silicon-valley-rightwing-technofascismPerrow/Normal Accidents: https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-revised-charles-perrow/10369279?ean=9780691004129High Reliability Organizations (HROs):Started (ish) with “A Rejoinder to Perrow” https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.1994.tb00047.xAnd you can find Rochlin & La Porte behind a lot of the early writing on HROs, including https://www.jstor.org/stable/44637690?seq=1 and https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.1996.tb00078.xAs well as Weick and Sutcliffe: https://bookshop.org/p/books/managing-the-unexpected-sustained-performance-in-a-complex-world-kathleen-m-sutcliffe/11267666?ean=9781118862414&https://journals.sagepub.com/doi/10.2307/41165243
-
18
Runbooks: the Good, Bad and Ugly w/special guest Andrew Hatch
You can register for the After-the-Episode chat with Andrew at https://resilienceinsoftware.org/networks/events/129997Tickets are free for members, $10 for non-members. You can join the Foundation at https://resilienceinsoftware.org/signup Zuul is what Volvo uses for their CI, and it’s part of the OpenInfra Foundation, it’s rad.You can find Andrew on LinkedIn here.
-
17
What is an incident? How come no one declare them?
Michael Wettick’s Lund thesis is great, and Laura Maguire’s paper on the Costs of Coordination that is a shortened version of her dissertation is worth a read!Clint’s SRECon talk that he mentioned a couple times: https://www.youtube.com/watch?v=k4UaDDkLOhwLorin wrote a great article on incidents and improvisation: https://surfingcomplexity.blog/2023/06/11/when-theres-no-plan-for-this-scenario-youve-got-to-improvise/Incident.io and the people who work there have hilarious LinkedIn posts about how people use incidents in their org.We talked about BlackRock3 who do incident command training: https://www.blackrock3.com/Brent Chapman has also done great incident command training and has done some talks on why IT incident management can learn from fire/emergency response management processes.We have a LinkedIn! https://www.linkedin.com/company/this-is-fine-a-podcast-about-software-and-resilience-engineering/And you can ask us questions here: https://forms.gle/rggrbGG6aFVrgZsv9
-
16
Chaos Engineering w/special guest Casey Rosenthal
The O’Reilly book on Chaos Engineering by Casey and Nora Jones is here: https://www.oreilly.com/library/view/chaos-engineering/9781492043850/Some of the Netflix posts introducing Chaos Monkey and Simian Army are here and here.You can see Lorin Hochstein talking about Chaos Engineering at Netflix here.The Void is an awesome collection of information on incidents throughout tech and you can find it here.Casey mentioned Rasmussen’s model. Lorin has a great summary of that on his blog, but you can read the original paper by Rasmussen introducing this model here.A report on the Netflix outage during Christmas of 2012.A reminder - you can ask us questions for the podcast at www.thisisfinepod.com
-
15
Burnout on Aisle 3
Clint wrote the Socio-Technical Reality Engineer as a blog post it’s a good read.The Burnout book by the Nagoski sisters is A+++ reading.Those Found Responsible Have Been Sacked is by the late, great, Dr. Richard Cook and Chris NemethThe Perverse Incentives of Reliability by Katie Wilde from Snyk at this year’s SRECon was just an incredible talk.Colette mentioned the Beaumaiden report from DMAIB. She gave a talk for the DORA community on resilience engineering that you can see here.
-
14
Resilience, Complexity, and Your Boss a collab w/Punk Rock Safety
Ben (Goodheart), Dave (Provan) and Ron (Gantt) have the very awesome podcast Punk Rock Safety (punkrocksafety.com) - you can get your own punk rock safety merch at punkrocksafetymerch.comCharles Perrow wrote Normal Accidents and talks about safety and power in his essay (book, really), Complex Organizations.Drew Rae lit the stage on fire about safety work as soothing rather than actually improving safety: EHS Congress Berlin 2024 - Day2Dr. Richard Cook’s concepts of ‘Above the Line/Below the Line’ got a shout out - here’s the paper, and here’s John Allspaw giving a talk about the concept.
-
13
Live From SRECon
No video for this one because it didn’t really end up working.We had some awesome people with us for this show:Eric DobbsWill GallegoJuan Carlos RamirezMartin SmithDr Richard Cook’s talk on The Marvelous Resilience of Bone(one of our absolute favorites)You can see the schedule for the SRECon 2025 Americas conference hereThe keynotes from the day we recorded were Dr David Woods and Katie Wilde (from Snyk)
-
12
Teaser Episode - Season 2
The XKCD comic that’s in Colette’s thesis is DependencyJustin Reock is at DXhttps://punkrocksafety.com/ are our mutual podcast friends
-
11
Episode 10 - When They go Full ITIL on You w/special guest john allspaw
You can find John at Adaptive Capacity Labs or his (old) blog at Kitchen Soap. ITIL is… well, it’s a thing.Colette’s “You’re surprised it works in the first place” comes from Richard Cook’s brilliant Velocity talk in 2013.FYI, John wasn’t talking about Franz Kafka, we think he was talking about Apache Kafka. But they are pretty similar, we think.
-
10
Episode 9 - Learning from Incidents with special guest Alex Elman
You can find ACL (Adaptive Capacity Labs), the folks who train software engineers how to do LFI and who we speak so fondly of here.Colette mentioned Allspaw’s take on Five Whys - if you want to know why we think there are better options for learning out there, you can read it here.Alex did a great talk with Sarah Butt on some LFI related things at LFI Conf in 2023: https://www.youtube.com/watch?v=CbSiKAtO7FkAnd at SRECon: SREcon20 Americas - Are We Getting Better Yet? Progress Toward Safer OperationsColette went to go see whales in the Baja through this tour, it was awesomeWrite to us at [email protected] or go fill out our form with a question at Thisisfinepod.com
-
9
Episode 8 - Why Human Factors and Not Technical Ones
The spicy Allspaw take that inspired our listener is here: https://www.linkedin.com/posts/jallspaw_a-im-a-bit-salty-today-b-if-you-dont-activity-7287968197742411776-5_Ay Charles Perrow is the guy who wrote Normal Accidents (https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-revised-charles-perrow/10369279?ean=9780691004129&next=t&next=t) , which Colette is somewhat controversially a fan of, and thus a Perrow-ian? (a lot of resilience engineering people are not fans!) Not many notes today, here go check out a page on one of Colette’s favorite chicken breeds: https://greenfirefarms.com/shetland_hen.html
-
8
Episode 7 - AI and Resilience with special guest Courtney Nash
The VOID is one of our favorite things!Some of Courtney’s inoculation of the MTTR virus can be found here:An interview with InfoQA talk at SRE Con Americas in 2022Courtney’s recent talk on Automation and AIDavid Graeber’s Bullshit Jobs started as a talk and then a great bookWant to read more about HABA-MABA and CSE/RE? Lisanne Bainbridge’s The Ironies of Automation is a perennial recommendation in our show notesThe thread Courtney mentioned from Gergely Orosz
-
7
Episode 6 - Can You Buy Resilience? With Special Guest Steve McGhee
Steve is the host of the Google SRE Prodcast, you should check it out!Colette got her chickens from Greenfire Farms, and her chicken coop from Carolina Coops, if anyone is wondering.The Chris Hayes podcast Colette mentioned about unconditional cash transfers is here.Iain M. Banks is an author of The Culture series, a set of fiction books based in a post-scarcity societyIf you didn’t get the Vizzini/Inigo Montoya references, you should probably find a way to see The Princess Bride.Colette mentioned STAMP - which is more along the lines of reliability engineering than resilience engineering, technically, but is related. You can read about how Google is using it here.Lord, you want the history of ITIL? Okay.**** note, none of the below sponsor us (yet), so these are pure-hearted endorsements from Clint during the episode ****Adaptive Capacity Labs will teach your teams how to be more resilient.Incident.io is who Clint mentioned as one of the many incident automation tools out there (Rootly and FireHydrant are a couple others).Backstage is an open source Spotify product, and anyone who’s worked at Spotify will talk your ear off about how great it is if you let us.*************************A new Resilience Engineering community that Colette and Clint are a part of has launched! You can find us at resilienceinsoftware.org and join to be a part of the conversation in SlackAnd of course, you can email us at [email protected] or write to us via http://thisisfinepod.com
-
6
episode 5 - curating your resilience engineering 101
We talk about our favorite recommendations for someone who's just getting into this whole resilience engineering thing.A small note: Clint's voice is a little low in this one! We tried to do audio magic as much as we could to fix it, but hopefully you'll forgive us with your holiday spirit and we'll do better next time. <3Notes:How Complex Systems Fail by Richard I. CookResilience in Complex Adaptive Systems by Richard I. Cook at Velocity NY, 2013Moving Gracefully from Compliance to Learning by Ivan Pupulidy at LFI Conf, 2023Going Solid by Richard I Cook and Jens Rasmussen (NOT Dave Woods, Colette had a brain malfunction)Prosaic Organizational Failure by Lee Clarke and Charles PerrowLee Clarke also wrote the stellar Mission Improbable about Fantasy DocumentsYou should check out Ben Hutchinson on LinkedIn and also he wrote about Fantasy Documents/Enabling DevicesThe Field Guide to Understanding Human Error by Sidney DekkerThe Maintenance Race by Stewart BrandWisdom From The Sharp End Agile2024 slides by Clint ByrumClint’s talk at Agile 2024 doesn’t seem to be on YouTube yet?The Howie Guide is now at PagerDuty since they bought JeliOkay fine, you want Colette’s thesis involving Fantasy Documents, Enabling Devices and Cybersecurity?Remember, you can find us at thisisfinepod.com or email us at [email protected]
-
5
Episode 4 - A look at the 2024 dora report
Fred’s wonderful blogThis year’s DORA reportLee, Ramsey & Hicks on productivity and performanceBainbridge’s classic: The Ironies of AutomationAnd, as always, visit our website or send us email at: [email protected] to send us questions!
-
4
Episode 3 - lions, tigers and metrics, oh my!
We answered a set of questions about how to deal with dashboards and MTTR and how to make the best of the situation with the help of special guest Vanessa Huerta Granda.
-
3
Episode 2 - Does Software Need Safety?
We talk to the pioneer of resilience engineering in the software world John Allspaw about how he discovered this world, and we answer a reader question together: does software need safety?Correction: we thought this would be episode 3, but it ended up being 2, because of scheduling conflicts with guests...You can submit your questions at www.thisisfinepod.com[email protected]
-
2
Episode 1 - Every Second Counts
The introduction episode of This is Fine! A podcast about resilience engineering in the software world. Clint and Colette discuss conferences and a little bit of their history. Then they answer a question from a (prospective) listener: How do you politely ask executives/senior members of a company to step away from an incident...?
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
A podcast about resilience engineering and software.Ever wondered why things on the internet break? Do you work in software and wish that you could have a Dear-Abby-Like call-in show that could answer your deepest questions about how to make your workplace suck less? We're here to help! Write us anonymously at our open question formEmail us at: [email protected] us and leave a voicemail, or text us at: (401) 592-7574
HOSTED BY
Colette Alexander and Clint Byrum
CATEGORIES
Loading similar podcasts...