PODCAST
Debug Log
Software engineering war stories, architecture decisions, and lessons learned.
-
17
Debug Log: The Million-Goroutine Memory Leak and the Case for "Boring" Auth
This episode explores a critical Kubernetes authentication gateway's failure, caused by an accumulation of a million dormant goroutines. It details how client-side context cancellations were not properly propagated to upstream proxying goroutines, leading to these lightweight concurrency units holding onto resources indefinitely. Listeners will learn about the crucial importance of meticulous context propagation in Go's concurrency model, especially in I/O-bound networked services, to prevent similar resource leaks and system instability.
-
16
Chasing the Cart: Why Pinterest Ripped Out Its Sequential Ad Architecture
This episode explores the challenges of traditional multi-stage ad serving architectures, where optimizing for intermediate metrics like clicks can inadvertently sabotage ultimate conversion goals by prematurely filtering out valuable ads. Listeners will learn how integrating sophisticated conversion prediction intelligence much earlier in the pipeline, through a dedicated "Conversion Candidate Generation" component, can overcome these limitations and lead to more effective ad delivery.
-
15
The Blast Radius of Agentic AI: Why "Five Nines" is a Relic
This episode explores why the traditional "five nines" reliability metric is fundamentally unsuitable for agentic AI systems. It explains that unlike traditional systems, agentic AI can be "up" but still cause catastrophic failures through incorrect autonomous actions, leading to a significantly wider "blast radius" of damage. Listeners will learn about the unique failure modes of these self-directed systems and the critical need to shift focus from mere availability to ensuring correctness and integrity.
-
14
Phantom in the Page Cache: Unpacking the 10-Line "Copy Fail" Exploit
This episode discusses a 9-year-old, 10-line "Copy Fail" exploit found in the Linux kernel's page cache, highlighting the paradox of such a critical yet subtle vulnerability evading detection for so long. It explores the nature of this "phantom" bug, explaining how its "surgical precision" and exploitation of concurrency in the page cache make it incredibly difficult to detect, even in highly scrutinized software. Listeners will learn about the profound implications of small flaws in critical system components and the challenges of securing complex, concurrent operating systems.
-
13
Automating the Autopsy: The Promise and Peril of AI-Generated Postmortems
This episode explores the intriguing concept of using AI to write incident postmortems, highlighting its potential for speed, consistency, and automating data synthesis from vast sources. However, it also delves into the significant perils, such as the impact of poor data quality, the risk of AI hallucinations, and AI's inability to grasp the nuanced human "why" behind incidents. Listeners will learn about the dichotomy between AI's data processing power and the essential human element in understanding complex system failures.
-
12
The Harness and the Lobotomy: Unpacking Anthropic’s 47-Day Degradation
This episode explores a 47-day incident where Anthropic's Claude Code appeared to degrade, revealing that the core AI model was intact but its 'harness'—the surrounding infrastructure and system prompts—failed. Listeners will learn how critical this 'harness' is for an AI product's effective performance, and how seemingly minor changes, like lowering default reasoning effort, can lead to significant user frustration and a breakdown of trust between a company and its users.
-
11
Scaling for Ghosts: 7 Microservices, 47 Users, and the Trap of Resume-Driven Development
This episode explores the phenomenon of "Resume-Driven Development," where an engineer at a pre-seed startup built an enterprise-grade distributed system designed for 100,000 users, despite only having 47. It highlights how engineers might prioritize resume-boosting complex infrastructure over a startup's actual needs, leading to significant financial and human capital costs. Listeners will learn about the dangers of over-engineering and the critical misalignment of incentives in early-stage tech development.
-
10
The 3,000 Incident Postmortem: Why Caches Are Actually the Enemy
This episode explores Marc Brooker's controversial claim that caching, often a default scaling solution, is a major cause of catastrophic "metastable" system failures. It delves into the importance of deep postmortem analysis, moving beyond superficial root causes to question observability, testing, and fundamental architectural assumptions. Listeners will learn how unquestioning reliance on caching can create systems prone to persistent, unrecoverable breakdowns.
-
9
The Interface Tax: Is Clean Architecture a Scam?
This episode critically explores how dogmatic adherence to "Clean Architecture" principles, such as excessive layering and abstraction, can inadvertently hinder development velocity. It introduces concepts like the "Interface Tax" and "Lasagna Code," illustrating how over-engineering for unlikely future changes creates unnecessary complexity and friction for developers. Listeners will gain a critical perspective on common architectural practices and learn to identify when they might be detrimental to project progress.
-
8
From Vibe-Coded to Enterprise: Handing the Pager to Claude
This episode explores Incident.io's new remote Model Context Protocol (MCP) server, which enables AI assistants like Claude to directly access and interact with live production incident data. Listeners will learn how this "USB-C for AI" standard aims to reduce "dashboard fatigue" and streamline incident response by providing consolidated information, while also considering the potential trade-offs regarding deep system understanding and the "vibe-coded" origin of the technology.
-
7
The Microservice Hangover: Investigating an 83% Cost Cut by Returning to a "Majestic Monolith"
This episode discusses a team's successful transition from microservices back to a monolithic architecture, resulting in an 83% reduction in infrastructure costs and a 61% reduction in codebase. It critically examines the common trend of smaller engineering teams adopting microservices due to "cargo culting" and highlights how this can lead to engineers spending excessive time on infrastructure rather than product features. Listeners will learn about the potential pitfalls of prematurely adopting complex distributed systems and the surprising benefits a well-managed monolith can offer for productivity and cost efficiency.
-
6
The Trojan Horse in the AI Stack: How One Tiny Library Exposed the Keys to the Kingdom
This episode explores a critical supply chain attack where malicious code was embedded in legitimate updates of the popular LiteLLM library on PyPI, causing system meltdowns and stealing sensitive credentials like SSH keys and cloud configurations. Listeners will learn how such attacks exploit trusted open-source dependencies to compromise critical infrastructure and why libraries that handle numerous API keys for services like Large Language Models are particularly attractive targets for attackers.
-
5
The Slow-Motion Failure: Deconstructing the March 2026 Claude Outages
This episode discusses a March 2026 outage of the Claude AI platform, revealing that the failure wasn't in the AI models themselves but in the "control plane" — critical non-AI components like authentication services. Listeners will learn how an unanticipated surge in new user sign-ups overwhelmed these "boring" but essential systems, highlighting the often-overlooked challenges of scaling stateful infrastructure compared to the AI's "inference plane."
-
4
The Shadow Workforce: Rise of the In-House AI Coder
This episode explores the rapid adoption of AI in software development, revealing how companies like Ramp and StrongDM are using AI to author significant code, with some even eliminating human review. It delves into why elite organizations build custom AI agents for deep integration into their proprietary systems, contrasting this with a "radical" approach that prioritizes behavioral validation over human oversight. Listeners will gain insight into the philosophical debates surrounding AI-generated code and the emerging architectural patterns for these autonomous systems.
-
3
The Rich Get Richer: Is AI Making Your Senior Engineers 10x and Your Juniors Obsolete?
This episode challenges the common belief that AI will level the playing field for developers, presenting data that shows it disproportionately benefits senior engineers. Listeners will learn that experienced developers use AI as a force multiplier, leveraging their deep architectural context to direct and curate AI-generated code, thus widening the productivity gap with junior developers. This has significant implications for how engineering teams are trained, mentored, and staffed.
-
2
Atlassian's AI Sacrifice: Firing Engineers to Hire "AI Talent"
This episode explores Atlassian's recent layoff of 1600 employees, including over 900 in R&D, as a strategic pivot to "self-fund further investment in AI." Listeners will learn about the significant financial implications of this move, the controversial method of employee notification, and how the company is sacrificing institutional knowledge and restructuring leadership in a calculated bet on future AI capabilities.
-
1
Matt Pocock: 9 Ways AI Coding Rewired My Brain
This episode explores how one developer's 100% AI-contributed software development process has fundamentally reshaped his approach, particularly by increasing his focus on robust integration testing. Listeners will learn that immediate, comprehensive feedback loops—including "desirable friction" like strong type checking and rapid local testing environments—are crucial for effectively guiding AI agents. The discussion also highlights AI's current limitations, such as its lack of "taste" for UI design.
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
Software engineering war stories, architecture decisions, and lessons learned.
Loading similar podcasts...