KubeFM Podcast - All Episodes

100

The Namespaces Scaling Trap, with Brian Stack

Most teams scale Kubernetes by thinking about pods and nodes. At Render, Brian Stack ran into a different dimension: hundreds of thousands of namespaces per cluster, multiplied across DaemonSets that list-watch every namespace.Brian explains how Render traced the issue through Calico and Vector, worked with upstream maintainers, and turned memory profiling into operational wins: lower node costs, lighter API-server load, and faster rollouts.In this interview:Why namespaces can become a hidden scaling bottleneckHow DaemonSets multiply memory and control-plane pressureHow profiling, staging clusters, and upstream collaboration freed 7 TiBWhy pushing from an 80% fix to a complete fix can make teams fasterSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/0mrvCsXrVInterested in sponsoring an episode? Learn more.

May 12, 2026

36m

99

AI Agents Running Kubernetes, with Mike Solomon

What happens when an AI agent stops generating Kubernetes YAML and starts operating the cluster directly?Mike Solomon, software engineer at AIATELLA, explains how his team moved from a sprawling Helm setup to Markdown-driven infrastructure specs that Claude Code can execute, test, and refine.You will learnWhy Helm became hard to maintain for a fast-moving medical infrastructure repoHow Claude debugged Argo, TLS conflicts, kubectl patches, and private registry credentialsHow runbooks plus agent memory files capture failures so deployments become reproducible.It is a practical look at where Kubernetes automation may be heading: less hand-written YAML, more precise intent, and a sharper definition of when the human must stay in the loop.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/y70mLvWNsInterested in sponsoring an episode? Learn more.

May 5, 2026

38m

98

SaaS with Kubernetes Operators and Garbage Collection, with Alexander Held

A single Kubernetes CRD for every service request turns small changes into full-platform reconciliations.Alexander Held, former platform engineer at Mercedes-Benz Tech Innovation, describes a production refactor from a 2,000-line CRD to purpose-built resources and controllers. He shows how teams can model business workflows as Kubernetes APIs and then use owner references, finalizers, and events to keep platform operations predictable.You will learn:Why monolithic CRDs create performance and troubleshooting problemsHow controllers turn database provisioning and backups into reconciliation loopsHow finalizers clean up external resources such as S3 backupsWhy Kubernetes events make platform workflows easier to debugSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/TGy4Qn7QsInterested in sponsoring an episode? Learn more.

Apr 28, 2026

35m

97

What Hip-Hop Can Teach Us About Kubernetes, with Kelsey Hightower, Eric Abercrombie, and Julius Payne II

Kelsey Hightower, Eric Abercrombie, and Julius Payne II reflect on life after achievement, entering the Kubernetes world for the first time, and how music, creativity, and lived experience shape the way they think about technology.In this interview:Why fundamentals, patience, and repetition still matter more than shortcutsHow Kubernetes, community, and confidence intersect for people entering cloud-native workWhat hip-hop, production, and storytelling can teach us about ownership, authenticity, and finding your voiceSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/czrCCXSLtInterested in sponsoring an episode? Learn more.

Apr 21, 2026

1h 29m

96

Intelligent Kubernetes Load Balancing, with Rohit Agrawal

You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.In this episode:Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per requestHow Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every clientHow zone-aware spillover cut cross-availability-zone costs without sacrificing availabilityWhy CPU-based routing failed (monitoring lag creates oscillation) and what signals to use insteadThe system has been running in production for three years across hundreds of services, handling millions of requests.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/y803JMhBkInterested in sponsoring an episode? Learn more.

Apr 7, 2026

30m

95

That Time I Found a Service Account Token in my Log Files, with Vincent von Büren

You're integrating HashiCorp Vault into your Kubernetes cluster and adding a temporary debug log line to check whether the ServiceAccount token is being passed correctly. Three months later, that log line is still in production — and the token it prints has a 1-year expiry with no audience restrictions.Vincent von Büren, a platform engineer at ipt in Switzerland, lived through exactly this incident. In this episode, he breaks down why default Kubernetes ServiceAccount tokens are a quiet security risk hiding in plain sight.You will learn:What's actually inside a Kubernetes ServiceAccount JWT (issuer, subject, audience, and expiry)Why tokens with no audience scoping enable replay attacks across internal and external systemsHow Vault's Kubernetes auth method and JWT auth method compare, and when to choose eachWhat projected tokens are, why they dramatically reduce blast radius, and what's holding teams back from using themPractical steps for auditing which pods actually need API access and disabling auto-mounting everywhere elseSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/LTnB_NtbcInterested in sponsoring an episode? Learn more.

Mar 31, 2026

28m

94

GPU Containers as a Service, with Landon Clipp

Running GPU workloads on Kubernetes sounds straightforward until you need to isolate multiple tenants on the same server. The moment you virtualize GPUs for security, you lose access to NVIDIA kernel drivers — and almost every tool in the ecosystem assumes those drivers exist.Landon Clipp built a GPU-based Containers as a Service platform from scratch, solving each isolation layer — from kernel separation with Kata Containers + QEMU to NVLink fabric partitioning to network policies with Cilium/eBPF — and shares exactly what broke along the way.In this interview:Why standard NVIDIA tooling (GPU Operator) fails in multi-tenant setups, and how to use CDI with PCI topology scanning to make GPUs visible to Kubernetes without kernel driversHow to partition the NVLink fabric between tenants using a trusted service VM running Fabric Manager, and why the physical PCIe wiring differs between Supermicro HGX and NVIDIA DGX systemsWhy gVisor doesn't work for GPU workloads — NVIDIA's unstable ioctl ABI means Google has to update gVisor for every driver release, and they only support a handful of GPUsWhat caused 8-GPU VMs to take 30+ minutes to boot, and the specific fixes (IOMMUFD, cold plugging, kernel upgrades) that brought it down to minutesHow Cilium network policies enforce tenant isolation at the Kubernetes identity level instead of fragile IP-based rulesWhere Containers as a Service fits best: inference workloads where AI teams want to ship an OCI image without managing infrastructure or signing multi-million dollar cluster contracts.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/jjK_yJTDzInterested in sponsoring an episode? Learn more.

Mar 24, 2026

93

How We Cut Build Debugging Time by 75% with AI, with Ron Matsliah

Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.In this episode:Why combining deterministic rules with AI produces better results than letting an LLM guess aloneHow correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errorsWhy integrating into existing workflows and building feedback loops from day one drove adoptionThe prompt engineering lessons learned from testing with real production data instead of synthetic examplesThe takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/PDdYfC00wInterested in sponsoring an episode? Learn more.

Mar 17, 2026

20m

92

Migrating Kubernetes Off Big Cloud, with Fernando Duran

Managed Kubernetes on a major cloud provider can cost hundreds or even thousands of dollars a month — and much of that spending hides behind defaults, minimum resource ratios, and auxiliary services you didn't ask for.Fernando Duran, founder of SadServers, shares how his GKE Autopilot proof of concept ran close to $1,000/month on a fraction of the CPU of the actual workload and how he cut that to roughly $30/month by moving to Hetzner with Edka as a managed control plane.In this interview:Why Kubernetes hasn't delivered on its original promise of cost savings through bin packing — and what it actually provides insteadA real cost comparison: $1,000/month on GKE vs. $30/month on Hetzner with Edka for the same nominal capacityWhat you need to bring with you (observability, logging, dashboards) when leaving a fully managed cloud providerThe decision comes down to how tightly coupled you are to cloud-specific services and whether your team can spare the cycles to manage the gaps.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6nSDbz9m4Interested in sponsoring an episode? Learn more.

Mar 10, 2026

25m

91

Migrating to Karpenter: Fun Stories, with Adhi Sutandi

Running multiple Kubernetes clusters on AWS with the cluster autoscaler? Every four months, you face the same grind: upgrading Kubernetes versions, recreating auto scaling groups, and hoping instance type changes stick.Adhi Sutandi, DevOps Engineer at Beekeeper by LumApps, shares how his team migrated from the cluster autoscaler to Karpenter across eight EKS clusters — and the hard lessons they learned along the way.In this episode:Why AWS auto scaling groups are immutable and how that creates upgrade bottlenecks at scaleHow the latest AMI tag accidentally turned less critical clusters into chaos engineering environments, dropping SLOs before anyone realized Karpenter was the causeWhy pre-stop sleep hooks solved pod restartability problems that Quarkus's built-in graceful shutdown couldn'tThe case for pod disruption budgets over Karpenter annotations when protecting critical workloads during node rotationsHow Karpenter's implicit 10% disruption budget caught the team off guard — and the explicit configuration that fixed itSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/XyVfsSQPrInterested in sponsoring an episode? Learn more.

Mar 3, 2026

1h 01m

90

From ECS to Kubernetes: A Real Migration Story, with Radosław Miernik

Migrating from ECS to Kubernetes sounds straightforward — until you hit spot capacity failures, firewall rules silently dropping traffic, and memory metrics that lie to your autoscaler.Radosław Miernik, Head of Engineering at aleno, walks through a real production migration: what broke, what they missed, and the fixes that made it work.In this interview:Running Flux and Argo CD together — Flux for the infra team, Argo CD's UI for developers who don't want to touch YAMLHow the wrong memory metric caused OOM errors, and why switching to jemalloc cut memory usage by 20%Splitting WebSocket and API containers into separate deployments with independent autoscalingFour months of migration, over 100 configuration changes in the first month, and a concrete breakdown of what platform work looks like when you can't afford downtime.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/x6wFMhVsxInterested in sponsoring an episode? Learn more.

Feb 24, 2026

38m

89

Faster EKS Node and Pod Startup, with Jan Ludvik

Kubernetes nodes on EKS can take over a minute to become ready, and pods often wait even longer — but most teams never look into why.Jan Ludvik, Senior Staff Reliability Engineer at Outreach, shares how he cut node startup from 65 to 45 seconds and reduced P90 pod startup by 30 seconds across ~1,000 nodes — by tackling overlooked defaults and EBS bottlenecks.In this episode:Why Kubelet's serial image pull default quietly blocks pod startup, and how parallel pulls fix itHow EBS lazy loading can silently negate image caching in AMIs — and the critical path workaroundA Lambda-based automation that temporarily boosts EBS throughput during startup, then reverts to save costThe kubelet metrics and logs that expose pod and node startup latenc,y most teams never monitorEvery second saved translates to faster scaling, lower AWS bills, and better end-user experience.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/B7TzKXyxfInterested in sponsoring an episode? Learn more.

Feb 17, 2026

21m

88

Kubernetes is not just for Black Friday, with Thibault Martin

You self-host services at home, but upgrades break things, rollbacks require SSH-ing in to kill containers manually, and there's no safety net if your hardware fails.Thibault Martin, Director of Program Development at the Matrix Foundation, walked this exact path — from Docker Compose to Podman with Ansible to Kubernetes on a single server — and explains why each transition happened and what it solved.In this interview:Why Ansible's declarative promise fell short with the Podman collection, forcing sequential imperative steps instead of desired-state definitionsHow community Helm charts replace the need to write and maintain every manifest yourselfWhy GitOps isn't just a deployment workflow — it's a disaster recovery strategy when your infrastructure lives in your living roomHow k3s removes the barrier to entry by bundling opinionated defaults so you can skip choosing CNI plugins and storage providersKubernetes doesn't have to be enterprise-scale — with the right distribution and community tooling, it can be a practical, low-overhead choice for anyone who cares about their data.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Xk5S7VqXzInterested in sponsoring an episode? Learn more.

Feb 10, 2026

27m

87

Patroni Backups: when pgBackRest and Argo CD have your back (literally), with Ziv Yatzik

Your database backup strategy shouldn't be the thing that takes your production systems down.Ziv Yatzik manages 600+ Postgres clusters in a closed network environment with no public cloud. After existing backup solutions proved unreliable — causing downtime when disks filled up — his team built a new architecture using pgBackRest, Argo CD, and Kubernetes CronJobs.In this episode:Why storing WAL files on shared NAS storage prevents backup failures from cascading into database outagesHow GitOps with Argo CD lets them manage backups for hundreds of clusters by adding a single YAML fileThe Ansible + Kubernetes hybrid approach that keeps VM-based Patroni clusters in sync with Kubernetes-orchestrated backupsA practical blueprint for making database backups boring, reliable, and safe.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Rg_sQYSmwInterested in sponsoring an episode? Learn more.

Feb 3, 2026

25m

86

Running a Full Kubernetes Cluster for $2 a Month, with Varnit Goyal

Most developers assume Kubernetes requires an enterprise budget. Varnit Goyal proves otherwise — he built a full three-node Kubernetes cluster for $2.16/month using Rackspace Spot Instances.The trick: pick non-default instance types, distribute nodes across low-demand regions, and let Kubernetes handle rescheduling when nodes get preempted. For service exposure, he replaced the $10/month load balancer with Tailscale Funnel — free.In this episode:How Spot Instance bidding works and which strategies keep costs and preemption lowUsing Tailscale Kubernetes operator as a free alternative to traditional load balancersRunning real development dependencies (Kafka, Elasticsearch, Postgres) on a budget clusterA practical walkthrough of what Kubernetes actually needs to function — and what you can strip away.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/HpVyQMVv0Interested in sponsoring an episode? Learn more.

Jan 27, 2026

27m

85

We Broke Our EKS Cluster Autoscaler with the AL2023 Migration, with Dilshan Wijesooriya

Dilshan Wijesooriya, Senior Cloud Engineer, discusses a real incident where migrating EKS nodes to AL2023 caused the cluster autoscaler to lose AWS permissions silently.You will learn:Why AL2023 blocks pod access to instance metadata by default, breaking components that relied on node IAM roles (like cluster autoscaler, external-DNS, and AWS Load Balancer Controller)How to implement IRSA correctly by configuring IAM roles, Kubernetes service accounts, and OIDC trust relationships, and why both AWS IAM and Kubernetes RBAC must be configured independentlyThe recommended migration strategy: move critical system components to IRSA before changing AMIs, test aggressively in non-production, and decouple identity changes from OS upgradesHow to audit which pods currently rely on node roles and clean up legacy IAM permissions to reduce attack surface after migrationSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/T_YPfTfDbInterested in sponsoring an episode? Learn more.

Jan 13, 2026

84

A Journey Through Kafkian SplitDNS in a Multitenant Kubernetes, with Fabián Sellés Rosa

Fabián Sellés Rosa, Tech Lead of the Runtime team at Adevinta, walks through a real engineering investigation that started with a simple request: allowing tenants to use third-party Kafka services. What seemed straightforward turned into a complex DNS resolution problem that required testing seven different approaches before a working solution was found.You will learn:Why Kafka's multi-step DNS resolution creates unique challenges in multi-tenant environments, where bootstrap servers and dynamic broker lists complicate standard DNS approachesThe iterative debugging process from Route 53 split DNS through Kubernetes native pod DNS config, custom DNS servers, Kafka proxies, and CoreDNS solutionsHow to implement the final solution using node-local DNS and CoreDNS templating with practical details including ndots configuration and Kyverno automationPlatform engineering evaluation criteria for assessing solutions based on maintainability, self-service capability, and evolvability in multi-tenant environmentsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/NsBZ-FwcJInterested in sponsoring an episode? Learn more.

Dec 2, 2025

83

More Kubernetes Than I Bargained For, with Amos Wenger

Amos Wenger walks through his production incident where adding a home computer as a Kubernetes node caused TLS certificate renewals to fail. The discussion covers debugging techniques using tools like netshoot and K9s, and explores the unexpected interactions between Kubernetes overlay networks and consumer routers.You will learn:How Kubernetes networking assumptions break when mixing cloud VMs with nodes behind consumer routers, and why cert-manager challenges fail in NAT environmentsThe differences between CNI plugins like Flannel and Calico, particularly how they handle IPv6 translationDebugging techniques for network issues using tools like netshoot, K9s, and iproute2Best practices for mixed infrastructure including proper node labeling, taints, and scheduling controlsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6Ll_7slr9Interested in sponsoring an episode? Learn more.

Nov 25, 2025

82

The Karpenter Effect: Redefining Kubernetes Operations, with Tanat Lokejaroenlarb

Tanat Lokejaroenlarb shares the complete journey of replacing EKS Managed Node Groups and Cluster Autoscaler with AWS Karpenter. He explains how this migration transformed their Kubernetes operations, from eliminating brittle upgrade processes to achieving significant cost savings of €30,000 per month through automated instance selection and AMD adoption.You will learn:How to decouple control plane and data plane upgrades using Karpenter's asynchronous node rollout capabilitiesCost optimization strategies including flexible instance selection, automated AMD migration, and the trade-offs between cheapest-first selection versus performance considerationsScaling and performance tuning techniques such as implementing over-provisioning with low-priority placeholder podsPolicy automation and operational practices using Kyverno for user experience simplification, implementing proper Pod Disruption BudgetsSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/T6hDSWYhbInterested in sponsoring an episode? Learn more.

Nov 18, 2025

81

Building Kubernetes (a lite version) from scratch in Go, with Owumi Festus

Festus Owumi walks through his project of building a lightweight version of Kubernetes in Go. He removed etcd (replacing it with in-memory storage), skipped containers entirely, dropped authentication, and focused purely on the control plane mechanics. Through this process, he demonstrates how the reconciliation loop, API server concurrency handling, and scheduling logic actually work at their most basic level.You will learn:How the reconciliation loop works - The core concept of desired state vs current state that drives all Kubernetes operationsWhy the API server is the gateway to etcd - How Kubernetes prevents race conditions using optimistic concurrency control and why centralized validation mattersWhat the scheduler actually does - Beyond simple round-robin assignment, understanding node affinity, resource requirements, and the complex scoring algorithms that determine pod placementThe complete pod lifecycle - Step-by-step walkthrough from kubectl command to running pod, showing how independent components work together like an orchestraSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/pf5kK9lQFInterested in sponsoring an episode? Learn more.

Nov 11, 2025

80

Graphs in your head, or how to assess a Kubernetes workload, with Oleksii Kolodiazhnyi

Understanding what's actually happening inside a complex Kubernetes system is one of the biggest challenges architects face.Oleksii Kolodiazhnyi, Senior Architect at Mirantis, shares his structured approach to Kubernetes workload assessment. He breaks down how to move from high-level business understanding to detailed technical analysis, using visualization tools and systematic documentation.You will learn:A top-down assessment methodology that starts with business cases and use cases before diving into technical detailsPractical visualization techniques using tools like KubeView, K9s, and Helm dashboard to quickly understand resource interactionsSystematic resource discovery approaches for different scenarios, from well-documented Helm-based deployments to legacy applications with hard-coded configurations buried in containersDocumentation strategies for creating consumable artifacts that serve different audiences, from business stakeholders to new team members joining the projectSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/zDThxGQsPInterested in sponsoring an episode? Learn more.

Nov 4, 2025

79

Our Journey to GitOps: Migrating to ArgoCD with Zero Downtime, with Andrew Jeffree

Andrew Jeffree from SafetyCulture walks through their complete migration of 250+ microservices from a fragile Helm-based setup to GitOps with ArgoCD, all without any downtime. He explains how they replaced YAML configurations with a domain-specific language built in CUE, creating a better developer experience while adding stronger validation and reducing operational pain points.You will learn:Zero-downtime migration techniques using temporary deployments with prune-last sync options to ensure healthy services before removing legacy onesHow CUE lang improves on YAML by providing schema validation, early error detection, and a cleaner interface for developersHuman-centric platform engineering approaches that prioritize developer experience and reduce on-call burden through empathy-driven design decisionsSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/Xvyp1_QcvInterested in sponsoring an episode? Learn more.

Oct 28, 2025

78

The Double-Edged Sword of AI-Assisted Kubernetes Operations, with Mai Nishitani

Mai Nishitani, Director of Enterprise Architecture at NTT Data and AWS Community Builder, demonstrates how Model Context Protocol (MCP) enables Claude to directly interact with Kubernetes clusters through natural language commands.You will learn:How MCP servers work and why they're significant for standardizing AI integration with DevOps tools, moving beyond custom integrations to a universal protocolThe practical capabilities and critical limitations of AI in Kubernetes operationsWhy fundamental troubleshooting skills matter more than ever as AI abstractions can fail in unexpected ways, especially during crisis scenarios and complex system failuresHow DevOps roles are evolving from manual administration toward strategic architecture and orchestrationSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/3hWvQjXxpInterested in sponsoring an episode? Learn more.

Oct 21, 2025

77

The Making of Flux: The Future, a KubeFM Original Series

In this closing episode, Bryan Ross (Field CTO at GitLab), Jane Yan (Principal Program Manager at Microsoft), Sean O’Meara (CTO at Mirantis) and William Rizzo (Strategy Lead, CTO Office at Mirantis) discuss how GitOps evolves in practice.How enterprises are embedding Flux into developer platforms and managed cloud services.Why bridging CI/CD and infrastructure remains a core challenge—and how GitOps addresses it.What leading platform teams (GitLab, Microsoft, Mirantis) see as the next frontier for GitOps.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/tVqKwNYQHInterested in sponsoring an episode? Learn more.

Oct 20, 2025

26m

76

The Data Engineer's guide to optimizing Kubernetes, with Niels Claeys

Niels Claeys shares how his team at Dataminded built Conveyor, a data platform processing up to 1.5 million core hours monthly. He explains the specific optimizations they discovered through production experience, from scheduler changes that immediately reduce costs by 10-15% to achieving 97% spot instance usage without reliability issues.You will learn:Why the default Kubernetes scheduler wastes money on batch workloads and how switching from "least allocated" to "most allocated" scheduling enables faster scale-down and better resource utilizationHow to achieve 97% spot instance adoption through strategic instance type diversification, region selection, and Spark-specific techniquesNode pool design principles that balance Kubernetes overhead with workload efficiencyPlatform-specific gotchas like AWS cross-AZ data transfer costs that can spike bills unexpectedlySponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/hGRfkzDJWInterested in sponsoring an episode? Learn more.

Oct 14, 2025

75

The Making of Flux: The Scale, a KubeFM Original Series

In this episode, Philippe Ensarguet, VP of Software Engineering at Orange, and Arnab Chatterjee, Global Head of Container & AI Platforms at Nomura, share how large enterprises are adopting Flux to drive reliable, compliant, and scalable platforms.How Orange uses Flux to manage bare-metal Kubernetes through its SYLVR project.Why FSIs rely on GitOps to balance agility with governance.How Flux helps enterprises achieve resilience, compliance, and repeatability at scale.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/tWcHlJm7MInterested in sponsoring an episode? Learn more.

Oct 13, 2025

23m

74

How We Integrated Native macOS Workloads with Kubernetes, with Vitalii Horbachov

Vitalii Horbachov explains how Agoda built macOS VZ Kubelet, a custom solution that registers macOS hosts as Kubernetes nodes and spins up macOS VMs using Apple's native virtualization framework. He details their journey from managing 200 Mac minis with bash scripts to a Kubernetes-native approach that handles 20,000 iOS tests at scale.You will learn:How to build hybrid runtime pods that combine macOS VMs with Docker sidecar containers for complex CI/CD workflowsCustom OCI image format implementation for managing 55-60GB macOS VM images with layered copy-on-write disks and digest validationNetworking and security challenges including Apple entitlements, direct NIC access, and implementing kubectl exec over SSHReal-world adoption considerations including MDM-based host lifecycle management and the build vs. buy decision for Apple infrastructure at scaleSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/q_JS76SvMInterested in sponsoring an episode? Learn more.

Oct 7, 2025

73

The Making of Flux: The Rewrite, a KubeFM Original Series

In this episode, Michael Bridgen (the engineer who wrote Flux's first lines) and Stefan Prodan (the maintainer who led the V2 rewrite) share how Flux grew from a fragile hack-day script into a production-grade GitOps toolkit.How early Flux addressed the risks of manual, unsafe Kubernetes upgradesWhy the complete V2 rewrite was critical for stability, scalability, and adoptionWhat the maintainers learned about building a sustainable, community-driven open-source projectSponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/bgkgn227-Interested in sponsoring an episode? Learn more.

Oct 6, 2025

44m

72

Scaling CI horizontally with Buildkite, Kubernetes, and multiple pipelines, with Ben Poland

Ben Poland walks through Faire's complete CI transformation, from a single Jenkins instance struggling with thousands of lines of Groovy to a distributed Buildkite system running across multiple Kubernetes clusters.He details the technical challenges of running CI workloads at scale, including API rate limiting, etcd pressure points, and the trade-offs of splitting monolithic pipelines into service-scoped ones.You will learn:How to architect CI systems that match team ownership and eliminate shared failure points across servicesKubernetes scaling patterns for CI workloads, including multi-cluster strategies, predictive node provisioning, and handling API throttlingPerformance optimization techniques like Git mirroring, node-level caching, and spot instance management for variable CI demandsMigration strategies and lessons learned from moving away from monolithic CI, including proof-of-concept approaches and avoiding the sunk cost fallacySponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/klBmzMY5-Interested in sponsoring an episode? Learn more.

Sep 30, 2025

71

Not Every Problem Needs Kubernetes, with Danyl Novhorodov

Danyl Novhorodov, a veteran .NET engineer and architect at Eneco, presents his controversial thesis that 90% of teams don't actually need Kubernetes. He walks through practical decision-making frameworks, explores powerful alternatives like BEAM runtimes and Actor models, and explains why starting with modular monoliths often beats premature microservices adoption.You will learn:The COST decision framework - How to evaluate infrastructure choices based on Complexity, Ownership, Skills, and Time rather than industry hypePlatform engineering vs. managed services - How to honestly assess whether your team can compete with AWS, Azure, and Google's managed container platformsEvolutionary architecture approach - Why modular monoliths with clear boundaries often provide better foundations than distributed systems from day oneSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/BYhFw8RwWInterested in sponsoring an episode? Learn more.

Sep 23, 2025

70

VerticalPodAutoscaler Went Rogue: It Took Down Our Cluster, with Thibault Jamet

Running 30 Kubernetes clusters serving 300,000 requests per second sounds impressive until your Vertical Pod Autoscaler goes rogue and starts evicting critical system pods in an endless loop.Thibault Jamet shares the technical details of debugging a complex VPA failure at Adevinta, where webhook timeouts triggered continuous pod evictions across their multi-tenant Kubernetes platform.You will learn:VPA architecture deep dive - How the recommender, updater, and mutating webhook components interact and what happens when the webhook failsHidden Kubernetes limits - How default QPS and burst rate limits in the Kubernetes Go client can cause widespread failures, and why these aren't well documented in Helm chartsMonitoring strategies for autoscaling - What metrics to track for webhook latency and pod eviction rates to catch similar issues before they become criticalSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/rf1pbWXdNInterested in sponsoring an episode? Learn more.

Sep 16, 2025

69

The Making of Flux: The Origin, a KubeFM Original Series

This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.You will learn:How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.How Flux sustained continuity after Weaveworks shut down through community governance.Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/5Sf5wpd8yInterested in sponsoring an episode? Learn more.

Sep 15, 2025

22m

68

Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling, with Jorrick Stempher

Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.You will learn:How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patternsThe Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisionsReal-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutionsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/clbDWqPYpInterested in sponsoring an episode? Learn more.

Sep 9, 2025

67

Solving Cold Starts: Uses Istio to Warm Up Java Pods, with Frédéric Gaudet

If you're running Java applications in Kubernetes, you've likely experienced the pain of slow pod startups affecting user experience during deployments and scaling events.Frédéric Gaudet, Senior SRE at BlaBlaCar, shares how his team solved the cold start problem for their 1,500 Java microservices using Istio's warm-up capabilities.You will learn:Why Java applications struggle with cold starts and how JIT compilation affects initial request latency in Kubernetes environmentsHow Istio's warm-up feature works to gradually ramp up traffic to new podsWhy other common solutions fail, including resource over-provisioning, init containers, and tools like GraalVMReal production impact from implementing this solution, including dramatic improvements in message moderation SLOs at BlaBlaCar's scale of 4,000 podsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/grxcypt9jInterested in sponsoring an episode? Learn more.

Sep 2, 2025

66

Teaching Kubernetes to Scale with a MacBook Screen Lock, with Brian Donelan

Brian Donelan, VP Cloud Platform Engineering at JPMorgan Chase, shares his ingenious side project that automatically scales Kubernetes workloads based on whether his MacBook is open or closed.By connecting macOS screen lock events to CloudWatch, KEDA, and Karpenter, he built a system that achieves 80% cost savings by scaling pods and nodes to zero when he's away from his laptop.You will learn:How KEDA differs from traditional Kubernetes HPA - including its scale-to-zero capabilities, event-driven scaling, and extensive ecosystem of 60+ built-in scalersThe technical architecture connecting macOS notifications through CloudWatch to trigger Kubernetes autoscaling using Swift, AWS SDKs, and custom metricsCost optimization strategies including how to calculate actual savings, account for API costs, and identify leading indicators of compute demandCreative approaches to autoscaling signals beyond CPU and memory, including examples from financial services and e-commerce that could revolutionize workload managementSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/sFd8TL1cSInterested in sponsoring an episode? Learn more.

Aug 26, 2025

65

Building a Carbon and Price-Aware Kubernetes Scheduler, with Dave Masselink

Data centers consume over 4% of global electricity and this number is projected to triple in the next few years due to AI workloads.Dave Masselink, founder of Compute Gardener, discusses how he built a Kubernetes scheduler that makes scheduling decisions based on real-time carbon intensity data from power grids.You will learn:How carbon-aware scheduling works - Using real-time grid data to shift workloads to periods when electricity generation has lower carbon intensity, without changing energy consumptionTechnical implementation details - Building custom Kubernetes schedulers using the scheduler plugin framework, including pre-filter and filter stages for carbon and time-of-use pricing optimizationEnergy measurement strategies - Approaches for tracking power consumption across CPUs, memory, and GPUsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/zk2xM1lfWInterested in sponsoring an episode? Learn more.

Aug 19, 2025

41m

64

How Policies Saved us a Thousand Headaches, with Alessandro Pomponio

Alessandro Pomponio from IBM Research explains how his team transformed their chaotic bare-metal clusters into a well-governed, self-service platform for AI and scientific workloads. He walks through their journey from manual cluster interventions to a fully automated GitOps-first architecture using ArgoCD, Kyverno, and Kueue to handle everything from policy enforcement to GPU scheduling.You will learn:How to implement GitOps workflows that reduce administrative burden while maintaining governance and visibility across multi-tenant research environmentsPractical policy enforcement strategies using Kyverno to prevent GPU monopolization, block interactive pod usage, and automatically inject scheduling constraintsFair resource sharing techniques with Kueue to manage scarce GPU resources across different hardware types while supporting both specific and flexible allocation requestsOrganizational change management approaches for gaining stakeholder buy-in, upskilling admin teams, and communicating policy changes to research usersSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/5sK7BFZ-8Interested in sponsoring an episode? Learn more.

Aug 12, 2025

63

Dear friend, you have built a Kubernetes, with Mac Chaffee

Mac Chaffee, a platform engineer and security champion, examines why developers often underestimate the complexity of running modern applications and how overconfidence leads to expensive technical mistakes.You will learn:Why teams reject Kubernetes then rebuild it piece by piece - understanding the psychological factors, like overconfidence, that drive initial rejection of complex but proven toolsHow to identify the tipping point when DIY solutions become more complex than adopting established orchestration tools, especially around scaling and high availability challengesThe right approach to abstracting Kubernetes complexity - why hiding the Kubernetes API often backfires and how to build effective guardrails instead of reinventing interfacesWhy mentorship gaps lead to poor technical decisions - how the lack of proper apprenticeship programs in tech results in teams making expensive mistakes when building infrastructureSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/9nFPmG85fInterested in sponsoring an episode? Learn more.

Jun 24, 2025

62

Beyond Kubernetes: Serverless Execution Models for Variable Workloads, with Marc Campora

Marc Campora, a systems consultant with experience in high-throughput platforms, shares his analysis of a real customer deployment with 500+ microservices. He breaks down the cost implications, technical constraints, and operational trade-offs between Kubernetes containers and AWS Lambda functions based on actual production data and migration assessments.You will learn:Cost analysis frameworks for comparing Lambda vs Kubernetes across different traffic patterns, including specific examples of 3x savings potential and the 80/20 rule for service utilizationMigration complexity factors when moving existing microservices to Lambda, including cold start issues, runtime model changes, and why it's often a complete rewrite rather than a simple portDecision criteria for choosing between platforms based on traffic consistency, computational requirements, and operational overhead toleranceSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/5gMTkzLhVInterested in sponsoring an episode? Learn more.

Jun 17, 2025

61

Shared Nothing, Shared Everything: The Truth About Kubernetes Multi-Tenancy, with Molly Sheets

Molly Sheets, Director of Engineering for Kubernetes at Zynga, discusses her team's approach to platform engineering. She explains why their initial one-cluster-per-team model became unsustainable and how they're transitioning to multi-tenant architectures.You will learn:Why slowing down deployments actually increases risk and how manual approval gates can make systems less resilient than faster, smaller deploymentsThe operational reality of cluster proliferation - why managing hundreds of clusters becomes unsustainable and when multi-tenancy becomes necessaryPractical multi-tenancy implementation strategies including resource quotas, priority classes, and namespace organization patterns that work in productionBetter metrics for multi-tenant environments - why control plane uptime doesn't matter and how to build meaningful SLOs for distributed platform healthSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Rmpl8948_Interested in sponsoring an episode? Learn more.

Jun 10, 2025

60

My pipelines from GitLab Commit to ArgoCD got beaten by FTP, with David Pech

A sophisticated GitLab CI/CD pipeline integrated with Argo CD was ultimately rejected in favour of simple FTP deployment, offering crucial insights into the real barriers facing cloud-native adoption in traditional organisations.David Pech, Staff Cloud Ops Engineer at Wrike and holder of all CNCF certifications, shares his experience supporting a PHP team after a company merger. He details how he built a complete cloud-native platform with Kubernetes, Helm charts, and GitOps workflows, only to see it fail against cultural and organizational resistance despite its technical superiority.You will learn:The hidden costs of sophisticated tooling - How GitOps pipelines with multiple moving parts can create trust issues when developers lose local control and must rely on remote processes they don't understandCultural factors that trump technical benefits - Why customer expectations, existing Windows-based infrastructure, and team readiness matter more than the elegance of your Kubernetes solutionPractical strategies for incremental adoption - The importance of starting small, building in-house operational expertise, and ensuring management advocacy at all levels before attempting cloud-native transformationsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/_MWX5m6G_Interested in sponsoring an episode? Learn more.

Jun 3, 2025

59

Performance testing Kubernetes workloads, with Stephan Schwarz

If you're tasked with performance testing Kubernetes workloads without much guidance, this episode offers clear, experience-based strategies that go beyond theory.Stephan Schwarz, a DevOps engineer at iits-consulting, walks through his systematic approach to performance testing Kubernetes applications. He covers everything from defining what performance actually means, to the practical methodology of breaking individual pods to understand their limits, and navigating the complexities of Kubernetes-specific components that affect test results.You will learn:How to establish baseline performance metrics by systematically testing individual pods, disabling autoscaling features, and documenting each incremental change to understand real application limitsWhy shared Kubernetes components skew results and how ingress controllers, service meshes, and monitoring stacks create testing challenges that require careful consideration of the entire request chainPractical approaches to HPA configuration, including how to account for scaling latency, the time delays inherent in Kubernetes scaling operations, and planning for spare capacity based on your SLA requirementsThe role of observability tools like OpenTelemetry in production environments where load testing isn't feasible, and how distributed tracing helps isolate performance bottlenecks across interdependent servicesSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/yY-FnmGfHInterested in sponsoring an episode? Learn more.

May 27, 2025

58

Managing 100s of Kubernetes Clusters using Cluster API, with Zain Malik

Discover how to manage Kubernetes at scale with declarative infrastructure and automation principles.Zain Malik shares his experience managing multi-tenant Kubernetes clusters with up to 30,000 pods across clusters capped at 950 nodes. He explains how his team transitioned from Terraform to Cluster API for declarative cluster lifecycle management, contributing upstream to improve AKS support while implementing GitOps workflows.You will learn:How to address challenges in large-scale Kubernetes operations, including node pool management inconsistencies and lengthy provisioning timesWhy Cluster API provides a powerful foundation for multi-cloud cluster management, and how to extend it with custom operators for production-specific needsHow implementing GitOps principles eliminates manual intervention in critical operations like cluster upgradesStrategies for handling production incidents and bugs when adopting emerging technologies like Cluster APISponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/5PLksqVlkInterested in sponsoring an episode? Learn more.

May 20, 2025

57

Super-Scaling Open Policy Agent with Batch Queries, with Nicholaos Mouzourakis

Dive into the technical challenges of scaling authorization in Kubernetes with this in-depth conversation about Open Policy Agent (OPA).Nicholaos Mouzourakis, Staff Product Security Engineer at Gusto, explains how his team re-architected Kubernetes native authorization using OPA to support scale, latency guarantees, and audit requirements across services. He shares detailed insights about their journey optimizing OPA performance through batch queries and solving unexpected interactions between Kubernetes resource limits and Go's runtime behavior.You will learn:Why traditional authorization approaches (code-driven and data-driven) fall short in microservice architectures, and how OPA provides a more flexible, decoupled solutionHow batch authorization can improve performance by up to 18x by reducing network round-tripsThe unexpected interaction between Kubernetes CPU limits and Go's thread management (GOMAXPROCS) that can severely impact OPA performancePractical deployment strategies for OPA in production environments, including considerations for sidecars, daemon sets, and WASM modulesSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/S-2vQ_j-4Interested in sponsoring an episode? Learn more.

May 13, 2025

56

Kubernetes upgrades: beyond the one-click update, with Tanat Lokejaroenlarb

Discover how Adevinta manages Kubernetes upgrades at scale in this episode with Tanat Lokejaroenlarb. Tanat shares his team's journey from time-consuming blue-green deployments to efficient in-place upgrades for their multi-tenant Kubernetes platform SHIP, detailing the engineering decisions and operational challenges they overcame.You will learn:How to transition from blue-green to in-place Kubernetes upgrades while maintaining service reliabilityTechniques for tracking and addressing API deprecations using tools like Pluto and Kube-no-troubleStrategies for minimizing SLO impact during node rebuilds through serialized approaches and proper PDB configurationWhy a phased upgrade approach with "cluster waves" provides safer production deployments even with thorough testingSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/VVHFfXGl_Interested in sponsoring an episode? Learn more.

May 6, 2025

55

From Fragile to Faultless: Kubernetes Self-Healing In Practice, with Grzegorz Głąb

Discover how to build resilient Kubernetes environments at scale with practical automation strategies from an engineer who's tackled complex production challenges.Grzegorz Głąb, Kubernetes Engineer at Cloud Kitchens, shares his team's journey developing a comprehensive self-healing framework. He explains how they addressed issues ranging from spot node preemptions to network packet drops caused by unbalanced IRQs, providing concrete examples of automation that prevents downtime and improves reliability.You will learn:How managed Kubernetes services like AKS provide benefits but require customization for specific use casesThe architecture of an effective self-healing framework using DaemonSets and deployments with Kubernetes-native componentsPractical solutions for common challenges like StatefulSet pods stuck on unreachable nodes and cleaning up orphaned podsTechniques for workload-level automation, including throttling CPU-hungry pods and automating diagnostic data collectionSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/yg_fkP0LNInterested in sponsoring an episode? Learn more.

Apr 29, 2025

54

Replacing StatefulSets with a custom Kubernetes operator in our Postgres cloud platform, with Andrew Charlton

Discover why standard Kubernetes StatefulSets might not be sufficient for your database workloads and how custom operators can provide better solutions for stateful applications.Andrew Charlton, Staff Software Engineer at Timescale, explains how they replaced Kubernetes StatefulSets with a custom operator called Popper for their PostgreSQL Cloud Platform. He details the technical limitations they encountered with StatefulSets and how their custom approach provides more intelligent management of database clusters.You will learn:Why StatefulSets fall short for managing high-availability PostgreSQL clusters, particularly around pod ordering and volume managementHow Timescale's instance matching approach solves complex reconciliation challenges when managing heterogeneous database workloadsThe benefits of implementing discrete, idempotent actions rather than workflows in Kubernetes operatorsReal-world examples of operations that became possible with their custom operator, including volume downsizing and availability zone consolidationSponsorThis episode is brought to you by mirrord — run local code like in your Kubernetes cluster without deploying first.More infoFind all the links and info for this episode here: https://ku.bz/fhZ_pNXM3Interested in sponsoring an episode? Learn more.

Apr 22, 2025

53

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.You will learn:How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon setsWhy smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineeringHow running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthlyPractical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issuesSponsorThis episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.More infoFind all the links and info for this episode here: https://ku.bz/wP6bTlrFsInterested in sponsoring an episode? Learn more.

Mar 18, 2025

52

I just want mTLS on Kubernetes, with John Howard

Dive into the world of Kubernetes security with this insightful conversation about securing cluster traffic through encryption.John Howard, Senior Software Engineer at Solo.io, explains the complexities of implementing Mutual TLS (mTLS) in Kubernetes. He discusses the evolution from DIY approaches to Service Mesh solutions, focusing on Istio's Ambient Mesh as a simplified path to workload encryption.You will learn:Why DIY mTLS implementation in Kubernetes is challenging at scale, requiring certificate management, application updates, and careful transition planningHow Service Mesh solutions offload security concerns from applications, allowing developers to focus on business logic while infrastructure handles encryptionThe advantages of Ambient Mesh's approach to simplifying mTLS implementation with its node proxy and waypoint proxy architectureSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/sk-ZF1PG9Interested in sponsoring an episode? Learn more.

Mar 4, 2025

51

Learned it the hard way: don't use Cilium's default Pod CIDR, with Isala Piyarisi

This episode examines how a default configuration in Cilium CNI led to silent packet drops in production after 8 months of stable operations.Isala Piyarisi, Senior Software Engineer at WSO2, shares how his team discovered that Cilium's default Pod CIDR (10.0.0.0/8) was conflicting with their Azure Firewall subnet assignments, causing traffic disruptions in their staging environment.You will learn:How Cilium's default CIDR allocation can create routing conflicts with existing infrastructureA methodical process for debugging network issues using packet tracing, routing table analysis, and firewall logsThe procedure for safely changing Pod CIDR ranges in production clustersSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/kJjXQlmTwInterested in sponsoring an episode? Learn more.

Feb 25, 2025