Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

PODCAST · news

Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

Ship It Weekly is a short, practical recap of what actually matters in DevOps, SRE, cloud infrastructure, and platform engineering.Each episode, your host Brian Teller walks through the latest outages, releases, tools, and incident writeups, then translates them into “here’s what this means for your systems” instead of just reading headlines. Expect a couple of main stories with context, a quick hit of tools or releases worth bookmarking, and the occasional segment on on-call, burnout, or team culture.This isn’t a certification prep show or a lab walkthrough. It’s aimed at people who are already working in the space and want to stay sharp without scrolling status pages, cloud updates, and blogs all week. You’ll hear about things like cloud provider incidents, Kubernetes and platform trends, Terraform and infrastructure changes, and real postmortems that are actually worth your time.Most episodes are 10–25 minutes, so you can catch up on the way

  1. 37

    Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail

    This episode of Ship It Weekly is about modern reliability getting squeezed from both directions. Old-school failures still hit hard, like broken DNSSEC, kernel privilege escalation bugs, and GitOps behavior changes. But newer automation layers add a second kind of risk, where AI agents, machine identity, and cloud control planes can do real damage fast when authority is too broad. Brian covers the Cursor and PocketOS production database wipe, the .de DNSSEC outage and Cloudflare’s response, Bluesky’s April outage postmortem, Argo CD v3.1.16 reaching end of life plus the v3.4.1 behavior change, Linux kernel CVE-2026-31431 under active exploitation, and why Google Cloud Agent Identity and AWS MCP Server GA both point to agents becoming first-class infrastructure actors.Sponsored by Guardsquare https://hubs.ly/Q04fJgkJ0LinksCursor / PocketOS production database wipe https://www.tellerstech.com/on-call-brief/2026-W19/Cloudflare on the .de DNSSEC outage https://blog.cloudflare.com/de-tld-outage-dnssec/Bluesky April 2026 outage postmortem https://pckt.blog/b/jcalabro/april-2026-outage-post-mortem-219ebg2Argo CD releases: v3.1.16 final release and v3.4.1 behavior change https://github.com/argoproj/argo-cd/releasesLinux kernel CVE-2026-31431 https://nvd.nist.gov/vuln/detail/CVE-2026-31431AWS bulletin for CVE-2026-31431 https://aws.amazon.com/security/security-bulletins/rss/2026-026-aws/Google Cloud Agent Identity https://cloud.google.com/blog/products/identity-security/whats-new-in-iam-security-governance-and-runtime-defenseAWS MCP Server is now generally available https://aws.amazon.com/blogs/aws/the-aws-mcp-server-is-now-generally-available/Cross-region disaster recovery for Amazon EKS using AWS Backup https://aws.amazon.com/blogs/containers/cross-region-disaster-recovery-for-amazon-eks-using-aws-backup/Google Ads new data retention policy starting June 1, 2026 https://ads-developers.googleblog.com/2026/05/new-data-retention-policy-for-google.htmlThis week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W19/More episodes and show notes https://shipitweekly.fm/

  2. 36

    Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

    This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.This episode is not sponsored. I wanted to cover IaCConf because the theme lines up closely with what Ship It Weekly focuses on: infrastructure, platform engineering, DevOps, SRE, and how teams are adapting to AI-driven change.In this Ship It: Conversations episode, I talk with Gareth Kersey about IaCConf 2026, a free virtual conference focused on infrastructure as code, platform engineering, DevOps, SRE, and infrastructure operations. The conference is May 14th 2026.The main theme is “keeping pace.” Not just keeping pace with new tools, but keeping pace with the speed of software delivery now that AI is changing how quickly application teams can write, ship, and change code.We talk about what that means for the infrastructure teams underneath it all: the people responsible for Terraform, Kubernetes, GitOps, policies, secrets, cost, security, rollback paths, and making sure faster delivery does not turn into faster chaos.Gareth walks through the IaCConf 2026 agenda, including Corey Quinn’s keynote, AI and Terraform sessions, platform engineering panels, Kubernetes and Argo CD talks, AI agents managing infrastructure as code, governance challenges, and the risk of 10x code velocity becoming 10x operational risk.The bigger theme here is that AI is not just changing how code gets written. It is changing the pressure on the systems around delivery. Infrastructure as code, platform engineering, policy, and operational guardrails matter even more when the pace of change goes up.Highlights• What “keeping pace” means for infrastructure, DevOps, SRE, and platform teams• Why faster application development can create more downstream operational pressure• Corey Quinn’s keynote, “AI Speaks Terraform Like a Tourist”• How AI-generated infrastructure changes create new governance and review challenges• Why infrastructure as code still matters as AI agents and automation become more common• Sessions covering Terraform, Kubernetes, Argo CD, GitOps, platform engineering, and AI-driven workflows• The risk of 10x code velocity turning into 10x operational risk• How platform teams can support faster developers without giving up safety or governance• Why IaCConf includes panels, demos, technical talks, and practitioner stories instead of only tool-specific content• How IaCConf has grown from its first event in 2025 into a broader infrastructure community• Why the event is trying to stay community-focused instead of becoming just another vendor marketing conference• The role of feedback, future spotlight events, in-person meetups, and possible community spaces around IaCConf• Why registering still makes sense even if you cannot attend live, since sessions are available afterwardIaCConf links• IaCConf 2026 registration page - https://www.iacconf.com/iacconf-2026• IaCConf LinkedIn page - https://www.linkedin.com/showcase/iac-conf/• IaCConf: https://www.iacconf.com/• IaCConf is supported by Spacelift: https://spacelift.comOur linksMore episodes + show notes + links: https://shipitweekly.fmOn Call Brief: https://oncallbrief.com

  3. 35

    GitHub RCE, AI Agent Prompt Injection, and the New Reality: Your Developer Toolchain Is Production Now

    This episode of Ship It Weekly is about the developer toolchain becoming part of production. Brian covers GitHub’s critical git push RCE, AI-assisted reverse engineering, prompt injection against AI agents in GitHub workflows, Elementary’s malicious CLI release, GitHub’s merge queue regression, Cal.com going closed source, and Copilot moving toward usage-based billing. Plus: MinIO’s repo archive, Ghostty leaving GitHub, Docker Hardened Images, and Azure DevOps security updates.LinksGitHub git push RCE https://github.blog/security/securing-the-git-push-pipeline-responding-to-a-critical-remote-code-execution-vulnerability/AI-assisted reverse engineering https://www.darkreading.com/application-security/reverse-engineering-ai-unearths-high-severity-github-bugAI agents + GitHub Actions prompt injection https://www.theregister.com/2026/04/15/claude_gemini_copilot_agents_hijacked/Elementary malicious CLI release https://www.elementary-data.com/post/security-incident-report-malicious-release-of-elementary-oss-python-cli-v0-23-3GitHub merge queue regression https://github.blog/news-insights/company-news/an-update-on-github-availability/Cal.com going closed source https://cal.com/blog/cal-com-goes-closed-source-whyGitHub Copilot billing https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/MinIO archived repo https://github.com/minio/minioGhostty leaving GitHub https://mitchellh.com/writing/ghostty-leaving-githubDocker Hardened Images https://www.docker.com/blog/why-we-chose-the-harder-path-docker-hardened-images-one-year-later/Azure DevOps security updates https://devblogs.microsoft.com/devops/one-click-security-scanning-and-org-wide-alert-triage-come-to-advanced-security/On Call Brief https://oncallbrief.com/More episodes https://shipitweekly.fm/

  4. 34

    Kubernetes 1.36, Gateway API v1.5, AWS Copilot End of Support, and Cloudflare Non-Human Identities

    This episode of Ship It Weekly is about platforms getting sharper about defaults, ownership, and the old paths they are no longer willing to quietly carry forever. Brian covers Kubernetes 1.36 and why it feels more like a cleanup-and-maturity release than a flashy feature dump, Gateway API v1.5 moving more networking behavior into the stable path, AWS Copilot CLI reaching end of support and what that means for teams still sitting on the older “easy” ECS workflow, Airbnb’s alert-development overhaul and why noisy or weak alerts are often a workflow problem long before they become an on-call problem, and Cloudflare’s push to treat scripts, agents, and third-party tools like real identities with real blast radius. He also hits the latest Azure DevOps Server patches and Google’s OTLP metrics support for Cloud Monitoring.LinksKubernetes v1.36 release https://kubernetes.io/blog/2026/04/22/kubernetes-v1-36-release/Gateway API v1.5 https://kubernetes.io/blog/2026/04/21/gateway-api-v1-5/AWS Copilot CLI end of support https://aws.amazon.com/blogs/containers/announcing-the-end-of-support-for-the-aws-copilot-cli/Airbnb on alert development https://medium.com/airbnb-engineering/it-wasnt-a-culture-problem-upleveling-alert-development-at-airbnb-01e2290eb0f5Cloudflare on non-human identities, OAuth visibility, and scoped permissions https://blog.cloudflare.com/improved-developer-security/Azure DevOps Server April patches https://devblogs.microsoft.com/devops/april-patches-for-azure-devops-server/OTLP metrics for Google Cloud Monitoring https://cloud.google.com/blog/products/management-tools/otlp-opentelemetry-protocol-for-google-cloud-monitoring-metricsPast episode where we talked about Cloudflare Mesh https://www.tellerstech.com/ship-it-weekly/aws-interconnect-ga-cloudflare-mesh-gitlab-19-eks-auto-mode-and-opentelemetry-config/This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W16/On Call Brief: https://oncallbrief.com/More episodes and show notes https://shipitweekly.fm/

  5. 33

    Ship It Conversations: Stephane Moser on Pipedrive’s Jenkins-to-GitHub Actions Migration, Argo CD, and CI/CD at Scale

    This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.In this Ship It: Conversations episode, I talk with Stephane Moser about Pipedrive’s move from Jenkins to GitHub Actions, building self-hosted runners on Kubernetes, shifting deployments toward GitOps with Argo CD, and what it actually takes to roll out a big CI/CD change across a large engineering org.We talk about why Jenkins had become painful, from Groovy friction to noisy-neighbor problems on shared VMs, why GitHub Actions fit better, how reusable workflows and custom actions helped, why Argo CD beat out Flux for their use case, and how they had to build better observability and internal deployment visibility around GitHub as they scaled.The bigger theme here is that this was not just a tooling swap. It was a product and platform migration. Isolation, repeatability, self-service, rollout strategy, and observability mattered just as much as the actual CI/CD tools.Highlights• Why Jenkins stopped working well for them: Groovy friction, shared VM contention, and poor predictability • Replacing CodeShip pull request validation first as the low-blast-radius starting point • Using Actions Runner Controller on Kubernetes with EKS and Karpenter for self-hosted runners • Why reusable workflows and custom actions helped cut repetition across hundreds of services • Choosing Argo CD over Flux, Argo Workflows, Tekton, and even a short Spinnaker attempt • Moving from push-based deploys toward GitOps for better isolation and safer credentials handling • Building internal observability because GitHub’s workflow visibility was not enough at their scale • Dogfooding first, then rolling migration out in batches until teams could self-serve the move • What broke when the new system actually worked too well: bot-driven deploy volume, queueing, and fairness • The mobile side of the story: Mac minis, unstable runners, GitHub-hosted runners, and a very different migration path • How AI sped up parts of the mobile migration and troubleshooting, without making the migration trivial • Stephane’s advice for big CI/CD shifts: start small, reduce blast radius, and use your own platform firstStephane’s links• LinkedIn: https://www.linkedin.com/in/moserss/ • Talk video: https://www.youtube.com/watch?v=VrE1dh-1zEY • Blog post Part 1: https://medium.com/pipedrive-engineering/so-long-jenkins-hello-github-actions-pipedrives-big-ci-cd-switch-03be29c75f63 • Blog post Part 2: https://medium.com/pipedrive-engineering/all-aboard-the-github-actions-express-pipedrives-big-ci-cd-switch-part-2-fcacf834afd2 • GitHub: https://github.com/moser-ssOur linksMore episodes + show notes + links: https://shipitweekly.fmOn Call Brief: https://oncallbrief.com

  6. 32

    AWS Interconnect GA, Cloudflare Mesh, GitLab 19, EKS Auto Mode, and OpenTelemetry Config

    This episode of Ship It Weekly is about networking, ingress, and private access moving further up into the platform layer. Brian covers AWS Interconnect going generally available, Cloudflare Mesh, GitLab 19.0 breaking changes around Gateway API and bundled services, EKS Auto Mode networking, and OpenTelemetry declarative config reaching stability. He also hits containerd security patches, GitHub’s new Code Security risk assessment, and AWS guidance on securing AI agents with MCP. (Amazon Web Services, Inc.)LinksAWS Interconnect GA and last mile connectivity https://aws.amazon.com/blogs/aws/aws-interconnect-is-now-generally-available-with-a-new-option-to-simplify-last-mile-connectivity/Cloudflare Mesh https://blog.cloudflare.com/mesh/GitLab 19.0 breaking changes https://about.gitlab.com/blog/a-guide-to-the-breaking-changes-in-gitlab-19-0/EKS Auto Mode networking https://aws.amazon.com/blogs/containers/navigating-enterprise-networking-challenges-with-amazon-eks-auto-mode/OpenTelemetry declarative config reaches stability https://opentelemetry.io/blog/2026/stable-declarative-config/containerd security releases https://github.com/containerd/containerd/releasesGitHub Code Security risk assessment for organizations https://github.blog/changelog/2026-04-08-code-security-risk-assessment-available-for-organizations/AWS secure AI agent access patterns using MCP https://aws.amazon.com/blogs/security/secure-ai-agent-access-patterns-to-aws-resources-using-model-context-protocol/This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W16/More episodes and show notes https://shipitweekly.fm/

  7. 31

    Special: Claude Mythos Preview and Project Glasswing: AI Exploit Discovery, Zero-Day Risk, Business Fallout, and What It Means for DevOps, Cloud, and Platform Security

    In this Ship It Weekly special, Brian breaks down Claude Mythos Preview and Project Glasswing, and why this story matters beyond normal AI launch hype.Anthropic is treating Mythos like a real security inflection point, not just a better coding model. Project Glasswing is their coordinated effort to get early access into the hands of defenders, critical software maintainers, and major infrastructure organizations before similar capability becomes more broadly available. If OpenClaw was about agents becoming a new control plane, this episode is about what happens when finding ways into messy environments and control planes starts getting faster too.We walk through the practical angle for DevOps, cloud, platform, and infra teams: exploit timelines may be compressing, platform debt becomes attacker leverage, and the boring work most orgs treat like cleanup suddenly looks a lot more like frontline security work. We also zoom out to the business side, including why banks, regulators, and government officials are already paying attention.ChaptersWhy This Episode ExistsOpenClaw CallbackWhat Actually HappenedDon’t Get Gullible, Don’t Get LazyWhat Changes If This Is Even Half TrueWhy Business People Should CareWhat This Means for DevOps, Cloud, and PlatformBoring Work Just Got PromotedThe Uncomfortable TakeawayWhat I’d Do Right NowLinks from this episodeClaude Mythos Previewhttps://red.anthropic.com/2026/mythos-preview/Project Glasswinghttps://www.anthropic.com/project/glasswingAI cyber threats: open letter to business leadershttps://www.gov.uk/government/publications/ai-cyber-threats-open-letter-to-business-leaders/ai-cyber-threats-open-letter-to-business-leaders-htmlAI-boosted hacks with Anthropic’s Mythos could have dire consequences for bankshttps://www.reuters.com/legal/litigation/ai-boosted-hacks-with-anthropics-mythos-could-have-dire-consequences-banks-2026-04-13/ECB to quiz bankers about risks of Anthropic's new AI model, source sayshttps://www.reuters.com/world/ecb-warn-bankers-about-new-anthropic-model-risks-source-says-2026-04-15/Related episode: OpenClaw specialhttps://www.tellerstech.com/ship-it-weekly/special-openclaw-security-timeline-and-fallout-cve-2026-25253-one-click-token-leak-malicious-clawhub-skills-exposed-agent-control-panels-and-why-local-ai-agents-are-a-new-devops-sre-control-plane/

  8. 30

    Amazon S3 Files, Malicious npm Plugins, Trivy Fallout, and Kubernetes’ Gateway Shift

    This episode of Ship It Weekly is about the interface layer becoming the story. Brian covers Amazon S3 Files and why it feels more like a managed filesystem layer in front of S3 than “S3 is EFS now,” including how it relates to the old s3fs and FUSE-style approach. He also digs into 36 malicious npm packages posing as Strapi plugins, the uglier follow-on to the Trivy incident he discussed previously, Kubernetes Ingress2Gateway 1.0 and the push toward Gateway API, and Kubernetes Agent Sandbox as a sign that newer AI-style workloads are starting to reshape the platform itself.LinksAmazon S3 Fileshttps://aws.amazon.com/blogs/aws/launching-s3-files-making-s3-buckets-accessible-as-file-systems/Malicious npm packages posing as Strapi pluginshttps://thehackernews.com/2026/04/36-malicious-npm-packages-exploited.htmlTrivy follow-on incident discussionhttps://github.com/aquasecurity/trivy/discussions/10425RoseSecurity on Trivy / typosquatting anglehttps://rosesecurity.dev/2026/03/20/typosquatting-trivy.htmlEarlier episode covering the first Trivy incidenthttps://www.tellerstech.com/ship-it-weekly/aws-bahrain-uae-data-center-issues-amid-iran-strikes-argocd-vs-flux-gitops-failures-github-actions-hackerbot-claw-attacks-trivy-roguepilot-codespaces-prompt-injection-block-ai-remake/Kubernetes Ingress2Gateway 1.0https://kubernetes.io/blog/2026/03/20/ingress2gateway-1-0-release/Kubernetes Agent Sandboxhttps://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/Fortinet FortiClient EMS emergency patchhttps://www.fortiguard.com/psirt/FG-IR-26-099Karpathy posthttps://x.com/karpathy/status/2036487306585268612ProofShothttps://github.com/AmElmo/proofshotMore episodes and show noteshttps://shipitweekly.fmOn Call Briefshttps://oncallbrief.com

  9. 29

    Ship It Conversations: David Tuite on Backstage, Internal Developer Portals, and the Shift to AI Agents

    This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.In this Ship It: Conversations episode, I talk with David Chute, founder and CEO of Roadie, about internal developer portals, Backstage, automation, and how IDPs may evolve as AI agents become more common in engineering workflows.We talk about the difference between a platform and a portal, the three common problems IDPs usually try to solve, why discoverability tends to be the first pain teams feel, and why a lot of orgs should start with automation before trying to perfect a service catalog. We also get into self-hosted Backstage vs managed options, and how teams should think about adoption, data models, and time to value.The bigger theme is the one I found most interesting: IDPs may be shifting away from dashboard-heavy “single pane of glass” thinking and toward becoming context layers for workflows, terminals, and eventually agents.Highlights• The difference between an internal developer platform and an internal developer portal• The three common IDP problem areas: discoverability, automation, and guardrails• Why discoverability is usually the first pain teams feel• Why adoption is often more of a human problem than a technical one• Catalog completeness vs team ownership• Why a lot of teams should start with automation first• Self-hosted Backstage vs SaaS tradeoffs: extensibility, control, lock-in, and time to value• Why IDPs may move from dashboards to context delivery for humans and agents• Why AI helps teams build faster, but does not solve the problem of building the right thing• David’s advice for platform and DevEx teams: talk to your internal users firstDavid’s links• LinkedIn: https://www.linkedin.com/in/davidtuite/Roadie / Backstage• Roadie: https://roadie.io/ • Backstage: https://backstage.io/Stuff mentioned• Workday • Backstage • GitHub • GitLab • Bitbucket • Azure DevOps • Argo CD • LaunchDarkly • CircleCI • DORA metrics • MCP-style context for agentsOur linksMore episodes + show notes + links: https://shipitweekly.fmOn Call Brief: https://oncallbrief.com

  10. 28

    GitHub Actions Hardening, Airbnb Config Rollouts, Cloudflare Rust Restarts, ECS Managed Daemons, and Terraform Access Controls

    This episode of Ship It Weekly is about the quiet platform work that keeps things safe before they break. Brian covers GitHub Actions hardening in Kubernetes-related repos, Airbnb’s safer config rollouts, Cloudflare’s zero-downtime Rust restarts, Amazon ECS Managed Daemons, and HCP Terraform access controls with IP allow lists and temporary AWS permission delegation.LinksGitHub Actions security roadmaphttps://github.blog/news-insights/product-news/whats-coming-to-our-github-actions-2026-security-roadmap/Airbnb config rolloutshttps://medium.com/airbnb-engineering/safeguarding-dynamic-configuration-changes-at-scale-5aca5222ed68Cloudflare graceful restarts for Rusthttps://blog.cloudflare.com/ecdysis-rust-graceful-restarts/Amazon ECS Managed Daemonshttps://aws.amazon.com/about-aws/whats-new/2026/04/amazon-ecs-managed-daemons/HCP Terraform IP allow listshttps://www.hashicorp.com/blog/hcp-terraform-adds-ip-allow-list-for-terraform-resourcesHCP Terraform AWS permission delegationhttps://www.hashicorp.com/blog/aws-permission-delegation-now-generally-available-in-hcp-terraformGitHub secret scanning updateshttps://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/GitHub secret scanning for AI coding agentshttps://github.blog/changelog/2026-03-31-secret-scanning-extends-to-ai-coding-agents-via-the-github-mcp-server/Codespaces GA with data residencyhttps://github.blog/changelog/2026-04-01-codespaces-is-now-generally-available-for-github-enterprise-with-data-residencyKubernetes v1.36 sneak peekhttps://kubernetes.io/blog/2026/03/30/kubernetes-v1-36-sneak-peek/GKE Inference Gatewayhttps://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gatewayMore episodes and show noteshttps://shipitweekly.fmOn Call Briefshttps://oncallbrief.com

  11. 27

    Hackerbot-Claw Grows, Xygeni Tag Poisoning, GitHub Search HA, Windows SID Failures, and AI Skills Supply Chain

    This episode of Ship It Weekly is about the places where convenience quietly turns into trust.Brian revisits the Trivy story by zooming out to the bigger hackerbot-claw GitHub Actions campaign, then gets into the Xygeni tag-poisoning compromise, GitHub’s search high availability rebuild for GitHub Enterprise Server, Windows Server 2025 surfacing duplicate SID problems in cloned images, and the agent-skills ecosystem replaying package supply chain history. Plus: a quick lightning round on GitHub pausing self-hosted runner minimum-version enforcement and March secret scanning updates.LinksOpenSSF advisory on active GitHub Actions exploitation https://seclists.org/oss-sec/2026/q1/246Xygeni action compromise via tag poisoning https://www.stepsecurity.io/blog/xygeni-action-compromised-c2-reverse-shell-backdoor-injected-via-tag-poisoningGitHub Enterprise Server search high availability rebuild https://github.blog/engineering/architecture-optimization/how-we-rebuilt-the-search-architecture-for-high-availability-in-github-enterprise-server/Microsoft on duplicate SIDs and nongeneralized Windows Server 2025 images https://learn.microsoft.com/en-us/troubleshoot/exchange/administration/exchange-server-issues-on-incorrect-windows-server-imageSocket on supply chain security for skills.sh https://socket.dev/blog/socket-brings-supply-chain-security-to-skillsSnyk ToxicSkills research https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/GitHub self-hosted runner minimum version enforcement paused https://github.blog/changelog/2026-03-13-self-hosted-runner-minimum-version-enforcement-paused/GitHub secret scanning pattern updates, March 2026 https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/More episodes and show notes at https://shipitweekly.fmOn Call Briefs at https://oncallbrief.com

  12. 26

    Ship It Conversations: Ang Chen on Project Vera, AI Cloud Emulation, and Safer Infrastructure Testing

    This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.In this Ship It: Conversations episode, I talk with Ang Chen from the University of Michigan about Project Vera, a cloud emulator built to help teams test infrastructure changes more safely before they touch real cloud.We talk about why testing against real cloud APIs is slow, expensive, and risky, how Vera works under tools like Terraform and CloudFormation, what “high fidelity” actually means, and where a tool like this could fit in local dev and CI/CD.The bigger theme is one I think matters a lot: if AI is going to play a real role in cloud operations, it probably needs a sandbox first, not direct access to production.NoteThis interview was recorded on February 13, 2026. Since then, Vera’s public project materials have expanded the framing a bit further around multi-cloud support and safe environments for agent learning, so keep that in mind while listening.Highlights• Why real cloud testing still creates cost, delay, and risk • How Vera emulates cloud behavior at the API layer • Where this could help with Terraform, CloudFormation, and CI/CD workflows • Why “useful enough to catch real mistakes” may matter more than perfect emulation • The limits, tradeoffs, and fidelity questions that still need to be solved • Why safe training grounds may matter before AI agents touch real infrastructureAng’s links• LinkedIn: https://www.linkedin.com/in/ang-chen-8b877a17/ • University of Michigan profile: https://eecs.engin.umich.edu/people/chen-ang/ • Publications: https://web.eecs.umich.edu/~chenang/pubs.htmlProject Vera• Project site: https://project-vera.github.io/ • GitHub: https://github.com/project-vera/vera • The quest for AI Agents as DevOps: https://project-vera.github.io/blogs/cloudagent/cloudagent/ • No More Manual Mocks: https://project-vera.github.io/blogs/cloudemu/cloudemu/Stuff mentioned• A Case for Learned Cloud Emulators: https://dl.acm.org/doi/10.1145/3718958.3754799 • Cloud Infrastructure Management in the Age of AI Agents: https://dl.acm.org/doi/abs/10.1145/3759441.3759443 • LocalStack: https://www.localstack.cloud/Our linksMore episodes + show notes + links: https://shipitweekly.fmOn Call Brief: https://oncallbrief.com

  13. 25

    McKinsey AI Flaw, Kafka Goes Diskless, Google Buys Wiz, AWS Copilot Ends, and AI Gateway on Kubernetes

    This week on Ship It Weekly, Brian looks at what happens when new interfaces create old responsibilities.McKinsey patched a vulnerability in its internal AI tool Lilli, Kafka contributors are pushing a diskless-topics model that rethinks durability and replication in cloud environments, and Google officially closed Wiz acquisition in one of the biggest cloud-security moves. Plus: AWS is sunsetting Copilot CLI, Kubernetes launches an AI Gateway Working Group.LinksMcKinsey statement on Lillihttps://www.mckinsey.com/about-us/media/statement-on-strengthening-safeguards-within-the-lilli-toolKafka diskless topics proposalhttps://cwiki.apache.org/confluence/display/KAFKA/The%2BPath%2BForward%2Bfor%2BSaving%2BCross-AZ%2BReplication%2BCosts%2BKIPsGoogle completes acquisition of Wizhttps://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/wiz-acquisition/AWS Copilot CLI end-of-supporthttps://aws.amazon.com/blogs/containers/announcing-the-end-of-support-for-the-aws-copilot-cli/Kubernetes AI Gateway Working Grouphttps://kubernetes.io/blog/2026/03/09/announcing-ai-gateway-wg/Amazon Bedrock observability for first-token latency and quota consumptionhttps://aws.amazon.com/about-aws/whats-new/2026/03/amazon-bedrock-observability-ttft-quota/Cloudflare JSON responses and RFC 9457 support for 1xxx errorshttps://developers.cloudflare.com/changelog/post/2026-03-11-json-rfc9457-responses-for-1xxx-errors/Amazon S3 source-region information in server access logshttps://aws.amazon.com/about-aws/whats-new/2026/02/amazon-s3-source-region-information/AWS Config adds 30 new resource typeshttps://aws.amazon.com/about-aws/whats-new/2026/03/aws-config-new-resource-types/Amazon Bedrock AgentCore Runtime stateful MCP server featureshttps://aws.amazon.com/about-aws/whats-new/2026/03/amazon-bedrock-agentcore-runtime-stateful-mcp/More episodes and show notes athttps://shipitweekly.fmOn Call Briefs athttps://oncallbrief.com

  14. 24

    Meta Buys Moltbook, Block AI Layoffs Get Messier, Atlassian Cuts Jobs, and GitHub Explains the Outages

    This week on Ship It Weekly, Brian covers five “AI meets reality” stories that every DevOps, SRE, security, and platform team can learn from.Block’s AI layoff story is getting messier as follow-up reporting pushes back on the original framing, Meta bought Moltbook and brought more attention to the trust and security problems already showing up around AI-agent platforms, and Atlassian cut about 10% of its workforce while saying AI is changing the skills and roles it needs. Plus: GitHub gives one of the more honest outage breakdowns we’ve seen lately, Anthropic and Mozilla show a more grounded AI use case with Claude finding real Firefox bugs, and there’s a quick lightning round on Bedrock AgentCore policy, Dependabot for pre-commit hooks, and Cloudflare’s latest threat report.LinksBlock layoffs follow-uphttps://www.theguardian.com/technology/2026/mar/08/block-ai-layoffs-jack-dorseyMeta acquires Moltbookhttps://www.theguardian.com/technology/2026/mar/10/meta-acquires-moltbook-ai-agent-social-networkWiz on Moltbook exposurehttps://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keysAtlassian team updatehttps://www.atlassian.com/blog/announcements/atlassian-team-update-march-2026GitHub availability issues write-uphttps://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/Anthropic + Mozilla Firefox securityhttps://www.anthropic.com/news/mozilla-firefox-securityAnthropic labor market reporthttps://www.anthropic.com/research/labor-market-impactsAWS Bedrock AgentCore Policy GAhttps://aws.amazon.com/about-aws/whats-new/2026/03/policy-amazon-bedrock-agentcore-generally-available/GitHub Dependabot support for pre-commit hookshttps://github.blog/changelog/2026-03-10-dependabot-now-supports-pre-commit-hooks/Cloudflare 2026 Threat Reporthttps://blog.cloudflare.com/2026-threat-report/More episodes and show notes athttps://shipitweekly.fmOn Call Briefs at:https://oncallbrief.com

  15. 23

    Ship It Conversations: Yvonne Young on Linux Foundations, Mentorship, and Getting Job Ready in Cloud

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Yvonne Young, a cloud and Linux mentor active in the CloudWhistler community. We talk about the real path into cloud and DevOps, why Linux still matters as a foundation, what “job ready” actually means, and why focus, consistency, and business thinking matter more than chasing every new tool.HighlightsLinux fundamentals still matter because so much of cloud and infra work sits on top of LinuxWhat “job ready” really means: prepare for both technical and behavioral interviews, know the basics, and show how you learn when you don’t know somethingWhy so many juniors stall out by trying to learn everything instead of picking a directionWhy daily reps beat cramming: short, consistent practice keeps skills fresh better than marathon study sessionsHow Yvonne thinks about certifications, including why hands-on certs like RHCSA stand outHands-on practice ideas: break things on purpose, troubleshoot, fix services, inspect ports, and use the help filesWhy tools matter less than the business problem they solveUsing Vault as an example of solving real issues like secret sprawl, rotation, and centralized accessHow to think about cloud learning: pick one provider, learn the concepts, and map your path to the kinds of companies you want to work forWhy mentorship and community matter, especially for juniors trying not to waste time or head in the wrong directionWhat seniors can do better: better onboarding, real availability, and giving juniors an actual lifeline when they get stuckYvonne’s linksLinkedIn: https://www.linkedin.com/in/yvonne-youngStuff mentionedAli Sohail on LinkedIn: https://www.linkedin.com/in/alisohailit/Tech With Engineers on LinkedIn: https://uk.linkedin.com/company/tech-with-engineersCloudWhistler community / training: training.cloudwhistler.comVault: https://www.hashicorp.com/en/products/vaultOpenBao: https://openbao.org/More episodes + details: https://shipitweekly.fm

  16. 22

    AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

    This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.LinksAWS Bahrain/UAE (Reuters) https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/ArgoCD to Flux https://hai.wxs.ro/migrations/argocd-to-flux/GitHub Actions exploitation https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitationTrivy incident https://github.com/aquasecurity/trivy/discussions/10265RoguePilot https://thehackernews.com/2026/02/roguepilot-flaw-in-github-codespaces.htmlBlock layoffs (WSJ) https://www.wsj.com/business/jack-dorseys-block-to-lay-off-4-000-employees-in-ai-remake-28f0d869Claude Code Security https://www.anthropic.com/news/claude-code-securityDeepSeek (Reuters) https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/Agentic boundaries https://vercel.com/blog/security-boundaries-in-agentic-architecturesCISA KEV https://www.cisa.gov/news-events/alerts/2026/03/03/cisa-adds-two-known-exploited-vulnerabilities-catalogmcp-atlassian CVE https://arcticwolf.com/resources/blog-uk/cve-2026-27825-critical-unauthenticated-rce-and-ssrf-in-mcp-atlassian/Claude Cowork tasks https://support.claude.com/en/articles/13854387-schedule-recurring-tasks-in-coworkMore: https://shipitweekly.fm

  17. 21

    Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

    This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.LinksAWS Bahrain/UAE (Reuters) https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/ArgoCD to Flux https://hai.wxs.ro/migrations/argocd-to-flux/GitHub Actions exploitation https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitationTrivy incident https://github.com/aquasecurity/trivy/discussions/10265RoguePilot https://thehackernews.com/2026/02/roguepilot-flaw-in-github-codespaces.htmlBlock layoffs (WSJ) https://www.wsj.com/business/jack-dorseys-block-to-lay-off-4-000-employees-in-ai-remake-28f0d869Claude Code Security https://www.anthropic.com/news/claude-code-securityDeepSeek (Reuters) https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/Agentic boundaries https://vercel.com/blog/security-boundaries-in-agentic-architecturesCISA KEV https://www.cisa.gov/news-events/alerts/2026/03/03/cisa-adds-two-known-exploited-vulnerabilities-catalogmcp-atlassian CVE https://arcticwolf.com/resources/blog-uk/cve-2026-27825-critical-unauthenticated-rce-and-ssrf-in-mcp-atlassian/Claude Cowork tasks https://support.claude.com/en/articles/13854387-schedule-recurring-tasks-in-coworkMore: https://shipitweekly.fm

  18. 20

    Ship It Conversations: Mike Lady on Day Two Readiness + Guardrails in the AI Era

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Mike Lady (Senior DevOps Engineer, distributed systems) from Enterprise Vibe Code on YouTube. We talk day two readiness, guardrails/quality gates, and why shipping safely matters even more now that AI can generate code fast.HighlightsDay 0 vs Day 1 vs Day 2 (launching vs operating and evolving safely)What teams look like without guardrails (“hope is not a strategy”)Why guardrails speed you up long-term (less firefighting, more predictable delivery)Day-two audit checklist: source control/branches/PRs, branch protection, CI quality gates, secrets/config, staging→prod flowAI agents: they’ll “lie, cheat, and steal” to satisfy the goal unless you gate themMulti-model reviews (Claude/Gemini/Codex) as different perspectivesAI in prod: start read-only (logs/traces), then earn trust slowlyMike’s linksYouTube: https://www.youtube.com/@EnterpriseVibeCodeSite: https://www.enterprisevibecode.com/LinkedIn: https://www.linkedin.com/in/mikelady/Stuff mentionedVibe Coding (Gene Kim + Steve Yegge): https://www.simonandschuster.com/books/Vibe-Coding/Gene-Kim/9781966280026Beads (agent memory/issue tracker): https://github.com/steveyegge/beadsGas Town (agent orchestration): https://github.com/steveyegge/gastownAGENTS.md (agent instructions file): https://agents.md/OpenAI Codex: https://openai.com/codex/More episodes + details: https://shipitweekly.fm

  19. 19

    Ship It Weekly – DevOps and SRE News for Engineers Who Run Production

    Ship It Weekly is a DevOps and SRE news podcast for engineers who run real systems.Every week I break down what actually matters in cloud, Kubernetes, CI/CD, infrastructure as code, and production reliability. No hype. No vendor spin. Just practical analysis from someone who’s been on call and shipped systems at scale.This isn’t a tutorial show. It’s a signal filter.I cover major industry shifts, security incidents, cloud provider changes, and tooling updates, then explain what they mean for platform teams and engineers operating in production.If you work in DevOps, SRE, platform engineering, or cloud infrastructure and want context instead of clickbait, you’re in the right place.New episodes weekly.You can also find detailed write-ups at: https://shipitweekly.fmAnd curated production-focused briefs at: https://oncallbrief.comSubscribe, and let’s ship.

  20. 18

    GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

    This week on Ship It Weekly, Brian hits five stories where the “defaults” are shifting under ops teams.GitHub is bringing Agentic Workflows into Actions, Gentoo is migrating off GitHub to Codeberg, Argo CD upgrades are forcing Server-Side Apply in some paths, AWS Config quietly expanded coverage again, and EC2 nested virtualization is now possible on virtual instances.LinksYouTube episodes https://www.youtube.com/watch?v=tuuLlo2rbI0&list=PLYLi5KINFnO7dVMbhsJQTKRFXfSSwPmuL&pp=sAgCOnCallBrief https://oncallbrief.comTeller’s Tech Substack https://tellerstech.substack.com/GitHub Agentic Workflows (preview) https://github.blog/changelog/2026-02-13-github-agentic-workflows-are-now-in-technical-preview/Gentoo moves to Codeberg https://www.theregister.com/2026/02/17/gentoo_moves_to_codeberg_amid/Argo CD upgrade guide: 3.2 -> 3.3 (SSA) https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/3.2-3.3/AWS Config: 30 new resource types https://aws.amazon.com/about-aws/whats-new/2026/02/aws-config-new-resource-typesEC2 nested virtualization (virtual instances) https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec2-nested-virtualization-on-virtual/GitHub status page update https://github.blog/changelog/2026-02-13-updated-status-experience/GitHub Actions: early Feb updates https://github.blog/changelog/2026-02-05-github-actions-early-february-2026-updates/Runner min version enforcement extended https://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/Open Build Service postmortem https://openbuildservice.org/2026/02/02/post-mortem/Human story: AI SRE vs incident management https://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/More episodes and show info on https://shipitweekly.fm

  21. 17

    Special: OpenClaw Security Timeline and Fallout: CVE-2026-25253 One-Click Token Leak, Malicious ClawHub Skills, Exposed Agent Control Panels, and Why Local AI Agents Are a New DevOps/SRE Control Plane (OpenAI Hires Founder)

    In this Ship It Weekly special, Brian breaks down the OpenClaw situation and why it’s bigger than “another CVE.”OpenClaw is a preview of what platform teams are about to deal with: autonomous agents running locally, wired into real tools, real APIs, and real credentials. When the trust model breaks, it’s not just data exposure. It’s an operator compromise.We walk through the recent timeline: mass internet exposure of OpenClaw control panels, CVE-2026-25253 (a one-click token leak that can turn your browser into the bridge to your local gateway), a skills marketplace that quickly became a malware delivery channel, and the Moltbook incident showing how “agent content” becomes a new supply chain problem. We close with the signal that agents are going mainstream: OpenAI hiring the OpenClaw creator.Chapters1. What OpenClaw Actually Is2. The Situation in One Line3. Localhost Is Not a Boundary (The CVE Lesson)4. Exposed Control Panels (How “Local” Went Public)5. The Marketplace Problem (Skills Are Supply Chain)6. The Ecosystem Spills (Agent Platforms Leaking Real Data)7. Minimum Viable Safety for Local Agents8. The Plot Twist (OpenAI Hires the Creator)Links from this episodeCensys exposure research https://censys.com/blog/openclaw-in-the-wild-mapping-the-public-exposure-of-a-viral-ai-assistantGitHub advisory (CVE-2026-25253) https://github.com/advisories/GHSA-g8p2-7wf7-98mqNVD entry https://nvd.nist.gov/vuln/detail/CVE-2026-25253Koi Security: ClawHavoc / malicious skills https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targetingMoltbook leak coverage (Reuters) https://www.reuters.com/legal/litigation/moltbook-social-media-site-ai-agents-had-big-security-hole-cyber-firm-wiz-says-2026-02-02/OpenClaw security docs https://docs.openclaw.ai/gateway/securityOpenAI hire coverage (FT) https://www.ft.com/content/45b172e6-df8c-41a7-bba9-3e21e361d3aaMore information and past episodes on https://shipitweekly.fm

  22. 16

    When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep

    This week on Ship It Weekly, Brian hits four stories where the guardrails become the incident.GitHub had “Too Many Requests” caused by legacy abuse protections that outlived their moment. Takeaway: controls need owners, visibility, and a retirement plan.Kubernetes has a nasty edge case where nodes/proxy GET can turn into command execution via WebSocket behavior. If you’ve ever handed out “telemetry” RBAC broadly, go audit it.HashiCorp shared how HCP Vault handled a real AWS regional disruption: control plane wobbled, Dedicated data planes kept serving. Control plane vs data plane separation paying off.AWS expanded its PCI DSS compliance package with more services and the Asia Pacific (Taipei) region. Scope changes don’t break prod today, but they turn into evidence churn later if you don’t standardize proof.Human story: “reasonable assurance” turning into busywork.LinksGitHub: When protections outlive their purpose (legacy defenses + lifecycle)https://github.blog/engineering/infrastructure/when-protections-outlive-their-purpose-a-lesson-on-managing-defense-systems-at-scale/Kubernetes nodes/proxy GET → RCE (analysis)https://grahamhelton.com/blog/nodes-proxy-rceOpenFaaS guidance / mitigation noteshttps://www.openfaas.com/blog/kubernetes-node-proxy-rce/HCP Vault resilience during real AWS regional outageshttps://www.hashicorp.com/blog/how-resilient-is-hcp-vault-during-real-aws-regional-outagesAWS: Fall 2025 PCI DSS compliance package updatehttps://aws.amazon.com/blogs/security/fall-2025-pci-dss-compliance-package-available-now/GitHub Actions: self-hosted runner minimum version enforcement extendedhttps://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/Headlamp in 2025: Project Highlights (SIG UI)https://kubernetes.io/blog/2026/01/22/headlamp-in-2025-project-highlights/AWS Network Firewall Active Threat Defense (MadPot)https://aws.amazon.com/blogs/security/real-time-malware-defense-leveraging-aws-network-firewall-active-threat-defense/Reasonable assurance turning into busywork (r/sre)https://www.reddit.com/r/sre/comments/1qvwbgf/at_what_point_does_reasonable_assurance_turn_into/More episodes + details: https://shipitweekly.fm

  23. 15

    Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP

    This week on Ship It Weekly, Brian hits four “control plane + trust boundary” stories where the glue layer becomes the incident.Azure had a platform incident that impacted VM management operations across multiple regions. Your app can be up, but ops is degraded.GitHub is pushing Agent HQ (Claude + Codex in the repo/CI flow), and Actions added a case() function so workflow logic is less brittle.MCP is becoming platform plumbing: Miro launched an MCP server and Kong launched an MCP Registry.LinksAzure status incident (VM service management issues) https://azure.status.microsoft/en-us/status/history/?trackingId=FNJ8-VQZGitHub Agent HQ: Claude + Codex https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/GitHub Actions update (case() function) https://github.blog/changelog/2026-01-29-github-actions-smarter-editing-clearer-debugging-and-a-new-case-function/Claude Opus 4.6 https://www.anthropic.com/news/claude-opus-4-6How Google SREs use Gemini CLI https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outagesMiro MCP server announcement https://www.businesswire.com/news/home/20260202411670/en/Miro-Launches-MCP-Server-to-Connect-Visual-Collaboration-With-AI-Coding-ToolsKong MCP Registry announcement https://konghq.com/company/press-room/press-release/kong-introduces-mcp-registryGitHub Actions hosted runners incident thread https://github.com/orgs/community/discussions/186184DockerDash / Ask Gordon research https://noma.security/blog/dockerdash-two-attack-paths-one-ai-supply-chain-crisis/Terraform 1.15 alpha https://github.com/hashicorp/terraform/releases/tag/v1.15.0-alpha20260204Wiz Moltbook write-up https://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keysChainguard “EmeritOSS” https://www.chainguard.dev/unchained/introducing-chainguard-emeritossMore episodes + details: https://shipitweekly.fm

  24. 14

    CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE

    This week on Ship It Weekly, Brian looks at four “glue failures” that can turn into real outages and real security risk.We start with CodeBreach: AWS disclosed a CodeBuild webhook filter misconfig in a small set of AWS-managed repos. The takeaway is simple: CI trigger logic is part of your security boundary now.Next is the Bazel TLS cert expiry incident. Cert failures are a binary cliff, and “auto renew” is only one link in the chain.Third is Helm chart reliability. Prequel reviewed 105 charts and found a lot of demo-friendly defaults that don’t hold up under real load, rollouts, or node drains.Fourth is n8n. Two new high-severity flaws disclosed by JFrog. “Authenticated” still matters because workflow authoring is basically code execution, and these tools sit next to your secrets.Lightning round: Fence, HashiCorp agent-skills, marimo, and a cautionary agent-loop story.LinksAWS CodeBreach bulletin https://aws.amazon.com/security/security-bulletins/2026-002-AWS/ Wiz research https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild Bazel postmortem https://blog.bazel.build/2026/01/16/ssl-cert-expiry.html Helm report https://www.prequel.dev/blog-post/the-real-state-of-helm-chart-reliability-2025-hidden-risks-in-100-open-source-charts n8n coverage https://thehackernews.com/2026/01/two-high-severity-n8n-flaws-allow.html Fence https://github.com/Use-Tusk/fence agent-skills https://github.com/hashicorp/agent-skills marimo https://marimo.io/ Agent loop story https://www.theregister.com/2026/01/27/ralph_wiggum_claude_loops/ Related n8n episodes: https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/ https://www.tellerstech.com/ship-it-weekly/n8n-auth-rce-cve-2026-21877-github-artifact-permissions-and-aws-devops-agent-lessons/More episodes + details: https://shipitweekly.fm

  25. 13

    Ship It Conversations: AI Automation for SMBs: What to Automate (And What Not To) (with Austin Reed)

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Austin Reed from horizon.dev about AI and automation for small and mid-sized businesses, and what actually works once you leave the demo world.We get into the most common automation wins he sees (sales and customer service), why a lot of projects fail due to communication and unclear specs more than the tech, and the trap of thinking “AI makes it cheap.” Austin shares how they push teams toward quick wins first, then iterate with prototypes so you don’t spend $10k automating a thing that never even happens.We also talk guardrails: when “human-in-the-loop” makes sense, what he avoids automating (finance-heavy logic, HIPAA/medical, government), and why the goal is usually leverage, not replacing people. On the dev side, we nerd out a bit on the tooling they’re using day to day: GPT and Claude, Cursor, PR review help, CI/CD workflows, and why knowing how to architect and validate output matters way more than people think.If you’re a DevOps/SRE type helping the business “do AI,” or you’re just tired of automation hype that ignores real constraints like credentials, scope creep, and operational risk, this one is very much about the practical middle ground.Links from the episode:Austin on LinkedIn: https://www.linkedin.com/in/automationsexpert/horizon.dev: horizon.devYouTube: https://www.youtube.com/@horizonsoftwaredevSkool: https://www.skool.com/automation-mastersIf you found this useful, share it with the person on your team who keeps saying “we should automate that” but hasn’t dealt with the messy parts yet.More information on our website: https://shipitweekly.fm

  26. 12

    curl Shuts Down Bug Bounties Due to AI Slop, AWS RDS Blue/Green Cuts Switchover Downtime to ~5 Seconds, and Amazon ECR Adds Cross-Repository Layer Sharing

    This week on Ship It Weekly, Brian looks at three different versions of the same problem: systems are getting faster, but human attention is still the bottleneck.We start with curl shutting down their bug bounty program after getting flooded with low-quality “AI slop” reports. It’s not a “security vs maintainers” story, it’s an incentives and signal-to-noise story. When the cost to generate reports goes to zero, you basically DoS the people doing triage.Next, AWS improved RDS Blue/Green Deployments to cut writer switchover downtime to typically ~5 seconds or less (single-region). That’s a big deal, but “fast switchover” doesn’t automatically mean “safe upgrade.” Your connection pooling, retries, and app behavior still decide whether it’s a blip or a cascade.Third, Amazon ECR added cross-repository layer sharing. Sounds small, but if you’ve got a lot of repos and you’re constantly rebuilding/pushing the same base layers, this can reduce storage duplication and speed up pushes in real fleets.Lightning round covers a practical Kubernetes clientcmd write-up, a solid “robust Helm charts” post, a traceroute-on-steroids style tool, and Docker Kanvas as another signal that vendors are trying to make “local-to-cloud” workflows feel less painful.We wrap with Honeycomb’s interim report on their extended EU outage, and the part that always hits hardest in long incidents: managing engineer energy and coordination over multiple days is a first-class reliability concern.Links from this episodecurl bug bounties shutdown https://github.com/curl/curl/pull/20312RDS Blue/Green faster switchover https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-rds-blue-green-deployments-reduces-downtime/ECR cross-repo layer sharing https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-ecr-cross-repository-layer-sharing/Kubernetes clientcmd apiserver access https://kubernetes.io/blog/2026/01/19/clientcmd-apiserver-access/Building robust Helm charts https://www.willmunn.xyz/devops/helm/kubernetes/2026/01/17/building-robust-helm-charts.htmlttl tool https://github.com/lance0/ttlDocker Kanvas (InfoQ) https://www.infoq.com/news/2026/01/docker-kanvas-cloud-deployment/Honeycomb EU interim report https://status.honeycomb.io/incidents/pjzh0mtqw3vtSRE Weekly issue #504 https://sreweekly.com/sre-weekly-issue-504/More episodes + details: https://shipitweekly.fm

  27. 11

    n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent Lessons

    This week on Ship It Weekly, the theme is simple: the automation layer has become a control plane, and that changes how you should think about risk.We start with n8n’s latest critical vulnerability, CVE-2026-21877. This one is different from the unauth “Ni8mare” issue we covered in Episode 12. It’s authenticated RCE, which means the real question isn’t only “is it internet exposed,” it’s who can log in, who can create or modify workflows, and what those workflows can reach. Takeaway: treat workflow automation tools like CI systems. They run code, they hold credentials, and they can pivot into real infrastructure.Next is GitHub’s new fine-grained permission for artifact metadata. Small change, big least-privilege implications for Actions workflows. It’s also a good forcing function to clean up permission sprawl across repos.Third is AWS’s DevOps Agent story, and the best part is that it’s not hype. It’s a real look at what it takes to operationalize agents: evaluation, observability into tool calls/decisions, and control loops with brakes and approvals. Prototype is cheap. Reliability is the work.Lightning round: GitHub secret scanning changes that can quietly impact governance, a punchy Claude Code “guardrails aren’t guaranteed” reminder, Block’s Goose as another example of agent workflows getting productized, and OpenCode as an “agent runner” pattern worth watching if you’re experimenting locally.Linksn8n CVE-2026-21877 (authenticated RCE) https://thehackernews.com/2026/01/n8n-warns-of-cvss-100-rce-vulnerability.html?m=1Episode 12 (n8n “Ni8mare” / CVE-2026-21858) https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/GitHub: fine-grained permission for artifact metadata (GA) https://github.blog/changelog/2026-01-13-new-fine-grained-permission-for-artifact-metadata-is-now-generally-available/GitHub secret scanning: extended metadata auto-enabled (Feb 18) https://github.blog/changelog/2026-01-15-secret-scanning-extended-metadata-to-be-automatically-enabled-for-certain-repositories/Claude Code issue thread (Bedrock guardrails gap) https://github.com/anthropics/claude-code/issues/17118Block Goose (tutorial + sessions/context) https://block.github.io/goose/docs/tutorials/rpi https://block.github.io/goose/docs/guides/sessions/smart-context-managementOpenCode https://opencode.aiMore episodes + details: https://shipitweekly.fm

  28. 10

    Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Gracious James Eluvathingal about TARS, his “human-in-the-loop” fixer bot wired into CI/CD.We get into why he built it in the first place, how he stitches together n8n, GitHub, SSH, and guardrailed commands, and what it actually looks like when an AI agent helps with incident response without being allowed to nuke prod. We also dig into rollback phases, where humans stay in the loop, and why validating every LLM output before acting on it is the single most important guardrail.If you’re curious about AI agents in pipelines but hate the idea of a fully autonomous “ops bot,” this one is very much about the middle ground: segmenting workflows, limiting blast radius, and using agents to reduce toil instead of replace engineers.Gracious also walks through where he’d like to take TARS next (Terraform, infra-level decisions, more tools) and gives some solid advice for teams who want to experiment with agents in CI/CD without starting with “let’s give it root and see what happens.”Links from the episode:Gracious on LinkedIn: https://www.linkedin.com/in/gracious-james-eluvathingalTARS overview post: https://www.linkedin.com/posts/gracious-james-eluvathingal_aiagents-devops-automation-activity-7391064503892987904-psQ4If you found this useful, share it with the person on your team who’s poking at AI automation and worrying about guardrails.More information on our website: https://shipitweekly.fm

  29. 9

    n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix Temporal

    This week on Ship It Weekly, Brian’s theme is basically: the “automation layer” is not a side tool anymore. It’s part of your perimeter, part of your reliability story, and sometimes part of your budget problem too.We start with the n8n security issue. A lot of teams use n8n as glue for ops workflows, which means it tends to collect credentials and touch real systems. When something like this drops, the right move is to treat it like production-adjacent infra: patch fast, restrict exposure, and assume anything stored in the tool is high value.Next is AWS quietly raising prices on EC2 Capacity Blocks for ML. Even if you’re not a GPU-heavy shop, it’s a useful signal: scarce compute behaves like a market. If you do rely on scheduled GPU capacity, it’s time to revisit forecasts and make sure your FinOps tripwires catch rate changes before the end-of-month surprise.Third is Netflix’s write-up on using Temporal for reliable cloud operations. The best takeaway is not “go adopt Temporal tomorrow.” It’s the pattern: long-running operational workflows should be resumable, observable, and safe to retry. If your critical ops are still bash scripts and brittle pipelines, you’re one transient failure away from a very dumb day.In the lightning round: Kubernetes Dashboard getting archived and the “ops dependencies die” reality check, Docker pushing hardened images as a safer baseline and Pipedash.LinksSRE Weekly issue 504 (source roundup) https://sreweekly.com/sre-weekly-issue-504/n8n CVE (NVD) https://nvd.nist.gov/vuln/detail/CVE-2026-21858n8n community advisory https://community.n8n.io/t/security-advisory-security-vulnerability-in-n8n-versions-1-65-1-120-4/247305AWS price increase coverage (The Register) https://www.theregister.com/2026/01/05/aws_price_increase/Netflix: Temporal powering reliable cloud operations https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953Kubernetes SIG-UI thread (Dashboard archiving) https://groups.google.com/g/kubernetes-sig-ui/c/vpYIRDMysek/m/wd2iedUKDwAJKubernetes Dashboard repo (archived) https://github.com/kubernetes/dashboardPipedash https://github.com/hcavarsan/pipedashDocker Hardened Images https://www.docker.com/blog/docker-hardened-images-for-every-developer/More episodes and more details on this episode can be found on our website: https://shipitweekly.fm

  30. 8

    Ship It Conversations: Backstage vs Internal IDPs, and Why DevEx Muscle Matters (with Danny Teller)

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).I sat down with Danny Teller, a DevOps Architect and Tech Lead Manager at Tipalti, to talk about internal developer platforms and the reality behind “just set up a developer portal.” We get into Backstage versus internal IDPs, why adoption is the real battle, and why platform/DevEx maturity matters more than whatever tool you pick.What we coveredBackstage vs internal IDPs Backstage is a solid starting point for a developer portal, but it doesn’t magically create standards, ownership, or platform maturity. We talk about when Backstage fits, and when teams end up building internal tooling anyway.DevEx muscle (the make-or-break) Danny’s take: the portal UI is the easy part. The hard part is the ongoing work that makes it useful: paved roads, sane defaults, support, and keeping the catalog/data accurate so engineers trust it.Where teams get burned Common failure mode: teams ship a portal first, then realize they don’t have the resourcing, ownership, or workflows behind it. Adoption fades fast if the portal doesn’t remove real friction.A build vs buy gut check We walk through practical signals that push you toward open source Backstage, a managed Backstage offering, or a commercial portal. We also hit the maintenance trap: if you build too much, you’ve created a second product.Links and resources Danny Teller's Linkedin: https://www.linkedin.com/in/danny-teller/matlas — one CLI for Atlas and MongoDB: https://github.com/teabranch/matlas-cliBackstage: https://backstage.io/ Roadie (managed Backstage): https://roadie.io/ Port: https://www.port.io/ Cortex: https://www.cortex.io/ OpsLevel: https://www.opslevel.com/ Atlassian Compass: https://www.atlassian.com/software/compass Humanitec Platform Orchestrator: https://humanitec.com/products/platform-orchestrator Northflank: https://northflank.com/If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a platform/devex friend and leave a quick rating or review. It helps more than it should.Visit our website at https://www.shipitweekly.fm

  31. 7

    Fail Small, IaC Control Planes, and Automated RCA

    This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.Links from this episodeSRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/Meta DrP: https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/DriftHound https://drifthound.io/Superset https://superset.sh/More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm

  32. 6

    Ship It Conversations: From Full-Stack to Cloud/DevOps, One Project at a Time (with Eric Paatey)

    This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).I sat down with Eric Paatey, a Cloud & DevOps Engineer who’s been transitioning from full-stack web development into cloud/devops, and building real skills through hands-on projects instead of just collecting tools and buzzwords.We talk about what that transition actually feels like, what’s helped most, and why you don’t need a rack of servers to learn DevOps.What we covered Eric’s path into DevOps How he moved from building web apps to caring about pipelines, infra, scalability, reliability, and automation. The “oh… code is only part of the job” moment that pushes a lot of people toward DevOps.The WHY behind DevOps Eric’s take: DevOps is mainly about breaking down silos and improving communication between dev, ops, security, and the business. We also hit the idea from The DevOps Handbook: small batches win. The bigger the release, the harder it is to recover when something breaks.Leveling up without drowning in tools DevOps has an endless tool list, so we talked about how to stay current without burning out. Eric’s recommendation: stay connected to the industry. Meet people, join user groups, go to events, and don’t silo yourself.The homelab mindset (and why simple is fine) Eric shared his “homelab on the go” setup and why the hardware isn’t the point. It’s about using a safe environment to build habits: automation, debugging, systems thinking, monitoring, breaking things, recovering, and improving the design.A practical first project for aspiring DevOps engineers We talked through a starter project you can actually show in interviews: Dockerize a simple app, deploy it behind an ALB, and learn basic networking/security along the way. You don’t need to understand everything on day one, but you do need to build things and learn what breaks.Agentic AI and guardrails We also touched on AI agents and MCPs, what they could mean for ops teams, and why you should not give agents full access to anything. Least privilege and policy guardrails matter, because “non-deterministic” and “prod permissions” is a scary combo.Links and resources Eric Paatey on LinkedIn: https://www.linkedin.com/in/eric-paatey-72a87799/Eric’s website/portfolio: https://ericpaatey.com/If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a coworker or your on-call buddy and leave a quick rating or review. It helps more than it should.Visit our website at https://www.shipitweekly.fm

  33. 5

    Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access

    This week on Ship It Weekly, Brian looks at real platform engineering in the wild.We start with Cloudflare’s write-up on building an internal maintenance scheduler on Workers. It’s not marketing fluff. It’s “we hit memory limits, changed the model, and stopped pulling giant datasets into the runtime.”Next up: AWS databases are now available inside the Vercel Marketplace. This is a quiet shift with loud consequences. Devs can click-button real AWS databases from the same place they deploy apps, and platform teams still own the guardrails: account sprawl, billing/tagging, audit trails, region choices, and networking posture.Third story: TEAM (Temporary Elevated Access Management) for IAM Identity Center. Time-bound elevation with approvals, automatic expiry, and auditing. We cover how this fits alongside break-glass and why auto-expiry is the difference between least-privilege and privilege creep.Lightning round: GitHub Actions workflow page performance improvements, Lambda Managed Instances (slightly cursed but interesting), a quick atmos tooling blip, and k8sdiagram.fun for explaining k8s to humans.We close with Marc Brooker’s “What Now? Handling Errors in Large Systems” and the takeaway: error handling isn’t a local code decision, it’s architecture. Crashing vs retrying vs continuing only makes sense when you understand correlation and blast radius.shipitweekly.fm has links + the contact email. Want to be a guest? Reach out. And if you’re enjoying the show, follow/subscribe and leave a quick rating or review. It helps a ton.Links from this episodeCloudflare https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/ AWS on Vercel https://aws.amazon.com/about-aws/whats-new/2025/12/aws-databases-are-available-on-the-vercel/ https://vercel.com/changelog/aws-databases-now-available-on-the-vercel-marketplace TEAM https://aws-samples.github.io/iam-identity-center-team/ https://github.com/aws-samples/iam-identity-center-team GitHub Actions https://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/ Lambda Managed Instances https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html Atmos https://github.com/cloudposse/atmos/issues k8sdiagram.fun https://k8sdiagram.fun/ Marc Brooker https://brooker.co.za/blog/2025/11/20/what-now.html

  34. 4

    Ship It Conversations: The WHY Behind DevOps, Upskilling, and Agentic AI (with Maz Islam)

    This is a Ship It Weekly conversation episode. The weekly news recaps are still weekly. These interviews drop in between when I find someone worth talking to and the convo feels useful.In this episode I’m joined by Mazharul “Maz” Islam (DevOps with Maz). Maz is a UK-based DevOps Engineer who shares practical, real-world DevOps content on YouTube and LinkedIn. We talk about the stuff that actually matters when you’re building systems, running infrastructure, owning reliability, and living in on-call.We hit three big things: the importance of understanding the WHY behind DevOps (not just the tools), how to upskill and keep up with the industry without burning out, and what the agentic AI era might look like for DevOps, SRE, and platform engineering teams. We also touch on MCPs and AI agents, and what “leveling up” looks like for companies that want to move faster without breaking everything.If you’re into DevOps culture, SRE practices, platform engineering, CI/CD, infrastructure automation, and how teams should think about reliability and security as things keep changing, this one should land.Guest Mazharul Islam (DevOps with Maz) UK-based DevOps Engineer. Posts a lot of hands-on content around cloud, DevOps fundamentals, and leveling up as an engineer.Links (Maz) YouTube: https://m.youtube.com/@devopswithmaz LinkedIn: https://www.linkedin.com/in/mazharul419Topics we covered WHY behind DevOps, and why “tools” is the smallest part of it DevOps fundamentals vs tool-chasing Upskilling strategies for DevOps Engineers and SREs How to keep learning cloud and automation without drowning What strong teams measure and what “good” actually looks like (delivery, reliability, feedback loops) Agentic AI, AI agents in operations, and the next era of DevOps MCPs, automation guardrails, and safe ways to scale change How companies can “level up” their engineering org without turning it into chaosWe also discussed the previous episode of Ship It Weekly - GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CIhttps://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/Book Maz recommended The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Paperback, Oct 6, 2016) Gene Kim, Jez Humble, Patrick Debois, John WillisAbout Ship It Weekly (format) Ship It Weekly is for people running infrastructure and owning reliability. Most episodes are quick weekly news recaps for DevOps, SRE, and platform engineering. In between those weekly drops, I’ll publish interview episodes like this one.Subscribe / help the show If you want the weekly DevOps news recaps plus these interviews, hit follow or subscribe in your podcast app. And if you’re feeling generous, leave a rating or review and share this episode with a coworker (especially your on-call buddy). That stuff genuinely helps the show get discovered.

  35. 3

    GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI

    This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier capped at 500 managed resources. If you’re running real infrastructure, this is a good moment to audit what you’re actually managing and decide whether you’re cleaning up, paying, or planning a migration.Then we talk PromptPwnd: why stuffing untrusted PR/issue text into AI agent prompts (inside CI) can turn into a supply chain/security problem. The short version: treat AI inputs like hostile user input, keep tokens/permissions minimal, and don’t let agents “run with scissors.”We also cover the Home Depot report about long-lived access exposure as a reminder that secrets hygiene, blast radius, and detection still matter more than the shiny tools.In the lightning round: CDKTF is sunset/archived, Bitbucket is cleaning up free unused workspaces, and SourceHut is proposing pricing changes. We wrap with a human note on “platform whiplash” and why a simple watchlist beats carrying all this stuff in your head.Links from this episodeGitHub Actions pricing + pause https://runs-on.com/blog/github-self-hosted-runner-fee-2026/ https://x.com/github/status/2001372894882918548 https://www.githubstatus.com/incidents/x696x0g4t85lHashiCorp / Terraform Cloud free plan changes https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice https://www.reddit.com/r/Terraform/s/slYm77wzYrPromptPwnd / AI agents in CI https://www.aikido.dev/blog/promptpwnd-github-actions-ai-agentsHome Depot access exposure report https://techcrunch.com/2025/12/12/home-depot-exposed-access-to-internal-systems-for-a-year-says-researcher/Bitbucket cleanup https://community.atlassian.com/forums/Bitbucket-articles/Bitbucket-cleanup-of-free-unused-workspaces-what-you-need-to/ba-p/3144063SourceHut pricing proposal https://sourcehut.org/blog/2025-12-01-proposed-pricing-changes/

  36. 2

    IBM Buys Confluent, React2Shell, and Netflix on Aurora

    In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being exploited in the wild. Even if you never touch React, if you run platforms or Kubernetes for teams using Next.js or RSC, you’re on the hook for patching windows, WAF rules, and blast-radius thinking.We also look at Netflix’s write-up on consolidating relational databases onto Aurora PostgreSQL, with big performance gains and cost savings. It’s a good excuse to step back and ask whether your own Postgres fleet still makes sense at the scale you’re at now.In the lightning round, we hit OpenTofu 1.11’s new language features, practical Terraform “tips from the trenches,” Ghostty becoming a non-profit project, and two spec-driven dev tools (Spec Kit and OpenSpec) that show what sane AI-assisted development might look like.For the human side, we close with “Your Brain on Incidents” and what high-stress outages actually do to people, plus a few concrete ideas for making on-call less brutal.If you’re on a platform team, own SLOs, or you’re the person people ping when “something is wrong with prod,” this one should give you a mix of immediate to-dos and longer-term questions for your roadmap.Links:IBM + Confluent https://www.confluent.io/blog/ibm-to-acquire-confluent/ https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-aiReact2Shell (CVE-2025-55182) https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-componentsNetflix on Aurora PostgreSQL https://aws.amazon.com/blogs/database/netflix-consolidates-relational-database-infrastructure-on-amazon-aurora-achieving-up-to-75-improved-performance/Tools & tips https://opentofu.org/blog/opentofu-1-11-0/ https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html https://mitchellh.com/writing/ghostty-non-profit https://github.com/github/spec-kit https://github.com/Fission-AI/OpenSpecHuman side https://uptimelabs.io/your-brain-on-incidents/

  37. 1

    AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging

    In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a “what matters so far” pass, not a full recap.Outside AWS, we get into Google’s 130,000-node GKE cluster and what actually applies if you’re running normal-sized clusters, plus the “It’s time to kill staging” argument and what responsible testing in production looks like with feature flags, progressive delivery, and solid observability.In the lightning round, we hit Zachary Loeber’s Terraform MCP server and terraform-ingest (letting AI tools speak your real Terraform modules), Runs-On’s EC2 instance rankings so you stop picking instance types by vibes, and Airbnb’s adaptive traffic management for their key-value store. We close with Nolan Lawson’s “The fate of small open source” and what it means when your platform quietly depends on one-maintainer libraries.Links from this episode:AWS highlights:https://aws.amazon.com/about-aws/whats-new/2025/11/aws-nat-gateway-regional-availabilityhttps://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-previewhttps://aws.amazon.com/about-aws/whats-new/2025/11/announcing-amazon-ecs-express-modehttps://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/Other topics:https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-clusterhttps://thenewstack.io/its-time-to-kill-staging-the-case-for-testing-in-production/https://blog.zacharyloeber.com/article/terraform-custom-module-mcp-server/https://go.runs-on.com/instances/rankinghttps://medium.com/airbnb-engineering/from-static-rate-limiting-to-adaptive-traffic-management-in-airbnbs-key-value-store-29362764e5c2https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/

  38. 0

    Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

    In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.We start with Kubernetes’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across clusters and AZs instead of guessing from node metrics.On the GitHub side, Actions OIDC tokens now include a check_run_id for tighter access control, and Copilot adds instructions files and custom agents so you can encode platform and security expectations directly into reviews and workflows.In the lightning round, we touch on Terrascan being archived, Microsoft’s write-up of a 15.72 Tbps Aisuru DDoS attack against Azure, and AWS flat-rate CloudFront plans that bundle CDN and security into more predictable pricing.We close with Lorin Hochstein’s “Two thought experiments” and what it looks like to write incident reports as if an AI (and your future teammates) will rely on them to debug the next outage.If run Kubernetes in prod this one should give you a few concrete ideas for your roadmap.Links from episodehttps://kubernetes.io/blog/2025/11/25/configuration-good-practices/https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-eks-provisioned-control-plane/https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/https://github.blog/changelog/2025-11-13-github-actions-oidc-token-claims-now-include-check_run_id/https://github.blog/ai-and-ml/unlocking-the-full-power-of-copilot-code-review-master-your-instructions-files/https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/create-custom-agentsLightning Roundhttps://github.com/tenable/terrascanhttps://www.bleepingcomputer.com/news/microsoft/microsoft-aisuru-botnet-used-500-000-ips-in-15-tbps-azure-ddos-attack/https://aws.amazon.com/about-aws/whats-new/2025/11/aws-flat-rate-pricing-plans/https://sreweekly.com/sre-weekly-issue-498/ (Lorin's Article)

  39. -1

    Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

    In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.Third, we talk about AI as a first-class workload on Kubernetes. CNCF’s new Certified Kubernetes AI Conformance Program aims to standardize how AI runs on K8s, and recent writing on SRE in the age of AI looks at what reliability means when systems learn and drift.In the lightning round, we hit good reads on database migrations, Postgres upgrades, and a distributed priority queue on Kafka. We wrap with the human side of incidents: fixation during incident response and using incidents as landmarks for the tradeoffs you’ve been making over time.If you’re on a platform team, responsible for SLOs, or the person people ping when “Kubernetes is weird,” this one should give you concrete questions to take back to your roadmap and runbooks.Links from this episodehttps://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/https://www.haproxy.com/blog/ingress-nginx-is-retiringhttps://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/Lightning roundhttps://www.cncf.io/blog/2025/11/18/top-5-hard-earned-lessons-from-the-experts-on-managing-kubernetes/https://www.tines.com/blog/zero-downtime-database-migrations-lessons-from-moving-a-live-productionhttps://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/https://klaviyo.tech/building-a-distributed-priority-queue-in-kafka-1b2d8063649ehttps://sreweekly.com/sre-weekly-issue-497/https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html

  40. -2

    Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages

    In this special kickoff episode of Ship It Weekly, Brian walks through three major outages from the last few weeks and what they actually mean for DevOps, SRE, and platform teams.Instead of just reading status pages, we look at how each incident exposes assumptions in our own architectures and runbooks:Topics in this episode:• Cloudflare’s global outage and what happens when your CDN/WAF becomes a single point of failure• The AWS us-east-1 incident and why “multi-AZ in one region” isn’t a full disaster recovery strategy• GitHub’s Git operations / Codespaces outage and how fragile our CI/CD and GitOps flows can be• Practical questions to ask about your own setup: CDN bypass, cross-region readiness, backups for Git and CIThis episode is more of a themed “special” to kick things off.Going forward, most episodes will follow a lighter news format: a couple of main stories from the week in DevOps/SRE/platform engineering, a quick tools and releases segment, and one culture/on-call or burnout topic. Specials like this will pop up when there’s a big incident or theme worth unpacking.If you’re the person people DM when production is acting weird, or you’re responsible for the platform everyone ships on, this one’s for you.Links from this episodeCloudflare outage – November 18, 2025https://blog.cloudflare.com/18-november-2025-outage/https://www.thousandeyes.com/blog/cloudflare-outage-analysis-november-18-2025AWS us-east-1 outage – October 2025https://aws.amazon.com/message/101925/https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025GitHub outage – November 18, 2025https://us.githubstatus.com/incidents/f3f7sg2d1m20https://currently.att.yahoo.com/att/github-down-now-not-just-211700617.html

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

Ship It Weekly is a short, practical recap of what actually matters in DevOps, SRE, cloud infrastructure, and platform engineering.Each episode, your host Brian Teller walks through the latest outages, releases, tools, and incident writeups, then translates them into “here’s what this means for your systems” instead of just reading headlines. Expect a couple of main stories with context, a quick hit of tools or releases worth bookmarking, and the occasional segment on on-call, burnout, or team culture.This isn’t a certification prep show or a lab walkthrough. It’s aimed at people who are already working in the space and want to stay sharp without scrolling status pages, cloud updates, and blogs all week. You’ll hear about things like cloud provider incidents, Kubernetes and platform trends, Terraform and infrastructure changes, and real postmortems that are actually worth your time.Most episodes are 10–25 minutes, so you can catch up on the way

HOSTED BY

Teller's Tech - DevOps, SRE and Cloud Podcast

CATEGORIES

URL copied to clipboard!