EPISODE · Jun 24, 2026 · 1H 5M
Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)
from The Information Bottleneck · host Ravid Shwartz-Ziv & Allen Roush
We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still under-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.Timeline00:00 — Intro: the part of AI nobody talks about01:20 — What "infrastructure for AI" actually means: the layers, from GPUs up to routing02:59 — Why not just buy your own GPUs and do it yourself?05:17 — The scale Fireworks runs at06:35 — Hardware inflation, GPU costs, and the real risk hiding in commit duration10:14 — Nvidia vs AMD vs TPUs, and why power is the bottleneck11:57 — Mixing GPU types and generations; fungibility vs. specialization14:22 — Once you have the GPUs, what's the next layer to build?17:04 — Dense vs. MoE, and why the hardware picks the winner21:07 — Quantization: is FP4 the floor? TurboQuant and INT vs. FP24:28 — How tied are the algorithms to the hardware?25:12 — DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss28:50 — Why RL is still wildly inefficient compared to pre-training30:08 — Speculative decoding, AI-generated kernels, and auto-research34:00 — The AGI question: why text gets automated but vision may stay expensive37:07 — Hype check: why Benny thinks AI is still under-hyped41:28 — Training vs. inference at the infrastructure level44:12 — Scaling across data centers: cross-region training with Cursor45:40 — Sparse autoencoders, interpretability, and why open source is human-constrained49:04 — Will open models catch up — on quality and on compute?51:41 — Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars54:41 — Physical limits, HBM, and whether chips keep getting faster58:17 — The belief about inference everyone gets wrong59:31 — Anthropic, mythos, and a frank take on go-to-market1:04:41 — Wrap-upMusic:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
What this episode covers
We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still under-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.Timeline00:00 — Intro: the part of AI nobody talks about01:20 — What "infrastructure for AI" actually means: the layers, from GPUs up to routing02:59 — Why not just buy your own GPUs and do it yourself?05:17 — The scale Fireworks runs at06:35 — Hardware inflation, GPU costs, and the real risk hiding in commit duration10:14 — Nvidia vs AMD vs TPUs, and why power is the bottleneck11:57 — Mixing GPU types and generations; fungibility vs. specialization14:22 — Once you have the GPUs, what's the next layer to build?17:04 — Dense vs. MoE, and why the hardware picks the winner21:07 — Quantization: is FP4 the floor? TurboQuant and INT vs. FP24:28 — How tied are the algorithms to the hardware?25:12 — DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss28:50 — Why RL is still wildly inefficient compared to pre-training30:08 — Speculative decoding, AI-generated kernels, and auto-research34:00 — The AGI question: why text gets automated but vision may stay expensive37:07 — Hype check: why Benny thinks AI is still under-hyped41:28 — Training vs. inference at the infrastructure level44:12 — Scaling across data centers: cross-region training with Cursor45:40 — Sparse autoencoders, interpretability, and why open source is human-constrained49:04 — Will open models catch up — on quality and on compute?51:41 — Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars54:41 — Physical limits, HBM, and whether chips keep getting faster58:17 — The belief about inference everyone gets wrong59:31 — Anthropic, mythos, and a frank take on go-to-market1:04:41 — Wrap-upMusic:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
NOW PLAYING
Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m