AI Insights – EP.2: Unlocking Cost-Effective AI with Small Language Models
In the latest episode of the Cisco AI Insights Po…
An episode of the Cisco Podcast Network podcast, hosted by Cisco, titled "AI Insights – EP.2: Unlocking Cost-Effective AI with Small Language Models" was published on February 26, 2026 and runs 22 minutes.
February 26, 2026 ·22m · Cisco Podcast Network
Summary
In the latest episode of the Cisco AI Insights Podcast, hosts Rafael Herrera and Sónia Marques welcome Cisco AI operations engineer James Tidd for a discussion on the world of small language models (SLMs) and the evolution of efficient AI inference. Together, they unravel the complexities behind “Fast Inference from Transformers via Speculative Decoding,” a groundbreaking paper from Google that explores how smaller draft models can speed up large language model predictions while maintaining accuracy. James shares his hands-on experience experimenting with the technique, leveraging knowledge distillation and speculative execution. The trio also discusses the potential of this approach to optimize AI, reduce power consumption and costs, and help businesses of all sizes get more out of existing hardware. A special thank you to Google’s AI team for developing this month's paper. If you are interested in reading the paper yourself, please visit this link: https://research.google/blog/looking-back-at-speculative-decoding/.
Episode Description
Similar Episodes
Apr 11, 2026 ·30m
Apr 11, 2026 ·9m
Apr 11, 2026 ·3m
Oct 21, 2010 ·9m
Oct 14, 2010 ·10m
Oct 7, 2010 ·7m