AI Insights – EP.2: Unlocking Cost-Effective AI with Small Language Models

In the latest episode of the Cisco AI Insights Po…

An episode of the Cisco Podcast Network podcast, hosted by Cisco, titled "AI Insights – EP.2: Unlocking Cost-Effective AI with Small Language Models" was published on February 26, 2026 and runs 22 minutes.

February 26, 2026 ·22m · Cisco Podcast Network

0:00 / 0:00

Summary

In the latest episode of the Cisco AI Insights Podcast, hosts Rafael Herrera and Sónia Marques welcome Cisco AI operations engineer James Tidd for a discussion on the world of small language models (SLMs) and the evolution of efficient AI inference. Together, they unravel the complexities behind “Fast Inference from Transformers via Speculative Decoding,” a groundbreaking paper from Google that explores how smaller draft models can speed up large language model predictions while maintaining accuracy. James shares his hands-on experience experimenting with the technique, leveraging knowledge distillation and speculative execution. The trio also discusses the potential of this approach to optimize AI, reduce power consumption and costs, and help businesses of all sizes get more out of existing hardware. A special thank you to Google’s AI team for developing this month's paper. If you are interested in reading the paper yourself, please visit this link: https://research.google/blog/looking-back-at-speculative-decoding/.