Speculative Decoding and Efficient LLM Inference with Chris Lott - #717 episode artwork

EPISODE · Feb 4, 2025 · 1H 16M

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) · host Sam Charrington

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator. The complete show notes for this episode can be found at https://twimlai.com/go/717.

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator. The complete show notes for this episode can be found at https://twimlai.com/go/717.

NOW PLAYING

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

0:00 1:16:30

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)?

This episode is 1 hour and 16 minutes long.

When was this The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) episode published?

This episode was published on February 4, 2025.

What is this episode about?

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these...

Can I download this The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!