PodParley PodParley
Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

EPISODE · Feb 4, 2025 · 1H 16M

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) · host Sam Charrington

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator. The complete show notes for this episode can be found at https://twimlai.com/go/717.

NOW PLAYING

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

0:00 1:16:30
Play in mini player Transcript not yet generated

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

No similar episodes found.

URL copied to clipboard!