Inference Scaling for Long-Context RAG

from LlamaCast · host Shahriar Shariati

🗓 Inference Scaling for Long-Context Retrieval Augmented GenerationThis research paper explores the effectiveness of inference scaling for retrieval augmented generation (RAG), a technique that enhances large language models (LLMs) by incorporating external knowledge. The authors introduce two strategies, demonstration-based RAG (DRAG) and iterative demonstration-based RAG (IterDRAG), for effectively scaling inference computation. They demonstrate that increasing inference computation, when optimally allocated, leads to nearly linear gains in RAG performance. Furthermore, they develop a computation allocation model to predict the optimal test-time compute allocation for various tasks and scenarios, showcasing its effectiveness in achieving performance gains and aligning with experimental results.📎 Link to paper

NOW PLAYING

0:00 12:18

1×

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Share this episode

Similar Episodes

No similar episodes found.

Similar Podcasts

No similar podcasts found.

URL copied to clipboard!

Share this episode

Similar Episodes

Similar Podcasts

Age Verification