Home /
technology Podcasts /
Two Voice Devs /
Episode 238 - LLM Benchmarking: What, Why, Who, and How

EPISODE · May 9, 2025 · 31 MIN

Episode 238 - LLM Benchmarking: What, Why, Who, and How

from Two Voice Devs · host Mark and Allen

How do you know if a Large Language Model is good for your specific task? You benchmark it! In this episode, Allen speaks with Amy Russ about her fascinating career path from international affairs to data, and how that unique perspective now informs her work in LLM benchmarking.Amy explains what benchmarking is, why it's crucial for both model builders and app developers, and how it goes far beyond simple technical tests to include societal, cultural, and ethical considerations like preventing harms.Learn about the complex process involving diverse teams, defining fuzzy criteria, and the technical tools used, including data versioning and prompt template engines. Amy also shares insights on how to get involved in open benchmarking efforts and where to find benchmarks relevant to your own LLM projects.Whether you're building models or using them in your applications, understanding benchmarking is key to finding and evaluating the best AI for your needs.Learn More:* ML Commons - https://mlcommons.org/Timestamps:00:18 Amy's Career Path (From Diplomacy to Data)02:46 What Amy Does Now (Benchmarking & Policy)03:38 Defining LLM Benchmarking05:08 Policy & Societal Benchmarking (Preventing Harms)07:55 The Need for Diverse Benchmarking Teams09:55 Technical Aspects & Tooling (Data Integrity, Versioning)10:50 Prompt Engineering & Versioning for Benchmarking12:48 Preventing Models from Tuning to Benchmarks15:30 Prompt Template Engines & Generating Prompts17:10 Other Benchmarking Tools & Testing Nuances19:10 Benchmarking Compared to Traditional QA21:45 Evaluating Benchmark Results (Human & Metrics)23:05 The Challenge of Establishing an Evaluation Scale23:58 How to Get Started in Benchmarking (Volunteering, Organizations)25:20 Open Benchmarks & Where to Find Them26:35 Benchmarking Your Own Model or App28:55 Why Benchmarking Matters for App Builders29:55 Where to Learn More & Follow AmyHashtags:#LLM #Benchmarking #AI #MachineLearning #GenAI #DataScience #DataEngineering #PromptEngineering #ModelEvaluation #TechPodcast #Developer #TwoVoiceDevs #MLCommons #QA

NOW PLAYING

Episode 238 - LLM Benchmarking: What, Why, Who, and How

0:00 31:44

1×

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Share this episode

Similar Episodes

Why the AI Panic is a Lie (And It's a Good Thing) | ROP #60

Feb 4, 2026 ·72m

How $500 Drones Are Forcing America to Reinvent Its Military | ROP #59

Nov 5, 2025 ·47m

America's Drone Market is About to Collapse (And It's a Good Thing) | ROP #58

Oct 29, 2025 ·49m

Founder's Guide to Winning the New Space Race | ROP #57

Oct 22, 2025 ·38m

His "Factory in a Box" Builds a House in 30 Days | ROP #56

Oct 15, 2025 ·30m

America's Nuclear Waste "Problem" is a Lie (And It's a Good Thing) | ROP #55

Oct 8, 2025 ·43m

Similar Podcasts

Photo Breakdown Scott Wyden Kivowitz Photo Breakdown is a podcast in which we explore the world of photography with a trusted guide, host Scott Wyden Kivowitz. His expertise and passion bring the industry to life as we explore the stories, trends, and ideas shaping it today. Join us as we dissect everything from incredible photographs and creative techniques to the latest gear releases and hot topics in the photography community.In each episode, we break down what’s happening behind the scenes - whether it’s making a powerful image, a candid discussion on industry trends, or a reflection on the tools and technology changing how we make photographs. You’ll get insights, expert opinions, and a fresh perspective on what’s top of mind for photographers right now.Anticipate short, engaging episodes brimming with ideas and inspiration. Be part of the conversation by sharing your thoughts, voice notes, and comments. Your participation is what makes our community vibrant and dynamic.It’s more than just photography - everyth Not All At Once Kendall Weihe Two guys talking shop once a week. Mostly about the intersection of technology and money. Bravo’s Dos Amigas Genevieve and Angela A Real Housewives recap podcast. Hosted by two news anchors that are diehard Bravo watchers. The Real Housewives weekly recap. The Small Business Startup School – Business Notes | Financial Literacy | Retail Psychology – For Professionals & Entrepreneurs The Small Business Startup School Inc. Starting or buying a small business? While personal circumstances may vary, business patterns remain timeless. On The Small Business Startup School, we explore strategies, insights, and practical solutions to help entrepreneurs confidently navigate their journey.Hosted by Ola Williams—a retail entrepreneur, fintech founder, and financial coach with over two decades of experience—this podcast marries financial awareness and retail psychology with optimism to deliver actionable takeaways.Join us to learn, grow, and connect as we uncover the keys to business success.Let’s continue to learn together and be encouraged to keep on connecting!

URL copied to clipboard!