AI testing and evaluation Podcast - All Episodes

1

AI Evaluation. Episode 1. Practical approach to using LLM-as-a-Judge effectively

Episode Description: In this episode, we dive into a practical, three-step approach to transform LLMs from unpredictable evaluators into reliable and transparent tools. Stop relying on vague instructions like "evaluate relevance" and learn how to implement a high-precision framework that yields consistent results.What we cover in this episode:• Step 1: The Power of Binary Criteria. Learn why you should define 5–7 concrete evaluation metrics—such as checking for fabricated facts, length limits, or specific tones—that result in a simple "yes" or "no".• Step 2: Structured Output for Accountability. Discover how to request JSON or other structured formats so the model provides a verdict and the specific evidence or justification supporting its decision.• Step 3: Continuous Improvement and Debugging. We discuss the importance of running 20–30 test examples to identify where the model makes mistakes. We explain why evaluation failures often stem from how criteria are formulated rather than the model's inherent capabilities.Tune in to learn how to move away from "black box" scoring and create an evaluation logic that you can continuously improve and fully understand.

Jan 28, 2026

20m

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

Share your thoughts

ABOUT THIS SHOW

A deep dive into AI quality and security, evaluation frameworks, bias detection, and building reliable and robust AI systems. Hosted by Aleksandr Meshkov, who is an AI evaluation architect with 13 years of experience

HOSTED BY

Aleksandr Meshkov

AI Evaluation. Episode 1. Practical approach to using LLM-as-a-Judge effectively

Authentication Required