EPISODE · Jun 12, 2026 · 7 MIN
Improve your prompts by hill-climbing with Evaluations
from Podkey WWDC 2026
A Podkey summary of Improve your prompts by hill-climbing with Evaluations, from WWDC 2026.Today’s thread is really about one deceptively simple question: how do you know your AI evaluator is actually judging things the way a human would? The big themes are drift between model and expert ratings, a more honest way to measure agreement with Cohen’s kappa, and a very practical process of improving prompts one change at a time. There’s also a useful reminder that better scores in one area can quietly make another area worse, and that tiny datasets can flatter you more than they should.When the judge starts driftingWhy raw agreement isn’t enoughPrompt tuning like an experimentControl versus experimentalRelevance and usefulness are not the same thingOne change at a timeGiving the model a lookup toolThe dataset is probably too smallBetter aggregation, better decisionsUsing generated samples to test tools tooThis podcast was created with Podkey. Make your own at https://podkey.fm
NOW PLAYING
Improve your prompts by hill-climbing with Evaluations
No transcript for this episode yet
Similar Episodes
May 14, 2026 ·360m
May 14, 2026 ·310m
May 14, 2026 ·205m
May 14, 2026 ·85m
May 14, 2026 ·282m