EPISODE · Jun 12, 2026 · 7 MIN
Create robust evaluations for agentic apps
from Podkey WWDC 2026
A Podkey summary of Create robust evaluations for agentic apps, from WWDC 2026.A lot of this comes down to a simple idea: if you're testing AI with a tiny, tidy dataset, you can talk yourself into thinking things work better than they actually do. The big themes here are synthetic data generation, how to judge coverage instead of chasing a magic sample count, and how to evaluate tool-using models in a way that actually catches workflow mistakes. There are also some very practical details underneath all that, like session setup, context limits, validators, sampling strategy, and why a score dropping can actually be useful news.Synthetic data as a loopCoverage matters more than raw countWhat a lower score is actually telling youSampling strategy and validationSession providers and context windowsEvaluating tool use with trajectoriesFlexible matchers and synthetic tool datasetsThis podcast was created with Podkey. Make your own at https://podkey.fm
NOW PLAYING
Create robust evaluations for agentic apps
No transcript for this episode yet
Similar Episodes
May 14, 2026 ·360m
May 14, 2026 ·310m
May 14, 2026 ·205m
May 14, 2026 ·85m
May 14, 2026 ·282m