ELO Ratings Questions

from 52 Weeks of Cloud · host Pragmatic AI Labs

Key ArgumentThesis: Using ELO for AI agent evaluation = measuring noiseProblem: Wrong evaluators, wrong metrics, wrong assumptions Solution: Quantitative assessment frameworksThe Comparison (00:00-02:00)Chess ELOFIDE arbiters: 120hr trainingBinary outcome: win/lossTest-retest: r=0.95Cohen's κ=0.92AI Agent ELORandom users: Google engineer? CS student? 10-year-old?Undefined dimensions: accuracy? style? speed?Test-retest: r=0.31 (coin flip)Cohen's κ=0.42Cognitive Bias Cascade (02:00-03:30)Anchoring: 34% rating variance in first 3 secondsConfirmation: 78% selective attention to preferred featuresDunning-Kruger: d=1.24 effect sizeResult: Circular preferences (A>B>C>A)The Quantitative Alternative (03:30-05:00)Objective MetricsMcCabe complexity ≤20Test coverage ≥80%Big O notation comparisonSelf-admitted technical debtReliability: r=0.91 vs r=0.42Effect size: d=2.18Dream Scenario vs Reality (05:00-06:00)DreamWorld's best engineersAnnotated metricsStandardized criteriaReality Random internet usersNo expertise verificationSubjective preferencesKey StatisticsMetricChessAI AgentsInter-rater reliabilityκ=0.92κ=0.42Test-retestr=0.95r=0.31Temporal drift±10 pts±150 ptsHurst exponent0.890.31TakeawaysStop: Using preference votes as quality metricsStart: Automated complexity analysisROI: 4.7 months to break evenCitations MentionedKapoor et al. (2025): "AI agents that matter" - κ=0.42 findingSantos et al. (2022): Technical Debt Grading validationRegan & Haworth (2011): Chess arbiter reliability κ=0.92Chapman & Johnson (2002): 34% anchoring effectQuotable Moments"You can't rate chess with basketball fans""0.31 reliability? That's a coin flip with extra steps""Every preference vote is a data crime""The psychometrics are screaming"ResourcesTechnical Debt Grading (TDG) FrameworkPMAT (Pragmatic AI Labs MCP Agent Toolkit)McCabe Complexity CalculatorCohen's Kappa Calculator 🔥 Hot Course Offers:🤖 Master GenAI Engineering - Build Production AI Systems🦀 Learn Professional Rust - Industry-Grade Development📊 AWS AI & Analytics - Scale Your ML in Cloud⚡ Production GenAI on AWS - Deploy at Enterprise Scale🛠️ Rust DevOps Mastery - Automate Everything🚀 Level Up Your Career:💼 Production ML Program - Complete MLOps & Cloud Mastery🎯 Start Learning Now - Fast-Track Your ML Career🏢 Trusted by Fortune 500 TeamsLearn end-to-end ML engineering from industry veterans at PAIML.COM

Episode metadata supplied by the publisher feed · Published Sep 18, 2025

What this episode covers

ELO ratings work for chess (κ=0.92) but fail catastrophically for AI agents (κ=0.31). Random users aren't chess arbiters. Code quality isn't win/loss. We explore psychometric failures, cognitive biases destroying data validity, and why quantitative metrics (McCabe complexity, test coverage) achieve 2.18x better reliability than human preferences.

PodParley-generated summary based on available episode metadata and transcript content.

NOW PLAYING

ELO Ratings Questions

0:00 3:39

1×

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Share this episode

Similar Episodes

I'm ok

Mar 26, 2026 ·1m

Food Saved My Life

Mar 19, 2026 ·34m

Eat More Vegetables: The 4 Foods That Beat Ozempic (Naturally)

Feb 18, 2026 ·11m

How to End Heart Disease with Dr. Fuhrman

Feb 11, 2026 ·45m

Revolutionizing Breast Health: QT Imaging, Overdiagnosis, and What to Do Instead

Jan 27, 2026 ·35m

REMIX: Why we over-shop and compulsively acquire, and how to stop, with Dr Jan Eppingstall

Jan 9, 2026 ·61m

Similar Podcasts

Ask A Spaceman Archives - 365 Days of Astronomy Ask A Spaceman Archives - 365 Days of Astronomy Podcasting Astronomy Every Day of the Year Eat to Live Jenna Fuhrman, Dr. Fuhrman Our health is our most precious gift and smart nutrition can change your life. Each month, join Dr. Fuhrman and his daughter, Jenna Fuhrman as they discuss important topics in the world of nutrition. Eat to Live will change the way you eat and think about food. French Your Way Jessica: Native French teacher founder of French Your Way Boost your French listening skills and test your comprehension with this one of a kind series of podcasts. Get the chance to listen to a real conversation between native speakers talking at normal speed AND customise your learning experience through carefully designed sets of questions (2 levels of difficulty) available for download at www.frenchvoicespodcast.com. All interviews also come with the transcript. French teacher Jessica interviews native speakers of French from around the world who share a bit of their life and passion. Where else would you meet in one same place a French yoga teacher based in Melbourne, a soap manufacturer from Provence, or a couple cycling around the world? That Hoarder: Overcome Compulsive Hoarding That Hoarder Hoarding disorder is stigmatised and people who hoard feel vast amounts of shame. This podcast began life as an audio diary, an anonymous outlet for somebody with this weird condition. That Hoarder speaks about her experiences living with compulsive hoarding, she interviews therapists, academics, researchers, children of hoarders, professional organisers and influencers, and she shares insight and tips for others with the problem. Listened to by people who hoard as well as those who love them and those who work with them, Overcome Compulsive Hoarding with That Hoarder aims to shatter the stigma, share the truth and speak openly and honestly to improve lives.

Frequently Asked Questions

How long is this episode of 52 Weeks of Cloud?

This episode is 3 minutes long.

When was this 52 Weeks of Cloud episode published?

This episode was published on September 18, 2025.

What is this episode about?

Can I download this 52 Weeks of Cloud episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.

URL copied to clipboard!