EPISODE · Sep 18, 2025 · 3 MIN
ELO Ratings Questions
from 52 Weeks of Cloud · host Pragmatic AI Labs
Key ArgumentThesis: Using ELO for AI agent evaluation = measuring noiseProblem: Wrong evaluators, wrong metrics, wrong assumptions Solution: Quantitative assessment frameworksThe Comparison (00:00-02:00)Chess ELOFIDE arbiters: 120hr trainingBinary outcome: win/lossTest-retest: r=0.95Cohen's κ=0.92AI Agent ELORandom users: Google engineer? CS student? 10-year-old?Undefined dimensions: accuracy? style? speed?Test-retest: r=0.31 (coin flip)Cohen's κ=0.42Cognitive Bias Cascade (02:00-03:30)Anchoring: 34% rating variance in first 3 secondsConfirmation: 78% selective attention to preferred featuresDunning-Kruger: d=1.24 effect sizeResult: Circular preferences (A>B>C>A)The Quantitative Alternative (03:30-05:00)Objective MetricsMcCabe complexity ≤20Test coverage ≥80%Big O notation comparisonSelf-admitted technical debtReliability: r=0.91 vs r=0.42Effect size: d=2.18Dream Scenario vs Reality (05:00-06:00)DreamWorld's best engineersAnnotated metricsStandardized criteriaReality Random internet usersNo expertise verificationSubjective preferencesKey StatisticsMetricChessAI AgentsInter-rater reliabilityκ=0.92κ=0.42Test-retestr=0.95r=0.31Temporal drift±10 pts±150 ptsHurst exponent0.890.31TakeawaysStop: Using preference votes as quality metricsStart: Automated complexity analysisROI: 4.7 months to break evenCitations MentionedKapoor et al. (2025): "AI agents that matter" - κ=0.42 findingSantos et al. (2022): Technical Debt Grading validationRegan & Haworth (2011): Chess arbiter reliability κ=0.92Chapman & Johnson (2002): 34% anchoring effectQuotable Moments"You can't rate chess with basketball fans""0.31 reliability? That's a coin flip with extra steps""Every preference vote is a data crime""The psychometrics are screaming"ResourcesTechnical Debt Grading (TDG) FrameworkPMAT (Pragmatic AI Labs MCP Agent Toolkit)McCabe Complexity CalculatorCohen's Kappa Calculator 🔥 Hot Course Offers:🤖 Master GenAI Engineering - Build Production AI Systems🦀 Learn Professional Rust - Industry-Grade Development📊 AWS AI & Analytics - Scale Your ML in Cloud⚡ Production GenAI on AWS - Deploy at Enterprise Scale🛠️ Rust DevOps Mastery - Automate Everything🚀 Level Up Your Career:💼 Production ML Program - Complete MLOps & Cloud Mastery🎯 Start Learning Now - Fast-Track Your ML Career🏢 Trusted by Fortune 500 TeamsLearn end-to-end ML engineering from industry veterans at PAIML.COM
NOW PLAYING
ELO Ratings Questions
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m