Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) episode artwork

EPISODE · Dec 20, 2025 · 16 MIN

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)

from Machine Learning Street Talk (MLST)

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: **the human experience.**Why High Benchmark Scores Don’t Mean Better AIJoining us are **Andrew Gordon** (Staff Researcher in Behavioral Science) and **Nora Petrova** (AI Researcher) from **Prolific**. They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people.---Key Insights in This Episode:* *The F1 Car Analogy:* Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability.* *The "Wild West" of AI Safety:* As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler."* *Fixing the "Leaderboard Illusion":* The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system.* *The Xbox Secret to AI Ranking:* Discover how Prolific uses *TrueSkill*—the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs.* *The Personality Gap:* Early data from the **Humane Leaderboard** suggests that while AI is getting smarter, it is actually performing *worse* on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers").---About the HUMAINE LeaderboardMoving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on *census data* (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world.*Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.*Rescript link:https://app.rescript.info/public/share/IDqwjY9Q43S22qSgL5EkWGFymJwZ3SVxvrfpgHZLXQc---TIMESTAMPS:00:00:00 Introduction & The Benchmarking Problem00:01:58 The Fractured State of AI Evaluation00:03:54 AI Safety & Interpretability00:05:45 Bias in Chatbot Arena00:06:45 Prolific's Three Pillars Approach00:09:01 TrueSkill Ranking & Efficient Sampling00:12:04 Census-Based Representative Sampling00:13:00 Key Findings: Culture, Personality & Sycophancy---REFERENCES:Paper:[00:00:15] MMLUhttps://arxiv.org/abs/2009.03300[00:05:10] Constitutional AIhttps://arxiv.org/abs/2212.08073[00:06:45] The Leaderboard Illusionhttps://arxiv.org/abs/2504.20879[00:09:41] HUMAINE Framework Paperhttps://huggingface.co/blog/ProlificAI/humaine-frameworkCompany:[00:00:30] Prolifichttps://www.prolific.com[00:01:45] Chatbot Arenahttps://lmarena.ai/Person:[00:00:35] Andrew Gordonhttps://www.linkedin.com/in/andrew-gordon-03879919a/[00:00:45] Nora Petrovahttps://www.linkedin.com/in/nora-petrova/Event:Algorithm:[00:09:01] Microsoft TrueSkillhttps://www.microsoft.com/en-us/research/project/trueskill-ranking-system/Leaderboard:[00:09:21] Prolific HUMAINE Leaderboardhttps://www.prolific.com/humaine[00:09:31] HUMAINE HuggingFace Spacehttps://huggingface.co/spaces/ProlificAI/humaine-leaderboard[00:10:21] Prolific AI Leaderboard Portalhttps://www.prolific.com/leaderboardDataset:[00:09:51] Prolific Social Reasoning RLHF Datasethttps://huggingface.co/datasets/ProlificAI/social-reasoning-rlhfOrganization:[00:10:31] MLCommonshttps://mlcommons.org/

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: **the human experience.**Why High Benchmark Scores Don’t Mean Better AIJoining us are **Andrew Gordon** (Staff Researcher in Behavioral Science) and **Nora Petrova** (AI Researcher) from **Prolific**. They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people.---Key Insights in This Episode:* *The F1 Car Analogy:* Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability.* *The "Wild West" of AI Safety:* As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler."* *Fixing the "Leaderboard Illusion":* The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system.* *The Xbox Secret to AI Ranking:* Discover how Prolific uses *TrueSkill*—the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs.* *The Personality Gap:* Early data from the **Humane Leaderboard** suggests that while AI is getting smarter, it is actually performing *worse* on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers").---About the HUMAINE LeaderboardMoving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on *census data* (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world.*Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.*Rescript link:https://app.rescript.info/public/share/IDqwjY9Q43S22qSgL5EkWGFymJwZ3SVxvrfpgHZLXQc---TIMESTAMPS:00:00:00 Introduction & The Benchmarking Problem00:01:58 The Fractured State of AI Evaluation00:03:54 AI Safety & Interpretability00:05:45 Bias in Chatbot Arena00:06:45 Prolific's Three Pillars Approach00:09:01 TrueSkill Ranking & Efficient Sampling00:12:04 Census-Based Representative Sampling00:13:00 Key Findings: Culture, Personality & Sycophancy---REFERENCES:Paper:[00:00:15] MMLUhttps://arxiv.org/abs/2009.03300[00:05:10] Constitutional AIhttps://arxiv.org/abs/2212.08073[00:06:45] The Leaderboard Illusionhttps://arxiv.org/abs/2504.20879[00:09:41] HUMAINE Framework Paperhttps://huggingface.co/blog/ProlificAI/humaine-frameworkCompany:[00:00:30] Prolifichttps://www.prolific.com[00:01:45] Chatbot Arenahttps://lmarena.ai/Person:[00:00:35] Andrew Gordonhttps://www.linkedin.com/in/andrew-gordon-03879919a/[00:00:45] Nora Petrovahttps://www.linkedin.com/in/nora-petrova/Event:Algorithm:[00:09:01] Microsoft TrueSkillhttps://www.microsoft.com/en-us/research/project/trueskill-ranking-system/Leaderboard:[00:09:21] Prolific HUMAINE Leaderboardhttps://www.prolific.com/humaine[00:09:31] HUMAINE HuggingFace Spacehttps://huggingface.co/spaces/ProlificAI/humaine-leaderboard[00:10:21] Prolific AI Leaderboard Portalhttps://www.prolific.com/leaderboardDataset:[00:09:51] Prolific Social Reasoning RLHF Datasethttps://huggingface.co/datasets/ProlificAI/social-reasoning-rlhfOrganization:[00:10:31] MLCommonshttps://mlcommons.org/

NOW PLAYING

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)

0:00 16:04

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

French Your Way Jessica: Native French teacher founder of French Your Way Boost your French listening skills and test your comprehension with this one of a kind series of podcasts. Get the chance to listen to a real conversation between native speakers talking at normal speed AND customise your learning experience through carefully designed sets of questions (2 levels of difficulty) available for download at www.frenchvoicespodcast.com. All interviews also come with the transcript. French teacher Jessica interviews native speakers of French from around the world who share a bit of their life and passion. Where else would you meet in one same place a French yoga teacher based in Melbourne, a soap manufacturer from Provence, or a couple cycling around the world? Kaizen Blueprint Aldo Chandra "Kaizen" is a Japanese term for continuous improvement. This podcast provides a blueprint to learn about health, wealth, relationships and everything else in between. Through our podcast, we strive to inspire, educate, and motivate our audience to cultivate a mindset of lifelong learning, productivity, and personal development. By sharing insights, strategies, and practical tips, we aim to guide listeners on their journey towards realizing their fullest potential, fostering success, and creating lasting positive change. One Man Went To Row PepperDawesMedia Follow the journey, from training to finish line, of a man from Derby, UK who is going from having only ever rowed on a machine to rowing 3000 miles solo across the Atlantic...just after his 70th birthday! Humanizing Change Tremendousness Join us each episode as we talk with innovators in their respective fields about their unique journeys and how they humanize change in their own work, right here, on Humanizing Change.

Frequently Asked Questions

How long is this episode of Machine Learning Street Talk (MLST)?

This episode is 16 minutes long.

When was this Machine Learning Street Talk (MLST) episode published?

This episode was published on December 20, 2025.

What is this episode about?

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on...

Can I download this Machine Learning Street Talk (MLST) episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!