How to Pick the Best Pretraining Data episode artwork

EPISODE · Apr 18, 2025 · 17 MIN

How to Pick the Best Pretraining Data

from Talking Machines by SU PARK · host Su Park

In this episode of "Talking Machines by Su Park," the hosts explore the critical topic of selecting pretraining datasets for Large Language Models, a decision that significantly impacts model performance and cost-efficiency. The discussion centers on a recent paper from the Allen Institute for AI, which introduces a novel approach to optimizing dataset selection without extensive computational resources, thereby addressing a key challenge in AI research.The episode highlights two major insights from the paper. First, the proposed suite of models, known as DATADECIDE, allows researchers to effectively predict which datasets will yield the best results for larger models based on smaller-scale experiments. This method has been shown to achieve approximately 80% accuracy in predicting performance outcomes, thus reducing the need for costly trial-and-error approaches. Additionally, the research reveals which benchmarks correlate with high performance, offering valuable guidance for future dataset selection in AI training."DataDecide: How to Predict Best Pretraining Data with Small Experiments" by Allen Institute for AI: https://arxiv.org/abs/2504.11393

NOW PLAYING

How to Pick the Best Pretraining Data

0:00 17:30

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

MG Show MG Show The MG Show, hosted by Jeffrey Pedersen and Shannon Townsend, is a leading alternative media platform dedicated to uncovering the truth behind today’s most pressing political issues. Launched in 2019, the show has grown exponentially, offering unfiltered insights, comprehensive research, and real-time analysis. With a commitment to independent journalism and factual integrity, the MG Show empowers its audience with knowledge and encourages active participation in the political discourse. French Your Way Jessica: Native French teacher founder of French Your Way Boost your French listening skills and test your comprehension with this one of a kind series of podcasts. Get the chance to listen to a real conversation between native speakers talking at normal speed AND customise your learning experience through carefully designed sets of questions (2 levels of difficulty) available for download at www.frenchvoicespodcast.com. All interviews also come with the transcript. French teacher Jessica interviews native speakers of French from around the world who share a bit of their life and passion. Where else would you meet in one same place a French yoga teacher based in Melbourne, a soap manufacturer from Provence, or a couple cycling around the world? That Hoarder: Overcome Compulsive Hoarding That Hoarder Hoarding disorder is stigmatised and people who hoard feel vast amounts of shame. This podcast began life as an audio diary, an anonymous outlet for somebody with this weird condition. That Hoarder speaks about her experiences living with compulsive hoarding, she interviews therapists, academics, researchers, children of hoarders, professional organisers and influencers, and she shares insight and tips for others with the problem. Listened to by people who hoard as well as those who love them and those who work with them, Overcome Compulsive Hoarding with That Hoarder aims to shatter the stigma, share the truth and speak openly and honestly to improve lives. Flottengeflüster ALD Automotive Österreich | LeasePlan Beim Flottengeflüster powered by ALD Automotive | LeasePlan präsentieren Jörg Janik und Peter Gutenbrunner alle zwei Wochen spannende Informationen rund um das Thema nachhaltige Mobilität. Beide beschäftigen sich schon lange mit der Thematik und bringen umfangreiches Fachwissen mit. Sollten sie aber doch einmal nicht weiter wissen, werden unsere Expert*innen hinzugezogen, die ihnen gerne mit Rat und Tat zur Seite stehen.

Frequently Asked Questions

How long is this episode of Talking Machines by SU PARK?

This episode is 17 minutes long.

When was this Talking Machines by SU PARK episode published?

This episode was published on April 18, 2025.

What is this episode about?

In this episode of "Talking Machines by Su Park," the hosts explore the critical topic of selecting pretraining datasets for Large Language Models, a decision that significantly impacts model performance and cost-efficiency. The discussion centers...

Can I download this Talking Machines by SU PARK episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!