Data Selection for Data-Centric AI: Data Quality Over Quantity // Cody Coleman // Coffee Sessions #59 episode artwork

EPISODE · Oct 11, 2021 · 1H 11M

Data Selection for Data-Centric AI: Data Quality Over Quantity // Cody Coleman // Coffee Sessions #59

from MLOps.community · host Demetrios

Coffee Sessions #59 with Cody Coleman, Data Quality Over Quantity or Data Selection for Data-Centric AI.Join the Community: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://go.mlops.community/YTJoinIn⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Get the newsletter: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://go.mlops.community/YTNewsletter⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠// AbstractBig data has been critical to many of the successes in ML, but it brings its own problems. Working with massive datasets is cumbersome and expensive, especially with unstructured data like images, videos, and speech. Careful data selection can mitigate the pains of big data by focusing computational and labeling resources on the most valuable examples.  Cody Coleman, a recent Ph.D. from Stanford University and founding member of MLCommons, joins us to describe how a more data-centric approach that focuses on data quality rather than quantity can lower the AI/ML barrier. Instead of managing clusters of machines and setting up cumbersome labeling pipelines, you can spend more time tackling real problems.// BioCody Coleman recently finished his Ph.D. in CS at Stanford University, where he was advised by Professors Matei Zaharia and Peter Bailis. His research spans from performance benchmarking of hardware and software systems (i.e., DAWNBench and MLPerf) to computationally efficient methods for active learning and core-set selection. His work has been supported by the NSF GRFP, the Stanford DAWN Project, and the Open Phil AI Fellowship.// RelevantLinks [preprint] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://arxiv.org/abs/2007.00077](https://arxiv.org/abs/2007.00077)[video] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://www.youtube.com/watch?v=vRVyOEK2JUU](https://www.youtube.com/watch?v=vRVyOEK2JUU)[blog post] Selection via Proxy: Efficient Data Selection for Deep Learning: [https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/](https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/)[slides] The DAWN of MLPerf: [https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view](https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view)[blog post] About Cody's research: [https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems](https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems)[video] About Cody: [https://www.youtube.com/watch?v=stxJMsxxxtA](https://www.youtube.com/watch?v=stxJMsxxxtA)--------------- ✌️Connect With Us ✌️ -------------Join our Slack community: https://go.mlops.community/slackFollow us on Twitter: @mlopscommunitySign up for the next meetup: https://go.mlops.community/registerConnect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/Connect with Cody on LinkedIn: https://www.linkedin.com/in/codyaustun/Timestamps: [00:00] Introduction to Cody Coleman [03:10] Cody's life story [07:35] Cody's journey in tech [15:04] Interest in Machine Learning and work at Stanford came about [21:48] Data-centric Machine Learning Data Quality [28:56] Research and Industry being together [33:33] Advice to practitioners [38:03] Principles of Machine Learning in an academic setting [43:50] Data-centric promising techniques that stand out [53:51] Developing benchmarks [56:34] Guardrails for machine learning vs automated testing suites  [1:02:57] Creating something valuable and useful [1:07:05] Data collecting vs Data Hoarding

Coffee Sessions #59 with Cody Coleman, Data Quality Over Quantity or Data Selection for Data-Centric AI.Join the Community: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://go.mlops.community/YTJoinIn⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Get the newsletter: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://go.mlops.community/YTNewsletter⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠// AbstractBig data has been critical to many of the successes in ML, but it brings its own problems. Working with massive datasets is cumbersome and expensive, especially with unstructured data like images, videos, and speech. Careful data selection can mitigate the pains of big data by focusing computational and labeling resources on the most valuable examples.  Cody Coleman, a recent Ph.D. from Stanford University and founding member of MLCommons, joins us to describe how a more data-centric approach that focuses on data quality rather than quantity can lower the AI/ML barrier. Instead of managing clusters of machines and setting up cumbersome labeling pipelines, you can spend more time tackling real problems.// BioCody Coleman recently finished his Ph.D. in CS at Stanford University, where he was advised by Professors Matei Zaharia and Peter Bailis. His research spans from performance benchmarking of hardware and software systems (i.e., DAWNBench and MLPerf) to computationally efficient methods for active learning and core-set selection. His work has been supported by the NSF GRFP, the Stanford DAWN Project, and the Open Phil AI Fellowship.// RelevantLinks [preprint] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://arxiv.org/abs/2007.00077](https://arxiv.org/abs/2007.00077)[video] Similarity Search for Efficient Active Learning and Search of Rare Concepts: [https://www.youtube.com/watch?v=vRVyOEK2JUU](https://www.youtube.com/watch?v=vRVyOEK2JUU)[blog post] Selection via Proxy: Efficient Data Selection for Deep Learning: [https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/](https://dawn.cs.stanford.edu/2020/04/23/selection-via-proxy/)[slides] The DAWN of MLPerf: [https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view](https://drive.google.com/file/d/17ZpX0GOtOXG8QMn6KEc_Le8tUfDBlgDE/view)[blog post] About Cody's research: [https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems](https://hai.stanford.edu/news/cody-coleman-lowering-machine-learnings-barriers-help-people-tackle-real-problems)[video] About Cody: [https://www.youtube.com/watch?v=stxJMsxxxtA](https://www.youtube.com/watch?v=stxJMsxxxtA)--------------- ✌️Connect With Us ✌️ -------------Join our Slack community: https://go.mlops.community/slackFollow us on Twitter: @mlopscommunitySign up for the next meetup: https://go.mlops.community/registerConnect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/Connect with Cody on LinkedIn: https://www.linkedin.com/in/codyaustun/Timestamps: [00:00] Introduction to Cody Coleman [03:10] Cody's life story [07:35] Cody's journey in tech [15:04] Interest in Machine Learning and work at Stanford came about [21:48] Data-centric Machine Learning Data Quality [28:56] Research and Industry being together [33:33] Advice to practitioners [38:03] Principles of Machine Learning in an academic setting [43:50] Data-centric promising techniques that stand out [53:51] Developing benchmarks [56:34] Guardrails for machine learning vs automated testing suites  [1:02:57] Creating something valuable and useful [1:07:05] Data collecting vs Data Hoarding

NOW PLAYING

Data Selection for Data-Centric AI: Data Quality Over Quantity // Cody Coleman // Coffee Sessions #59

0:00 1:11:00

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

She’s a Hazard to Herself She’s a Hazard Hi there, I’m Mallory, and I’d like to invite you into our world with “She’s a Hazard to Herself!” Join us as we navigate life with Multiple Sclerosis from the seat of my power wheelchair. Discover stories of resilience, family, and the community we’ve built around chronic illness. Whether you’re impacted by MS or want to learn from our journey, there’s something here for you. So why wait? Subscribe to “She’s a Hazard to Herself” on your favorite podcast app and be part of our journey today. Let’s lift each other up, one episode at a time! Tips, News and Stories for Older Adults Esther C Kane CAPS, C.D.S. "Tips, News, and Stories for Older Adults" delivers weekly insights tailored for seniors. We bring you summaries of curated news, practical advice, and inspiring stories that matter to the 55+ community. From health and finance to technology and lifestyle, our content keeps you informed and engaged. Sourced from trusted outlets, each episode offers valuable information for navigating your golden years. Join us as we explore aging with positivity, wisdom, and engaging stories. Your perfect companion for staying active, learning, and embracing life's later chapters. Prayer Time Heir Waves Prayer Time A podcast especially for our Prayer Time community NEWMORROW SESSIONS - A PodCast Series on the Future of Hospitality Mario C. Bauer, Florian Schneider, Axel Weber & Dr. Tillman Bardt The Newmorrow PodCast is more than a podcast — it's a platform for open dialog on the future of our business, a platform for those building what doesn’t exist yet. Here, we share and embrace our passion for the hospitality industry, but we won’t romanticize the journey. We ask the tough questions, confront uncomfortable truths, and prepare for a future that resists easy answers. We believe that the tougher and wilder times become, the more openly, honestly and humanely people need to talk to each other and act together. We believe, openness, togetherness, and truthfulness should also be cornerstones of a professional community to develop our utopian idea of „open source“. This is a space where visionaries don’t just imagine the future — they wrestle with the paradoxes that shape it: success vs. happiness, data vs. instinct, stability vs. reinvention. Join leaders, entrepreneurs, and thinkers as they share not what made them — but what’s actively shaping them, now and next. So tune in

Frequently Asked Questions

How long is this episode of MLOps.community?

This episode is 1 hour and 11 minutes long.

When was this MLOps.community episode published?

This episode was published on October 11, 2021.

What is this episode about?

Coffee Sessions #59 with Cody Coleman, Data Quality Over Quantity or Data Selection for Data-Centric AI.Join the Community:...

Can I download this MLOps.community episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!