Reasoning, Robustness, and Human Feedback in AI - Max Bartolo (Cohere) episode artwork

EPISODE · Mar 18, 2025 · 1H 23M

Reasoning, Robustness, and Human Feedback in AI - Max Bartolo (Cohere)

from Machine Learning Street Talk (MLST)

Dr. Max Bartolo from Cohere discusses machine learning model development, evaluation, and robustness. Key topics include model reasoning, the DynaBench platform for dynamic benchmarking, data-centric AI development, model training challenges, and the limitations of human feedback mechanisms. The conversation also covers technical aspects like influence functions, model quantization, and the PRISM project.Max Bartolo (Cohere):https://www.maxbartolo.com/https://cohere.com/commandTRANSCRIPT:https://www.dropbox.com/scl/fi/vujxscaffw37pqgb6hpie/MAXB.pdf?rlkey=0oqjxs5u49eqa2m7uaol64lbw&dl=0TOC:1. Model Reasoning and Verification [00:00:00] 1.1 Model Consistency and Reasoning Verification [00:03:25] 1.2 Influence Functions and Distributed Knowledge Analysis [00:10:28] 1.3 AI Application Development and Model Deployment [00:14:24] 1.4 AI Alignment and Human Feedback Limitations2. Evaluation and Bias Assessment [00:20:15] 2.1 Human Evaluation Challenges and Factuality Assessment [00:27:15] 2.2 Cultural and Demographic Influences on Model Behavior [00:32:43] 2.3 Adversarial Examples and Model Robustness3. Benchmarking Systems and Methods [00:41:54] 3.1 DynaBench and Dynamic Benchmarking Approaches [00:50:02] 3.2 Benchmarking Challenges and Alternative Metrics [00:50:33] 3.3 Evolution of Model Benchmarking Methods [00:51:15] 3.4 Hierarchical Capability Testing Framework [00:52:35] 3.5 Benchmark Platforms and Tools4. Model Architecture and Performance [00:55:15] 4.1 Cohere's Model Development Process [01:00:26] 4.2 Model Quantization and Performance Evaluation [01:05:18] 4.3 Reasoning Capabilities and Benchmark Standards [01:08:27] 4.4 Training Progression and Technical Challenges5. Future Directions and Challenges [01:13:48] 5.1 Context Window Evolution and Trade-offs [01:22:47] 5.2 Enterprise Applications and Future ChallengesREFS:[00:03:10] Research at Cohere with Laura Ruis et al., Max Bartolo, Laura Ruis et al.https://cohere.com/research/papers/procedural-knowledge-in-pretraining-drives-reasoning-in-large-language-models-2024-11-20[00:04:15] Influence functions in machine learning, Koh & Lianghttps://arxiv.org/abs/1703.04730[00:08:05] Studying Large Language Model Generalization with Influence Functions, Roger Grosse et al.https://storage.prod.researchhub.com/uploads/papers/2023/08/08/2308.03296.pdf[00:11:10] The LLM ARChitect: Solving ARC-AGI Is A Matter of Perspective, Daniel Franzen, Jan Disselhoff, and David Hartmannhttps://github.com/da-fr/arc-prize-2024/blob/main/the_architects.pdf[00:12:10] Hugging Face model repo for C4AI Command A, Cohere and Cohere For AIhttps://huggingface.co/CohereForAI/c4ai-command-a-03-2025[00:13:30] OpenInterpreterhttps://github.com/KillianLucas/open-interpreter[00:16:15] Human Feedback is not Gold Standard, Tom Hosking, Max Bartolo, Phil Blunsomhttps://arxiv.org/abs/2309.16349[00:27:15] The PRISM Alignment Dataset, Hannah Kirk et al.https://arxiv.org/abs/2404.16019[00:32:50] How adversarial examples arise, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madryhttps://arxiv.org/abs/1905.02175[00:43:00] DynaBench platform paper, Douwe Kiela et al.https://aclanthology.org/2021.naacl-main.324.pdf[00:50:15] Sara Hooker's work on compute limitations, Sara Hookerhttps://arxiv.org/html/2407.05694v1[00:53:25] DataPerf: Community-led benchmark suite, Mazumder et al.https://arxiv.org/abs/2207.10062[01:04:35] DROP, Dheeru Dua et al.https://arxiv.org/abs/1903.00161[01:07:05] GSM8k, Cobbe et al.https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k[01:09:30] ARC, François Chollethttps://github.com/fchollet/ARC-AGI[01:15:50] Command A, Coherehttps://cohere.com/blog/command-a[01:22:55] Enterprise search using LLMs, Coherehttps://cohere.com/blog/commonly-asked-questions-about-search-from-coheres-enterprise-customers

Dr. Max Bartolo from Cohere discusses machine learning model development, evaluation, and robustness. Key topics include model reasoning, the DynaBench platform for dynamic benchmarking, data-centric AI development, model training challenges, and the limitations of human feedback mechanisms. The conversation also covers technical aspects like influence functions, model quantization, and the PRISM project.Max Bartolo (Cohere):https://www.maxbartolo.com/https://cohere.com/commandTRANSCRIPT:https://www.dropbox.com/scl/fi/vujxscaffw37pqgb6hpie/MAXB.pdf?rlkey=0oqjxs5u49eqa2m7uaol64lbw&dl=0TOC:1. Model Reasoning and Verification [00:00:00] 1.1 Model Consistency and Reasoning Verification [00:03:25] 1.2 Influence Functions and Distributed Knowledge Analysis [00:10:28] 1.3 AI Application Development and Model Deployment [00:14:24] 1.4 AI Alignment and Human Feedback Limitations2. Evaluation and Bias Assessment [00:20:15] 2.1 Human Evaluation Challenges and Factuality Assessment [00:27:15] 2.2 Cultural and Demographic Influences on Model Behavior [00:32:43] 2.3 Adversarial Examples and Model Robustness3. Benchmarking Systems and Methods [00:41:54] 3.1 DynaBench and Dynamic Benchmarking Approaches [00:50:02] 3.2 Benchmarking Challenges and Alternative Metrics [00:50:33] 3.3 Evolution of Model Benchmarking Methods [00:51:15] 3.4 Hierarchical Capability Testing Framework [00:52:35] 3.5 Benchmark Platforms and Tools4. Model Architecture and Performance [00:55:15] 4.1 Cohere's Model Development Process [01:00:26] 4.2 Model Quantization and Performance Evaluation [01:05:18] 4.3 Reasoning Capabilities and Benchmark Standards [01:08:27] 4.4 Training Progression and Technical Challenges5. Future Directions and Challenges [01:13:48] 5.1 Context Window Evolution and Trade-offs [01:22:47] 5.2 Enterprise Applications and Future ChallengesREFS:[00:03:10] Research at Cohere with Laura Ruis et al., Max Bartolo, Laura Ruis et al.https://cohere.com/research/papers/procedural-knowledge-in-pretraining-drives-reasoning-in-large-language-models-2024-11-20[00:04:15] Influence functions in machine learning, Koh & Lianghttps://arxiv.org/abs/1703.04730[00:08:05] Studying Large Language Model Generalization with Influence Functions, Roger Grosse et al.https://storage.prod.researchhub.com/uploads/papers/2023/08/08/2308.03296.pdf[00:11:10] The LLM ARChitect: Solving ARC-AGI Is A Matter of Perspective, Daniel Franzen, Jan Disselhoff, and David Hartmannhttps://github.com/da-fr/arc-prize-2024/blob/main/the_architects.pdf[00:12:10] Hugging Face model repo for C4AI Command A, Cohere and Cohere For AIhttps://huggingface.co/CohereForAI/c4ai-command-a-03-2025[00:13:30] OpenInterpreterhttps://github.com/KillianLucas/open-interpreter[00:16:15] Human Feedback is not Gold Standard, Tom Hosking, Max Bartolo, Phil Blunsomhttps://arxiv.org/abs/2309.16349[00:27:15] The PRISM Alignment Dataset, Hannah Kirk et al.https://arxiv.org/abs/2404.16019[00:32:50] How adversarial examples arise, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madryhttps://arxiv.org/abs/1905.02175[00:43:00] DynaBench platform paper, Douwe Kiela et al.https://aclanthology.org/2021.naacl-main.324.pdf[00:50:15] Sara Hooker's work on compute limitations, Sara Hookerhttps://arxiv.org/html/2407.05694v1[00:53:25] DataPerf: Community-led benchmark suite, Mazumder et al.https://arxiv.org/abs/2207.10062[01:04:35] DROP, Dheeru Dua et al.https://arxiv.org/abs/1903.00161[01:07:05] GSM8k, Cobbe et al.https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k[01:09:30] ARC, François Chollethttps://github.com/fchollet/ARC-AGI[01:15:50] Command A, Coherehttps://cohere.com/blog/command-a[01:22:55] Enterprise search using LLMs, Coherehttps://cohere.com/blog/commonly-asked-questions-about-search-from-coheres-enterprise-customers

NOW PLAYING

Reasoning, Robustness, and Human Feedback in AI - Max Bartolo (Cohere)

0:00 1:23:11

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

French Your Way Jessica: Native French teacher founder of French Your Way Boost your French listening skills and test your comprehension with this one of a kind series of podcasts. Get the chance to listen to a real conversation between native speakers talking at normal speed AND customise your learning experience through carefully designed sets of questions (2 levels of difficulty) available for download at www.frenchvoicespodcast.com. All interviews also come with the transcript. French teacher Jessica interviews native speakers of French from around the world who share a bit of their life and passion. Where else would you meet in one same place a French yoga teacher based in Melbourne, a soap manufacturer from Provence, or a couple cycling around the world? Kaizen Blueprint Aldo Chandra "Kaizen" is a Japanese term for continuous improvement. This podcast provides a blueprint to learn about health, wealth, relationships and everything else in between. Through our podcast, we strive to inspire, educate, and motivate our audience to cultivate a mindset of lifelong learning, productivity, and personal development. By sharing insights, strategies, and practical tips, we aim to guide listeners on their journey towards realizing their fullest potential, fostering success, and creating lasting positive change. One Man Went To Row PepperDawesMedia Follow the journey, from training to finish line, of a man from Derby, UK who is going from having only ever rowed on a machine to rowing 3000 miles solo across the Atlantic...just after his 70th birthday! Humanizing Change Tremendousness Join us each episode as we talk with innovators in their respective fields about their unique journeys and how they humanize change in their own work, right here, on Humanizing Change.

Frequently Asked Questions

How long is this episode of Machine Learning Street Talk (MLST)?

This episode is 1 hour and 23 minutes long.

When was this Machine Learning Street Talk (MLST) episode published?

This episode was published on March 18, 2025.

What is this episode about?

Dr. Max Bartolo from Cohere discusses machine learning model development, evaluation, and robustness. Key topics include model reasoning, the DynaBench platform for dynamic benchmarking, data-centric AI development, model training challenges, and...

Can I download this Machine Learning Street Talk (MLST) episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!