EPISODE · Aug 12, 2025 · 49 MIN
RLVR Lets Models Fail Their Way to the Top
from YAAP (Yet Another AI Podcast) · host AI21
Think you know fine-tuning? If your answer is RLHF, you don’t. In this episode, Itay, who leads the Alignment group at AI21, gives a no-fluff crash course on RLVR (Reinforcement Learning with Verifiable Rewards), the method powering today’s smartest coding and reasoning models. He explains why RLVR beats RLHF at its own game, how “hard to solve, easy to verify” tasks unlock exploration without chaos, and the emergent behaviors you only get when models are allowed to screw up. If you want to actually understand RLVR (and use it), start here. Key topics: How RLVR outsmarts RLHF in real-world training The “verified rewards” trick that kills reward hacking Emergent skills you don’t get with hand-holding: self-verification, backtracking, multi-path reasoning Why coding models took a giant leap forward Practical steps to train (and actually benefit from) RLVR models
NOW PLAYING
RLVR Lets Models Fail Their Way to the Top
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Jan 2, 2026 ·47m
Dec 21, 2025 ·46m