EPISODE · Mar 1, 2026 · 14 MIN
Off-the-Shelf Large Language Models Are Unreliable Judges – Jonathan Choi (USC / WashU)
from Talking law and economics at ETH Zurich · host ETH Center for Law & Economics
With the rapid rise of artificial intelligence, large language models (LLMs) are increasingly being considered for tasks once thought to be uniquely human—including legal interpretation. The idea of “AI judges” suggests appealing possibilities: consistent, fast, and ostensibly unbiased answers to legal questions. But how reliable are these models? Can their judgments truly be trusted? And do they withstand careful empirical scrutiny?In this episode of the CLE Vlog Series, Prof. Jonathan Choi (University of Southern California & Washington University, St. Louis) joins Alessandro Tacconelli (ETH Zurich) to discuss his paper, “Off-the-Shelf Large Language Models Are Unreliable Judges.” Prof. Choi presents findings from a series of empirical experiments designed to test how well LLMs perform as legal interpreters. His results reveal that model judgments are highly sensitive to prompt phrasing, output processing methods, and training choices. Moreover, post-training adjustments in today’s most widely used models can push LLMs’ assessments far from empirically grounded predictions of language use. These insights raise serious questions about the credibility of LLMs in legal interpretation and cast doubt on their ability to capture the “ordinary meaning” of legal texts.Paper Reference:Jonathan Choi – University of Southern California / Washington University (St. Louis)Large Language Models Are Unreliable Judgeshttps://papers.ssrn.com/sol3/papers.cfm?abstract_id=5188865Audio Credits for Trailer:AllttA by AllttA https://youtu.be/ZawLOcbQZ2w
What this episode covers
With the rapid rise of artificial intelligence, large language models (LLMs) are increasingly being considered for tasks once thought to be uniquely human—including legal interpretation. The idea of “AI judges” suggests appealing possibilities: consistent, fast, and ostensibly unbiased answers to legal questions. But how reliable are these models? Can their judgments truly be trusted? And do they withstand careful empirical scrutiny?In this episode of the CLE Vlog Series, Prof. Jonathan Choi (University of Southern California & Washington University, St. Louis) joins Alessandro Tacconelli (ETH Zurich) to discuss his paper, “Off-the-Shelf Large Language Models Are Unreliable Judges.” Prof. Choi presents findings from a series of empirical experiments designed to test how well LLMs perform as legal interpreters. His results reveal that model judgments are highly sensitive to prompt phrasing, output processing methods, and training choices. Moreover, post-training adjustments in today’s most widely used models can push LLMs’ assessments far from empirically grounded predictions of language use. These insights raise serious questions about the credibility of LLMs in legal interpretation and cast doubt on their ability to capture the “ordinary meaning” of legal texts.Paper Reference:Jonathan Choi – University of Southern California / Washington University (St. Louis)Large Language Models Are Unreliable Judgeshttps://papers.ssrn.com/sol3/papers.cfm?abstract_id=5188865Audio Credits for Trailer:AllttA by AllttA https://youtu.be/ZawLOcbQZ2w
NOW PLAYING
Off-the-Shelf Large Language Models Are Unreliable Judges – Jonathan Choi (USC / WashU)
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m