EPISODE · May 17, 2026 · 7 MIN
“Benchmarking Real Work” by kaivu, leni, rohuang, zef
Thanks to Megan Kinniment for helpful comments and discussion. TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or reduce the human effort per grade. To do this, we propose generating fuzzy tasks as a byproduct of real SWE work — snapshot the repo and a proto-spec before starting, and after finishing, use an AI transform to produce an executable spec and LLM-judge conditions. Because the engineer just did the work, verifying the judges or grading the agent directly is much cheaper than grading the task from scratch. I think this would be a good way to collect tasks, as well as a useful personal epistemic tool. This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them. Motivation: sampling bias in HCAST There are several well-described limitations of time horizons. But the strongest reason that I don’t update that much on trends in time horizons (and time horizon-like tasks) is because I think all existing evaluations [...] ---Outline:(01:14) Motivation: sampling bias in HCAST(02:47) Making fuzzy tasks sampling viable by increasing judge capacity(04:02) Proposal: sampling from real work(05:18) Advantages(06:10) Discussion(06:13) How inconvenient is this?(06:32) Can we test fuzzy skills by just testing longer tasks? The original text contained 3 footnotes which were omitted from this narration. --- First published: May 16th, 2026 Source: https://www.lesswrong.com/posts/NbDjD47u6WmthgiDC/benchmarking-real-work --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Benchmarking Real Work” by kaivu, leni, rohuang, zef
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m