← Back to the list
Benno Stein
Building evaluation infrastructure that turns researcher disagreement into something technically resolvable.
For most of NLP and IR history, when two papers reported different numbers on the "same" benchmark, the disagreement could not be settled — different splits, different prompts, different code, different machines. Stein's Webis group built TIRA, a platform that runs submitted code in a controlled environment so that "I ran your code on this data" becomes a literally executable claim. The same infrastructure thinking drives the PAN evaluation campaigns he has co-chaired since 2009: each task includes its evaluation protocol as a binding part of the task definition.