Benno Stein — Whom to read in AI

For most of NLP and IR history, when two papers reported different numbers on the "same" benchmark, the disagreement could not be settled — different splits, different prompts, different code, different machines. Stein's Webis group built TIRA, a platform that runs submitted code in a controlled environment so that "I ran your code on this data" becomes a literally executable claim. The same infrastructure thinking drives the PAN evaluation campaigns he has co-chaired since 2009: each task includes its evaluation protocol as a binding part of the task definition.

Worth following when: you want evaluation infrastructure that produces disagreements with technical resolution paths, not just disagreements you can publish about.
Topics: reproducible evaluation infrastructure (TIRA); long-running shared-task design (PAN); the difference between evaluation conventions and enforceable evaluation protocols.
Key works: TIRA Integrated Research Architecture (2007, ongoing); PAN shared-task series (2009, ongoing, co-chair); Webis group publications on retrieval and computational argumentation.