For most of NLP and IR history, when two papers reported different numbers on the "same" benchmark, the disagreement could not be settled — different splits, different prompts, different code, different machines. Stein's Webis group built TIRA, a platform that runs submitted code in a controlled environment so that "I ran your code on this data" becomes a literally executable claim. The same infrastructure thinking drives the PAN evaluation campaigns he has co-chaired since 2009: each task includes its evaluation protocol as a binding part of the task definition.

Worth following when
you want evaluation infrastructure that produces disagreements with technical resolution paths, not just disagreements you can publish about.
Topics
reproducible evaluation infrastructure (TIRA); long-running shared-task design (PAN); the difference between evaluation conventions and enforceable evaluation protocols.
Key works
TIRA Integrated Research Architecture (2007, ongoing); PAN shared-task series (2009, ongoing, co-chair); Webis group publications on retrieval and computational argumentation.