Norbert Fuhr
What information retrieval evaluation has been getting wrong, in writing, for the last several decades — and why each new generation of researchers makes the same mistakes.
Fuhr's 2017 paper "Some Common Mistakes In IR Evaluation, And How They Can Be Avoided" reads like a checklist that every LLM-evaluation paper of the last three years should have been reviewed against. The list includes things like reporting averaged-over-runs metrics with no significance testing, comparing systems on different test collections, applying statistical tests inappropriate to the data — failures the IR field had already named and that the language-model-evaluation literature reinvented from scratch. His longer body of work, going back to probabilistic IR in the late 1980s, is the long story of trying to make retrieval a discipline rather than an engineering folklore.