Fuhr's 2017 paper "Some Common Mistakes In IR Evaluation, And How They Can Be Avoided" reads like a checklist that every LLM-evaluation paper of the last three years should have been reviewed against. The list includes things like reporting averaged-over-runs metrics with no significance testing, comparing systems on different test collections, applying statistical tests inappropriate to the data — failures the IR field had already named and that the language-model-evaluation literature reinvented from scratch. His longer body of work, going back to probabilistic IR in the late 1980s, is the long story of trying to make retrieval a discipline rather than an engineering folklore.

Worth following when
you suspect a current LLM-evaluation paper is making mistakes the IR community already documented two decades ago — and want a citation that backs you up.
Topics
foundations of probabilistic IR; methodological mistakes in IR and NLP evaluation; the long-arc history of information retrieval as a discipline.
Key works
"Some Common Mistakes In IR Evaluation, And How They Can Be Avoided" (2017); foundational work on probabilistic IR (1989 onward); Gerard Salton Award lecture and writings.