Charles L. A. Clarke
What systematic, replicable evaluation of question-answering systems requires — now that the systems being evaluated are language models rather than retrievers.
Clarke spent three decades inside the TREC evaluation tradition, where rigorous comparison of retrieval systems involved shared corpora, shared queries, shared relevance judgments, and an explicit protocol for resolving disagreements between assessors. His 2023 paper "Evaluating Open-Domain QA in the Era of LLMs" carries that discipline forward into the current moment, pointing out that most LLM-based QA evaluation has quietly dropped most of those guardrails — single-source ground truth, automatic judges that have not been validated against humans, no protocol for answers that are technically correct but stylistically different from the reference. The methodological reset he argues for is closer to the original TREC posture than to anything currently in vogue.