Martin Potthast
Whether language models can be used to judge whether a document is relevant to a query — and what changes when they replace human assessors in that role.
For decades, relevance judgments — the labels that say "this document is relevant to this query" — were produced by human assessors, expensive and slow but the gold standard for IR evaluation. Potthast's "LLMs for Relevance" line of work, run jointly with the Webis network, asked the practical question: how close can a language model get to expert human relevance judgments, and where does the substitution silently break? The findings are mixed in a productive way — close enough for some task types to enable evaluation at corpus scales human assessment cannot reach, and far enough off in others that any uncritical use produces systematically biased benchmarks.