For decades, relevance judgments — the labels that say "this document is relevant to this query" — were produced by human assessors, expensive and slow but the gold standard for IR evaluation. Potthast's "LLMs for Relevance" line of work, run jointly with the Webis network, asked the practical question: how close can a language model get to expert human relevance judgments, and where does the substitution silently break? The findings are mixed in a productive way — close enough for some task types to enable evaluation at corpus scales human assessment cannot reach, and far enough off in others that any uncritical use produces systematically biased benchmarks.

Worth following when
you want to know whether using an LLM to grade relevance is appropriate for your evaluation, and where it isn't.
Topics
LLMs as substitutes for human relevance assessors; large-scale IR evaluation under labelling constraints; European open web-search infrastructure (OpenWebSearch.eu).
Key works
Webis "LLMs for Relevance" line (2023 onward); long-standing PAN and TIRA contributions (with Stein); OpenWebSearch.eu open-infrastructure publications.