Most NLP evaluation is one-shot: a benchmark drops, gets saturated, gets replaced. Monz has been co-organizing the Findings of the WMT campaigns for over a decade, producing one of the few datasets in the field with longitudinal structure — the same translation task, evaluated by the same protocol, across the same language pairs, year after year. The accumulated record shows what actually transferred from the old statistical MT era to the neural era to the LLM era, and where MT progress has stalled despite the impression of universal improvement.

Worth following when
you want evaluation methodology that's been validated across a decade of repeated annual application — instead of a benchmark that's six months old.
Topics
annual machine translation evaluation (WMT findings); longitudinal tracking of MT progress across language pairs; the methodology lineage from statistical to neural to LLM-era MT.
Key works
co-organization of Findings of the WMT (annual, since 2008); UvA Language Technology Lab publications on multilingual MT; ongoing work on neural MT methodology.