The 2023 G-Eval paper crystallized a practice that had been creeping into the field for a year before: instead of writing yet another automatic metric for NLG quality, use a strong LLM as the judge, prompt it with a structured rubric, and read its output as a score. G-Eval correlated with human judgment better than every prior automatic metric on the same tasks, and the paper is now the citable reference for almost any system that scores generated text with a model — including ai100's, which depends on this approach existing as a defensible methodology rather than a clever hack.

Worth following when
you want to understand the methodological foundation under "let the model grade it" and how to argue for it in a peer-reviewed setting.
Topics
LLM-as-judge as a scoring methodology; chain-of-thought prompting in evaluation; human-alignment of automatic NLG metrics.
Key works
G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment (2023, lead author); subsequent LLM-as-judge methodology refinements.