← Back to the list
Yang Liu
Whether a more capable language model can be used to grade the outputs of a less capable one — and how to do that without fooling yourself.
The 2023 G-Eval paper crystallized a practice that had been creeping into the field for a year before: instead of writing yet another automatic metric for NLG quality, use a strong LLM as the judge, prompt it with a structured rubric, and read its output as a score. G-Eval correlated with human judgment better than every prior automatic metric on the same tasks, and the paper is now the citable reference for almost any system that scores generated text with a model — including ai100's, which depends on this approach existing as a defensible methodology rather than a clever hack.