The standard practice for scoring language models is to ask GPT-4 or another large closed model to grade the outputs — and the method works, while being methodologically shaky in a way that gets worse over time. Reproducing a result two years from now will be impossible, because the judge is a closed service with an opaque release cycle and regular weight updates. Kim's Prometheus and Prometheus 2 (lead author, 2023–24) are an attempt to give the community an open analogue: an evaluator with known weights and a documented training procedure, against which independent verification is actually possible.

Worth following when
you plan to evaluate LLMs in research that has to remain reproducible, and you understand that using GPT-as-judge binds your science to someone else's product roadmap.
Topics
open LLM-evaluators as alternatives to closed judges; what one transformer is actually comparing when it "scores" another; the limits of LLM-as-judge as a methodology.
Key works
Prometheus (2023, lead author); Prometheus 2 (2024, lead author); CoT Collection (2023, lead author).