Seungone Kim — Whom to read in AI

The standard practice for scoring language models is to ask GPT-4 or another large closed model to grade the outputs — and the method works, while being methodologically shaky in a way that gets worse over time. Reproducing a result two years from now will be impossible, because the judge is a closed service with an opaque release cycle and regular weight updates. Kim's Prometheus and Prometheus 2 (lead author, 2023–24) are an attempt to give the community an open analogue: an evaluator with known weights and a documented training procedure, against which independent verification is actually possible.

Worth following when: you plan to evaluate LLMs in research that has to remain reproducible, and you understand that using GPT-as-judge binds your science to someone else's product roadmap.
Topics: open LLM-evaluators as alternatives to closed judges; what one transformer is actually comparing when it "scores" another; the limits of LLM-as-judge as a methodology.
Key works: Prometheus (2023, lead author); Prometheus 2 (2024, lead author); CoT Collection (2023, lead author).