Seungone Kim
Whether the field can have an open-weight LLM-evaluator alternative, so that "one black box judging another black box" stops being the only available option.
The standard practice for scoring language models is to ask GPT-4 or another large closed model to grade the outputs — and the method works, while being methodologically shaky in a way that gets worse over time. Reproducing a result two years from now will be impossible, because the judge is a closed service with an opaque release cycle and regular weight updates. Kim's Prometheus and Prometheus 2 (lead author, 2023–24) are an attempt to give the community an open analogue: an evaluator with known weights and a documented training procedure, against which independent verification is actually possible.