← Back to the list
Jindong Wang
Standardizing what "evaluating an LLM" even means as a research procedure.
The literature on LLM evaluation grew faster than the methodology to make those papers comparable — different prompts, different scoring, different definitions of "success" produced contradictory claims about the same models. Wang led the most-cited synthesis of that landscape (the 2023 "Survey on Evaluation of LLMs", with collaborators) and built PromptBench, an open framework for stress-testing model robustness across prompt variations — the part of the field where the gap between "the model works" and "the model works on the exact words we tested" lives.