Jindong Wang — Whom to read in AI

The literature on LLM evaluation grew faster than the methodology to make those papers comparable — different prompts, different scoring, different definitions of "success" produced contradictory claims about the same models. Wang led the most-cited synthesis of that landscape (the 2023 "Survey on Evaluation of LLMs", with collaborators) and built PromptBench, an open framework for stress-testing model robustness across prompt variations — the part of the field where the gap between "the model works" and "the model works on the exact words we tested" lives.

Worth following when: you want to design an LLM evaluation from scratch and want to know what conventions already exist before you reinvent them.
Topics: systematic synthesis of LLM evaluation research; prompt robustness and adversarial prompts; the gap between benchmark scores and deployed behavior.
Key works: "A Survey on Evaluation of LLMs" (2023, lead author); PromptBench framework (2023).