The literature on LLM evaluation grew faster than the methodology to make those papers comparable — different prompts, different scoring, different definitions of "success" produced contradictory claims about the same models. Wang led the most-cited synthesis of that landscape (the 2023 "Survey on Evaluation of LLMs", with collaborators) and built PromptBench, an open framework for stress-testing model robustness across prompt variations — the part of the field where the gap between "the model works" and "the model works on the exact words we tested" lives.

Worth following when
you want to design an LLM evaluation from scratch and want to know what conventions already exist before you reinvent them.
Topics
systematic synthesis of LLM evaluation research; prompt robustness and adversarial prompts; the gap between benchmark scores and deployed behavior.
Key works
"A Survey on Evaluation of LLMs" (2023, lead author); PromptBench framework (2023).