The 2023 "Is ChatGPT a Good NLG Evaluator?" preliminary study, with Zhou as senior author, was one of the first papers to seriously ask the question the field had been treating as a foregone conclusion — can a single closed-source LLM be the scoring oracle for an entire research line, and what happens when you actually check? The findings were mixed: ChatGPT-as-judge correlated well with humans on some NLG tasks and badly on others, with biases that didn't show up until you stratified by task type. Zhou's broader work is in industrial conversational AI, which lends a particular angle to the question: when your dialogue product talks to hundreds of millions of users, "the evaluator is mostly right" is not a tolerable answer for the cases where it isn't.

Worth following when
you want a methodologically cautious early treatment of LLM-as-judge, written by someone who has to make scoring decisions count in production.
Topics
early empirical studies of LLM-as-judge reliability; task-stratified evaluator bias; industrial conversational AI evaluation.
Key works
"Is ChatGPT a Good NLG Evaluator? A Preliminary Study" (2023, senior author); ongoing publications on large-scale conversational systems.