Jie Zhou
Whether language models can be trusted to evaluate other language models in production NLP pipelines.
The 2023 "Is ChatGPT a Good NLG Evaluator?" preliminary study, with Zhou as senior author, was one of the first papers to seriously ask the question the field had been treating as a foregone conclusion — can a single closed-source LLM be the scoring oracle for an entire research line, and what happens when you actually check? The findings were mixed: ChatGPT-as-judge correlated well with humans on some NLG tasks and badly on others, with biases that didn't show up until you stratified by task type. Zhou's broader work is in industrial conversational AI, which lends a particular angle to the question: when your dialogue product talks to hundreds of millions of users, "the evaluator is mostly right" is not a tolerable answer for the cases where it isn't.