Jian-Guang Lou
Whether a language model that produces code actually produces code that does what the request asked for — and what evaluation looks like when correctness has a definite answer for once.
Lou's MSR Asia research line has worked on the rare LLM-evaluation problem where correctness is unambiguous: code generation. Either the generated program compiles, runs, and produces the right output, or it doesn't — which means the evaluation methodology can sidestep most of the LLM-as-judge problems that plague open-ended generation evaluation. His group's contributions to in-context-learning evaluation and program-synthesis benchmarks built on this advantage: a literature that knows what "correct" means and uses that as leverage to interrogate other parts of model behavior.