Lou's MSR Asia research line has worked on the rare LLM-evaluation problem where correctness is unambiguous: code generation. Either the generated program compiles, runs, and produces the right output, or it doesn't — which means the evaluation methodology can sidestep most of the LLM-as-judge problems that plague open-ended generation evaluation. His group's contributions to in-context-learning evaluation and program-synthesis benchmarks built on this advantage: a literature that knows what "correct" means and uses that as leverage to interrogate other parts of model behavior.

Worth following when
you want to study LLM evaluation methodology in the rare setting where the ground truth is unambiguous — and to see what that clarity buys you that open-ended evaluation lacks.
Topics
program-synthesis evaluation methodology; in-context-learning rigor for code generation; the contrast between unambiguous-ground-truth and open-ended evaluation.
Key works
body of work on program synthesis and code generation evaluation at Microsoft Research Asia (2018 onward); in-context-learning evaluation publications; ACM Distinguished Member contributions to applied LLM research.