James Zou
What evaluation looks like when the AI being evaluated has to clear regulatory bars before deployment — and what general LLM evaluation should learn from a field that has been doing this for years.
Zou's research at Stanford has tracked the FDA's growing roster of approved AI/ML medical devices and the evaluation methodology each one had to satisfy: not benchmark scores, but prospective clinical performance studies with predefined endpoints. His work also includes systematic comparisons of LLMs and clinicians on medical-question-answering, which gives the field a calibrated reference for what "this LLM is medically reliable" actually requires. For ai100, biomedical AI is the methodological exemplar — a domain where the evaluation question is structurally similar to ours (a high-stakes recommendation in a context that affects real outcomes) but where the stakes have forced a longer methodological discipline than general LLM evaluation has yet developed.