Zou's research at Stanford has tracked the FDA's growing roster of approved AI/ML medical devices and the evaluation methodology each one had to satisfy: not benchmark scores, but prospective clinical performance studies with predefined endpoints. His work also includes systematic comparisons of LLMs and clinicians on medical-question-answering, which gives the field a calibrated reference for what "this LLM is medically reliable" actually requires. For ai100, biomedical AI is the methodological exemplar — a domain where the evaluation question is structurally similar to ours (a high-stakes recommendation in a context that affects real outcomes) but where the stakes have forced a longer methodological discipline than general LLM evaluation has yet developed.

Worth following when
you want LLM evaluation methodology informed by a domain where stakes have already forced rigorous standards — and want to see what those standards actually require.
Topics
biomedical AI evaluation; FDA-track AI/ML medical device approval methodology; the bridge between regulatory-grade evaluation and general LLM benchmarking.
Key works
body of work on FDA-approved AI medical devices and their evaluation criteria (Stanford, 2020 onward); systematic LLM-vs-clinician comparison studies; ongoing publications on AI-in-healthcare evaluation methodology.