Pang Wei Koh
What happens when a language model encounters the kind of data it didn't see during training — and how to measure that gap rigorously, not just notice it after deployment.
WILDS, which Koh co-led in 2021, was the first systematic benchmark suite for distribution shifts in machine learning — geographic shift, temporal shift, demographic shift, the kinds of real-world differences that quietly break models trained on convenient datasets. The result that organized the work was deflating: models with state-of-the-art performance on standard test sets routinely lost half their accuracy or more on the WILDS shifts, and no single robustness technique closed the gap. For ai100 this is the structural problem behind any benchmark of AI engines — what we measure on our query set is not what end-users encounter, and the gap deserves the same methodological care as accuracy itself.