WILDS, which Koh co-led in 2021, was the first systematic benchmark suite for distribution shifts in machine learning — geographic shift, temporal shift, demographic shift, the kinds of real-world differences that quietly break models trained on convenient datasets. The result that organized the work was deflating: models with state-of-the-art performance on standard test sets routinely lost half their accuracy or more on the WILDS shifts, and no single robustness technique closed the gap. For ai100 this is the structural problem behind any benchmark of AI engines — what we measure on our query set is not what end-users encounter, and the gap deserves the same methodological care as accuracy itself.

Worth following when
you suspect a benchmark result will not survive contact with real-world deployment data, and you want the literature that turned that suspicion into measurable shift categories.
Topics
distribution-shift benchmarks (WILDS); foundation-model robustness under realistic conditions; the gap between in-distribution evaluation and deployment behavior.
Key works
WILDS distribution-shift benchmark suite (2021, co-lead); foundation-model robustness publications (2021 onward); recent work on retrieval-augmented LMs and dataset attribution.