Yejin Choi
What it would take for a language model to have what humans have in spades and what LLMs reliably lack — common sense.
Choi has been the most consistent voice in NLP arguing that scale alone does not produce common-sense reasoning, and her group has built the benchmarks to prove it: HellaSwag (where humans get 95% right and language models from 2019 got 47%), PIQA for physical commonsense, SIQA for social, COMET for commonsense knowledge graphs. The framing she brought to the field — that current evaluation gives models credit for surface-pattern matching on tasks where humans are using actual world models — has aged well as larger LLMs improved on the benchmarks while still producing reasoning failures the benchmarks were designed to expose. For ai100, this is the literature that justifies our skepticism of headline benchmark numbers: a model scoring 92% on commonsense benchmarks may still be doing something different from what humans do to score 95%, and the difference matters for downstream behavior.