Choi has been the most consistent voice in NLP arguing that scale alone does not produce common-sense reasoning, and her group has built the benchmarks to prove it: HellaSwag (where humans get 95% right and language models from 2019 got 47%), PIQA for physical commonsense, SIQA for social, COMET for commonsense knowledge graphs. The framing she brought to the field — that current evaluation gives models credit for surface-pattern matching on tasks where humans are using actual world models — has aged well as larger LLMs improved on the benchmarks while still producing reasoning failures the benchmarks were designed to expose. For ai100, this is the literature that justifies our skepticism of headline benchmark numbers: a model scoring 92% on commonsense benchmarks may still be doing something different from what humans do to score 95%, and the difference matters for downstream behavior.

Worth following when
you want LLM evaluation methodology grounded in the gap between benchmark performance and actual reasoning competence — from someone who has been making that argument since before it was fashionable.
Topics
commonsense reasoning evaluation (HellaSwag, PIQA, SIQA, COMET); the methodological gap between benchmark scores and reasoning competence; long-arc work on what NLP evaluation actually measures.
Key works
HellaSwag (2019, senior author); PIQA: Reasoning about Physical Commonsense (2020, co-author); COMET: Commonsense Transformers for Automatic Knowledge Graph Construction (2019, senior author); Defending Against Neural Fake News (Grover, 2019, senior author); AI2 Mosaic Team body of work on commonsense AI.