Christopher Potts — Whom to read in AI

Potts comes at language models from a linguistics-and-philosophy background, which gives him an unusual angle on what current evaluation methodology lets us conclude. Stanford Sentiment Treebank, which his group co-built more than a decade ago, is still cited in nearly every paper about sentiment classification — partly for the dataset itself, partly because it made compositional structure rather than word polarity the unit of evaluation. His more recent work on dynamic adversarial benchmarks like DynaSent argues a quieter point: a static test set goes stale the moment it appears, because subsequent models train on its echo in the corpus.

Worth following when: you want a linguist's take on what an LLM benchmark result does and does not let you claim.
Topics: sentiment and natural language inference as test cases for compositional meaning; adversarial and dynamic benchmark design; the linguistics behind LLM evaluation.
Key works: Stanford Sentiment Treebank (2013, co-author); DynaSent dynamic benchmark (2021); ongoing work on compositional generalization tests for LLMs.