Potts comes at language models from a linguistics-and-philosophy background, which gives him an unusual angle on what current evaluation methodology lets us conclude. Stanford Sentiment Treebank, which his group co-built more than a decade ago, is still cited in nearly every paper about sentiment classification — partly for the dataset itself, partly because it made compositional structure rather than word polarity the unit of evaluation. His more recent work on dynamic adversarial benchmarks like DynaSent argues a quieter point: a static test set goes stale the moment it appears, because subsequent models train on its echo in the corpus.

Worth following when
you want a linguist's take on what an LLM benchmark result does and does not let you claim.
Topics
sentiment and natural language inference as test cases for compositional meaning; adversarial and dynamic benchmark design; the linguistics behind LLM evaluation.
Key works
Stanford Sentiment Treebank (2013, co-author); DynaSent dynamic benchmark (2021); ongoing work on compositional generalization tests for LLMs.