Hashimoto's group runs a quietly insistent line of inquiry: when a benchmark produces a result, what part of that result is the model and what part is the experimental setup? His 2023 benchmark for news summarization showed that LLMs declared "near human" on standard datasets owed as much to the brittle scoring conventions of those datasets as to model quality. The same skepticism drives his earlier work on distributional robustness — a model that does well on a held-out test set may have learned the test set's tilt rather than the underlying task.

Worth following when
you want to read an evaluation paper and know whether you should believe the headline number.
Topics
statistical evaluation of language models; benchmark validity and contamination; distributional robustness of trained models.
Key works
Benchmarking LLMs for News Summarization (2023); foundational work on distributional robustness and group DRO (2019, ongoing).