← Back to the list
Tatsunori Hashimoto
Whether the numbers reported about language models are statistical findings or methodological artifacts.
Hashimoto's group runs a quietly insistent line of inquiry: when a benchmark produces a result, what part of that result is the model and what part is the experimental setup? His 2023 benchmark for news summarization showed that LLMs declared "near human" on standard datasets owed as much to the brittle scoring conventions of those datasets as to model quality. The same skepticism drives his earlier work on distributional robustness — a model that does well on a held-out test set may have learned the test set's tilt rather than the underlying task.