The 2019 paper "The Curious Case of Neural Text Degeneration", which Holtzman led, introduced nucleus sampling (top-p): instead of sampling the next token from the full distribution or from the top-k highest-probability tokens, sample from the smallest set of tokens whose cumulative probability exceeds p. The method became the standard for almost every deployed LLM, including the ones ai100 currently evaluates, because it produced more fluent and less repetitive text than the alternatives. The implication for evaluation that the field has been slower to absorb: two runs of the same model with different sampling settings can produce noticeably different scores, which means evaluation results are not just about the model — they're partly about which decoding configuration the evaluator happened to use.

Worth following when
you want to understand the generation-side decisions that affect what an LLM evaluation actually measures, beyond the model's parametric state.
Topics
nucleus sampling and top-p decoding; the influence of decoding strategy on LLM evaluation results; text-generation theory and methodology.
Key works
"The Curious Case of Neural Text Degeneration" introducing nucleus sampling (2019, lead author); ongoing publications on text-generation evaluation; UChicago Conceptualization Lab.