Ari Holtzman
How language models actually generate text from their probability distributions — and what the choice of sampling method does to evaluation results.
The 2019 paper "The Curious Case of Neural Text Degeneration", which Holtzman led, introduced nucleus sampling (top-p): instead of sampling the next token from the full distribution or from the top-k highest-probability tokens, sample from the smallest set of tokens whose cumulative probability exceeds p. The method became the standard for almost every deployed LLM, including the ones ai100 currently evaluates, because it produced more fluent and less repetitive text than the alternatives. The implication for evaluation that the field has been slower to absorb: two runs of the same model with different sampling settings can produce noticeably different scores, which means evaluation results are not just about the model — they're partly about which decoding configuration the evaluator happened to use.