1. Akari Asai How a language model decides that its own internal knowledge is insufficient and that it should reach for an external source instead. Retrieval-augmented generation Evaluation methodology
  2. Amir Globerson Using one language model as an adversarial interrogator of another to surface factual errors that neither could find alone. Hallucination and factuality Evaluation methodology
  3. Ari Holtzman How language models actually generate text from their probability distributions — and what the choice of sampling method does to evaluation results. Evaluation methodology Foundational figures
  4. Ashish Sabharwal What classical formal-reasoning research can contribute to evaluating whether modern LLMs actually reason — and how to tell that from reasoning-shaped language that happens to land on the right answer. Reasoning and decomposition Foundational figures
  5. Benno Stein Building evaluation infrastructure that turns researcher disagreement into something technically resolvable. Transparency and governance Evaluation methodology Information retrieval
  6. Björn Schuller What language technology has to evaluate when the input is not text but speech — including the emotional, paralinguistic, and individual-speaker signals that text-only methods discard. Evaluation methodology Speech and multimodal Safety and robustness
  7. Charles L. A. Clarke What systematic, replicable evaluation of question-answering systems requires — now that the systems being evaluated are language models rather than retrievers. Evaluation methodology Information retrieval
  8. Chelsea Finn How language and learning systems adapt to a new task with very few examples — and what the theoretical structure of that adaptation tells you about what they have or haven't actually learned. Reasoning and decomposition Foundational figures Systems and scaling
  9. Chris Callison-Burch Whether human readers can tell apart text produced by a language model from text produced by humans — and how that distinguishability decays as models improve. Multilingual evaluation Evaluation methodology Foundational figures
  10. Christof Monz Year-over-year systematic evaluation of machine translation across dozens of language pairs — and what that tracking reveals about which translation problems are getting solved and which aren't. Multilingual evaluation Evaluation methodology
  11. Christopher Manning The argument that meaning, as humans use the word, is not what large language models trade in. Evaluation methodology Foundational figures Interpretability
  12. Christopher Potts What linguistic structure language models actually represent — and what they only seem to. Evaluation methodology Foundational figures
  13. Colin Raffel Whether a single language model can do all NLP tasks at once when they're all framed as text-in-text-out — and whether that unification holds across languages. Multilingual evaluation Foundational figures
  14. Dan Jurafsky Making natural language processing a teachable discipline — and using that perspective to read where the field's current evaluation practices fit, and don't fit, into its longer history. Evaluation methodology Foundational figures
  15. Danqi Chen Whether a language model that produces an answer with citations is actually grounding the answer in those citations — or just attaching plausible references after the fact. Retrieval-augmented generation Hallucination and factuality Foundational figures
  16. David Jurgens Whether language models that handle factual questions cleanly can also handle the kind of social knowledge that determines what humans actually mean when they say things. Evaluation methodology Safety and robustness Foundational figures
  17. Daxin Jiang How to pre-train an encoder so that the embeddings it produces are good for retrieval — without ever supervising on a retrieval task. Retrieval-augmented generation Information retrieval
  18. Diyi Yang How language models behave when the task is social rather than informational — persuasion, support, conflict, politeness. Evaluation methodology Dialogue and agents
  19. Dragan Gašević Whether AI tools used in educational contexts actually help the learners they're built for — and what kind of evaluation infrastructure that question requires beyond conventional ML benchmarks. Transparency and governance Evaluation methodology Foundational figures
  20. Ee-Peng Lim Whether asking a language model to make a plan before solving a problem produces better reasoning than telling it to think step by step. Reasoning and decomposition Evaluation methodology
  21. Emma Strubell Whether the compute and energy cost of training and serving language models belongs in the headline of an evaluation, where accuracy currently sits alone. Transparency and governance Evaluation methodology
  22. Emmanuel Candès How to put valid statistical uncertainty intervals around any model's predictions — including language models — without assuming you know what kind of error distribution to expect. Evaluation methodology Foundational figures
  23. Eric Horvitz The qualitative shape of language-model capability — what it looks like as a thing, and whether we have the vocabulary to describe it before we have the methodology to measure it. Transparency and governance Evaluation methodology Foundational figures
  24. Furu Wei Using a language model's own parametric knowledge to make retrieval find the document the user was actually looking for. Retrieval-augmented generation Evaluation methodology
  25. George J. Pappas How quickly an automated attacker can find prompts that break a language model's safety alignment — and what that means for evaluating model robustness as such. Evaluation methodology Safety and robustness
  26. Gideon Mann What language models look like when they're trained for a single high-value vertical — finance, in this case — and what evaluation that specialization requires beyond general-purpose benchmarks. Transparency and governance Evaluation methodology Foundational figures
  27. Graham Neubig Connecting retrieval, generation, and evaluation into a single working system — and asking when the connections actually hold. Retrieval-augmented generation Evaluation methodology Dialogue and agents
  28. Hannaneh Hajishirzi When a language model should reach for an external knowledge source — and when its own parametric memory is enough. Hallucination and factuality Transparency and governance Evaluation methodology
  29. Igor Mordatch What it means to evaluate a language model that takes consequential actions through its text — not only producing answers but operating in environments that respond. Reasoning and decomposition Systems and scaling Dialogue and agents
  30. Ion Stoica The distributed-compute infrastructure that almost every modern LLM is either trained on, served from, or evaluated through. Foundational figures Systems and scaling
  31. Iryna Gurevych Building the encoder infrastructure that makes "find sentences similar to this query" a fast and reliable operation across languages and tasks. Retrieval-augmented generation Information retrieval Foundational figures
  32. James Zou What evaluation looks like when the AI being evaluated has to clear regulatory bars before deployment — and what general LLM evaluation should learn from a field that has been doing this for years. Transparency and governance Evaluation methodology Foundational figures
  33. Jamie Callan What retrieval-augmented generation looks like when you actually know the thirty-year history of information retrieval it's reinventing. Retrieval-augmented generation Information retrieval Foundational figures
  34. Jared Kaplan Whether language-model loss is a smooth function of compute, data, and model size — and what that smoothness lets you predict (and not predict) about capabilities at larger scales. Evaluation methodology Foundational figures Systems and scaling
  35. Jennifer Wortman Vaughan How people actually understand and act on language-model evaluation results — and where the gap between what was measured and what gets believed. Transparency and governance Evaluation methodology
  36. Jesse Dodge What has to be in a paper about a language model for another lab to be able to verify the result. Transparency and governance Evaluation methodology
  37. Ji-Rong Wen Organizing the LLM literature into something other Chinese-language NLP researchers can actually navigate from inside the academic ecosystem. Multilingual evaluation Evaluation methodology Foundational figures
  38. Jian-Guang Lou Whether a language model that produces code actually produces code that does what the request asked for — and what evaluation looks like when correctness has a definite answer for once. Reasoning and decomposition Evaluation methodology Foundational figures
  39. Jian-Yun Nie How search behaves differently when the query is a conversation in a non-English language — and how language models are changing that picture. Retrieval-augmented generation Multilingual evaluation Information retrieval
  40. Jianfeng Gao What large language models are as a class of systems — taxonomically, architecturally, and in terms of what they actually inherit from the longer history of neural NLP. Evaluation methodology Foundational figures Dialogue and agents
  41. Jie Zhou Whether language models can be trusted to evaluate other language models in production NLP pipelines. LLM-as-judge methodology Dialogue and agents
  42. Jimmy Lin Making information retrieval reproducible enough that an LLM researcher and an IR researcher can run the same experiment and get the same answer. Retrieval-augmented generation Transparency and governance Information retrieval
  43. Jimmy Xiangji Huang How information retrieval scales over the messy operational data that real organizations hold, as opposed to the clean benchmark corpora the field actually publishes on. Evaluation methodology Information retrieval
  44. Jindong Wang Standardizing what "evaluating an LLM" even means as a research procedure. Evaluation methodology Safety and robustness
  45. Jochen Wirtz What happens when customer-facing service AI replaces, augments, or competes with human service workers — and what kinds of evaluation that change actually requires. Transparency and governance Evaluation methodology Dialogue and agents
  46. Jonathan Berant Whether a language model is actually doing the reasoning steps needed to answer a question — or just producing an answer that happens to be right. Reasoning and decomposition Evaluation methodology
  47. Juanzi Li Combining structured knowledge from knowledge graphs with the statistical patterns language models learn from text — and what that combination buys you for evaluation. Retrieval-augmented generation Hallucination and factuality Foundational figures
  48. Julian McAuley Whether language models trained on the open web are already doing recommendation — and what that implies for products that compete with traditional recommenders. Evaluation methodology Information retrieval Dialogue and agents
  49. Junichi Yamagishi Whether human listeners — or automated detectors — can tell AI-generated speech apart from real human speech, and how that distinguishability changes as speech synthesis improves. Evaluation methodology Speech and multimodal Safety and robustness
  50. Jure Leskovec What graph structure adds to the kinds of reasoning and retrieval problems language models currently handle without it — and what gets missed when relationships in data are flattened into text. Information retrieval Foundational figures
  51. Kevin Chen-Chuan Chang Organizing the rapidly growing literature on language-model reasoning into something a researcher new to the area can actually navigate. Reasoning and decomposition Evaluation methodology
  52. Kyle Lo What's actually in a training corpus once you sit down and look at it document by document. Transparency and governance Evaluation methodology
  53. Luke Zettlemoyer Whether open-weight language models can be built at frontier scale and whether their factuality can be measured at fine resolution. Hallucination and factuality Foundational figures
  54. Maarten de Rijke Whether information retrieval should be done by a system that "writes the document ID" instead of by one that searches a vector index. Retrieval-augmented generation Information retrieval
  55. Maarten Sap Whether language models can produce or reason about social knowledge with the same competence they show on factual tasks — and what's at stake when they can't. Evaluation methodology Safety and robustness Dialogue and agents
  56. Maosong Sun What the open-LLM ecosystem looks like when it grows out of Chinese academic NLP rather than out of Western non-profits like AI2. Transparency and governance Multilingual evaluation Foundational figures
  57. Marco Baroni Whether neural language models can compose what they've learned into new combinations they've never seen — or whether they're really only doing sophisticated interpolation within their training distribution. Reasoning and decomposition Multilingual evaluation Foundational figures
  58. Mari Ostendorf What language technology looks like when you've been responsible for it as deployable engineering for thirty years before LLMs arrived to redo the field. Evaluation methodology Speech and multimodal Foundational figures
  59. Mark Gales Detecting hallucinations in a language model without any access to its weights or to ground truth. Hallucination and factuality Speech and multimodal
  60. Martin Potthast Whether language models can be used to judge whether a document is relevant to a query — and what changes when they replace human assessors in that role. LLM-as-judge methodology Evaluation methodology Information retrieval
  61. Mengnan Du What kinds of explanations can be obtained for language-model outputs — and which of those explanations turn out to be reliable. Evaluation methodology Interpretability
  62. Michihiro Yasunaga Whether a language model can correctly translate a natural-language question into a precise structured query — and what that translation reveals about reasoning over knowledge. Reasoning and decomposition Evaluation methodology
  63. Minlie Huang Whether the categories of "harmful" used to evaluate language-model safety transfer from Western to Chinese-language deployment contexts, where the regulatory frame and cultural categories are different. Multilingual evaluation Safety and robustness Dialogue and agents
  64. Mohit Bansal Whether the methods we use to evaluate language-only models still work when the same model has to handle images, speech, or other modalities at the same time — and what fails first when modalities are combined. Evaluation methodology Speech and multimodal Foundational figures
  65. Nan Duan Rewriting the user's query before retrieval so that the retriever has a chance of returning useful documents. Retrieval-augmented generation
  66. Nathan Lambert What happens to a language model between "trained on the internet" and "answering your question the way it does" — and how to study that step in public. Transparency and governance Evaluation methodology
  67. Noah A. Smith Whether a language model can produce its own training data — and what the methodological consequences are when that becomes the standard practice. Transparency and governance Evaluation methodology Foundational figures
  68. Norbert Fuhr What information retrieval evaluation has been getting wrong, in writing, for the last several decades — and why each new generation of researchers makes the same mistakes. Evaluation methodology Information retrieval Foundational figures
  69. Omer Levy Whether a language model can infer the task from examples alone — without being told what to do — and what that ability reveals about how it represents instructions. Reasoning and decomposition Evaluation methodology Foundational figures
  70. Pang Wei Koh What happens when a language model encounters the kind of data it didn't see during training — and how to measure that gap rigorously, not just notice it after deployment. Transparency and governance Evaluation methodology Safety and robustness
  71. Paolo Rosso Building evaluation campaigns that work for Iberian-Romance languages — Spanish, Catalan, Portuguese — instead of porting English-centric methodology and accepting the resulting blind spots. Multilingual evaluation Evaluation methodology Safety and robustness
  72. Pascale Fung Whether the same language model performs the same kind of work across different languages — and whether evaluation methodology that's built for English is misleading us about what models do in everything else. Multilingual evaluation Evaluation methodology
  73. Percy Liang How to measure language models so a measurement made today still means something next year. Evaluation methodology
  74. Peter Henderson Where the technical findings about language-model behavior actually matter — in audits, regulation, and legal liability — and what the gap between "we measured this" and "this changes what's allowed" looks like. Transparency and governance Evaluation methodology
  75. Philip S. Yu Bridging four decades of data-mining methodology to the question of how to evaluate large language models without reinventing techniques the field already has. Evaluation methodology Information retrieval Foundational figures
  76. Prateek Mittal How visual inputs become a new attack surface for safety-aligned language models that accept multimodal queries. Safety and robustness Speech and multimodal
  77. Qiang Yang Whether useful machine learning can happen when the data you'd train or evaluate on can't be moved to a single place — for legal, privacy, or commercial reasons. Transparency and governance Evaluation methodology Foundational figures
  78. Quoc V. Le The architectural and training-recipe building blocks that the modern LLM era was built on top of. Reasoning and decomposition Foundational figures
  79. Rishi Bommasani What language-model developers are and aren't telling us about their own models — and how to measure that systematically. Transparency and governance Evaluation methodology
  80. Roi Reichart What "domain" means for a language model — when its training distribution stops matching its deployment context — and how to evaluate that mismatch rigorously. Evaluation methodology Foundational figures
  81. Seungone Kim Whether the field can have an open-weight LLM-evaluator alternative, so that "one black box judging another black box" stops being the only available option. Transparency and governance LLM-as-judge methodology Evaluation methodology
  82. Shafiq Joty Evaluation methodology for language models when the deployment context is enterprise software rather than a research demo. Evaluation methodology Dialogue and agents
  83. Shayne Longpre What's actually inside the data language models train on — and who can or can't tell. Transparency and governance Evaluation methodology
  84. Shinji Watanabe The open-source speech-processing infrastructure that lets academic and industrial groups train, evaluate, and compare voice-input or voice-output language systems on the same footing. Evaluation methodology Speech and multimodal Foundational figures
  85. Shuming Shi What "language model hallucination" looks like when you're responsible for shipping LLM-powered products to a billion-user surface. Hallucination and factuality
  86. Steven Schockaert How to evaluate a retrieval-augmented system end-to-end when each part of it can fail in different ways. Retrieval-augmented generation Evaluation methodology
  87. Tatsunori Hashimoto Whether the numbers reported about language models are statistical findings or methodological artifacts. Evaluation methodology
  88. Tom Mitchell Whether a language model has an internal representation of whether it's telling the truth — separable from what it actually outputs. Hallucination and factuality Foundational figures Interpretability
  89. Torsten Hoefler Generalizing language-model reasoning beyond linear chains of thought — into branching, backtracking, and recombination of intermediate reasoning steps. Reasoning and decomposition Systems and scaling
  90. Tushar Khot What "reasoning ability" actually means as something you can put on a benchmark — and what changes when you also try to coach the model to do reasoning through structured prompting. Reasoning and decomposition Evaluation methodology
  91. Wayne Xin Zhao The synthesis side of LLM research — what it takes to read every paper of the moment and produce something other researchers can navigate. Evaluation methodology Information retrieval Foundational figures
  92. Weijia Shi Making retrieval-augmented generation work when the language model itself is a closed box you can't fine-tune. Retrieval-augmented generation Information retrieval
  93. Wen-tau Yih Making the retrieval step inside retrieval-augmented systems good enough that the generation step has something to work with. Retrieval-augmented generation Hallucination and factuality Information retrieval
  94. Wenjie Li How generative retrieval relates to the other generation tasks — summarization, question answering — that the same system architecture has to handle. Retrieval-augmented generation Information retrieval
  95. Xia Hu Whether the academic state of the art in language models can be turned into something a practitioner — not a research lab — can actually deploy and trust. Evaluation methodology Foundational figures Interpretability
  96. Xing Xie Connecting the recommender-systems tradition of measuring user-facing AI behavior with the new evaluation challenges modern LLMs pose. Evaluation methodology Information retrieval Foundational figures
  97. Xipeng Qiu Building an open Chinese-language large language model that the academic community can actually study under the hood — and the tooling around it. Multilingual evaluation Foundational figures
  98. Xueqi Cheng What information retrieval research from inside the Chinese IR tradition has been arguing — and how its angle on generative retrieval differs from the Western canon. Retrieval-augmented generation Information retrieval
  99. Yang Liu Whether a more capable language model can be used to grade the outputs of a less capable one — and how to do that without fooling yourself. LLM-as-judge methodology Evaluation methodology
  100. Yann LeCun What machine-learning systems should look like, set against what they currently are. Reasoning and decomposition Foundational figures
  101. Yarin Gal Quantifying when language models don't know what they're saying. Hallucination and factuality Evaluation methodology
  102. Yejin Choi What it would take for a language model to have what humans have in spades and what LLMs reliably lack — common sense. Reasoning and decomposition Evaluation methodology Foundational figures
  103. Yoav Goldberg Whether the chain of reasoning a language model produces is the chain of reasoning it actually followed. Hallucination and factuality Foundational figures Interpretability
  104. Yoav Shoham What can be measured about the state of AI from outside any single company — and what retrieval-augmentation looks like when you don't need to retrain anything. Retrieval-augmented generation Transparency and governance Foundational figures
  105. Yonatan Belinkov What's actually inside a language model's hidden representations — and which of those internal states map onto things humans would recognize as knowledge. Evaluation methodology Interpretability
  106. Yue Zhang Organizing what the field calls "hallucination" into categories that actually mean different things. Hallucination and factuality Evaluation methodology
  107. Yulia Tsvetkov What kinds of bias and harm look like in language-model outputs across languages — especially the languages and communities the field's standard evaluation has historically ignored. Multilingual evaluation Evaluation methodology Safety and robustness
  108. Zhaochun Ren Whether a language model can be trusted with the job that's currently done by an information retrieval system — and which parts of that job it actually does well. Retrieval-augmented generation Information retrieval
  109. Zhiting Hu Treating a language model as one component in a larger planning system — instead of asking it to do reasoning end-to-end in its own head. Reasoning and decomposition Dialogue and agents