Whom to read in AI
A curated navigator — not a ranking, not a top-N — across 109 researchers whose work intersects what ai100 measures.
109 researchers · 14 subject areas · 3 lenses
What this is and how to use it
The list below is not a ranking. There is no top, no bottom, no h-index sorting. AI research is not a leaderboard, and the work that lasts is rarely the work that trends.
What this page is: a working map of the researchers ai100 reads when we need to understand how language models behave under pressure — how they are scored, where they break, what kind of evidence about them is and isn't trustworthy. The people here have shaped how AI evaluation, retrieval, and reliability are actually done, not how they're talked about on social media.
For each name we list the direction of the work, a short brief on what makes the person worth your time, the topics where their writing is the entry point, and a few named works to start with. We do not list current employers or job titles — those move every year or two, and the work doesn't. Where someone is affiliated with a company whose models we evaluate, we say so explicitly in a Disclosure line: ai100's methodology depends on us being honest about conflicts, and there are conflicts on this list.
The flagship analysis — what we find when we put the major AI engines through 200 audits each, in five languages — runs separately. This page is the upstream: the people whose work makes that analysis possible to design.
- Akari Asai How a language model decides that its own internal knowledge is insufficient and that it should reach for an external source instead.
- Amir Globerson Using one language model as an adversarial interrogator of another to surface factual errors that neither could find alone.
- Ari Holtzman How language models actually generate text from their probability distributions — and what the choice of sampling method does to evaluation results.
- Ashish Sabharwal What classical formal-reasoning research can contribute to evaluating whether modern LLMs actually reason — and how to tell that from reasoning-shaped language that happens to land on the right answer.
- Benno Stein Building evaluation infrastructure that turns researcher disagreement into something technically resolvable.
- Björn Schuller What language technology has to evaluate when the input is not text but speech — including the emotional, paralinguistic, and individual-speaker signals that text-only methods discard.
- Charles L. A. Clarke What systematic, replicable evaluation of question-answering systems requires — now that the systems being evaluated are language models rather than retrievers.
- Chelsea Finn How language and learning systems adapt to a new task with very few examples — and what the theoretical structure of that adaptation tells you about what they have or haven't actually learned.
- Chris Callison-Burch Whether human readers can tell apart text produced by a language model from text produced by humans — and how that distinguishability decays as models improve.
- Christof Monz Year-over-year systematic evaluation of machine translation across dozens of language pairs — and what that tracking reveals about which translation problems are getting solved and which aren't.
- Christopher Manning The argument that meaning, as humans use the word, is not what large language models trade in.
- Christopher Potts What linguistic structure language models actually represent — and what they only seem to.
- Colin Raffel Whether a single language model can do all NLP tasks at once when they're all framed as text-in-text-out — and whether that unification holds across languages.
- Dan Jurafsky Making natural language processing a teachable discipline — and using that perspective to read where the field's current evaluation practices fit, and don't fit, into its longer history.
- Danqi Chen Whether a language model that produces an answer with citations is actually grounding the answer in those citations — or just attaching plausible references after the fact.
- David Jurgens Whether language models that handle factual questions cleanly can also handle the kind of social knowledge that determines what humans actually mean when they say things.
- Daxin Jiang How to pre-train an encoder so that the embeddings it produces are good for retrieval — without ever supervising on a retrieval task.
- Diyi Yang How language models behave when the task is social rather than informational — persuasion, support, conflict, politeness.
- Dragan Gašević Whether AI tools used in educational contexts actually help the learners they're built for — and what kind of evaluation infrastructure that question requires beyond conventional ML benchmarks.
- Ee-Peng Lim Whether asking a language model to make a plan before solving a problem produces better reasoning than telling it to think step by step.
- Emma Strubell Whether the compute and energy cost of training and serving language models belongs in the headline of an evaluation, where accuracy currently sits alone.
- Emmanuel Candès How to put valid statistical uncertainty intervals around any model's predictions — including language models — without assuming you know what kind of error distribution to expect.
- Eric Horvitz The qualitative shape of language-model capability — what it looks like as a thing, and whether we have the vocabulary to describe it before we have the methodology to measure it.
- Furu Wei Using a language model's own parametric knowledge to make retrieval find the document the user was actually looking for.
- George J. Pappas How quickly an automated attacker can find prompts that break a language model's safety alignment — and what that means for evaluating model robustness as such.
- Gideon Mann What language models look like when they're trained for a single high-value vertical — finance, in this case — and what evaluation that specialization requires beyond general-purpose benchmarks.
- Graham Neubig Connecting retrieval, generation, and evaluation into a single working system — and asking when the connections actually hold.
- Hannaneh Hajishirzi When a language model should reach for an external knowledge source — and when its own parametric memory is enough.
- Igor Mordatch What it means to evaluate a language model that takes consequential actions through its text — not only producing answers but operating in environments that respond.
- Ion Stoica The distributed-compute infrastructure that almost every modern LLM is either trained on, served from, or evaluated through.
- Iryna Gurevych Building the encoder infrastructure that makes "find sentences similar to this query" a fast and reliable operation across languages and tasks.
- James Zou What evaluation looks like when the AI being evaluated has to clear regulatory bars before deployment — and what general LLM evaluation should learn from a field that has been doing this for years.
- Jamie Callan What retrieval-augmented generation looks like when you actually know the thirty-year history of information retrieval it's reinventing.
- Jared Kaplan Whether language-model loss is a smooth function of compute, data, and model size — and what that smoothness lets you predict (and not predict) about capabilities at larger scales.
- Jennifer Wortman Vaughan How people actually understand and act on language-model evaluation results — and where the gap between what was measured and what gets believed.
- Jesse Dodge What has to be in a paper about a language model for another lab to be able to verify the result.
- Ji-Rong Wen Organizing the LLM literature into something other Chinese-language NLP researchers can actually navigate from inside the academic ecosystem.
- Jian-Guang Lou Whether a language model that produces code actually produces code that does what the request asked for — and what evaluation looks like when correctness has a definite answer for once.
- Jian-Yun Nie How search behaves differently when the query is a conversation in a non-English language — and how language models are changing that picture.
- Jianfeng Gao What large language models are as a class of systems — taxonomically, architecturally, and in terms of what they actually inherit from the longer history of neural NLP.
- Jie Zhou Whether language models can be trusted to evaluate other language models in production NLP pipelines.
- Jimmy Lin Making information retrieval reproducible enough that an LLM researcher and an IR researcher can run the same experiment and get the same answer.
- Jimmy Xiangji Huang How information retrieval scales over the messy operational data that real organizations hold, as opposed to the clean benchmark corpora the field actually publishes on.
- Jindong Wang Standardizing what "evaluating an LLM" even means as a research procedure.
- Jochen Wirtz What happens when customer-facing service AI replaces, augments, or competes with human service workers — and what kinds of evaluation that change actually requires.
- Jonathan Berant Whether a language model is actually doing the reasoning steps needed to answer a question — or just producing an answer that happens to be right.
- Juanzi Li Combining structured knowledge from knowledge graphs with the statistical patterns language models learn from text — and what that combination buys you for evaluation.
- Julian McAuley Whether language models trained on the open web are already doing recommendation — and what that implies for products that compete with traditional recommenders.
- Junichi Yamagishi Whether human listeners — or automated detectors — can tell AI-generated speech apart from real human speech, and how that distinguishability changes as speech synthesis improves.
- Jure Leskovec What graph structure adds to the kinds of reasoning and retrieval problems language models currently handle without it — and what gets missed when relationships in data are flattened into text.
- Kevin Chen-Chuan Chang Organizing the rapidly growing literature on language-model reasoning into something a researcher new to the area can actually navigate.
- Kyle Lo What's actually in a training corpus once you sit down and look at it document by document.
- Luke Zettlemoyer Whether open-weight language models can be built at frontier scale and whether their factuality can be measured at fine resolution.
- Maarten de Rijke Whether information retrieval should be done by a system that "writes the document ID" instead of by one that searches a vector index.
- Maarten Sap Whether language models can produce or reason about social knowledge with the same competence they show on factual tasks — and what's at stake when they can't.
- Maosong Sun What the open-LLM ecosystem looks like when it grows out of Chinese academic NLP rather than out of Western non-profits like AI2.
- Marco Baroni Whether neural language models can compose what they've learned into new combinations they've never seen — or whether they're really only doing sophisticated interpolation within their training distribution.
- Mari Ostendorf What language technology looks like when you've been responsible for it as deployable engineering for thirty years before LLMs arrived to redo the field.
- Mark Gales Detecting hallucinations in a language model without any access to its weights or to ground truth.
- Martin Potthast Whether language models can be used to judge whether a document is relevant to a query — and what changes when they replace human assessors in that role.
- Mengnan Du What kinds of explanations can be obtained for language-model outputs — and which of those explanations turn out to be reliable.
- Michihiro Yasunaga Whether a language model can correctly translate a natural-language question into a precise structured query — and what that translation reveals about reasoning over knowledge.
- Minlie Huang Whether the categories of "harmful" used to evaluate language-model safety transfer from Western to Chinese-language deployment contexts, where the regulatory frame and cultural categories are different.
- Mohit Bansal Whether the methods we use to evaluate language-only models still work when the same model has to handle images, speech, or other modalities at the same time — and what fails first when modalities are combined.
- Nan Duan Rewriting the user's query before retrieval so that the retriever has a chance of returning useful documents.
- Nathan Lambert What happens to a language model between "trained on the internet" and "answering your question the way it does" — and how to study that step in public.
- Noah A. Smith Whether a language model can produce its own training data — and what the methodological consequences are when that becomes the standard practice.
- Norbert Fuhr What information retrieval evaluation has been getting wrong, in writing, for the last several decades — and why each new generation of researchers makes the same mistakes.
- Omer Levy Whether a language model can infer the task from examples alone — without being told what to do — and what that ability reveals about how it represents instructions.
- Pang Wei Koh What happens when a language model encounters the kind of data it didn't see during training — and how to measure that gap rigorously, not just notice it after deployment.
- Paolo Rosso Building evaluation campaigns that work for Iberian-Romance languages — Spanish, Catalan, Portuguese — instead of porting English-centric methodology and accepting the resulting blind spots.
- Pascale Fung Whether the same language model performs the same kind of work across different languages — and whether evaluation methodology that's built for English is misleading us about what models do in everything else.
- Percy Liang How to measure language models so a measurement made today still means something next year.
- Peter Henderson Where the technical findings about language-model behavior actually matter — in audits, regulation, and legal liability — and what the gap between "we measured this" and "this changes what's allowed" looks like.
- Philip S. Yu Bridging four decades of data-mining methodology to the question of how to evaluate large language models without reinventing techniques the field already has.
- Prateek Mittal How visual inputs become a new attack surface for safety-aligned language models that accept multimodal queries.
- Qiang Yang Whether useful machine learning can happen when the data you'd train or evaluate on can't be moved to a single place — for legal, privacy, or commercial reasons.
- Quoc V. Le The architectural and training-recipe building blocks that the modern LLM era was built on top of.
- Rishi Bommasani What language-model developers are and aren't telling us about their own models — and how to measure that systematically.
- Roi Reichart What "domain" means for a language model — when its training distribution stops matching its deployment context — and how to evaluate that mismatch rigorously.
- Seungone Kim Whether the field can have an open-weight LLM-evaluator alternative, so that "one black box judging another black box" stops being the only available option.
- Shafiq Joty Evaluation methodology for language models when the deployment context is enterprise software rather than a research demo.
- Shayne Longpre What's actually inside the data language models train on — and who can or can't tell.
- Shinji Watanabe The open-source speech-processing infrastructure that lets academic and industrial groups train, evaluate, and compare voice-input or voice-output language systems on the same footing.
- Shuming Shi What "language model hallucination" looks like when you're responsible for shipping LLM-powered products to a billion-user surface.
- Steven Schockaert How to evaluate a retrieval-augmented system end-to-end when each part of it can fail in different ways.
- Tatsunori Hashimoto Whether the numbers reported about language models are statistical findings or methodological artifacts.
- Tom Mitchell Whether a language model has an internal representation of whether it's telling the truth — separable from what it actually outputs.
- Torsten Hoefler Generalizing language-model reasoning beyond linear chains of thought — into branching, backtracking, and recombination of intermediate reasoning steps.
- Tushar Khot What "reasoning ability" actually means as something you can put on a benchmark — and what changes when you also try to coach the model to do reasoning through structured prompting.
- Wayne Xin Zhao The synthesis side of LLM research — what it takes to read every paper of the moment and produce something other researchers can navigate.
- Weijia Shi Making retrieval-augmented generation work when the language model itself is a closed box you can't fine-tune.
- Wen-tau Yih Making the retrieval step inside retrieval-augmented systems good enough that the generation step has something to work with.
- Wenjie Li How generative retrieval relates to the other generation tasks — summarization, question answering — that the same system architecture has to handle.
- Xia Hu Whether the academic state of the art in language models can be turned into something a practitioner — not a research lab — can actually deploy and trust.
- Xing Xie Connecting the recommender-systems tradition of measuring user-facing AI behavior with the new evaluation challenges modern LLMs pose.
- Xipeng Qiu Building an open Chinese-language large language model that the academic community can actually study under the hood — and the tooling around it.
- Xueqi Cheng What information retrieval research from inside the Chinese IR tradition has been arguing — and how its angle on generative retrieval differs from the Western canon.
- Yang Liu Whether a more capable language model can be used to grade the outputs of a less capable one — and how to do that without fooling yourself.
- Yann LeCun What machine-learning systems should look like, set against what they currently are.
- Yarin Gal Quantifying when language models don't know what they're saying.
- Yejin Choi What it would take for a language model to have what humans have in spades and what LLMs reliably lack — common sense.
- Yoav Goldberg Whether the chain of reasoning a language model produces is the chain of reasoning it actually followed.
- Yoav Shoham What can be measured about the state of AI from outside any single company — and what retrieval-augmentation looks like when you don't need to retrain anything.
- Yonatan Belinkov What's actually inside a language model's hidden representations — and which of those internal states map onto things humans would recognize as knowledge.
- Yue Zhang Organizing what the field calls "hallucination" into categories that actually mean different things.
- Yulia Tsvetkov What kinds of bias and harm look like in language-model outputs across languages — especially the languages and communities the field's standard evaluation has historically ignored.
- Zhaochun Ren Whether a language model can be trusted with the job that's currently done by an information retrieval system — and which parts of that job it actually does well.
- Zhiting Hu Treating a language model as one component in a larger planning system — instead of asking it to do reasoning end-to-end in its own head.