Jimmy Lin — Whom to read in AI

Lin's Anserini and Pyserini toolkits became the default reproducibility layer for retrieval research the moment they appeared — when an NLP paper now reports a BM25 baseline number, that number usually comes from his code. The same posture carries into HyDE (Hypothetical Document Embeddings, 2023, senior author): the idea was to use a language model to generate a synthetic answer document, embed it, and retrieve against that embedding — a method clean enough that other groups could replicate it without negotiating about implementation details. Reading Lin is the way to understand which retrieval results in current LLM papers actually mean something and which depend on undocumented baselines.

Worth following when: you want to know whether a retrieval-related result in a paper is reproducible — and how to set things up so that yours will be.
Topics: dense retrieval and zero-shot retrieval via LLM-generated embeddings; reproducibility infrastructure for IR research; the bridge between IR and neural NLP communities.
Key works: Anserini IR toolkit (2017, lead); Pyserini (2021, lead); HyDE: Precise Zero-Shot Dense Retrieval (2023, senior author).