Jimmy Xiangji Huang — Whom to read in AI

Most academic IR research runs on web-scale or news-scale collections — corpora where the retrieval problem is well-defined and the documents are reasonably clean. Huang's IR&KM Lab at York has spent two decades on a more representative case: retrieval over the kind of mixed, heterogeneous, sometimes structured data that actual enterprises hold, where the failure modes look different from anything a benchmark captures. His recent LLM-evaluation work extends the same lens — how do retrieval-augmented language models behave when the corpus they retrieve from is the kind of repository an actual organization runs on, not a curated research collection.

Worth following when: you want IR research that takes seriously the gap between benchmark corpora and the kind of data RAG systems actually encounter once deployed.
Topics: information retrieval over heterogeneous and enterprise-scale data; the corpus-side conditions of RAG behavior; systematic evaluation of LLM-IR hybrids on non-web data.
Key works: body of work on IR for big-data and enterprise corpora (2000s onward, York IR&KM Lab); ongoing systematic LLM-IR evaluation publications.