Jimmy Xiangji Huang
How information retrieval scales over the messy operational data that real organizations hold, as opposed to the clean benchmark corpora the field actually publishes on.
Most academic IR research runs on web-scale or news-scale collections — corpora where the retrieval problem is well-defined and the documents are reasonably clean. Huang's IR&KM Lab at York has spent two decades on a more representative case: retrieval over the kind of mixed, heterogeneous, sometimes structured data that actual enterprises hold, where the failure modes look different from anything a benchmark captures. His recent LLM-evaluation work extends the same lens — how do retrieval-augmented language models behave when the corpus they retrieve from is the kind of repository an actual organization runs on, not a curated research collection.