The standard practice for dense-retrieval models was to start from a generic encoder (BERT, RoBERTa) and fine-tune on retrieval-specific objectives — query-document pairs, contrastive losses. SimLM (2022, with Jiang as senior author) proposed a different pre-training step that produced encoders already biased toward retrieval-useful representations, so that subsequent fine-tuning needed far less labelled data to match state-of-the-art quality. The methodological consequence sits exactly where ai100 has to think about it: the embedding step that decides which documents the retriever sees as similar is a function of pre-training choices most papers never describe.

Worth following when
you want to understand the pre-training-side decisions that shape what a dense retriever can and cannot find.
Topics
retrieval-aware pre-training methodology; the gap between generic encoders and retrieval-tuned ones; what pre-training choices determine in downstream RAG behavior.
Key works
SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval (2022, senior author); broader pre-LLM-era work on commercial-scale web NLP at Bing/Cortana.