Daxin Jiang
How to pre-train an encoder so that the embeddings it produces are good for retrieval — without ever supervising on a retrieval task.
The standard practice for dense-retrieval models was to start from a generic encoder (BERT, RoBERTa) and fine-tune on retrieval-specific objectives — query-document pairs, contrastive losses. SimLM (2022, with Jiang as senior author) proposed a different pre-training step that produced encoders already biased toward retrieval-useful representations, so that subsequent fine-tuning needed far less labelled data to match state-of-the-art quality. The methodological consequence sits exactly where ai100 has to think about it: the embedding step that decides which documents the retriever sees as similar is a function of pre-training choices most papers never describe.