Self-Instruct (2022, senior author) showed that an LLM can generate the instruction-tuning data needed to improve itself: prompt the model with a few seed examples of (instruction, input, output) tuples, ask it to generate more in the same pattern, filter for quality, and use the result as training data. The technique was immediately adopted across the industry — most current instruction-tuned LLMs were trained on data of this kind, much of it generated by GPT-3 or GPT-4. The methodological consequence for evaluation is uncomfortable: when models trained on synthetic data from older models are evaluated by judges that are also language models, the entire evaluation loop is increasingly endogenous to the same family of systems being measured.

Worth following when
you want to think carefully about evaluation methodology in an era when training data, judges, and the systems being evaluated all come from related model lineages.
Topics
self-generated instruction-tuning data (Self-Instruct); the methodological consequences of LLM-generated training data; the long arc of structured-prediction NLP.
Key works
Self-Instruct (2022, senior author); long body of structured-prediction and probabilistic NLP work; ongoing UW and AI2 publications on responsible LLM development.