← Back to the list
Shayne Longpre
What's actually inside the data language models train on — and who can or can't tell.
Longpre coordinates the Data Provenance Initiative, a collective audit of the licensing, sourcing, and legal status of the datasets behind modern LLMs. The project answers a question that turns out to be much harder than it sounds: when a model produces a sentence, what is the chain of evidence that the building blocks of that sentence were even in its training set? His earlier FLAN dataset work helped seed the instruction-tuning era; the current focus is closer to ai100's transparency line — knowing what's upstream of an answer, not just what the answer looks like.