Longpre coordinates the Data Provenance Initiative, a collective audit of the licensing, sourcing, and legal status of the datasets behind modern LLMs. The project answers a question that turns out to be much harder than it sounds: when a model produces a sentence, what is the chain of evidence that the building blocks of that sentence were even in its training set? His earlier FLAN dataset work helped seed the instruction-tuning era; the current focus is closer to ai100's transparency line — knowing what's upstream of an answer, not just what the answer looks like.

Worth following when
you need to make defensible claims about what a model "knows" or "was trained on" for legal, audit, or product-positioning reasons.
Topics
dataset licensing and provenance; the data lifecycle of generative models; what a real transparency report on training corpora should contain.
Key works
Data Provenance Initiative (2023, ongoing, lead); FLAN dataset contributions (2021); audit reports on training-data licensing changes (2024, ongoing).