ALCE (2023, with Chen as senior author) is the benchmark that put the question on the table: when an LLM produces text-with-citations, do the cited sources actually support the claims they're attached to, or are the citations decorative? The framework decomposes attribution into measurable components — citation precision (the citation supports the claim), citation recall (every claim that should be cited is), and the interaction between the two — and shows that current LLMs vary enormously on which they get right. For ai100, which audits whether AI engines mention brands with appropriate sourcing, this is the closest precedent for the methodology: attribution as a separate evaluation axis from raw answer accuracy.

Worth following when
you want to evaluate whether a model's citations actually do the work they appear to be doing, or are just stylistic markers attached to plausible-sounding answers.
Topics
automatic evaluation of LLM citation behavior (ALCE); attribution precision and recall as separate metrics; the gap between citations as decoration and citations as evidence.
Key works
ALCE: Enabling Large Language Models to Generate Text with Citations (2023, senior author); Dense Passage Retrieval (2020, co-author with Yih); Princeton NLP publications on retrieval-augmented LMs.