The literature on explaining individual LLM outputs grew faster than the methods to validate those explanations against ground truth. Du's "Explainability for Large Language Models: A Survey" (2024, senior author with collaborators) sorts that landscape into categories that have come to matter — attention-based, gradient-based, perturbation-based, and natural-language explanations the model generates about itself, each with documented strengths and well-documented failure modes. The survey is particularly useful in the part most other surveys skip: case-by-case discussion of when an explanation method actively misleads rather than merely under-informs.

Worth following when
you need to choose an explainability method for an LLM and want to know which of the options have been independently validated.
Topics
taxonomy of LLM explainability methods; self-explanations by LLMs and their reliability; the methodology gap between proposing and validating explanations.
Key works
"Explainability for Large Language Models: A Survey" (2024, senior author); ongoing TMLR Lab publications on trustworthy and explainable LLMs.