Mengnan Du
What kinds of explanations can be obtained for language-model outputs — and which of those explanations turn out to be reliable.
The literature on explaining individual LLM outputs grew faster than the methods to validate those explanations against ground truth. Du's "Explainability for Large Language Models: A Survey" (2024, senior author with collaborators) sorts that landscape into categories that have come to matter — attention-based, gradient-based, perturbation-based, and natural-language explanations the model generates about itself, each with documented strengths and well-documented failure modes. The survey is particularly useful in the part most other surveys skip: case-by-case discussion of when an explanation method actively misleads rather than merely under-informs.