← Back to the list
Tom Mitchell
Whether a language model has an internal representation of whether it's telling the truth — separable from what it actually outputs.
Mitchell wrote the textbook the field grew up on (1997), founded the first Machine Learning Department in 2006, and in 2023 returned to first principles with a result that complicates the standard story about LLM hallucination: in his paper with Amos Azaria, a simple classifier trained on the hidden activations of an LLM predicts whether the model is about to produce a true or false statement, with accuracy well above chance. The model, in some readable sense, "knows" — and its surface behavior doesn't reflect that knowledge.