Yonatan Belinkov
What's actually inside a language model's hidden representations — and which of those internal states map onto things humans would recognize as knowledge.
Long before mechanistic interpretability became a mainstream subfield, Belinkov was running probing experiments on neural language models — training small classifiers to predict whether the model's hidden states encode part-of-speech, syntactic structure, semantic role, factual knowledge. The early results were sobering and stabilizing: models often did encode linguistic structure in measurable ways, but the encoding was distributed and entangled in patterns that contradicted the assumption that a single layer or unit "represents" a single concept. The line continues into his current LLM-era work, where the question is whether the same probing techniques scale to models a thousand times larger.