Yonatan Belinkov — Whom to read in AI

Long before mechanistic interpretability became a mainstream subfield, Belinkov was running probing experiments on neural language models — training small classifiers to predict whether the model's hidden states encode part-of-speech, syntactic structure, semantic role, factual knowledge. The early results were sobering and stabilizing: models often did encode linguistic structure in measurable ways, but the encoding was distributed and entangled in patterns that contradicted the assumption that a single layer or unit "represents" a single concept. The line continues into his current LLM-era work, where the question is whether the same probing techniques scale to models a thousand times larger.

Worth following when: you want to ground claims about what a language model "knows" in something more measurable than its output behavior.
Topics: probing classifiers for linguistic and factual knowledge in LMs; the structure of distributed internal representations; the BlackboxNLP tradition.
Key works: foundational probing experiments on neural LMs (2017 onward); "Linguistic Knowledge and Transferability of Contextual Representations" (2019, co-author); co-founding of the BlackboxNLP workshop series.