Marco Baroni — Whom to read in AI

Baroni's group built SCAN (2018) as a controlled test of compositional generalization: a small synthetic language where the model has to combine verbs and modifiers in patterns absent from training. Neural sequence-to-sequence models, even at the time the dominant architecture, failed spectacularly — generalizing one short held-out combination but breaking on slightly longer ones. The result generalized: subsequent papers have shown the same compositional-generalization failures in modern LLMs at scale, which means the question Baroni put on the table in 2018 hasn't been answered by scale alone.

Worth following when: you need to assess whether a model's reasoning is genuinely compositional or merely a sophisticated form of in-distribution pattern matching.
Topics: compositional-generalization benchmarks (SCAN and successors); distributional semantics from a linguistic angle; what LLM scale does and doesn't fix about fundamental generalization gaps.
Key works: SCAN compositional-generalization benchmark (2018, senior author); long-arc work on distributional semantics from UPF Barcelona; ACL test-of-time award publications.