Baroni's group built SCAN (2018) as a controlled test of compositional generalization: a small synthetic language where the model has to combine verbs and modifiers in patterns absent from training. Neural sequence-to-sequence models, even at the time the dominant architecture, failed spectacularly — generalizing one short held-out combination but breaking on slightly longer ones. The result generalized: subsequent papers have shown the same compositional-generalization failures in modern LLMs at scale, which means the question Baroni put on the table in 2018 hasn't been answered by scale alone.

Worth following when
you need to assess whether a model's reasoning is genuinely compositional or merely a sophisticated form of in-distribution pattern matching.
Topics
compositional-generalization benchmarks (SCAN and successors); distributional semantics from a linguistic angle; what LLM scale does and doesn't fix about fundamental generalization gaps.
Key works
SCAN compositional-generalization benchmark (2018, senior author); long-arc work on distributional semantics from UPF Barcelona; ACL test-of-time award publications.