Kyle Lo — Whom to read in AI

Most discussion of LLM training data happens at the headline level — "trained on the web", "trained on Common Crawl". Lo's work on Dolma, the corpus behind OLMo, is what that conversation looks like when someone actually opens the file: documented decisions about which CommonCrawl snapshots, which deduplication thresholds, which filtering heuristics, which document types, all written up at the level of detail a commercial lab would call confidential. SciBERT, earlier in his career, did the equivalent move for scientific text — bake in domain-specific tokenization and pretraining from documented corpora, then ship the artifact so others can replicate.

Worth following when: you want to understand what's inside a training corpus at the granularity needed to predict downstream model behavior — rather than at the marketing-summary level.
Topics: open training corpora and their documentation (Dolma); domain-specific pretraining (SciBERT); the documentation standards that distinguish open from closed model development.
Key works: Dolma open training corpus (2024, lead); OLMo open language model (2024, co-lead); SciBERT (2019, lead).