← Back to the list
Kyle Lo
What's actually in a training corpus once you sit down and look at it document by document.
Most discussion of LLM training data happens at the headline level — "trained on the web", "trained on Common Crawl". Lo's work on Dolma, the corpus behind OLMo, is what that conversation looks like when someone actually opens the file: documented decisions about which CommonCrawl snapshots, which deduplication thresholds, which filtering heuristics, which document types, all written up at the level of detail a commercial lab would call confidential. SciBERT, earlier in his career, did the equivalent move for scientific text — bake in domain-specific tokenization and pretraining from documented corpora, then ship the artifact so others can replicate.