HELM came out of his group at Stanford — the closest thing the academic side has to a public, contestable reference scoreboard for language models. The choice that defines the work is that scoring functions, datasets, and prompt templates are all in the open, so you can disagree with the methodology in detail. That posture — methodology as the artifact, not scaffolding around a number — is what ai100's own scoring tries to inherit.

Worth following when
you want to read evaluation done by someone who treats the how of scoring as the contribution itself.
Topics
holistic LLM evaluation; behavioral testing of foundation models; what benchmarks incentivize and what they hide.
Key works
HELM (2022, ongoing); SQuAD reading-comprehension benchmark (2016, co-author); Foundation Model Transparency Index (2023, co-author).