← Back to the list
Percy Liang
How to measure language models so a measurement made today still means something next year.
HELM came out of his group at Stanford — the closest thing the academic side has to a public, contestable reference scoreboard for language models. The choice that defines the work is that scoring functions, datasets, and prompt templates are all in the open, so you can disagree with the methodology in detail. That posture — methodology as the artifact, not scaffolding around a number — is what ai100's own scoring tries to inherit.