Tushar Khot
What "reasoning ability" actually means as something you can put on a benchmark — and what changes when you also try to coach the model to do reasoning through structured prompting.
ARC, which Khot helped design at AI2, is the grade-school-science benchmark that current LLM papers cite as evidence of "reasoning ability" — questions where the model has to work out which scenario explains an observation, a step that goes beyond surface fact retrieval. The same instinct drives his Decomposed Prompting work (2023, lead author): if a complex task can be broken into a stable set of sub-tasks, you can prompt each sub-task in isolation and recompose the results, getting both better accuracy and an inspectable trace of what the model did. Together, the two lines make ARC a stronger evaluation tool — you can identify exactly where in the reasoning chain the model lost the question.