ARC, which Khot helped design at AI2, is the grade-school-science benchmark that current LLM papers cite as evidence of "reasoning ability" — questions where the model has to work out which scenario explains an observation, a step that goes beyond surface fact retrieval. The same instinct drives his Decomposed Prompting work (2023, lead author): if a complex task can be broken into a stable set of sub-tasks, you can prompt each sub-task in isolation and recompose the results, getting both better accuracy and an inspectable trace of what the model did. Together, the two lines make ARC a stronger evaluation tool — you can identify exactly where in the reasoning chain the model lost the question.

Worth following when
you want benchmark design and prompting methodology treated as the same problem in LLM evaluation.
Topics
reasoning benchmarks (ARC and successors); decomposed prompting and reusable sub-task patterns; reasoning-trace inspection as part of evaluation.
Key works
AI2 Reasoning Challenge / ARC (2018, co-author); Decomposed Prompting (2023, lead author); ongoing AI2 publications on machine reasoning evaluation.