Tushar Khot — Whom to read in AI

ARC, which Khot helped design at AI2, is the grade-school-science benchmark that current LLM papers cite as evidence of "reasoning ability" — questions where the model has to work out which scenario explains an observation, a step that goes beyond surface fact retrieval. The same instinct drives his Decomposed Prompting work (2023, lead author): if a complex task can be broken into a stable set of sub-tasks, you can prompt each sub-task in isolation and recompose the results, getting both better accuracy and an inspectable trace of what the model did. Together, the two lines make ARC a stronger evaluation tool — you can identify exactly where in the reasoning chain the model lost the question.

Worth following when: you want benchmark design and prompting methodology treated as the same problem in LLM evaluation.
Topics: reasoning benchmarks (ARC and successors); decomposed prompting and reusable sub-task patterns; reasoning-trace inspection as part of evaluation.
Key works: AI2 Reasoning Challenge / ARC (2018, co-author); Decomposed Prompting (2023, lead author); ongoing AI2 publications on machine reasoning evaluation.