Sabharwal came to LLM research from a background in formal reasoning and satisfiability — areas where "does the system reason correctly" has a precise meaning, measured against logical specifications and verifiable by operational tests. That background reads through his more recent work on LLM reasoning evaluation: a tendency to look at whether the structure of the reasoning matches the structure the problem requires, on top of whether the final answer happens to be correct. ARC and similar benchmarks are an output of that posture — designed so that surface fluency does not substitute for actual inference.

Worth following when
you want LLM reasoning evaluation grounded in the older formal-reasoning tradition, where "correctness" had operational meaning before deep learning arrived.
Topics
formal reasoning and SAT-solving applied to LLM evaluation; reasoning-trace inspection in QA benchmarks; the bridge between classical AI inference and current LLM behavior.
Key works
AI2 Reasoning Challenge / ARC (2018, co-author); pre-LLM-era work on tractable inference and SAT; recent LLM-era reasoning-evaluation publications from AI2.