Jonathan Berant
Whether a language model is actually doing the reasoning steps needed to answer a question — or just producing an answer that happens to be right.
For complex questions — "what's the total population of countries whose capital starts with 'B'?" — the difference between knowing the answer and reasoning to it matters, but most QA evaluation collapses both into a single correctness score. Berant's BREAK and QDMR work formalized the alternative: decompose each complex question into the atomic reasoning steps it implies, then check whether the model went through those steps, not just whether the final answer matched. The DROP reading-comprehension benchmark (2019, co-author) put numbers on how much current LLMs lose when the question demands actually composing intermediate facts to reach the answer.