Jonathan Berant — Whom to read in AI

For complex questions — "what's the total population of countries whose capital starts with 'B'?" — the difference between knowing the answer and reasoning to it matters, but most QA evaluation collapses both into a single correctness score. Berant's BREAK and QDMR work formalized the alternative: decompose each complex question into the atomic reasoning steps it implies, then check whether the model went through those steps, not just whether the final answer matched. The DROP reading-comprehension benchmark (2019, co-author) put numbers on how much current LLMs lose when the question demands actually composing intermediate facts to reach the answer.

Worth following when: you want to evaluate not just whether an LLM got the answer, but whether it reasoned its way there in a structurally honest way.
Topics: question decomposition into atomic reasoning steps (QDMR, BREAK); compositional reading comprehension (DROP); the gap between answer correctness and reasoning correctness.
Key works: BREAK dataset with QDMR annotations (2020, lead author); DROP reading-comprehension benchmark (2019, co-author); ongoing Tel Aviv and Google DeepMind work on compositional reasoning.