For complex questions — "what's the total population of countries whose capital starts with 'B'?" — the difference between knowing the answer and reasoning to it matters, but most QA evaluation collapses both into a single correctness score. Berant's BREAK and QDMR work formalized the alternative: decompose each complex question into the atomic reasoning steps it implies, then check whether the model went through those steps, not just whether the final answer matched. The DROP reading-comprehension benchmark (2019, co-author) put numbers on how much current LLMs lose when the question demands actually composing intermediate facts to reach the answer.

Worth following when
you want to evaluate not just whether an LLM got the answer, but whether it reasoned its way there in a structurally honest way.
Topics
question decomposition into atomic reasoning steps (QDMR, BREAK); compositional reading comprehension (DROP); the gap between answer correctness and reasoning correctness.
Key works
BREAK dataset with QDMR annotations (2020, lead author); DROP reading-comprehension benchmark (2019, co-author); ongoing Tel Aviv and Google DeepMind work on compositional reasoning.