Joty's earlier published work is in discourse-structure parsing — the kind of NLP where you ask whether a paragraph hangs together as an argument, not just whether each sentence parses. That habit carries into his more recent LLM-evaluation publications: an enterprise LLM application has to produce outputs that hold their argumentative shape across multi-turn use, and standard single-shot benchmarks don't capture that. The body of work reads as a steady reminder that "the model passes a benchmark" and "the model holds up under enterprise traffic" are not the same claim.

Worth following when
you need an evaluation perspective informed by what shipping LLM features to enterprise customers actually demands of the underlying methodology.
Topics
discourse structure as a lens on long-form LLM output; multi-turn evaluation beyond single-shot benchmarks; the gap between research benchmarks and enterprise deployment.
Key works
earlier foundational work on discourse parsing (RST and beyond); ongoing publications on multi-turn LLM evaluation and reliability.