← Back to the list
Shafiq Joty
Evaluation methodology for language models when the deployment context is enterprise software rather than a research demo.
Joty's earlier published work is in discourse-structure parsing — the kind of NLP where you ask whether a paragraph hangs together as an argument, not just whether each sentence parses. That habit carries into his more recent LLM-evaluation publications: an enterprise LLM application has to produce outputs that hold their argumentative shape across multi-turn use, and standard single-shot benchmarks don't capture that. The body of work reads as a steady reminder that "the model passes a benchmark" and "the model holds up under enterprise traffic" are not the same claim.