Fung led the 2023 paper "Multitask, Multilingual, Multimodal Evaluation of ChatGPT" — one of the first systematic empirical studies of an LLM run through reasoning, hallucination, and interactivity tasks across multiple languages simultaneously. The findings were unflattering for the field: English-language evaluation results consistently overstated the model's competence in lower-resource languages, where the same prompts produced markedly worse reasoning quality and higher hallucination rates. For ai100, which audits AI engines across five language regions, this is the cleanest published demonstration of why language-by-language evaluation is methodologically required, not an optional extension of an English baseline.

Worth following when
you need to evaluate language-model behavior across multiple language environments and want the empirical literature that shows why per-language testing is non-negotiable.
Topics
multilingual evaluation of LLMs (reasoning, hallucination, dialogue); empathetic and emotionally-aware dialogue systems; the methodological consequences of English-centric benchmark design.
Key works
"Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity" (2023, senior author); CAiRE conversational AI publications; long-arc work on cross-lingual NLP since the 1990s.