Pascale Fung
Whether the same language model performs the same kind of work across different languages — and whether evaluation methodology that's built for English is misleading us about what models do in everything else.
Fung led the 2023 paper "Multitask, Multilingual, Multimodal Evaluation of ChatGPT" — one of the first systematic empirical studies of an LLM run through reasoning, hallucination, and interactivity tasks across multiple languages simultaneously. The findings were unflattering for the field: English-language evaluation results consistently overstated the model's competence in lower-resource languages, where the same prompts produced markedly worse reasoning quality and higher hallucination rates. For ai100, which audits AI engines across five language regions, this is the cleanest published demonstration of why language-by-language evaluation is methodologically required, not an optional extension of an English baseline.