Callison-Burch's "Real or Fake Text?" line of research (with Liam Dugan and others, 2020s) ran interactive experiments where readers were asked to mark the point in a text where the human author stopped and an AI continuation began. The results have tracked the progress of LLMs from an angle most evaluation skips — through 2020 the cut-off point was easy to find; by 2023 readers struggled to find it at all, even when motivated. His earlier work pioneered crowdsourcing as a primary instrument for NLP evaluation, which gives the current line additional weight: he is unusually qualified to say what human evaluators can and cannot detect under realistic conditions.

Worth following when
you need to design human evaluation of LLM-generated text and want methodology informed by what humans actually notice versus what they think they notice.
Topics
human distinguishability of AI vs. human text; crowdsourcing methodology for NLP evaluation; machine translation evaluation lineage from the statistical era.
Key works
"Real or Fake Text?" line of human-vs-AI distinguishability experiments (2020 onward); crowdsourcing-for-NLP methodology papers (early 2010s); WMT-era contributions to machine-translation evaluation.