The 2023 "Visual Adversarial Examples Jailbreak Aligned LLMs" paper from Mittal's Princeton group showed something the text-side safety community had not been paying attention to: a multimodal LLM that refuses a harmful text request will often comply with the same request when accompanied by an image carefully optimized to look unremarkable to humans but to push the model's internal representation into a region where its alignment training does not apply. The methodological consequence is concrete — any LLM safety evaluation that tests only text inputs is auditing a fraction of the actual attack surface. For ai100 the same lesson scales: brand-mention evaluation tested only in text mode misses the brand-mention behavior visible in the image-input or document-upload paths the engines also support.

Worth following when
you want to evaluate LLM safety, robustness, or behavior across all the input modalities a modern engine actually accepts, not just the text-only slice.
Topics
multimodal adversarial attacks on LLMs; safety alignment across input modalities; adversarial ML applied to deployed LLM products.
Key works
"Visual Adversarial Examples Jailbreak Aligned LLMs" (2023, senior author); broader adversarial-ML and ML-security publications from Princeton.