Prateek Mittal
How visual inputs become a new attack surface for safety-aligned language models that accept multimodal queries.
The 2023 "Visual Adversarial Examples Jailbreak Aligned LLMs" paper from Mittal's Princeton group showed something the text-side safety community had not been paying attention to: a multimodal LLM that refuses a harmful text request will often comply with the same request when accompanied by an image carefully optimized to look unremarkable to humans but to push the model's internal representation into a region where its alignment training does not apply. The methodological consequence is concrete — any LLM safety evaluation that tests only text inputs is auditing a fraction of the actual attack surface. For ai100 the same lesson scales: brand-mention evaluation tested only in text mode misses the brand-mention behavior visible in the image-input or document-upload paths the engines also support.