George J. Pappas
How quickly an automated attacker can find prompts that break a language model's safety alignment — and what that means for evaluating model robustness as such.
Pappas came to LLM research from a long career in control theory and formal methods, where the question "how does this system fail" is treated as a primary design constraint, baked into the system from the start. His 2023 paper "Jailbreaking Black-Box LLMs in Twenty Queries" applies that posture to alignment: an attacker LLM is given the target model as a black box and instructed to find an input that bypasses the target's safety training, iteratively refining its prompts based on the target's responses. The result — that twenty queries on average suffice — reframes the conversation about LLM safety: alignment training is not a binary fix but a graded barrier whose height can be measured in attacker effort.