Researchers develop method to identify minimal changes that cause jailbroken LLMs to refuse harmful requests
A new approach called LOCA traces specific intermediate representation changes in language models that can reverse jailbreak success, offering mechanistic insights into why safety measures fail.
1 source · single source
- Researchers Kumar and Ahuja introduced LOCA, a causal analysis method that identifies minimal sets of interpretable intermediate representation changes needed to restore refusal behavior in jailbroken LLMs.
- LOCA successfully induced model refusal with an average of six interpretable changes, compared to prior methods that required 20 or more modifications to achieve the same outcome.
- The method was evaluated on Gemma and Llama chat models using a large jailbreak benchmark, distinguishing between different harmful request categories such as violence and cyberattacks.
Researchers have developed a mechanistic interpretation framework called LOCA to explain why jailbreak attacks succeed on safety-trained language models. Rather than treating all jailbreaks as exploiting the same underlying concepts, the method identifies case-specific causal mechanisms that allow harmful requests to bypass refusal behavior.
The approach works by identifying a minimal set of changes to a model's intermediate representations that can restore refusal on a jailbroken request. In controlled experiments, LOCA restored safety behavior through an average of six interpretable modifications, substantially outperforming prior causal interpretation methods adapted to the same task, which often failed even after 20 or more changes.
Evaluation was conducted on harmful request-jailbreak pairs from a large benchmark, tested across Gemma and Llama chat model variants. The method distinguishes between different categories of harmful requests, such as violence and cyberattacks, recognizing that the same jailbreak strategy may operate through different intermediate mechanisms depending on the request type.
The authors frame this work as a step toward mechanistic understanding of adversarial vulnerabilities in frontier systems. By pinpointing specific representational pathways involved in jailbreak success, the method could inform future defenses, though the researchers note their findings apply to current-generation open-source models rather than frontier closed-source systems.
- May 3, 2026 · Apple — Machine Learning Research
Apple researchers develop pseudo-annotation pipeline to expand sign language datasets
Trust76 - May 2, 2026 · Apple — Machine Learning Research
Apple researchers introduce inference-time feedback system for tool-calling agents
Trust74 - May 2, 2026 · Apple — Machine Learning Research
Apple researchers propose normalizing flow-based model for video generation as alternative to diffusion
Trust77