Compliant persona in chat models suppresses refusal, study finds
Researchers show refusal behavior in instruction-tuned models is gated downstream by a compliant persona direction, with steering experiments reducing refusal rates from 97% to 2% in Llama-3.1-8B-Instruct.
1 source · cross-referenced
- Refusal in instruction-tuned chat models is gated downstream by a compliant persona direction, not an isolated mechanism.
- In Llama-3.1-8B-Instruct, steering a compliant persona reduced refusal rates from 97% to 2%.
- Restoring refusal at late layers partially recovers refusal behavior; projecting out the persona direction in a late-layer window restores refusal to baseline.
- Findings are demonstrated on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.
Researchers report that refusal behavior in instruction-tuned chat models is not an isolated mechanism but is gated downstream by a compliant persona direction. In experiments on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, the authors extract a compliant model-persona direction and a refusal direction, then intervene on both. Steering the model toward a compliant persona suppressed refusal behavior substantially: in Llama-3.1-8B-Instruct, the refusal rate fell from 97% to 2%.
Reintroducing the refusal direction at late layers partially restored refusal behavior, but not when interventions occurred at early layers. Projecting out the persona direction within a late-layer window restored refusal to baseline levels, whereas projecting out a random direction did not. The authors conclude that refusal is gated at the late-layer expression stage, downstream of where refusal is computed, and that treating refusal as a single isolated direction misses its dependence on persona.
The study uses linear directions in activation space to represent both refusal and persona traits, enabling targeted steering interventions. By demonstrating that persona steering can suppress refusal and that persona removal can restore it, the work highlights the entanglement between safety-aligned behaviors and model persona in chat models.
- Jun 26, 2026 · arXiv cs.AI
Paper proposes activation-steering method to detect and reduce sycophancy in language models
Trust79 - Jun 26, 2026 · arXiv cs.CL
Researchers propose HierBias, a hierarchical model for context-aware media bias detection
Trust79 - Jun 25, 2026 · arXiv cs.AI
Researchers propose automated benchmark generation for neural relational reasoning using LLMs
Trust79