BPE tokenization fragmentation enables character-level attacks that bypass LLM safety alignment
Researchers show that Byte-Pair Encoding tokenization splits safety-critical words into sub-word pieces, creating exploitable gaps in refusal mechanisms across five model families.
1 source · cross-referenced
- Character-level perturbations bypass safety alignment in modern LLMs while keeping prompts human-readable.
- BPE tokenization fragments safety-critical words into sub-word pieces, a structural mechanism tested on five model families.
- Optimization targeting safety-token fragmentation flips refusal triggers on 80–100% of refused HarmBench prompts, with 48% producing harmful outputs.
- No DPO configuration closed attack success rate (ASR) across tested configurations; SFT on fragmented prompts closed ASR in 3/5 families but induced global collapse.
Researchers identify a structural mechanism in modern large language models (LLMs) where Byte-Pair Encoding (BPE) tokenization fragments safety-critical words into sub-word pieces, creating exploitable gaps in refusal mechanisms. The mechanism was tested end-to-end across five model families: Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, and Mistral-7B.
An optimization targeting safety-token fragmentation flipped the first-token refusal trigger on 80–100% of refused HarmBench prompts. Of these flips, 48% produced genuinely harmful outputs, with per-model harmful output rates ranging from 29% to 65%. The gap-versus-behavior ROC-AUC scores ranged from 0.66 to 0.98 across models, with a pooled score of 0.84.
Activation patching localized the disrupted safety signal to the last approximately 30% of transformer layers. An alignment-data scan of 30,000 examples found zero intentionally fragmented prompts, achieving positive-control recall of at least 99% at attack-relevant intensities. Targeted-mutation experiments further isolated safety words as the disruption locus.
On the defense side, a 68-cell grid comprising 55 trained checkpoints showed that no Direct Preference Optimization (DPO) configuration achieved seed- and pool-stable attack success rate (ASR) closure across three model families under closed pool-size confounds. Supervised Fine-Tuning (SFT) trained on fragmented prompts closed ASR in three of five families but only via global collapse that increased refusal rates on benign prompts as well.
To distinguish selective repair from global collapse, the authors introduce Conv-Benign, a candidate paired diagnostic. All ASR claims were 3-judge-calibrated, with cell rankings stable across judges and absolute levels within ±18 percentage points.
- Jul 3, 2026 · Schneier on Security
Flock’s ‘Vehicle Fingerprint’ system enables law enforcement tracking without license plates
Trust74 - Jul 3, 2026 · arXiv cs.CL
Provenance-based framework reduces LLM agent misalignment errors by up to 96%
Trust79 - Jul 2, 2026 · Schneier on Security
Paper argues cybersecurity is being overused to frame unrelated policy issues
Trust78