Researchers propose RepSelect method to make LLM unlearning more robust against reversal attacks
RepSelect isolates forget-set-specific representations to reduce post-relearning answer accuracy by 4–50x compared to baselines while maintaining general capabilities.
1 source · cross-referenced
- RepSelect targets selective representations to achieve deep and robust forgetting in LLMs.
- Evaluated on four model families (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite) across biohazardous knowledge and abusive tendencies.
- Reduces post-relearning answer accuracy by 4–50x over five baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL).
- Near-perfectly robust to few-shot prompting attacks compared to existing methods.
Researchers from an unnamed institution propose RepSelect, a method to make large language model (LLM) unlearning more robust against reversal attacks. The core idea is that existing unlearning techniques target representations shared with both the retain set and the subspace recoverable by a fine-tuning attacker, which makes forgetting shallow and easily reversible.
RepSelect isolates forget-set-specific representations by collapsing the top principal components of weight gradients before each update. This approach aims to leave general capabilities intact while limiting what fine-tuning can recover after unlearning.
The method was evaluated across two forget categories—biohazardous knowledge and abusive tendencies—and four model families spanning dense and Mixture-of-Experts architectures: Llama 3, Qwen 3.5, Gemma 4 E4B, and DeepSeek V2 Lite.
Compared to five popular baselines—GradDiff, NPO, SimNPO, RMU, and UNDIAL—RepSelect achieves a 4–50x larger reduction in post-relearning answer accuracy than the strongest baseline. It is also near-perfectly robust to few-shot prompting attacks, a common method for reversing unlearning.
The authors argue that targeting selective representations is an important step toward deep and robust LLM forgetting, addressing a central challenge in making models forget specific knowledge and values without sacrificing general capabilities.
- Jun 17, 2026 · arXiv cs.CL
Paper proposes PromptMN, a pseudo-prompting language to structure human-AI instructions
Trust79 - Jun 17, 2026 · arXiv cs.CL
Researchers propose MemSlides, a hierarchical memory framework for personalized slide generation agents
Trust79 - Jun 17, 2026 · arXiv cs.AI
Researchers release SkillChain-Gym, a benchmark for reskilling-aware production-inventory control under disruptions
Trust79