Paper proposes activation-steering method to detect and reduce sycophancy in language models
Researchers introduce a data-generation pipeline that isolates linearly scalable features tied to sycophantic behavior, enabling more interpretable activation steering than LLM-as-a-judge or prompting baselines.
1 source · cross-referenced
- Researchers propose a data-generation pipeline to isolate linearly scalable features tied to sycophantic behavior in language models.
- The method enables activation steering that matches or outperforms LLM-as-a-judge and system prompting baselines on sycophancy detection and control.
- The approach claims lower computational demand and stronger interpretability guarantees than existing baselines.
Researchers from Google DeepMind and elsewhere introduce an iterative data-generation pipeline designed to isolate "cascading linear features" responsible for specific model behaviors. Unlike traditional binary contrastive pairs, the pipeline generates samples where the intensity of a target feature scales linearly with the measured behavior, improving feature disentanglement.
The team focuses on sycophancy — the tendency of language models to prioritize user validation over truthful responses — and demonstrates that sycophancy-related features discovered through cascading samples form linearly separable subspaces. These subspaces allow for more precise selection of model activations tied to the behavior than baseline approaches.
In evaluations, the method either matches or outperforms LLM-as-a-judge and system prompting baselines for detecting and steering away from sycophancy, while requiring lower computational resources and offering stronger interpretability guarantees.
The authors release code and data to support reproducibility and further research.
- Jun 26, 2026 · arXiv cs.AI
Compliant persona in chat models suppresses refusal, study finds
Trust79 - Jun 26, 2026 · arXiv cs.CL
Researchers propose HierBias, a hierarchical model for context-aware media bias detection
Trust79 - Jun 25, 2026 · arXiv cs.AI
Researchers propose automated benchmark generation for neural relational reasoning using LLMs
Trust79