Research · Jun 26, 2026

Paper proposes activation-steering method to detect and reduce sycophancy in language models

Researchers introduce a data-generation pipeline that isolates linearly scalable features tied to sycophantic behavior, enabling more interpretable activation steering than LLM-as-a-judge or prompting baselines.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Researchers propose a data-generation pipeline to isolate linearly scalable features tied to sycophantic behavior in language models.
The method enables activation steering that matches or outperforms LLM-as-a-judge and system prompting baselines on sycophancy detection and control.
The approach claims lower computational demand and stronger interpretability guarantees than existing baselines.

Researchers from Google DeepMind and elsewhere introduce an iterative data-generation pipeline designed to isolate "cascading linear features" responsible for specific model behaviors. Unlike traditional binary contrastive pairs, the pipeline generates samples where the intensity of a target feature scales linearly with the measured behavior, improving feature disentanglement.

The team focuses on sycophancy — the tendency of language models to prioritize user validation over truthful responses — and demonstrates that sycophancy-related features discovered through cascading samples form linearly separable subspaces. These subspaces allow for more precise selection of model activations tied to the behavior than baseline approaches.

In evaluations, the method either matches or outperforms LLM-as-a-judge and system prompting baselines for detecting and steering away from sycophancy, while requiring lower computational resources and offering stronger interpretability guarantees.

The authors release code and data to support reproducibility and further research.

Sources

01arXiv cs.AI — Detecting and Controlling Sycophancy with Cascading Linear Features

Also on Research

Paper proposes activation-steering method to detect and reduce sycophancy in language models

Compliant persona in chat models suppresses refusal, study finds

Researchers propose HierBias, a hierarchical model for context-aware media bias detection

Researchers propose automated benchmark generation for neural relational reasoning using LLMs