Research · Jul 1, 2026

Contrastive Reflection framework improves agentic IR prompt accuracy by 9 percentage points on HotpotQA

A new iterative prompt-optimization method for LLM agents in information retrieval achieves a 51.4% to 60.4% exact-match accuracy improvement on HotpotQA, outperforming failure-only and random-evidence variants.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

A new iterative prompt-optimization framework called Contrastive Reflection improves held-out exact-match accuracy for agentic information retrieval from 51.4% to 60.4% on HotpotQA.
The method uses structured traces from QA and grading agents to identify error-anchored behavioral slices and propose targeted prompt edits.
Failure-only and random-evidence variants improve less and break more previously correct examples, while a light instruction-only comparison places the method near modern prompt optimizers like MIPROv2 (59.4%) and GEPA (57.0%).
The framework is designed to make prompt repair more inspectable and validation-driven for LLM agents in IR workflows.
The work is accepted at the Agent4IR @ KDD 2026 workshop.

Researchers from multiple institutions propose Contrastive Reflection, an iterative prompt-optimization framework for agentic information retrieval (IR) workflows. The method targets a practical challenge: improving prompts for LLM agents that issue retrieval queries, synthesize answers, and evaluate IR systems.

The framework begins with task-centric quality definitions exposed by QA and grading agents, including dimension-level scores and rationales. These structured traces are used to identify error-anchored behavioral slices and add nearby successful examples from the same region. A Teacher LLM then proposes targeted prompt edits, which are accepted only when validation performance improves, optionally with regression checks.

On a public HotpotQA retrieval-augmented QA setup, a single tree-selected contrastive repair improves held-out exact-match accuracy from 51.4% to 60.4%. In contrast, failure-only and random-evidence variants show smaller gains and break more previously correct examples. A light instruction-only comparison places Contrastive Reflection near modern prompt optimizers: MIPROv2 reaches 59.4% and GEPA 57.0%.

The authors instantiate the framework with a tree-based slice selector but emphasize that the core contribution is the contrastive reflection loop rather than the selector itself. The method is designed to make prompt repair more inspectable and validation-driven, addressing a gap in current agentic IR workflows.

The work is accepted for presentation at the Agent4IR @ KDD 2026 workshop, indicating peer-reviewed visibility within the IR research community.

Sources

01arXiv cs.AI — Contrastive Reflection for Iterative Prompt Optimization

Also on Research

Contrastive Reflection framework improves agentic IR prompt accuracy by 9 percentage points on HotpotQA

Study proposes AI-driven method to discover reusable simulation models via natural language queries

Study finds external feedback drives agent improvement more than self-feedback or unguided refinement

Hugging Face-affiliated team argues AI specialization is theoretically inevitable