Research · Jun 18, 2026

Researchers propose CaVe-VLM-CoT, a reflection-based agentic-RAG framework for interpretable vision-language models

The framework introduces a five-stage closed-loop pipeline to reduce hallucinations by enforcing step-level citation grounding and structured feedback loops, with new evaluation metrics including CaVeScore.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Introduces CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework for vision-language models (VLMs) to reduce hallucinations via evidence-grounded reasoning.
Proposes a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, with structured feedback to the Extractor on ungrounded claims.
Presents a suite of 23 component-wise metrics, including CaVeScore, to jointly measure retrieval quality, citation faithfulness, and cross-modal grounding.
Reports 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects), without architectural or prompt modifications.

Vision-language models (VLMs) frequently generate fluent but visually unfaithful outputs, a phenomenon known as hallucination. Existing approaches such as chain-of-thought and retrieval-augmented methods address this only partially, as they do not enforce step-level citation grounding or provide mechanisms to route verification failures back to retrieval for correction.

Researchers introduce CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework designed to enforce evidence-grounded reasoning through a five-stage closed-loop pipeline. The pipeline consists of Extractor, Retriever, Solver, Citation Injector, and Verifier stages, where detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.

To evaluate the framework, the authors propose a suite of 23 component-wise metrics spanning all pipeline stages, anchored by CaVeScore—a composite metric that weights accuracy, citation precision and recall, attribution, and evidence grounding. These metrics aim to jointly measure retrieval quality, step-wise citation faithfulness, and cross-modal grounding, addressing gaps in existing evaluation practices.

Without modifying the underlying model architecture or prompts, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU across 30 subjects. The results indicate measurable improvements in both performance and interpretability on multimodal benchmarks.

Sources

01arXiv cs.AI — CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

Also on Research

Researchers propose CaVe-VLM-CoT, a reflection-based agentic-RAG framework for interpretable vision-language models

No evidence of Semitic-specific cross-lingual transfer in large language models

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds