Skip to content
Research · Jun 18, 2026

Researchers propose CaVe-VLM-CoT, a reflection-based agentic-RAG framework for interpretable vision-language models

The framework introduces a five-stage closed-loop pipeline to reduce hallucinations by enforcing step-level citation grounding and structured feedback loops, with new evaluation metrics including CaVeScore.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Introduces CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework for vision-language models (VLMs) to reduce hallucinations via evidence-grounded reasoning.
  • Proposes a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, with structured feedback to the Extractor on ungrounded claims.
  • Presents a suite of 23 component-wise metrics, including CaVeScore, to jointly measure retrieval quality, citation faithfulness, and cross-modal grounding.
  • Reports 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects), without architectural or prompt modifications.

Vision-language models (VLMs) frequently generate fluent but visually unfaithful outputs, a phenomenon known as hallucination. Existing approaches such as chain-of-thought and retrieval-augmented methods address this only partially, as they do not enforce step-level citation grounding or provide mechanisms to route verification failures back to retrieval for correction.

Researchers introduce CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework designed to enforce evidence-grounded reasoning through a five-stage closed-loop pipeline. The pipeline consists of Extractor, Retriever, Solver, Citation Injector, and Verifier stages, where detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.

To evaluate the framework, the authors propose a suite of 23 component-wise metrics spanning all pipeline stages, anchored by CaVeScore—a composite metric that weights accuracy, citation precision and recall, attribution, and evidence grounding. These metrics aim to jointly measure retrieval quality, step-wise citation faithfulness, and cross-modal grounding, addressing gaps in existing evaluation practices.

Without modifying the underlying model architecture or prompts, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU across 30 subjects. The results indicate measurable improvements in both performance and interpretability on multimodal benchmarks.

Sources
  1. 01arXiv cs.AICaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.