Researchers propose CaVe-VLM-CoT, a reflection-based agentic-RAG framework for interpretable vision-language models
The framework introduces a five-stage closed-loop pipeline to reduce hallucinations by enforcing step-level citation grounding and structured feedback loops, with new evaluation metrics including CaVeScore.
1 source · cross-referenced
- Introduces CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework for vision-language models (VLMs) to reduce hallucinations via evidence-grounded reasoning.
- Proposes a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, with structured feedback to the Extractor on ungrounded claims.
- Presents a suite of 23 component-wise metrics, including CaVeScore, to jointly measure retrieval quality, citation faithfulness, and cross-modal grounding.
- Reports 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects), without architectural or prompt modifications.
Vision-language models (VLMs) frequently generate fluent but visually unfaithful outputs, a phenomenon known as hallucination. Existing approaches such as chain-of-thought and retrieval-augmented methods address this only partially, as they do not enforce step-level citation grounding or provide mechanisms to route verification failures back to retrieval for correction.
Researchers introduce CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework designed to enforce evidence-grounded reasoning through a five-stage closed-loop pipeline. The pipeline consists of Extractor, Retriever, Solver, Citation Injector, and Verifier stages, where detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.
To evaluate the framework, the authors propose a suite of 23 component-wise metrics spanning all pipeline stages, anchored by CaVeScore—a composite metric that weights accuracy, citation precision and recall, attribution, and evidence grounding. These metrics aim to jointly measure retrieval quality, step-wise citation faithfulness, and cross-modal grounding, addressing gaps in existing evaluation practices.
Without modifying the underlying model architecture or prompts, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU across 30 subjects. The results indicate measurable improvements in both performance and interpretability on multimodal benchmarks.
- Jun 19, 2026 · arXiv cs.CL
No evidence of Semitic-specific cross-lingual transfer in large language models
Trust79 - Jun 19, 2026 · arXiv cs.CL
LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts
Trust79 - Jun 19, 2026 · arXiv cs.CL
Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds
Trust79