Skip to content
Research · Jun 19, 2026

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

Weighted ensemble of Gemini 2.5 Pro and two Gemma 3 variants outperforms individual models on expert-labeled biomedical screening task.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • A weighted ensemble of Google's Gemini 2.5 Pro, Gemma 3 12B, and Gemma 3 27B achieved a 0.74 weighted F1-score and 0.74 accuracy in detecting EQ-5D studies from PubMed abstracts.
  • The approach combined few-shot prompting, weight ensembling, and a soft stacking meta-classifier to improve precision-recall balance and interpretability.
  • Nine LLMs were evaluated on a dataset manually labeled by two experts, with the ensemble surpassing individual model performance.

Researchers evaluated nine large language models—including Google’s Gemini 2.5 Pro and two variants of Gemma 3 (12B and 27B parameters)—on a task to detect EQ-5D health-related quality-of-life studies in PubMed abstracts.

The team used a multi-phase framework combining few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier to combine model outputs.

A weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b achieved a 0.74 weighted F1-score and 0.74 accuracy, outperforming individual models on the same dataset.

The ensemble improved the balance between precision and recall compared to single models, and the soft stacking approach enhanced reliability and interpretability of predictions.

Feature analysis indicated that model probability outputs were critical in guiding final predictions, suggesting that uncertainty calibration contributes to performance.

The findings support the feasibility of automating parts of systematic literature review screening in biomedical research using ensemble LLM methods.

Sources
  1. 01arXiv cs.CLEnsembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.