Research · Jun 19, 2026

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

Weighted ensemble of Gemini 2.5 Pro and two Gemma 3 variants outperforms individual models on expert-labeled biomedical screening task.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

A weighted ensemble of Google's Gemini 2.5 Pro, Gemma 3 12B, and Gemma 3 27B achieved a 0.74 weighted F1-score and 0.74 accuracy in detecting EQ-5D studies from PubMed abstracts.
The approach combined few-shot prompting, weight ensembling, and a soft stacking meta-classifier to improve precision-recall balance and interpretability.
Nine LLMs were evaluated on a dataset manually labeled by two experts, with the ensemble surpassing individual model performance.

Researchers evaluated nine large language models—including Google’s Gemini 2.5 Pro and two variants of Gemma 3 (12B and 27B parameters)—on a task to detect EQ-5D health-related quality-of-life studies in PubMed abstracts.

The team used a multi-phase framework combining few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier to combine model outputs.

A weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b achieved a 0.74 weighted F1-score and 0.74 accuracy, outperforming individual models on the same dataset.

The ensemble improved the balance between precision and recall compared to single models, and the soft stacking approach enhanced reliability and interpretability of predictions.

Feature analysis indicated that model probability outputs were critical in guiding final predictions, suggesting that uncertainty calibration contributes to performance.

The findings support the feasibility of automating parts of systematic literature review screening in biomedical research using ensemble LLM methods.

Sources

01arXiv cs.CL — Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

Also on Research

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

No evidence of Semitic-specific cross-lingual transfer in large language models

Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds

Systematic study compares diffusion language models to next-token LLMs across eight benchmarks