LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts
Weighted ensemble of Gemini 2.5 Pro and two Gemma 3 variants outperforms individual models on expert-labeled biomedical screening task.
1 source · cross-referenced
- A weighted ensemble of Google's Gemini 2.5 Pro, Gemma 3 12B, and Gemma 3 27B achieved a 0.74 weighted F1-score and 0.74 accuracy in detecting EQ-5D studies from PubMed abstracts.
- The approach combined few-shot prompting, weight ensembling, and a soft stacking meta-classifier to improve precision-recall balance and interpretability.
- Nine LLMs were evaluated on a dataset manually labeled by two experts, with the ensemble surpassing individual model performance.
Researchers evaluated nine large language models—including Google’s Gemini 2.5 Pro and two variants of Gemma 3 (12B and 27B parameters)—on a task to detect EQ-5D health-related quality-of-life studies in PubMed abstracts.
The team used a multi-phase framework combining few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier to combine model outputs.
A weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b achieved a 0.74 weighted F1-score and 0.74 accuracy, outperforming individual models on the same dataset.
The ensemble improved the balance between precision and recall compared to single models, and the soft stacking approach enhanced reliability and interpretability of predictions.
Feature analysis indicated that model probability outputs were critical in guiding final predictions, suggesting that uncertainty calibration contributes to performance.
The findings support the feasibility of automating parts of systematic literature review screening in biomedical research using ensemble LLM methods.
- Jun 19, 2026 · arXiv cs.CL
No evidence of Semitic-specific cross-lingual transfer in large language models
Trust79 - Jun 19, 2026 · arXiv cs.CL
Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds
Trust79 - Jun 19, 2026 · arXiv cs.AI
Systematic study compares diffusion language models to next-token LLMs across eight benchmarks
Trust79