Study evaluates how language models interpret personal health records to answer patient questions
Researchers tested whether LLMs given access to de-identified patient data improve the quality and safety of health guidance, identifying specific gaps in clinical understanding.
1 source · cross-referenced
- Researchers evaluated Gemini 3.0 Flash responses to 2,257 patient health queries under three conditions: without PHR context, with basic summaries, and with full clinical notes.
- Statistically significant improvements in answer helpfulness emerged with PHR data (p < 0.001), with gains observed in safety, accuracy, relevance, and personalization.
- A new evaluation framework identified LLM gaps in interpreting complex health records, including temporal disorientation and occasional confabulations.
- The work combines automatic rating systems with clinician evaluation on a subset of 95 queries, both with full PHR knowledge.
Researchers from Google and partner institutions conducted a controlled evaluation of how large language models use personal health records to respond to patient questions. The study employed Gemini 3.0 Flash and assessed its answers across three conditions: responses generated without any PHR context, with basic demographic and medication summaries, and with access to complete clinical notes.
The evaluation dataset comprised 2,257 queries representing three distinct question patterns: web-style searches, template-derived chatbot-style questions, and actual queries patients posed to healthcare providers. These queries were matched against de-identified records from a pool of 1,945 patients.
Results showed statistically significant improvements in answer helpfulness when the model had access to PHR data (p < 0.001), with benefits extending to safety, accuracy, relevance, and personalization across all question types. Clinician raters evaluated a subset of 95 responses alongside automated assessment.
Beyond aggregate improvement metrics, researchers developed a new evaluation framework to classify specific failure modes in PHR interpretation. The framework revealed gaps in how LLMs handle temporal reasoning in health records and documented instances of clinically meaningful confabulation — errors that seemed plausible but were factually incorrect.
The work frames PHR-augmented health AI as a potential tool for patient empowerment while establishing systematic methods for detecting when LLMs misinterpret complex clinical information.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 19, 2026 · Google DeepMind — Blog
DeepMind's Co-Scientist tool helps researchers identify genetic factors that reverse cellular aging in human cells
Trust68