Validation of a GPT-4o–Powered Classifier for Stratification of High-Volume Epilepsy Patient Portal Messages
Abstract number :
1.1
Submission category :
13. Health Services (Delivery of Care, Access to Care, Health Care Models)
Year :
2025
Submission ID :
1135
Source :
www.aesnet.org
Presentation date :
12/6/2025 12:00:00 AM
Published date :
Authors :
Presenting Author: Valdery Moura Junior, PhD, MBA – Massachussetts General Hospital
Susanna Gallani, PhD – Harvard Business School
Lara Basovic, MD – Massachusetts General Hospital
Sydney Cash, MD, PhD – Massachusetts General Hospital
Elyse Park, PhD – Massachusetts General Hospital
Gaurdia Banister, PhD – Massachusetts General Hospital
Louisa Sylvia, PhD – Massachusetts General Hospital
Shawn Murphy, MD, PhD – Massachusetts General Hospital
Peter Hadar, MD, MS – Massachusetts General Hospital; Harvard Medical School
Lidia M.V.R. Moura, MD, MPH, PhD – Massachusetts General Hospital; Harvard Medical School
Rationale: In outpatient epilepsy care, high-volume patient messaging places increasing demands on clinicians and complicates timely triage. We evaluated a GPT-4o–based classifier, developed via prompt engineering, to categorize and triage patient portal messages, with the goals of supporting safe and scalable clinical operations and improving patient care.
Methods: A random sample of 101 patient messages sent via the electronic medical record portal was drawn from established patients seen at a tertiary epilepsy clinic in 2019. Three epilepsy physicians (MDs) independently reviewed and annotated each message along two axes: (1) message type (four categories: Visit follow-up or non-urgent medical question, prescription question, test result question, referral request), and (2) clinical urgency (high vs. low/medium). Raters were trained using a written standard operating procedure that operationalized definitions for each label. Classifier outputs from GPT-4o were compared against these human ratings. Performance was evaluated across three agreement criteria: consensus agreement (all 3 MDs), majority agreement (≥2 MDs), and lenient concordance (agreement with ≥1 MD). Accuracy, 95% confidence intervals, and diagnostic metrics (e.g., sensitivity, specificity) were computed.
Results: Human expert agreement was variable, particularly for urgency classification. Despite this, GPT-4o demonstrated high performance: Message Type: 100% accuracy under consensus agreement (n = 67), 89.9% with ≥2 MDs (n = 99), and 100% with ≥1 MD match (n = 101). Urgency: 93.4% accuracy under consensus agreement (n = 76; 95% CI: 0.853–0.978), 86.0% with ≥2 MDs (n = 100; 95% CI: 0.776–0.921), and 95.0% with ≥1 MD match (n = 101; 95% CI: 0.888–0.984). For urgency, sensitivity was 0.80 and specificity 0.94 under consensus MD agreement, indicating strong performance in identifying high-priority cases. Importantly, the classifier’s outputs matched at least one expert’s judgment in nearly all cases, even when full agreement among human raters was lacking.
Conclusions: Despite inter-rater variability among clinicians, the GPT-4o classifier demonstrated high reliability across both strict and permissive consensus thresholds. Its ability to consistently align with human reasoning supports its feasibility for augmenting message triage in epilepsy clinical settings.
Funding: Harvard Business School D^3 Associates Program
Health Services (Delivery of Care, Access to Care, Health Care Models)