Identification of Patients with Epilepsy from Electronic Health Records Using Text Mining
Abstract number :
2.381
Submission category :
16. Epidemiology
Year :
2022
Submission ID :
2205163
Source :
www.aesnet.org
Presentation date :
12/4/2022 12:00:00 PM
Published date :
Nov 22, 2022, 05:27 AM
Authors :
Aidan Cardall, – Massachusetts General Hospital; Marta Fernandes, PhD - Massachusetts General Hospital; Lidia Moura, MD, PHD – MGH; Claire Jacobs, MD, PHD – MGH; Sahar Zafar, MD – MGH; M. Brandon Westover, MD, PHD – Massachusetts General Hospital / Harvard Medical School
Rationale: Unstructured data present in electronic medical records (EMR) are a rich source of information, however abstracting it is labor intensive. Natural language processing (NLP) can reduce the need for manual chart review. We present an application of NLP to large-scale analysis of medical records to determine which patients have a diagnosis of epilepsy.
Methods: We developed an NLP program to identify patients with an epilepsy diagnosis using unstructured notes, outpatient anti-seizure medications (ASMs), ICD codes, age and sex. We captured ground truth for a subset of clinic patients in structured SmartForms developed by the Epilepsy Learning Healthcare System (ELHS). Data were randomly divided into training (70%) and test sets (30%), with different patients in each set. We used 5-fold cross validation for hyperparameter tuning during training. Features provided to the model were indicators of the presence of key words and phrases defined by medical expertise. An extreme gradient boosting model was trained and evaluated on the hold-out test set. Confidence intervals were calculated via bootstrapping.
Results: Our study cohort included 2,020 adults (age ≥ 18) seen in MGB outpatient epilepsy clinics between December 2018 and May 2022: median age 40 ± 17 years, 58% women, 81% White, 85% Non-Hispanic; 92% with epilepsy. A total of 163 key words and phrases were captured from 6,295 notes and used as input features. The final model included these features combined with ICD codes and ASMs achieving an area under the receiver operating characteristic curve of 0.94 (95% CI 0.89-0.97) and areas under the precision recall curves of 1.00 (95 %CI 0.99-1.00) and 0.70 (95 %CI 0.60-0.80) for positive and negative diagnosis of epilepsy, respectively.
Conclusions: A machine learning–based NLP approach accurately identifies patients with epilepsy from unstructured clinical notes, ICD codes, and medications. This model will help enable large-scale epilepsy research using electronic medical records.
Funding: NIH, CDC, Epilepsy Foundation of America, Harvard Catalyst
Epidemiology