Abstracts

Assessing the Performance of Open-source Large Language Models in Epilepsy

Abstract number : 1.198
Submission category : 2. Translational Research / 2B. Devices, Technologies, Stem Cells
Year : 2024
Submission ID : 1132
Source : www.aesnet.org
Presentation date : 12/7/2024 12:00:00 AM
Published date :

Authors :
Presenting Author: Faycal Zine-eddine, MD – Research Center of the University of Montreal Hospital Center (CRCHUM)

Antoine Magron, MSc – Ecole Polytechnique Fédérale de Lausanne (EPFL) and Research Center of the University of Montreal Hospital Center (CRCHUM)
Dang Nguyen, MD, PhD, FRCPC – CRCHUM, Department of Neuroscience of the Université de Montréal
Elie Bou Assi, PhD – Department of Neuroscience, Université de Montréal

Rationale: Large language models (LLM) have shown promising capabilities across a wide range of tasks, including answering questions related to epilepsy. However, they currently suffer from many flaws, such as generating inaccurate information (hallucinations) and lacking up-to-date knowledge. We hypothesize that smaller language models could be trained on domain-specific data and achieve superior performances in epilepsy-related tasks compared to larger LLMs.


Methods: To develop a domain-specific LLM specialized in epilepsy, we first systematically assessed the knowledge of open-source models using epilepsy board practice questions created by the American Epilepsy Society (AES). We used the three latest versions of the AES self-assessment test, spanning from 2021 to 2023, and selected 224 questions with text only. We tested multiple open source LLMs, each with fewer than 10 billion parameters making them suitable to run locally for inference (table 1). We used the base version of these models without inherent question answering capacity and used in context learning, such as few-shot learning and chain-of-thought prompting to improve answer quality. For the few-shot learning, we used three randomly selected questions from the 2022 version of Continuum, a review of epilepsy endorsed by the American Academy of Neurology (AAN). The questions were manually checked to ensure no questions overlapped with the test set. Our main quality metric was accuracy, calculated as a percentage of correct answer. For comparison we also tested OpenAI's GPT’s model as a commercial gold standard.


Results: The best accuracy obtained by the models after providing three examples of multiple-choice questions were as follows: 49.1% for Llama-3 8B, 40.6% for Mistral 7B, 37.5% for Gemma 7B, 34.4% for Qwen2 7B, 31.3% for Meditron 7B, 27.7% for both Llama-2 7B and Qwen2 1.5B, 23.21% for Gemma 2B, and 20.98% for Falcon 7B. The performance of Llama-3 was consistent between one and three examples, while the other models showed a steady rate of increase in their accuracy that plateaued after three examples. GPT4 using our complex prompting achieved an accuracy of 79.0%, much higher than previously reported with simple prompts while GPT3.5 showed an accuracy much closer to the open-source models at 54.0%. This comes at a significant cost, since the GPT models are much larger than the ones we tried (figure 1).


Conclusions: Among all tested open-source LLMs, Llama-3 8B demonstrated superior knowledge in epilepsy, achieving nearly 50% accuracy despite being a general-purpose model. It achieved an accuracy very close to GPT3.5, while being nearly 22 times smaller. Surprisingly, Meditron, a medical LLM, was outperformed by many general models of the same size. GPT4, a state-of-the-art LLM, further enhanced with our prompt, showed expert-level knowledge in epilepsy and passed the AES practice tests. Building on these results, we hope to match this performance without trading-off in privacy and transparency by further training LLama-3 on domain-specific data to create our own LLM.


Funding: This work is supported by the Natural Sciences and Engineering Research Council of Canada (EBA) and the Fonds de Recherche Santé-Québec (FRQS).

Translational Research