Abstract
Objective: To compare the performance of eight large language models (LLMs) with neurology residents on board-style multiple-choice questions across seven subspecialties and two cognitive levels. Methods: In a cross-sectional benchmarking study, we evaluated Bard, Claude, Gemini v1, Gemini 2.5, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, and ChatGPT-5 using 107 text-only items spanning movement disorders, vascular neurology, neuroanatomy, neuroimmunology, epilepsy, neuromuscular disease, and neuro-infectious disease. Items were labeled as lower- or higher-order per Bloom's taxonomy by two neurologists. Models answered each item in a fresh session and reported confidence and Bloom classification. Residents completed the same set under exam-like conditions. Outcomes included overall and domain accuracies, guessing-adjusted accuracy, confidence–accuracy calibration (Spearman ρ), agreement with expert Bloom labels (Cohen κ), and inter-generation scaling (linear regression of topic-level accuracies). Group differences used Fisher exact or χ2 tests with Bonferroni correction. Results: Residents scored 64.9%. ChatGPT-5 achieved 84.1% and ChatGPT-4o 81.3%, followed by Gemini 2.5 at 77.6% and ChatGPT-4 at 68.2%; Claude (56.1%), Bard (54.2%), ChatGPT-3.5 (53.3%), and Gemini v1 (39.3%) underperformed residents. On higher-order items, ChatGPT-5 (86%) and ChatGPT-4o (82.5%) maintained superiority; Gemini 2.5 matched 82.5%. Guessing-adjusted accuracy preserved rank order (ChatGPT-5 78.8%, ChatGPT-4o 75.1%, Gemini 2.5 70.1%). Confidence–accuracy calibration was weak across models. Inter-generation scaling was strong within the ChatGPT lineage (ChatGPT-4 to 4o R2 = 0.765, p = 0.010; 4o to 5 R2 = 0.908, p < 0.001) but absent for Gemini v1 to 2.5 (R2 = 0.002, p = 0.918), suggesting discontinuous improvements. Conclusions: LLMs—particularly ChatGPT-5 and ChatGPT-4o—exceeded resident performance on text-based neurology board-style questions across subspecialties and cognitive levels. Gemini 2.5 showed substantial gains over v1 but with domain-uneven scaling. Given weak confidence calibration, LLMs should be integrated as supervised educational adjuncts with ongoing validation, version governance, and transparent metadata to support safe use in neurology education.
Recommended Citation
M. Almomani et al., "Evaluation of Multiple Generative Large Language Models on Neurology Board-style Questions," Frontiers in Digital Health, vol. 7, article no. 1737882, Frontiers Media, Jan 2026.
The definitive version is available at https://doi.org/10.3389/fdgth.2025.1737882
Department(s)
Electrical and Computer Engineering
Publication Status
Open Access
Keywords and Phrases
artificial intelligence; board examinations; large language models; model performance analysis; neurology education
International Standard Serial Number (ISSN)
2673-253X
Document Type
Article - Journal
Document Version
Final Version
File Type
text
Language(s)
English
Rights
© 2026 The Authors, All rights reserved.
Creative Commons Licensing

This work is licensed under a Creative Commons Attribution 4.0 License.
Publication Date
01 Jan 2026
