Abstract

Objective: To compare the performance of eight large language models (LLMs) with neurology residents on board-style multiple-choice questions across seven subspecialties and two cognitive levels. Methods: In a cross-sectional benchmarking study, we evaluated Bard, Claude, Gemini v1, Gemini 2.5, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, and ChatGPT-5 using 107 text-only items spanning movement disorders, vascular neurology, neuroanatomy, neuroimmunology, epilepsy, neuromuscular disease, and neuro-infectious disease. Items were labeled as lower- or higher-order per Bloom's taxonomy by two neurologists. Models answered each item in a fresh session and reported confidence and Bloom classification. Residents completed the same set under exam-like conditions. Outcomes included overall and domain accuracies, guessing-adjusted accuracy, confidence–accuracy calibration (Spearman ρ), agreement with expert Bloom labels (Cohen κ), and inter-generation scaling (linear regression of topic-level accuracies). Group differences used Fisher exact or χ2 tests with Bonferroni correction. Results: Residents scored 64.9%. ChatGPT-5 achieved 84.1% and ChatGPT-4o 81.3%, followed by Gemini 2.5 at 77.6% and ChatGPT-4 at 68.2%; Claude (56.1%), Bard (54.2%), ChatGPT-3.5 (53.3%), and Gemini v1 (39.3%) underperformed residents. On higher-order items, ChatGPT-5 (86%) and ChatGPT-4o (82.5%) maintained superiority; Gemini 2.5 matched 82.5%. Guessing-adjusted accuracy preserved rank order (ChatGPT-5 78.8%, ChatGPT-4o 75.1%, Gemini 2.5 70.1%). Confidence–accuracy calibration was weak across models. Inter-generation scaling was strong within the ChatGPT lineage (ChatGPT-4 to 4o R2 = 0.765, p = 0.010; 4o to 5 R2 = 0.908, p < 0.001) but absent for Gemini v1 to 2.5 (R2 = 0.002, p = 0.918), suggesting discontinuous improvements. Conclusions: LLMs—particularly ChatGPT-5 and ChatGPT-4o—exceeded resident performance on text-based neurology board-style questions across subspecialties and cognitive levels. Gemini 2.5 showed substantial gains over v1 but with domain-uneven scaling. Given weak confidence calibration, LLMs should be integrated as supervised educational adjuncts with ongoing validation, version governance, and transparent metadata to support safe use in neurology education.

Department(s)

Electrical and Computer Engineering

Publication Status

Open Access

Keywords and Phrases

artificial intelligence; board examinations; large language models; model performance analysis; neurology education

International Standard Serial Number (ISSN)

2673-253X

Document Type

Article - Journal

Document Version

Final Version

File Type

text

Language(s)

English

Rights

© 2026 The Authors, All rights reserved.

Creative Commons Licensing

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Publication Date

01 Jan 2026

Share

 
COinS