Objectives: In this study, the accuracy rates of the answers given by three different large language models (LLMs) (ChatGPT-4o, DeepSeek-R1, and Gemini 2.0) to the multiple-choice questions (MCQs) asked in the European Board of Hand Surgery (EBHS) exam and the reasons for the wrong answers were examined. It was hypothesized that the DeepSeek-R1 model would show a higher accuracy rate than the other two models based on reported differences in training datasets.
Materials and Methods: 10 different exams published in The Journal of Hand Surgery (European Volume) (between 2022- 2024) and 150 true/false MCQs were examined in the study. The MCQs divided into five subheadings according to the content of the questions, and these were anatomy, trauma, systemic-chronic diseases, microsurgery, and congenital disorders. The error reasons for the wrong answers of the models were divided into four groups, and these were data-related, semantic, algorithmic, and logical errors.
Results: ChatGPT-4o had a correct answer rate of 74%, DeepSeek-R1 76.7%, and Gemini 2.0 73.3%, and no significant difference was observed between these rates (p = 0.572). The models gave the same answer for 103 out of 150 MCQs, and 84.5% of these answers were correct. In the evaluation of wrong answers, it was seen that the most frequent type of error was data-related.
Conclusion: There was no significant difference in accuracy rates, content-based subcategories, or error types among the three LLMs. Data-related errors indicate gaps in training, but approximately 75% accuracy in this exam suggests that further error analysis could enhance future model performance.
Key words: artificial intelligence; board exam; ChatGPT; DeepSeek; error analysis; Gemini; hand surgery; large language models
|