Evaluating hallucination behavior in large language models for medical applications: a comparative study of four popular chatbot models

Ibraheem Altamimi; Mohammed Abdullah Alsaawi; Abdulmalik Alomayyer; Mohamad Hani Temsah; Mohammed Alquhidan; Salem Alshehri; Abdulmajeed Khalid Alzuwayyid; Naif Aljamlan; Yazan Yousef Alghammas

doi:10.24911/IJMDC.51-1763928059

IJMDC. 2026; 10(1): 084-090

Evaluating hallucination behavior in large language models for medical applications: a comparative study of four popular chatbot models

Ibraheem Altamimi, Mohammed Abdullah Alsaawi, Abdulmalik Alomayyer, Mohamad Hani Temsah, Mohammed Alquhidan, Salem Alshehri, Abdulmajeed Khalid Alzuwayyid, Naif Aljamlan, Yazan Yousef Alghammas.

Abstract	Download PDF		Post
Objective: This study aimed to assess and quantify the referencing hallucination of the four major large language models (LLMs) chatbots. Methods: This was a cross-sectional study to compare four LLM-chatbots (GPT-4o with internet access, GPT-4 Legacy, GPT-4o, and O1) in referencing hallucination. A total of 400 clinically based prompts (100 prompts for each bot) were made and asked separately to each chatbot. Answers from chatbots were tested using the Reference Hallucination Score (RHS). Results: This study analyzed 400 prompts (200 basic, 200 advanced) answered by four chatbots (100 per model) using a seven-item hallucination score (KR-20 = 0.808). Reference errors were frequent: 255/400 responses (63.7%) had incorrect or missing publication dates, 240/400 (60.0%) contained broken or misleading links, and only 79/400 (19.8%) provided a correct DOI. GPT-4o Int (286 erroneous vs 414 correct elements) and GPT-4o (317 vs 383) showed the lowest hallucination burden, whereas O1 Preview (621 vs 79) performed worst, with GPT-4 Legacy intermediate. Conclusion: Medical LLM-based chatbots exhibited different degrees of bibliographic hallucination depending on model type. More recent iterations, such as GPT-4o, have decreased such hallucination, albeit not extinguishing it. Before deploying such tools for clinically relevant communication or decision assistance, reference audits structured within a framework alongside human verification must be conducted. Key words: Clinical decision support, hallucination, large language models, medical chatbots, medical application.

Evaluating hallucination behavior in large language models for medical applications: a comparative study of four popular chatbot models

Abstract