Assessment of ChatGPT-4o’s Answers to Common Questions on Thyroid Fine-Needle Aspiration Biopsy

Ali Salbas; Rasit Eren Buyuktoka

doi:10.5455/medscience.2025.09.250

Med-Science. 2026; 15(1): 243-8

doi: 10.5455/medscience.2025.09.250

Assessment of ChatGPT-4o’s Answers to Common Questions on Thyroid Fine-Needle Aspiration Biopsy

Ali Salbas, Rasit Eren Buyuktoka.

Abstract	Download PDF		Post
This study set out to assess the quality of ChatGPT-4o’s (Chat Generative Pre-trained Transformer, version 4o) replies to common patient questions concerning thyroid fine-needle aspiration biopsy (FNAB). A cross-sectional design was employed, in which patient-focused questions were gathered using the search phrase “frequently asked questions about thyroid biopsy” on Google. Following the removal of duplicates and overlapping items, 20 unique questions were chosen. Each question was submitted to ChatGPT-4o in a new session. The generated responses were then evaluated by 12 radiologists, all blinded to the source of the answers. Ratings were given on a 5-point Likert scale across four categories: relevance, accuracy, clarity, and completeness. Descriptive analyses were performed, and interrater reliability was calculated using the intraclass correlation coefficient (ICC). All 20 questions received scores between 3 and 5 in every category. The overall mean score was 4.72±0.12. Relevance achieved the best performance with a mean of 4.95±0.06, while clarity was the lowest at 4.61±0.23. The reliability analysis showed weak agreement among evaluators, with ICC values of –0.028 (p=0.863) for relevance, 0.061 (p=0.005) for accuracy, 0.072 (p=0.002) for clarity, 0.031 (p=0.016) for completeness, and 0.061 (p=0.002) for the overall score. In conclusion, ChatGPT-4o produced highly relevant, accurate, and generally comprehensive responses to patient inquiries regarding thyroid FNAB. Nonetheless, the limited interrater reliability underscores variability in expert judgment, especially in clarity and completeness. Although ChatGPT-4o holds promise as a supportive tool for patient education, its outputs should be reviewed and tailored by healthcare professionals prior to use in clinical practice. Key words: Thyroid Nodule, Biopsy, Fine-Needle, Patient Education, Natural Language Processing, Artificial Intelligence

Assessment of ChatGPT-4o’s Answers to Common Questions on Thyroid Fine-Needle Aspiration Biopsy

Abstract