Evaluating a large language model's accuracy in CTPA image interpretation for acute pulmonary embolism

Banu Arslan

doi:10.5455/medscience.2025.06.166

Med-Science. 2026; 15(1): 116-21

doi: 10.5455/medscience.2025.06.166

Evaluating a large language model's accuracy in CTPA image interpretation for acute pulmonary embolism

Banu Arslan.

Abstract	Download PDF		Post
Pulmonary embolism (PE) is a major cause of cardiovascular mortality. Computed tomography pulmonary angiography (CTPA) is the gold standard for definitive diagnosis; however, image interpretation can be delayed in busy or under-resourced emergency departments. Large language models (LLMs) such as ChatGPT-4 Turbo, which now accept images, may offer scalable decision support, but their diagnostic performance on CTPA has not yet been defined. We retrospectively assembled 200 de-identified single-slice CTPA images from PubMed Central (162 positive, 38 negative for PE). After anonymization, each image was uploaded individually to ChatGPT-4 Turbo. Ground-truth labels were obtained from source case reports. Sensitivity, specificity, accuracy, predictive values, and Cohen’s kappa for anatomic sub-typing were calculated. ChatGPT identified 150 true positives, 12 false negatives, 28 false positives, and 10 true negatives. Sensitivity was 92.6%, specificity 26.3%, and overall accuracy 80%. Positive and negative predictive values were 84.3% and 45.5%, respectively. Agreement on embolus location was fair (κ=0.25) with 51% correct sub-typing. ChatGPT-4 Turbo detected most emboli but generated many false alarms and misclassified half of the anatomic locations. These data position the model as a useful triage aid to flag CTPAs for expedited human review, rather than a stand-alone interpreter. Key words: ChatGPT, artificial intelligence, computed tomography pulmonary angiography, pulmonary embolism, emergency

Evaluating a large language model's accuracy in CTPA image interpretation for acute pulmonary embolism

Abstract