A comparative analysis of Large Language Models in digital content on painless labor: a systematic evaluation of readability, reliability, quality, and accuracy

Ahmet Ridvan Dogan; Havva Kocayigit; Ayca Tas Tuna

doi:10.5455/annalsmedres.2025.11.346

Ann Med Res. 2026; 33(6): 267-276

doi: 10.5455/annalsmedres.2025.11.346

A comparative analysis of Large Language Models in digital content on painless labor: a systematic evaluation of readability, reliability, quality, and accuracy

Ahmet Ridvan Dogan,Havva Kocayigit,Ayca Tas Tuna.

Abstract	Download PDF		Post
Aim: Painless labor is an important concern in obstetric care, with both medical and psychosocial dimensions. With the increasing use of artificial intelligence (AI) in health communication, evaluating the quality of content generated by large language models (LLMs) is essential. This study aimed to compare AI-generated content on “painless labor” produced by three LLMs—ChatGPT (Chat Generative Pre-trained Transformer), Gemini, and DeepSeek—in terms of readability, reliability, content quality, and medical accuracy. Materials and Methods: A total of 270 texts were generated using 30 frequently searched keywords on three dates in May 2025, via the free versions of the three LLMs. Readability was assessed with six validated indices. Reliability and quality were evaluated using the Modified DISCERN tool, JAMA (Journal of the American Medical Association) Benchmark Criteria, Ensuring Quality Information for Patients (EQIP)-36, and Global Quality Score. Medical accuracy was assessed for 90 texts by an obstetric anesthesia expert. Results: ChatGPT texts were the most readable and written at lower grade levels but scored lower in reliability and quality. Gemini ranked highest in both reliability and content quality, though it produced more complex language. DeepSeek showed variable performance. No significant differences were found in medical accuracy or content completeness. Gemini also demonstrated the most consistent performance across all time points. Conclusion: LLMs vary substantially in how they present medical content. ChatGPT achieved higher readability scores, indicating simpler language structures, whereas Gemini demonstrated higher reliability and content quality metrics. However, the absence of source citations in all models raises concerns about content verifiability, highlighting the need for critical oversight in healthcare applications. Key words: Painless Labor; Large Language Models; Readability; Health Literacy; Patient Education; Obstetric Anesthesia.

A comparative analysis of Large Language Models in digital content on painless labor: a systematic evaluation of readability, reliability, quality, and accuracy

Abstract