A Comparative Assessment of Large Language Models in Congenital Hypothyroidism: Reliability, Quality and Readability
PDF
Cite
Share
Request
Original Article
E-PUB
21 April 2026

A Comparative Assessment of Large Language Models in Congenital Hypothyroidism: Reliability, Quality and Readability

J Clin Res Pediatr Endocrinol. Published online 21 April 2026.
1. University of Health Sciences Türkiye Antalya City Hospital, Clinic of Pediatric Endocrinology, Antalya, Türkiye
No information available.
No information available
Received Date: 29.01.2026
Accepted Date: 10.04.2026
E-Pub Date: 21.04.2026
PDF
Cite
Share
Request

ABSTRACT

Objective

To comparatively evaluate the reliability, quality, and readability of responses generated by widely used large language model (LLM)-based chatbots to congenital hypothyroidism (CH)-related patient questions.

Methods

Forty frequently asked questions (FAQs) about CH, derived from clinician-reviewed patient education resources, were submitted under standardized conditions (December 2025) to Chat Generative Pre-Trained Transformer-4 (ChatGPT-4), ChatGPT-5.2, Gemini, and Copilot. The modified DISCERN (mDISCERN) instrument was used to assess reliability, whereas the Global Quality Score (GQS) was used to evaluate quality. Readability was evaluated using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG). Scores were compared using Friedman tests with Bonferroni-corrected post-hoc analyses.

Results

Median mDISCERN scores were 5.0 for ChatGPT-4, ChatGPT-5.2, and Gemini, and 4.0 for Copilot. Median GQS scores were 5.0 for ChatGPT-4, ChatGPT-5.2, and Gemini, and 4.0 for Copilot. Differences among models were significant for both mDISCERN and GQS (p<0.001), with ChatGPT-5.2 outperforming others in key pairwise comparisons. Readability differed significantly across all indices (all p<0.001). ChatGPT-5.2 demonstrated the highest FRE and lowest FKGL, whereas Gemini produced the most complex text. However, all models exceeded the recommended sixth-grade reading level.

Conclusion

LLM-based chatbots produced generally moderate-to-high quality CH information, but readability remains suboptimal for patient education. ChatGPT-5.2 showed the best overall performance. LLM outputs may support patient information needs but should complement, not replace, clinician-provided counseling.

Keywords:
Artificial intelligence, ChatGPT, congenital hypothyroidism, copilot, Google Gemini, large language models

What is already known on this topic?

Large language model (LLM)-based chatbots are increasingly used by patients to obtain medical information. Previous studies have shown variable reliability, quality, and readability of LLM-generated health content, but most materials exceed the recommended sixth-grade reading level for patient education.

What this study adds?

This study is the first to evaluate four LLMs in the context of congenital hypothyroidism using parent-centered questions, assessing reliability, quality, readability, and source accuracy. Although all models demonstrated good levels of reliability and quality, ChatGPT-5.2 showed superior overall performance compared with the others. These findings suggest that, as LLMs continue to evolve, they hold increasing potential to generate more reliable and readable health information.

Introduction

Primary congenital hypothyroidism (CH) is the most common congenital endocrine disorder, with an estimated incidence of approximately 1 in 1,000-3,000 live births worldwide. If left untreated, CH may lead to severe and irreversible intellectual disability, however, this adverse outcome can largely be prevented through neonatal screening programs and early initiation of treatment. In countries where newborn screening programs are effectively implemented, the majority of patients with CH demonstrate neurodevelopmental outcomes within normal limits (1, 2, 3, 4, 5, 6, 7).

Abnormal findings on newborn screening necessitate confirmatory biochemical testing to establish or exclude the diagnosis of hypothyroidism. Measurement of thyrotropin [thyroid-stimulating hormone (TSH)] together with free thyroxine (free T4), or alternatively total T4 and triiodothyronine (T3) uptake, is recommended for this purpose. The presence of elevated serum TSH levels accompanied by low free T4 concentrations confirms the diagnosis of primary hypothyroidism and requires urgent initiation of treatment (8). Oral levothyroxine is the treatment of choice, and both the timing and dosage of thyroid hormone replacement are critical determinants of clinical outcomes (9, 10). In term infants, the recommended initial dose is 10-15 µg/kg/day, a range that has been associated with optimal neurocognitive outcomes, normal growth, and improved school performance (10, 11).

Currently, patients and parents of patients frequently rely on the internet to access health-related information. Previous reports indicate that approximately 90% of adults use the internet, and nearly 75% search for health-related information before seeking medical care, highlighting the importance of evaluating the accuracy and readability of online medical content (12). Medical content is disseminated to broad audiences through digital and social media platforms such as Google, Facebook, and Twitter (13). In recent years, the use of artificial intelligence (AI) technologies in the field of healthcare has increased rapidly (14). AI refers to the ability of computer systems to perform functions that typically require human intelligence, including decision-making, learning from experience, natural language understanding, and problem-solving.

In addition to traditional citation metrics, alternative metrics such as Altmetric scores provide valuable insight into the dissemination and societal impact of scientific publications. Bibliometric analyses not only illustrate the historical development of a research field but also identify highly interactive studies that shape academic visibility. Recent bibliometric studies highlight the growing role of digital engagement in the dissemination of medical knowledge, including the importance of evaluating online information sources in contemporary healthcare environments (13).

In this context, AI-driven large language models (LLMs) have emerged as novel and easily accessible sources of information for individuals seeking health-related knowledge. AI-based chatbots are capable of interacting with patients, answering questions, and providing basic medical information (15). Chat Generative Pre-Trained Transformer (ChatGPT) version 3.5 was released in 2022, rapidly gained a large user base, and was subsequently followed by more advanced versions (16). In addition to ChatGPT, other LLM-based chatbots, such as Microsoft Copilot and Google Gemini, have also been developed. These models were selected because they represent the most widely used and publicly accessible LLM-based chatbots at the time of the study and have been frequently been evaluated in previous healthcare-related research, allowing comparability with existing literatüre.

Access to AI enables patients to obtain health-related information quickly and easily. However, health literacy plays a central role in patient understanding and engagement, and the readability, reliability, and quality of this information are thus of critical importance (17). The National Institutes of Health (NIH), the American Medical Association (AMA), and the United States Department of Health and Human Services recommend that web-based patient education materials be written at or below a sixth-grade reading level (18, 19, 20, 21). In addition, LLMs may occasionally cite non-existent sources or generate biased or inaccurate information, raising concerns regarding patient safety (22). Improved patient knowledge regarding disease mechanisms and treatment has been shown to enhance adherence to medical recommendations and improve clinical outcomes (23).

The aim of this study was to conduct a comparative evaluation of the responses generated by four AI-based chatbots, ChatGPT-4, ChatGPT-5.2, Gemini, and Copilot, to frequently asked questions (FAQs) related to CH, with respect to readability, reliability, and quality.

Methods

Study Design

This study was designed as a cross-sectional analytical study evaluating the reliability, quality, and readability of responses generated by AI-based LLMs regarding CH.

This study did not involve human participants or patient-level data. All evaluated responses were obtained from publicly accessible AI platforms. Therefore, ethics committee approval was not required.

Question Sources and Initial Screening

Questions related to CH were developed using patient education content from internationally recognized, reliable, evidence-based websites that are reviewed by clinicians, including the Cleveland Clinic, Mayo Clinic, and the United Kingdom National Health Service. These sources were selected because they are widely regarded as trustworthy in patient and caregiver education and include questions that are frequently asked by patients and their families.

Initially, 60 questions related to CH were identified. Questions that were repetitive, highly similar in wording, overlapping in meaning, or not directly related to CH were excluded through a screening process. Following this refinement, 40 questions were selected for the final analysis. The complete list of questions is provided in the Supporting Information section. The screening and selection of questions were independently performed by two pediatric endocrinologists with clinical experience in CH, and any disagreements regarding inclusion or exclusion were resolved by consultation with a third pediatric endocrinologist, with final decisions made by consensus.

Question Categorization

The final set of questions was categorized into six clinically meaningful domains reflecting the topics most frequently sought by parents of children with CH. These domains included basic information; symptoms and clinical features; diagnosis and screening; treatment and monitoring; risks, side effects, and complications; and recovery and outlook.

AI Models and Interaction Procedure

The selected questions were submitted to multiple LLM-based chatbot platforms, including ChatGPT-4 and ChatGPT-5.2 (free and paid versions; OpenAI; December 2025 and December 2025), Gemini (free version; Google; November 2025), and Copilot (Microsoft; December 2025), all of which were publicly accessible at the time of the study. All evaluations were conducted in December 2025 using identical prompts and standardized conditions. All searches were performed using a web browser in incognito mode without logging into any personal accounts to minimize personalization bias.

To ensure that each response was generated independently and to prevent contextual memory bias, the conversation history was cleared prior to each question, and a new chat session was initiated. To assess response consistency, the same set of questions was resubmitted to each chatbot one week later under the same conditions. No additional prompts, follow-up questions, response regeneration commands, or clarifications were used, except for requesting references when they were not initially provided by the chatbot.

All responses and cited references were recorded and stored for subsequent analysis. The existence, accessibility, and academic credibility of the cited sources were systematically verified and documented. All cited references were systematically verified using PubMed, Google Scholar, CrossRef, and official journal websites. A reference was classified as fabricated if it could not be identified in these databases or if inconsistencies were detected in authorship, journal name, publication year, volume, page numbers, or DOI information. In addition, references that were retrievable but unrelated to the topic or that did not support the statements made in the chatbot response were classified as inaccurate citations. All references were independently reviewed by two pediatric endocrinologists, and disagreements were resolved by consensus. Source usage and citation behavior were incorporated into the modified DISCERN (mDISCERN) (24, 25) and Global Quality Scale (GQS) (26, 27) assessments, and misleading, fabricated, or non-academic references were systematically identified and recorded.

Expert Evaluation Process

All chatbot responses were independently evaluated by two pediatric endocrinologists with clinical experience in the management of CH. In cases of disagreement regarding scoring, the responses were re-assessed by a third pediatric endocrinologist, and a final decision was reached by consensus. Inter-rater agreement exceeded 0.80 (Cohen’s κ), indicating excellent agreement.

Reliability Assessment

The mDISCERN instrument was used to assess reliability. This scale consists of five criteria, with each criterion scored as 1 if fulfilled and 0 if not fulfilled. Higher total scores (out of five) indicate greater reliability. The reliability and validity of the DISCERN instruments have been previously established (24, 25). The mDISCERN scale evaluates the following five criteria using a yes/no format: clear statement of aims; reliability of information sources; balance and absence of bias; provision of additional sources of information; and discussion of uncertainties.

Quality Assessment

The quality of the chatbot responses was evaluated using the GQS, which has been applied in similar studies (26, 27). GQS is a five-point Likert scale designed to assess the usability, quality, and flow of online health information. A score of 1 represents the lowest quality, whereas a score of 5 indicates the highest quality. Scores of 2 reflect low quality with limited usefulness, 3 indicate moderate quality with limited usefulness, and 4 represent good quality and usefulness (Table 1).

Readability Assessment

The readability of the responses generated by the chatbots was analyzed using multiple established readability indices to evaluate textual complexity and the required reading level. These indices included the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), and the Simple Measure of Gobbledygook (SMOG). Readability scores were calculated using an online tool with automated computation functions (28).

The FRE score ranges from 0 to 100, with lower scores indicating more difficult text. Scores between 0 and 30 correspond to very difficult texts requiring college-level reading skills; scores between 31 and 50 indicate difficult texts appropriate for grades 13-16; scores between 51 and 60 represent relatively difficult texts at the 10th-12th grade level; scores between 61 and 70 indicate plain English suitable for grades 8-9; scores between 71 and 80 correspond to fairly easy texts at the 7th-grade level; scores between 81 and 90 indicate easy texts appropriate for the 6th-grade level; and scores between 91 and 100 represent very easy texts that can be understood by an average 11-year-old student.

The FKGL score represents the grade level required to understand a text, with scores of 10 or higher indicating that the material is appropriate for readers at the high school level or above. According to recommendations from the AMA and the NIH, patient education materials should be written at a sixth-grade reading level or lower (18, 19, 20, 21).

The CLI measures the reading level corresponding to grade levels in the United States. The SMOG score indicates the number of years of education required to understand a text. In the GFI, which evaluates textual complexity based on sentence length and the proportion of long words, scores above 12 indicate more difficult texts.

Acceptable readability thresholds were defined as an FRE score of ≥80 and ≤6 for the other four indices. Materials exceeding these thresholds were considered more difficult to read than the levels recommended for the general population (23, 29, 30, 31, 32).

Statistical Analysis

Statistical analyses were performed using SPSS software (IBM Corp., Armonk, NY, USA). Continuous and ordinal variables were assessed for normality using the Shapiro-Wilk test. As the outcome scores did not follow a normal distribution, results were summarized as median and interquartile range. Differences in scores across the compared chatbot models were evaluated using the Friedman test. When the Friedman test indicated a statistically significant difference, post hoc pairwise comparisons were conducted using the Wilcoxon signed-rank test with Bonferroni correction. A Bonferroni-corrected p value of <0.008 was considered statistically significant. Effect sizes for the Friedman tests were calculated using Kendall’s W and interpreted as small (≈0.1), moderate (≈0.3), and large (≥0.5). This approach was used to complement p-values and to provide information on the magnitude of differences between models.

Results

The response performance of ChatGPT-4, ChatGPT-5.2, Gemini, and Copilot was evaluated using CH-related FAQs grouped into six domains. These domains comprised Basic Information; Symptoms and Clinical Features; Diagnosis and Screening; Treatment and Monitoring; Risks, Side Effects, and Complications; and Recovery and Outlook.

The reliability of the LLMs was assessed using mDISCERN. The median mDISCERN score was 5.0 (4.0-5.0) for ChatGPT-4, 5.0 (5.0-5.0) for ChatGPT-5.2, 5.0 (4.0-5.0) for Gemini, and 4.0 (3.0-4.0) for Copilot (Table 2).

The quality of the responses was evaluated using GQS. The median GQS score was 5.0 (4.0-5.0) for both ChatGPT-4 and Gemini, 5.0 (5.0-5.0) for ChatGPT-5.2, and 4.0 (3.0-4.0) for Copilot (Table 2).

The readability of the responses to the FAQs was evaluated using multiple indices. The highest FRE score was observed for ChatGPT-5.2 at 57.2 (39.4-66.8), whereas the lowest FRE score was recorded for Gemini at 38.2 (31.1-46.8). The lowest FKGL score was also obtained for ChatGPT-5.2 at 8.4 (7.0-12.0), while the highest FKGL score was found for ChatGPT-4 at 13.3 (11.9-14.5). ChatGPT-5.2 demonstrated the lowest SMOG, GFI, and CLI scores, whereas Gemini had the highest values for these indices (Table 3).

Gemini and Copilot provided references and direct links to the cited sources after each response. ChatGPT-4 and ChatGPT-5.2 did not provide sources by default but supplied references when explicitly requested; ChatGPT-5.2 included hyperlinks, whereas ChatGPT-4 did not. Regarding the accuracy of the cited sources, ChatGPT-5.2 achieved a rate of 100%, ChatGPT-4 and Gemini each demonstrated an accuracy of 95%, and Copilot showed an accuracy rate of 60%.

All LLMs provided additional information beyond the direct answers and indicated what further details they could offer upon request. In addition, ChatGPT-5.2 presented brief summary sections for parents (e.g., “short answer for parents”) for some questions. The responses generated by ChatGPT-4 were generally longer than those of the other models.

Reliability and Quality

All LLMs differed significantly in terms of mDISCERN scores (p<0.001). The Friedman test yielded χ²(3)=22.653 (p<0.001), with a Kendall’s W of 0.19, indicating a small-to-moderate effect size. In pairwise comparisons, significant differences were observed between ChatGPT-5.2 and ChatGPT-4 and between ChatGPT-5.2 and Copilot (p=0.002 and p<0.001, respectively) (Table 2). ChatGPT-5.2 achieved higher reliability scores than the other models.

With respect to content quality, GQS scores also differed significantly between all LLMs (p<0.001). The effect size was small-to-moderate [χ²(3)=22.393, p<0.001; Kendall’s W=0.19]. In pairwise comparisons, the GQS score of ChatGPT-5.2 was significantly higher than those of the other three models (p=0.001, p=0.001, and p<0.001, respectively) (Table 2). No significant differences in reliability or quality scores were observed across the different question categories.

Readability

Significant differences were observed between the AI models across all readability indices (SMOG, FKGL, GFI, CLI, and FRE; all p<0.001). Effect size analysis demonstrated large effects for FKGL (W=0.59), SMOG (W=0.55), CLI (W=0.58), and FRE (W=0.49), and a very large effect for GFI (W=0.86), indicating substantial differences in textual complexity across models (Table 3). ChatGPT-5.2 demonstrated significantly higher FRE scores and significantly lower FKGL, SMOG, GFI, and CLI scores compared with all other models (Table 4), indicating superior readability.

In pairwise comparisons between ChatGPT-4, Gemini, and Copilot, mixed results were observed depending on the index. No significant difference was found between ChatGPT-4 and Copilot for the CLI score, or between Gemini and Copilot for the SMOG score (p=0.624) (Table 4).

FRE

Significant differences were observed between the models (p<0.001). Copilot was more readable than ChatGPT-4 and Gemini (p=0.002 and p<0.001, respectively). ChatGPT-5.2 was more readable than all other models (all p<0.001).

FKGL

Significant differences were again evident between the models (p<0.001). Copilot was more readable than ChatGPT-4 and Gemini, and Gemini was more readable than ChatGPT-4 (all p<0.001). ChatGPT-5.2 demonstrated significantly better readability than all other models (all p<0.001).

SMOG

Significant differences were detected across all models, once more (p<0.001). ChatGPT-4 had lower SMOG scores than Gemini and Copilot (p<0.001 and p=0.002, respectively). ChatGPT-5.2 once again performed better on SMOG scores than all other models (all p<0.001).

GFI

GFI scores ranked from lowest to highest were ChatGPT-5.2, ChatGPT-4, Copilot, and Gemini, with all pairwise comparisons being statistically significant.

CLI

Based on CLI scores, ChatGPT-4 and Copilot were more readable than Gemini (both p<0.001). ChatGPT-5.2 had significantly lower CLI scores than all other models (p=0.001 vs. ChatGPT-4; p<0.001 vs. Gemini and Copilot) (Tables 3 and 4).

Discussion

In this study, the quality, reliability, and readability of responses provided by four popular AI chatbots, ChatGPT-4, ChatGPT-5.2, Gemini, and Copilot, to some of the most FAQs about CH were evaluated. To the best of our knowledge, this is the first study to compare the responses of four different LLMs to CH-related FAQs.

CH is now usually diagnosed during the neonatal period, most commonly through heel-prick blood screening. If early treatment is not initiated, it can lead to irreversible intellectual disability and developmental delay, making it a major source of anxiety for parents of newborns (8). For this reason, many parents seek information about CH through LLMs. These systems have been reported to assist healthcare professionals in areas such as disease diagnosis, treatment planning, prognostic assessment, and public health management, and they may also influence patient decision-making in healthcare (33). By comparing the quality, reliability, and readability of LLMs, the present study provided insight into their suitability for use by parents.

We found the reliability and quality of all LLMs to be in the moderate-to-high range, but with significant differences between them. Ranked from lowest to highest, the models were Copilot, ChatGPT-4 and Gemini, and ChatGPT-5.2. ChatGPT-5.2 was significantly more reliable than ChatGPT-4 and Copilot and demonstrated higher quality than ChatGPT-4, Gemini, and Copilot. Consistent with our findings, a previous study evaluating ChatGPT, Perplexity, ChatSonic, and Microsoft Bing AI reported that the information quality of the responses was moderate to high (34). Gül et al. (14) found lower mDISCERN scores for ChatGPT and Gemini and higher scores for Perplexity. Another study reported that Gemini achieved higher GQS scores compared with other chatbots (35). The superior performance of ChatGPT-5.2 in our study may be attributed to its concise and accurate responses and the high accuracy of the sources it provided, while Gemini’s provision of references and direct links alongside its answers likely contributed to its relatively high scores.

Recent studies have shown that the accuracy, quality, and clinical appropriateness of LLM responses depend largely on the clarity and specificity of user prompts. Sarangi and Mondal (36) showed that LLMs, such as ChatGPT, Google Bard, and Microsoft Bing, perform better when prompted with clear and well-defined queries.  Therefore, the questions were developed from internationally recognized, evidence-based patient education materials that were specifically designed for patients and caregivers, written in an understandable language, and reviewed by healthcare professionals. Within the scope of the validated assessment tools used in this study, all evaluated LLMs demonstrated generally high levels of reliability and quality.We also evaluated the readability level of the responses in our study. In the United States, the average literacy level corresponds to approximately the 7th-8th grade; however, according to the AMA, health education materials should be written at the 6th-grade level. This recommendation is based on the fact that patients’ comprehension decreases when they are dealing with illness and psychological stress, and therefore even complex medical conditions should be explained in very simple language (27). Nevertheless, previous studies have shown that a substantial proportion of online patient education materials exceed the recommended readability levels, which is considered inappropriate from a public health perspective (19, 20, 21, 24).

While significant differences were observed across models, effect size analysis showed that reliability and quality differences were at most small-to-moderate, whereas readability differences were large to very large. This suggests that although overall content quality was relatively comparable among models, substantial variability existed in textual complexity, which may have a meaningful effect on patient comprehension and overall health literacy.

In the present study, the responses generated by ChatGPT-5.2 were found to correspond approximately to a 9th-10th grade reading level, whereas those of Copilot corresponded to a 12th-14th grade level, ChatGPT-4 to a 13th-14th grade level, and Gemini to a 14th-16th grade level. ChatGPT-5.2 was significantly more readable than all other LLMs. Pairwise comparisons among ChatGPT-4, Gemini, and Copilot yielded variable significant differences depending on the readability index used.

Momenaei et al. (37) reported that understanding ChatGPT’s responses on retinal disease surgery required a university-level education. Similarly, another study found that responses provided by ChatGPT, Bard, and Microsoft Bing Chat to palliative care-related questions were written at approximately a 10th-grade reading level (38). Although studies directly comparing ChatGPT, Gemini, Copilot, and particularly ChatGPT-5.2 in terms of readability are limited, existing evidence, consistent with our findings, indicates that the readability of LLM-generated content generally exceeds the recommended 6th-grade level. Previous studies have demonstrated that ChatGPT versions can reduce readability levels when provided with specific instructions (39, 40, 41). These findings suggest that incorporating tailored prompts aimed at simplifying language could enhance readability in future applications. The superior readability of ChatGPT-5.2 observed in our study may also be attributed to its more advanced language model architecture compared with the other LLMs.

In a study evaluating the knowledge levels of caregivers of children with CH, insufficient knowledge was identified as a major barrier to effective follow-up. It was suggested that healthcare professionals providing information about CH, which is one of the leading preventable causes of intellectual and developmental disability, should use clear, simple, and patinet-appropriate understandable language (42). Education is a key component of disease management (43), and studies have shown that providing patients and caregivers with personalized information improves adherence to medical recommendations and leads to better health outcomes (44).

The use of LLMs by caregivers, in addition to the education provided by healthcare professionals, has become increasingly common with the recent expansion and widespread use of AI-based applications. Although the use of LLMs is known to enhance access to healthcare information, concerns remain regarding the potential for misleading content, variability in quality, and readability levels that may exceed those appropriate for the general population (14, 45). Therefore, AI-based tools should be used cautiously, and consultation with healthcare professionals should be encouraged when necessary. Reassuringly, all LLMs evaluated in our study included warning statements advising users to consult a physician or noting that the information provided should not be used as a substitute for medical decision-making. Previous research into digital platforms such as YouTube and web-based resources, that represent other major sources of health information for patients, have demonstrated considerable variability in the readability, reliability, and quality of such content, highlighting the importance of ongoing evaluation of online health information (18, 46)

Study Limitations

The analysis was limited to English-language responses, as English is the most commonly used language in general online information seeking. Therefore, the findings cannot be directly generalized to content generated in other languages. In addition, the use of a single readability calculator may have introduced minor variability in readability estimates, although the tool employed has been widely used in previous studies (35). Furthermore, the findings are based on chatbot responses obtained in December 2025; given the continuous updating of LLMs, these results will change over time and may improve further. LLM-generated responses are not fully deterministic and may vary across sessions or over time due to model updates and probabilistic generation mechanisms. Therefore, exact reproducibility of responses cannot be fully guaranteed.

Strengths

This study represents the first comprehensive evaluation of multiple LLMs specifically about CH. The use of validated and widely accepted assessment tools, standardized prompts, and expert evaluation enhances the methodological robustness and objectivity of the findings, enabling a reliable comparison across models.

Conclusion

The present study showed that although all four chatbots produced CH-related content with moderate to good reliability and quality, ChatGPT-5.2 outperformed the others in reliability, quality, and readability, despite overall readability exceeding the recommended sixth-grade level. The potential of AI-based tools to provide accurate, understandable, and reliable information about CH, which is screened for in a large proportion of neonates in many countries, to parents and caregivers is of great importance. To minimize the risk of misinformation and improve user experience, both continued model development and appropriate prompt formulation remain important factors influencing the quality, reliability, and readability of LLM-generated responses. Nevertheless, regardless of how advanced LLMs become, they are currently a long way from replacing face-to-face medical consultations and clinical evaluation of patients by physicians.

Ethics

Ethics Committee Approval: This study did not involve human participants or patient-level data. All evaluated responses were obtained from publicly accessible artificial intelligence platforms. Therefore, ethics committee approval was not required.
Informed Consent: Informed consent was not required.

Acknowledgements

We would like to thank the pediatric endocrinologists who scored the performance of the large language models in answering frequently asked patient questions about CH.
Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. The complete list of FAQs included in the analysis is available in the Supporting Information.

Authorship Contributions

Concept: Ebru Barsal Çetiner, Design: Ebru Barsal Çetiner, Berna Singin, Data Collection or Processing: Ebru Barsal Çetiner, Berna Singin, Analysis or Interpretation: Ebru Barsal Çetiner, Berna Singin, Literature Search: Ebru Barsal Çetiner, Writing: Ebru Barsal Çetiner, Berna Singin.
Conflict of interest: None declared
Financial Disclosure: There is no specific funding related to this research.

References

1
Kurinczuk JJ, Bower C, Lewis B, Byrne G. Congenital hypothyroidism in Western Australia 1981-1998. J Paediatr Child Health. 2002;38:187-191.
2
Hinton CF, Harris KB, Borgfeld L, Drummond-Borg M, Eaton R, Lorey F, Therrell BL. Trends in incidence rates of congenital hypothyroidism related to select demographic factors: data from the United States, California, Massachusetts, New York, and Texas. Pediatrics. 2010;125(Suppl 2):S37-S47.
3
Waller DK, Anderson JL, Lorey F, Cunningham GC. Risk factors for congenital hypothyroidism: an investigation of infant’s birth weight, ethnicity, and gender in California, 1990-1998. Teratology. 2000;62:36-41.
4
Tuli G, Munarin J, Tessaris D, Matarazzo P, Einaudi S, de Sanctis L. Incidence of primary congenital hypothyroidism and relationship between diagnostic categories and associated malformations. Endocrine. 2021;71:122-129. Epub 2020 Jun 7
5
Deladoëy J, Ruel J, Giguère Y, Van Vliet G. Is the incidence of congenital hypothyroidism really increasing? A 20-year retrospective population-based study in Québec. J Clin Endocrinol Metab. 2011;96:2422-2429. Epub 2011 Jun 1
6
Danner E, Niuro L, Huopio H, Niinikoski H, Viikari L, Kero J, Jääskeläinen J. Incidence of primary congenital hypothyroidism over 24 years in Finland. Pediatr Res. 2022;93:649-653. Epub 2022 Jun 3
7
McGrath N, Hawkes C, McDonnell C, Cody D, O’Connell SM. Incidence of congenital hypothyroidism over 37 years in Ireland. Pediatrics. 2018;142:e20181199.
8
van Trotsenburg P, Stoupa A, Léger J, Rohrer T, Peters C, Fugazzola L, Cassio A, Heinrichs C, Beauloye V, Pohlenz J, Rodien P, Coutant R, Szinnai G, Murray P, Bartés B, Luton D, Salerno M, de Sanctis L, Vigone M, Krude H, Persani L, Polak M. Congenital hypothyroidism: a 2020-2021 consensus guidelines update-an ENDO-European Reference Network Initiative Endorsed by the European Society for Pediatric Endocrinology and the European Society for Endocrinology. Thyroid. 2021;31:387-419.
9
Selva KA, Harper A, Downs A, Blasco PA, Lafranchi SH. Neurodevelopmental outcomes in congenital hypothyroidism: comparison of initial T4 dose and time to reach target T4 and TSH. J Pediatr. 2005;147:775-780.
10
Rose SR, Wassner AJ, Wintergerst KA, Yayah-Jones NH, Hopkin RJ, Chuang J, Smith JR, Abell K, LaFranchi SH; Section on Endocrinology Executive Committee; Council on Genetics Executive Committee. Congenital hypothyroidism: screening and management. Pediatrics. 2023;151:e2022060420.
11
Aleksander PE, Brückner-Spieler M, Stoehr AM, Lankes E, Kühnen P, Schnabel D, Ernert A, Stäblein W, Craig ME, Blankenstein O, Grüters A, Krude H. Mean high-dose l-thyroxine treatment is efficient and safe to achieve a normal IQ in young adult patients with congenital hypothyroidism. J Clin Endocrinol Metab. 2018;103:1459-1469.
12
Gunduz ME, Matis GK, Ozduran E, Hanci V. Evaluating the readability, quality, and reliability of online patient education materials on spinal cord stimulation. Turk Neurosurg. 2024;34:588-599.
13
Bagcier F, Yurdakul O, Ozduran E. Top 100 cited articles on ankylosing spondylitis. Reumatismo. 2020;72:218-227.
14
Gül Ş, Erdemir İ, Hancı V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT and BARD. Medicine (Baltimore). 2024;103:e38009.
15
Hopkins JM, Logan J, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr. 2023;7:pkad010.
16
Zhang S, Liau ZQG, Tan KLM, Chua WL. Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res. 2024;36:15.
17
Ozduran E, Hanci V, Erkin Y. Evaluating the readability, quality and reliability of online patient education materials on chronic low back pain. Natl Med J India. 2024;37:124-130.
18
Erkin Y, Hanci V, Ozduran E. Evaluation of the reliability and quality of YouTube videos as a source of information for transcutaneous electrical nerve stimulation. PeerJ. 2023;11:e15412.
19
Ozduran E, Hanci V. Evaluating the readability, quality, and reliability of online information on Sjögren’s syndrome. Indian J Rheumatol. 2023;18:16-25.
20
Özduran E, Hanci V. Evaluating the readability, quality and reliability of online information on Behçet’s disease. Reumatismo. 2022;74:49-60.
21
Özduran E, Büyükçoban S. Evaluating the readability, quality and reliability of online patient education materials on post-COVID pain. PeerJ. 2022;10:e13686.
22
Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15:e35179.
23
Özbek İ, Hancı V, of EÖTJ, 2025 undefined. Digital guidance: quality and readability analysis of Artificial Intelligence-generated spondyloarthropathy texts. Turk J Osteoporos. 2025;31:12-18.
24
Erkin Y, Hanci V, Ozduran E. Evaluating the readability, quality and reliability of online patient education materials on transcutaneous electrical nerve stimulation (TENS). Medicine (Baltimore). 2023;102.16:e33529.
25
Ladhar S, Koshman SL, Yang F, Turgeon R. Evaluation of online written medication educational resources for people living with heart failure. CJC Open. 2022;4:858-865.
26
Moult B, Franck LS, Brady H. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve the quality of written health care information. Health Expect. 2004;7.2:165-175.
27
Avra TD, Le M, Hernandez S, Thure K, Ulloa JG. Readability assessment of online peripheral artery disease education materials. J Vasc Surg. 2022;76:1728-1732. Epub 2022 Aug 2
28
Simpson D. The readability test tool. Available from: https://www.readabilityformulas.com
29
Pohl NB, Derector E, Rivlin M, Bachoura A, Tosti R, Kachooei AR, Beredjiklian PK, Fletcher DJ. A quality and readability comparison of artificial intelligence and popular health website education materials for common hand surgery procedures. Hand Surg Rehabil. 2024;43:101723. Epub 2024 May 21
30
Lee B, Dixon E, Wales DP. Evaluation of reading level of result letters sent to patients from an academic primary care practice. Health Serv Res Manag Epidemiol. 2023;10:23333928231172142.
31
Onder E, Ensari E. ChatGPT-4o’s performance on pediatric vesicoureteral reflux. J Pediatr Urol. 2025;21:504-509. Epub 2024 Dec 7
32
Kara M, Ozduran E, Kara MM, Özbek İC, Hancı V. Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and Perplexity on the most commonly asked questions about Ankylosing spondylitis. PLoS One. 2025;20:e0326351.
33
Khosravi M, Zare Z, Mojtabaeian SM, Izadi R. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234863.
34
Musheyev D, Pan A, Loeb S, Kabarriti AE. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies? Eur Urol. 2024;85:13-16. Epub 2023 Aug 10
35
Ozduran E, Akkoc I, Büyükçoban S, Erkin Y, Hanci V. Readability, reliability and quality of responses generated by ChatGPT, Gemini, and Perplexity for the most frequently asked questions about pain. Medicine (Baltimore). 2025;104.11;e41780.
36
Sarangi P, Mondal H. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imaging. 2024;34:574-575.
37
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868. Epub 2023 Jun 3
38
Kim MJ, Admane S, Chang YK, Shih KK, Reddy A, Tang M, La Cruz M, Taylor TP. Chatbot performance in defining and differentiating palliative care, supportive care, hospice care. J Pain Symptom Manage. 2024;67:e381-e391. Epub 2024 Jan 12
39
Foster BJ, Mitsnefes M, Dahhou X, Zhang X, Laskin BL. Changes in excess mortality from end-stage renal disease in the United States from 1995 to 2013. Clin J Am Soc Nephrol. 2018;13:91-99. Epub 2017 Dec 14
40
Garcia Valencia OA, Thongprayoon C, Miao J, Suppadungsuk S, Krisanapan P, Craici IM, Jadlowiec CC, Mao SA, Mao MA, Leeaphorn N, Budhiraja P, Cheungpasitporn W. Cheungpasitporn W. Empowering inclusivity: improving readability of living kidney donation information with ChatGPT. Front Digit Health. 2024;6:1366967.
41
Zaki HA, Mai M, Abdel-Megid H, Liew SQR, Kidanemariam S, Omar AS, Tiwari U, Hamze J, Ahn SH, Maxwell AWP. Using ChatGPT to improve readability of interventional radiology procedure descriptions. Cardiovasc Intervent Radiol. 2024;47:1134-1141. Epub 2024 Jul 9
42
Brito LNS, de Andrade CLO, de Aragão Dantas Alves C. Adhesion to treatment by children with congenital hypothyroidism: knowledge of caregivers in Bahia State, Brazil. Rev Paul Pediatr. 2021;39:e2020074.
43
Wolf MS, Gazmararian JA, Baker DW. Health literacy and functional health status among older adults. Arch Intern Med. 2005;165:1946-1952.
44
Grippaudo FR, Nigrelli S, Patrignani A, Ribuffo D, Tognini S, La Mastra C. Quality of the information provided by ChatGPT for patients in breast plastic surgery: are we already in the future? JPRAS Open. 2024;40:99-105.
45
Association of Women’s Health, Obstetric and Neonatal Nurses. Health literacy. Available from: http://lib.ncfh.org/pdfs/6617.pdf
46
Özduran E. “Bel ağrısı” ile ilgili Türkçe internet kaynaklı hasta eğitim materyallerinin okunabilirliklerinin değerlendirilmesi. DEU Tıp Derg. 2022;36:135-150.