AIDec 31, 2025

Large Language Models Transform Clinical Decision Support and Diagnostic Workflows

Large language models (LLMs), particularly GPT-4, ChatGPT, and emerging competitors, demonstrated clinically meaningful diagnostic accuracy across diverse medical specialties in 2025, marking a...

2 min read

Large language models (LLMs), particularly GPT-4, ChatGPT, and emerging competitors, demonstrated clinically meaningful diagnostic accuracy across diverse medical specialties in 2025, marking a potential inflection point for AI-assisted clinical decision-making [1][2][3][4][5].

Diagnostic performance studies revealed GPT-4 achieved comparable or superior accuracy to physicians in structured scenarios, with 93.75% diagnostic accuracy on short clinical vignettes when using structured prompting [1]. In rheumatology, ChatGPT-4 matched rheumatologists' diagnostic accuracy for inflammatory rheumatic diseases (35% vs 39% for top diagnosis), with superior sensitivity for detecting these conditions [2]. For rare and undiagnosed diseases, LLMs demonstrated potential to expand differential diagnoses, though the most likely diagnosis was often incorrect without clinical context refinement [3][4].

Real-world validation studies showed 75% diagnostic accuracy for prehospital patient care reports, with only 0.96% of cases resulting in potentially dangerous under-triage [5]. Umbrella reviews identified disease diagnosis and clinical decision-making as the most common application theme, though concerns about accuracy, bias, and ethical implications tempered enthusiasm [1][2]. Key limitations included reduced performance during external validation, difficulty with rare conditions, and reliance on structured, text-based inputs rather than multimodal clinical data.

Why it matters:

For clinicians: LLMs offer decision support that could expand differential diagnoses, reduce diagnostic errors in complex cases, and democratize access to specialist-level reasoning in resource-limited settings. However, current limitations—including hallucinations, inability to integrate physical exam findings, and lack of regulatory oversight—restrict their use to augmentation rather than replacement of clinical judgment. Prompt engineering skills are emerging as essential competencies for effective LLM utilization.

For researchers: The gap between controlled diagnostic vignettes and real-world clinical complexity remains substantial. Future research must address multimodal integration (imaging, lab data, physical exams), prospective validation in clinical workflows, and mechanisms to ensure model transparency, fairness, and safety across diverse patient populations. Regulatory frameworks for medical AI deployment lag behind technological capability.

References

  1. Xu R, Zhang W, Ma Y, et al. Diagnosis and Triage Performance of Contemporary Large Language Models on Short Clinical Vignettes. J Med Syst. 2025;49(1):141. doi: 10.1007/s10916-025-02284-y
    PubMed: https://pubmed.ncbi.nlm.nih.gov/41108346/
  2. Gräf J, Morf H, Ronicke S, et al. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatology (Oxford). 2023. [Epub ahead of print]
    PubMed: https://pubmed.ncbi.nlm.nih.gov/37742280/
  3. Shyr C, Cassini TA, Tinker RJ, et al. Large Language Models for Rare Disease Diagnosis at the Undiagnosed Diseases Network. JAMA Netw Open. 2025;8(8):e2528538. doi: 10.1001/jamanetworkopen.2025.28538
    PubMed: https://pubmed.ncbi.nlm.nih.gov/40844783/
  4. Rebollar-Hernández A, Padilla-López LA, Fernández-Hernández H. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Artif Intell. 2024;7. [Epub ahead of print]
    PubMed: https://pubmed.ncbi.nlm.nih.gov/38966538/
  5. Miller ED, Franc JM, Hertelendy AJ, et al. Accuracy of Commercial Large Language Model (ChatGPT) to Predict the Diagnosis for Prehospital Patients Suitable for Ambulance Transport Decisions: Diagnostic Accuracy Study. Prehosp Emerg Care. 2025;29(3):238-242. doi: 10.1080/10903127.2025.2460775
    PubMed: https://pubmed.ncbi.nlm.nih.gov/39889232/