Published Date : 26/08/2025
In the rapidly advancing field of artificial intelligence (AI), large language models (LLMs) are showing promising capabilities in various healthcare applications. A recent study conducted at the Eye and ENT Hospital of Fudan University in Shanghai, China, aimed to evaluate the performance of five popular LLMs in addressing cataract-related queries. The models assessed were ChatGPT-4, ChatGPT-4o, Gemini, Copilot, and the open-source Llama 3.5. This evaluation involved both qualitative and quantitative assessments, benchmarking the models' outputs against human-generated responses using seven key metrics: accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability.
The study was designed to provide a comprehensive understanding of how these LLMs perform in a clinical context, particularly in patient education and cataract care. The Eye and ENT Hospital of Fudan University, a leading institution in ophthalmology, provided the necessary expertise and resources to conduct this rigorous evaluation. The research team, comprising professionals from both Fudan University and the Zhejiang Provincial Hospital of Chinese Medicine, ensured a multidisciplinary approach to the study.
In the information quality assessment, ChatGPT-4o emerged as the top performer, achieving the highest scores in accuracy (6.70 ± 0.63), completeness (4.63 ± 0.63), and harmlessness (3.97 ± 0.17). Gemini, on the other hand, excelled in conciseness, with a score of 4.00 ± 0.14. These results indicate that ChatGPT-4o and Gemini are particularly well-suited for providing detailed and concise responses to cataract-related questions.
Further subgroup analysis revealed that all LLMs performed comparably to or better than humans across different clinical topics. This suggests that these models can effectively handle a wide range of cataract-related queries, from basic patient education to more complex clinical issues. The readability assessment, however, showed that ChatGPT-4o had the lowest readability score (26.02 ± 10.78), indicating the highest level of reading difficulty. Copilot, with a readability score of 40.26 ± 14.58, was more readable but still less so than human-generated content, which had a score of 51.54 ± 13.71.
Stability in reproducibility and stability assessment was another critical metric. Copilot demonstrated the best stability, which is crucial for ensuring consistent and reliable responses in a clinical setting. All LLMs also exhibited strong self-correction capability when prompted, a feature that can help mitigate potential errors and improve the accuracy of responses over time.
Despite these promising results, the study emphasizes the importance of cautious and critical evaluation of AI in clinical practice. Clinicians and patients should be aware of the limitations of AI, particularly in terms of readability and the potential for complex medical jargon. The research team concluded that LLMs have considerable potential in providing accurate and comprehensive responses to common cataract-related clinical issues, with ChatGPT-4o leading the way in accuracy, completeness, and harmlessness.
This study underscores the growing role of AI in healthcare, particularly in patient education and clinical decision-making. As LLMs continue to evolve, they have the potential to significantly enhance the quality of care for patients with cataracts and other ophthalmic conditions. However, ongoing research and validation are necessary to ensure that these technologies are used effectively and safely in clinical settings.
In summary, the study at the Eye and ENT Hospital of Fudan University provides valuable insights into the performance of LLMs in cataract care. The results highlight the potential of AI to improve patient education and clinical practice, while also emphasizing the importance of continued research and careful implementation. As AI technology advances, it is likely to play an increasingly important role in the future of healthcare.
Q: What is the main purpose of the study?
A: The main purpose of the study was to evaluate the performance of five large language models (LLMs) in addressing cataract-related queries, focusing on metrics such as accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability.
Q: Which LLM performed the best in the study?
A: ChatGPT-4o performed the best, achieving the highest scores in accuracy, completeness, and harmlessness.
Q: What were the key metrics used to assess the LLMs?
A: The key metrics used to assess the LLMs were accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability.
Q: How did the LLMs compare to human-generated responses?
A: The LLMs performed comparably to or better than human-generated responses in most metrics, with some models excelling in specific areas like conciseness and stability.
Q: What are the limitations of using AI in cataract care?
A: The limitations of using AI in cataract care include readability issues, potential for complex medical jargon, and the need for ongoing research and validation to ensure safe and effective use in clinical settings.