PURPOSE: The aim of the study is to evaluate the accuracy and legibility of the answers given by 3 different large language models (LLMs) to common patient questions about cataract surgery.
METHODS: Three distinct LLMs (ChatGPT, Microsoft Copilot, and Google Gemini) were queried on 30 common inquiries about cataract surgery. The accuracy of the responses was evaluated using a Likert scale, based on the consensus opinion of two specialists. The readability of the responses was evaluated using three distinct readability indices: Flesch-Kincaid Grade Level, Coleman-Liau, and Flesch Reading Ease.
RESULTS: None of the responses from LLMs received a score of 1 for any question. All responses generated by ChatGPT were rated four or higher. For comparison, 90% of Gemini’s responses and 27% of Copilot’s responses achieved scores of four or above. In consideration of legibility, it was observed that all three LLMs were challenging to read. However, Copilot exhibited slightly superior readability, followed by Gemini and ChatGPT, respectively.
CONCLUSION: While the responses provided by ChatGPT exhibited a slightly lower level of readability, they nonetheless proved to be the most proficient in answering cataract surgery-related questions. LLMs may support patient education, but their readability must be improved to ensure effective communication. Future work should focus on making AI-generated responses clearer and more accessible.
Keywords: Artificial intelligence, cataract surgery, large language models.