PURPOSE: To evaluate and compare the performance of three leading artificial intelligence (AI) models (ChatGPT 4o, ChatGPT o1, Claude 3.5 Sonnet and Gemini 2.0 Flash Experimental) in answering ophthalmology questions from two different, popular board preparation question resources and to analyze performance variations across subspecialties and resources.
METHODS: From the 398 available questions in the ebodtraining.com question bank, 344 text-based questions were selected and organized to include 35 questions per subspecialty. The same number of questions per subspecialty were randomly selected from eyedocs.co.uk to match those from ebodtraining.com. ChatGPT 4o, ChatGPT o1, Claude 3.5 Sonnet, and Gemini were tested on these questions, with responses evaluated as either correct or wrong, allowing calculation of both overall and subspecialty-specific performance metrics.
RESULTS: Various AI models were evaluated on two ophthalmology question banks: ebodtraining.com (344 questions) and eyedocs.co.uk (345 questions). For ebodtraining.com, ChatGPT o1 achieved 88.0% accuracy, followed by Claude 3.5 Sonnet (84.7%), Gemini (81.7%), and ChatGPT 4o (81.2%), with all models showing weaker performance in Neuroophthalmology section. Similarly, on eyedocs.co.uk, ChatGPT o1 led with 88.4%, while Claude 3.5 Sonnet reached 84.6%, Gemini 79.2%, and ChatGPT 4o 73.4%. ChatGPT o1 significantly outperformed ChatGPT 4o on both platforms and demonstrated higher accuracy across multiple subspecialties compared to Claude 3.5 Sonnet and Gemini.
CONCLUSION: In modern world, time is getting more precious every day and with the help of AI models, students can receive information and explanation rapidly. In addition, with the advantage of asking further questions, students can access personalised answers, reduce time consumption and get a tailored learning experience. However, it should be taken into consideration that although AI models demonstrate promising capabilities in ophthalmology board examination preparation, their performance varies significantly across subspecialties and question types. These tools can serve as valuable supplementary resources for exam preparation, but cannot replace comprehensive clinical training and expertise.