AI healthcare can’t beat a specialist

By Published On: September 9, 2024Last Updated: September 9, 2024
AI healthcare can’t beat a specialist

The capabilities, limitations and risks of generative AI are currently a topic of major interest, with widely varying predictions about the roles that language models may one day be able to fill. An area that’s frequently brought up in this regard is healthcare, and this study is certainly food for thought.

Researches from several institutions, including the School of Clinical Medicine at the University of Cambridge, tasked several language models with answering a large number of ophthalmology-based multiple-choice questions taken from a medical textbook.

The same questions were shown to eye doctors, junior eye doctors, and junior doctors who haven’t yet picked a specialty, with the latter group intended to roughly correspond to the level of ophthalmology knowledge that might be expected from a GP. The full results are published here.

To summarise, the language model performance varied widely but the best results were from a model called GPT-4, which answered 69 per cent of the questions correctly. That was significantly better than the unspecialised doctors (43 per cent on average) and slightly better than the ophthalmology trainees (59 per cent), but worse than the ophthalmologists (76 per cent on average, with the highest mark being 90 per cent).

As Dr Arun Thirunavukarasu, who led the study, suggests in the pull quote below, one could speculate from this that language models won’t replace specialists, but may one day be suitable for a role in triage – determining as a first port of call whether a case is serious enough to be referred to a specialist for an expert opinion, in the same way as a GP does currently.

On the other hand, it strikes me that multiple-choice textbook questions are likely to play to GPT-4’s strengths. The researchers note that the questions weren’t used as part of the language model’s training data, but nonetheless it wouldn’t be surprising if other textbooks had similar questions that the AI might have encountered during training.

And perhaps more obviously, receiving a written summary of a patient’s condition is very different from being confronted with a real-life patient and examining their eyes yourself, which a language model wouldn’t seem to stand much chance of doing.

Textbook questions are usually written in a way that’s intended to lead the reader towards the correct answer, whereas in reality everyone’s eyes are different and human judgment would seem an essential ingredient in a correct diagnosis. As the authors note, “Examination performance is an unvalidated indicator of clinical aptitude”.

Still, this may be a hint at what could one day be possible as AI continues to develop.

Half of FDA-approved AI medical devices not trained on real patient data
AI will surpass human brains once we crack the ‘neural code’