The use of large language models (LLMs) like GPT-3.5 and GPT-4 in clinical settings has become an exciting and highly debated topic, particularly regarding their potential role in emergency departments (EDs). A recent study, “Evaluating the use of large language models to provide clinical recommendations in the Emergency Department,” delves deep into the capabilities and limitations of these models when tasked with making clinical recommendations based on real-world emergency department notes. Let’s explore the key findings, their significance, and what the future might hold for AI in healthcare.
The research, led by Christopher Y.K. Williams and his colleagues, sought to determine how well GPT-3.5-turbo and GPT-4-turbo could perform in three specific clinical recommendation tasks: deciding whether a patient should be admitted to the hospital, determining if a radiological investigation is needed, and assessing whether antibiotics should be prescribed. While these tasks are critical in emergency medicine, they are also complex because they require a high level of judgment, balancing risks and resources. The results offer a cautious view of where AI stands today in such life-critical applications.
They used real-world data from 10,000 patient visits and evaluated the accuracy of the LLMs’ recommendations, comparing them to the decisions made by resident physicians. The first significant finding was that both GPT-3.5 and GPT-4 underperformed compared to human physicians, with accuracy rates of up to 24% lower for GPT-3.5 and around 8% lower for GPT-4 in some tasks. This gap highlights the complexity of human decision-making in clinical settings, which involves more than just raw data processing.
“While early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks,” stated coauthor, Brenda Y. Miao.
Both models, particularly GPT-3.5, demonstrated a tendency to err on the side of caution. For example, they frequently recommended hospital admissions and radiological investigations more than necessary, indicating high sensitivity but poor specificity. This “over-caution” could have serious implications if adopted in real-world clinical settings, such as unnecessary tests, overburdened hospital resources, and increased healthcare costs.
In fact, one of the central challenges highlighted in the study is finding the balance between sensitivity (correctly identifying when a treatment is needed) and specificity (correctly identifying when it is not). For example, in the task of deciding whether antibiotics should be prescribed, GPT-4 actually outperformed the resident physicians with an accuracy of 83% compared to 78%. This finding suggests that in more straightforward decision-making scenarios, AI can sometimes provide valuable assistance, even surpassing human performance. However, this was not consistent across all tasks, showing that LLMs are far from being reliable across the board.
To improve the models’ performance, the research team experimented with various prompt engineering techniques. These included strategies such as asking the model to “think step by step” before making a recommendation. This form of “chain-of-thought” reasoning improved the LLMs’ performance slightly by encouraging more systematic and logical output, but it was not enough to close the gap between AI and human clinicians.
One reason for the disparity between human and machine recommendations lies in the complexity of clinical decision-making. Doctors not only rely on the raw data of patient symptoms and histories but also take into account external factors such as resource availability, patient preferences, and social determinants of health. These are areas where current LLMs still fall short.
As Brenda notes, “LLMs are overly cautious in their clinical recommendations and exhibit a tendency to recommend intervention, which leads to a notable number of false positive suggestions.” In other words, the AI models lacked the nuanced judgment that human physicians bring to bear on their decisions.
The findings from this study are important because they provide a realistic assessment of where LLMs stand in their journey toward becoming reliable tools for clinical decision support. The results caution against premature deployment of these models in critical care settings without significant improvements in accuracy and specificity.
That said, the potential benefits of AI in healthcare remain promising. As GPT-4’s higher accuracy in antibiotic prescription demonstrates, there are areas where AI could augment clinical decision-making, helping to reduce errors, improve efficiency, and offer second opinions that may increase overall patient safety. As LLMs improve and as healthcare systems find better ways to integrate these models into clinical workflows, the future could see more widespread use of AI to assist, but not replace, human doctors.
As lead researcher Christopher Williams highlights, “Before LLMs can be integrated into the clinical environment, it is important to fully understand both their capabilities and limitations. Otherwise, there is a risk of unintended harmful consequences, especially if models have been deployed at scale.”
For more, visit: https://doi.org/10.1038/s41467-024-52415-1