New Research Reveals Larger AI Models Are Less Reliable

In the world of artificial intelligence (AI), bigger has often been seen as better. Tech companies and researchers have continuously scaled up their AI models, especially language models, which are the brains behind many tools we use today, from voice assistants to automated chatbots. But a recent study by Lexin Zhou and colleagues challenges this assumption, showing that as language models grow in size and complexity, their reliability might actually be going down. The study, published in Nature, delves into how these larger, more “instructable” models often produce incorrect or misleading results.

Language models like OpenAI’s GPT or Meta’s LLaMA have been designed to get better at understanding and generating human-like text. To improve them, researchers have continuously increased the size of the models, feeding them more data, and adding more parameters (the internal settings that help the models understand patterns). This process, called scaling, was thought to make AI smarter and more reliable. After all, if you give a machine more information and processing power, it should become better at answering questions, right? In fact, it is totally opposite.

Zhou’s study shows that while these large models may excel at certain tasks, they also become more prone to producing errors. Moreover, these errors are often subtle and harder to detect. In the past, older, smaller models would simply say they couldn’t answer a question. But today’s bigger models tend to provide confident but wrong answers. In some cases, these responses can look so convincing that even human supervisors struggle to identify the errors. According to the study, “scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often.”

So, what exactly is happening?

The researchers looked at three families of language models: OpenAI’s GPT series, Meta’s LLaMA models, and BigScience’s BLOOM. These models have been designed to handle a wide range of tasks, from simple math problems to understanding complex scientific questions. The findings reveal that while the models perform well in some areas, they consistently fail in tasks that are actually quite simple for humans. In other words, the models seem to be struggling with basic, easy tasks that should be well within their capability.

A key takeaway from this research is the concept of difficulty concordance. It is the idea that both humans and machines should find the same tasks easy or hard. The study found that while these large models are getting better at handling complex questions, they are still inconsistent when it comes to simple ones. In fact, the bigger and more instructable the models get, the more they seem to struggle with tasks that should be easy, such as basic arithmetic or unscrambling simple words.

An even bigger issue is task avoidance. Earlier models would often skip over difficult tasks or admit they couldn’t answer. But the new generation of language models tries to answer everything, even if they don’t know the right answer. This leads to what the researchers call “ultracrepidarianism,” which is a fancy term for providing answers beyond their knowledge or ability. It’s like asking someone for directions, and instead of admitting they don’t know, they confidently send you the wrong way.

This is particularly concerning when these models are used in high-stakes areas, like medicine, education, or scientific research. If a chatbot confidently provides the wrong medical advice, or a language model generates incorrect scientific data, the consequences can be serious. The study’s authors point out, “These findings highlight the need for a fundamental shift in the design and development of general-purpose AI, especially in high-stakes areas where reliability is crucial.”

So, where does this leave the future of AI development?

Zhou and his team suggest that simply making models bigger isn’t the solution. Instead, there needs to be a focus on improving predictability and transparency ensuring that these models not only get better at answering questions but also know when to avoid giving an answer if they’re unsure. As it stands, these models sometimes hide their lack of understanding behind confident-sounding answers, which can be misleading for users who trust the system.

One potential solution the researchers mention is building AI systems that can better communicate their limitations. For example, models could be trained to say, “I’m not sure about that” or “I don’t have enough information to answer.” This would make AI more honest and less prone to confidently providing incorrect answers. Furthermore, developing AI systems with reject options, where they are taught to refuse to answer rather than guess, could also improve reliability, particularly in fields where the truth is critical.

References: Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., & Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature. https://doi.org/10.1038/s41586-024-07930-y

Scroll to Top