It turns out that your friendly neighborhood AI assistant might be getting too confident for its own good. A new study reveals that as language models like OpenAI’s GPT and Meta’s LLaMA become more powerful, they’re also becoming…well, bigger fibbers. The research, published in Nature, shows that these beefed-up AIs are more likely to churn out inaccurate answers—even when they don’t have a clue. Why? Because they’re getting better at pretending they do.

The issue isn’t just limited to rare, brain-busting questions; even the simplest queries can trip them up. But because they can tackle tougher topics convincingly, we might be overlooking their obvious mistakes. The solution? Maybe these chatbots should learn to just say, “I don’t know.” But for companies keen to show off their high-tech toys, admitting ignorance isn’t exactly a selling point.

This thinking seems to hold true for big language models too, which keep getting stronger with each new version. Fresh research points out that these smarter AI chatbots are actually becoming less reliable because they tend to make up facts instead of dodging or refusing to answer questions they can’t handle.

Ai assistants

Image: Pexles

In the Search for Smarter AI Chatbots, We’re Left With Increasingly Unreliable Ones

The study, published in the journal Nature, looked at some top commercial LLMs in the field: OpenAI’s GPT, Meta’s LLaMA, and an open-source model called BLOOM from the research group BigScience.

It found their answers are often more accurate now, but they’re less reliable overall giving more wrong answers than older models did.

“They try to answer pretty much everything these days. This means more right, but also more wrong [answers],” study co-author José Hernández-Orallo, who works at the Valencian Research Institute for Artificial Intelligence in Spain, told Nature.

Mike Hicks, who studies science and technology philosophy at the University of Glasgow, took a tougher stance.

“That looks to me like what we would call bullshitting,” Hicks, who didn’t take part in the study, told Nature. “It’s getting better at acting like it knows stuff.”

The researchers tested the models on subjects from math to geography, and also asked them to do tasks like putting information in a specific order. The larger more capable models gave the most correct answers overall but struggled with tougher questions where they were less accurate.

The study found that OpenAI’s GPT-4 and o1 were some of the biggest bullshitters answering every question thrown their way. This trend seems to be affecting all the LLMs examined. For the LLaMA family of models, none could score above 60 percent accuracy even on the simplest questions, according to the research.

In a nutshell, as AI models grew largerconsidering parameters, training data, and other elementsthey gave a higher percentage of incorrect answers.

AI models are getting better at answering harder questions. The issue, besides their tendency to make things up, is that they still get the simple ones wrong. In theory, these mistakes should raise more red flags, but we might overlook their clear flaws because we’re amazed at how these large language models handle complex problems, according to the researchers.

The study had some worrying findings about how people view AI responses. When asked to determine if the chatbots’ answers were right or wrong, a chosen group of participants made mistakes 10 to 40 percent of the time.

The easiest way to fix these problems, the researchers say, is to program the LLMs to be less keen on answering everything.

“You can set a limit, and when the question is tough, [make the chatbot] say, ‘no, I don’t know,'” Hernández-Orallo told Nature.

But being truthful might not help AI companies trying to impress people with their cool technology. If these smarter AI chatbots were limited to answering things they knew about, it could show the boundaries of the tech.