Recent developments in artificial intelligence have led to an unexpected trend: newer models are hallucinating more frequently. Hallucinations in large language models (LLMs), such as ChatGPT, refer to instances when the AI generates information that appears plausible but is completely false and lacks any factual basis.
For instance, OpenAI’s benchmark test, PersonQA, evaluates the factual accuracy of its AI models regarding information about individuals. The findings showed that the model identified as o3 hallucinated in 33 percent of its responses, while the o4-mini model performed even worse, with hallucination rates at 48 percent.
In contrast, earlier versions like o1 and o3-mini recorded significantly lower hallucination rates of 16 percent and 14.8 percent, respectively. Currently, OpenAI has yet to identify the reasons for this increase in hallucinations among their newer reasoning models.
While hallucinations may be acceptable in creative tasks, they pose a serious concern for AI applications that require high accuracy, such as research or information retrieval. The integrity of AI assistants like ChatGPT relies heavily on their ability to provide reliable information.
In response to these concerns, an OpenAI representative communicated to TechCrunch that the company remains committed to enhancing the accuracy and reliability of their models. The ongoing effort to minimize hallucinations highlights the importance of trustworthy AI systems, especially as reliance on these technologies continues to grow in various aspects of daily life.