Published Date : 09/06/2025
Apple researchers have uncovered fundamental limitations in cutting-edge artificial intelligence (AI) models, according to a recent paper that raises doubts about the technology industry's race to develop ever more powerful systems. The study, published over the weekend, focuses on large reasoning models (LRMs), an advanced form of AI, and finds that these models face a 'complete accuracy collapse' when presented with highly complex problems.
The research indicates that standard AI models outperform LRMs in low-complexity tasks, while both types of models suffer a 'complete collapse' with high-complexity tasks. Large reasoning models attempt to solve complex queries by generating detailed thinking processes that break down the problem into smaller steps.
As LRMs neared performance collapse, they began 'reducing their reasoning effort,' which the Apple researchers found particularly concerning. The study tested the models' ability to solve puzzles, such as the Tower of Hanoi and River Crossing puzzles, and acknowledged that this focus on puzzles represented a limitation in its work.
Gary Marcus, a US academic and a prominent voice of caution on the capabilities of AI models, described the Apple paper as 'pretty devastating.' Marcus added that the findings raise questions about the race to artificial general intelligence (AGI), a theoretical stage of AI at which a system can match a human in carrying out any intellectual task.
Referring to the large language models (LLMs) that underpin tools such as ChatGPT, Marcus wrote: 'Anybody who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves.'
The paper also found that reasoning models wasted computing power by finding the right solution for simpler problems early in their 'thinking.' However, as problems became slightly more complex, models first explored incorrect solutions and arrived at the correct ones later. For higher-complexity problems, the models would enter 'collapse,' failing to generate any correct solutions. Even when provided with an algorithm that would solve the problem, the models failed.
The paper stated: 'Upon approaching a critical threshold – which closely corresponds to their accuracy collapse point – models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.' This indicates a 'fundamental scaling limitation in the thinking capabilities of current reasoning models.'
The study tested models including OpenAI’s o3, Google’s Gemini Thinking, Anthropic’s Claude 3.7 Sonnet-Thinking, and DeepSeek-R1. Anthropic, Google, and DeepSeek have been contacted for comment. OpenAI, the company behind ChatGPT, declined to comment.
Referring to 'generalisable reasoning' – or an AI model’s ability to apply a narrow conclusion more broadly – the paper said: 'These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalisable reasoning.'
Andrew Rogoyski, of the Institute for People-Centred AI at the University of Surrey, said the Apple paper signals that the industry is 'still feeling its way' on AGI and that the industry could have reached a 'cul-de-sac' in its current approach. 'The finding that large reason models lose the plot on complex problems, while performing well on medium- and low-complexity problems implies that we’re in a potential cul-de-sac in current approaches,' he said.
Q: What are large reasoning models (LRMs) in AI?
A: Large reasoning models (LRMs) are advanced AI systems designed to solve complex problems by breaking them down into smaller, more manageable steps. They aim to mimic human-like reasoning processes.
Q: What did the Apple study find about LRMs?
A: The Apple study found that LRMs face a 'complete accuracy collapse' when dealing with highly complex problems. They perform well on low-complexity tasks but fail to maintain accuracy as the complexity increases.
Q: What is the significance of the accuracy collapse in LRMs?
A: The accuracy collapse in LRMs suggests a fundamental limitation in their ability to handle complex tasks. This raises questions about the current approaches to developing artificial general intelligence (AGI).
Q: How does the Apple study impact the development of AGI?
A: The study challenges the assumption that current AI models, such as large language models (LLMs), are a direct route to achieving AGI. It suggests that the industry may need to explore new approaches to overcome these limitations.
Q: What are the implications of the study for the AI industry?
A: The study implies that the AI industry may be at a 'cul-de-sac' with its current approaches. It highlights the need for further research and development to overcome the fundamental limitations of current AI models.