Skip to content
Search

Latest Stories

Apple study reveals major flaws in billion-dollar AI models

AI models still struggle with basic logical tasks

Apple Research Exposes AI Model Weaknesses

Apple researchers evaluated several prominent generative AI systems

iStock

A new research paper from Apple has exposed serious shortcomings in the reasoning abilities of some of today’s most advanced artificial intelligence (AI) systems. Despite being marketed as powerful tools capable of solving complex problems, the study shows that these models still struggle with basic logical tasks, raising questions about the real capabilities of large language and reasoning models.

AI models fail child-level logic tests

Apple researchers evaluated several prominent generative AI systems, including ChatGPT, Claude, and DeepSeek, using classic problem-solving tasks. One of the tests was the well-known Tower of Hanoi puzzle, which requires moving discs across pegs while following specific rules.


While the puzzle is simple enough for a bright child to solve, most AI models failed when asked to handle scenarios involving more than seven discs. Accuracy fell below 80% with seven discs, and performance dropped even further with eight. According to co-lead author Iman Mirzadeh, the issue wasn't just solving the puzzle — it was that the models couldn’t follow a logical thought process even when given the solution algorithm.

“They fail to reason in a step-by-step, structured way,” he said, noting that the models’ approach was neither logical nor intelligent.

The myth of scaling exposed

The results challenge one of the AI industry’s most commonly held beliefs: that simply scaling models — making them larger and feeding them more data — will lead to better performance. Apple’s research provides strong evidence that this is not always true.

Gary Marcus, a well-known AI researcher and commentator, called the findings a reality check. Venture capitalist Josh Wolfe even coined a new verb, “to GaryMarcus”, meaning to critically debunk exaggerated claims about AI. The Apple study, Wolfe argued, had done exactly that by revealing the real limits of model reasoning.

Marcus has long argued that AI systems, particularly those based on neural networks, can only generalise within the data they’ve seen before. Once asked to work beyond that training distribution, they often break down — a pattern clearly confirmed in Apple’s tests.

AI is not yet a substitute for human logic

To be clear, even humans make errors on the more complex versions of the Tower of Hanoi. However, AI systems were supposed to improve on this, not replicate human flaws. As Marcus points out, artificial general intelligence (AGI) should combine human creativity with machine-level precision. But instead of outperforming people in logic and reliability, today’s large models still make basic errors.

Apple AI study Most AI models failed when asked to handle scenarios involving more than seven discsiStock

Apple’s results also support concerns raised by Arizona State University’s Subbarao Kambhampati, who has cautioned against assuming AI models reason like humans. In reality, they often skip steps or fail to understand the underlying principles of a problem, despite producing convincing-sounding answers.

Caution urged for businesses and society

The implications are significant for businesses looking to integrate AI into their operations. While models such as GPT-4, Claude, and others perform well in areas like writing, coding, and brainstorming, they remain unreliable for high-stakes decision-making. As Marcus points out, these systems can’t yet outperform classical algorithms in areas like database management, protein folding, or strategic games like chess.

This unpredictability limits how much society can rely on generative AI. While the technology will continue to be useful in supporting human tasks, it is far from being a replacement for human judgement or traditional rule-based systems in critical contexts.

The illusion of intelligence

Perhaps most concerning is how easily these models can appear more capable than they are. If an AI performs well on an easy test, users may assume it can handle more complex problems too. But Apple’s study shows this confidence can be misplaced. The same model that solves a four-disc puzzle may completely fail when asked to solve one with eight.

This illusion of intelligence could lead to overtrust in AI systems — something experts warn must be avoided if the technology is to be used responsibly.

Rethinking the future of AI

Despite the findings, Marcus remains optimistic about AI’s future, just not in its current form. He believes that hybrid approaches, combining classical logic with modern computing power, could eventually produce more reliable systems. But he is sceptical that current LLM-based systems are the answer.

The Apple paper shows that hype around generative AI has outpaced its real-world abilities. Until AI can reason in a consistent, logical manner — not just produce convincing text — it will remain limited in scope.

As researchers and developers reflect on these findings, one thing is clear: the path to truly intelligent machines will require more than just scaling up. It will demand smarter, better-designed models that prioritise reliability over illusion.

More For You

East Midlands Airport Cargo Boom to Create 20,000 Jobs

The cargo operation involves staff handling approximately one million packages nightly, with major operators including UPS and DHL using the site as a hub

East Midlands Airport

East Midlands Airport's cargo boom set to create 20,000 jobs with £4 billion economic boost

Highlights

  • Cargo volumes up 17.4 per cent between May and July, reaching over 103,000 tonnes with 24 per cent growth in June alone.
  • Ambitious expansion plans include 122,000m2 of warehouse space and stands for 18 additional aircraft over next 20 years.
  • Four new Chinese operators launched routes while major players Atlas Air and DHL use site as key hub.

East Midlands Airport is experiencing unprecedented cargo growth that directors say has resolved the site's "identity crisis" and could generate 20,000 new jobs alongside a £4 bn economic uplift.

The airport handled more than 103,000 tonnes of cargo between May and July, marking a 17.4 per cent increase on the same period in 2024.

Keep ReadingShow less