A new research paper from Apple has exposed serious shortcomings in the reasoning abilities of some of today’s most advanced artificial intelligence (AI) systems. Despite being marketed as powerful tools capable of solving complex problems, the study shows that these models still struggle with basic logical tasks, raising questions about the real capabilities of large language and reasoning models.
AI models fail child-level logic tests
Apple researchers evaluated several prominent generative AI systems, including ChatGPT, Claude, and DeepSeek, using classic problem-solving tasks. One of the tests was the well-known Tower of Hanoi puzzle, which requires moving discs across pegs while following specific rules.
While the puzzle is simple enough for a bright child to solve, most AI models failed when asked to handle scenarios involving more than seven discs. Accuracy fell below 80% with seven discs, and performance dropped even further with eight. According to co-lead author Iman Mirzadeh, the issue wasn't just solving the puzzle — it was that the models couldn’t follow a logical thought process even when given the solution algorithm.
“They fail to reason in a step-by-step, structured way,” he said, noting that the models’ approach was neither logical nor intelligent.
The myth of scaling exposed
The results challenge one of the AI industry’s most commonly held beliefs: that simply scaling models — making them larger and feeding them more data — will lead to better performance. Apple’s research provides strong evidence that this is not always true.
Gary Marcus, a well-known AI researcher and commentator, called the findings a reality check. Venture capitalist Josh Wolfe even coined a new verb, “to GaryMarcus”, meaning to critically debunk exaggerated claims about AI. The Apple study, Wolfe argued, had done exactly that by revealing the real limits of model reasoning.
Marcus has long argued that AI systems, particularly those based on neural networks, can only generalise within the data they’ve seen before. Once asked to work beyond that training distribution, they often break down — a pattern clearly confirmed in Apple’s tests.
AI is not yet a substitute for human logic
To be clear, even humans make errors on the more complex versions of the Tower of Hanoi. However, AI systems were supposed to improve on this, not replicate human flaws. As Marcus points out, artificial general intelligence (AGI) should combine human creativity with machine-level precision. But instead of outperforming people in logic and reliability, today’s large models still make basic errors.

Apple’s results also support concerns raised by Arizona State University’s Subbarao Kambhampati, who has cautioned against assuming AI models reason like humans. In reality, they often skip steps or fail to understand the underlying principles of a problem, despite producing convincing-sounding answers.
Caution urged for businesses and society
The implications are significant for businesses looking to integrate AI into their operations. While models such as GPT-4, Claude, and others perform well in areas like writing, coding, and brainstorming, they remain unreliable for high-stakes decision-making. As Marcus points out, these systems can’t yet outperform classical algorithms in areas like database management, protein folding, or strategic games like chess.
This unpredictability limits how much society can rely on generative AI. While the technology will continue to be useful in supporting human tasks, it is far from being a replacement for human judgement or traditional rule-based systems in critical contexts.
The illusion of intelligence
Perhaps most concerning is how easily these models can appear more capable than they are. If an AI performs well on an easy test, users may assume it can handle more complex problems too. But Apple’s study shows this confidence can be misplaced. The same model that solves a four-disc puzzle may completely fail when asked to solve one with eight.
This illusion of intelligence could lead to overtrust in AI systems — something experts warn must be avoided if the technology is to be used responsibly.
Rethinking the future of AI
Despite the findings, Marcus remains optimistic about AI’s future, just not in its current form. He believes that hybrid approaches, combining classical logic with modern computing power, could eventually produce more reliable systems. But he is sceptical that current LLM-based systems are the answer.
The Apple paper shows that hype around generative AI has outpaced its real-world abilities. Until AI can reason in a consistent, logical manner — not just produce convincing text — it will remain limited in scope.
As researchers and developers reflect on these findings, one thing is clear: the path to truly intelligent machines will require more than just scaling up. It will demand smarter, better-designed models that prioritise reliability over illusion.














