Skip to content
Search

Latest Stories

Apple study reveals major flaws in billion-dollar AI models

AI models still struggle with basic logical tasks

Apple Research Exposes AI Model Weaknesses

Apple researchers evaluated several prominent generative AI systems

iStock

A new research paper from Apple has exposed serious shortcomings in the reasoning abilities of some of today’s most advanced artificial intelligence (AI) systems. Despite being marketed as powerful tools capable of solving complex problems, the study shows that these models still struggle with basic logical tasks, raising questions about the real capabilities of large language and reasoning models.

AI models fail child-level logic tests

Apple researchers evaluated several prominent generative AI systems, including ChatGPT, Claude, and DeepSeek, using classic problem-solving tasks. One of the tests was the well-known Tower of Hanoi puzzle, which requires moving discs across pegs while following specific rules.


While the puzzle is simple enough for a bright child to solve, most AI models failed when asked to handle scenarios involving more than seven discs. Accuracy fell below 80% with seven discs, and performance dropped even further with eight. According to co-lead author Iman Mirzadeh, the issue wasn't just solving the puzzle — it was that the models couldn’t follow a logical thought process even when given the solution algorithm.

“They fail to reason in a step-by-step, structured way,” he said, noting that the models’ approach was neither logical nor intelligent.

The myth of scaling exposed

The results challenge one of the AI industry’s most commonly held beliefs: that simply scaling models — making them larger and feeding them more data — will lead to better performance. Apple’s research provides strong evidence that this is not always true.

Gary Marcus, a well-known AI researcher and commentator, called the findings a reality check. Venture capitalist Josh Wolfe even coined a new verb, “to GaryMarcus”, meaning to critically debunk exaggerated claims about AI. The Apple study, Wolfe argued, had done exactly that by revealing the real limits of model reasoning.

Marcus has long argued that AI systems, particularly those based on neural networks, can only generalise within the data they’ve seen before. Once asked to work beyond that training distribution, they often break down — a pattern clearly confirmed in Apple’s tests.

AI is not yet a substitute for human logic

To be clear, even humans make errors on the more complex versions of the Tower of Hanoi. However, AI systems were supposed to improve on this, not replicate human flaws. As Marcus points out, artificial general intelligence (AGI) should combine human creativity with machine-level precision. But instead of outperforming people in logic and reliability, today’s large models still make basic errors.

Apple AI study Most AI models failed when asked to handle scenarios involving more than seven discsiStock

Apple’s results also support concerns raised by Arizona State University’s Subbarao Kambhampati, who has cautioned against assuming AI models reason like humans. In reality, they often skip steps or fail to understand the underlying principles of a problem, despite producing convincing-sounding answers.

Caution urged for businesses and society

The implications are significant for businesses looking to integrate AI into their operations. While models such as GPT-4, Claude, and others perform well in areas like writing, coding, and brainstorming, they remain unreliable for high-stakes decision-making. As Marcus points out, these systems can’t yet outperform classical algorithms in areas like database management, protein folding, or strategic games like chess.

This unpredictability limits how much society can rely on generative AI. While the technology will continue to be useful in supporting human tasks, it is far from being a replacement for human judgement or traditional rule-based systems in critical contexts.

The illusion of intelligence

Perhaps most concerning is how easily these models can appear more capable than they are. If an AI performs well on an easy test, users may assume it can handle more complex problems too. But Apple’s study shows this confidence can be misplaced. The same model that solves a four-disc puzzle may completely fail when asked to solve one with eight.

This illusion of intelligence could lead to overtrust in AI systems — something experts warn must be avoided if the technology is to be used responsibly.

Rethinking the future of AI

Despite the findings, Marcus remains optimistic about AI’s future, just not in its current form. He believes that hybrid approaches, combining classical logic with modern computing power, could eventually produce more reliable systems. But he is sceptical that current LLM-based systems are the answer.

The Apple paper shows that hype around generative AI has outpaced its real-world abilities. Until AI can reason in a consistent, logical manner — not just produce convincing text — it will remain limited in scope.

As researchers and developers reflect on these findings, one thing is clear: the path to truly intelligent machines will require more than just scaling up. It will demand smarter, better-designed models that prioritise reliability over illusion.

More For You

pub hotels UK

The group earned five stars for customer service and accuracy of descriptions.

coachinginngroup

Pub hotel group beat luxury chains in UK guest satisfaction survey

Highlights

  • Coaching Inn Group scores 81 per cent customer satisfaction, beating Marriott and Hilton.
  • Wetherspoon Hotels named best value at £70 per night.
  • Britannia Hotels ranks bottom for 12th consecutive year with 44 per cent score.
A traditional pub hotel group has outperformed luxury international chains in the UK's largest guest satisfaction survey, while one major operator continues its decade-long streak at the bottom of the rankings.
The Coaching Inn Group, comprising 36 relaxed inn-style hotels in historic buildings across beauty spots and market towns, achieved the highest customer score of 81per cent among large chains in Which?'s annual hotel survey. The group earned five stars for customer service and accuracy of descriptions, with guests praising its "lovely locations and excellent food and service.
"The survey, conducted amongst 4,631 guests, asked respondents to rate their stays across eight categories including cleanliness, customer service, breakfast quality, bed comfort and value for money. At an average £128 per night, Coaching Inn demonstrated that mid-range pricing with consistent quality appeals to British travellers.
J D Wetherspoon Hotels claimed both the Which? Recommended Provider status (WRPs) and Great Value badge for the first time, offering rooms at just £70 per night while maintaining four-star ratings across most categories. Guests described their stays as "clean, comfortable and good value.
"Among boutique chains, Hotel Indigo scored 79 per cent with its neighbourhood-inspired design, while InterContinental achieved 80per cent despite charging over £300 per night, and the chain missed WRP status for this reason.

Budget brands decline

However, Premier Inn, long considered Britain's reliable budget choice, lost its recommended status this year. Despite maintaining comfortable beds, guests reported "standards were slipping" and prices "no longer budget levels" at an average £94 per night.

The survey's biggest disappointment remains Britannia Hotels, scoring just 44 per cent and one star for bedroom and bathroom quality. This marks twelve consecutive years at the bottom, with guests at properties like Folkestone's Grand Burstin calling it a total dive.

Keep ReadingShow less