Apple Study Highlights AI Models’ Flaws in Mathematical Reasoning

Recent research by Apple has exposed significant weaknesses in large language models (LLMs) regarding their ability to perform mathematical reasoning. Despite their success across various tasks, LLMs lack what Apple researchers describe as ‘true logical reasoning’ capabilities.

Challenges in Solving Simple Mathematical Problems

The study found that AI chatbots struggle with simple math problems that most people can easily solve. Notably, their accuracy varies depending on how questions are phrased. ‘Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question,’ the research paper states.

As the complexity of questions increases, the performance of these models deteriorates. The research highlights that adding more clauses to a mathematical problem worsens the models’ accuracy, suggesting that current LLMs rely heavily on patterns from training data rather than genuine logical reasoning.

Introducing the GSM-Symbolic Benchmark

In response to these findings, Apple researchers have developed a new benchmark called GSM-Symbolic. This benchmark evaluates mathematical reasoning by generating diverse questions from symbolic templates. ‘We add seemingly relevant but ultimately inconsequential statements to GSM-Symbolic templates,’ researchers explained. These additions, while irrelevant, caused a significant drop in model performance.

In one example, the prompt reads: ‘Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?’ Several LLMs, including o1-mini and LLama3-8B, incorrectly subtracted the smaller kiwis, leading to an answer of 185 instead of the correct 190.

Catastrophic Performance Loss

Across all state-of-the-art models, performance dropped by as much as 65% when an irrelevant variable was added to the prompt. This ‘catastrophic performance loss’ demonstrates how LLMs convert statements to operations without truly understanding their meaning.

Larger Models Perform Better

Interestingly, larger models like Claude and Gemini were able to correctly solve the kiwi problem, suggesting that model size plays a role in mitigating some of these issues. However, even OpenAI’s advanced O1-preview model saw a 17.5% accuracy drop.

Conclusion

Apple’s research suggests that the complexity of mathematical questions remains a significant challenge for LLMs. The study calls for further research to develop AI models that can engage in formal reasoning and robust, generalizable problem-solving skills. ‘This remains a critical challenge for the field as we strive to create systems with human-like cognitive abilities or general intelligence,’ the researchers concluded.