Agathon: Can you reason with LLMs?

In a paper from Apple titled GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models the authors examine the ability of large language models (LLMs) like GPT and similar systems to solve math problems, particularly using reasoning rather than just pattern recognition. Here’s our primer on the paper:

1. Problem with Current Benchmarks

LLMs are often tested on a popular math dataset called GSM8K, which includes grade-school-level questions. However, simply scoring well on GSM8K doesn’t necessarily mean these models understand math or can reason logically. Many LLMs may perform well simply by memorising question patterns and answers rather than actually solving the problems.
The authors developed an improved testing method, called GSM-Symbolic, to better evaluate whether these models are genuinely reasoning through problems or just relying on patterns. GSM-Symbolic introduces variations in question phrasing and numbers to challenge models more rigorously.

2. Testing Mathematical Reasoning Skills

When tested on GSM-Symbolic, many models performed inconsistently, especially when only the numbers in questions were changed. This suggests that the models were thrown off by changes in variables, which reveals their lack of flexible problem-solving abilities.
The study also found that as questions got more complex—by adding more clauses or steps—the performance of LLMs declined. This drop happened because many LLMs aren’t capable of true logical reasoning but instead try to mimic reasoning by recognising patterns from training data.

3. New Challenges Introduced: GSM-NoOp

To further explore model limitations, the authors created another test called GSM-NoOp. This dataset added extra, irrelevant details to math problems (called "no-op" information) to see if the LLMs could ignore this unnecessary information.
Most models struggled with GSM-NoOp, often getting confused and incorrectly factoring in these irrelevant details into their calculations, showing they lack true reasoning skills.

4. Key Findings

High Sensitivity to Changes: The LLMs performed poorly when minor changes were introduced to the questions, suggesting that they may not genuinely understand the math problems.
Struggle with Complexity: The more complex a problem became, the worse the models performed, indicating that LLMs are not yet ready to tackle truly challenging logical problems.
Pattern Matching Over True Reasoning: The study suggests that current LLMs tend to match patterns rather than engage in actual reasoning. This means they might answer correctly in familiar situations but fail in new or slightly altered scenarios.

5. Conclusion

The research highlights that while LLMs have made progress in handling math problems, they still rely heavily on recognising familiar patterns. The study emphasises the need for better evaluation methods and further improvements in model design so that future LLMs can achieve genuine reasoning capabilities.

In essence, while LLMs show some promise, they still have a long way to go in terms of true logical reasoning, particularly for complex or unfamiliar math problems. This research helps pave the way for developing models that can genuinely understand and solve problems rather than just mimicking patterns.