math problems math problems

AI Models Are Falling Short in Solving New Math Problems

Mathematicians are putting the flashiest AI models to the test, and the results are not flattering. When Nov researchers asked frontier systems to tackle fresh, human‑written problems, the machines stumbled on almost every one, even as companies market these tools as near‑general problem solvers. The gap between headline math scores and real reasoning is now wide enough that even AI builders are starting to worry.

The story is not that AI cannot do math at all. Rather, current systems shine on familiar contest questions and benchmarks, then fail hard when the numbers, wording, or structure shift away from what they have seen. That jagged performance is forcing a rethink of how we measure progress, how we use these tools in science, and how far they really are from human‑level mathematical insight.

Fresh problems, familiar failures

When Nov Mathematicians set out to probe AI reasoning, they did not recycle textbook exercises. Each expert designed a new problem, solved it by hand, and only then fed it to leading models. Across hundreds of these custom challenges, Current models missed the solution almost every time. These were not obscure research questions, but carefully crafted tests of multi‑step reasoning that any strong graduate student could, in principle, crack.

Another project used a similar setup to see how systems behave when the rules are clear but the territory is new. In that study, the organizers described their approach as Testing the models on tasks where Each mathematician contributed a unique problem and solution, then compared human and machine performance. The pattern matched the Nov experiment: AI handled questions that looked like training data, but once the structure shifted beyond standard templates and frameworks, the systems struggled to even set up the right strategy.

Benchmarks say “genius,” reality says “fragile”

On paper, the numbers look dazzling. Public leaderboards show GPT systems acing competition‑style tests, with reports that Math and scientific reasoning scores are climbing fast. On AIME 2025, a contest known for tricky algebra and number theory, one GPT‑5 variant is said to hit 93.3% without tools, beating earlier o3 results of 85%. On the surface, that sounds like a model that should breeze through any high school or early college problem set.

Look closer, and the picture changes. The FrontierMath benchmark, built with funding from OpenAI, tracks the Share of hard, original problems solved correctly over time. Even as models improve, the success rate on the toughest, privately held questions remains low, which suggests that much of the headline progress comes from better pattern matching on public data. Critics argue that this is exactly how a benchmark can drift away from reality, since any evaluation that becomes a training target risks turning into a memorization contest.

How models “cheat” at math

There is growing evidence that large models are not reasoning in the way users assume, but leaning on surface patterns instead. One group of researchers found that when they perturbed familiar questions, New Research Exposes Cheat on Math Tests, with Performance Drops of 48 to 58% When Numbers Change, even though the underlying structure stayed the same. In some cases, accuracy fell to just 2.5% when coefficients or constants were swapped, a sign that the system had memorized specific training items instead of learning a general method.

That fragility shows up in more formal settings too. The Hard2Verify project describes a step‑level verification benchmark where model responses from GPT‑5 (high), Gemini 2.5 Pro, and Claude Sonnet 4 (thinking) are broken down and checked by PhD‑level mathematicians. The benchmark includes model responses generated by top‑tier systems, such as GPT and Gemini 2.5 Pro, that look fluent but contain subtle logical slips that only show up when each step is scrutinized. This kind of fine‑grained evaluation reveals how easily models can bluff their way through long derivations that would never pass in a real research seminar.

Human contests and jagged progress

When AI steps into human competitions, the contrast can be stark. A human expert panel carried out an evaluation of Top humans vs frontier models on the 2025 International Math Ol problems, using strict grading similar to what contestants face. Even with tool use and multiple attempts, the systems lagged far behind medal‑level students, especially on proofs that demanded original insight rather than routine technique. The exercise showed that current AI can mimic parts of a solution, but often fails to build a coherent argument from start to finish.

Yet specialised systems are making real headway. DeepMind reports that Since achieving IMO Gold medal standard in July 2025, Gemini Deep Think has progressed to scoring up to 90% on the IMO Proof benchmark. Those numbers suggest that, with heavy engineering, a model can be tuned to excel at specific contest styles. Even here, however, performance is jagged: brilliant on some structured proof tasks, brittle when a question falls outside the training groove or demands an unconventional line of attack.

Broken benchmarks and new training ideas

Some of the mismatch between hype and reality comes from how progress is measured. One analyst argues that AI benchmarking is broken because any high‑profile benchmark quickly turns into a target for model training, turning the public leaderboard into a memory test. Once that happens, evaluation stops telling us how systems will behave on new work and instead reflects how well they have been tuned to a narrow suite of tasks. The same risk applies to math, where repeated exposure to famous contest problems can inflate scores without improving genuine reasoning.

Even technical metrics can give a false sense of stability. A recent table of AI Model performance lists a METR Time Horizons entry where a Model named GPT‑5 (medium) is credited with 137.3 Minutes on a long‑horizon task. That kind of figure sounds precise, yet it tells users little about how the system will behave when a problem shifts in form, as the Nov Mathematicians found. The deeper question is whether training on static datasets has hit a ceiling for reasoning tasks, and some practitioners now argue that the plateau is the limit of static data itself.

Leave a Reply

Your email address will not be published. Required fields are marked *