Monday, February 16, 2026

Math May Be AI's Final Frontier

When IBM’s DeepBlue beat Garry Kasparov in chess some 30 years ago, that was a wake-up call about AI, but many assumed it had just brute-forced the victory. When Google’s DeepMind AlphaGo beat Lee Saedol in the even more difficult game Go, people were more amazed. “They’re how I imagine games from far in the future,” Shi Yue, a top Go player from China said. When Ai-Da and Dalle-E started creating what seemed like original art, the lines between human and AI really started getting blurry. Then ChatGPT and other AI started writing poems and essays, passing tests, carrying on Turing test type conversations, and it sure seemed like AI had met or surpassed its human creators.

Can AI do math? Credit: Microsoft Designer

Oh, yeah? But can it do math?  Not just very, very complicated arithmetic, but novel, innovative math? Well, some mathematicians want to put AI to the test.   

On Euler Day (that’s February 7th, for those of you not keeping track), eleven leading mathematicians issued First Proof – “A set of ten math questions to evaluate the capabilities of AI systems to autonomously solve problems that arise naturally in the research process.” The problems were designed so that no LLM could simply search the internet for existing proofs and pass them off as their own. They gave AI models a week to submit solutions, and unveiled the results on Valentine’s Day (who says mathematicians aren’t romantic?).

“The goal here is to understand the limits — how far can A.I. go beyond its training data and the existing solutions it finds online?” said Dr. Tamara Kolda, one of the authors, in an interview with Siobhan Roberts of The New York Times.

So far, it appears that AI might want to stick to writing poems.

The challenge produced a surprising amount of responses. “We did not expect there would be this much activity,” Mohammed Abouzaid, a math professor at Stanford University and a member of the First Proof team, told Joseph Howett of Scientific American. “We did not expect that the AI companies would take it this seriously and put this much labor into it.”

Dr. Abouzaid. Credit: Stanford University
To be fair, OpenAI claimed one of its unreleased models solved six of the ten, although it later had to backtrack on one of them. Other publicly released models only solved one or two. Daniel Litt, a mathematician at the University of Toronto, who was not part of the First Proof team, told Mr. Howlett: “I expected maybe two to three unambiguously correct solutions from publicly available models. Ten would have been very surprising to me.”

Martin Hairer, a professor at EPFL and Imperial College of London and one of the eleven, described to Ms. Roberts his impression of how the models performed:

Sometimes it would be like reading a paper by a bad undergraduate student, where they sort of know where they’re starting from, they know where they want to go, but they don’t really know how get there. So they wander around here and there, and then at some point they just stick in “and therefore” and pray.

“The models seem to have struggled,” Kevin Barreto, an undergraduate student at the University of Cambridge, who was not part of the First Proof team and who had recently used AI to solve one of the ErdÅ‘s problems, told Mr. Howlett. “To be honest, yeah, I’m somewhat disappointed.”

Professor Abouzaid was somewhat more generous, saying: “The correct solutions that I’ve seen out of AI systems, they have the flavor of 19th-century mathematics. But we’re trying to build the mathematics of the 21st century.

One of the challenges involved in evaluating the responses is determining how much human assistance the models had in producing their responses. “Once there’s humans involved, how do we judge how much is human and how much is AI?" Lauren Williams, a Harvard professor and one of the First Proof team, admitted to Mr. Howlett.  

And, let’s be clear, the set of problems were not among the most advanced that could have been posed. The authors wrote in their paper: "Our 'first proof' experiment is focused on the final and most well-specified stage of math research, in which the question and frameworks are already understood.” Dr. Williams explained the rationale to Ms. Roberts: “We can query the A.I. model with small, well-defined questions, and then assess whether its answers are correct. If we were to ask an A.I. model to come up with the big question, or a framework, it would be much harder to evaluate its performance.”

The First Draft team is planning to release round two on March 14, 2026 (Pi Day, again for those of you not paying attention). Further rounds are expected to follow.

Some mathematicians are taking other approaches. CalTech Professor Sergei Gukov and colleagues want to think of math proofs as a type of game.  In a new paper, they described developing a new type of machine-learning algorithm that can solve math problems requiring extremely long sequences of steps, and used it to make progress on a longstanding math problem called the Andrews-David conjecture.

"Our program aims to find long sequences of steps that are rare and hard to find," says study first author Ali Shehper, a postdoctoral scholar at Rutgers University who will soon join Caltech as a research scientist. "It's like trying to find your way through a maze the size of Earth. These are very long paths that you have to test out, and there's only one path that works."  Or, as Professor Sergei describes it: “We know the hypothesis, we know the goal, but connecting them is what’s missing.”

"If you ask ChatGPT to write a letter, it will come up with something typical. It's unlikely to come up with anything unique and highly original. It's a good parrot," Professor Gukov says. "Our program is good at coming up with outliers."  Because of that, he believes: "We made a lot of improvements in an area of math that was decades old. Progress had been relatively slow, but now it's hustling and bustling."

Whether or not their approach would have met the First Proof requirements, it reminds me of what AlphaGo did in displaying creativity. Math may never be the same.  “I already have heard from colleagues that they are in shock,” Scott Armstrong, a mathematician at Sorbonne University in France, told Mr. Howlett. “These tools are coming to change mathematics, and it's happening now."

No comments:

Post a Comment