When IBM’s DeepBlue beat Garry Kasparov in chess some 30 years ago, that was a wake-up call about AI, but many assumed it had just brute-forced the victory. When Google’s DeepMind AlphaGo beat Lee Saedol in the even more difficult game Go, people were more amazed. “They’re how I imagine games from far in the future,” Shi Yue, a top Go player from China said. When Ai-Da and Dalle-E started creating what seemed like original art, the lines between human and AI really started getting blurry. Then ChatGPT and other AI started writing poems and essays, passing tests, carrying on Turing test type conversations, and it sure seemed like AI had met or surpassed its human creators.
![]() |
| Can AI do math? Credit: Microsoft Designer |
Oh, yeah? But can it do math? Not just very, very complicated arithmetic, but novel, innovative math? Well, some mathematicians want to put AI to the test.
On Euler
Day (that’s February 7th, for those of you not keeping track), eleven
leading mathematicians issued First Proof –
“A set of ten math questions to evaluate the capabilities of AI systems to
autonomously solve problems that arise naturally in the research process.” The
problems were designed so that no LLM could simply search the internet for existing
proofs and pass them off as their own. They gave AI models a week to submit
solutions, and unveiled the results on Valentine’s Day (who says mathematicians
aren’t romantic?).
“The goal
here is to understand the limits — how far can A.I. go beyond its training data
and the existing solutions it finds online?”
said Dr. Tamara Kolda, one of the authors, in an interview with Siobhan
Roberts of The New York Times.
So far, it
appears that AI might want to stick to writing poems.
The
challenge produced a surprising amount of responses. “We did not expect there
would be this much activity,” Mohammed Abouzaid, a math professor at Stanford
University and a member of the First Proof team, told
Joseph Howett of Scientific American. “We did not expect that the AI
companies would take it this seriously and put this much labor into it.”
![]() |
| Dr. Abouzaid. Credit: Stanford University |
Martin Hairer, a professor at EPFL and Imperial College of London and one of the eleven, described to Ms. Roberts his impression of how the models performed:
Sometimes it would be like reading a paper by a bad undergraduate student, where they sort of know where they’re starting from, they know where they want to go, but they don’t really know how get there. So they wander around here and there, and then at some point they just stick in “and therefore” and pray.
“The
models seem to have struggled,” Kevin Barreto, an undergraduate student at the
University of Cambridge, who was not part of the First Proof team and who had
recently used
AI to solve one of the Erdős problems, told
Mr. Howlett. “To be honest, yeah, I’m somewhat disappointed.”
Professor
Abouzaid was somewhat more generous, saying: “The correct solutions that I’ve
seen out of AI systems, they have the flavor of 19th-century mathematics. But
we’re trying to build the mathematics of the 21st century.
One of the
challenges involved in evaluating the responses is determining how much human
assistance the models had in producing their responses. “Once there’s humans
involved, how do we judge how much is human and how much is AI?" Lauren
Williams, a Harvard professor and one of the First Proof team, admitted to Mr.
Howlett.
And, let’s
be clear, the set of problems were not among the most advanced that could have
been posed. The authors wrote in their paper: "Our 'first proof'
experiment is focused on the final and most well-specified stage of math
research, in which the question and frameworks are already understood.” Dr.
Williams explained the rationale to Ms. Roberts: “We can query the A.I. model
with small, well-defined questions, and then assess whether its answers are
correct. If we were to ask an A.I. model to come up with the big question, or a
framework, it would be much harder to evaluate its performance.”
The First
Draft team is planning to release round two on March 14, 2026 (Pi Day, again
for those of you not paying attention). Further rounds are expected to follow.
Some mathematicians
are taking other approaches. CalTech Professor Sergei Gukov and colleagues want
to think of math proofs as a type of game. In a new paper, they described
developing a new type of machine-learning algorithm that can solve math
problems requiring extremely long sequences of steps, and used it to make
progress on a longstanding math problem called the Andrews-David conjecture.
"If
you ask ChatGPT to write a letter, it will come up with something typical. It's
unlikely to come up with anything unique and highly original. It's a good
parrot," Professor Gukov says. "Our program is good at coming up with
outliers." Because of that, he
believes: "We made a lot of improvements in an area of math that was
decades old. Progress had been relatively slow, but now it's hustling and
bustling."
Whether or
not their approach would have met the First Proof requirements, it reminds me
of what AlphaGo did in displaying creativity. Math may never be the same. “I
already have heard from colleagues that they are in shock,” Scott Armstrong, a
mathematician at Sorbonne University in France, told Mr. Howlett. “These tools
are coming to change mathematics, and it's happening now."














