Published Date : 22/08/2025
A few months before the 2025 International Mathematical Olympiad (IMO) in July, a three-person team at OpenAI made a bold bet that they could use the competition’s notoriously tough problems to train an artificial intelligence (AI) model to think independently for hours, enabling it to write math proofs. Their goal wasn’t just to create an AI that could perform complex math but one that could evaluate ambiguity and nuance—skills essential for AI to handle many real-world tasks. These skills are crucial for developing artificial general intelligence (AGI), which aims to achieve human-level understanding and reasoning.
The IMO, held this year on Australia’s Sunshine Coast, is the world’s premier math competition for high school students, bringing together top contenders from over 100 countries. Participants are given six problems—three per day, each worth seven points—to solve over two days. These problems are far from simple; they require sustained reasoning and creativity in the form of written proofs, spanning various fields of mathematics. Until this year, AI systems had struggled to solve such problems.
The OpenAI team, comprising researchers and engineers Alex Wei, Sheryl Hsu, and Noam Brown, used a general-purpose reasoning model. This AI was designed to “think” through challenging problems by breaking them into steps, checking its own work, and adapting its approach as it went. Although AI systems couldn’t officially compete, the tough test served as a demonstration of their capabilities. The AIs tackled the questions under the same conditions as human participants, working for two 4.5-hour sessions without any external assistance like search engines or math software. The proofs produced by the AI were graded by three former IMO medalists and posted online. The AI correctly completed five out of six problems, scoring 35 out of 42 points, the minimum required for an IMO gold medal. Only 26 students, or 4 percent, outperformed the AI; five students achieved perfect 42s. Given that a year ago, language-based AI systems like OpenAI’s struggled with elementary math, these results represent a dramatic leap in performance.
In an interview, Alex Wei and Sheryl Hsu from the OpenAI team discussed their work, the model’s lack of response to the sixth question, and how developing a system capable of writing complex proofs could lead to AGI.
When asked about the decision to use a general-purpose AI system instead of one specifically designed for math problems, Wei explained that the philosophy was to build general-purpose AI and develop methods that could apply beyond math. Math is a good proving ground for AI because it’s fairly objective: if you have a proof, it’s easier to get consensus on its correctness. IMO problems are very hard, so the team wanted to tackle these with general-purpose methods, hoping they would also apply to other domains.
Hsu added that the goal at OpenAI is to build AGI, not just to write papers or win competitions. Everything they did for this project was also useful for the broader goal of building better models that users can actually use.
In terms of how a reasoning model winning a gold in the IMO could help lead to AGI, Wei noted that one perspective is to think about the time horizon of tasks. A year ago, ChatGPT could only handle very basic math problems. Now, it’s better at longer-running tasks, such as editing paragraphs. As AI improves, it can handle more complex and time-consuming tasks. Hsu emphasized that reasoning models were previously very good at tasks that are easy to verify. In the real world, tasks are more complex, with nuance and room for error. Proof-based math isn’t trivial to evaluate, and this is similar to the tasks people want help with in the real world.
The training process for the model involved reinforcement learning, where the model is rewarded for good behavior and penalized for bad behavior. Hsu mentioned that they also scaled up test-time compute, giving the AI more time to “think” before answering. This extra thinking time provided significant gains. When they finally looked at the results, the progress was exciting, and they believed gold might be within reach.
On the IMO test, the model got five out of six answers correct. However, it didn’t attempt to answer the sixth question. Wei explained that the model knowing what it doesn’t know was an early sign of progress. Current models sometimes “hallucinate,” providing answers they don’t truly know. This capability isn’t specific to math; it would be beneficial if the model could honestly say when it doesn’t know, rather than giving an answer that needs independent verification.
The impact of this work on future models is significant. Hsu stated that everything they did for this project is general-purpose, including the ability to grade outputs that aren’t single answers and to work on hard problems for a long time while making steady progress. These capabilities are being applied beyond math in future models. Wei added that the model can generate long, coherent outputs without mistakes, a skill that will be valuable in many other domains.
Q: What is the International Mathematical Olympiad (IMO)?
A: The International Mathematical Olympiad (IMO) is the world’s premier math competition for high school students, held annually and bringing together top contenders from over 100 countries. Participants are given six complex problems to solve over two days, each worth seven points.
Q: What is the significance of the OpenAI model's performance in the IMO?
A: The OpenAI model's performance in the IMO is significant because it demonstrates the ability of AI to handle complex, long-running tasks and to reason through problems with nuance and ambiguity, skills essential for developing artificial general intelligence (AGI).
Q: What is artificial general intelligence (AGI)?
A: Artificial general intelligence (AGI) refers to AI systems that possess human-level understanding and reasoning capabilities, enabling them to perform a wide range of tasks across various domains, not just specific, narrow tasks.
Q: How did the OpenAI team train their AI model for the IMO?
A: The OpenAI team used a general-purpose reasoning model and employed reinforcement learning to train the AI. They scaled up test-time compute, giving the model more time to 'think' before answering, which provided significant gains in performance.
Q: Why is the model's ability to know what it doesn't know important?
A: The model's ability to know what it doesn't know is important because it addresses the issue of 'hallucinations' in current AI systems, where they provide answers they are not confident about. This capability is crucial for building trust and reliability in AI systems.