Published Date : 19/07/2025
There are many ways to test the intelligence of an artificial intelligence—conversational fluidity, reading comprehension, or even mind-bendingly difficult physics. However, some of the tests that are most likely to stump AIs are ones that humans find relatively easy, even entertaining. Though AIs increasingly excel at tasks requiring high levels of human expertise, this does not mean they are close to attaining artificial general intelligence (AGI). AGI requires that an AI can take a very small amount of information and use it to generalize and adapt to highly novel situations. This ability, which is the basis for human learning, remains challenging for AIs.
One test designed to evaluate an AI’s ability to generalize is the Abstraction and Reasoning Corpus (ARC), a collection of tiny, colored-grid puzzles that ask a solver to deduce a hidden rule and then apply it to a new grid. Developed by AI researcher François Chollet in 2019, it became the basis of the ARC Prize Foundation, a nonprofit program that administers the test. The foundation also develops new tests and has been routinely using two (ARC-AGI-1 and its more challenging successor ARC-AGI-2). This week, the foundation is launching ARC-AGI-3, which is specifically designed for testing AI agents and is based on making them play video games.
Scientific American spoke to ARC Prize Foundation president, AI researcher, and entrepreneur Greg Kamradt to understand how these tests evaluate AIs, what they tell us about the potential for AGI, and why they are often challenging for deep-learning models even though many humans tend to find them relatively easy.
The definition of intelligence measured by ARC-AGI-1 is the ability to learn new things. While AI can win at chess and beat Go, these models cannot generalize to new domains. They can’t go and learn English, for example. So, what François Chollet made was a benchmark called ARC-AGI, which teaches you a mini skill in the question and then asks you to demonstrate that mini skill. The test measures a model’s ability to learn within a narrow domain, but it does not claim to measure AGI because it’s still in a scoped domain.
There are two ways to define AGI. The first is more tech-forward, which is whether an artificial system can match the learning efficiency of a human. Humans learn a lot outside their training data, such as speaking English, driving a car, and riding a bike. That’s called generalization. When you can do things outside of what you’ve been trained on, we define that as intelligence. An alternative definition of AGI is when we can no longer come up with problems that humans can do and AI cannot. As long as the ARC Prize or humanity in general can still find problems that humans can do but AI cannot, we do not have AGI.
One of the things that differentiates the ARC Prize Foundation is that their benchmarks must be solvable by humans. This is in opposition to other benchmarks, which focus on “Ph.D.-plus-plus” problems. The foundation tested 400 people on ARC-AGI-2. The average person scored 66 percent on ARC-AGI-2, and collectively, the aggregated responses of five to 10 people contain the correct answers to all the questions on the ARC2.
What makes this test hard for AI and relatively easy for humans is that humans are incredibly sample-efficient with their learning. They can look at a problem and with maybe one or two examples, they can pick up the mini skill or transformation and go and do it. The algorithm running in a human’s head is orders of magnitude better and more efficient than what we’re seeing with AI right now.
ARC-AGI-1, created by François Chollet, had about 1,000 tasks and was the minimum viable version to measure generalization. It held for five years because deep learning couldn’t touch it. However, reasoning models that came out in 2024, by OpenAI, started making progress on it. ARC-AGI-2 went a little further, requiring a bit more planning for each task. The grids are larger, and the rules are more complicated, but the concept remains the same. The new format, ARC-AGI-3, is completely different and more interactive, using video games to test agents.
If you think about everyday life, it’s rare that we have a stateless decision. Stateless means just a question and an answer. Current benchmarks are more or less stateless, which limits what you can test. You cannot test planning, exploration, or intuiting about your environment. The foundation is making 100 novel video games to test humans to ensure they can do them, which forms the basis for their benchmark. AIs will then be dropped into these video games to see if they can understand the environment they’ve never seen before. To date, with internal testing, no AI has been able to beat even one level of one of the games.
Each “environment,” or video game, is a two-dimensional, pixel-based puzzle. These games are structured as distinct levels, each designed to teach a specific mini skill to the player (human or AI). To successfully complete a level, the player must demonstrate mastery of that skill by executing planned sequences of actions.
Video games have long been used as benchmarks in AI research, with Atari games being a popular example. However, traditional video game benchmarks have several limitations. Popular games have extensive training data publicly available, lack standardized performance evaluation metrics, and permit brute-force methods involving billions of simulations. Additionally, developers building AI agents typically have prior knowledge of these games, unintentionally embedding their own insights into the solutions. ARC-AGI-3 aims to overcome these limitations by creating novel, interactive environments that test AI’s ability to generalize and adapt to new situations.
Q: What is the Abstraction and Reasoning Corpus (ARC)?
A: The Abstraction and Reasoning Corpus (ARC) is a collection of tiny, colored-grid puzzles designed to evaluate an AI’s ability to generalize by deducing hidden rules and applying them to new grids.
Q: What is the main challenge in achieving artificial general intelligence (AGI)?
A: The main challenge in achieving AGI is the ability to generalize and adapt to highly novel situations using a very small amount of information, which is a key aspect of human learning.
Q: How does ARC-AGI-3 differ from previous tests?
A: ARC-AGI-3 is more interactive and uses video games to test AI agents, requiring them to understand and adapt to new environments, which is a significant step up from previous stateless benchmarks.
Q: Why are these tests important for AI research?
A: These tests are crucial for understanding the limitations of current AI models and for developing more advanced systems that can generalize and adapt to new situations, which is essential for achieving AGI.
Q: What are the limitations of traditional video game benchmarks in AI research?
A: Traditional video game benchmarks have extensive training data, lack standardized performance metrics, and permit brute-force methods. They also often embed developers' insights, which can skew results.