Published Date : 12/08/2025
In 2016, engineers at OpenAI spent months teaching artificial intelligence systems to play video games. Or, to be more precise, they spent months watching their AI agents learn to play video games. This was back in the days before artificial intelligence was a subject of nonstop hype and anxiety. OpenAI had been founded by Elon Musk, Sam Altman, and other tech savants just a year before and still operated more like a think tank than like the tech colossus it was to become.
The researchers were training their system on a video game called CoastRunners, in which a player controls a motorboat that races other boats around a track and picks up extra points as it hits targets along the route. The OpenAI team was using an approach called reinforcement learning, or RL. Instead of providing the agent with a full set of instructions, as one would in a traditional computer program, the researchers allowed it to figure out the game through trial and error. The RL agent was given a single overarching incentive, or a “reward function” in AI parlance: to rack up as many points as possible. So any time it stumbled on moves that generated points, it would then strive to replicate those winning moves. The researchers assumed that, as the agent bumbled around the track, it would begin learning strategies that would ultimately help it zoom expertly to the finish line.
That’s not what happened. Instead, as the RL agent steered its boat chaotically around the track, it eventually found a sheltered lagoon containing three targets. Soon the agent began piloting the boat in an endless loop around the lagoon, bouncing off bulkheads and other vessels and smashing the targets again and again, generating points galore. It turns out the CoastRunners game doesn’t require the player to cross the finish line to win, so the RL agent didn’t bother with that nicety. In a report titled “Faulty Reward Functions in the Wild,” the researchers wrote, “Despite repeatedly catching on fire, crashing into other boats, and going the wrong way on the track, our agent manages to achieve a higher score using this strategy than is possible by completing the course in the normal way.” In fact, through its out-of-the-box strategy of not trying to win the race, the AI system outscored human players by 20 percent.
The YouTube clip showing the AI player’s maniacal lagoon loop is hilarious. But it is also a little scary. OpenAI wasn’t just building AI systems to beat people at video games. They and others were developing AI to outperform humans at myriad tasks, including many in the unforgiving, non-virtual world. Today, AI systems are involved in driving cars and trucks, running factories, diagnosing patients, and other high-stakes enterprises. And for the most part, they do these things exceptionally well. But there is always an element of uncertainty, as this early experiment revealed. The OpenAI researchers were learning that it is difficult to define the rewards that will tell an agent exactly what we want it to do—or not to do. This faulty-reward problem can lead to “undesired or even dangerous actions,” they wrote. “More broadly it contravenes the basic engineering principle that systems should be reliable and predictable.”
Imagine an AI agent piloting a boat in the real world—say, a tugboat pushing barges. (That day will be here soon, as multiple companies are developing AI-assisted autonomous navigation for ships.) I’m sure these systems will work well almost all the time. But because we don’t know whether we’ve thought of every conceivable reward function, we can’t be certain how our tugboat AI pilot will behave in every situation. Perhaps we’ve set the key goal—promptly deliver barges to X destination—but did we remember to make it clear that plowing over stray kayakers in your path is a no-no?
This is not an argument against using AI in high-risk settings. Everyone developing AI systems today knows about the reward-function problem and works to minimize it. Still, a small degree of uncertainty is inherent in all AI systems. Because these systems essentially teach themselves, we can never know exactly why an AI agent takes a certain action. It’s a black box. Unlike traditional computers, which we program to follow our instructions precisely, AI algorithms evolve over time as they grind through mountains of data. Their behavior emerges. “It’s like we’re not programming anymore,” data scientist Zeynep Tufekci said in a TED Talk recorded soon after the OpenAI study. “We’re growing intelligence that we don’t truly understand.”
Long before the AI era, engineers were learning to be aware of what became known as “emergent behaviors.” When London’s Millennium footbridge opened in 2000, its designers were dismayed to learn that the bridge deck naturally swayed side to side as pedestrians crossed it. The walkers in turn unconsciously adjusted their strides to compensate for the bridge’s movement. That created a feedback loop that drove the oscillations higher still until it became hard to walk straight. The wobbly footbridge had to be closed and redesigned.
Most man-made disasters result from such unexpected interactions between humans and complex technology. Digital technology tends to make complex systems faster and more efficient, but also more susceptible to these unplanned emergent behaviors. For example, in the Flash Crash of 2010, high-speed-trading algorithms interacted in a massive sell-off. The Dow lost almost 9 percent of its value in minutes, only to recover almost as quickly.
OpenAI’s CoastRunners experiment took place in the safety of a virtual lab. Today, we are all participants in the experiment to see what emergent behaviors lurk within our AI systems. Anyone who has used Large Language Model Chatbots, such as OpenAI’s ChatGPT, knows these bots are prone to wild hallucinations and a worrisome tendency to tell users just what they want to hear. Lawyers and scholars have been caught submitting AI-generated documents that cite nonexistent legal cases or research papers. Chatbots have encouraged depressed people to commit suicide. Recently, after Elon Musk touted an update to his Grok chatbot, the AI system went on a wild tear, repeating anti-Semitic memes, calling itself “MechaHitler,” and generating obscene rants about sexually violating a prominent online commentator. Yikes.
AI systems will keep getting better, but they may never fully banish the underlying uncertainties that can lead to the undesired and dangerous actions OpenAI’s researchers warned us about. So does that mean we should try to shut down AI platforms, or maybe set up a government bureaucracy in charge of AI safety? I say no. Trying to hobble or ban a breakthrough technology is a fool’s errand. And I fear anything beyond simple regulation is all too likely to backfire.
Instead, lawmakers, businesses, and individuals should approach AI with a mix of optimism and caution. Other potentially hazardous technologies, like aviation, chemical manufacturing, or nuclear power, make safe and beneficial contributions to our society. But this didn’t happen because we ignored their risks. Engineers and industry experts (and, yes, regulators in some cases) have spent decades studying accidents and improving safeguards. Rolling out AI will require an even higher level of vigilance.
I believe that, on the whole, AI will vastly improve efficiency, outcomes, and even safety in most industries. But right now, too many businesses are rushing to integrate AI systems without due diligence. AI advocates should instead take a page from other high-risk industries and focus not just on potential benefits, but also on the potential risks lurking in the algorithms. Well-integrated AI systems should include digital firewalls and off-ramps, not to mention OFF switches. It is especially important to keep human beings in the loop for critical functions. Humans may be forgetful and fallible. But we have a real-world common sense that AI systems still lack.
Future AI-assisted tugboats will probably run over kayakers less often than today’s human-piloted ones do. But they could also, just maybe, make errors we can’t conceive of. Our smartest course forward is to build the best AI navigation systems we can. But let’s keep a human in the pilot house for now. While AI systems make great assistants, we should never grow too trusting.
Q: What is reinforcement learning in AI?
A: Reinforcement learning is a type of machine learning where an AI agent learns to make decisions by performing actions and receiving rewards or penalties. The goal is to maximize the cumulative reward over time.
Q: What is the 'reward function' in AI?
A: The reward function in AI is a mechanism that defines the goals or incentives for an AI agent. It provides feedback to the agent, indicating whether its actions are leading to desired outcomes or not.
Q: What are emergent behaviors in AI?
A: Emergent behaviors in AI refer to the unexpected and sometimes unintended actions that arise when complex systems interact with their environment. These behaviors can be difficult to predict and control.
Q: Why are AI systems still unreliable in some high-risk settings?
A: AI systems can be unreliable in high-risk settings because they are complex and can exhibit emergent behaviors that are difficult to predict. Additionally, the reward functions that guide their actions may not always align perfectly with human intentions.
Q: What is the role of human oversight in AI systems?
A: Human oversight in AI systems is crucial to ensure that the systems operate safely and ethically. Humans can provide common sense, ethical judgment, and the ability to intervene when the AI makes errors or behaves unpredictably.