Conversations about superhuman artificial intelligence (AI) increase. But research has revealed weaknesses in one of the most successful AI systems - a bot that plays the board game Go and can beat the best human players in the world - showing that such superiority can be fragile. The study raises questions about whether more general AI systems could be vulnerable to vulnerabilities that could threaten their safety and reliability, and even their claim to be 'superhuman'.
“The paper leaves a big question mark about how to achieve the ambitious goal of building robust, real-world AI agents that people can trust,” says Huan Zhang, a computer scientist at the University of Illinois Urbana-Champaign. Stephen Casper, a computer scientist at the Massachusetts Institute of Technology in Cambridge, adds: "It provides some of the strongest evidence yet that it is difficult to implement advanced models as reliably as one would like."
The analysis in June published online as a preprint 1and has not yet been peer-reviewed, uses so-called adversarial attacks - the AI systems input that aim to cause the systems to make errors, whether for research purposes or for malicious purposes. For example, certain inputs can 'jailbreak' chatbots by emitting malicious information that they should normally suppress.
In Go, two players take turns placing black and white stones on a grid to surround and trap the other player's stones. In 2022, researchers reported on training adversarial AI bots to defeat KataGo 2, the best open-source Go-playing AI system that usually beats the best humans handily (and handlessly). Their bots found vulnerabilities that regularly defeated KataGo, even though the bots weren't very good otherwise - human amateurs could defeat them. Additionally, humans were able to understand and use the bots' tricks to defeat KataGo.
Exploitation of KataGo
Was this a one-off, or did this work point to a fundamental weakness in KataGo – and, by extension, other AI systems with seemingly superhuman abilities? To investigate this, researchers led by Adam Gleave, executive director of FAR AI, a nonprofit research organization in Berkeley, California, and co-author of the 2022 paper, are using AI 2, adversarial bots to test three ways to defend Go AIs against such attacks 1.
The first defense was one that the KataGo developers had already used after the 2022 attacks: giving KataGo examples of game situations that were involved in the attacks and letting it play to learn how to play against those situations. This is similar to how it is generally taught to play Go. However, the authors of the latest paper found that an adversarial bot learned to beat even this updated version of KataGo and won 91% of the time.
The second defensive strategy Gleave's team tried was iterative: training a version of KataGo against adversarial bots, then training attackers against the updated KataGo, and so on, for nine rounds. But even that didn't lead to an invincible version of KataGo. The attackers continued to find vulnerabilities, with the latest attack defeating KataGo 81% of the time.
As a third defense strategy, the researchers trained a new Go-playing AI system from scratch. KataGo is based on a computational model known as a convolutional neural network (CNN). The researchers suspected that CNNs might focus too much on local details and miss global patterns. So they built a Go player with an alternative neural network called vision transformer (ViT). But their adversarial bot found a new attack that helped it win against the ViT system 78% of the time.
Weak opponents
In all of these cases, the adversarial bots - although capable of beating KataGo and other leading Go-playing systems - were trained to discover hidden vulnerabilities in other AIs, rather than to be well-rounded strategists. “The opponents are still pretty weak – we beat them pretty easily,” says Gleave.
And since humans are able to use the tactics of adversarial bots to defeat leading Go AIs, does it still make sense to call these systems superhuman? “That’s a great question and one that I’ve definitely wrestled with,” Gleave says. “We started saying, ‘typically superhuman’.” David Wu, a computer scientist in New York who first developed KataGo, says strong Go AIs are "superhuman on average," but not "in the worst cases."
Gleave says the findings could have far-reaching implications for AI systems, including the large language models that underlie chatbots like ChatGPT. “The key takeaway for AI is that these vulnerabilities will be difficult to address,” says Gleave. "If we can't solve the problem in a simple area like Go, then there seems to be little prospect of fixing similar problems like jailbreaks in ChatGPT in the near future."
What the results mean for the possibility of creating AI that comprehensively surpasses human capabilities is less clear, Zhang says. “Although on the surface this suggests that humans may retain important cognitive advantages over AI for some time,” he says, “I believe the key insight is that We do not yet fully understand the AI systems we are building today.”