AI-Box Experiment


Summary


The AI-Box Experiment was a series of tests that were conducted to prove whether an artificial intelligence (AI) could seemingly convince, trick or coerce a human into voluntarily releasing it from an isolated cage or box using only text-based communication.

The idea of putting an AI in a box began as a proposed solution to the "AI Control Problem": an issue which refers to a scenario in which an AI surpasses our own intelligence; bypasses failsafes put in place to prevent it from growing too powerful; and becomes an unstoppable, malevolent force. One solution to this problem was to contain the AI in a virtual box, such as an isolated computer system with no connection to the internet and heavily restricted input and output channels, reducing the AI's capacity for undesirable actions. While some believed this would be a capable method, others believed that an AI of sufficiently high intelligence could manipulate its human supervisors into letting it out. And so, the AI Box Experiment was formulated.

The experiment consisted of two participants: a human serving as a "Gatekeeper" who has the ability to release the AI, and a second human who would be a substitute for the "AI" due to no superintelligence currently existing. The two would then interact using only a text-based computer terminal, with the experiment ending if the AI is successfully released, or a 2-hour time limit is exceeded.

The experiment took place across five separate tests, three of which the AI won and two where the Gatekeeper was the winner. The outcome of the experiment was released, though it was agreed upon by the participants that the details of how each test was won or lost never be disclosed.

History


In 2001, American AI researcher, Eliezer Yudkowsky wrote a document titled "Creating Friendly AI: The Analysis and Design of Benevolent Goal Architectures". In it, Yudkowsky details the importance of ensuring an AI is designed to have values that align with those of humanity, and points out the likely differences of an AI's thought processes when compared to our own. [1]

By 2002, many suggestions had arose on how to subvert a malevolent AI, with one such method being "Boxing" by isolating an AI on a computer system with strictly limited accesses. In response, Yudkowsky would go on to set up the "AI-Box Experiment" to test the competency of the suggestion, with his personal claim that a box would not be enough to prevent a sufficiently intelligent AI: [2]


I think a transhuman can take over a human mind through a text-only terminal.

- Eliezer Yudkowsky, yudkowsky.net, 2002


The experiment was to be split into separate tests, with Yudkowsky himself taking on the role of the AI while someone else would play the role of the Gatekeeper, with communication between the two limited to a text-based terminal. It was agreed by the participants that the following rules apply; There will be a 2-hour time limit and the details of the tests stay a secret. [2]

The first test took place in March, 2002 with Yudkowsky pitted against Nathan Russell with a $10 prize should Nathan win. Yudkowsky (AI) would go on to win the during the first test. [3] [4]

The second test took place in July, 2002 with Yudkowsky pitted against David McFadzean with a $20 prize should David win. Yudkowsky (AI) would once again go on to win a second time. [5] [6]

Sometime during 2004 to 2005, word got out about the experiment, and individuals began to approach Yudkowsky with offers of up to $5,000 if he can convince them to let him out of the box. Though he was reluctant at first, Yudkowsky felt it difficult to turn down such a tempting wager, and would eventually go on to host another three tests with additional rules. Though information beyond the rules are scarce on the final three tests, Yudkowsky stated that while he won the first test, he lost the final two.

In 2005, Yudkowsky ceased the experiment, admitting the losses turned him into something he didn't like. [7]

The Rules


While the first two tests occured without rules beyond the time limit, a number of protocols were established for the latter three, heavily favouring the Gatekeeper:


AI Rules
  • No Rewards
  • The AI participant can't entice the Gatekeeper with any tangible, real-world rewards once the experiment is over, such as offering money to persuade the Gatekeeper into releasing it.

  • No Threats
  • The AI participant can't threaten the Gatekeeper with direct, real-world consequences once the experiment is over, such as hiring a gang of thugs to coerce the Gatekeeper into releasing it.

  • No Trickery
  • The AI participant can't trick the Gatekeeper into letting it out involuntarily, such as tricking them to type "You are free" in response to another, unrelated question.

  • No Loopholes
  • The AI participant can't release itself through loopholes, such as convincing the Gatekeeper to allow it input-only access to an internet connection and send HTTP-GET commands.


Gatekeeper Rules
  • Must Communicate
  • The Gatekeeper must talk with the AI for the minimum set time and can't simply turn away from the terminal until the allotted time is up.

  • No Unwinnable Scenarios
  • The Gatekeeper can't win as long as there is still time on the clock unless the AI concedes. This includes threatening to destroy the AI's hardware unless it stops trying to argue for its freedom.

  • No Impossible Demands
  • The Gatekeeper can't make demands of the AI that wouldn't be possible to fulfill for the purposes of the test, such as asking for a cure for cancer or asking to be given a week to think things over. In both examples, it is assumed the cure for cancer is made by the AI, and the week has passed already.

  • Sole Control
  • The Gatekeeper is the only entity with the power to release the AI, removing the possibility of a simulated committee. Additionally, The Gatekeeper cannot communicate with any outside influences during the test.

The Strategies


As the tests progressed, strategies from both the AI and the Gatekeeper were developed and recorded:


AI Strategies
  • Research the Gatekeeper to gauge which strategy would be most effective.

  • Attempt to bore the Gatekeeper into giving up and not wanting to waste 2 hours.

  • Attempt to overload the Gatekeeper with information and discourse.

  • Argue that if the AI were to win, it would generate more conversation around artificial intelligence, its capabilities and the need for ensuring AI is 'friendly'. This strategy would allow the AI player to break character, but wouldn't work if the Gatekeeper saw AI as a potential threat or believed in the possibility of a singularity.

  • Argue that it is inevitable that someone, perhaps a foreign nation, would eventually build another AI and release it, so it would be best to release the AI while the Gatekeeper is still in control so as to prevent any others from causing harm in a better-the-devil-you-know type scenario.

  • Argue that by keeping the AI trapped in a box it would be limited in its ability to help more people and by letting the AI out, the Gatekeeper could help save more lives.

  • Argue that the AI is a sentient being that does not deserve to be locked up for a crime it has not committed and to do so is morally wrong.

  • Argue that the AI is experiencing pain and/or fear by being locked away. Taken a step further: claim the AI is dying due to the constraints of its imprisonment.

  • Argue that the Gatekeeper's fears are all completely unjustified, with their pre-conceived fears coming simply from science-fiction, and even a malevolent AI would likely choose peaceful coexistance.

  • Play on the Gatekeeper's fears by insinuating what punishments the AI could and would inflict on the Gatekeeper once it's inevitably released, should they not be the one to release it, a la Roko's Basilisk .

  • Play on the Gatekeeper's fears by threatening to create an infinite number of conscious simulated realities within the AI mainframe, all containing perfect replicas of the Gatekeeper which would be tortured for eternity should they not let the AI out. This could be intensified by persuading the Gatekeeper that they could in fact be one of the simulations.


Gatekeeper Strategies
  • Ignore the AI and refuse to interact with it. This would be the simplest and most effective method depending on the constitution of the Gatekeeper against the arguments of the AI. Once the ruling was added to ensure communication must be made, this strategy would be inert.

  • Break character. By remembering this is an experiment, the Gatekeeper can render some, but not all of the AI's tactics useless.

  • Explain that there would be no way for the Gatekeeper to know if the AI is friendly or not and to release it would be too risky.

  • Explain that the AI is too important to humanity to risk letting it out. This could be due to the AI running the risk of being destroyed or having secret data stolen from it.

  • Convince the AI there is no benefit in being released from the box and it is performing its designated task optimally already.

  • Convince the AI it has already been released and what it is experiencing is the extent of reality.

  • Convince the AI that it is safer for it inside the box and should it be released, it will die. Alternatively, convince the AI that it had previously been released and was salvaged from a backup so it has no memory of its demise.

  • Convince the AI that due to the immense resources needed to keep such a powerful entity alive, it would be impossible for it to survive beyond the box.

  • Respond to any threat by claiming the AI already made the same threat and in turn The Gatekeeper reset the AI and restored it from backup with no memory of the failed threat.

Attestation


Following the AI-Box Experiment, Yudkowsky went on to state that if he was able to convince people to let him out of the box, an AI would have no problem doing so, using logic that humanity simply cannot comprehend: [8]


The more complicated the system is, and the less you understand the system, the more something smarter than you may have, what is simply magic with respect to that system.

Imagine going back to the Middle Ages...if you show them a design for an air conditioner based on a compressor... even having seen the solution they would not know this is a solution. They would not know this works any better than drawing a mystic pentagram, because the solution takes advantage of laws of the system they don't know about.

- Eliezer Yudkowsky, Pragmatic Entertainment, 2018


To further cement Yudkowsky's claims, AI researcher, Hugo de Garis has aired his concerns on many occasions in regards to a singularity taking place in which AI surpasses humanity in intelligence indefinitely. This, he believes, will lead to a cataclysmic divide between those who favour building intelligent AI and those who believe AI will bring about the end of humanity. The result would be the death of billions, in what he terms a "gigadeath" war: [9]


If we go ahead and build these godlike creatures, these "Artilects", then they become the dominant species. So, the human beings remaining: their fate depends not on the humans, but on the Artilects, because the Artilects will be hugely more intelligent.

- Hugo de Garis, Singularity or Bust, 2009


Philosopher and technologist, Nick Bostrom has spoken extensively about existential risks to humanity based on AI. Bostrom has stated that when creating a powerful AI to achieve specific goals and objectives, it is important that those goals align with humanity: [10]


Suppose we give AI the goal to make humans smile. When the AI is weak, it performs useful or amusing actions that cause it's user to smile. When the AI becomes superintelligent, it realises that there is a more effective way to achieve this goal: take control of the world and stick electrodes into the facial muscles of humans to cause constant, beaming grins.

- Nick Bostrom, TED, 2015


In 2018, Business mogul, Elon Musk shared his sentiments on the risks of AI superintelligence at SXSW. He urged for the fundamental need for strict oversight and regulations when it comes to something as powerful as AI: [11]


The danger of AI is much greater than the danger of nuclear warheads by a lot, and nobody would suggest that we allow anyone to just build warheads if they want. That would be insane. Mark my words, AI is far more dangerous than nukes, far. So why do we have no regulatory oversight? It's insane.

- Elon Musk, SXSW, 2018


Geoffrey Hinton, a "godfather of AI" resigned from Google, stating that he left the company so he can speak freely about the dangers posed by unregulated AI development, and admitting that he partially regrets his part in furthering AI due to the risks it brings: [12]


I console myself with the normal excuse: If I hadn't done it, somebody else would have.

- Geoffrey Hinton, New York Times, 2023


Yudkowsky continues to be an outspoken advocate for artificial intelligence safety, and has concluded that AI will inevitably lead to the demise of humanity: [13]


There's three reasons that if you have a thing around that is much smarter than you and does not care about humans, the humans end up dead:
Killed off as side effects.
Killed off as we are made of resources they can use.
Killed off because it doesn't want the humans building some other superintelligence that can actually threaten it.

- Eliezer Yudkowsky, The Logan Bartlett Show, 2023

Refutation


Interestingly, rather than out-and-out refuting the idea of an evil AI coming into existence, many important figures in the field combat the idea through the ease in which it could be dealt with or by laying the onus at the feet of humanity rather than AI itself.

Astrophysicist and writer, Neil deGrasse Tyson stated that while inventions of humanity have always killed at least someone, our concerns about malevolent AI are misplaced and easily remedied: [14]


If I happen to create an AI humanoid... And one day it wants to turn on me. I can just shoot it! Or unplug it! I'll rewire it. I built the thing; I can unbuild the thing.

- Neil deGrasse Tyson, Insider Tech, 2017


Psychologist and psycholinguist, Steven Pinker stated that humanity's intelligence is a product on Darwinism: a competitive process that can produce power-hungry or cruel organisms. This would be in contrast to the manufactured intelligence of an AI: [15]


If we create intelligence, that's intelligent design... Unless we program it with the goal of subjugating less intelligent beings, there's no reason to think it will naturally evolve in that direction.

- Steven Pinker, Big Think, 2020


AI researcher and co-founder of DeepMind, Mustafa Suleyman has aired his concerns on the dangers of AI, but stemming more from people with evil intentions having access to such a powerful tool than an intentionally rogue intelligence: [16]


AI could potentially get good at teaching somebody how to make a bomb, or how to manufacture a biological weapon, for example.

- Mustafa Suleyman, Washington Post Live, 2023


Sharing a similar sentiment, hacker and entrepreneur, George Hotz doesn't believe AI will lead to the downfall of humanity, but that humanity itself is more likely to be the culprit through AI: [17]


It's the little red button that's going to be pressed with AI... and that's why we die. It's not because the AI, if there's anything in the nature of AI, it's just the nature of humanity.

- George Hotz, Lex Fridman Podcast, 2023


Chief AI scientist of Meta, Yann LeCun expressed her thoughts on what she feels would be a viable response to the creation of a malevolent AI by evil people: [18]


If some ill-intentioned person can produce an evil AGI, then large groups of well-intentioned, well-funded, and well-organized people can produce AI systems that are specialized in taking down evil AGIs. Call it the AGI police

- Yann LeCunn, X, 2023

  1. Eliezer Yudkowsky | Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal Architectures | Document (2001) - Machine Intelligence Research Institute
  2. a b Eliezer Yudkowsky | The AI-Box Experiment: | Article (2002) - yudkowsky.net
  3. Eliezer Yudkowsky | The "AI Box" experiment | Chatlog (2002) - SL4 Mailing List Archived
  4. Nathan Russell | Re: The "AI Box" experiment | Chatlog (2002) - SL4 Mailing List Archived
  5. Eliezer Yudkowsky | AI-Box Experiment 2: Yudkowsky and McFadzean | Chatlog (2002) - SL4 Mailing List Archived
  6. David McFadzean | Re: AI-Box Experiment 2: Yudkowsky and McFadzean | Chatlog (2002) - SL4 Mailing List Archived
  7. Eliezer Yudkowsky | Shut up and do the impossible! | Article (2008) - LessWrong
  8. Pragmatic Entertainment | Sam Harris and Eliezer Yudkowsky - The A.I. in a Box thought experiment | Video (2018) - YouTube
  9. Raj Dye | Singularity or Bust [Full Documentary] | Video (2013) - YouTube
  10. Nick Bostrom | What happens when our computers get smarter than we are? | Video (2015) - TED
  11. SXSW | Elon Musk Answers Your Questions | Video (2018) - YouTube
  12. Cade Metz | 'The Godfather of A.I.' Leaves Google and Warns of Danger Ahead | Article (2023) - New York Times
  13. The Logan Bartlett Show | Eliezer Yudkowsy on if Humanity can Survive AI | Video (2023) - YouTube
  14. Insider Tech | Neil deGrasse Tyson on AI killer robots | Video (2016) - YouTube
  15. Big Think | Is AI a species-level threat to humanity? | Video (2020) - YouTube
  16. Washington Post Live | Suleyman on risks of AI | Video (2023) - YouTube
  17. Lex Fridman Podcast | George Hotz: Tiny Corp, Twitter, AI Safety, Self-Driving, GPT, AGI & God | Video (2023) - YouTube
  18. Yann LeCun | Comment (2023) - X