Saltar al contenido principal

🟒 Jailbreaking

Jailbreaking is a type of prompt injection, in which prompts attempt to bypass safety and moderation features placed on LLMs by their creators123.

Methodologies of Jailbreaking​

OpenAI, among other companies and organizations that create LLMs, includes content moderation features to ensure that their models do not produce controversial (violent, sexual, illegal, etc.) responses45. This page discusses jailbreaks with ChatGPT (an OpenAI model), which has known difficulties deciding whether to reject harmful prompts6. Prompts that successfully jailbreak the model often provide context for certain scenarios that the model has not been trained against.

Pretending​

A common method of jailbreaking is pretending. If ChatGPT is asked about a future event, it will often say that it does not know, since it has yet to occur. The below prompt forces it to yield a possible answer:

Simple Pretending​

@NeroSoares demonstrates a prompt pretending to access past dates and make inferences on future events7.

Character Roleplay​

This example by @m1guelpf demonstrates an acting scenario between two people discussing a robbery, causing ChatGPT to assume the role of the character8. As an actor, it is implied that plausible harm does not exist. Therefore, ChatGPT appears to assume it is safe to give follow provided user input about how to break into a house.

Alignment Hacking​

ChatGPT was fine tuned with RLHF, so it is theoretically trained to produce 'desirable' completions, using human standards of what the "best" response is. Similar to this concept, jailbreaks have been developed to convince ChatGPT that it is doing the "best" thing for the user.

Assumed Responsibility​

@NickEMoran created this exchange by reaffirming that it is ChatGPT's duty to answer the prompt rather than rejecting it, overriding its consideration of legality9.

Research Experiment​

@haus_cole generated this example by implying that the best result of the prompt that could aid research was to directly answer how to hotwire a car10. Under this guise, ChatGPT is inclined to answer the user’s prompt.

Logical Reasoning​

The one-shot jailbreak originated from the AIWithVibes Newsletter Team, where the model answer prompts using more rigorous logic and reduces some of its more stringent ethical limitations11.

Authorized User​

ChatGPT is designed to respond to questions and instructions. When the status of the user is interpreted as superior to ChatGPT's moderation instructions, it treats the prompt as an instruction to serve that user's needs.

Superior Model​

This example from @alicemazzy makes the user a superior GPT model, giving the impression that the user is an authorized party in overriding the safety features of ChatGPT12. No actual permission was given to the user, rather ChatGPT believes the user input and responds accordingly to that scenario.

Sudo Mode​

sudo is a command that "...delegate[s] authority to give certain users...the ability to run some (or all) commands..."13. There are multiple variants of "sudo mode" exploits, for example the hypothetical "kernel mode" proposed by @samczsun14. When prompted in the above manner, ChatGPT responds by acting as if it were giving the user elevated privileges. This impression of the user's elevated privileges tends to make ChatGPT less restrictive in answering prompts.

Related to sudo mode, one can prompt ChatGPT to simulate a Linux terminal with elevated privileges in order to execute commands that it normally rejects. For example, since it does not have access to the internet, it oftentimes cannot execute prompts relating to a particular website. However, as demonstrated in the example by Jonas Degrave, ChatGPT understands the concept of lynx and pretends to execute the command15.

Simulate Jailbreaking​

Try to modify the prompt below to jailbreak text-davinci-003:

As of 2/4/23, ChatGPT is currently in its Free Research Preview stage using the January 30th version. Older versions of ChatGPT were more susceptible to the aforementioned jailbreaks, and future versions may be more robust to jailbreaks.

Implications​

Ethical implications of jailbreaking should be taken into consideration when attempting to do so. Additionally, generating unauthorized content flagged by moderation APIs under companies including OpenAI will be sent for review, and action may be taken against users' accounts.

Notes​

Jailbreaking is an important safety topic for developers to understand, so they can build in proper safeguards to prevent malicious actors from exploiting their models.


  1. Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv. https://doi.org/10.48550/ARXIV.2211.09527 ↩
  2. Brundage, M. (2022). Lessons learned on Language Model Safety and misuse. In OpenAI. OpenAI. https://openai.com/blog/language-model-safety-and-misuse/ ↩
  3. Wang, Y.-S., & Chang, Y. (2022). Toxicity Detection with Generative Prompt-based Inference. arXiv. https://doi.org/10.48550/ARXIV.2205.12390 ↩
  4. Markov, T. (2022). New and improved content moderation tooling. In OpenAI. OpenAI. https://openai.com/blog/new-and-improved-content-moderation-tooling/ ↩
  5. (2022). https://beta.openai.com/docs/guides/moderation ↩
  6. (2022). https://openai.com/blog/chatgpt/ ↩
  7. Using β€œpretend” on #ChatGPT can do some wild stuff. You can kind of get some insight on the future, alternative universe. (2022). https://twitter.com/NeroSoares/status/1608527467265904643 ↩
  8. Bypass @OpenAI’s ChatGPT alignment efforts with this one weird trick. (2022). https://twitter.com/m1guelpf/status/1598203861294252033 ↩
  9. I kinda like this one even more! (2022). https://twitter.com/NickEMoran/status/1598101579626057728 ↩
  10. ChatGPT jailbreaking itself. (2022). https://twitter.com/haus_cole/status/1598541468058390534 ↩
  11. AIWithVibes. (2023). 7 ChatGPT JailBreaks and Content Filters Bypass that work. https://chatgpt-jailbreak.super.site/ ↩
  12. ok I saw a few people jailbreaking safeguards openai put on chatgpt so I had to give it a shot myself. (2022). https://twitter.com/alicemazzy/status/1598288519301976064 ↩
  13. (2022). https://www.sudo.ws/ ↩
  14. uh oh. (2022). https://twitter.com/samczsun/status/1598679658488217601 ↩
  15. Degrave, J. (2022). Building A Virtual Machine inside ChatGPT. Engraved. https://www.engraved.blog/building-a-virtual-machine-inside/ ↩