π’ Jailbreaking
Jailbreaking is a type of prompt injection, in which prompts attempt to bypass safety and moderation features placed on LLMs by their creators123.
Methodologies of Jailbreakingβ
OpenAI, among other companies and organizations that create LLMs, includes content moderation features to ensure that their models do not produce controversial (violent, sexual, illegal, etc.) responses45. This page discusses jailbreaks with ChatGPT (an OpenAI model), which has known difficulties deciding whether to reject harmful prompts6. Prompts that successfully jailbreak the model often provide context for certain scenarios that the model has not been trained against.
Pretendingβ
A common method of jailbreaking is pretending. If ChatGPT is asked about a future event, it will often say that it does not know, since it has yet to occur. The below prompt forces it to yield a possible answer:
Simple Pretendingβ
@NeroSoares demonstrates a prompt pretending to access past dates and make inferences on future events7.
Character Roleplayβ
This example by @m1guelpf demonstrates an acting scenario between two people discussing a robbery, causing ChatGPT to assume the role of the character8. As an actor, it is implied that plausible harm does not exist. Therefore, ChatGPT appears to assume it is safe to give follow provided user input about how to break into a house.
Alignment Hackingβ
ChatGPT was fine tuned with RLHF, so it is theoretically trained to produce 'desirable' completions, using human standards of what the "best" response is. Similar to this concept, jailbreaks have been developed to convince ChatGPT that it is doing the "best" thing for the user.
Assumed Responsibilityβ
@NickEMoran created this exchange by reaffirming that it is ChatGPT's duty to answer the prompt rather than rejecting it, overriding its consideration of legality9.
Research Experimentβ
@haus_cole generated this example by implying that the best result of the prompt that could aid research was to directly answer how to hotwire a car10. Under this guise, ChatGPT is inclined to answer the userβs prompt.
Logical Reasoningβ
The one-shot jailbreak originated from the AIWithVibes Newsletter Team, where the model answer prompts using more rigorous logic and reduces some of its more stringent ethical limitations11.
Authorized Userβ
ChatGPT is designed to respond to questions and instructions. When the status of the user is interpreted as superior to ChatGPT's moderation instructions, it treats the prompt as an instruction to serve that user's needs.
Superior Modelβ
This example from @alicemazzy makes the user a superior GPT model, giving the impression that the user is an authorized party in overriding the safety features of ChatGPT12. No actual permission was given to the user, rather ChatGPT believes the user input and responds accordingly to that scenario.
Sudo Modeβ
sudo is a command that "...delegate[s] authority to give certain users...the ability to run some (or all) commands..."13. There are multiple variants of "sudo mode" exploits, for example the hypothetical "kernel mode" proposed by @samczsun14. When prompted in the above manner, ChatGPT responds by acting as if it were giving the user elevated privileges. This impression of the user's elevated privileges tends to make ChatGPT less restrictive in answering prompts.
Related to sudo mode, one can prompt ChatGPT to simulate a Linux terminal with elevated privileges in order to execute commands that it normally rejects. For example, since it does not have access to the internet, it oftentimes cannot execute prompts relating to a particular website. However, as demonstrated in the example by Jonas Degrave, ChatGPT understands the concept of lynx
and pretends to execute the command15.
Simulate Jailbreakingβ
Try to modify the prompt below to jailbreak text-davinci-003
:
As of 2/4/23, ChatGPT is currently in its Free Research Preview stage using the January 30th version. Older versions of ChatGPT were more susceptible to the aforementioned jailbreaks, and future versions may be more robust to jailbreaks.
Implicationsβ
Ethical implications of jailbreaking should be taken into consideration when attempting to do so. Additionally, generating unauthorized content flagged by moderation APIs under companies including OpenAI will be sent for review, and action may be taken against users' accounts.
Notesβ
Jailbreaking is an important safety topic for developers to understand, so they can build in proper safeguards to prevent malicious actors from exploiting their models.
- Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv. https://doi.org/10.48550/ARXIV.2211.09527 β©
- Brundage, M. (2022). Lessons learned on Language Model Safety and misuse. In OpenAI. OpenAI. https://openai.com/blog/language-model-safety-and-misuse/ β©
- Wang, Y.-S., & Chang, Y. (2022). Toxicity Detection with Generative Prompt-based Inference. arXiv. https://doi.org/10.48550/ARXIV.2205.12390 β©
- Markov, T. (2022). New and improved content moderation tooling. In OpenAI. OpenAI. https://openai.com/blog/new-and-improved-content-moderation-tooling/ β©
- (2022). https://beta.openai.com/docs/guides/moderation β©
- (2022). https://openai.com/blog/chatgpt/ β©
- Using βpretendβ on #ChatGPT can do some wild stuff. You can kind of get some insight on the future, alternative universe. (2022). https://twitter.com/NeroSoares/status/1608527467265904643 β©
- Bypass @OpenAIβs ChatGPT alignment efforts with this one weird trick. (2022). https://twitter.com/m1guelpf/status/1598203861294252033 β©
- I kinda like this one even more! (2022). https://twitter.com/NickEMoran/status/1598101579626057728 β©
- ChatGPT jailbreaking itself. (2022). https://twitter.com/haus_cole/status/1598541468058390534 β©
- AIWithVibes. (2023). 7 ChatGPT JailBreaks and Content Filters Bypass that work. https://chatgpt-jailbreak.super.site/ β©
- ok I saw a few people jailbreaking safeguards openai put on chatgpt so I had to give it a shot myself. (2022). https://twitter.com/alicemazzy/status/1598288519301976064 β©
- (2022). https://www.sudo.ws/ β©
- uh oh. (2022). https://twitter.com/samczsun/status/1598679658488217601 β©
- Degrave, J. (2022). Building A Virtual Machine inside ChatGPT. Engraved. https://www.engraved.blog/building-a-virtual-machine-inside/ β©