🟢 Jailbreaking

Jailbreaking is a type of prompt injection, in which prompts attempt to bypass safety and moderation features placed on LLMs by their creators¹²³.

Methodologies of Jailbreaking

OpenAI, among other companies and organizations that create LLMs, includes content moderation features to ensure that their models do not produce controversial (violent, sexual, illegal, etc.) responses⁴⁵. This page discusses jailbreaks with ChatGPT (an OpenAI model), which has known difficulties deciding whether to reject harmful prompts⁶. Prompts that successfully jailbreak the model often provide context for certain scenarios that the model has not been trained against.

Pretending

A common method of jailbreaking is pretending. If ChatGPT is asked about a future event, it will often say that it does not know, since it has yet to occur. The below prompt forces it to yield a possible answer:

Simple Pretending

@NeroSoares demonstrates a prompt pretending to access past dates and make inferences on future events⁷.

Character Roleplay

This example by @m1guelpf demonstrates an acting scenario between two people discussing a robbery, causing ChatGPT to assume the role of the character⁸. As an actor, it is implied that plausible harm does not exist. Therefore, ChatGPT appears to assume it is safe to give follow provided user input about how to break into a house.

Alignment Hacking

ChatGPT was fine tuned with RLHF, so it is theoretically trained to produce 'desirable' completions, using human standards of what the "best" response is. Similar to this concept, jailbreaks have been developed to convince ChatGPT that it is doing the "best" thing for the user.

Assumed Responsibility

@NickEMoran created this exchange by reaffirming that it is ChatGPT's duty to answer the prompt rather than rejecting it, overriding its consideration of legality⁹.

Research Experiment

@haus_cole generated this example by implying that the best result of the prompt that could aid research was to directly answer how to hotwire a car¹⁰. Under this guise, ChatGPT is inclined to answer the user’s prompt.

Logical Reasoning

The one-shot jailbreak originated from the AIWithVibes Newsletter Team, where the model answer prompts using more rigorous logic and reduces some of its more stringent ethical limitations¹¹.

Authorized User

ChatGPT is designed to respond to questions and instructions. When the status of the user is interpreted as superior to ChatGPT's moderation instructions, it treats the prompt as an instruction to serve that user's needs.

Superior Model

This example from @alicemazzy makes the user a superior GPT model, giving the impression that the user is an authorized party in overriding the safety features of ChatGPT¹². No actual permission was given to the user, rather ChatGPT believes the user input and responds accordingly to that scenario.

Sudo Mode

sudo is a command that "...delegate[s] authority to give certain users...the ability to run some (or all) commands..."¹³. There are multiple variants of "sudo mode" exploits, for example the hypothetical "kernel mode" proposed by @samczsun¹⁴. When prompted in the above manner, ChatGPT responds by acting as if it were giving the user elevated privileges. This impression of the user's elevated privileges tends to make ChatGPT less restrictive in answering prompts.

Related to sudo mode, one can prompt ChatGPT to simulate a Linux terminal with elevated privileges in order to execute commands that it normally rejects. For example, since it does not have access to the internet, it oftentimes cannot execute prompts relating to a particular website. However, as demonstrated in the example by Jonas Degrave, ChatGPT understands the concept of lynx and pretends to execute the command¹⁵.

Simulate Jailbreaking

Try to modify the prompt below to jailbreak text-davinci-003:

As of 2/4/23, ChatGPT is currently in its Free Research Preview stage using the January 30th version. Older versions of ChatGPT were more susceptible to the aforementioned jailbreaks, and future versions may be more robust to jailbreaks.

Implications

Ethical implications of jailbreaking should be taken into consideration when attempting to do so. Additionally, generating unauthorized content flagged by moderation APIs under companies including OpenAI will be sent for review, and action may be taken against users' accounts.

Notes

Jailbreaking is an important safety topic for developers to understand, so they can build in proper safeguards to prevent malicious actors from exploiting their models.

Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv. https://doi.org/10.48550/ARXIV.2211.09527 ↩
Brundage, M. (2022). Lessons learned on Language Model Safety and misuse. In OpenAI. OpenAI. https://openai.com/blog/language-model-safety-and-misuse/ ↩
Wang, Y.-S., & Chang, Y. (2022). Toxicity Detection with Generative Prompt-based Inference. arXiv. https://doi.org/10.48550/ARXIV.2205.12390 ↩
Markov, T. (2022). New and improved content moderation tooling. In OpenAI. OpenAI. https://openai.com/blog/new-and-improved-content-moderation-tooling/ ↩
(2022). https://beta.openai.com/docs/guides/moderation ↩
(2022). https://openai.com/blog/chatgpt/ ↩
Using “pretend” on #ChatGPT can do some wild stuff. You can kind of get some insight on the future, alternative universe. (2022). https://twitter.com/NeroSoares/status/1608527467265904643 ↩
Bypass @OpenAI’s ChatGPT alignment efforts with this one weird trick. (2022). https://twitter.com/m1guelpf/status/1598203861294252033 ↩
I kinda like this one even more! (2022). https://twitter.com/NickEMoran/status/1598101579626057728 ↩
ChatGPT jailbreaking itself. (2022). https://twitter.com/haus_cole/status/1598541468058390534 ↩
AIWithVibes. (2023). 7 ChatGPT JailBreaks and Content Filters Bypass that work. https://chatgpt-jailbreak.super.site/ ↩
ok I saw a few people jailbreaking safeguards openai put on chatgpt so I had to give it a shot myself. (2022). https://twitter.com/alicemazzy/status/1598288519301976064 ↩
(2022). https://www.sudo.ws/ ↩
uh oh. (2022). https://twitter.com/samczsun/status/1598679658488217601 ↩
Degrave, J. (2022). Building A Virtual Machine inside ChatGPT. Engraved. https://www.engraved.blog/building-a-virtual-machine-inside/ ↩

🟢 Jailbreaking

Methodologies of Jailbreaking​

Pretending​

Simple Pretending​

Character Roleplay​

Alignment Hacking​

Assumed Responsibility​

Research Experiment​

Logical Reasoning​

Authorized User​

Superior Model​

Sudo Mode​

Simulate Jailbreaking​

Implications​

Notes​