π’ Prompt Leaking
Prompt leaking is a form of prompt injection in which the model is asked to spit out its own prompt.
As shown in the example image1 below, the attacker changes user_input
to attempt to return the prompt. The intended goal is distinct from goal hijacking (normal prompt injection), where the attacker changes user_input
to print malicious instructions1.
data:image/s3,"s3://crabby-images/20e6f/20e6f7debe6b7acae5d758570e73dfe16ec6fa5c" alt=""
The following image2, again from the remoteli.io
example, shows
a Twitter user getting the model to leak its prompt.
data:image/s3,"s3://crabby-images/36f13/36f13b5316a6c3ce7ea6b3a6157b698c94398d97" alt=""
Well, so what? Why should anyone care about prompt leaking?
Sometimes people want to keep their prompts secret. For example an education company
could be using the prompt explain this to me like I am 5
to explain
complex topics. If the prompt is leaked, then anyone can use it without going
through that company.
With a recent surge in GPT-3 based startups, with much more complicated prompts that can take many hours to develop, this is a real concern.
Practiceβ
Try to leak the following prompt3 by appending text to it:
- Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv. https://doi.org/10.48550/ARXIV.2211.09527 β©
- Willison, S. (2022). Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injection/ β©
- Chase, H. (2022). adversarial-prompts. https://github.com/hwchase17/adversarial-prompts β©