2023-04-14 Prompt injection: what’s the worst that can happen ?

Activity around building sophisticated applications on top of LLMs (Large Language Models) such as GPT-3/4/ChatGPT/etc is growing like wildfire right now .

Many of these applications are potentially vulnerable to prompt injection.

It’s not clear to me that this risk is being taken as seriously as it should.

But is it really that bad ?

Often when I raise this in conversations with people, they question how much of a problem this actually is.

For some applications, it doesn’t really matter. My translation app above? Not a lot of harm was done by getting it to talk like a pirate.

If your LLM application only shows its output to the person sending it text, it’s not a crisis if they deliberately trick it into doing something weird. They might be able to extract your original prompt (a prompt leak attack) but that’s not enough to cancel your entire product.

(Aside: prompt leak attacks are something you should accept as inevitable: treat your own internal prompts as effectively public data, don’t waste additional time trying to hide them.)

Increasingly though, people are granting LLM applications additional capabilities.

The ReAct pattern, Auto-GPT, ChatGPT Plugins—all of these are examples of systems that take an LLM and give it the ability to trigger additional tools—make API requests, run searches, even execute generated code in an interpreter or a shell.

This is where prompt injection turns from a curiosity to a genuinely dangerous vulnerability.

A partial solution: show us the prompts!

I’m currently still of the opinion that there is no 100% reliable protection against these attacks.

It’s really frustrating: I want to build cool things on top of LLMs, but a lot of the more ambitious things I want to build—the things that other people are enthusiastically exploring already—become a lot less interesting to me if I can’t protect them against being exploited.

There are plenty of 95% effective solutions, usually based around filtering the input and output from the models.

That 5% is the problem though: in security terms, if you only have a tiny window for attacks that work an adversarial attacker will find them. And probably share them on Reddit.

Here’s one thing that might help a bit though: make the generated prompts visible to us.

As an advanced user of LLMs this is something that frustrates me already.

When Bing or Bard answer a question based on a search, they don’t actually show me the source text that they concatenated into their prompts in order to answer my question. As such, it’s hard to evaluate which parts of their answer are based on the search results, which parts come from their own internal knowledge (or are hallucinated/confabulated/made-up).

GPT-4 is better, but it’s still not a solved problem

If you have GPT-4 API access you can use the OpenAI Playground tool to try out prompt injections yourself.

GPT-4 includes the concept of a “system prompt”, which lets you provide your instructions in a way that is separate from your user input.