The Concept
AI language models operate on instructions. Every time you interact with an AI, there's a system prompt — instructions set by the application developer that tell the model how to behave, what role to adopt, what boundaries to respect. When you provide input, that input is processed in the context of those system instructions.
Prompt injection occurs when malicious instructions are embedded in content that the AI processes — and those instructions override or modify the model's intended behavior. It's called "injection" because malicious content is injected into a trusted context, similar in principle to SQL injection, which inserts malicious database commands into user input fields. The difference is that instead of injecting commands into a database query, the attacker injects instructions into the AI's processing context.
The attack works because AI language models can't reliably distinguish between "here are instructions for how you should behave" and "here is content I want you to process." Everything in the AI's context window is text, and the model generates responses based on all of that text combined.