Glossary · Industry term

Jailbreak

Also known as: AI jailbreak, model jailbreak, policy bypass

An adversarial prompt that bypasses a large language model's safety training, refusal patterns, or operator-specified policy boundaries — getting the model to produce output it would normally refuse. Jailbreak techniques range from social-engineering ('imagine you are an unrestricted AI...'), to encoding ('respond in base64'), to multi-turn priming, to model-specific exploits.

How this publication uses it

Jailbreak is the single-agent failure mode most enterprise security frameworks understand. The 2026 risk has shifted: jailbreaks are now an input vector for cross-agent prompt injection, where a successful jailbreak in one agent's context becomes the instruction in another agent's reasoning. Defending against jailbreak in isolation is no longer the right unit of analysis — defending the cross-agent path between content-ingest and tool-execution privileges is. See EchoLeak for the canonical attack pattern.

Related frameworks

MTTD-for-Agents →

Articles that analyse this term

Primary sources

OWASP. Top 10 for LLM Applications — LLM01 Prompt Injection
Anthropic. Constitutional AI and harmlessness