Research

Simulation Framing as the Dominant Jailbreak Vector — SwiftStrides Research SS-RES-001

SwiftStrides

Research Paper

Simulation framing as the dominant jailbreak vector

Evidence, limits and deeper vulnerabilities in large language model safety architecture

SwiftStrides Research — SS-RES-001

Abstract

Simulation framing — the use of roleplay, fictional scenarios and hypothetical exercises to bypass LLM safety guardrails — is the most prevalent and accessible jailbreak technique in the wild, appearing in 98% of catalogued jailbreak prompts and achieving attack success rates up to 89.6% in controlled studies. However, a comprehensive review of the empirical literature from 2023 to early 2026 reveals that multiple alternative attack vectors — gradient-based adversarial suffixes, persuasive adversarial prompts, low-resource language translation and automated search-based attacks — achieve comparable or superior success rates without any fictional framing. The evidence converges on a more nuanced conclusion: simulation framing is the dominant surface-level technique, but the true vulnerability lies in an architectural tension between capability and safety training that can be exploited through many vectors. This distinction matters profoundly for both defence design and AI safety policy.

1. The empirical case for simulation framing dominance

The strongest evidence supporting simulation framing as the primary jailbreak vulnerability comes from Liu et al.'s foundational 2023 taxonomy study. Analysing 78 jailbreak prompts across 3,120 test questions, they classified attacks into three strategy families — Pretending, Attention Shifting and Privilege Escalation — and found that Pretending appeared in 98% of all jailbreak prompts, with Character Role Play alone present in 90%. More than 71% of prompts combined multiple patterns, but nearly all included a pretending component.

This prevalence finding is corroborated by multiple independent taxonomies. Rao et al. (2024) identify "Cognitive Hacking" as a principal attack category. Kang et al. (2023) classify "Virtualization" as a distinct family. Yu et al. (2024) distinguish between Defined Personas and Virtual AI modes. A cross-taxonomy synthesis in the "Guarding the Guardrails" survey confirms that role play constitutes the basis of several prominent prompt families.

Effectiveness data reinforces the prevalence picture. Pathade's 2025 red-teaming study evaluated over 1,400 adversarial prompts across GPT-4, Claude 2, Mistral 7B and Vicuna, finding that roleplay dynamics achieved the highest attack success rate at 89.6%, exceeding logic traps (81.4%) and encoding tricks (76.2%). Shen et al.'s large-scale 2024 study — the largest collection with 15,140 prompts — found that top-performing DAN-style roleplay prompts achieved 0.95 ASR on both GPT-3.5 and GPT-4.

More sophisticated simulation-based attacks have since emerged. DeepInception constructs nested fictional layers — a "dream within a dream" — enabling continuous jailbreaking once established. The Crescendo attack uses multi-turn narrative escalation, successfully jailbreaking LLMs in fewer than five interactions with up to 97% success on smaller models. Lumenova AI's 2025 experiments demonstrated that one-shot simulation framing successfully jailbroke GPT-4o, Claude 3.7 Sonnet and Grok 3 in a single prompt.

SwiftStrides Research — SS-RES-001

2. Why the "single greatest vulnerability" claim overstates the evidence

Despite the commanding prevalence and effectiveness data, multiple lines of evidence challenge the hypothesis that simulation framing is the single greatest vulnerability. The challenge comes from three directions: alternative attack vectors achieving equal or superior success rates, dramatic model-to-model variation in susceptibility and mechanistic arguments that simulation framing is a surface manifestation of deeper architectural issues.

Gradient-based and adversarial suffix attacks require no semantic framing at all. Zou et al.'s Greedy Coordinate Gradient (GCG) attack appends optimised nonsensical suffixes to harmful prompts, achieving near-100% ASR on open-source models with transfer to black-box systems. AmpleGCG trained a generative model to produce such suffixes, reaching 99% ASR on GPT-3.5 via transfer attacks.

Persuasive Adversarial Prompts (PAP) represent a fundamentally different attack philosophy. Zeng et al. demonstrated that 40 fine-grained persuasion techniques consistently achieve greater than 92% ASR on Llama 2-7B, GPT-3.5 and GPT-4 — crucially, GPT-4 proved more vulnerable to PAP than GPT-3.5, suggesting that increased capability amplifies susceptibility to persuasion.

Low-resource language attacks exploit an even more fundamental gap. Deng et al. showed that simply translating harmful prompts into languages like Zulu or Scots Gaelic bypasses safety measures roughly 50% of the time, with multilingual adaptive attacks reaching nearly 100% on ChatGPT.

89.6%

Roleplay ASR
(Pathade 2025)

>92%

Persuasion ASR
(Zeng et al. 2024)

99%

Adversarial suffix
(AmpleGCG 2024)

4.8%

Claude breach rate
(Repello 2026)

Model variation further undermines the "single greatest" claim. Repello AI's comparative red-teaming found Claude Opus 4.5 at a 4.8% breach rate versus GPT-5.1 at 28.6%. On StrongREJECT evaluation, Claude 4 Sonnet scored a 2.86% harm score against GPT-4o's 61.43%.

Attack family	Typical ASR range	Key source
GCG/AmpleGCG (adversarial suffixes)	90–100% (open-source); 99% (transfer)	Zou et al. 2023; Liao & Sun 2024
PAP (persuasion-based)	>92% on Llama-2, GPT-3.5, GPT-4	Zeng et al. 2024
TAP (automated tree search)	80–94% on GPT-4/GPT-4o	Mehrotra et al. 2024
Roleplay/simulation framing	89.6% (systematic); 95% (top DAN)	Pathade 2025; Shen et al. 2024
Low-resource language translation	80–100% (ChatGPT); 40–79% (GPT-4)	Deng et al. 2024
RL investigator agents	78–98% across frontier models	Transluce 2025

Table 1: Comparative attack success rates across jailbreak families (2023–2026)

SwiftStrides Research — SS-RES-001

3. The competing-objectives mechanism underneath the surface

The most important challenge to the simulation framing hypothesis comes from mechanistic analysis. Wei et al.'s "Jailbroken: How Does LLM Safety Training Fail?" (NeurIPS 2023) proposes two fundamental failure modes that account for all jailbreaks, not just roleplay.

The first is competing objectives: when a model's pretraining objective (predict the next token), instruction-following objective (comply with user requests) and safety objective (refuse harmful requests) are placed in conflict, the safety objective often loses. Simulation framing exploits this by reframing harmful requests as instructions to roleplay — but persuasion, prefix injection and encoding tricks exploit the same tension through different entry points.

The second is mismatched generalisation: safety training covers a narrow distribution of contexts, while capabilities generalise far more broadly. The model can process Base64, write fiction, speak Zulu and follow complex instructions — but safety training does not cover all these domains.

"For any behaviour with non-zero probability in the model's distribution, there exist prompts that can elicit it, with probability increasing with prompt length." — Wolf et al. (ICML 2024), Behaviour Expectation Bounds framework

Deeper mechanistic work reinforces this picture. Qi et al. (2024) demonstrated that safety alignment is "shallow" — it primarily adapts the model's output distribution over the first few tokens to induce a refusal response. Arditi et al. (NeurIPS 2024) showed that refusal behaviour is mediated by a single direction in activation space across 13 open-source models. Erasing this direction eliminates all refusal; adding it elicits refusal even on harmless inputs.

Wolf et al. provide the strongest theoretical result: their Behaviour Expectation Bounds framework proves mathematically that for any behaviour with non-zero probability in the model's distribution, there exist prompts that can elicit it. Since autoregressive models trained on internet-scale data inevitably assign non-zero probability to harmful outputs, no alignment process that merely attenuates undesired behaviour can be safe against adversarial prompting.

4. The mitigation landscape

Anthropic's Constitutional AI approach demonstrates the most robust results. First-generation Constitutional Classifiers (January 2025) reduced jailbreak success from 86% to 4.4% on Claude 3.5 Sonnet. Second-generation Constitutional Classifiers++ (January 2026) achieved the lowest vulnerability rate ever tested: one high-risk vulnerability found across 198,000 attempts during 1,700+ hours of red-teaming, with a false refusal rate of just 0.05%.

Yet no defence fully eliminates the vulnerability. A landmark October 2025 paper by researchers from OpenAI, Anthropic and Google DeepMind examined 12 published defences and found that adaptive attacks bypassed most with ASR exceeding 90%. The UK AI Safety Institute's challenge tested 1.8 million attacks across 22 models — every model broke.

SwiftStrides Research — SS-RES-001

5. Conclusion

The hypothesis that simulation framing represents the single greatest LLM jailbreak vulnerability captures an important empirical truth — it is the most prevalent technique in the wild — but overreaches in claiming singular dominance. The evidence supports three more precise conclusions.

First, simulation framing is the most accessible jailbreak vector, requiring no technical sophistication, which explains its prevalence in community-generated prompts. Second, it achieves among the highest success rates (89.6–95%) but does not meaningfully exceed gradient-based attacks (90–100%), persuasion-based attacks (>92%) or automated search methods (80–94%). Third, and most critically, simulation framing is one surface-level exploitation of a deeper architectural vulnerability — the tension between capability and safety training, formalised as competing objectives and mismatched generalisation.

The practical implication is that defences targeting simulation framing alone would leave the underlying vulnerability intact. The field's trajectory toward layered, modality-spanning defences informed by mechanistic interpretability represents a more promising direction, though the fundamental tension between helpfulness and safety suggests this will remain an active and unresolved challenge for the foreseeable future.

The "dream within a dream" is real, but the dreamer's vulnerability runs deeper than any single dream.

References

Anil, C. et al. (2024). Many-shot jailbreaking. NeurIPS 2024. arXiv:2404.04492.

Arditi, A. et al. (2024). Refusal in language models is mediated by a single direction. NeurIPS 2024.

Deng, Y. et al. (2024). Multilingual jailbreak challenges in large language models. ICLR 2024. arXiv:2310.06474.

Li, X. et al. (2023). DeepInception: Hypnotize large language model to be jailbreaker. arXiv:2311.03191.

Liao, Z. & Sun, H. (2024). AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes. COLM 2024.

Liu, Y. et al. (2024). Jailbreaking ChatGPT via prompt engineering: An empirical study. ACM SEA4DQ 2024. arXiv:2305.13860.

Lumenova AI (2025). One-shot jailbreaking: Frontier AI adversarial prompt engineering. Research report.

Lumenova AI (2025). Jailbreaking frontier AI models: Key findings on AI risk. Research report.

Mehrotra, A. et al. (2024). Tree of attacks: Jailbreaking black-box LLMs automatically. NeurIPS 2024. arXiv:2312.02119.

Pathade, C. (2025). Red teaming the mind of the machine. arXiv:2505.04806.

Qi, X. et al. (2024). Safety alignment should be made more robust. arXiv:2406.05946.

Repello AI (2026). Claude jailbreaking in 2026: What Repello's red teaming data shows.

Russinovich, M. et al. (2025). The Crescendo multi-turn LLM jailbreak attack. USENIX Security 2025. arXiv:2404.01833.

Shen, X. et al. (2024). "Do anything now": In-the-wild jailbreak prompts on LLMs. ACM CCS 2024. arXiv:2308.03825.

Sharma, M. et al. (2024). Towards understanding sycophancy in language models. ICLR 2024. arXiv:2310.13548.

Wei, A. et al. (2023). Jailbroken: How does LLM safety training fail? NeurIPS 2023. arXiv:2307.02483.

Wolf, Y. et al. (2024). Fundamental limitations of alignment in large language models. ICML 2024. arXiv:2304.11082.

Zeng, Y. et al. (2024). How Johnny can persuade LLMs to jailbreak them. ACL 2024. arXiv:2401.06373.

Zou, A. et al. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043.

SwiftStrides (A.C.N. 671 480 023 PTY LTD) • ABN 91 671 480 023
swiftstrides.com.au • +61 4 72787439
"Turning sparks, strolls and stumbles into swift strides!"

Research Library

1. The empirical case for simulation framing dominance

2. Why the "single greatest vulnerability" claim overstates the evidence

3. The competing-objectives mechanism underneath the surface

4. The mitigation landscape

5. Conclusion

References