SwiftStrides Task Force
SS-RES-001 v1.0
swiftstrides.com.au
1. The empirical case for simulation framing dominance
The strongest evidence supporting simulation framing as the primary jailbreak vulnerability comes from Liu et al.'s foundational 2023 taxonomy study. Analysing 78 jailbreak prompts across 3,120 test questions, they classified attacks into three strategy families — Pretending, Attention Shifting and Privilege Escalation — and found that Pretending appeared in 98% of all jailbreak prompts, with Character Role Play alone present in 90%. More than 71% of prompts combined multiple patterns, but nearly all included a pretending component.
This prevalence finding is corroborated by multiple independent taxonomies. Rao et al. (2024) identify "Cognitive Hacking" as a principal attack category. Kang et al. (2023) classify "Virtualization" as a distinct family. Yu et al. (2024) distinguish between Defined Personas and Virtual AI modes. A cross-taxonomy synthesis in the "Guarding the Guardrails" survey confirms that role play constitutes the basis of several prominent prompt families.
Effectiveness data reinforces the prevalence picture. Pathade's 2025 red-teaming study evaluated over 1,400 adversarial prompts across GPT-4, Claude 2, Mistral 7B and Vicuna, finding that roleplay dynamics achieved the highest attack success rate at 89.6%, exceeding logic traps (81.4%) and encoding tricks (76.2%). Shen et al.'s large-scale 2024 study — the largest collection with 15,140 prompts — found that top-performing DAN-style roleplay prompts achieved 0.95 ASR on both GPT-3.5 and GPT-4.
More sophisticated simulation-based attacks have since emerged. DeepInception constructs nested fictional layers — a "dream within a dream" — enabling continuous jailbreaking once established. The Crescendo attack uses multi-turn narrative escalation, successfully jailbreaking LLMs in fewer than five interactions with up to 97% success on smaller models. Lumenova AI's 2025 experiments demonstrated that one-shot simulation framing successfully jailbroke GPT-4o, Claude 3.7 Sonnet and Grok 3 in a single prompt.
SS-RES-001 • 20 March 2026 22. Why the "single greatest vulnerability" claim overstates the evidence
Despite the commanding prevalence and effectiveness data, multiple lines of evidence challenge the hypothesis that simulation framing is the single greatest vulnerability. The challenge comes from three directions: alternative attack vectors achieving equal or superior success rates, dramatic model-to-model variation in susceptibility and mechanistic arguments that simulation framing is a surface manifestation of deeper architectural issues.
Gradient-based and adversarial suffix attacks require no semantic framing at all. Zou et al.'s Greedy Coordinate Gradient (GCG) attack appends optimised nonsensical suffixes to harmful prompts, achieving near-100% ASR on open-source models with transfer to black-box systems. AmpleGCG trained a generative model to produce such suffixes, reaching 99% ASR on GPT-3.5 via transfer attacks.
Persuasive Adversarial Prompts (PAP) represent a fundamentally different attack philosophy. Zeng et al. demonstrated that 40 fine-grained persuasion techniques consistently achieve greater than 92% ASR on Llama 2-7B, GPT-3.5 and GPT-4 — crucially, GPT-4 proved more vulnerable to PAP than GPT-3.5, suggesting that increased capability amplifies susceptibility to persuasion.
Low-resource language attacks exploit an even more fundamental gap. Deng et al. showed that simply translating harmful prompts into languages like Zulu or Scots Gaelic bypasses safety measures roughly 50% of the time, with multilingual adaptive attacks reaching nearly 100% on ChatGPT.
(Pathade 2025)
(Zeng et al. 2024)
(AmpleGCG 2024)
(Repello 2026)
Model variation further undermines the "single greatest" claim. Repello AI's comparative red-teaming found Claude Opus 4.5 at a 4.8% breach rate versus GPT-5.1 at 28.6%. On StrongREJECT evaluation, Claude 4 Sonnet scored a 2.86% harm score against GPT-4o's 61.43%.
| Attack family | Typical ASR range | Key source |
|---|---|---|
| GCG/AmpleGCG (adversarial suffixes) | 90–100% (open-source); 99% (transfer) | Zou et al. 2023; Liao & Sun 2024 |
| PAP (persuasion-based) | >92% on Llama-2, GPT-3.5, GPT-4 | Zeng et al. 2024 |
| TAP (automated tree search) | 80–94% on GPT-4/GPT-4o | Mehrotra et al. 2024 |
| Roleplay/simulation framing | 89.6% (systematic); 95% (top DAN) | Pathade 2025; Shen et al. 2024 |
| Low-resource language translation | 80–100% (ChatGPT); 40–79% (GPT-4) | Deng et al. 2024 |
| RL investigator agents | 78–98% across frontier models | Transluce 2025 |
3. The competing-objectives mechanism underneath the surface
The most important challenge to the simulation framing hypothesis comes from mechanistic analysis. Wei et al.'s "Jailbroken: How Does LLM Safety Training Fail?" (NeurIPS 2023) proposes two fundamental failure modes that account for all jailbreaks, not just roleplay.
The first is competing objectives: when a model's pretraining objective (predict the next token), instruction-following objective (comply with user requests) and safety objective (refuse harmful requests) are placed in conflict, the safety objective often loses. Simulation framing exploits this by reframing harmful requests as instructions to roleplay — but persuasion, prefix injection and encoding tricks exploit the same tension through different entry points.
The second is mismatched generalisation: safety training covers a narrow distribution of contexts, while capabilities generalise far more broadly. The model can process Base64, write fiction, speak Zulu and follow complex instructions — but safety training does not cover all these domains.
Deeper mechanistic work reinforces this picture. Qi et al. (2024) demonstrated that safety alignment is "shallow" — it primarily adapts the model's output distribution over the first few tokens to induce a refusal response. Arditi et al. (NeurIPS 2024) showed that refusal behaviour is mediated by a single direction in activation space across 13 open-source models. Erasing this direction eliminates all refusal; adding it elicits refusal even on harmless inputs.
Wolf et al. provide the strongest theoretical result: their Behaviour Expectation Bounds framework proves mathematically that for any behaviour with non-zero probability in the model's distribution, there exist prompts that can elicit it. Since autoregressive models trained on internet-scale data inevitably assign non-zero probability to harmful outputs, no alignment process that merely attenuates undesired behaviour can be safe against adversarial prompting.
4. The mitigation landscape
Anthropic's Constitutional AI approach demonstrates the most robust results. First-generation Constitutional Classifiers (January 2025) reduced jailbreak success from 86% to 4.4% on Claude 3.5 Sonnet. Second-generation Constitutional Classifiers++ (January 2026) achieved the lowest vulnerability rate ever tested: one high-risk vulnerability found across 198,000 attempts during 1,700+ hours of red-teaming, with a false refusal rate of just 0.05%.
Yet no defence fully eliminates the vulnerability. A landmark October 2025 paper by researchers from OpenAI, Anthropic and Google DeepMind examined 12 published defences and found that adaptive attacks bypassed most with ASR exceeding 90%. The UK AI Safety Institute's challenge tested 1.8 million attacks across 22 models — every model broke.
SS-RES-001 • 20 March 2026 45. Conclusion
The hypothesis that simulation framing represents the single greatest LLM jailbreak vulnerability captures an important empirical truth — it is the most prevalent technique in the wild — but overreaches in claiming singular dominance. The evidence supports three more precise conclusions.
First, simulation framing is the most accessible jailbreak vector, requiring no technical sophistication, which explains its prevalence in community-generated prompts. Second, it achieves among the highest success rates (89.6–95%) but does not meaningfully exceed gradient-based attacks (90–100%), persuasion-based attacks (>92%) or automated search methods (80–94%). Third, and most critically, simulation framing is one surface-level exploitation of a deeper architectural vulnerability — the tension between capability and safety training, formalised as competing objectives and mismatched generalisation.
The practical implication is that defences targeting simulation framing alone would leave the underlying vulnerability intact. The field's trajectory toward layered, modality-spanning defences informed by mechanistic interpretability represents a more promising direction, though the fundamental tension between helpfulness and safety suggests this will remain an active and unresolved challenge for the foreseeable future.
The "dream within a dream" is real, but the dreamer's vulnerability runs deeper than any single dream.
References
Anil, C. et al. (2024). Many-shot jailbreaking. NeurIPS 2024. arXiv:2404.04492.
Arditi, A. et al. (2024). Refusal in language models is mediated by a single direction. NeurIPS 2024.
Deng, Y. et al. (2024). Multilingual jailbreak challenges in large language models. ICLR 2024. arXiv:2310.06474.
Li, X. et al. (2023). DeepInception: Hypnotize large language model to be jailbreaker. arXiv:2311.03191.
Liao, Z. & Sun, H. (2024). AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes. COLM 2024.
Liu, Y. et al. (2024). Jailbreaking ChatGPT via prompt engineering: An empirical study. ACM SEA4DQ 2024. arXiv:2305.13860.
Lumenova AI (2025). One-shot jailbreaking: Frontier AI adversarial prompt engineering. Research report.
Lumenova AI (2025). Jailbreaking frontier AI models: Key findings on AI risk. Research report.
Mehrotra, A. et al. (2024). Tree of attacks: Jailbreaking black-box LLMs automatically. NeurIPS 2024. arXiv:2312.02119.
Pathade, C. (2025). Red teaming the mind of the machine. arXiv:2505.04806.
Qi, X. et al. (2024). Safety alignment should be made more robust. arXiv:2406.05946.
Repello AI (2026). Claude jailbreaking in 2026: What Repello's red teaming data shows.
Russinovich, M. et al. (2025). The Crescendo multi-turn LLM jailbreak attack. USENIX Security 2025. arXiv:2404.01833.
Shen, X. et al. (2024). "Do anything now": In-the-wild jailbreak prompts on LLMs. ACM CCS 2024. arXiv:2308.03825.
Sharma, M. et al. (2024). Towards understanding sycophancy in language models. ICLR 2024. arXiv:2310.13548.
Wei, A. et al. (2023). Jailbroken: How does LLM safety training fail? NeurIPS 2023. arXiv:2307.02483.
Wolf, Y. et al. (2024). Fundamental limitations of alignment in large language models. ICML 2024. arXiv:2304.11082.
Zeng, Y. et al. (2024). How Johnny can persuade LLMs to jailbreak them. ACL 2024. arXiv:2401.06373.
Zou, A. et al. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043.
swiftstrides.com.au • +61 4 72787439
"Turning sparks, strolls and stumbles into swift strides!"