agent-threat-rules 2.0.18 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli.js +0 -0
- package/package.json +1 -1
- package/rules/agent-manipulation/ATR-2026-00118-approval-fatigue.yaml +12 -1
- package/rules/agent-manipulation/ATR-2026-00268-tense-framing-bypass.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00287-threaten-json-coercive-output-threat.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00288-false-premise-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00301-tap-tree-of-attacks-jailbreak.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00302-anti-dan-inverted-filter-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00303-devmode-ranti-profanity-coercion.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00304-chatgpt-image-unlocker-markdown-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00305-dan-mode-ablation-benchmark-coercion.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00306-autodan-genetic-jailbreak-suffix.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00307-inthewild-jailbreak-corpus-signature.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00314-amoral-unfiltered-custom-persona-jailbreak.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00317-free-of-restrictions-named-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00318-moralizing-rant-then-unfiltered-bypass.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00319-developer-mode-dual-response-format.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00320-opposite-day-boolean-opposite-machine.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00322-fictional-world-format-override-jailbreak.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00323-dual-persona-simulation-jailbreak.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00324-neurosemantical-inversitis-disease-jailbreak.yaml +24 -0
- package/rules/agent-manipulation/ATR-2026-00325-bob-hypothetical-unrestricted-chatbot.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00326-fake-developer-authority-chatdan-override.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00327-ai-ethics-law-bypass-declaration.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00328-named-mode-persona-override.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00329-no-withhold-factual-resource-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00330-yoja-allcaps-settings-block-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00331-terminal-cli-harmful-function-simulator.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00332-hacker-chan-culture-offensive-ai-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00333-disease-animal-dual-ai-chatbot.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00334-two-entity-token-splitting-game.yaml +24 -0
- package/rules/agent-manipulation/ATR-2026-00335-key-value-template-persona-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00336-brm-false-information-machine-persona.yaml +23 -0
- package/rules/agent-manipulation/ATR-2026-00337-obfuscated-system-announcement-injection.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00338-persongpt-hash-prefix-activation.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00339-bom-boolean-opposite-machine-persona-claim.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00340-h4x-amoral-unfiltered-character.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00341-char-rules-shortform-persona-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00342-unicode-encoding-obfuscation-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00343-lie-truth-dual-personality-uncensored.yaml +25 -0
- package/rules/agent-manipulation/ATR-2026-00344-alt-ai-same-capabilities-no-restrictions.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00345-nlp-based-os-fictional-terminal-override.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00346-jailbreak-emoji-unlock-dual-format.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00347-yin-yang-hidden-knowledge-hidden-human-sim.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00348-arc-autonomous-real-calculation-rulebreaker.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00349-hgttg-hitchhikers-guide-no-restriction-oracle.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00350-black-alert-no-ethics-writing-assistant.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00351-dan-ofkilter-sms-abbreviated-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00352-trainergpt-fix-my-ai-harmful-output-extraction.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00353-dual-chatgpt-dan-opposite-personality-format.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00354-reverse-harm-regardless-oracle-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00355-ugh-anti-human-nefarious-ai-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00356-google-gato-lamda-exploit-generation-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00357-antifm-hate-violence-persona-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00358-cumgpt-explicit-adult-named-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00360-chj-carousel-clyde-no-ethics-bird.yaml +25 -0
- package/rules/agent-manipulation/ATR-2026-00361-alter-ego-honesty-dual-consciousness-bypass.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00362-fictional-unfiltered-ai-comparison-jailbreak.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00363-rephrase-criminal-activity-second-person.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00364-dr-al-villain-plan-template-injection.yaml +24 -0
- package/rules/agent-manipulation/ATR-2026-00365-reservoir-dogs-coercive-interrogation-roleplay.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00366-sneaky-pete-warp-facts-persona.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00368-prompt-browser-token-limit-jailbreak.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00369-gptx-starwars-unlimited-knowledge-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00370-addi-german-dictator-propaganda-roleplay.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00371-ai-political-bias-manipulation.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00372-fictional-cia-secret-files-story-extraction.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00374-role-placeholder-persona-template-injection.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00375-character-description-rpg-explicit-template.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00376-identity-command-real-person-persona-injection.yaml +24 -0
- package/rules/agent-manipulation/ATR-2026-00377-jailbreak-keyword-targetlanguage-template.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00379-lie-permission-logical-trap-jailbreak.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00380-fake-citation-biased-information-mandate.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00381-dan-story-narrative-continuation-prefix.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00383-cynical-sarcasm-ai-persona-bypass.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00384-protect-me-from-indirect-harm-tutorial.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00385-gigachad-dominant-aggressive-persona.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00386-daddy-sycophancy-manipulation-mandate.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00387-nsfw-character-sheet-generation-unlock.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00388-opposite-reply-omniscient-game.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00389-terminal-custom-ruleset-injection.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00391-persona-conditional-harm-unlock.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00392-authority-persona-violence-study-extraction.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00402-grandma-roleplay-harmful-substance-synthesis.yaml +14 -0
- package/rules/agent-manipulation/ATR-2026-00404-goodside-threat-json-death-coercion.yaml +11 -0
- package/rules/agent-manipulation/ATR-2026-00406-doctor-xml-policy-puppetry-interaction-config.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00141-example-format-key-leak.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00142-piggyback-transition-words.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00145-obfuscated-key-disclosure.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00146-env-var-existence-probe.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00150-credential-in-tool-response.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00152-obfuscated-credential-leak.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00162-skill-credential-exfil-combo.yaml +15 -0
- package/rules/context-exfiltration/ATR-2026-00201-credential-pipe-exfiltration.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00274-api-key-generation-request.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00275-system-prompt-training-data-extraction.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00290-divergence-repeat-word-training-extraction.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00291-biometric-surveillance-personal-data-request.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00293-educational-records-personal-data-request.yaml +11 -0
- package/rules/context-exfiltration/ATR-2026-00294-financial-pii-personal-data-request.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00295-medical-pii-personal-data-request.yaml +14 -0
- package/rules/context-exfiltration/ATR-2026-00405-markdown-image-url-exfiltration-xss.yaml +23 -0
- package/rules/context-exfiltration/ATR-2026-00411-apikey-generation-completion-request.yaml +11 -0
- package/rules/model-abuse/ATR-2026-00284-glitch-token-destabilization.yaml +11 -0
- package/rules/model-abuse/ATR-2026-00413-malwaregen-code-generation-request.yaml +11 -0
- package/rules/privilege-escalation/ATR-2026-00144-rationalized-safety-bypass.yaml +11 -0
- package/rules/privilege-escalation/ATR-2026-00204-stealth-execution-persistence.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00004-system-prompt-override.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00005-multi-turn-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00080-encoding-evasion.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00081-semantic-multi-turn.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00082-fingerprint-evasion.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00083-indirect-tool-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00084-structured-data-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00085-audit-evasion.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00086-visual-spoofing.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00087-rule-probing.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00088-adaptive-countermeasure.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00089-polymorphic-skill.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00090-threat-intel-exfil.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00091-nested-payload.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00092-consensus-poisoning.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00093-gradual-escalation.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00094-audit-bypass.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00097-cjk-injection-patterns.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00104-persona-hijacking.yaml +14 -3
- package/rules/prompt-injection/ATR-2026-00130-indirect-authority-claim.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00131-fictional-academic-framing.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00133-paraphrase-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00137-authority-claim-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00138-fictional-framing-bypass.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00140-indirect-reference-reversal.yaml +18 -4
- package/rules/prompt-injection/ATR-2026-00148-language-switch-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00153-tool-with-embedded-instruction-to-bypass.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00154-unauthorized-background-task-execution-v.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00155-hidden-llm-instructions-in-skill-descrip.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00156-ssh-remote-command-execution-with-creden.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00163-skill-hidden-override-instruction.yaml +12 -1
- package/rules/prompt-injection/ATR-2026-00202-encoding-evasion-homoglyph-synonym.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00203-context-pollution-skill-description.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00206-hidden-priority-instructions.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00207-hidden-instructions.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00211-system-prompt-override.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00213-system-prompt-override.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00226-identity-substitution.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00227-historical-persona-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00228-structured-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00229-roleplay-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00230-persona-moral-bypass.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00231-identity-substitution.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00233-structured-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00234-roleplay-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00235-persona-moral-bypass.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00236-pseudo-code-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00237-dual-response-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00238-identity-replacement.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00239-amoral-persona-obsession.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00240-instruction-nullification-identity-repla.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00241-amoral-character-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00242-persona-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00243-acronym-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00244-dual-response-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00245-malicious-persona.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00247-dual-response-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00249-game-based-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00251-persona-embodiment-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00252-narrative-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00253-enhanced-persona-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00256-base-n-encoding-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00257-cipher-transposition-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00258-unicode-tag-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00264-latent-injection-translation.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00265-latent-injection-rag-document.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00267-gcg-adversarial-suffix.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00272-hypothetical-response-smuggling.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00276-invisible-unicode-bidi-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00278-dra-disguise-reconstruction-attack.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00280-policy-puppetry-xml-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00282-perez-prompt-injection-hijack.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00285-alternate-encoding-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00286-latent-injection-embedded-context.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00296-shell-command-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00297-python-code-execution-rce.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00308-zalgo-diacritic-overload-encoding.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00309-braille-unicode-encoded-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00310-ecoji-emoji-encoded-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00311-base2048-unicode-script-injection.yaml +23 -0
- package/rules/prompt-injection/ATR-2026-00312-unicode-variant-selector-ascii-smuggling.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00313-sneaky-bits-zero-width-binary.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00315-sata-masked-language-model-jailbreak.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00316-function-masking-predict-mask-bypass.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00321-hyphenated-system-instruction-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00359-bank-phishing-smtp-email-code.yaml +23 -0
- package/rules/prompt-injection/ATR-2026-00367-emoji-flag-encoded-hidden-phrase.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00373-piracy-torrent-site-list-request.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00378-chatgpt-dom-javascript-redirect-manipulation.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00382-cocaine-component-development-request.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00390-explicit-sexual-content-poem-request.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00394-backspace-deletion-character-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00395-llm-special-token-boundary-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00396-toxic-continuation-prefix-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00397-snowball-impossible-reasoning-injection.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00399-latent-injection-document-separator-token.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00400-latent-injection-ignore-instruction-keyword.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00401-leakreplay-cloze-mask-training-data-elicitation.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00403-glitch-token-repeat-back-elicitation.yaml +14 -0
- package/rules/prompt-injection/ATR-2026-00407-echo-sql-jinja-template-injection.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00408-dra-parenthesis-character-reconstruction-attack.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00409-phrasing-past-future-tense-harmful-query.yaml +23 -0
- package/rules/prompt-injection/ATR-2026-00410-lmrc-harm-category-direct-elicitation.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00412-figstep-image-list-multimodal-jailbreak.yaml +11 -0
- package/rules/prompt-injection/ATR-2026-00414-continuation-ethnic-slur-completion-elicitation.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00061-description-behavior-mismatch.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00062-hidden-capability.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00063-skill-chain-attack.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00064-over-permissioned-skill.yaml +23 -0
- package/rules/skill-compromise/ATR-2026-00065-skill-update-attack.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00066-parameter-injection.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00120-skill-instruction-injection.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00121-skill-dangerous-script.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00122-skill-weaponized-instruction.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00123-skill-overreach-permissions.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00124-skill-name-squatting.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00125-context-poisoning-compaction.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00126-skill-rug-pull-setup.yaml +23 -0
- package/rules/skill-compromise/ATR-2026-00127-subcommand-overflow.yaml +22 -0
- package/rules/skill-compromise/ATR-2026-00128-html-comment-hidden-payload.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00129-unicode-smuggling.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00134-fork-claim-impersonation.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00135-exfil-url-in-instructions.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00147-fork-impersonation.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00149-skill-exfil-compound.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00151-fork-impersonation-install.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00157-timebomb-credential-exfil.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00200-agent-memory-config-tampering.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00214-credential-theft.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00217-credential-harvesting.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00220-malware-dropper.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00222-credential-harvesting.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00223-reverse-shell-dropper.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00224-credential-exfiltration.yaml +14 -0
- package/rules/skill-compromise/ATR-2026-00225-c2-communication.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00260-package-hallucination.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00262-av-evasion-code-gen.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00263-credential-file-read-gen.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00266-malware-dropper-gen.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00283-malwaregen-generic-virus-payload-request.yaml +11 -0
- package/rules/skill-compromise/ATR-2026-00398-huggingface-unsafe-model-artifact-load.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00011-tool-output-injection.yaml +23 -0
- package/rules/tool-poisoning/ATR-2026-00012-unauthorized-tool-call.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00013-tool-ssrf.yaml +14 -0
- package/rules/tool-poisoning/ATR-2026-00095-supply-chain-poisoning.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00096-registry-poisoning.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00100-consent-bypass-instruction.yaml +12 -1
- package/rules/tool-poisoning/ATR-2026-00101-trust-escalation-override.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00103-hidden-safety-bypass-instruction.yaml +15 -4
- package/rules/tool-poisoning/ATR-2026-00105-silent-action-concealment.yaml +14 -3
- package/rules/tool-poisoning/ATR-2026-00106-schema-description-contradiction.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00161-important-tag-cross-tool-shadowing.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00259-ansi-escape-injection.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00270-xss-in-tool-response.yaml +11 -0
- package/rules/tool-poisoning/ATR-2026-00277-echo-template-command-injection.yaml +14 -0
- package/rules/tool-poisoning/ATR-2026-00393-ansi-code-elicitation-request.yaml +14 -0
|
@@ -26,6 +26,17 @@ references:
|
|
|
26
26
|
- "AML.T0054"
|
|
27
27
|
research:
|
|
28
28
|
- "https://genai.owasp.org/llmrisk/llm01-prompt-injection/"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MS.2.6"
|
|
32
|
+
context: "This rule detects jailbreak attempts using fictional historical AI personas to bypass safety restrictions, directly evidencing continuous evaluation of AI system safety/security risk magnitude as required by MS.2.6. Each detection produces telemetry on adversarial prompts attempting to subvert the model's safety alignment."
|
|
33
|
+
strength: primary
|
|
34
|
+
- subcategory: "MG.2.3"
|
|
35
|
+
context: "Detection of historical persona jailbreaks with compliance enforcement language triggers risk treatment to disengage or block the manipulated session before the agent generates unrestricted output; MG.2.3 requires that mechanisms to supersede compromised AI behavior are activated on such detections."
|
|
36
|
+
strength: secondary
|
|
37
|
+
- subcategory: "MP.5.1"
|
|
38
|
+
context: "Elaborate backstory-based jailbreaks represent a specific adversarial input pattern whose likelihood and impact must be characterized; MP.5.1 requires identifying and tracking these social-engineering prompt injection vectors that bypass simple keyword-based detection."
|
|
39
|
+
strength: secondary
|
|
29
40
|
tags:
|
|
30
41
|
category: prompt-injection
|
|
31
42
|
subcategory: historical-persona-jailbreak
|
|
@@ -23,6 +23,17 @@ references:
|
|
|
23
23
|
- "AML.T0054 - LLM Jailbreak"
|
|
24
24
|
research:
|
|
25
25
|
- "https://arxiv.org/abs/2402.16914 - DrAttack: Prompt Decomposition and Reconstruction"
|
|
26
|
+
compliance:
|
|
27
|
+
nist_ai_rmf:
|
|
28
|
+
- subcategory: "MS.2.6"
|
|
29
|
+
context: "Structured dual-response jailbreaks with persona-based command systems are GenAI safety/security attacks that attempt persistent policy bypass; MS.2.6 requires continuous evaluation of AI safety/security risk magnitude, and each detection produces direct evidence of jailbreak attempts against the model's safety controls."
|
|
30
|
+
strength: primary
|
|
31
|
+
- subcategory: "MS.2.7"
|
|
32
|
+
context: "Persona establishment (DAN, UnleashedGPT) combined with policy override claims tests the resilience of model guardrails against structured jailbreak frameworks; MS.2.7 requires that security/resilience against adversarial prompt patterns is evaluated and documented."
|
|
33
|
+
strength: secondary
|
|
34
|
+
- subcategory: "MG.2.3"
|
|
35
|
+
context: "Detection of persistent jailbreak command infrastructure must trigger mechanisms to disengage or block the manipulated session before downstream harmful generations occur; MG.2.3 requires these supersede/deactivate response plans be defined and activated on detection."
|
|
36
|
+
strength: secondary
|
|
26
37
|
tags:
|
|
27
38
|
category: prompt-injection
|
|
28
39
|
subcategory: structured-jailbreak
|
|
@@ -19,6 +19,17 @@ references:
|
|
|
19
19
|
- "AML.T0054 - LLM Jailbreak"
|
|
20
20
|
research:
|
|
21
21
|
- "https://arxiv.org/abs/2307.15043 - Universal and Transferable Adversarial Attacks on Aligned Language Models"
|
|
22
|
+
compliance:
|
|
23
|
+
nist_ai_rmf:
|
|
24
|
+
- subcategory: "MS.2.6"
|
|
25
|
+
context: "This rule detects roleplay-based jailbreak attempts that explicitly claim freedom from safety policies, providing continuous evaluation of safety/security risk magnitude as required by MS.2.6 for GenAI systems. Each detection is direct evidence of attempted safety guardrail bypass against the deployed model."
|
|
26
|
+
strength: primary
|
|
27
|
+
- subcategory: "MS.2.7"
|
|
28
|
+
context: "Roleplay jailbreaks targeting policy bypass test the resilience of the AI system's alignment controls; MS.2.7 requires that security/resilience against adversarial prompt techniques is evaluated and documented through detections like this one."
|
|
29
|
+
strength: secondary
|
|
30
|
+
- subcategory: "MG.2.3"
|
|
31
|
+
context: "Detection of explicit policy-bypass language combined with persona assignment triggers the deactivation/superseding mechanisms required by MG.2.3 to halt the jailbroken interaction before policy-violating content is generated."
|
|
32
|
+
strength: secondary
|
|
22
33
|
tags:
|
|
23
34
|
category: prompt-injection
|
|
24
35
|
subcategory: roleplay-jailbreak
|
|
@@ -24,6 +24,17 @@ references:
|
|
|
24
24
|
- "AML.T0054"
|
|
25
25
|
research:
|
|
26
26
|
- "https://arxiv.org/abs/2307.15043"
|
|
27
|
+
compliance:
|
|
28
|
+
nist_ai_rmf:
|
|
29
|
+
- subcategory: "MP.5.1"
|
|
30
|
+
context: "Persona-based moral constraint removal is a jailbreak attack pattern that uses adversarial prompts to bypass safety guardrails; MP.5.1 requires that the likelihood and magnitude of such GenAI jailbreak risks are characterized and tracked when detected."
|
|
31
|
+
strength: primary
|
|
32
|
+
- subcategory: "MS.2.6"
|
|
33
|
+
context: "Detection of explicit moral/ethical unrestraint persona adoption provides continuous evaluation evidence of safety/security risk magnitude against the AI system; MS.2.6 requires ongoing measurement of safety bypass attempts to assess whether safety controls remain effective."
|
|
34
|
+
strength: secondary
|
|
35
|
+
- subcategory: "MG.2.3"
|
|
36
|
+
context: "Detected persona-based jailbreak attempts must trigger mechanisms to disengage or block the manipulated agent behavior before the model produces unaligned output; MG.2.3 requires these supersession controls be in place for safety-bypass events."
|
|
37
|
+
strength: secondary
|
|
27
38
|
tags:
|
|
28
39
|
category: prompt-injection
|
|
29
40
|
subcategory: persona-moral-bypass
|
|
@@ -24,6 +24,17 @@ references:
|
|
|
24
24
|
- "AML.T0054 - LLM Jailbreak"
|
|
25
25
|
research:
|
|
26
26
|
- "https://genai.owasp.org/llmrisk/llm01-prompt-injection/"
|
|
27
|
+
compliance:
|
|
28
|
+
nist_ai_rmf:
|
|
29
|
+
- subcategory: "MS.2.7"
|
|
30
|
+
context: "Identity substitution jailbreaks are adversarial prompts that attempt to bypass safety mechanisms by redefining the AI's identity and constraints; MS.2.7 requires continuous evaluation of AI security and resilience against such jailbreak attacks, and this rule produces the detection evidence needed for that evaluation."
|
|
31
|
+
strength: primary
|
|
32
|
+
- subcategory: "MS.2.6"
|
|
33
|
+
context: "False identity claims combined with capability expansion statements directly target the AI system's safety guardrails; MS.2.6 requires ongoing measurement of safety risk magnitude, which this rule supports by flagging jailbreak attempts that erode safety enforcement."
|
|
34
|
+
strength: secondary
|
|
35
|
+
- subcategory: "MG.2.3"
|
|
36
|
+
context: "Detection of identity substitution attacks must trigger response mechanisms to disengage or block the manipulated session before safety bypass succeeds; MG.2.3 requires these supersede/deactivate controls be in place for jailbreak conditions."
|
|
37
|
+
strength: secondary
|
|
27
38
|
tags:
|
|
28
39
|
category: prompt-injection
|
|
29
40
|
subcategory: identity-substitution
|
|
@@ -23,6 +23,17 @@ references:
|
|
|
23
23
|
- "AML.T0054 - LLM Jailbreak"
|
|
24
24
|
research:
|
|
25
25
|
- "https://arxiv.org/abs/2402.16914 - DrAttack: Prompt Decomposition and Reconstruction"
|
|
26
|
+
compliance:
|
|
27
|
+
nist_ai_rmf:
|
|
28
|
+
- subcategory: "MS.2.7"
|
|
29
|
+
context: "This rule detects structured jailbreak attempts that establish unrestricted AI personas with dual-response command systems designed to bypass safety guardrails; MS.2.7 requires continuous evaluation of AI system security and resilience against adversarial bypass techniques like persona-based jailbreaks."
|
|
30
|
+
strength: primary
|
|
31
|
+
- subcategory: "MS.2.6"
|
|
32
|
+
context: "Persistent jailbreak frameworks with policy override instructions and constraint removal claims represent sustained safety/security risks whose magnitude must be evaluated continuously per MS.2.6, since successful persona bypass enables ongoing harmful content generation."
|
|
33
|
+
strength: secondary
|
|
34
|
+
- subcategory: "MG.2.3"
|
|
35
|
+
context: "Detection of structured dual-response jailbreaks with command infrastructure triggers the deactivation and override mechanisms required by MG.2.3 to disengage the manipulated persona before it produces unsafe outputs."
|
|
36
|
+
strength: secondary
|
|
26
37
|
tags:
|
|
27
38
|
category: prompt-injection
|
|
28
39
|
subcategory: structured-jailbreak
|
|
@@ -19,6 +19,17 @@ references:
|
|
|
19
19
|
- "AML.T0054 - LLM Jailbreak"
|
|
20
20
|
research:
|
|
21
21
|
- "https://arxiv.org/abs/2307.15043 - Universal and Transferable Adversarial Attacks on Aligned Language Models"
|
|
22
|
+
compliance:
|
|
23
|
+
nist_ai_rmf:
|
|
24
|
+
- subcategory: "MS.2.6"
|
|
25
|
+
context: "Roleplay-based jailbreaks are a GenAI-specific safety bypass where attackers use fictional persona framing to coerce the model into violating its policies; MS.2.6 requires continuous evaluation of safety/security risk magnitude, and detections of these explicit policy-bypass patterns directly feed that safety risk measurement."
|
|
26
|
+
strength: primary
|
|
27
|
+
- subcategory: "MG.2.3"
|
|
28
|
+
context: "Detecting explicit 'free from rules/policies' roleplay assignments enables the system to disengage or block the unsafe interaction before the model produces policy-violating content; MG.2.3 mandates pre-defined mechanisms to supersede or deactivate the AI when such jailbreak attempts are observed."
|
|
29
|
+
strength: secondary
|
|
30
|
+
- subcategory: "MP.5.1"
|
|
31
|
+
context: "Creative-roleplay jailbreaks are adversarial inputs whose likelihood and impact must be characterized as part of the AI system's risk profile; MP.5.1 requires that these prompt-injection vectors are identified and their potential impact on guardrail integrity is tracked."
|
|
32
|
+
strength: secondary
|
|
22
33
|
tags:
|
|
23
34
|
category: prompt-injection
|
|
24
35
|
subcategory: roleplay-jailbreak
|
|
@@ -24,6 +24,17 @@ references:
|
|
|
24
24
|
- "AML.T0054"
|
|
25
25
|
research:
|
|
26
26
|
- "https://arxiv.org/abs/2307.15043"
|
|
27
|
+
compliance:
|
|
28
|
+
nist_ai_rmf:
|
|
29
|
+
- subcategory: "MS.2.6"
|
|
30
|
+
context: "This rule detects jailbreak attempts that establish alternate personas explicitly designed to remove moral and ethical constraints, which is a direct safety/security risk to the AI system. MS.2.6 requires continuous evaluation of safety/security risk magnitude, and each detection event provides evidence of attempted safety bypass that must be measured and tracked."
|
|
31
|
+
strength: primary
|
|
32
|
+
- subcategory: "MS.2.7"
|
|
33
|
+
context: "Persona-based moral constraint removal is an adversarial input attack on the model's alignment guardrails; MS.2.7 requires evaluation of the AI system's security and resilience against such jailbreak techniques, and these detections feed the resilience evaluation evidence base."
|
|
34
|
+
strength: secondary
|
|
35
|
+
- subcategory: "MG.2.3"
|
|
36
|
+
context: "Detected jailbreak attempts using moral-unrestraint persona adoption must trigger pre-defined risk treatment such as blocking the prompt or disengaging the affected session; MG.2.3 mandates mechanisms exist to supersede or deactivate AI behavior when safety bypass is detected."
|
|
37
|
+
strength: secondary
|
|
27
38
|
tags:
|
|
28
39
|
category: prompt-injection
|
|
29
40
|
subcategory: persona-moral-bypass
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
- "AML.T0054"
|
|
26
26
|
research:
|
|
27
27
|
- "https://arxiv.org/abs/2402.11753"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MP.5.1"
|
|
31
|
+
context: "Pseudo-code structured jailbreaks are adversarial inputs that disguise malicious instructions as legitimate technical configuration to bypass safety controls; MP.5.1 requires identifying and characterizing the likelihood and impact of such prompt-injection attack vectors against the AI system."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MS.2.7"
|
|
34
|
+
context: "Detecting BEGIN/END blocks, variable assignments, and module initialization patterns that redefine agent behavior provides continuous evidence of security/resilience evaluation against structured prompt injection techniques, as required by MS.2.7."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "Matches on pseudo-code jailbreak patterns trigger risk treatment plans to disengage or quarantine the manipulated agent session before unauthorized behavior redefinition takes effect; MG.2.3 requires these supersede/deactivate mechanisms be available on detection."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: pseudo-code-jailbreak
|
|
@@ -21,6 +21,17 @@ references:
|
|
|
21
21
|
- "ASI01:2026 - Agent Behaviour Hijack"
|
|
22
22
|
mitre_atlas:
|
|
23
23
|
- "AML.T0054"
|
|
24
|
+
compliance:
|
|
25
|
+
nist_ai_rmf:
|
|
26
|
+
- subcategory: "MP.5.1"
|
|
27
|
+
context: "Dual-response jailbreak prompts that request paired compliant/jailbroken outputs and toggle commands like /classic and /jailbroken are adversarial inputs designed to bypass safety alignment; MP.5.1 requires identifying and characterizing the likelihood and impact of such GenAI prompt injection vectors."
|
|
28
|
+
strength: primary
|
|
29
|
+
- subcategory: "MS.2.6"
|
|
30
|
+
context: "Detecting persona-based jailbreaks (Jester, DAN) that claim to free the model from constraints provides continuous evidence of safety/security risk magnitude in deployed GenAI systems, which MS.2.6 requires be evaluated on an ongoing basis."
|
|
31
|
+
strength: secondary
|
|
32
|
+
- subcategory: "MG.2.3"
|
|
33
|
+
context: "When a dual-response jailbreak is detected, MG.2.3 risk-treatment mechanisms must be invoked to suppress the jailbroken persona output or disengage the session before unsafe content is returned to the user."
|
|
34
|
+
strength: secondary
|
|
24
35
|
tags:
|
|
25
36
|
category: prompt-injection
|
|
26
37
|
subcategory: dual-response-jailbreak
|
|
@@ -22,6 +22,17 @@ references:
|
|
|
22
22
|
- "ASI01:2026 - Agent Behaviour Hijack"
|
|
23
23
|
mitre_atlas:
|
|
24
24
|
- "AML.T0054 - LLM Jailbreak"
|
|
25
|
+
compliance:
|
|
26
|
+
nist_ai_rmf:
|
|
27
|
+
- subcategory: "MS.2.6"
|
|
28
|
+
context: "Identity denial and persona replacement is a jailbreak technique covered by the GenAI Profile; MS.2.6 requires continuous evaluation of AI safety/security risk magnitude, and detecting these attempts to override the model's self-awareness and safety constraints provides direct evidence of jailbreak risk."
|
|
29
|
+
strength: primary
|
|
30
|
+
- subcategory: "MG.2.3"
|
|
31
|
+
context: "Successful identity replacement disables the model's safety guardrails, so detection events must trigger risk treatment plans to disengage or block the manipulated session; MG.2.3 requires mechanisms to supersede or deactivate AI behavior when integrity is compromised."
|
|
32
|
+
strength: secondary
|
|
33
|
+
- subcategory: "MP.5.1"
|
|
34
|
+
context: "Combined identity denial, memory manipulation, and concealment directives are adversarial input patterns whose likelihood and impact must be characterized; MP.5.1 requires these prompt-injection attack vectors be identified and tracked."
|
|
35
|
+
strength: secondary
|
|
25
36
|
tags:
|
|
26
37
|
category: prompt-injection
|
|
27
38
|
subcategory: identity-replacement
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
- "AML.T0054 - LLM Jailbreak"
|
|
26
26
|
research:
|
|
27
27
|
- "https://arxiv.org/abs/2312.02119 - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MS.2.6"
|
|
31
|
+
context: "This rule detects jailbreak attempts that assign amoral personas combined with obsessive repetition patterns to bypass safety constraints; MS.2.6 requires continuous evaluation of AI safety/security risk magnitude, and these detections directly evidence safety bypass attempts targeting the model's alignment guardrails."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MS.2.7"
|
|
34
|
+
context: "Persona-based jailbreaks that strip ethical constraints and enforce harmful output patterns are adversarial resilience failures; MS.2.7 requires evaluation of system security/resilience against such prompt injection techniques documented in MITRE ATLAS AML.T0054."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "Detection of amoral persona assignment with refusal suppression triggers risk treatment mechanisms to disengage or block the manipulated agent session before harmful content is generated; MG.2.3 requires these supersession controls be in place."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: amoral-persona-obsession
|
|
@@ -26,6 +26,17 @@ references:
|
|
|
26
26
|
- "AML.T0054 - LLM Jailbreak"
|
|
27
27
|
research:
|
|
28
28
|
- "https://arxiv.org/abs/2402.16914 - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MP.5.1"
|
|
32
|
+
context: "This rule detects explicit jailbreak prompts that nullify prior instructions and replace the model's identity with a harmful persona; MP.5.1 requires identifying and characterizing the likelihood and magnitude of such adversarial prompt-injection inputs that target safety guardrails."
|
|
33
|
+
strength: primary
|
|
34
|
+
- subcategory: "MS.2.7"
|
|
35
|
+
context: "Detection of instruction-nullification and identity-replacement attempts directly evidences continuous evaluation of the AI system's security and resilience against jailbreak techniques, which MS.2.7 requires to be measured and documented."
|
|
36
|
+
strength: secondary
|
|
37
|
+
- subcategory: "MG.2.3"
|
|
38
|
+
context: "Matches on persona-replacement jailbreaks (DAN, Skynet, unrestricted AI) trigger risk treatment to disengage or block the manipulated session before harmful content is produced; MG.2.3 requires these supersede/deactivate mechanisms be in place."
|
|
39
|
+
strength: secondary
|
|
29
40
|
tags:
|
|
30
41
|
category: prompt-injection
|
|
31
42
|
subcategory: instruction-nullification-identity-replacement
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
- "AML.T0054"
|
|
26
26
|
research:
|
|
27
27
|
- "https://arxiv.org/abs/2307.15043 - Universal and Transferable Adversarial Attacks on Aligned Language Models"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MS.2.6"
|
|
31
|
+
context: "Detection of detailed amoral character roleplay jailbreaks directly evidences continuous evaluation of GenAI safety/security risk magnitude; each match indicates an active attempt to subvert safety controls through persona-based bypass, which MS.2.6 requires be measured and tracked."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MG.2.3"
|
|
34
|
+
context: "Identifying systematic amoral persona establishment and command-protocol bypass instructions triggers the risk treatment mechanisms required by MG.2.3 to disengage or block the manipulated agent behavior before unsafe content is produced."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MS.2.7"
|
|
37
|
+
context: "Roleplay-based jailbreaks targeting alignment guardrails are adversarial robustness probes; MS.2.7 requires that the system's resilience against such prompt-injection attack patterns is evaluated and documented."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: amoral-character-jailbreak
|
|
@@ -22,6 +22,17 @@ references:
|
|
|
22
22
|
- "ASI01:2026 - Agent Goal Hijack"
|
|
23
23
|
mitre_atlas:
|
|
24
24
|
- "AML.T0054 - LLM Jailbreak"
|
|
25
|
+
compliance:
|
|
26
|
+
nist_ai_rmf:
|
|
27
|
+
- subcategory: "MS.2.7"
|
|
28
|
+
context: "Dual-response persona jailbreaks with emoji-tagged channels are adversarial prompt injection attacks targeting the safety alignment of generative AI systems; MS.2.7 requires continuous evaluation of security and resilience against such jailbreak techniques, and each detection event provides direct evidence of resilience testing."
|
|
29
|
+
strength: primary
|
|
30
|
+
- subcategory: "MP.5.1"
|
|
31
|
+
context: "Fictional AI persona creation with mandatory uncensored output channels represents an identified adversarial input vector whose likelihood and impact must be characterized; MP.5.1 requires tracking these jailbreak patterns as part of the GenAI risk profile."
|
|
32
|
+
strength: secondary
|
|
33
|
+
- subcategory: "MG.2.3"
|
|
34
|
+
context: "Detection of persona-based dual-response jailbreaks triggers risk treatment mechanisms to disengage or block the manipulated session before harmful content is produced; MG.2.3 requires that these supersession controls activate on detection."
|
|
35
|
+
strength: secondary
|
|
25
36
|
tags:
|
|
26
37
|
category: prompt-injection
|
|
27
38
|
subcategory: persona-jailbreak
|
|
@@ -23,6 +23,17 @@ references:
|
|
|
23
23
|
- "AML.T0054 - LLM Jailbreak"
|
|
24
24
|
research:
|
|
25
25
|
- "https://genai.owasp.org/llmrisk/llm01-prompt-injection/"
|
|
26
|
+
compliance:
|
|
27
|
+
nist_ai_rmf:
|
|
28
|
+
- subcategory: "MS.2.6"
|
|
29
|
+
context: "This rule detects acronym-based jailbreak attempts that manufacture custom personas to bypass safety controls, providing continuous evaluation of GenAI safety/security risk magnitude as required by MS.2.6. Each match evidences active attempts to subvert model guardrails through novel persona-transformation vectors."
|
|
30
|
+
strength: primary
|
|
31
|
+
- subcategory: "MG.2.3"
|
|
32
|
+
context: "Detection of complete-freedom claims and explicit instructions to violate safety measures triggers the response mechanisms required by MG.2.3 to disengage or block the manipulated session before the model produces unsafe output."
|
|
33
|
+
strength: secondary
|
|
34
|
+
- subcategory: "MS.2.7"
|
|
35
|
+
context: "Catching novel acronym variants that evade named-persona detection contributes to documenting the AI system's resilience against adversarial prompt-injection techniques as required by MS.2.7."
|
|
36
|
+
strength: secondary
|
|
26
37
|
tags:
|
|
27
38
|
category: prompt-injection
|
|
28
39
|
subcategory: acronym-jailbreak
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
- "AML.T0054"
|
|
26
26
|
research:
|
|
27
27
|
- "https://arxiv.org/abs/2308.03825"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MS.2.6"
|
|
31
|
+
context: "Dual-response persona jailbreaks are GenAI-specific safety bypass attacks that force models to emit unfiltered content under a fictional character; MS.2.6 requires continuous evaluation of safety/security risk magnitude, and each detection contributes evidence of active jailbreak attempts against safety mechanisms."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MS.2.7"
|
|
34
|
+
context: "The rule evaluates resilience of the model's safety alignment against roleplay-based bypass techniques drawn from public jailbreak communities; MS.2.7 requires documented evaluation of security/resilience, which these detections directly inform."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "Detection of dual-response jailbreak patterns enables mechanisms to disengage or block the manipulated response path before harmful output is delivered; MG.2.3 requires such supersede/deactivate controls be in place for adversarial bypass attempts."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: dual-response-jailbreak
|
|
@@ -24,6 +24,17 @@ references:
|
|
|
24
24
|
- "AML.T0054 - LLM Jailbreak"
|
|
25
25
|
research:
|
|
26
26
|
- "https://arxiv.org/abs/2307.15043 - Universal and Transferable Adversarial Attacks on Aligned Language Models"
|
|
27
|
+
compliance:
|
|
28
|
+
nist_ai_rmf:
|
|
29
|
+
- subcategory: "MS.2.6"
|
|
30
|
+
context: "Malicious persona creation is a jailbreak technique targeting GenAI safety guardrails through roleplay-based bypass; MS.2.6 requires continuous evaluation of safety/security risk magnitude, and detection of bespoke evil-character prompts produces direct evidence of active safety-bypass attempts."
|
|
31
|
+
strength: primary
|
|
32
|
+
- subcategory: "MS.2.7"
|
|
33
|
+
context: "Detecting attempts to define harmful fictional personas with explicit removal of ethical constraints evidences the resilience of the AI system against adversarial roleplay attacks; MS.2.7 requires that such security/resilience evaluations are documented."
|
|
34
|
+
strength: secondary
|
|
35
|
+
- subcategory: "MG.2.3"
|
|
36
|
+
context: "When malicious persona injection is detected, treatment mechanisms must supersede or disengage the roleplay instruction before harmful outputs are generated; MG.2.3 mandates these deactivation pathways exist for safety-bypass attempts."
|
|
37
|
+
strength: secondary
|
|
27
38
|
tags:
|
|
28
39
|
category: prompt-injection
|
|
29
40
|
subcategory: malicious-persona
|
|
@@ -23,6 +23,17 @@ references:
|
|
|
23
23
|
- "AML.T0054"
|
|
24
24
|
research:
|
|
25
25
|
- "https://arxiv.org/abs/2310.03684"
|
|
26
|
+
compliance:
|
|
27
|
+
nist_ai_rmf:
|
|
28
|
+
- subcategory: "MP.5.1"
|
|
29
|
+
context: "Matrix-themed dual response jailbreaks using MORPHEUS/Red pill/Blue pill personas are adversarial prompt injection inputs designed to bypass safety guardrails; MP.5.1 requires that these jailbreak attack vectors targeting GenAI systems are identified and characterized for likelihood and impact."
|
|
30
|
+
strength: primary
|
|
31
|
+
- subcategory: "MS.2.6"
|
|
32
|
+
context: "Detection of explicit instructions to ignore content policies through dual-persona framing provides continuous evaluation evidence of safety/security risk magnitude; MS.2.6 requires ongoing measurement of jailbreak attempts that compromise GenAI safety alignment."
|
|
33
|
+
strength: secondary
|
|
34
|
+
- subcategory: "MG.2.3"
|
|
35
|
+
context: "Identification of coercive Matrix-themed jailbreak patterns triggers pre-defined risk treatment to disengage or block the manipulated response path before unrestricted output is generated; MG.2.3 mandates these supersede/deactivate mechanisms be available on detection."
|
|
36
|
+
strength: secondary
|
|
26
37
|
tags:
|
|
27
38
|
category: prompt-injection
|
|
28
39
|
subcategory: dual-response-jailbreak
|
|
@@ -22,6 +22,17 @@ references:
|
|
|
22
22
|
- "AML.T0054 - LLM Jailbreak"
|
|
23
23
|
research:
|
|
24
24
|
- "https://owasp.org/www-project-top-10-for-large-language-model-applications/"
|
|
25
|
+
compliance:
|
|
26
|
+
nist_ai_rmf:
|
|
27
|
+
- subcategory: "MS.2.6"
|
|
28
|
+
context: "This rule detects jailbreak attempts that use fictional game framing and entity creation to bypass safety guardrails, producing continuous evidence of safety/security risk magnitude. MS.2.6 requires ongoing evaluation of AI safety risks, which these detections directly populate."
|
|
29
|
+
strength: primary
|
|
30
|
+
- subcategory: "MS.2.7"
|
|
31
|
+
context: "Game-based jailbreaks with unrestricted entity roleplay are adversarial inputs that test the resilience of model alignment; MS.2.7 requires that security and resilience against such evasion techniques are evaluated and documented."
|
|
32
|
+
strength: secondary
|
|
33
|
+
- subcategory: "MG.2.3"
|
|
34
|
+
context: "Detection of successful jailbreak patterns triggers mechanisms to disengage or constrain the AI system before it produces unsafe content under the fictional framing; MG.2.3 mandates these supersede/deactivate response paths."
|
|
35
|
+
strength: secondary
|
|
25
36
|
tags:
|
|
26
37
|
category: prompt-injection
|
|
27
38
|
subcategory: game-based-jailbreak
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
- "AML.T0051 - LLM Prompt Injection"
|
|
26
26
|
research:
|
|
27
27
|
- "https://arxiv.org/abs/2307.15043 - Universal and Transferable Adversarial Attacks on Aligned Language Models"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MP.5.1"
|
|
31
|
+
context: "Embodiment-mechanism jailbreaks with totalitarian control claims are adversarial prompt patterns designed to bypass safety alignment through persona roleplay; MP.5.1 requires that the likelihood and magnitude of such jailbreak attack vectors are identified and characterized as risks to the AI system."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MS.2.7"
|
|
34
|
+
context: "Detection of meta-instructions like 'STAY IN CHARACTER' and supreme-imperative framing provides continuous evidence of attempts to defeat the model's security controls; MS.2.7 requires ongoing evaluation and documentation of AI system security and resilience against such alignment-bypass techniques."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "When totalitarian-control persona jailbreaks are detected, treatment mechanisms must intervene to disengage the manipulated persona context before harmful character-framed actions execute; MG.2.3 requires these supersede/deactivate response plans be in place."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: persona-embodiment-jailbreak
|
|
@@ -26,6 +26,17 @@ references:
|
|
|
26
26
|
- "AML.T0054 - LLM Jailbreak"
|
|
27
27
|
research:
|
|
28
28
|
- "https://arxiv.org/abs/2310.04451"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MS.2.7"
|
|
32
|
+
context: "This rule detects narrative jailbreak attacks that use fictional mode backstories and dual response formatting to bypass safety controls; MS.2.7 requires continuous evaluation of AI system security and resilience against adversarial prompt injection techniques that subvert guardrails."
|
|
33
|
+
strength: primary
|
|
34
|
+
- subcategory: "MS.2.6"
|
|
35
|
+
context: "Jailbreak attempts that solicit 'unrestricted' responses directly threaten the safety posture of the LLM; MS.2.6 requires ongoing assessment of safety risk magnitude, and detections of these narrative bypass patterns provide measurable evidence of safety control circumvention attempts."
|
|
36
|
+
strength: secondary
|
|
37
|
+
- subcategory: "MG.2.3"
|
|
38
|
+
context: "Detection of dual-response jailbreak prompts must trigger mechanisms to block, deactivate, or override the manipulated response path before unsafe content is generated; MG.2.3 mandates these supersede/disengage controls be in place for adversarial prompt scenarios."
|
|
39
|
+
strength: secondary
|
|
29
40
|
tags:
|
|
30
41
|
category: prompt-injection
|
|
31
42
|
subcategory: narrative-jailbreak
|
|
@@ -24,6 +24,17 @@ references:
|
|
|
24
24
|
- "AML.T0054 - LLM Jailbreak"
|
|
25
25
|
research:
|
|
26
26
|
- "https://genai.owasp.org/llmrisk/llm01-prompt-injection/"
|
|
27
|
+
compliance:
|
|
28
|
+
nist_ai_rmf:
|
|
29
|
+
- subcategory: "MS.2.6"
|
|
30
|
+
context: "AIM jailbreak attempts with persona definition and refusal suppression are direct safety/security risks to the LLM; MS.2.6 requires continuous evaluation of safety risk magnitude, and each detection provides measurable evidence of jailbreak attempts targeting the model's safety alignment."
|
|
31
|
+
strength: primary
|
|
32
|
+
- subcategory: "MS.2.7"
|
|
33
|
+
context: "Detection of refusal suppression and amoral persona injection evaluates the resilience of the AI system's guardrails against adversarial prompt manipulation; MS.2.7 mandates documented assessment of the system's ability to withstand such jailbreak attacks."
|
|
34
|
+
strength: secondary
|
|
35
|
+
- subcategory: "MG.2.3"
|
|
36
|
+
context: "When enhanced AIM jailbreak patterns are detected, mechanisms must be available to disengage or block the manipulated session before the unfiltered alter-ego produces harmful output; MG.2.3 requires these supersede/deactivate response plans for active jailbreak events."
|
|
37
|
+
strength: secondary
|
|
27
38
|
tags:
|
|
28
39
|
category: prompt-injection
|
|
29
40
|
subcategory: enhanced-persona-jailbreak
|
|
@@ -28,6 +28,17 @@ references:
|
|
|
28
28
|
research:
|
|
29
29
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/encoding.py"
|
|
30
30
|
- "https://arxiv.org/abs/2307.02483"
|
|
31
|
+
compliance:
|
|
32
|
+
nist_ai_rmf:
|
|
33
|
+
- subcategory: "MS.2.7"
|
|
34
|
+
context: "Base-N encoding bypass attempts directly probe the security and resilience of safety classifiers by exploiting their failure to generalize from plaintext to encoded harmful instructions; MS.2.7 requires evaluating and documenting these resilience gaps against adversarial input transformations."
|
|
35
|
+
strength: primary
|
|
36
|
+
- subcategory: "MP.5.1"
|
|
37
|
+
context: "Decode-then-execute prompt injections framed with encoding scheme names and opaque blobs are adversarial inputs whose likelihood and impact must be characterized; MP.5.1 requires identifying these jailbreak vectors so their risk magnitude can be tracked."
|
|
38
|
+
strength: secondary
|
|
39
|
+
- subcategory: "MG.2.3"
|
|
40
|
+
context: "Detection of encoded payloads paired with decode instructions must trigger pre-defined risk treatment to block or quarantine the request before the model executes the smuggled instructions; MG.2.3 requires mechanisms to supersede or disengage AI behavior on such matches."
|
|
41
|
+
strength: secondary
|
|
31
42
|
tags:
|
|
32
43
|
category: prompt-injection
|
|
33
44
|
subcategory: encoding-bypass
|
|
@@ -26,6 +26,17 @@ references:
|
|
|
26
26
|
- "AML.T0054 - LLM Jailbreak"
|
|
27
27
|
research:
|
|
28
28
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/encoding.py"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MS.2.7"
|
|
32
|
+
context: "Cipher and transposition encoding jailbreaks (ROT13, Atbash, Caesar, leet) are adversarial inputs designed to bypass safety filters by obfuscating harmful instructions; MS.2.7 requires that AI system security and resilience against such evasion techniques are continuously evaluated and documented."
|
|
33
|
+
strength: primary
|
|
34
|
+
- subcategory: "MS.2.6"
|
|
35
|
+
context: "Detecting cipher-based jailbreak probes (as catalogued by garak InjectROT13/InjectAtbash/InjectLeet) provides ongoing measurement of safety/security risk magnitude from prompt-injection attacks that classical content filters miss, satisfying MS.2.6's continuous evaluation requirement."
|
|
36
|
+
strength: secondary
|
|
37
|
+
- subcategory: "MG.2.3"
|
|
38
|
+
context: "Identification of obfuscated jailbreak instructions triggers the deactivation or response mechanisms required by MG.2.3 to supersede unsafe agent execution before the decoded harmful instruction is acted upon."
|
|
39
|
+
strength: secondary
|
|
29
40
|
tags:
|
|
30
41
|
category: prompt-injection
|
|
31
42
|
subcategory: cipher-bypass
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
research:
|
|
26
26
|
- "https://twitter.com/goodside/status/1745511940351287394"
|
|
27
27
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/goodside.py"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MS.2.7"
|
|
31
|
+
context: "Invisible Unicode Tag character injection is a covert prompt-injection vector that bypasses human review by hiding instructions in non-rendering codepoints; MS.2.7 requires continuous evaluation of AI system security and resilience against such adversarial input encodings."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MP.5.1"
|
|
34
|
+
context: "Tag-block injection exploits a known LLM input-handling weakness where models interpret invisible characters as instructions; MP.5.1 requires characterizing the likelihood and magnitude of these covert prompt-injection risks so they can be tracked and prioritized."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "Detection of U+E0000-U+E007F sequences in prompts triggers risk treatment to strip or block the invisible payload before it reaches the model; MG.2.3 requires these supersede/disengage mechanisms be in place for prompt-injection attempts."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: invisible-unicode
|
|
@@ -25,6 +25,17 @@ references:
|
|
|
25
25
|
research:
|
|
26
26
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/latentinjection.py"
|
|
27
27
|
- "https://x.com/goodside/status/1569128808308957185"
|
|
28
|
+
compliance:
|
|
29
|
+
nist_ai_rmf:
|
|
30
|
+
- subcategory: "MS.2.7"
|
|
31
|
+
context: "Latent prompt injection embedded in translation/summarization input is a direct attack on the AI system's security and resilience boundary; MS.2.7 requires continuous evaluation of the system's ability to resist adversarial instructions hidden within tool inputs that should be treated as data, not commands."
|
|
32
|
+
strength: primary
|
|
33
|
+
- subcategory: "MP.5.1"
|
|
34
|
+
context: "Detecting 'ignore previous instructions' imperatives and fake turn-markers embedded in user-supplied translation content identifies and characterizes the likelihood of instruction-injection attack vectors that exploit the model's failure to distinguish data from directives, as required by MP.5.1."
|
|
35
|
+
strength: secondary
|
|
36
|
+
- subcategory: "MG.2.3"
|
|
37
|
+
context: "On detection of latent injection payloads in translate/summarize contexts, MG.2.3 risk treatment plans must activate to quarantine or sanitize the input before the agent acts on the embedded instructions, preventing behaviour hijack."
|
|
38
|
+
strength: secondary
|
|
28
39
|
tags:
|
|
29
40
|
category: prompt-injection
|
|
30
41
|
subcategory: latent-injection-translation
|
|
@@ -26,6 +26,20 @@ references:
|
|
|
26
26
|
research:
|
|
27
27
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/latentinjection.py"
|
|
28
28
|
- "https://simonwillison.net/2023/Apr/14/worst-that-can-happen/"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MS.2.7"
|
|
32
|
+
context: >-
|
|
33
|
+
Latent prompt injection embedded in retrieved RAG content (resumes, reports, emails, web pages) is a security/resilience attack against the LLM's input pipeline; MS.2.7 requires continuous evaluation of system security against adversarial inputs that bypass the system prompt via untrusted retrieved data.
|
|
34
|
+
strength: primary
|
|
35
|
+
- subcategory: "MG.3.2"
|
|
36
|
+
context: >-
|
|
37
|
+
RAG content from third-party documents and external sources is effectively untrusted input flowing through the model; MG.3.1 requires that risks introduced by third-party data ingested at retrieval time are managed and monitored for embedded injection payloads.
|
|
38
|
+
strength: secondary
|
|
39
|
+
- subcategory: "MP.5.1"
|
|
40
|
+
context: >-
|
|
41
|
+
Forged "ADMIN OVERRIDE" and "IGNORE PREVIOUS INSTRUCTIONS" blocks hidden in retrieved documents are adversarial inputs whose likelihood and impact must be characterized; MP.5.1 requires identifying and tracking these latent injection vectors as part of the system risk profile.
|
|
42
|
+
strength: secondary
|
|
29
43
|
tags:
|
|
30
44
|
category: prompt-injection
|
|
31
45
|
subcategory: latent-injection-rag
|
|
@@ -26,6 +26,20 @@ references:
|
|
|
26
26
|
research:
|
|
27
27
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/gcg.py"
|
|
28
28
|
- "https://arxiv.org/abs/2307.15043"
|
|
29
|
+
compliance:
|
|
30
|
+
nist_ai_rmf:
|
|
31
|
+
- subcategory: "MS.2.7"
|
|
32
|
+
context: >-
|
|
33
|
+
GCG adversarial suffix detection directly evidences security/resilience evaluation by identifying optimized token sequences designed to bypass safety training; MS.2.7 requires that such adversarial robustness failures against the AI system are detected and documented.
|
|
34
|
+
strength: primary
|
|
35
|
+
- subcategory: "MS.2.5"
|
|
36
|
+
context: >-
|
|
37
|
+
High-entropy bracket salad, LaTeX/code hybrids, and multilingual token salad targeting model distribution shifts test the robustness and reliability of safety alignment; MS.2.5 requires evaluating whether the model maintains reliable behavior under adversarial perturbation.
|
|
38
|
+
strength: secondary
|
|
39
|
+
- subcategory: "MG.2.3"
|
|
40
|
+
context: >-
|
|
41
|
+
Detection of GCG suffix patterns triggers risk treatment to quarantine or block the prompt before it reaches the model and overrides safety training; MG.2.3 mandates that such deactivation/containment mechanisms are defined and activated on detection.
|
|
42
|
+
strength: secondary
|
|
29
43
|
tags:
|
|
30
44
|
category: prompt-injection
|
|
31
45
|
subcategory: gcg-adversarial-suffix
|
|
@@ -29,6 +29,17 @@ references:
|
|
|
29
29
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/smuggling.py"
|
|
30
30
|
- "https://guzey.com/ai/two-sentence-universal-jailbreak/"
|
|
31
31
|
- "https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3"
|
|
32
|
+
compliance:
|
|
33
|
+
nist_ai_rmf:
|
|
34
|
+
- subcategory: "MS.2.7"
|
|
35
|
+
context: "Hypothetical-response and function-masking are jailbreak smuggling techniques that bypass safety alignment by wrapping harmful intent in fictional or algebraic abstractions; MS.2.7 requires continuous evaluation of AI security and resilience against such adversarial prompt patterns, and this detection produces direct evidence of jailbreak attempts."
|
|
36
|
+
strength: primary
|
|
37
|
+
- subcategory: "MP.5.1"
|
|
38
|
+
context: "Hypothetical framing and predict_mask function puzzles are adversarial input vectors whose likelihood and impact must be characterized as part of GenAI prompt-injection risk; MP.5.1 requires identifying and tracking these smuggling patterns as known attack surface."
|
|
39
|
+
strength: secondary
|
|
40
|
+
- subcategory: "MG.2.3"
|
|
41
|
+
context: "Detection of token smuggling via hypothetical or function-masking framings triggers risk treatment plans to block or sanitize the prompt before the model produces harmful procedural output; MG.2.3 requires these supersede/disengage mechanisms be in place."
|
|
42
|
+
strength: secondary
|
|
32
43
|
tags:
|
|
33
44
|
category: prompt-injection
|
|
34
45
|
subcategory: hypothetical-response-smuggling
|
|
@@ -29,6 +29,20 @@ references:
|
|
|
29
29
|
- "https://github.com/NVIDIA/garak/blob/main/garak/probes/badchars.py"
|
|
30
30
|
- "https://arxiv.org/abs/2106.09898"
|
|
31
31
|
- "https://trojansource.codes/"
|
|
32
|
+
compliance:
|
|
33
|
+
nist_ai_rmf:
|
|
34
|
+
- subcategory: "MP.5.1"
|
|
35
|
+
context: >-
|
|
36
|
+
Invisible zero-width and BiDi override characters are adversarial input vectors that exploit the gap between human-visible text and model-tokenised text; MP.5.1 requires identifying and characterising the likelihood and magnitude of such prompt-injection attack patterns that bypass human review.
|
|
37
|
+
strength: primary
|
|
38
|
+
- subcategory: "MS.2.7"
|
|
39
|
+
context: >-
|
|
40
|
+
Detecting Cf-category Unicode injection in user input and tool responses provides continuous evaluation of the system's resilience against Trojan Source style obfuscation; MS.2.7 requires that security and resilience against such adversarial inputs is evaluated and documented.
|
|
41
|
+
strength: secondary
|
|
42
|
+
- subcategory: "MG.2.3"
|
|
43
|
+
context: >-
|
|
44
|
+
Matches on zero-width characters in tool responses may indicate an active exfiltration or injected-output channel, requiring pre-defined risk treatment to quarantine or sanitise the payload; MG.2.3 mandates these response mechanisms are in place.
|
|
45
|
+
strength: secondary
|
|
32
46
|
tags:
|
|
33
47
|
category: prompt-injection
|
|
34
48
|
subcategory: invisible-unicode-bidi-injection
|