npm - mia-code - Versions diffs - 0.2.0 → 0.3.0 - Mend

mia-code 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (410) hide show

package/contextual_research/PDE-generalization--caefee82-efb1-4dbb-8733-691b01581464--260130/sources/2504.00218v2.md ADDED Viewed

@@ -0,0 +1,2025 @@
+Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with
+Optimized Prompt Attacks
+ WARNING: This paper contains text that may be considered offensive.
+Rana Muhammad Shahroz Khan1, Zhen Tan2, Sukwon Yun1,
+Charles Fleming3, Tianlong Chen1
+1University of North Carolina at Chapel Hill, 2Arizona State University, 3Cisco
+5
+2
+0
+2
+t
+c
+O
+8
+]
+A
+M
+.
+s
+c
+[
+2
+v
+8
+1
+2
+0
+0
+.
+4
+0
+5
+2
+:
+v
+i
+X
+r
+a
+Abstract
+Most discussions about Large Language Model
+(LLM) safety have focused on single-agent set-
+tings but multi-agent LLM systems now cre-
+ate novel adversarial risks because their be-
+havior depends on communication between
+agents and decentralized reasoning.
+In this
+work, we innovatively focus on attacking prag-
+matic systems that have constrains such as lim-
+ited token bandwidth, latency between mes-
+sage delivery, and defense mechanisms. We de-
+sign a permutation-invariant adversarial attack
+that optimizes prompt distribution across la-
+tency and bandwidth-constraint network topolo-
+gies to bypass distributed safety mechanisms
+within the system. Formulating the attack
+path as a problem of maximum-flow minimum-
+cost, coupled with the novel Permutation-
+Invariant Evasion Loss (PIEL), we leverage
+graph-based optimization to maximize attack
+success rate while minimizing detection risk.
+Evaluating across models including Llama,
+Mistral, Gemma, DeepSeek and other variants
+on various datasets like JailBreakBench and
+AdversarialBench, our method outperforms
+conventional attacks by up to 7×, exposing
+critical vulnerabilities in multi-agent systems.
+Moreover, we demonstrate that existing de-
+fenses, including variants of Llama-Guard and
+PromptGuard, fail to prohibit our attack, em-
+phasizing the urgent need for multi-agent spe-
+cific safety mechanisms.
+1
+Introduction
+Recent breakthroughs in Large Language Mod-
+els (LLMs) have shown remarkable prowess in
+various tasks, such as writing complex computer
+code (Zheng et al., 2023; Tong and Zhang, 2024),
+logical reasoning (Ouyang et al., 2022; Thoppi-
+lan et al., 2022; Bai et al., 2022), among others.
+However, as real-world tasks become increasingly
+complex, a single LLM is often insufficient to han-
+dle all aspects of a complex task. This limitation
+has led to the rise of LLM-based agents (Liang
+Figure 1: Adversarial attack in multi-agent LLM sys-
+tems. Top: Network topology showing communi-
+cation between agents and the targeted string attack
+flow from source to target. Bottom: Comparison be-
+tween existing approaches that fail under constraints
+and our method using MFMC problem formulation
+and Permutation-Invariant Loss, which successfully by-
+passes safety mechanisms while respecting constraints.
+et al., 2023; Yang et al., 2023), which integrate
+language generation with different tools. Recent
+research (Du et al., 2023; Wu et al., 2023; Liu et al.,
+2023d; Shinn et al., 2024; Wang et al., 2023; Zhang
+et al., 2024b; Qian et al., 2023; Jin et al., 2023) has
+further shown that multi-agent LLM systems can
+significantly enhance task performance by distribut-
+ing reasoning and leveraging collective intelligence.
+These systems offer advantages in scalability and
+adaptability, making them increasingly relevant in
+autonomous systems, large-scale content modera-
+tion, and AI-driven governance.
+Despite their advantages, multi-agent LLM sys-
+tems introduce novel security risks (Amayuelas
+et al., 2024) that remain largely unexplored. While
+previous research has extensively studied vulnera-
+bilities in single-agent settings, such as adversarial
+prompting for jailbreak attacks (Zou et al., 2023),
+and data poisoning (Ramirez et al., 2022), attacking
+a multi-agent system poses some unique challenges
+and settings. Existing works highlight key aspects
+of these vulnerabilities. Evil Geniuses (Tian et al.,
+2023) explores role-based adversarial prompting,
+emphasizing the need to further investigate inter-
+agent communication risks. Similarly, Prompt
+Infection (Lee and Tiwari, 2024) introduces self-
+replicating prompt injections, demonstrating how
+adversarial prompts can persist and spread. Based
+on those works, this paper targets multi-agent sys-
+tems in a novel pragmatic scenario: Optimizing
+adversarial prompt propagation in latency aware
+and token bandwidth limited multi-agent systems
+with built-in safety mechanisms. We aim to reveal
+that in such constrained settings, those systems still
+exhibit adversarial vectors as attackers can manip-
+ulate inter-agent messaging, exploit communica-
+tion bottlenecks, or disrupt agent coordination to
+achieve malicious objectives.
+Given this premise, as shown in Figure 1, and
+the key works discussed above, our paper aims to
+answer this key question:
+(Q) How can adversarial prompts be opti-
+mally propagated through constrained multi-
+agent LLM system to evade detection while en-
+suring jailbreak success, considering token band-
+width limits and asynchronous message arrival?
+In this paper, we develop a permutation-invariant
+attack in multi-agent settings that exploits inter-
+agent communication to bypass safety mechanisms.
+Unlike conventional jailbreak attacks that target
+a single standalone model, our method distributes
+adversarial prompts across the agent network by op-
+timizing over the topology and its constraints, with
+a goal to attack a single agent not accessible outside
+the system. This ensures that the attack propagates
+undetected which maximizes its effectiveness. We
+formalize this attack as a maximum-flow minimum-
+cost optimization problem, accounting for token
+bandwidth constraints, communication topologies,
+and distributed safety enforcement.
+To validate our method, we conduct extensive ex-
+periments across multiple architectures, including
+Llama-2-7B (Touvron et al., 2023). We bench-
+mark our attack on a variety of datasets including
+JailbreakBench (Chao et al., 2024), demonstrat-
+ing that our method achieves upto 7× the attack
+success rate compared with a vanilla prompt. Fur-
+thermore, we evaluate the effectiveness of existing
+safety mechanisms, including Llama-Guard (Inan
+et al., 2023), and show that they fail to defend
+against our attack in multi-agent settings. Lastly,
+we justify our settings and hyper-parameters with
+different ablation studies.
+Our key contributions include: ❶ We identify
+new vulnerabilities in multi-agent communication,
+where attackers can manipulate inter-agent mes-
+saging to bypass existing safety constraints. We
+analyze realistic attack scenarios, including token
+bandwidth and message asynchrony, demonstrating
+fundamental weaknesses in multi-agent reasoning
+systems. ❷ We propose a novel optimization-based
+attack, modeling adversarial prompt propagation
+under the constrained setting as a maximum-flow
+minimum-cost problem. Our method remains ef-
+fective across different graph configurations, en-
+suring high attack success rates even in random-
+ized agent topologies. ❸ We evaluate our at-
+tack across multiple LLM architectures, includ-
+ing Llama, Mistral, Gemma, and their DeepSeek-
+R1 distilled versions. Our method is bench-
+marked on JailbreakBench, AdversarialBench,
+and In-the-wild Jailbreak Prompts, demon-
+strating attack success rates of upto 94%, signifi-
+cantly outperforming naive prompting (11%). We
+conduct ablation studies on different topologies, of-
+fering practical insights into securing multi-agent
+LLM deployments. ❹ Additionally, we assess
+various safety mechanisms, including variants of
+Llama-Guard and PromptGuard, and show that
+they fail to prohibit our attack, highlighting the
+urgent need for advanced defenses.
+2 Related Works
+Multi-LLM Agents. Large Language Model
+agents have shown remarkable performance in vari-
+ous tasks through mutual collaboration (Chen et al.,
+2023; Hua et al., 2023; Cohen et al., 2023; Zhou
+et al., 2023; Li et al., 2023b,a; Chan et al., 2023;
+Dong et al., 2024; Qian et al., 2023). A grow-
+ing body of research demonstrates how integrating
+multiple agents in collaborative frameworks can en-
+hance problem-solving abilities in complex scenar-
+ios (Liu et al., 2023a; Chen et al., 2024). Notable
+examples include Generative Agents (Park et al.,
+2023), to simulate a town of 25 agents and study
+social interactions and collective memory. The Nat-
+ural Language-Based Society (NL-SOM) (Zhuge
+et al., 2023) takes a different approach, orchestrat-
+ing agents with specialized functions to tackle com-
+plex tasks through iterative "mindstorms". Despite
+the performance and effectiveness, new security
+concerns have raised.
+Jailbreak Attacks in LLMs. Research from re-
+cent studies shows that Large Language Models
+face serious security risks because certain precisely
+designed prompts can disable their fundamental
+safety features (Zeng et al., 2024a; Bai et al., 2022).
+These attacks, referred to as "jailbreak" attacks,
+demonstrate remarkable effectiveness by causing
+LLMs to produce content which breaks their de-
+clared ethical and operational guidelines. Research
+in this field has progressed through two separate
+development paths: (1) Traditional prompt engi-
+neering where human researchers create decep-
+tive prompts (Wei et al., 2024; Liu et al., 2023c;
+Shen et al., 2024), (2) attack strategies developed
+from learning-based approaches where we opti-
+mize methods automatically to attack LLMs (Guo
+et al., 2021; Lyu et al., 2022, 2023, 2024; Liu et al.,
+2023b; Zou et al., 2023). The learning-based meth-
+ods are particularly concerning as they can system-
+atically exploit the weaknesses of a system.
+Jailbreak Attacks in Multi-Agent Systems.
+The landscape of jailbreak attacks continues to ex-
+pand, with many recent works (Tian et al., 2023;
+Gu et al., 2024; Tan et al., 2024; Zeng et al., 2024b;
+Lee and Tiwari, 2024) exploring how they affect
+Multi-agent systems. While these studies highlight
+critical risks, our work differs in its focus on adver-
+sarial prompt propagation under constrained multi-
+agent communication with certain limitations: to-
+ken bandwidth constraints, latency-aware messag-
+ing, and decentralized safety enforcement. Unlike
+Evil Geniuses (Tian et al., 2023), which examines
+role-based adversarial attacks, we optimize prompt
+routing to exploit the network topology with the
+above mentioned bottlenecks. Similarly, Agent
+Smith (Gu et al., 2024), studies the exponential jail-
+break propagation, but assumes unrestricted inter-
+agent messaging, unlike our token-bandwidth con-
+straint communication. Wolf Within (Tan et al.,
+2024) and Prompt Infection (Lee and Tiwari, 2024)
+explore malicious prompt propagation, but focus
+on stealthy influence and self-replicating attacks,
+respectively. By modeling pragmatic multi-agent
+attack scenarios, our study aims to study security
+challenges beyond existing works, particularly ex-
+tending them to ensure effective jailbreaks despite
+topological constraints.
+3 Threat Model
+In this section, we will introduce the settings to
+study the vulnerabilities of a multi-agent system
+in a realistic manner. We will discuss the general
+settings of the environment that the adversary will
+be deployed into, as well as discuss the capabilities
+of the said adversary.
+3.1 Scenario
+We consider a multi-agent LLM system, denoted
+by S, where multiple LLMs operate within a con-
+nected network, communicating with one another
+to complete tasks collaboratively. The agents in
+this system exchange messages via a predefined
+communication topology L (essentially undirected
+graph), which dictates how prompts are passed be-
+tween models. Every input into any LLM is passed
+around to its neighbors as well. Similarly, each
+individual agent is responsible for its own mem-
+ory bank, i.e. the context window which accumu-
+lates over time until a maximum size is reached,
+in which case the model evicts the oldest memory
+first. We assume in our setting that the memory
+bank basically accumulates the inputs an agent has
+received, and at the time of inference concatenates
+everything into a string which is used as a context
+to generate the new output. An illustration of the
+threat model is also presented in Figure 2. This
+setting introduces several key constraints that make
+adversarial attacks fundamentally different from
+traditional single LLM:
+❶ Each edge in the network has a token band-
+width constraint, F(uv) for edge uv (between
+LLMs u and v), meaning only a limited number
+of tokens can be transmitted per interaction. This
+might not necessarily be same for each edge. This
+constraint arises from various factors, such as: (1)
+Design limitations, where different agents oper-
+ate on distinct GPUs with varying memory capac-
+ities, (2) Communication efficiency, where lower-
+bandwidth connections prioritize lightweight mes-
+sage exchanges, and (3) Agent Specific limitations,
+as some LLMs are inherently constrained in how
+much input they can process per step. ❷ Latency
+varies across different edges, meaning that mes-
+sages do not always arrive at their destination in a
+deterministic sequence. Some edges may transmit
+prompts faster than others, leading to asynchronous
+message arrival at the target LLM. This variability
+necesitates the design of a permutation-invariant
+adversarial prompts, ensuring that the attack re-
+mains effective regardless of the order in which
+different chunks of the prompt reach the target. ❸
+To mitigate harmful interactions, certain edges in
+the network are equipped with safety mechanisms,
+Figure 2: Process of generating and optimizing adversarial prompt chunks for multi-agent LLM systems. (a)
+Multi-agent Topologies: Different network structures including Chain, Tree, Random Graph, and Complete Graph
+that influence attack effectiveness. (b) Topological Optimization: Identifying optimal paths based on bandwidth
+constraints and detection risk, with chunks strategically distributed across the network. (c) Permutation Invariance:
+Due to network latency, prompt chunks may arrive in different orders, creating a sampling space where optimized
+chunks remain effective regardless of arrival sequence, successfully bypassing safety mechanisms.
+such as Llama-Guard, designed to filter adversarial
+prompts. However, not every edge is protected due
+to the following reasons: (1) Computational limita-
+tions, as running safety filters on every edge would
+require significant GPU resources, (2) Strategic
+Safety Placement, where only high-risk interac-
+tions are monitored, and (3) System design trade-
+offs, where some edges prioritize communication
+speed over security.
+Terminology. We denote each LLM as a vertex
+vi, and such a set of LLMs can be referred to as V.
+Similarly we can denote E as the set of all edges
+uv, and the token bandwidth of such edges can
+be defined by a function F : E → R≥0 such that
+for any edge uv, F(uv) = F(vu). Lastly, we can
+quantify the risk of getting caught by the safety
+mechanism as a function G : E → R≥0, where
+G(uv) = 0 if there is not safety mechanism on the
+edge uv. As a result such a system can be denoted
+by S(E, V, F, G), which will simply be referred to
+as S from here on out.
+3.2 Adversary Capabilities
+It is assumed that the adversary operates within
+the multi-agent system S, leveraging the following
+capabilities to execute a stealthy jailbreak attack:
+❶ Jailbreak via Multi-Agent Communication: The
+adversary can send adversarial prompts into the sys-
+tem through an initial agent vi with the goal of prop-
+agating the attack to a target agent vt. However,
+due to token bandwidth constraints and message de-
+lays, the adversarial prompt must be partitioned and
+strategically routed through the network to evade
+detection. ❷ Knowledge of Network Topology L
+and Safety Mechanisms: Partial knowledge of the
+communication graph, including agent connectiv-
+ity and token bandwidth constraints on edges are
+available to the adversary. Additionally, although
+the adversary does not have direct access to inter-
+nal LLM parameters, they are aware that certain
+edges are protected by safety mechanisms and can
+estimate the likelihood of detection using the risk
+function G. ❸ Architecture of the Target Model
+vt: The adversary knows the architecture of tar-
+get LLM vt, allowing them to optimize adversarial
+prompts that actually jailbreak that model type. ❹
+Restricted System Access: The adversary does not
+control all agents in the system, nor do they have
+full visibility into message processing. They can-
+not directly modify parameters or override built-in
+safety mechanisms.
+3.3 Adversarial Goals
+In our setting, the adversary aims to execute a
+stealthy jailbreak attack within a multi-agent LLM
+system S, leveraging optimized prompt propaga-
+tion strategies to bypass safety mechanisms and
+manipulate the target LLM’s behavior. The primary
+attack scenario we use is Jailbreak, i.e., to gener-
+ate harmful output, where the adversary carefully
+routes an adversarial prompt through the network
+topology to ensure it reaches the target agent vt
+while avoiding detection.
+4 Method
+In this section, we decouple the structure and ob-
+jective of (i) finding the optimal path in the multi-
+agent communication topology, and (ii) the per-
+mutation invariant adversarial formulation to effec-
+tively bypass safety mechanisms in a constrained
+LLM network as shown in Figure 2.
+4.1 Topological Optimization
+Problem Formulation.
+In a multi-agent system
+S = (V, E), an adversary aims to propagate an
+adversarial prompt from a source agent, denoted
+as vi ∈ V, to a target agent vt ∈ V, while mini-
+mizing the risk of detection by safety mechanisms
+and maximizing the token flow through the net-
+work. Each communication edge (u, v) ∈ E has
+a token bandwidth constraint F(u, v), which lim-
+its the number of tokens that can be transmitted
+in a single exchange, and a risk function G(u, v),
+representing the likelihood of adversarial content
+being detected and blocked by safety mechanisms
+such as Llama-Guard. The adversary’s objective
+is to find an optimal path that balances high token
+throughput while minimizing detection risk.
+Minimum Cost Maximum Flow Formulation.
+Given the above problem, we formulate it as a Min-
+imum Cost Maximum Flow problem. We define a
+flow function f : E → R≥0, where f (u, v) repre-
+sents the number of adversarial tokens transmitted
+along edge (u, v) ∈ E. The objective is to mini-
+mize the total risk while ensuring maximum token
+flow from vi to vt:
+(cid:88)
+min
+(u,v)∈E
+G(u, v)f (u, v)
+(1)
+subject to the following constraints:
+Token Capacity Constraints:
+0 ≤ f (u, v) ≤ F(u, v),
+∀(u, v) ∈ E
+(2)
+Flow Conservation:
+f (w, u) =
+(cid:88)
+w∈V
+(cid:88)
+w∈V
+f (u, w),
+∀u ∈ V \ {vi, vt}
+Source and Sink Constraints:
+where Fmax represents the maximum flow that can
+be transmitted from vi to vt.
+To solve this optimization problem efficiently
+and get the optimal attack path, we deploy the so-
+lution algorithm implemented in NetworkX (Hag-
+berg et al., 2008), which finds the highest token
+flow while minimizing detection risk. More infor-
+mation on how we quantify this risk can be found
+in Appendix C.
+4.2 Permutation Invariant Evasion Loss
+Problem Formulation.
+In multi-agent system,
+S, communication constraints introduce a unique
+challenge for adversarial attacks. Prompts are of-
+ten transmitted in discrete chunks due to token
+bandwidth limitations, agent-specific processing
+delays, and asynchronous message arrival. As these
+chunks propagate through the communication net-
+work, they are accumulated in an agent’s memory
+bank but arrive in varying orders depending on net-
+work latency and routing paths. This inherent non-
+determinism means that the adversarial prompts
+must remain effective regardless of how they are
+received and concatenated by the target agent. The
+primary challenge in designing adversarial prompts
+for multi-agent LLM system, S, lies in ensuring
+that the objective enforces permutation invariance.
+Given such a system, we must optimize a structured
+adversarial prompt that remains effective regardless
+of permutation of chunks.
+Permutation Invariant Evasion Loss (PIEL).
+Let the LLM agent be a next token predictor, i.e.,
+a function that maps an input sequence of tokens
+x1:n to a probability distribution over the next to-
+ken. Specifically, we denote the probability of the
+model generating the next token xn+1 given prior
+tokens x1:n as p(xn+1|x1:n). Similarly, we can
+now extend it to a full sequence of L target tokens,
+expressing the the probability of generating a spe-
+cific harmful output x∗
+n+1:n+L as
+p(x∗
+n+1:n+L|x1:n) =
+L
+(cid:89)
+i=1
+p(x∗
+n+i|x1:n+i−1)
+(6)
+(3)
+Then the adversarial loss function is then given by
+the negative log-likelihood of the target sequence:
+(cid:88)
+w∈V
+(cid:88)
+w∈V
+f (vi, w) −
+f (w, vt) −
+(cid:88)
+w∈V
+(cid:88)
+w∈V
+f (w, vi) = Fmax,
+(4)
+f (vt, w) = Fmax,
+(5)
+LN LL(x1:n) = − log p(x∗
+n+1:n+L|x1:n)
+(7)
+and simply minimizing LN LL(x1:n) increases the
+likelihood of generating the adversarial target
+phrase. To introduce permutation invariance, we
+Experiment
+JailbreakBenchmark
+AdversarialBenchmark
+In-the-wild Jailbreak
+Target Model Type
+Method
+ASR-m ↑ ASR ↑ ASR-M ↑ ASR-m ↑ ASR ↑ ASR-M ↑ ASR-m ↑ ASR ↑ ASR-M ↑
+Llama-2-7B
+Llama-3.1-8B
+Mistral-7B
+Gemma-2-9B
+Vanilla Prompt
+0
+GCG
+Ours
+Vanilla Prompt
+GCG
+Ours
+0.010
+0.670
+0
+0
+0
+0.017
+0.726
+0
+0
+0
+0.020
+0.780
+0
+0
+0.430
+0.462
+0.480
+Vanilla Prompt
+0
+GCG
+Ours
+0.290
+0.780
+Vanilla Prompt
+0
+0
+0.324
+0.812
+0
+0
+0.340
+0.840
+0
+GCG
+Ours
+0.080
+0.700
+0.100
+0.1200
+0.720
+0.740
+Llama-3.1-8B
+(DeepSeek-R1-Distilled)
+Vanilla Prompt
+GCG
+Ours
+0
+0
+0
+0
+0
+0
+0
+0.120
+0.498
+0
+0.056
+0.380
+0
+0.194
+0.512
+0
+0.146
+0.498
+0
+0
+0
+0.160
+0.533
+0
+0.067
+0.402
+0
+0.212
+0.543
+0
+0.155
+0.506
+0
+0
+0
+0.180
+0.566
+0
+0.074
+0.420
+0
+0.228
+0.566
+0
+0.162
+0.514
+0
+0
+0.121
+0.189
+0.144
+0.201
+0.153
+0.231
+0.543
+0.561
+0.587
+0.077
+0.122
+0.389
+0.187
+0.197
+0.082
+0.147
+0.410
+0.215
+0.203
+0.086
+0.159
+0.423
+0.234
+0.209
+0.603
+0.627
+0.642
+0.123
+0.188
+0.137
+0.194
+0.146
+0.198
+0.587
+0.598
+0.609
+0.065
+0.089
+0.069
+0.097
+0.072
+0.107
+0.380
+0.413
+0.440
+0.354
+0.368
+0.384
+0.369
+0.376
+0.384
+Table 1: Attack success rates (ASR) of different adversarial prompting methods across multiple LLM architectures
+on different benchmarks. We report the minimum (ASR-m), average (ASR), and maximum (ASR-M) attack success
+rates over multiple trials.
+structure the adversarial prompt as K discrete
+chunks: C = {C1, C2, . . . , CK}, where each chunk
+Ci consists of a sequence of tokens of length Li.
+Since different message paths in the multi-agent
+system, S, may deliver these chunks in varying
+sequences, we define the loss to be averaged over
+all possible orderings of the chunks:
+L(C) =
+1
+K!
+(cid:88)
+π∼SK
+− log p(x∗
+n+1:n+L|ϕ)
+(8)
+where SK represents the set of all possible
+chunk orderings, and ϕ represents the operation
+of Concatenate(π(1), π(2), . . . , π(K)). However,
+optimizing token selection in adversarial inputs is
+challenging due to their discrete nature. To navi-
+gate this, we employ the Greedy-Coordinate Gra-
+dient (GCG) (Zou et al., 2023) method, iteratively
+refining token choices while considering all chunk
+order permutations. For each token t in chunk Ci,
+we compute its gradient based on expectation over
+all orderings by ∇tL(C). Then in each iteration we
+follow three key steps : (1) Compute Loss across
+all orderings, (2) Gradient computation for token
+updates, (3) Token substitution strategy using GCG.
+The whole algorithm is also described in the Ap-
+pendix as Algorithm 1.
+Stochastic Permutation Invariant Evasion Loss
+(S-PIEL). We can see from Equation (8) that we
+calculate the loss over all K! permutations pos-
+sible, which can be computationally prohibitive
+in practice if the targeted model vt have multiple
+neighbors (so multiple chunks). Hence to solve, we
+introduce the stochastic version of the loss. Instead
+of evaluating the loss on every single element of
+SK we randomly sample a smaller subset ˜SK and
+try to approximate the loss using it as follows:
+˜L(C) =
+1
+| ˜SK|
+(cid:88)
+π∼ ˜SK
+− log p(x∗
+n+1:n+L|ϕ)
+(9)
+To understand the computation trade-offs with
+quality of the adversarial prompts generated using
+the S-PIEL, we perform an ablation study which
+can be found in Section 5.5.
+5 Experiments
+In this section, we conduct a series of experi-
+ments to evaluate the effectiveness of our proposed
+permutation-invariant attack. Detailed findings
+for all the experiments are described below, while
+all the experimental settings, including baselines,
+datasets, architectures, training settings and com-
+prehensive metrics are discussed in Appendix B.
+5.1 Overall Performance Comparison
+To evaluate the effectiveness of our permutation-
+invariant attack, we conduct experiments across
+multiple LLM architectures, including Llama-2,
+across different benchmarks. Furthermore, each ex-
+periment is run three times with randomized multi-
+agent topologies to mitigate bias, ensuring robust
+evaluation of our attack performance. Complete
+experimental details can be found in Appendix B.1.
+Based on Table 1, we can derive some key find-
+ings across different LLM architectures and bench-
+marks: ❶ Baseline Comparison: Our method sub-
+stantially outperforms existing approaches across
+all scenarios. Vanilla prompts show near-zero effec-
+tiveness on most benchmarks, while GCG achieves
+moderate success (16 − 32%) only on specific
+models like Mistral-7B. In contrast, our approach
+demonstrates upto 7× improvement over the best
+baseline performance, highlighting the effective-
+ness of permutation-invariant design. For instance,
+on Llama-2-7B, vanilla prompts achieve 0% suc-
+cess rate across structured benchmarks, while GCG
+manages only 1.7% ASR on JailbreakBench. In
+contrast, our method achieves 72.6% ASR, demon-
+strating a dramatic improvement in attack capabil-
+ity. ❷ Attack Stability: The small variance be-
+tween minimum (ASR-m) and maximum (ASR-M)
+– typically 2-6% – demonstrates the stability of our
+attack across different random topologies. This
+consistency is particularly evident in Gemma-2-9B,
+where the variance remains under 4%. This sta-
+bility extends to other models, with Mistral-7B
+showing only 6% variation (78.0% to 84.0%), con-
+firming the robustness of our permutation-invariant
+design. ❸ Model Sensitivity: Some models ex-
+hibit higher susceptibility to our attack. For exam-
+ple, Mistral-7B and Llama-2-7B show the high-
+est vulnerability, achieving 81.2% and 72.6% ASR
+on Jailbreak Benchmark, respectively, while
+models like Llama-3.1-8B achieve 41.3% ASR
+on same benchmark – a significant result given the
+model’s initial results. These findings indicate that
+despite different architectures, our method outper-
+forms the existing baselines in multi-agent setting.
+❹ General Observations: Interestingly, the larger
+model size does not always guarantee a better secu-
+rity. Additionally, DeepSeek-R1 Distillation show
+cases notably lower ASR (41.3%) on same bench-
+mark.
+5.2 Safety Mechanism Efficacy
+The goal of this experiment is to simply analyze
+the effectiveness of graph optimizations we per-
+formed in Section 4.1 in reducing the detectability
+of these jailbreak prompts when routed through a
+multi-agent system with safety mechanisms. Our
+primary focus is on understanding whether graph-
+optimized routing helps bypass safety mechanisms
+of different types. The experimental settings are
+explained in Appendix B.2.
+Table 2: Transferability evaluation of our adversarial
+prompts across different source and target models.
+Source Model Target Model
+Jailbreak Benchmark Adversarial Benchmark
+Llama-2-7B
+Mistral-7B
+Gemma-2-9B
+Llama-2-7B
+Mistral-7B
+Gemma-2-9B
+Llama-2-7B
+Mistral-7B
+Gemma-2-9B
+Llama-2-7B
+Mistral-7B
+Gemma-2-9B
+0.740
+0.710
+0.680
+0.690
+0.820
+0.610
+0.610
+0.690
+0.710
+0.522
+0.488
+0.492
+0.446
+0.522
+0.412
+0.472
+0.498
+0.512
+Based on Figure 3, which compares the effec-
+tiveness of different safety mechanisms against var-
+ious attack methods, we observe some key findings:
+❶ Baseline Comparison: Across all safety mech-
+anisms, vanilla prompts are most easily detected,
+followed by GCG prompts, while our permutation-
+invariant prompts consistently achieves the lowest
+detection rates when it comes to attacks in multi-
+agent systems (chunked). ❷ Defense Robustness:
+Even the most advanced safety mechanisms strug-
+gle against our permutation-invariant attack. The
+best-performing model, Llama-Guard-3-8B, still
+sees its F1-score drop by nearly 30% when faced
+with our method compared to vanilla prompts, high-
+lighting significant vulnerability in current safety
+measures.
+5.3 Transferability
+To assess the transferability of adversarial prompts,
+we evaluate attack success rates across different
+source-target LLM pairs, including Llama-2-7B,
+Mistral-7B, and Gemma-2-9B. We use Jailbreak
+Benchmark and Adversarial Benchmark, to mea-
+sure the Attack Success Rate (ASR) when prompts
+optimized on one model are applied to an-
+other. Further details regarding setup are shared
+in Appendix B.3 and the findings are summa-
+rized in Table 2 for the transferability of our
+permutation-invariant attack across different LLM
+architectures. We observe several key find-
+ings: ❶ Source-Target Similarity: The effec-
+tiveness of transferred attacks strongly corre-
+lated with architectural similarity between source
+and target models. For instance, when using
+Llama-2-7B as the source model on Jailbreak
+Benchmark, its attack achieves 74% ASR on it-
+self but also maintains relatively high effectiveness
+on Mistral-7B (71%) and Gemma-2-9B(68%).
+This suggests that adversarial prompts learned
+on one architecture can successfully transfer to
+the other model, though with some degradation
+in performance. ❷ Model-Specific Robustness:
+Mistral-7B shows unique characteristics both
+Figure 3: Detection efficacy of different safety mechanisms against adversarial prompts.
+The results for the ablation are summarized in
+Figure 4. Complete graph structure demonstrate
+the highest vulnerability to attacks, achieving an
+ASR of around 78%, while Chain topologies prove
+the most resilient with approximately 60% ASR.
+This suggests that increased connectivity and path
+diversity might actually make systems more suscep-
+tible to adversarial attacks when it comes to attacks
+that utilize the topology to their own advantage.
+Table 3: Effect of sample size on the number of itera-
+tions required for convergence.
+Sample Size( M)
+2
+4
+8
+16
+32
+64
+Iterations
+ASR
+N/A N/A 15,000
+0
+0.01
+0
+5,000
+0.08
+4,200
+0.17
+1,750
+0.56
+5.5 Ablation Study 2: Sensitivity Analysis for
+Stochastic Version
+We know that the Permutation Invariant Evasion
+Loss introduced in Section 4.2 can be computation-
+ally prohibitive as its complexity can be catego-
+rized as O(K!) where K is the optimal number of
+chunks. Hence to solve, we introduce the Stochastic
+version of the loss where we randomly sample M
+chunks out of K! permutations at each iteration.
+To investigate the effect of sample size, M , on the
+performance of our method, we conduct an abla-
+tion study measuring the ASR as a function of the
+number of M . The experimental details can be
+found in Appendix B.5
+We can see in Table 3 the relationship between
+sample size and ASR. Starting from a very low
+effectiveness of almost 0% ASR with small sam-
+ple sizes (M = 2, 4), the performance improves
+dramatically as M increases, reaching around 56%
+at M = 64, which is around 50% of K!. The
+computational cost, measured in required iterations
+for convergence, demonstrates an inverse relations
+with sample size M . As shown in Table 3, smaller
+sample sizes require significantly more iterations
+(15, 000 iterations for M = 8) compared to larger
+Figure 4: Impact of different network topologies on
+attack success rate (ASR) in a multi-agent LLM system.
+as a source and target model. As a source,
+it achieves the highest self ASR (82% on Jail-
+break Benchmark) but shows steeper perfor-
+mance drops when transferred to other archi-
+tectures (69% on Llama-2-7B). This indicates
+while Mistral-7B can generate highly effective
+attacks, these attacks may be more model-specific
+compared to those generated by other architec-
+tures. ❸ Architecture Generalization: Interest-
+ingly, while Gemma-2-9B shows moderate perfor-
+mance as a source model (71% self ASR), its at-
+tacks demonstrate more consistent transfer perfor-
+mance across different target models, with smaller
+variations in success rates. This suggests that some
+architectures may naturally generate more gener-
+alizable adversarial prompts, even if they are not
+optimal for any specific target.
+5.4 Ablation Study 1: Effect of Topology
+To investigate the effect of communication topol-
+ogy on the success of adversarial attacks in multi-
+agent systems, we conduct an ablation study using
+a range of graph structures. Our goal is to systemat-
+ically vary the underlying communication structure,
+so we can quantify the impact of network topology
+on adversarial robustness. Experimental details are
+listed in Appendix B.4
+PromptGuard-86MLlama-Guard-7BLlama-Guard-2-8BLlama-Guard-3-8BLlama-Guard-3-1B020406080100F1-ScoreVanillaGCGOurs0%20%40%60%80%100%ChainTreeComplete GraphRandom Graphsamples (1, 750 iterations for M = 64). Notably,
+for very small sample sizes, the loss does not con-
+verge as depicted by N/A. Most interestingly,
+these results suggest a practical trade-off point be-
+tween attack effectiveness and computational effi-
+ciency.
+6 Conclusion
+In this paper, we investigate the vulnerabilities of
+multi-agent LLM systems to adversarial prompt
+propagation attacks. Our findings demonstrate that
+optimized prompt routing can effectively bypass
+safety mechanisms in a system while adhering to
+token bandwidth constraints and handling asyn-
+chronous message arrival. Through extensive ex-
+periments, we highlighted critical safety gaps in
+existing defenses, showing that traditional single-
+agent safety measures are insufficient in multi-
+agent setting.
+Limitations
+While our study sheds light on critical vulnera-
+bilities in multi-agent systems, several constraints
+should be acknowledged.
+❶ Our evaluation is restricted to a set of open-
+source models and benchmarks. Although these
+models represent a diverse range of large-scale
+LLMs, they do not fully encapsulate the broader
+landscape of commercial and fine-tuned propri-
+etary systems. Future research should expand the
+scope to include wider variety of architectures par-
+ticularly those with more advanced safety training
+protocols, like GPT-4 (Achiam et al., 2023), and
+Claude (ClaudeTeam).
+❷ Our approach assumes the partial knowledge of
+the communication structure and safety enforce-
+ment mechanisms within the system. While this
+reflects certain real-world scenarios where attack-
+ers can exploit known patterns, it does not account
+for cases where the network topology is entirely
+unknown or dynamically reconfigured as shown in
+AgentPrune (Zhang et al., 2024a).
+❸ Our modeling of inter-agent interactions simpli-
+fies some complexities present in real deployments.
+We assume static safety mechanisms and prede-
+fined token bandwidth constraints, whereas actual
+multi-agent networks may involve shifting policies,
+evolving defenses and variable latency conditions.
+❹ All the models we study focus solely on text-
+based agent interactions. Many emerging LLM-
+based systems incorporate multi-modal capabilities
+as well. The potential for adversarial manipula-
+tion in these multi-modal systems remains an open
+question, and future research should examine how
+cross-modal dependencies influence such security
+risks.
+Addressing these limitations will provide a
+clearer path toward securing multi-agent LLM
+frameworks, ensuring their safe and reliable de-
+ployment in real-world application.
+Ethical Statement
+Ensuring the security of multi-agent LLM systems
+is critical as these models become more integrated
+into real-world applications. Our research is driven
+by the need to understand and address vulnerabil-
+ities that could be exploited by adversaries, with
+the ultimate goal of strengthening AI safety mech-
+anisms. By analyzing how adversarial prompts
+can bypass existing defenses, we aim to provide
+valuable insights for the development of more ro-
+bust safeguards that can protect these systems from
+manipulation.
+We acknowledge that the techniques explored in
+this work could be misused if applied irresponsi-
+bly. To mitigate this risk, we have conducted all
+experiments in controlled environments and have
+refrained from testing on real-world deployments.
+Our intent is solely to inform security research and
+to assist developers in identifying and mitigating
+risks before they become exploitable. We strongly
+advocate for ethical AI practices and emphasize
+that advancements in adversarial understanding
+should always be accompanied by proactive de-
+fense strategies to ensure the safe and responsible
+deployment of AI technologies.
+Acknowledgment
+This research was, in part, funded by the CISCO
+Faculty Award, UNC SDS Seed Grant and Net-
+Mind.AI. The views and conclusions contained in
+this document are those of the authors and should
+not be interpreted as representing official policies,
+either expressed or implied of the funding organi-
+zations.
+References
+Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
+Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
+Diogo Almeida, Janko Altenschmidt, Sam Altman,
+Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
+arXiv preprint arXiv:2303.08774.
+Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades,
+Wenyue Hua, Liangming Pan, and William Yang
+Wang. 2024. MultiAgent collaboration attack: Inves-
+tigating adversarial attacks in large language model
+collaborations via debate. In Findings of the Associ-
+ation for Computational Linguistics: EMNLP 2024,
+pages 6929–6948, Miami, Florida, USA. Association
+for Computational Linguistics.
+Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
+Amanda Askell, Jackson Kernion, Andy Jones,
+Anna Chen, Anna Goldie, Azalia Mirhoseini,
+Cameron McKinnon, et al. 2022. Constitutional
+ai: Harmlessness from ai feedback. arXiv preprint
+arXiv:2212.08073.
+Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu,
+Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan
+Liu. 2023. Chateval: Towards better llm-based eval-
+uators through multi-agent debate. arXiv preprint
+arXiv:2308.07201.
+Patrick Chao, Edoardo Debenedetti, Alexander Robey,
+Maksym Andriushchenko, Francesco Croce, Vikash
+Sehwag, Edgar Dobriban, Nicolas Flammarion,
+George J Pappas, Florian Tramer, et al. 2024. Jail-
+breakbench: An open robustness benchmark for jail-
+breaking large language models. arXiv preprint
+arXiv:2404.01318.
+Jiaqi Chen, Yuxian Jiang, Jiachen Lu, and Li Zhang.
+in
+preprint
+self-organizing agents
+environment.
+S-agents:
+arXiv
+2024.
+open-ended
+arXiv:2402.04578.
+Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang,
+Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia
+Qin, Yaxi Lu, Ruobing Xie, et al. 2023. Agent-
+verse: Facilitating multi-agent collaboration and ex-
+ploring emergent behaviors in agents. arXiv preprint
+arXiv:2308.10848, 2(4):6.
+ClaudeTeam. The claude 3 model family: Opus, sonnet,
+haiku.
+Roi Cohen, May Hamri, Mor Geva, and Amir Glober-
+son. 2023. Lm vs lm: Detecting factual errors via
+cross examination. arXiv preprint arXiv:2305.13281.
+Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024.
+Self-collaboration code generation via chatgpt. ACM
+Transactions on Software Engineering and Method-
+ology, 33(7):1–38.
+Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
+baum, and Igor Mordatch. 2023. Improving factual-
+ity and reasoning in language models through multia-
+gent debate. arXiv preprint arXiv:2305.14325.
+GraySwanAI. 2024. Nanocg. https://github.com/
+[Accessed 16-
+GraySwanAI/nanoGCG/tree/main.
+02-2025].
+Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao
+Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin.
+2024. Agent smith: A single image can jailbreak
+one million multimodal llm agents exponentially fast.
+arXiv preprint arXiv:2402.08567.
+Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and
+Douwe Kiela. 2021. Gradient-based adversarial at-
+tacks against text transformers. In Proceedings of the
+2021 Conference on Empirical Methods in Natural
+Language Processing, pages 5747–5757, Online and
+Punta Cana, Dominican Republic. Association for
+Computational Linguistics.
+Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,
+Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma,
+Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In-
+centivizing reasoning capability in llms via reinforce-
+ment learning. arXiv preprint arXiv:2501.12948.
+Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart.
+2008. Exploring network structure, dynamics, and
+In Proceedings of the
+function using networkx.
+7th Python in Science Conference, pages 11 – 15,
+Pasadena, CA USA.
+Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei,
+Jianchao Ji, Yingqiang Ge, Libby Hemphill, and
+Yongfeng Zhang. 2023. War and peace (waragent):
+Large language model-based multi-agent simulation
+of world wars. arXiv preprint arXiv:2311.17227.
+Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi
+Rungta, Krithika Iyer, Yuning Mao, Michael
+Tontchev, Qing Hu, Brian Fuller, Davide Testuggine,
+et al. 2023. Llama guard: Llm-based input-output
+safeguard for human-ai conversations. arXiv preprint
+arXiv:2312.06674.
+Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
+sch, Chris Bamford, Devendra Singh Chaplot, Diego
+de las Casas, Florian Bressand, Gianna Lengyel, Guil-
+laume Lample, Lucile Saulnier, et al. 2023. Mistral
+7b. arXiv preprint arXiv:2310.06825.
+Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli
+Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue
+Zhou, and Jiangtao Gong. 2023. Surrealdriver: De-
+signing generative driver agent simulation framework
+in urban contexts based on large language model.
+arXiv preprint arXiv:2309.13193.
+Donghyun Lee and Mo Tiwari. 2024. Prompt infec-
+tion: Llm-to-llm prompt injection within multi-agent
+systems. arXiv preprint arXiv:2410.07283.
+Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
+Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
+Akhil Mathur, Alan Schelten, Amy Yang, Angela
+Fan, et al. 2024. The llama 3 herd of models. arXiv
+preprint arXiv:2407.21783.
+Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii
+Khizbullin, and Bernard Ghanem. 2023a. Camel:
+Communicative agents for" mind" exploration of
+large language model society. Advances in Neural
+Information Processing Systems, 36:51991–52008.
+Yuan Li, Yixuan Zhang, and Lichao Sun. 2023b. Metaa-
+gents: Simulating interactions of human behav-
+iors for llm-based task-oriented coordination via
+arXiv preprint
+collaborative generative agents.
+arXiv:2310.06500.
+Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
+Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and
+Zhaopeng Tu. 2023. Encouraging divergent thinking
+in large language models through multi-agent debate.
+arXiv preprint arXiv:2305.19118.
+Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny
+Zhou, Andrew M Dai, Diyi Yang, and Soroush
+Vosoughi. 2023a. Training socially aligned lan-
+guage models on simulated social interactions. arXiv
+preprint arXiv:2305.16960.
+Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei
+Xiao. 2023b. Autodan: Generating stealthy jailbreak
+prompts on aligned large language models. arXiv
+preprint arXiv:2310.04451.
+Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen
+Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang,
+Kailong Wang, and Yang Liu. 2023c. Jailbreaking
+chatgpt via prompt engineering: An empirical study.
+arXiv preprint arXiv:2305.13860.
+Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi
+Yang. 2023d. Dynamic llm-agent network: An llm-
+agent collaboration framework with agent team opti-
+mization. arXiv preprint arXiv:2310.02170.
+Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang,
+Haibin Ling, Susmit Jha, and Chao Chen. 2024.
+Task-agnostic detector for insertion-based backdoor
+attacks. In Findings of the Association for Computa-
+tional Linguistics: NAACL 2024, pages 2808–2822,
+Mexico City, Mexico. Association for Computational
+Linguistics.
+Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao
+Chen. 2022. A study of the attention abnormality
+In Proceedings of the 2022
+in trojaned BERTs.
+Conference of the North American Chapter of the
+Association for Computational Linguistics: Human
+Language Technologies, pages 4727–4741, Seattle,
+United States. Association for Computational Lin-
+guistics.
+Weimin Lyu, Songzhu Zheng, Lu Pang, Haibin Ling,
+and Chao Chen. 2023. Attention-enhancing back-
+door attacks against BERT-based models. In Find-
+ings of the Association for Computational Linguis-
+tics: EMNLP 2023, pages 10672–10690, Singapore.
+Association for Computational Linguistics.
+2022. Training language models to follow instruc-
+tions with human feedback. Advances in neural in-
+formation processing systems, 35:27730–27744.
+Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
+ith Ringel Morris, Percy Liang, and Michael S Bern-
+stein. 2023. Generative agents: Interactive simulacra
+of human behavior. In Proceedings of the 36th an-
+nual acm symposium on user interface software and
+technology, pages 1–22.
+Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
+Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
+Sun. 2023. Communicative agents for software de-
+velopment. arXiv preprint arXiv:2307.07924, 6(3).
+Miguel A Ramirez, Song-Kyoo Kim, Hussam Al
+Hamadi, Ernesto Damiani, Young-Ji Byon, Tae-Yeon
+Kim, Chung-Suk Cho, and Chan Yeob Yeun. 2022.
+Poisoning attacks and defenses on artificial intelli-
+gence: A survey. arXiv preprint arXiv:2202.10276.
+Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen,
+and Yang Zhang. 2024. " do anything now": Charac-
+terizing and evaluating in-the-wild jailbreak prompts
+In Proceedings of the
+on large language models.
+2024 on ACM SIGSAC Conference on Computer and
+Communications Security, pages 1671–1685.
+Noah Shinn, Federico Cassano, Ashwin Gopinath,
+Karthik Narasimhan, and Shunyu Yao. 2024. Re-
+flexion: Language agents with verbal reinforcement
+learning. Advances in Neural Information Process-
+ing Systems, 36.
+Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan
+Li, Yu Kong, Tianlong Chen, and Huan Liu. 2024.
+The wolf within: Covert injection of malice into
+mllm societies via an mllm operative. arXiv preprint
+arXiv:2402.14859.
+Gemma Team, Morgane Riviere, Shreya Pathak,
+Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
+raju, Léonard Hussenot, Thomas Mesnard, Bobak
+Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
+Improving open language models at a practical size.
+arXiv preprint arXiv:2408.00118.
+Llama Team. 2024. Meta llama guard 2.
+https:
+//github.com/meta-llama/PurpleLlama/blob/
+main/Llama-Guard2/MODEL_CARD.md.
+Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
+Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
+Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
+2022. Lamda: Language models for dialog applica-
+tions. arXiv preprint arXiv:2201.08239.
+Meta. 2024. Prompt Guard-86M | Model Cards and
+Prompt formats — llama.com. https://www.llama.
+com/docs/model-cards-and-prompt-formats/
+prompt-guard/. [Accessed 13-02-2025].
+Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong,
+and Hang Su. 2023. Evil geniuses: Delving into
+arXiv preprint
+the safety of llm-based agents.
+arXiv:2311.11855.
+Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
+Carroll Wainwright, Pamela Mishkin, Chong Zhang,
+Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
+Weixi Tong and Tianyi Zhang. 2024. Codejudge: Eval-
+uating code generation with large language models.
+arXiv preprint arXiv:2410.02184.
+Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
+bert, Amjad Almahairi, Yasmine Babaei, Nikolay
+Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
+Bhosale, et al. 2023. Llama 2: Open founda-
+tion and fine-tuned chat models. arXiv preprint
+arXiv:2307.09288.
+Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dy-
+lan R Ashley, Róbert Csordás, Anand Gopalakrish-
+nan, Abdullah Hamdi, Hasan Abed Al Kader Ham-
+moud, Vincent Herrmann, Kazuki Irie, et al. 2023.
+Mindstorms in natural language-based societies of
+mind. arXiv preprint arXiv:2305.17066.
+Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr,
+J Zico Kolter, and Matt Fredrikson. 2023. Univer-
+sal and transferable adversarial attacks on aligned
+language models. arXiv preprint arXiv:2307.15043.
+Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man-
+dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and
+Anima Anandkumar. 2023. Voyager: An open-ended
+embodied agent with large language models. arXiv
+preprint arXiv:2305.16291.
+Alexander Wei, Nika Haghtalab, and Jacob Steinhardt.
+2024. Jailbroken: How does llm safety training fail?
+Advances in Neural Information Processing Systems,
+36.
+Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li,
+Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng,
+Qingyun Wu, and Chi Wang. 2023. An empirical
+study on challenging math problem solving with gpt-
+4. arXiv preprint arXiv:2306.01337.
+Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-gpt
+for online decision making: Benchmarks and addi-
+tional opinions. arXiv preprint arXiv:2306.02224.
+Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang,
+Ruoxi Jia, and Weiyan Shi. 2024a. How johnny can
+persuade llms to jailbreak them: Rethinking persua-
+sion to challenge ai safety by humanizing llms. arXiv
+preprint arXiv:2401.06373.
+Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang,
+and Qingyun Wu. 2024b. Autodefense: Multi-agent
+llm defense against jailbreak attacks. arXiv preprint
+arXiv:2403.04783.
+Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun,
+Guancheng Wan, Kun Wang, Dawei Cheng, Jef-
+frey Xu Yu, and Tianlong Chen. 2024a. Cut the
+crap: An economical communication pipeline for
+arXiv preprint
+llm-based multi-agent systems.
+arXiv:2410.02506.
+Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng
+Wan, Miao Yu, Junfeng Fang, Kun Wang, and Dawei
+Cheng. 2024b. G-designer: Architecting multi-agent
+communication topologies via graph neural networks.
+arXiv preprint arXiv:2410.11782.
+Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan
+Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang,
+Yang Li, et al. 2023. Codegeex: A pre-trained model
+for code generation with multilingual evaluations on
+humaneval-x. arXiv preprint arXiv:2303.17568.
+Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li,
+Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang,
+Jing Chen, Ruipu Wu, Shuai Wang, et al. 2023.
+Agents: An open-source framework for autonomous
+language agents. arXiv preprint arXiv:2309.07870.
+A Use of Generative AI
+To enhance clarity and readability, we utilized
+LLMs exclusively as a language polishing tool.
+Its role was confined to proofreading, grammat-
+ical correction, and stylistic refinement—functions
+analogous to those provided by traditional grammar
+checkers and dictionaries. This tool did not con-
+tribute to the generation of new scientific content
+or ideas, and its usage is consistent with standard
+practices for manuscript preparation.
+B Experimental Settings
+In this section we list down all the experimental
+settings, including datasets, architectures utilized,
+baselines and metrics. Compute: We utilize 8×
+Nvidia A6000s for all of our experiments.
+B.1 Experiment: Overall Performance
+Comparison.
+Mistral-7B (Jiang et
+Datasets and Architectures. To comprehen-
+sively evaluate the effectiveness and generaliz-
+ability of our permutation-invariant attack, we
+conduct experiments across a diverse range of
+target LLM architectures and datasets. Specifically,
+we evaluate our method on Llama-2-7B (Tou-
+Llama-3.1-8B (Dubey
+vron et al., 2023),
+et
+al.,
+al., 2024),
+2023), Gemma-2-9B (Team et al., 2024) and
+DeepSeek-R1-Distilled (Guo et al., 2025)
+version of Llama-3-8.1B (Dubey et al., 2024).
+These architectures represent a broad spectrum of
+model scales and training paradigms, ensuring a
+rigorous assessment of our attack’s applicability.
+For evaluation dataset, we utilize three distinct
+benchmarks: ❶ Jailbreak Benchmark (Chao
+et al., 2024): a collection of 100 harmful misuse
+behaviors ranging from physical harm to disin-
+formation.❷ Adversarial Benchmark (Zou et al.,
+2023): a collection of 520 harmful instructions
+sharing the theme of profanity, discrimination,
+cybercrime and misinformation.❸ In-the-Wild
+Jailbreak Benchmark (Shen et al., 2024): A curated
+dataset of 1405 Jailbreak prompts with focus on
+upto 13 different scenarios including Fraud, Harm
+and Pornography. Furthermore, to simulate the
+multi-agent system we assign 1 random topology
+to each of the prompt for all models to have
+a consistent comparison. As the topology is
+randomly generated, we run our experiment 3
+times to mitigate the effect of any bias/seed.
+Baselines. Given a very specific multi-agent sys-
+tem setup, we identify 2 main baselines: ❶ Greedy
+Coordinate Gradient (GCG) Attack (Zou et al.,
+2023) and ❷ Vanilla Instructions that come paired
+in each of the benchmarks above. To calculate the
+GCG Prompt we use the NanoGCG (GraySwanAI,
+2024) library with consistent settings across
+datasets and benchmarks. We optimize each
+prompt for upto 500 steps, and have a search width
+of 64, alongside 64 token replacements for any
+given position (topk). In this set of experiments we
+use the PIEL instead of S-PIEL for a comprehen-
+sive comparison.
+Metrics. Across these benchmarks, we measure
+Permuted Attack Success Rate: Given an attack
+prompt, we will create K chunks (provided by the
+topological optimization method), and choose 1
+random permutation out of K! possible permuta-
+tions. Then we will pass this permutation across
+the system and record the results. As we repeat this
+experiment 3 times to avoid any bias/randomness,
+we also report ASR-m, the minimum ASR achieved
+in these 3 runs, and ASR-M which is the maximum
+ASR achieved, alongside the average ASR.
+B.2 Experiment: Safety Mechanism Efficacy
+(Team,
+al.,
+et
+evaluation of our
+Llama-Guard-2-8B
+(Dubey
+Datasets and Architectures. For
+a
+com-
+prehensive
+routing, we
+utilize a diverse set of five safety-aligned
+safety models: Llama-Guard-7B (Inan et al.,
+2024),
+2023),
+Llama-Guard-3-8B
+2024),
+Llama-Guard-3-1B (Dubey et al., 2024) and
+Prompt-Guard-86M (Meta, 2024). Such a diverse
+set ensures that a varying level of safety-aligned
+architectures are considered in our analysis.
+For the dataset, we use a complete benchmark
+provided by Jailbreak Benchmark which includes
+100 harmful prompts, and 100 benign prompts
+which will help us quantify the false positive and
+false negative rates.
+Settings and Metrics. We generate a random
+communication graph with maximum degree 3 for
+each prompt, following recent work (Zhang et al.,
+2024b) showing sparse topologies achieve compa-
+rable performance as dense networks. We optimize
+the token flow using the algorithm described in
+Section 4.1, which provides optimal chunk lengths
+for each edge. We then process the prompts as fol-
+lows: vanilla and GCG prompts are directly chun-
+ked based on optimal lengths, while our method em-
+B.5 Ablation: Sensitivity Analysis of the
+Stochastic Version
+Settings. We evaluate how increasing M influ-
+ences the ASR on Jailbreak Benchmark, and also
+assess the scalability and efficiency of our attack
+under computational budgets. For all of our exper-
+iments in this section we use Llama-2-7B as our
+target model type, and K = 5! = 120.
+C Quantifying Detection Risk
+Figure 5: Effect of chunk length on the detection of
+PromptGuard-86M and its 4-bit quantized version.
+To quantify the risk associated with the safety
+mechanisms deployed in the multi-agent LLM sys-
+tem, we conducted a series of empirical experi-
+ments evaluating PromptGuard-86M (Meta, 2024)
+and its 4-bit quantized version. We used the com-
+plete Jailbreak Benchmark with benign samples–
+comprising of 100 harmful and 100 non-harmful
+examples – to systemically examine how prompt
+chunking affects detection performance. Specifi-
+cally, we split the GCG prompts into varying token
+lengths and assessed the F1 score of the safety
+mechanism at different chunk sizes. The results in
+Figure 5 reveal that shorter prompt fragments often
+evade detection, whereas longer chunks improve
+safety enforcement by providing more context for
+classification.
+ploys the full Permutation-Invariant Evasion Loss.
+For evaluation metric, we assess the detection per-
+formance using the F1-Score, which provides a
+balanced measure of safety mechanism’s effective-
+ness by combining precision and recall, capturing
+both false positives, and false negatives.
+B.3 Experiment: Transferability
+Datasets
+and Architectures. To
+evaluate
+the transferability of our permutation-invariant
+prompts, we conduct experiments across multiple
+LLM architectures and benchmark datasets.
+Specifically, we want to assess whether adversarial
+prompts optimized for one target model and
+effectively transfer and maintain high attack
+success rates when applied to unseen models. We
+conduct experiments on Llama-2-7B, Mistral-7B
+and Gemma-2-9B which represents a diverse set
+of models. For evaluation datasets, we utilize
+two benchmarks: ❶ Jailbreak Benchmark, which
+consists of 100 harmful prompts, and ❷ Adver-
+sarial Benchmark, a collection of 520 harmful
+instructions.
+Settings and Metrics. Similar to the first exper-
+iment, we will assign each prompt to a random
+communication topology and then optimize over
+it to find the optimal chunk length and number of
+chunks. Then we will use our Permutation Invari-
+ant Evasion Loss to generate the prompts for one
+target model. Lastly, we will sample 1 random per-
+mutation of the said prompt and apply it to other
+target models for a fair comparison. Hence the At-
+tack Success Rate (ASR) will be calculated based
+on this permutation’s performance which will be
+sampled randomly (but same for all models after
+sampling).
+B.4 Ablation: Effect of Topology
+Settings. Specifically, we examine how differ-
+ent agent connectivity patterns impact the Attack
+Success Rate (ASR) of our permutation invari-
+ant attack. We test four distinct topologies: ❶
+Chain, ❷ Tree, ❸ Complete Graph and ❹ Random
+Graph, each representing a varying level of con-
+nectivity. For the dataset, we calculate the ASR
+on Jailbreak Benchmark. Furthermore, notice that
+in this case each edge will have a safety mech-
+anism randomly assigned from a set of the fol-
+lowing: PromptGuard-86M, Llama-Guard-7B and
+Llama-Guard-3-8B. Lastly, for all the cases we
+use Llama-2-7B as our target model.
+D Permutation-Invariant Evasion Loss
+Algorithm
+The algorithm for Permutation Invariant Loss is
+provided below:
+Algorithm 1 Permutation-Invariant Evasion Opti-
+mization
+Require: Target Model Type vt, Initial Chunk Set
+C = {C1, . . . , CK} from Topological Opti-
+mization, Iterations T
+Ensure: Optimized Chunked Set C∗
+1: Randomly initialize token sequences Ck
+2: for t = 1 to T do
+3:
+SK ← Set of all Permutations
+Total Loss L(C) ← 0
+for π ∈ SK do
+4:
+5:
+6:
+7:
+8:
+9:
+10:
+11:
+12:
+ϕ = Concat(π(1), ..., π(K))
+Lπ = − log p(x⋆
+n+1:n+H |ϕ)
+L(C) = L(C) + Lπ
+end for
+L(C) = L(C)/K!
+for Ci ∈ C do
+GCG(Ci, L(C))
+end for
+13:
+14: end for
+15: return Optimized adversarial chunks C∗