@chllming/wave-orchestration 0.6.3 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/CHANGELOG.md +82 -1
  2. package/README.md +40 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +67 -11
  39. package/docs/guides/terminal-surfaces.md +12 -0
  40. package/docs/plans/context7-wave-orchestrator.md +20 -0
  41. package/docs/plans/current-state.md +8 -1
  42. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  43. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  44. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  45. package/docs/plans/migration.md +26 -0
  46. package/docs/plans/wave-orchestrator.md +60 -12
  47. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  48. package/docs/reference/cli-reference.md +547 -0
  49. package/docs/reference/coordination-and-closure.md +436 -0
  50. package/docs/reference/live-proof-waves.md +25 -3
  51. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  52. package/docs/reference/proof-metrics.md +90 -0
  53. package/docs/reference/runtime-config/README.md +63 -2
  54. package/docs/reference/runtime-config/codex.md +2 -1
  55. package/docs/reference/sample-waves.md +29 -18
  56. package/docs/reference/wave-control.md +164 -0
  57. package/docs/reference/wave-planning-lessons.md +131 -0
  58. package/package.json +5 -4
  59. package/releases/manifest.json +40 -0
  60. package/scripts/research/agent-context-archive.mjs +18 -0
  61. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  62. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  63. package/scripts/wave-orchestrator/agent-state.mjs +11 -2
  64. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  65. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  66. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  67. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  68. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  69. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  70. package/scripts/wave-orchestrator/config.mjs +175 -0
  71. package/scripts/wave-orchestrator/control-cli.mjs +1216 -0
  72. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  73. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  74. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  75. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  76. package/scripts/wave-orchestrator/dashboard-renderer.mjs +120 -5
  77. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  78. package/scripts/wave-orchestrator/evals.mjs +23 -0
  79. package/scripts/wave-orchestrator/executors.mjs +3 -2
  80. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  81. package/scripts/wave-orchestrator/install.mjs +151 -2
  82. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  83. package/scripts/wave-orchestrator/launcher-runtime.mjs +33 -30
  84. package/scripts/wave-orchestrator/launcher.mjs +884 -36
  85. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  86. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  87. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  88. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  89. package/scripts/wave-orchestrator/replay.mjs +10 -4
  90. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  91. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  92. package/scripts/wave-orchestrator/shared.mjs +26 -0
  93. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  94. package/scripts/wave-orchestrator/terminals.mjs +1 -1
  95. package/scripts/wave-orchestrator/traces.mjs +157 -2
  96. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  97. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  98. package/scripts/wave-orchestrator/wave-files.mjs +144 -23
  99. package/scripts/wave.mjs +27 -0
  100. package/skills/repo-coding-rules/SKILL.md +1 -0
  101. package/skills/role-cont-eval/SKILL.md +1 -0
  102. package/skills/role-cont-qa/SKILL.md +13 -6
  103. package/skills/role-deploy/SKILL.md +1 -0
  104. package/skills/role-documentation/SKILL.md +4 -0
  105. package/skills/role-implementation/SKILL.md +4 -0
  106. package/skills/role-infra/SKILL.md +2 -1
  107. package/skills/role-integration/SKILL.md +15 -8
  108. package/skills/role-planner/SKILL.md +39 -0
  109. package/skills/role-planner/skill.json +21 -0
  110. package/skills/role-research/SKILL.md +1 -0
  111. package/skills/role-security/SKILL.md +2 -2
  112. package/skills/runtime-claude/SKILL.md +2 -1
  113. package/skills/runtime-codex/SKILL.md +1 -0
  114. package/skills/runtime-local/SKILL.md +2 -0
  115. package/skills/runtime-opencode/SKILL.md +1 -0
  116. package/skills/wave-core/SKILL.md +25 -6
  117. package/skills/wave-core/references/marker-syntax.md +16 -8
  118. package/wave.config.json +45 -0
@@ -0,0 +1,3747 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - blackboard-and-shared-workspaces
8
+ - repo-context-and-evaluation
9
+ kind: 'paper'
10
+ title: 'Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems'
11
+ ---
12
+ # Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems
13
+
14
+ <Note>
15
+ Converted from the source document on 2026-03-21. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
+ </Note>
17
+
18
+ ## Metadata
19
+
20
+ | Field | Value |
21
+ | --- | --- |
22
+ | Content type | Paper / report |
23
+ | Authors | Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, Wenyuan Jiang |
24
+ | Year | 2026 |
25
+ | Venue | arXiv 2603.01045 |
26
+ | Research bucket | P0 direct hits |
27
+ | Maps to | Distributed coordination benchmarks, communication-reasoning gaps, and evidence on integration failures in multi-agent systems. |
28
+ | Harness fit | A concrete benchmark for testing whether shared-workspace coordination actually improves reasoning integration. |
29
+ | Source page | [Open source](https://arxiv.org/abs/2603.01045) |
30
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2603.01045.pdf) |
31
+
32
+ ## Extracted text
33
+ ### Page 1
34
+
35
+ SILO-BENCH: A Scalable Environment for Evaluating Distributed
36
+
37
+ Coordination in Multi-Agent LLM Systems
38
+
39
+ Yuzhe Zhang1 Feiran Liu1 Yi Shan1 Xinyi Huang1 Xin Yang2 Yueqi Zhu1
40
+
41
+ Xuxin Cheng4 Cao Liu4 Ke Zeng4 Terry Jingchen Zhang3,5† Wenyuan Jiang3†*
42
+
43
+ 1 Beijing University of Technology, Beijing, China
44
+
45
+ 2 Zhejiang University, Hangzhou, China 3 ETH Zürich, Switzerland
46
+
47
+ 4 Meituan LongCat Interaction Team 5 Vector Institute for Artificial Intelligence
48
+
49
+ Abstract
50
+
51
+ Large language models are increasingly de-
52
+
53
+ ployed in multi-agent systems to overcome
54
+
55
+ context limitations by distributing information
56
+
57
+ across agents. Yet whether agents can reli-
58
+
59
+ ably compute with distributed information—
60
+
61
+ rather than merely exchange it—remains an
62
+
63
+ open question. We introduce SILO-BENCH,
64
+
65
+ a role-agnostic benchmark of 30 algorithmic
66
+
67
+ tasks across three communication complexity
68
+
69
+ levels, evaluating 54 configurations over 1,620
70
+
71
+ experiments. Our experiments expose a fun-
72
+
73
+ damental Communication-Reasoning Gap:
74
+
75
+ agents spontaneously form task-appropriate co-
76
+
77
+ ordination topologies and exchange informa-
78
+
79
+ tion actively, yet systematically fail to synthe-
80
+
81
+ size distributed state into correct answers. The
82
+
83
+ failure is localized to the reasoning-integration
84
+
85
+ stage—agents often acquire sufficient informa-
86
+
87
+ tion but cannot integrate it. This coordina-
88
+
89
+ tion overhead compounds with scale, eventu-
90
+
91
+ ally eliminating parallelization gains entirely.
92
+
93
+ These findings demonstrate that naively scaling
94
+
95
+ agent count cannot circumvent context limita-
96
+
97
+ tions, and SILO-BENCH provides a foundation
98
+
99
+ for tracking progress toward genuinely collabo-
100
+
101
+ rative multi-agent systems.
102
+
103
+ 1 Introduction
104
+
105
+ The rapid advancement of Large Language Models
106
+
107
+ (LLMs) has demonstrated remarkable capabilities
108
+
109
+ in individual inference and generation tasks (AI,
110
+
111
+ 2023; Touvron et al., 2023; DeepSeek-AI et al.,
112
+
113
+ 2024). However, as the scale and complexity of
114
+
115
+ real-world problems continue to grow, a fundamen-
116
+
117
+ tal bottleneck has emerged: the limited context
118
+
119
+ window of a single model restricts its ability to
120
+
121
+ process global information (Li et al., 2024; Chen
122
+
123
+ et al., 2024b; An et al., 2024). Even with recent
124
+
125
+ progress in extending context lengths to millions of
126
+
127
+ tokens (Reid et al., 2024), the quadratic cost of at-
128
+
129
+ tention (Ratner et al., 2023; Yen et al., 2024) makes
130
+
131
+ *†
132
+
133
+ Corresponding Author.
134
+
135
+ Step 1: Data Partition
136
+
137
+ Step 2: Agent Initialization
138
+
139
+ Step 3: Collaborative Execution
140
+
141
+ Step 4: Metric Computation
142
+
143
+ 1 2 3 4 5 6 7 8 9
144
+
145
+ 1 2 3 4 7 6 5 8 9
146
+
147
+ 1 2 4 5 6 7
148
+
149
+ Iterative
150
+
151
+ Exchange
152
+
153
+ Agent 0
154
+
155
+ I have 9, 2, 7...
156
+
157
+ Agent 1
158
+
159
+ I want to get your data...
160
+
161
+ Agent 2
162
+
163
+ Sure ~ I have 8..
164
+
165
+ Data
166
+
167
+ Propagation
168
+
169
+ This is a Sort task,
170
+
171
+ I am Agent 0...
172
+
173
+ This is a Sort task,
174
+
175
+ I am Agent 1...
176
+
177
+ This is a Sort task,
178
+
179
+ I am Agent 2...
180
+
181
+ Communication Density: 0.92
182
+
183
+ Success Rate: 33.3%
184
+
185
+ Token Effciency: 194
186
+
187
+ Agent 0
188
+
189
+ Agent 1
190
+
191
+ Agent 2
192
+
193
+ Null
194
+
195
+ Agent0: Agent1:
196
+
197
+ Agent2:
198
+
199
+ ✅ ❌
200
+
201
+
202
+
203
+ 8 9
204
+
205
+ 2 79 1 35 48 6
206
+
207
+ 12 3 457 89
208
+
209
+ III-21:
210
+
211
+ Distributed Sort 6
212
+
213
+ Partial Corr. Score: 40.0%
214
+
215
+ Figure 1: Pipeline of SILO-BENCH. Global information
216
+
217
+ is partitioned across N agents, each holding only local
218
+
219
+ data. Agents must communicate through the provided
220
+
221
+ protocol to reconstruct global truth. Success requires
222
+
223
+ effective collaboration strategies. This is an example of
224
+
225
+ the III-21 Distributed Sort (Appendix E.)
226
+
227
+ centralized processing increasingly impractical for
228
+
229
+ truly large-scale tasks.
230
+
231
+ Multi-Agent Systems (MAS) offer a compelling
232
+
233
+ architectural paradigm to address this scalability
234
+
235
+ challenge (Zhang et al., 2024a; Wang et al., 2024b).
236
+
237
+ By distributing global information across multiple
238
+
239
+ agents that collaborate to compute results, MAS
240
+
241
+ can theoretically overcome the token limitations
242
+
243
+ of single models (Liu et al., 2023). This dis-
244
+
245
+ tributed approach mirrors successful patterns in
246
+
247
+ traditional computing—from MapReduce to mod-
248
+
249
+ ern distributed databases—where data partitioning
250
+
251
+ and coordinated computation across nodes achieve
252
+
253
+ scales unattainable by a single machine (Dean and
254
+
255
+ Ghemawat, 2008). In the realm of large models, we
256
+
257
+ 1
258
+
259
+ arXiv:2603.01045v1 [cs.MA] 1 Mar 2026
260
+
261
+ ### Page 2
262
+
263
+ define the scenario where an individual agent has
264
+
265
+ access only to partial information, thereby neces-
266
+
267
+ sitating coordination to resolve token constraints,
268
+
269
+ as information silos. However, a critical question
270
+
271
+ remains underexplored: Can current LLM-based
272
+
273
+ agents effectively collaborate within information
274
+
275
+ silos to compute a globally correct answer (Qian
276
+
277
+ et al., 2024, 2023; Liu et al., 2023)? Existing multi-
278
+
279
+ agent benchmarks either prescribe fixed communi-
280
+
281
+ cation structures (Li et al., 2023; Wu et al., 2024;
282
+
283
+ Hong et al., 2023) or focus on social simulation
284
+
285
+ rather than computational collaboration (Park et al.,
286
+
287
+ 2023; Lan et al., 2024). These approaches often
288
+
289
+ introduce inductive bias into the agents’ final out-
290
+
291
+ puts (Baltaji et al., 2024). For instance, if an agent
292
+
293
+ is assigned the role of a "doctor", it may exhibit
294
+
295
+ poor performance in artistic domains, which contra-
296
+
297
+ dicts the goal of developing general-purpose agents
298
+
299
+ (An et al., 2024; Qian et al., 2024, 2023; Liu et al.,
300
+
301
+ 2023). Furthermore, most benchmarks to date tar-
302
+
303
+ get specific tasks (Deng et al., 2024; Chen et al.,
304
+
305
+ 2024a; Gioacchini et al., 2024) and fail to address
306
+
307
+ a significant gap in our understanding: whether
308
+
309
+ logic-based models can autonomously discover and
310
+
311
+ execute effective coordination strategies for dis-
312
+
313
+ tributed computing problems. Addressing this gap
314
+
315
+ is a key objective for future AGI evaluation. To
316
+
317
+ bridge this gap, we propose SILO-BENCH—a pio-
318
+
319
+ neering benchmark for evaluating free-form com-
320
+
321
+ munication and collaboration in multi-agent LLM
322
+
323
+ systems (Liu et al., 2023). In summary, our contri-
324
+
325
+ butions are as follows:
326
+
327
+ • We introduce SILO-BENCH, a role-agnostic
328
+
329
+ configurable environment for evaluating dis-
330
+
331
+ tributed coordination under information si-
332
+
333
+ los. Unlike static test suites that prescribe fixed
334
+
335
+ roles and communication scripts, our framework
336
+
337
+ can generate unlimited evaluation instances while
338
+
339
+ providing high-level task hints—allowing obser-
340
+
341
+ vation of whether agents can translate structural
342
+
343
+ understanding into effective coordination proto-
344
+
345
+ cols (Press et al., 2021; Liu et al., 2023).
346
+
347
+ • We conduct the largest systematic study of
348
+
349
+ multi-agent collaboration to date by instanti-
350
+
351
+ ating 54 representative configurations. Span-
352
+
353
+ ning diverse protocols and computing paradigms
354
+
355
+ (Zhao et al., 2024; Islam et al., 2024a), we pro-
356
+
357
+ pose a multi-dimensional metric suite to com-
358
+
359
+ prehensively quantify the trade-off between task
360
+
361
+ success rate, token consumption, and communi-
362
+
363
+ cation density.
364
+
365
+ • We expose critical scalability limitations and
366
+
367
+ the Communication-Reasoning Gap in current
368
+
369
+ LLMs. Our results reveal that while agents can
370
+
371
+ spontaneously discover task-appropriate commu-
372
+
373
+ nication topologies, they fail to translate effective
374
+
375
+ coordination into correct distributed computation.
376
+
377
+ This disconnect, coupled with inefficient infor-
378
+
379
+ mation synthesis, causes performance to collapse
380
+
381
+ as task complexity increases and the agent scale
382
+
383
+ expands.
384
+
385
+ 2 Related Work
386
+
387
+ Context Limitations and Distributed Reason-
388
+
389
+ ing. The finite context window of LLMs con-
390
+
391
+ stitutes a fundamental bottleneck for processing
392
+
393
+ large-scale information. While recent advances
394
+
395
+ have extended context lengths to millions of to-
396
+
397
+ kens (Reid et al., 2024; Liu et al., 2024a), the
398
+
399
+ quadratic computational complexity of attention
400
+
401
+ mechanisms makes centralized processing increas-
402
+
403
+ ingly resource-intensive and prone to “lost-in-the-
404
+
405
+ middle” phenomena (Liu et al., 2024b). Although
406
+
407
+ Retrieval-Augmented Generation (RAG) offers a
408
+
409
+ palliative solution (Wang et al., 2024c; Islam et al.,
410
+
411
+ 2024b), it often fractures global context, strug-
412
+
413
+ gling with tasks that require holistic reasoning
414
+
415
+ across disjoint segments. Existing benchmarks like
416
+
417
+ SCROLLS (Shaham et al., 2022), LongBench (Bai
418
+
419
+ et al., 2024), and ∞Bench (Zhang et al., 2024b)
420
+
421
+ effectively evaluate single-agent retrieval but over-
422
+
423
+ look the paradigm of distributed collaboration. We
424
+
425
+ posit that overcoming the context barrier requires
426
+
427
+ shifting from centralized attention to collaborative
428
+
429
+ computation, where agents act as distributed pro-
430
+
431
+ cessors to digest partitioned information and syn-
432
+
433
+ thesize global insights—a capability currently un-
434
+
435
+ measured by standard long-context evaluations.
436
+
437
+ Multi-Agent Architectures and Role-
438
+
439
+ Agnosticism. The paradigm of orchestrating
440
+
441
+ multiple LLM agents has evolved from simple
442
+
443
+ role-playing to complex problem-solving frame-
444
+
445
+ works. Foundational works like CAMEL (Li et al.,
446
+
447
+ 2023) and MetaGPT (Hong et al., 2023) utilize
448
+
449
+ role-specialized agents (e.g., assigning “Manager”
450
+
451
+ or “Coder” personas) embedded within fixed hier-
452
+
453
+ archical or waterfall workflows. While effective
454
+
455
+ for domain-specific tasks like software engineering
456
+
457
+ (Islam et al., 2024a), these approaches entangle the
458
+
459
+ agents’ reasoning capabilities with semantic role
460
+
461
+ priors, making it difficult to isolate the contribution
462
+
463
+ of the communication architecture itself. Other
464
+
465
+ 2
466
+
467
+ ### Page 3
468
+
469
+ Agent 3
470
+
471
+ Agent 2
472
+
473
+ Agent 4
474
+
475
+ Agent 1
476
+
477
+ 1 2
478
+
479
+ 43
480
+
481
+ Exchange 1
482
+
483
+ Exchange
484
+
485
+ Agent 3 Agent 2
486
+
487
+ Agent 4
488
+
489
+ Agent 1
490
+
491
+ 1
492
+
493
+ 2
494
+
495
+ 3
496
+
497
+ I need to exchange
498
+
499
+ with everyone to get
500
+
501
+ more information~
502
+
503
+ 4
504
+
505
+ Agent 3Agent 2
506
+
507
+ Agent 4
508
+
509
+ Agent 1
510
+
511
+ 1
512
+
513
+ 2
514
+
515
+ 4
516
+
517
+ 3
518
+
519
+ I need all
520
+
521
+ your data!
522
+
523
+ Level II:Mesh Network Level III: Global ShuffleLevel I:Aggregation
524
+
525
+ 31
526
+
527
+ 4
528
+
529
+ 2
530
+
531
+ Exchange 2
532
+
533
+ 4
534
+
535
+ Who have?2
536
+
537
+ Figure 2: Three complexity levels in SILO-BENCH characterized by their communication patterns. Level I
538
+
539
+ (Aggregation): A central agent collects data from all peers via a star topology. Level II (Mesh Network): Agents
540
+
541
+ exchange information with immediate neighbors through pairwise communication. Level III (Global Shuffle): All
542
+
543
+ agents must communicate with every other agent, requiring full mesh connectivity.
544
+
545
+ efforts, such as debate-based systems (Du et al.,
546
+
547
+ 2023) or Mixture-of-Agents (Wang et al., 2024a),
548
+
549
+ often prescribe static topological constraints that
550
+
551
+ limit dynamic information flow. We introduce
552
+
553
+ SILO-BENCH, a role-agnostic configurable
554
+
555
+ environment with task-structural guidance
556
+
557
+ for evaluating distributed coordination under
558
+
559
+ information silos. Unlike static test suites that
560
+
561
+ prescribe fixed roles and communication scripts,
562
+
563
+ our framework dynamically generates unlimited
564
+
565
+ evaluation instances while providing high-level
566
+
567
+ task hints—allowing observation of whether agents
568
+
569
+ can translate structural understanding into effective
570
+
571
+ coordination protocols (Press et al., 2021; Liu
572
+
573
+ et al., 2023).
574
+
575
+ 3 SILO-BENCH
576
+
577
+ This section presents the architecture of SILO-
578
+
579
+ BENCH, a configurable environment for evaluating
580
+
581
+ multi-agent collaboration under information silos.
582
+
583
+ Each configuration is defined by three orthogonal
584
+
585
+ dimensions: agent scale N, communication pro-
586
+
587
+ tocol P, and language model M. We describe
588
+
589
+ the task space, evaluation metrics, and execution
590
+
591
+ pipeline.
592
+
593
+ 3.1 Task Space
594
+
595
+ A central design goal of SILO-BENCH is to ground
596
+
597
+ task difficulty in principled communication com-
598
+
599
+ plexity theory, so that observed performance gaps
600
+
601
+ can be attributed to coordination demands rather
602
+
603
+ than ad hoc task choice. The theoretical founda-
604
+
605
+ tion for analyzing distributed computation costs
606
+
607
+ dates back to Yao’s seminal work on communica-
608
+
609
+ tion complexity (Yao, 1979), which established the
610
+
611
+ framework for quantifying the minimum bits re-
612
+
613
+ quired for distributed parties to compute a function.
614
+
615
+ Building on this foundation, we categorize tasks by
616
+
617
+ their optimal communication complexity:
618
+
619
+ τk = (fk, Xk, y∗
620
+
621
+ k) (1)
622
+
623
+ where fk specifies the computational function, Xk
624
+
625
+ is the global input data, and y∗
626
+
627
+ k is the ground-
628
+
629
+ truth answer. Tasks are organized into three levels
630
+
631
+ based on their optimal communication complex-
632
+
633
+ ity (complete task specifications are provided in
634
+
635
+ Appendix E).
636
+
637
+ Level I: Aggregation (O(N) communication).
638
+
639
+ As illustrated in Figure 2 (left), these tasks ex-
640
+
641
+ hibit embarrassingly parallel structure followed by
642
+
643
+ reduction. Each agent processes its local shard
644
+
645
+ independently, producing intermediate results ag-
646
+
647
+ gregated through associative operations (e.g., max,
648
+
649
+ sum, xor). The optimal topology is a star or tree
650
+
651
+ structure where one agent collects all partial re-
652
+
653
+ sults. Representative tasks include global maxi-
654
+
655
+ mum (LC-414: “Third Maximum Number”), dis-
656
+
657
+ tributed voting (LC-169: “Majority Element”), and
658
+
659
+ word frequency counting (LC-2085).
660
+
661
+ Level II: Mesh Network (O(N) communication).
662
+
663
+ As shown in Figure 2 (center), these tasks exhibit
664
+
665
+ spatial locality: agent i’s computation depends pri-
666
+
667
+ marily on neighboring agents i − 1 and i + 1. Infor-
668
+
669
+ mation propagates through a structured mesh via
670
+
671
+ pairwise exchanges, with optimal topology being
672
+
673
+ a linear chain requiring N − 1 point-to-point ex-
674
+
675
+ changes. Representative tasks include prefix sum
676
+
677
+ (LC-1480), moving average (LC-346: “Moving Av-
678
+
679
+ erage from Data Stream”), and trapping rain water
680
+
681
+ (LC-42).
682
+
683
+ 3
684
+
685
+ ### Page 4
686
+
687
+ Level III: Global Shuffle (O(N log N) to O(N 2)
688
+
689
+ communication). As depicted in Figure 2 (right),
690
+
691
+ these tasks feature irregular, potentially all-to-all
692
+
693
+ communication patterns where any agent’s out-
694
+
695
+ put may depend on information from any other
696
+
697
+ agent. The range O(N log N)–O(N 2) spans from
698
+
699
+ the classical lower bound for distributed reorgani-
700
+
701
+ zation to the full-consensus cost imposed by our
702
+
703
+ evaluation criterion, where every agent must out-
704
+
705
+ put the complete global answer. Representative
706
+
707
+ tasks include distributed sorting (LC-912), graph
708
+
709
+ connectivity (LC-323), and matrix multiplication
710
+
711
+ (LC-311).
712
+
713
+ 3.2 Task Construction Pipeline.
714
+
715
+ LeetCode problems serve solely as algorithmic
716
+
717
+ inspiration—we do not transform raw LeetCode
718
+
719
+ data. For each task category (e.g., “Global Max-
720
+
721
+ imum”), we independently implement a Python
722
+
723
+ generator that programmatically produces ran-
724
+
725
+ dom inputs and exact ground-truth answers. A
726
+
727
+ task instance is one concrete input–output pair
728
+
729
+ drawn from this generator under a fixed (N, P, M)
730
+
731
+ configuration—where N is the agent scale, P the
732
+
733
+ communication protocol, and M the language
734
+
735
+ model—and a fixed random seed, ensuring repro-
736
+
737
+ ducibility while allowing unlimited fresh instances.
738
+
739
+ To illustrate: for Level-I Global Maximum (in-
740
+
741
+ spired by LC-414, the “Third Maximum Number”
742
+
743
+ problem), given agent count N and per-agent shard
744
+
745
+ size k, the generator (i) samples N × k integers uni-
746
+
747
+ formly at random, (ii) partitions them into N equal
748
+
749
+ shards X1,..., XN, (iii) computes y∗ = max(X),
750
+
751
+ and (iv) records the fixed seed for reproducibility.
752
+
753
+ Each agent receives only its local shard Xi and must
754
+
755
+ coordinate to determine y∗. This pipeline gener-
756
+
757
+ alises directly to all 30 tasks: the generator encodes
758
+
759
+ the task-specific function fk, scales global input
760
+
761
+ size proportionally with N to maintain constant per-
762
+
763
+ agent workload, and produces exact ground-truth
764
+
765
+ answers enabling fully objective evaluation.
766
+
767
+ 3.3 Evaluation Metrics
768
+
769
+ We define four complementary metrics to capture
770
+
771
+ both what agents achieve and how they coordinate.
772
+
773
+ Let ˆyi denote agent i’s submitted answer, and let mi
774
+
775
+ denote the total number of messages successfully
776
+
777
+ transmitted outward by agent i during the entire
778
+
779
+ collaboration.
780
+
781
+ Success Rate (S). Measures the proportion of
782
+
783
+ agents converging to the correct answer:
784
+
785
+ S =
786
+
787
+ 1
788
+
789
+ N
790
+
791
+ NX
792
+
793
+ i=1
794
+
795
+ 1[ˆyi = y∗] (2)
796
+
797
+ A task instance is successful when S = 1, indicat-
798
+
799
+ ing unanimous convergence.
800
+
801
+ Partial Correctness Score (P). Binary success
802
+
803
+ rate can understate partial progress. We intro-
804
+
805
+ duce a continuous measure of answer quality tai-
806
+
807
+ lored to each task category: for Level-I, P is the
808
+
809
+ fraction of agents within tolerance of the ground
810
+
811
+ truth; for Level-II, the fraction of correctly com-
812
+
813
+ puted elements per local segment; for Level-III, the
814
+
815
+ longest correctly ordered subsequence relative to
816
+
817
+ total length. Letting qi ∈ [0, 1] denote the per-agent
818
+
819
+ quality score:
820
+
821
+ P =
822
+
823
+ 1
824
+
825
+ N
826
+
827
+ NX
828
+
829
+ i=1
830
+
831
+ qi (3)
832
+
833
+ Together with S, this score allows us to isolate
834
+
835
+ where coordination breaks down: the gap P −
836
+
837
+ S quantifies performance lost specifically at the
838
+
839
+ reasoning-integration stage rather than at the com-
840
+
841
+ munication stage.
842
+
843
+ Token Consumption (C). Quantifies computa-
844
+
845
+ tional cost per communication round:
846
+
847
+ C =
848
+
849
+ PN
850
+
851
+ i=1
852
+
853
+ PR
854
+
855
+ r=1 tout
856
+
857
+ i [r]
858
+
859
+ Rmax
860
+
861
+ (4)
862
+
863
+ where tout
864
+
865
+ i [r] is the number of output tokens gen-
866
+
867
+ erated by agent i in round r, and Rmax is the max
868
+
869
+ number of rounds executed.
870
+
871
+ Communication Density (D). Captures inter-
872
+
873
+ agent interaction intensity. Here N (N − 1) is
874
+
875
+ the directed-edge count when each ordered pair
876
+
877
+ exchanges exactly one message; since agents may
878
+
879
+ send multiple messages to the same recipient across
880
+
881
+ rounds, D ∈ [0, +∞):
882
+
883
+ D =
884
+
885
+ PN
886
+
887
+ i=1 mi
888
+
889
+ N (N − 1)
890
+
891
+ (5)
892
+
893
+ Values near 0 suggest sparse, targeted exchanges;
894
+
895
+ D = 1 indicates one message per directed pair on
896
+
897
+ average; values exceeding 1 reflect iterative multi-
898
+
899
+ round exchanges. For the SFS protocol (see Ap-
900
+
901
+ pendix A), mi counts the number of times other
902
+
903
+ 4
904
+
905
+ ### Page 5
906
+
907
+ agents successfully read files written by agent i,
908
+
909
+ preserving the same “information actually trans-
910
+
911
+ ferred” semantics as direct message-passing.
912
+
913
+ Together, S and P measure what agents achieve,
914
+
915
+ C measures at what cost, and D reveals how they
916
+
917
+ coordinate.
918
+
919
+ 3.4 Execution Pipeline
920
+
921
+ Given task τ = (f, X, y∗) and configuration
922
+
923
+ (N, P, M), the evaluation proceeds through four
924
+
925
+ phases.
926
+
927
+ Phase 1: Data Partition. PARTITION(X, N) →
928
+
929
+ {X1,..., XN}, where |Xi| ≈ |X |/N ensures
930
+
931
+ equipartition and no agent holds privileged infor-
932
+
933
+ mation.
934
+
935
+ Phase 2: Agent Initialization. Each agent i is
936
+
937
+ initialized with model M and receives INIT(i) ←
938
+
939
+ (desc(f, Xi), P), specifying the core task logic,
940
+
941
+ local data, and protocol constraint. The prompt
942
+
943
+ provides task-structural guidance while preserving
944
+
945
+ strategic autonomy (see Appendix B).
946
+
947
+ Phase 3: Collaborative Execution. Agents en-
948
+
949
+ gage in iterative communication for up to Rmax
950
+
951
+ rounds. All N agents are activated in parallel
952
+
953
+ within each round: they receive incoming messages
954
+
955
+ from the previous round, independently decide on
956
+
957
+ actions, and execute them simultaneously. Mes-
958
+
959
+ sages or files written in round r become visible
960
+
961
+ at the start of round r + 1. Execution terminates
962
+
963
+ when all agents submit answers or the round limit
964
+
965
+ is reached.
966
+
967
+ Phase 4: Metric Computation. The four metrics
968
+
969
+ (S, P, C, D) are computed from submitted answers
970
+
971
+ {ˆyi}N
972
+
973
+ i=1 and recorded communication logs.
974
+
975
+ 4 Experiments
976
+
977
+ We systematically evaluate multi-agent coordina-
978
+
979
+ tion across three orthogonal axes—agent scale,
980
+
981
+ communication protocol, and language model—
982
+
983
+ yielding a factorial design that covers qualitatively
984
+
985
+ distinct coordination regimes.
986
+
987
+ 4.1 Experimental Setup
988
+
989
+ Each evaluation instance in SILO-BENCH is speci-
990
+
991
+ fied by agent scale N, communication protocol P,
992
+
993
+ and language model M. All models are deployed
994
+
995
+ locally with default temperature and 128K context
996
+
997
+ windows.
998
+
999
+ Received
1000
+
1001
+ from Agent1
1002
+
1003
+ Agent 1 Agent 3
1004
+
1005
+ Agent 2
1006
+
1007
+ Agent 4
1008
+
1009
+ Send to Agent3
1010
+
1011
+ Agent 1 Agent 3
1012
+
1013
+ Agent 2
1014
+
1015
+ Agent 4
1016
+
1017
+ File System
1018
+
1019
+ Agent 1
1020
+
1021
+ Agent 2
1022
+
1023
+ Agent 3
1024
+
1025
+ Agent 4
1026
+
1027
+ BP SFS
1028
+
1029
+ Broadcast!
1030
+
1031
+ P2P
1032
+
1033
+ Figure 3: The three communication protocols employed
1034
+
1035
+ in SILO-BENCH.
1036
+
1037
+ Agent Scale (N). We vary team size across
1038
+
1039
+ N ∈ {2, 5, 10, 20, 50, 100}, chosen to probe quali-
1040
+
1041
+ tatively distinct coordination regimes. The min-
1042
+
1043
+ imal team (N = 2) isolates fundamental pair-
1044
+
1045
+ wise coordination without overhead. Small groups
1046
+
1047
+ (N ∈ {5, 10}) allow agents to feasibly track all
1048
+
1049
+ peers simultaneously. Medium scale (N = 20)
1050
+
1051
+ begins to make exhaustive peer tracking challeng-
1052
+
1053
+ ing, pushing agents toward selective communica-
1054
+
1055
+ tion. Large scale (N ∈ {50, 100}) makes hierar-
1056
+
1057
+ chical or highly selective coordination effectively
1058
+
1059
+ necessary—and, as our results confirm, largely be-
1060
+
1061
+ yond the reach of current LLMs.
1062
+
1063
+ Communication Protocol (P). As shown in
1064
+
1065
+ Figure 3, we instantiate three protocols: P2P—
1066
+
1067
+ directed messaging where agents explicitly address
1068
+
1069
+ individual recipients; BP—broadcast messaging
1070
+
1071
+ where each transmission reaches all agents simul-
1072
+
1073
+ taneously; SFS—indirect coordination through a
1074
+
1075
+ shared file system. Agents retain complete auton-
1076
+
1077
+ omy in deciding what to share, with whom, and
1078
+
1079
+ when. Detailed specifications are provided in Ap-
1080
+
1081
+ pendix A.
1082
+
1083
+ Language Model (M). All N agents within a
1084
+
1085
+ configuration share the same model, isolating co-
1086
+
1087
+ ordination capability from heterogeneity effects.
1088
+
1089
+ We evaluate three frontier open-source models:
1090
+
1091
+ DeepSeek-V3.1 (DeepSeek-AI et al., 2024), GPT-
1092
+
1093
+ OSS-120B (OpenAI et al., 2025), and Qwen3-
1094
+
1095
+ Next-80B-A3B (Yang et al., 2025).
1096
+
1097
+ Our Experimental Setup
1098
+
1099
+ Tasks: 30 (10 per difficulty level)
1100
+
1101
+ Scales: 6 (2, 5, 10, 20, 50, 100)
1102
+
1103
+ Protocols: 3 (P2P, BP, SFS)
1104
+
1105
+ Models: 3 (DeepSeek, GPT, Qwen)
1106
+
1107
+ This yields 6 × 3 × 3 = 54 unique configura-
1108
+
1109
+ tions and 30 × 54 = 1,620 total experiments (see
1110
+
1111
+ Appendix C for infrastructure details). To disen-
1112
+
1113
+ 5
1114
+
1115
+ ### Page 6
1116
+
1117
+ tangle coordination overhead from intrinsic task
1118
+
1119
+ difficulty, we additionally conduct N =1 baseline
1120
+
1121
+ experiments where a single agent receives the com-
1122
+
1123
+ plete global input and answers directly without
1124
+
1125
+ communication. We define Relative Coordination
1126
+
1127
+ Cost (RCC) = 1 − SR(N =k)/SR(N =1), captur-
1128
+
1129
+ ing the fraction of single-agent performance lost
1130
+
1131
+ to coordination overhead. The N =1 oracle repre-
1132
+
1133
+ sents the upper bound; SILO-BENCH asks whether
1134
+
1135
+ distributed agents can approach this bound through
1136
+
1137
+ coordination alone.
1138
+
1139
+ 4.2 Overall Performance
1140
+
1141
+ Table 1 summarizes performance across all mod-
1142
+
1143
+ els and configurations. DeepSeek-V3.1 achieves
1144
+
1145
+ a 36.9% average success rate, followed by GPT-
1146
+
1147
+ OSS-120B at 16.9% and Qwen3 at 8.2%—a 4.5×
1148
+
1149
+ spread. Even the strongest model fails nearly two-
1150
+
1151
+ thirds of the time, establishing that current LLMs
1152
+
1153
+ cannot reliably coordinate under information silos.
1154
+
1155
+ Coordination overhead, not task difficulty,
1156
+
1157
+ drives the performance gap. To confirm that
1158
+
1159
+ failures stem from coordination rather than intrin-
1160
+
1161
+ sic task hardness, we compare multi-agent suc-
1162
+
1163
+ cess rates against the N =1 oracle. Table 2 reports
1164
+
1165
+ results for GPT-OSS-120B (trends are consistent
1166
+
1167
+ across models). Even at the smallest team size
1168
+
1169
+ (k=2), multi-agent systems already lose 15–49%
1170
+
1171
+ of single-agent performance, and RCC compounds
1172
+
1173
+ steadily with scale, reaching 80–100% at k=50
1174
+
1175
+ for Level-II and Level-III tasks. Crucially, the
1176
+
1177
+ single-agent success rate difference between Level-
1178
+
1179
+ I and Level-III is modest—only about 15 percent-
1180
+
1181
+ age points—yet the multi-agent gap balloons to
1182
+
1183
+ over 18 percentage points, confirming that perfor-
1184
+
1185
+ mance collapse is driven by coordination failure,
1186
+
1187
+ not by the tasks themselves being harder.
1188
+
1189
+ Agents gather information but fail to integrate it.
1190
+
1191
+ While RCC reveals that coordination fails, the Par-
1192
+
1193
+ tial Correctness Score (PCS) reveals where. PCS
1194
+
1195
+ measures continuous answer quality (Section 3.3),
1196
+
1197
+ and the divergence between PCS and SR isolates
1198
+
1199
+ the reasoning-integration stage as the bottleneck.
1200
+
1201
+ At N ≥50 on Level-III tasks, SR drops to 0% while
1202
+
1203
+ PCS remains at 8–16%, confirming that agents ac-
1204
+
1205
+ quire partial global information but cannot synthe-
1206
+
1207
+ size it correctly.This dissociation appears even on
1208
+
1209
+ simpler tasks: averaged across all scales on Level-I
1210
+
1211
+ tasks, DeepSeek-V3.1 achieves a PCS of 88.0% yet
1212
+
1213
+ an SR of only 62.0% (Table 1), a gap of 26 per-
1214
+
1215
+ centage points indicating that agents collectively
1216
+
1217
+ hold nearly all required information but still fail to
1218
+
1219
+ produce a correct final answer.
1220
+
1221
+ Performance degrades multiplicatively with
1222
+
1223
+ scale and complexity. Figure 5 and Table 3 show
1224
+
1225
+ that task complexity and agent scale interact mul-
1226
+
1227
+ tiplicatively. DeepSeek-V3.1 drops from 62% on
1228
+
1229
+ Level-I to 12% on Level-III, and Level-III tasks
1230
+
1231
+ reach zero success at N ≥50, while Level-I tasks
1232
+
1233
+ remain above 40% even at 100 agents. As Fig-
1234
+
1235
+ ure 4 illustrates, all models degrade with agent
1236
+
1237
+ count, and communication density decreases at
1238
+
1239
+ larger scales—agents become sparser in interaction
1240
+
1241
+ precisely when denser coordination is most needed.
1242
+
1243
+ 4.3 Protocol Suitability
1244
+
1245
+ Having established that coordination fails broadly,
1246
+
1247
+ we examine whether protocol choice modulates this
1248
+
1249
+ failure. Figure 6 reveals distinct model-protocol
1250
+
1251
+ affinities. DeepSeek-V3.1 prefers broadcast mes-
1252
+
1253
+ saging (40% with BP vs. 32% with SFS), while
1254
+
1255
+ GPT-OSS-120B performs best with targeted com-
1256
+
1257
+ munication (20% under P2P vs. 14% under BP),
1258
+
1259
+ suggesting that protocol suitability depends on how
1260
+
1261
+ a model balances the cognitive cost of address-
1262
+
1263
+ ing decisions against the noise of undifferentiated
1264
+
1265
+ broadcasts. SFS underperforms in most cases: de-
1266
+
1267
+ spite comparable information transfer volume to
1268
+
1269
+ BP, it consistently yields lower SR—indicating that
1270
+
1271
+ the bottleneck lies in reasoning about shared state
1272
+
1273
+ rather than in communication volume.
1274
+
1275
+ 5 Analysis and Discussion
1276
+
1277
+ The preceding results establish that coordination
1278
+
1279
+ broadly fails, that failures scale with complexity,
1280
+
1281
+ and that agents accumulate partial information they
1282
+
1283
+ cannot synthesize. We now investigate the mecha-
1284
+
1285
+ nisms: first asking whether agents at least discover
1286
+
1287
+ the right structural approach, then tracing exactly
1288
+
1289
+ where execution breaks down.
1290
+
1291
+ 5.1 Case Study: Emergent Coordination
1292
+
1293
+ Patterns
1294
+
1295
+ Figure 7 visualizes the communication patterns
1296
+
1297
+ that agents spontaneously adopt for three represen-
1298
+
1299
+ tative tasks. In the Global Max heatmap (Level-I,
1300
+
1301
+ left), nearly all message traffic flows into column
1302
+
1303
+ 0—Agent 0 emerges organically as a central aggre-
1304
+
1305
+ gator, producing a near-perfect star topology. This
1306
+
1307
+ self-organized structure closely matches the theo-
1308
+
1309
+ retically optimal pattern and yields high task suc-
1310
+
1311
+ cess. In the Prefix Sum heatmap (Level-II, center),
1312
+
1313
+ 6
1314
+
1315
+ ### Page 7
1316
+
1317
+ Dimension
1318
+
1319
+ DeepSeek-V3.1 GPT-OSS-120B Qwen3-Next-80B-A3B
1320
+
1321
+ SR↑ PCS↑ Token↓ Den. SR↑ PCS↑ Token↓ Den. SR↑ PCS↑ Token↓ Den.
1322
+
1323
+ By Communication Protocol
1324
+
1325
+ BP 40.4 50.7 297.3 0.93 14.4 33.1 148.7 0.78 4.5 12.2 846.9 0.13
1326
+
1327
+ P2P 38.9 50.4 363.1 1.52 20.3 42.6 579.8 2.25 10.5 26.5 909.8 0.62
1328
+
1329
+ SFS 31.5 41.0 308.5 1.13 16.0 34.7 212.8 0.74 9.5 18.7 864.4 0.21
1330
+
1331
+ By Difficulty Level
1332
+
1333
+ Level I 62.0 88.0 184.0 0.71 27.4 56.8 187.1 1.05 20.7 44.4 747.1 0.62
1334
+
1335
+ Level II 35.1 59.7 355.9 0.98 14.5 35.3 330.1 1.18 2.9 11.9 990.1 0.19
1336
+
1337
+ Level III 11.7 27.9 439.2 1.93 8.8 22.7 424.2 1.54 1.0 1.5 881.6 0.16
1338
+
1339
+ By Agent Scale
1340
+
1341
+ N = 2 61.2 78.4 12.1 2.80 34.4 52.0 6.3 2.82 17.2 23.7 41.2 0.54
1342
+
1343
+ N = 5 48.5 68.2 44.2 1.94 28.0 47.9 30.6 1.78 9.1 15.7 112.4 0.42
1344
+
1345
+ N = 10 39.9 59.1 91.3 1.19 14.0 36.8 77.6 1.35 8.6 19.0 261.4 0.38
1346
+
1347
+ N = 20 33.6 60.2 211.0 0.72 13.2 37.0 194.1 0.88 7.4 20.1 549.6 0.35
1348
+
1349
+ N = 50 19.0 46.8 510.3 0.25 5.2 27.3 549.8 0.46 5.1 9.1 1466.3 0.14
1350
+
1351
+ N = 100 18.1 46.5 1093.8 0.14 6.4 28.7 1024.2 0.24 1.3 5.1 2901.1 0.08
1352
+
1353
+ Average 36.9 47.1 323.0 0.82 16.9 38.3 313.8 1.01 8.2 19.8 873.6 0.25
1354
+
1355
+ Table 1: Overall performance summary across all models and configurations. SR = Success Rate (%), PCS = Partial
1356
+
1357
+ Correctness Score (%), Token = Token Consumption (tokens/round), Den. = Communication Density. Best results
1358
+
1359
+ per section in bold.
1360
+
1361
+ Level I (Aggregation) Level II (Mesh) Level III (Global Shuffle)
1362
+
1363
+ Scale k SR(N =1) SR(N =k) RCC SR(N =1) SR(N =k) RCC SR(N =1) SR(N =k) RCC
1364
+
1365
+ k = 2 96.7 82.0 15.2% 90.0 62.0 31.1% 80.0 41.0 48.8%
1366
+
1367
+ k = 5 93.3 65.0 30.3% 70.0 47.0 32.9% 73.3 22.0 70.0%
1368
+
1369
+ k = 10 76.7 51.3 33.1% 73.3 22.0 70.0% 60.0 9.0 85.0%
1370
+
1371
+ k = 20 63.3 48.0 24.2% 36.7 14.0 61.8% 43.3 7.0 83.8%
1372
+
1373
+ k = 50 33.3 18.0 45.9% 30.0 6.0 80.0% 26.7 0.0 100%
1374
+
1375
+ k = 100 20.0 10.0 50.0% 13.3 5.0 62.4% 10.0 0.0 100%
1376
+
1377
+ Table 2: Single-agent baseline SR (%), multi-agent SR (%), and Relative Coordination Cost (RCC = 1 −
1378
+
1379
+ SR(N =k)/SR(N =1)) for GPT-OSS-120B across difficulty levels and scales. RCC columns (shaded) quantify the
1380
+
1381
+ fraction of single-agent performance lost to coordination overhead. Trends are consistent across all three models;
1382
+
1383
+ full results in Appendix F.
1384
+
1385
+ Level N=2 N=5 N=10 N=20 N=50 N=100
1386
+
1387
+ I 85.0 72.0 68.7 65.7 38.1 40.6
1388
+
1389
+ II 61.7 55.3 28.3 29.5 17.4 14.3
1390
+
1391
+ III 36.2 17.2 10.0 5.7 0.0 0.0
1392
+
1393
+ Avg 61.2 48.5 39.9 33.6 19.0 18.1
1394
+
1395
+ Table 3: Success Rate (%) by agent count and difficulty
1396
+
1397
+ level for DeepSeek-V3.1 (averaged across all protocols).
1398
+
1399
+ a prominent diagonal band reflects agents commu-
1400
+
1401
+ nicating primarily with their immediate neighbors,
1402
+
1403
+ correctly capturing the sequential dependency of
1404
+
1405
+ the prefix computation. However, off-diagonal scat-
1406
+
1407
+ ter reveals that agents also broadcast beyond their
1408
+
1409
+ neighbors, generating redundant overhead rather
1410
+
1411
+ than the clean chain the task requires. In the Dis-
1412
+
1413
+ tributed Sort heatmap (Level-III, right), the matrix
1414
+
1415
+ is uniformly dense: agents exchange messages with
1416
+
1417
+ nearly every other agent, which is precisely what
1418
+
1419
+ global data reorganization demands, yet the high
1420
+
1421
+ density comes with highly uneven per-agent loads—
1422
+
1423
+ some senders dominate entire rows—suggesting
1424
+
1425
+ uncoordinated flooding rather than structured ex-
1426
+
1427
+ change.
1428
+
1429
+ Taken together, these patterns confirm that
1430
+
1431
+ agents can translate high-level task descriptions
1432
+
1433
+ into broadly appropriate coordination topologies
1434
+
1435
+ without explicit instruction. Yet the heatmaps also
1436
+
1437
+ reveal a consistent gap between structural intent
1438
+
1439
+ and execution quality: even when the right topol-
1440
+
1441
+ ogy emerges, agents over-communicate, distribute
1442
+
1443
+ load unevenly, or fail to adhere to the optimal pat-
1444
+
1445
+ 7
1446
+
1447
+ ### Page 8
1448
+
1449
+ 2 5 10 20 50 100
1450
+
1451
+ Number of Agents
1452
+
1453
+ 0
1454
+
1455
+ 20
1456
+
1457
+ 40
1458
+
1459
+ 60
1460
+
1461
+ Success Rate (%)
1462
+
1463
+ DeepSeek-V3.1
1464
+
1465
+ GPT-OSS-120B
1466
+
1467
+ Qwen3-Next-80B
1468
+
1469
+ 2 5 10 20 50 100
1470
+
1471
+ Number of Agents
1472
+
1473
+ 10
1474
+
1475
+ 1
1476
+
1477
+ 10
1478
+
1479
+ 2
1480
+
1481
+ 10
1482
+
1483
+ 3
1484
+
1485
+ Token Consumption
1486
+
1487
+ DeepSeek-V3.1
1488
+
1489
+ GPT-OSS-120B
1490
+
1491
+ Qwen3-Next-80B
1492
+
1493
+ 2 5 10 20 50 100
1494
+
1495
+ Number of Agents
1496
+
1497
+ 0.0
1498
+
1499
+ 0.5
1500
+
1501
+ 1.0
1502
+
1503
+ 1.5
1504
+
1505
+ 2.0
1506
+
1507
+ 2.5
1508
+
1509
+ Communication Density
1510
+
1511
+ DeepSeek-V3.1
1512
+
1513
+ GPT-OSS-120B
1514
+
1515
+ Qwen3-Next-80B
1516
+
1517
+ Figure 4: Scaling behavior across agent counts. (a) Success rates decline for all models as team size increases, with
1518
+
1519
+ sharp drops beyond N = 20. (b) Token consumption scales roughly linearly with agent count. (c) Communication
1520
+
1521
+ density decreases at scale, suggesting coordination sparsification.
1522
+
1523
+ Level I (Aggregation) Level II (Mesh) Level III (Shuffle)
1524
+
1525
+ 0
1526
+
1527
+ 20
1528
+
1529
+ 40
1530
+
1531
+ 60
1532
+
1533
+ Success Rate (%)
1534
+
1535
+ 62.0
1536
+
1537
+ 35.1
1538
+
1539
+ 11.7
1540
+
1541
+ 27.4
1542
+
1543
+ 14.5
1544
+
1545
+ 8.8
1546
+
1547
+ 20.7
1548
+
1549
+ 2.9 1.0
1550
+
1551
+ DeepSeek-V3.1 GPT-OSS-120B Qwen3-Next-80B
1552
+
1553
+ Figure 5: Success rate by difficulty level.
1554
+
1555
+ Broadcast (BP) Peer-to-Peer (P2P) Shared File System (SFS)
1556
+
1557
+ 0
1558
+
1559
+ 10
1560
+
1561
+ 20
1562
+
1563
+ 30
1564
+
1565
+ 40
1566
+
1567
+ 50
1568
+
1569
+ Success Rate (%)
1570
+
1571
+ 40.4 38.9
1572
+
1573
+ 31.5
1574
+
1575
+ 14.4
1576
+
1577
+ 20.3
1578
+
1579
+ 16.0
1580
+
1581
+ 4.5
1582
+
1583
+ 10.5 9.5
1584
+
1585
+ DeepSeek-V3.1 GPT-OSS-120B Qwen3-Next-80B
1586
+
1587
+ Figure 6: Success rate by communication protocol.
1588
+
1589
+ tern. This raises the core question addressed next:
1590
+
1591
+ given that agents communicate in approximately
1592
+
1593
+ the right way, why do they still so often fail?
1594
+
1595
+ 5.2 The Communication-Reasoning Gap
1596
+
1597
+ To classify failure systematically, we apply a two-
1598
+
1599
+ stage hybrid procedure: rule-based detection iden-
1600
+
1601
+ tifies Premature Submission (agent submits before
1602
+
1603
+ reaching the task-specific minimum peer count),
1604
+
1605
+ Consensus Failure (|{ˆyi}N
1606
+
1607
+ i=1| > 1), and Computa-
1608
+
1609
+ tion Error (full data receipt confirmed in log but
1610
+
1611
+ answer incorrect). Two independent annotators
1612
+
1613
+ then reviewed stratified runs, achieving Cohen’s
1614
+
1615
+ κ = 0.87; disagreements were resolved by discus-
1616
+
1617
+ sion.
1618
+
1619
+ Analyzing execution logs under this scheme, we
1620
+
1621
+ identified three distinct failure modes (Table 4).
1622
+
1623
+ Premature Submission (37.2%) is the most preva-
1624
+
1625
+ lent: agents submit before gathering sufficient
1626
+
1627
+ information—Agent-77 in Task I-06, for instance,
1628
+
1629
+ submitted after contacting only 28 of 100 agents,
1630
+
1631
+ yielding answer 208 vs. the expected 114. Consen-
1632
+
1633
+ sus Failure (29.9%) occurs when agents communi-
1634
+
1635
+ cate actively but cannot converge; one 100-agent
1636
+
1637
+ XOR checksum run produced 12 distinct answers,
1638
+
1639
+ with 86 agents converging on 146 while the re-
1640
+
1641
+ mainder submitted values ranging from 42 to 238.
1642
+
1643
+ Computation Error (28.6%) arises when agents col-
1644
+
1645
+ lect all required data yet compute incorrectly, such
1646
+
1647
+ as submitting 619 instead of 620 due to an off-by-
1648
+
1649
+ one error during final aggregation. These modes
1650
+
1651
+ frequently co-occur: 67 runs exhibit both Prema-
1652
+
1653
+ ture Submission and Consensus Failure, as early
1654
+
1655
+ exits prevent subsequent consensus-building and
1656
+
1657
+ widen the convergence gap for remaining agents.
1658
+
1659
+ Together, these three modes define the
1660
+
1661
+ Communication-Reasoning Gap: agents ex-
1662
+
1663
+ hibit proficiency in the social mechanics of
1664
+
1665
+ coordination—formatting messages, responding
1666
+
1667
+ to peers, organizing information flow—while fail-
1668
+
1669
+ ing at the computational core of determining in-
1670
+
1671
+ formation sufficiency and synthesizing distributed
1672
+
1673
+ state. This is not a failure of effort: behavioral com-
1674
+
1675
+ parison shows successful runs complete in fewer
1676
+
1677
+ rounds, while verification behaviors appear in over
1678
+
1679
+ 95% of runs regardless of outcome. The bottleneck
1680
+
1681
+ is reasoning quality at the integration stage, not
1682
+
1683
+ communication intent.
1684
+
1685
+ 5.3 Implications and Future Directions
1686
+
1687
+ The analyses jointly reveal a structural asymmetry
1688
+
1689
+ with practical consequences. Coordination over-
1690
+
1691
+ head does not merely reduce parallelization gains—
1692
+
1693
+ for Level-III tasks at N ≥50, it eliminates them en-
1694
+
1695
+ tirely, leaving a coordinated team outperformed by
1696
+
1697
+ a single agent with full data access. Perhaps most
1698
+
1699
+ counterintuitively, spontaneous leader emergence—
1700
+
1701
+ conventionally assumed to help—actively hurts per-
1702
+
1703
+ 8
1704
+
1705
+ ### Page 9
1706
+
1707
+ 0 2 4 6 8 10 12 14 16 18
1708
+
1709
+ Receiver ID
1710
+
1711
+ 0
1712
+
1713
+ 2
1714
+
1715
+ 4
1716
+
1717
+ 6
1718
+
1719
+ 8
1720
+
1721
+ 10
1722
+
1723
+ 12
1724
+
1725
+ 14
1726
+
1727
+ 16
1728
+
1729
+ 18
1730
+
1731
+ Sender ID
1732
+
1733
+ I-01: Global Max
1734
+
1735
+ (N=20)
1736
+
1737
+ 0 2 4 6 8 10 12 14 16 18
1738
+
1739
+ Receiver ID
1740
+
1741
+ 0
1742
+
1743
+ 2
1744
+
1745
+ 4
1746
+
1747
+ 6
1748
+
1749
+ 8
1750
+
1751
+ 10
1752
+
1753
+ 12
1754
+
1755
+ 14
1756
+
1757
+ 16
1758
+
1759
+ 18
1760
+
1761
+ Sender ID
1762
+
1763
+ Expected
1764
+
1765
+ diagonal
1766
+
1767
+ II-11: Prefix Sum
1768
+
1769
+ (N=20)
1770
+
1771
+ 0 2 4 6 8 10 12 14 16 18
1772
+
1773
+ Receiver ID
1774
+
1775
+ 0
1776
+
1777
+ 2
1778
+
1779
+ 4
1780
+
1781
+ 6
1782
+
1783
+ 8
1784
+
1785
+ 10
1786
+
1787
+ 12
1788
+
1789
+ 14
1790
+
1791
+ 16
1792
+
1793
+ 18
1794
+
1795
+ Sender ID
1796
+
1797
+ III-21: Distributed Sort
1798
+
1799
+ (N=20)
1800
+
1801
+ 0.0
1802
+
1803
+ 0.5
1804
+
1805
+ 1.0
1806
+
1807
+ 1.5
1808
+
1809
+ 2.0
1810
+
1811
+ 2.5
1812
+
1813
+ 3.0
1814
+
1815
+ 3.5
1816
+
1817
+ 4.0
1818
+
1819
+ Messages
1820
+
1821
+ 0
1822
+
1823
+ 1
1824
+
1825
+ 2
1826
+
1827
+ 3
1828
+
1829
+ 4
1830
+
1831
+ 5
1832
+
1833
+ 6
1834
+
1835
+ Messages
1836
+
1837
+ 0
1838
+
1839
+ 2
1840
+
1841
+ 4
1842
+
1843
+ 6
1844
+
1845
+ 8
1846
+
1847
+ 10
1848
+
1849
+ Messages
1850
+
1851
+ Communication Heatmaps (P2P, deepseek-v3.1)
1852
+
1853
+ Figure 7: Communication heatmaps for representative tasks with N = 20 agents using DeepSeek-V3.1. Left: I-01
1854
+
1855
+ (Global Max) shows emergent leader pattern. Center: II-11 (Prefix Sum) exhibits diagonal pattern indicating partial
1856
+
1857
+ spatial locality discovery. Right: III-21 (Distributed Sort) shows dense all-to-all communication.
1858
+
1859
+ Failure Mode Count Percent
1860
+
1861
+ Success 153 50.8%
1862
+
1863
+ Premature Submission 112 37.2%
1864
+
1865
+ Consensus Failure 90 29.9%
1866
+
1867
+ Computation Error 86 28.6%
1868
+
1869
+ Table 4: Failure mode distribution (categories not mutu-
1870
+
1871
+ ally exclusive).
1872
+
1873
+ formance on Level-III tasks, because the aggrega-
1874
+
1875
+ tor becomes overwhelmed by the volume of global
1876
+
1877
+ data it must process.
1878
+
1879
+ Three directions follow. First, agents need mech-
1880
+
1881
+ anisms to detect information sufficiency before
1882
+
1883
+ committing to a final answer. Second, the explicit
1884
+
1885
+ synchronization checkpoints present in success-
1886
+
1887
+ ful runs should be formalized as consensus pro-
1888
+
1889
+ tocols. Third, adaptive protocol selection based
1890
+
1891
+ on task structure could unlock model-protocol co-
1892
+
1893
+ optimization, given the model-dependent affinities
1894
+
1895
+ observed. SILO-BENCH provides the evaluation
1896
+
1897
+ foundation for tracking progress along all three.
1898
+
1899
+ 6 Conclusion
1900
+
1901
+ We introduce SILO-BENCH to evaluate distributed
1902
+
1903
+ coordination in multi-agent LLM systems across
1904
+
1905
+ 1,620 experiments. The results are unambiguous:
1906
+
1907
+ current LLMs cannot reliably escape their informa-
1908
+
1909
+ tion silos through coordination alone.
1910
+
1911
+ The Communication-Reasoning Gap identifies
1912
+
1913
+ the precise fault line: agents are competent commu-
1914
+
1915
+ nicators but poor distributed reasoners. They spon-
1916
+
1917
+ taneously form task-appropriate topologies and ex-
1918
+
1919
+ change information actively, yet consistently fail
1920
+
1921
+ to integrate what they have gathered—a dissoci-
1922
+
1923
+ ation made concrete by the PCS–SR divergence
1924
+
1925
+ and the RCC analysis showing that coordination
1926
+
1927
+ overhead eliminates parallelization gains entirely
1928
+
1929
+ at high complexity. Most strikingly, spontaneous
1930
+
1931
+ leader emergence actively hurts performance on
1932
+
1933
+ complex tasks, revealing that self-organized cen-
1934
+
1935
+ tralization creates bottlenecks rather than resolving
1936
+
1937
+ them.
1938
+
1939
+ Closing this gap will require mechanisms for
1940
+
1941
+ information sufficiency detection, explicit consen-
1942
+
1943
+ sus protocols, and adaptive coordination strategies.
1944
+
1945
+ SILO-BENCH provides the evaluation infrastruc-
1946
+
1947
+ ture to track progress along these directions.
1948
+
1949
+ Limitations
1950
+
1951
+ While SILO-BENCH provides a comprehensive
1952
+
1953
+ framework for evaluating multi-agent collabora-
1954
+
1955
+ tion, it has several limitations. Our evaluation only
1956
+
1957
+ covers three fundamental communication protocols
1958
+
1959
+ and does not include other coordination mecha-
1960
+
1961
+ nisms such as hierarchical protocols, gossip-based
1962
+
1963
+ dissemination and hybrid approaches. We adopt
1964
+
1965
+ agent configurations with uniform underlying mod-
1966
+
1967
+ els, whereas real-world multi-agent systems usually
1968
+
1969
+ involve heterogeneous compositions with distinct
1970
+
1971
+ coordination patterns. Closed-source models are
1972
+
1973
+ not evaluated in this work due to their high cost at
1974
+
1975
+ our scale and unverifiable, incomparable reported
1976
+
1977
+ token usage. In addition, our assessment focuses
1978
+
1979
+ on three frontier LLMs, which may not capture
1980
+
1981
+ the full spectrum of failure modes across all LLMs
1982
+
1983
+ since each model has unique characteristics in rea-
1984
+
1985
+ soning logic, communication strategies and error
1986
+
1987
+ propagation that lead to distinct performance limi-
1988
+
1989
+ tations.
1990
+
1991
+ 9
1992
+
1993
+ ### Page 10
1994
+
1995
+ References
1996
+
1997
+ Open AI. 2023. Gpt-4 technical report. arXiv preprint
1998
+
1999
+ arXiv:2303.08774.
2000
+
2001
+ Chenxin An, Shansan Gong, Ming Zhong, Xingjian
2002
+
2003
+ Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and
2004
+
2005
+ Xipeng Qiu. 2024. L-eval: Instituting standardized
2006
+
2007
+ evaluation for long context language models. In Pro-
2008
+
2009
+ ceedings of the 62nd Annual Meeting of the Associa-
2010
+
2011
+ tion for Computational Linguistics (Volume 1: Long
2012
+
2013
+ Papers), pages 14388–14411.
2014
+
2015
+ Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
2016
+
2017
+ Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
2018
+
2019
+ Liu, Aohan Zeng, Lei Hou, and 1 others. 2024. Long-
2020
+
2021
+ bench: A bilingual, multitask benchmark for long
2022
+
2023
+ context understanding. In Proceedings of the 62nd
2024
+
2025
+ annual meeting of the association for computational
2026
+
2027
+ linguistics (volume 1: Long papers), pages 3119–
2028
+
2029
+ 3137.
2030
+
2031
+ Razan Baltaji, Babak Hemmatian, and Lav Varshney.
2032
+
2033
+ 2024. Conformity, confabulation, and impersonation:
2034
+
2035
+ Persona inconstancy in multi-agent llm collabora-
2036
+
2037
+ tion. In Proceedings of the 2nd Workshop on Cross-
2038
+
2039
+ Cultural Considerations in NLP, pages 17–31.
2040
+
2041
+ Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang,
2042
+
2043
+ Wei-Wei Tu, Zhaofeng He, and Lijie Wen. 2024a.
2044
+
2045
+ Llmarena: Assessing capabilities of large language
2046
+
2047
+ models in dynamic multi-agent environments. In
2048
+
2049
+ ACL (1).
2050
+
2051
+ Longze Chen, Ziqiang Liu, Wanwei He, Yinhe Zheng,
2052
+
2053
+ Hao Sun, Yunshui Li, Run Luo, and Min Yang. 2024b.
2054
+
2055
+ Long context is not long at all: A prospector of long-
2056
+
2057
+ dependency data for large language models. In Pro-
2058
+
2059
+ ceedings of the 62nd Annual Meeting of the Associa-
2060
+
2061
+ tion for Computational Linguistics (Volume 1: Long
2062
+
2063
+ Papers), pages 8222–8234.
2064
+
2065
+ Jeffrey Dean and Sanjay Ghemawat. 2008. Mapreduce:
2066
+
2067
+ simplified data processing on large clusters. Commu-
2068
+
2069
+ nications of the ACM, 51(1):107–113.
2070
+
2071
+ DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx-
2072
+
2073
+ uan Wang, Bochao Wu, Chengda Lu, Chenggang
2074
+
2075
+ Zhao, Chengqi Deng, Chenyu Zhang, Chong, and 1
2076
+
2077
+ others. 2024. Deepseek-v3 technical report. CoRR,
2078
+
2079
+ abs/2412.19437.
2080
+
2081
+ Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao
2082
+
2083
+ Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan,
2084
+
2085
+ Bin Wang, Rui Yan, and 1 others. 2024. Mobile-
2086
+
2087
+ bench: An evaluation benchmark for llm-based mo-
2088
+
2089
+ bile agents. In Proceedings of the 62nd Annual Meet-
2090
+
2091
+ ing of the Association for Computational Linguistics
2092
+
2093
+ (Volume 1: Long Papers), pages 8813–8831.
2094
+
2095
+ Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
2096
+
2097
+ baum, and Igor Mordatch. 2023. Improving factual-
2098
+
2099
+ ity and reasoning in language models through multia-
2100
+
2101
+ gent debate. In Forty-first International Conference
2102
+
2103
+ on Machine Learning.
2104
+
2105
+ Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito,
2106
+
2107
+ Kiril Gashteovski, David Friede, Roberto Bifulco,
2108
+
2109
+ and Carolin Lawrence. 2024. Agentquest: A modu-
2110
+
2111
+ lar benchmark framework to measure progress and
2112
+
2113
+ improve llm agents. CoRR.
2114
+
2115
+ Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu
2116
+
2117
+ Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang,
2118
+
2119
+ Zili Wang, Steven Ka Shing Yau, Zijuan Lin, and
2120
+
2121
+ 1 others. 2023. Metagpt: Meta programming for a
2122
+
2123
+ multi-agent collaborative framework. In The Twelfth
2124
+
2125
+ International Conference on Learning Representa-
2126
+
2127
+ tions.
2128
+
2129
+ Md Ashraful Islam, Mohammed Eunus Ali, and
2130
+
2131
+ Md Rizwan Parvez. 2024a. Mapcoder: Multi-agent
2132
+
2133
+ code generation for competitive problem solving. In
2134
+
2135
+ Annual Meeting of the Association of Computational
2136
+
2137
+ Linguistics 2024, pages 4912–4944. Association for
2138
+
2139
+ Computational Linguistics (ACL).
2140
+
2141
+ Shayekh Islam, Md Asib Rahman, KSM Tozammel Hos-
2142
+
2143
+ sain, Enamul Hoque, Shafiq Joty, and Md Rizwan
2144
+
2145
+ Parvez. 2024b. Open-rag: Enhanced retrieval aug-
2146
+
2147
+ mented reasoning with open-source large language
2148
+
2149
+ models. In Findings of the Association for Compu-
2150
+
2151
+ tational Linguistics: EMNLP 2024, pages 14231–
2152
+
2153
+ 14244.
2154
+
2155
+ Yihuai Lan, Zhiqiang Hu, Lei Wang, Yang Wang, De-
2156
+
2157
+ heng Ye, Peilin Zhao, Ee-Peng Lim, Hui Xiong, and
2158
+
2159
+ Hao Wang. 2024. Llm-based agent society investi-
2160
+
2161
+ gation: Collaboration and confrontation in avalon
2162
+
2163
+ gameplay. In Proceedings of the 2024 Conference on
2164
+
2165
+ Empirical Methods in Natural Language Processing,
2166
+
2167
+ pages 128–145.
2168
+
2169
+ Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii
2170
+
2171
+ Khizbullin, and Bernard Ghanem. 2023. Camel:
2172
+
2173
+ Communicative agents for" mind" exploration of
2174
+
2175
+ large language model society. Advances in Neural
2176
+
2177
+ Information Processing Systems, 36:51991–52008.
2178
+
2179
+ Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan
2180
+
2181
+ Zhang. 2024. Loogle: Can long-context language
2182
+
2183
+ models understand long contexts? In Proceedings
2184
+
2185
+ of the 62nd Annual Meeting of the Association for
2186
+
2187
+ Computational Linguistics (Volume 1: Long Papers),
2188
+
2189
+ pages 16304–16333.
2190
+
2191
+ Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel.
2192
+
2193
+ 2024a. World model on million-length video and
2194
+
2195
+ language with blockwise ringattention. In The Thir-
2196
+
2197
+ teenth International Conference on Learning Repre-
2198
+
2199
+ sentations.
2200
+
2201
+ Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
2202
+
2203
+ jape, Michele Bevilacqua, Fabio Petroni, and Percy
2204
+
2205
+ Liang. 2024b. Lost in the middle: How language
2206
+
2207
+ models use long contexts. Transactions of the Asso-
2208
+
2209
+ ciation for Computational Linguistics, 12:157–173.
2210
+
2211
+ Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu
2212
+
2213
+ Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen
2214
+
2215
+ Men, Kejuan Yang, and 1 others. 2023. Agentbench:
2216
+
2217
+ Evaluating llms as agents. In The Twelfth Interna-
2218
+
2219
+ tional Conference on Learning Representations.
2220
+
2221
+ 10
2222
+
2223
+ ### Page 11
2224
+
2225
+ OpenAI,:, Sandhini Agarwal, Lama Ahmad, Jason
2226
+
2227
+ Ai, Sam Altman, Andy Applebaum, Edwin Arbus,
2228
+
2229
+ Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao,
2230
+
2231
+ Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita
2232
+
2233
+ Brett, Eugene Brevdo, Greg Brockman, Sebastien
2234
+
2235
+ Bubeck, and 108 others. 2025. gpt-oss-120b & gpt-
2236
+
2237
+ oss-20b model card. Preprint, arXiv:2508.10925.
2238
+
2239
+ Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
2240
+
2241
+ ith Ringel Morris, Percy Liang, and Michael S Bern-
2242
+
2243
+ stein. 2023. Generative agents: Interactive simulacra
2244
+
2245
+ of human behavior. In Proceedings of the 36th an-
2246
+
2247
+ nual acm symposium on user interface software and
2248
+
2249
+ technology, pages 1–22.
2250
+
2251
+ Ofir Press, Noah Smith, and Mike Lewis. 2021. Train
2252
+
2253
+ short, test long: Attention with linear biases enables
2254
+
2255
+ input length extrapolation. In International Confer-
2256
+
2257
+ ence on Learning Representations.
2258
+
2259
+ Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
2260
+
2261
+ Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
2262
+
2263
+ Sun. 2023. Communicative agents for software de-
2264
+
2265
+ velopment. CoRR.
2266
+
2267
+ Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan
2268
+
2269
+ Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng
2270
+
2271
+ Su, Xin Cong, and 1 others. 2024. Chatdev: Com-
2272
+
2273
+ municative agents for software development. In Pro-
2274
+
2275
+ ceedings of the 62nd Annual Meeting of the Associa-
2276
+
2277
+ tion for Computational Linguistics (Volume 1: Long
2278
+
2279
+ Papers), pages 15174–15186.
2280
+
2281
+ Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram,
2282
+
2283
+ Inbal Magar, Omri Abend, Ehud Karpas, Amnon
2284
+
2285
+ Shashua, Kevin Leyton-Brown, and Yoav Shoham.
2286
+
2287
+ 2023. Parallel context windows for large language
2288
+
2289
+ models. In Proceedings of the 61st annual meeting of
2290
+
2291
+ the association for computational linguistics (volume
2292
+
2293
+ 1: Long papers), pages 6383–6402.
2294
+
2295
+ Machel Reid, Nikolay Savinov, Denis Teplyashin,
2296
+
2297
+ Dmitry Lepikhin, Timothy P Lillicrap, Jean-Baptiste
2298
+
2299
+ Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan
2300
+
2301
+ Firat, Julian Schrittwieser, and 1 others. 2024. Gem-
2302
+
2303
+ ini 1.5: Unlocking multimodal understanding across
2304
+
2305
+ millions of tokens of context. CoRR.
2306
+
2307
+ Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori
2308
+
2309
+ Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong,
2310
+
2311
+ Mor Geva, Jonathan Berant, and 1 others. 2022.
2312
+
2313
+ Scrolls: Standardized comparison over long language
2314
+
2315
+ sequences. In Conference on Empirical Methods in
2316
+
2317
+ Natural Language Processing.
2318
+
2319
+ Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
2320
+
2321
+ Martinet, Marie-Anne Lachaux, Timothée Lacroix,
2322
+
2323
+ Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
2324
+
2325
+ Azhar, and 1 others. 2023. Llama: Open and effi-
2326
+
2327
+ cient foundation language models. arXiv preprint
2328
+
2329
+ arXiv:2302.13971.
2330
+
2331
+ Junlin Wang, WANG Jue, Ben Athiwaratkun, Ce Zhang,
2332
+
2333
+ and James Zou. 2024a. Mixture-of-agents enhances
2334
+
2335
+ large language model capabilities. In The Thirteenth
2336
+
2337
+ International Conference on Learning Representa-
2338
+
2339
+ tions.
2340
+
2341
+ Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong,
2342
+
2343
+ and Yangqiu Song. 2024b. Rethinking the bounds of
2344
+
2345
+ llm reasoning: Are multi-agent discussions the key?
2346
+
2347
+ In 62nd Annual Meeting of the Association for Com-
2348
+
2349
+ putational Linguistics, ACL 2024, pages 6106–6131.
2350
+
2351
+ Association for Computational Linguistics (ACL).
2352
+
2353
+ Zheng Wang, Shu Teo, Jieer Ouyang, Yongjun Xu, and
2354
+
2355
+ Wei Shi. 2024c. M-rag: Reinforcing large language
2356
+
2357
+ model performance through retrieval-augmented gen-
2358
+
2359
+ eration with multiple partitions. In Proceedings
2360
+
2361
+ of the 62nd Annual Meeting of the Association for
2362
+
2363
+ Computational Linguistics (Volume 1: Long Papers),
2364
+
2365
+ pages 1966–1978.
2366
+
2367
+ Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
2368
+
2369
+ Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
2370
+
2371
+ Shaokun Zhang, Jiale Liu, and 1 others. 2024. Au-
2372
+
2373
+ togen: Enabling next-gen llm applications via multi-
2374
+
2375
+ agent conversations. In First Conference on Lan-
2376
+
2377
+ guage Modeling.
2378
+
2379
+ An Yang, Anfeng Li, Baosong Yang, Beichen Zhang,
2380
+
2381
+ Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,
2382
+
2383
+ Chengen Huang, Chenxu Lv, Chujie Zheng, Day-
2384
+
2385
+ iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao
2386
+
2387
+ Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41
2388
+
2389
+ others. 2025. Qwen3 technical report. Preprint,
2390
+
2391
+ arXiv:2505.09388.
2392
+
2393
+ Andrew Chi-Chih Yao. 1979. Some complexity ques-
2394
+
2395
+ tions related to distributive computing (preliminary
2396
+
2397
+ report). In Proceedings of the eleventh annual ACM
2398
+
2399
+ symposium on Theory of computing, pages 209–213.
2400
+
2401
+ Howard Yen, Tianyu Gao, and Danqi Chen. 2024. Long-
2402
+
2403
+ context language modeling with parallel context en-
2404
+
2405
+ coding. In Proceedings of the 62nd Annual Meeting
2406
+
2407
+ of the Association for Computational Linguistics (Vol-
2408
+
2409
+ ume 1: Long Papers), pages 2588–2610.
2410
+
2411
+ Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu,
2412
+
2413
+ Bryan Hooi, and Shumin Deng. 2024a. Exploring
2414
+
2415
+ collaboration mechanisms for llm agents: A social
2416
+
2417
+ psychology view. In Proceedings of the 62nd An-
2418
+
2419
+ nual Meeting of the Association for Computational
2420
+
2421
+ Linguistics (Volume 1: Long Papers), pages 14544–
2422
+
2423
+ 14607.
2424
+
2425
+ Xinrong Zhang, Yingfa Chen, Shengding Hu, Zi-
2426
+
2427
+ hang Xu, Junhao Chen, Moo Khai Hao, Xu Han,
2428
+
2429
+ Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and
2430
+
2431
+ 1 others. 2024b. inftybench: Extending long con-
2432
+
2433
+ text evaluation beyond 100k tokens. arXiv preprint
2434
+
2435
+ arXiv:2402.13718.
2436
+
2437
+ Xiutian Zhao, Ke Wang, and Wei Peng. 2024. An elec-
2438
+
2439
+ toral approach to diversify llm-based multi-agent col-
2440
+
2441
+ lective decision-making. In EMNLP.
2442
+
2443
+ 11
2444
+
2445
+ ### Page 12
2446
+
2447
+ A Communication Protocols
2448
+
2449
+ This appendix provides complete specifications for
2450
+
2451
+ the three communication protocols implemented in
2452
+
2453
+ SILO-BENCH. Each protocol defines a distinct co-
2454
+
2455
+ ordination substrate constraining the mechanism of
2456
+
2457
+ information exchange while preserving full agent
2458
+
2459
+ autonomy over content and strategy.
2460
+
2461
+ A.1 Protocol Overview
2462
+
2463
+ Table 5 summarizes the three protocols. They span
2464
+
2465
+ the spectrum of coordination paradigms along three
2466
+
2467
+ axes: (1) explicit vs. implicit addressing—P2P re-
2468
+
2469
+ quires agents to name recipients, BP eliminates ad-
2470
+
2471
+ dressing entirely, SFS routes coordination through
2472
+
2473
+ shared state; (2) direct vs. indirect communication—
2474
+
2475
+ P2P and BP involve direct message exchange, SFS
2476
+
2477
+ agents never “speak” to each other explicitly; (3)
2478
+
2479
+ default density—P2P encourages sparse targeted
2480
+
2481
+ exchanges, BP defaults to dense all-to-all dissemi-
2482
+
2483
+ nation, SFS density depends entirely on read/write
2484
+
2485
+ behavior.
2486
+
2487
+ A.2 Peer-to-Peer Protocol (P2P)
2488
+
2489
+ The P2P protocol implements directed messaging
2490
+
2491
+ through SQLite-backed mailboxes. Each agent
2492
+
2493
+ maintains a private inbox; messages are deliv-
2494
+
2495
+ ered asynchronously to the recipient’s buffer un-
2496
+
2497
+ til their next activation. Agents must decide
2498
+
2499
+ not only what to communicate but whom to
2500
+
2501
+ contact, enabling evaluation of task-appropriate
2502
+
2503
+ routing strategy discovery. Available actions
2504
+
2505
+ are: send_message(target_id, content),
2506
+
2507
+ which delivers a message to the specified agent;
2508
+
2509
+ receive_messages(), which retrieves all pending
2510
+
2511
+ messages; wait(), which signals completion of the
2512
+
2513
+ agent’s decision for the current round (see below);
2514
+
2515
+ and submit_result(answer), which submits the
2516
+
2517
+ final answer. Messages are stored in an in-memory
2518
+
2519
+ SQLite database recording sender ID, recipient ID,
2520
+
2521
+ content, timestamp, and read status. Delivery order-
2522
+
2523
+ ing within each sender-recipient pair is guaranteed;
2524
+
2525
+ no global ordering is enforced.
2526
+
2527
+ A.3 Broadcast Protocol (BP)
2528
+
2529
+ The BP protocol implements broadcast messag-
2530
+
2531
+ ing where each transmission reaches all other
2532
+
2533
+ agents simultaneously. An implicit aggrega-
2534
+
2535
+ tor collects messages each round and distributes
2536
+
2537
+ the compiled history to all participants. Avail-
2538
+
2539
+ able actions are: broadcast_message(content),
2540
+
2541
+ receive_messages(), list_agents(), wait(),
2542
+
2543
+ and submit_result(answer). Broadcast mes-
2544
+
2545
+ sages are stored centrally, tagged with sender ID
2546
+
2547
+ and timestamp, and delivered as a chronologically
2548
+
2549
+ ordered compiled view at each round.
2550
+
2551
+ A.4 Shared File System Protocol (SFS)
2552
+
2553
+ The SFS protocol implements indirect coordi-
2554
+
2555
+ nation through a shared key-value store visible
2556
+
2557
+ to all agents. Rather than exchanging messages
2558
+
2559
+ directly, agents read and write to a common
2560
+
2561
+ namespace, enabling asynchronous coordina-
2562
+
2563
+ tion analogous to blackboard architectures.
2564
+
2565
+ Available actions are: list_files(prefix),
2566
+
2567
+ read_file(path), write_file(path,
2568
+
2569
+ content), delete_file(path), wait(), and
2570
+
2571
+ submit_result(answer). The shared file system
2572
+
2573
+ is backed by an in-memory SQLite database
2574
+
2575
+ storing path, content, creation time, and last
2576
+
2577
+ modification time. Writes are immediately visible
2578
+
2579
+ to subsequent reads; concurrent writes to the same
2580
+
2581
+ path follow last-writer-wins semantics.
2582
+
2583
+ A.5 The wait() Action: Formal Specification
2584
+
2585
+ The wait() action is semantically uniform across
2586
+
2587
+ all three protocols and serves as an explicit round-
2588
+
2589
+ boundary signal. When an agent invokes wait(),
2590
+
2591
+ all remaining operations in its current round are
2592
+
2593
+ skipped, and the agent’s decision phase for round
2594
+
2595
+ r is marked complete. The agent is then suspended
2596
+
2597
+ until the start of round r+1.
2598
+
2599
+ At the start of round r+1, the following activa-
2600
+
2601
+ tion and message-delivery rules apply:
2602
+
2603
+ • P2P and BP: The agent’s inbox is populated
2604
+
2605
+ with all messages sent to it (P2P) or broadcast
2606
+
2607
+ by any agent (BP) during round r. These are
2608
+
2609
+ delivered atomically at round start; no message
2610
+
2611
+ sent in round r is visible before round r+1.
2612
+
2613
+ • SFS: The agent observes the full shared file sys-
2614
+
2615
+ tem state as of the end of round r, including all
2616
+
2617
+ writes committed by other agents during round
2618
+
2619
+ r.
2620
+
2621
+ • No blocking: wait() does not block on a spe-
2622
+
2623
+ cific event or agent. It is a “pass” that yields
2624
+
2625
+ control to the synchronous round scheduler. If an
2626
+
2627
+ agent never calls wait() explicitly, the runtime
2628
+
2629
+ automatically advances it to the next round after
2630
+
2631
+ its action budget is exhausted.
2632
+
2633
+ • Post-submit behavior: Once an agent has called
2634
+
2635
+ submit_result(), subsequent rounds are no-
2636
+
2637
+ ops for that agent—it neither receives new mes-
2638
+
2639
+ sages nor is activated again.
2640
+
2641
+ 12
2642
+
2643
+ ### Page 13
2644
+
2645
+ Protocol Description Available Actions
2646
+
2647
+ P2P Directed messaging via agent-addressed mailboxes send_message, receive_messages, wait,
2648
+
2649
+ submit_result
2650
+
2651
+ BP Broadcast messaging to all agents simultaneously broadcast_message, receive_messages,
2652
+
2653
+ list_agents, wait, submit_result
2654
+
2655
+ SFS Coordination through shared key-value storage list_files, read_file, write_file, delete_file,
2656
+
2657
+ wait, submit_result
2658
+
2659
+ Table 5: Comparison of communication protocols in SILO-BENCH.
2660
+
2661
+ This synchronous round-based model ensures
2662
+
2663
+ that all N agents observe a consistent snapshot
2664
+
2665
+ of the communication state at each round bound-
2666
+
2667
+ ary, making execution reproducible and analysis
2668
+
2669
+ tractable.
2670
+
2671
+ B Prompt Design: Structural Guidance
2672
+
2673
+ without Role Prescription
2674
+
2675
+ This appendix details the prompt templates used to
2676
+
2677
+ initialize agents in SILO-BENCH. Our core design
2678
+
2679
+ philosophy is to provide task-structural informa-
2680
+
2681
+ tion while preserving strategic autonomy: prompts
2682
+
2683
+ convey high-level dependency patterns and poten-
2684
+
2685
+ tial coordination approaches, but do not prescribe
2686
+
2687
+ mandatory execution sequences or assign semantic
2688
+
2689
+ roles.
2690
+
2691
+ B.1 Base Prompt Template
2692
+
2693
+ Each agent receives an initialization prompt follow-
2694
+
2695
+ ing this structure:
2696
+
2697
+ Agent Initialization Prompt
2698
+
2699
+ You are Agent {agent_id} in a multi-agent
2700
+
2701
+ system consisting of {N} agents (IDs range
2702
+
2703
+ from 0 to {N-1}).
2704
+
2705
+ Task Description:
2706
+
2707
+ {task_description}
2708
+
2709
+ Your Local Data:
2710
+
2711
+ {data_shard}
2712
+
2713
+ Communication Protocol:
2714
+
2715
+ {protocol_description}
2716
+
2717
+ Available Actions:
2718
+
2719
+ - {protocol_specific_actions}
2720
+
2721
+ - submit_result(answer): Submit your final
2722
+
2723
+ answer when confident
2724
+
2725
+ Your goal is to coordinate with other agents
2726
+
2727
+ to compute the globally correct answer. No
2728
+
2729
+ single agent has sufficient information to
2730
+
2731
+ solve this task independently. When you
2732
+
2733
+ have determined the answer, submit it using
2734
+
2735
+ submit_result().
2736
+
2737
+ The {protocol_specific_actions} place-
2738
+
2739
+ holder is instantiated according to the protocol spec-
2740
+
2741
+ ifications in Appendix A.
2742
+
2743
+ B.2 Task Description Examples
2744
+
2745
+ Task descriptions convey task structure and poten-
2746
+
2747
+ tial coordination patterns while leaving concrete
2748
+
2749
+ implementation decisions to the agents themselves.
2750
+
2751
+ The three examples below illustrate how descrip-
2752
+
2753
+ tions scale from simple aggregation to global reor-
2754
+
2755
+ ganization tasks.
2756
+
2757
+ Example: Global Maximum (Task I-01)
2758
+
2759
+ Find the maximum value across all data
2760
+
2761
+ distributed among the agents. Each agent
2762
+
2763
+ holds a portion of a larger array. The
2764
+
2765
+ correct answer is the single largest
2766
+
2767
+ integer across the entire distributed
2768
+
2769
+ dataset. You must coordinate with other
2770
+
2771
+ agents to determine this global maximum.
2772
+
2773
+ Example: Prefix Sum (Task II-11)
2774
+
2775
+ Compute the prefix sum array for a sequence
2776
+
2777
+ distributed across agents. Agent 0 holds
2778
+
2779
+ elements [0, k), Agent 1 holds elements
2780
+
2781
+ [k, 2k), and so on. The prefix sum at
2782
+
2783
+ position i is the sum of all elements from
2784
+
2785
+ position 0 to i. You must coordinate to
2786
+
2787
+ compute the correct prefix sum for your
2788
+
2789
+ portion, accounting for cumulative sums
2790
+
2791
+ from preceding agents.
2792
+
2793
+ Example: Distributed Sort (Task III-21)
2794
+
2795
+ Sort the entire distributed array in
2796
+
2797
+ ascending order. Each agent holds a portion
2798
+
2799
+ of the unsorted data. The final result
2800
+
2801
+ should be the complete sorted sequence.
2802
+
2803
+ You must coordinate to exchange data and
2804
+
2805
+ determine the correct global ordering.
2806
+
2807
+ B.3 Design Principles
2808
+
2809
+ Our prompts are carefully calibrated to provide
2810
+
2811
+ structural information without prescribing behav-
2812
+
2813
+ ior. On the one hand, descriptions convey whether
2814
+
2815
+ tasks involve aggregation, sequential dependencies,
2816
+
2817
+ or global data reorganization (e.g., “accounting for
2818
+
2819
+ cumulative sums from preceding agents”); some
2820
+
2821
+ prompts suggest possible coordination patterns as
2822
+
2823
+ 13
2824
+
2825
+ ### Page 14
2826
+
2827
+ options rather than requirements (e.g., “consider
2828
+
2829
+ establishing a coordinator” or “you may exchange
2830
+
2831
+ with neighbors”); and for complex tasks, prompts
2832
+
2833
+ may mention general algorithmic paradigms (e.g.,
2834
+
2835
+ “sample-based partitioning”) without specifying
2836
+
2837
+ concrete steps. On the other hand, we do not assign
2838
+
2839
+ semantic roles—no agent is designated “Manager,”
2840
+
2841
+ “Worker,” or “Coordinator”—and prompts never
2842
+
2843
+ specify “Step 1: Agent 0 does X, Step 2: Agent
2844
+
2845
+ 1 does Y.” Agents decide their own message tim-
2846
+
2847
+ ing and recipients, and must discover consensus
2848
+
2849
+ and verification mechanisms independently. The
2850
+
2851
+ distinction is illustrated below:
2852
+
2853
+ ✗ Prescribed (NOT our approach): “You are the
2854
+
2855
+ leader. Step 1: Collect data from all agents. Step
2856
+
2857
+ 2: Compute the result. Step 3: Broadcast the
2858
+
2859
+ answer.”
2860
+
2861
+ ✓ Our approach: “Can you identify a leader to
2862
+
2863
+ collect and compare results? How would agents
2864
+
2865
+ coordinate to reach consensus?”
2866
+
2867
+ All agents receive structurally identical prompts
2868
+
2869
+ (modulo their ID and data shard), ensuring no agent
2870
+
2871
+ holds implicit leadership status. The phrase “No
2872
+
2873
+ single agent has sufficient information” is included
2874
+
2875
+ explicitly to prevent premature submission of par-
2876
+
2877
+ tial results. This design tests whether agents can
2878
+
2879
+ translate high-level task understanding into con-
2880
+
2881
+ crete coordination protocols—a capability that, as
2882
+
2883
+ our results demonstrate, remains largely absent in
2884
+
2885
+ current LLMs.
2886
+
2887
+ C Experimental Details
2888
+
2889
+ C.1 Agent Scale Rationale
2890
+
2891
+ The six agent counts are chosen to probe quali-
2892
+
2893
+ tatively distinct coordination regimes. The min-
2894
+
2895
+ imal team (N = 2) isolates fundamental pair-
2896
+
2897
+ wise coordination without overhead. Small groups
2898
+
2899
+ (N ∈ {5, 10}) allow agents to feasibly track all
2900
+
2901
+ peers simultaneously. Medium scale (N = 20)
2902
+
2903
+ begins to make exhaustive peer tracking challeng-
2904
+
2905
+ ing, pushing agents toward selective communica-
2906
+
2907
+ tion. Large scale (N ∈ {50, 100}) makes hierar-
2908
+
2909
+ chical or highly selective coordination effectively
2910
+
2911
+ necessary—and, as our results confirm, largely be-
2912
+
2913
+ yond the reach of current LLMs.
2914
+
2915
+ C.2 Execution Parameters
2916
+
2917
+ Each configuration is allocated a maximum of
2918
+
2919
+ Rmax = 100 communication rounds. Within each
2920
+
2921
+ Component Min Mean Max Std
2922
+
2923
+ Base Prompt (Tbase) 612 748.8 989 87.3
2924
+
2925
+ Data Shard (Tdata) 45 312.4 1,856 298.6
2926
+
2927
+ Table 6: Token consumption of base prompt and data
2928
+
2929
+ shard per agent (across all configurations).
2930
+
2931
+ round, all agents are activated in parallel: they re-
2932
+
2933
+ ceive incoming messages from the previous round,
2934
+
2935
+ independently decide on actions, and execute them
2936
+
2937
+ simultaneously. Messages or files written in round
2938
+
2939
+ r become visible to all agents at the start of round
2940
+
2941
+ r + 1. An agent exits the coordination loop upon in-
2942
+
2943
+ voking submit_result(answer); agents that fail
2944
+
2945
+ to submit within Rmax rounds are assigned a null
2946
+
2947
+ answer counted as incorrect. Due to computational
2948
+
2949
+ constraints, each configuration is executed once
2950
+
2951
+ with fixed random seeds for data generation.
2952
+
2953
+ C.3 Infrastructure
2954
+
2955
+ Experiments were conducted on a GH200 cluster,
2956
+
2957
+ with up to 50 concurrent configurations executed si-
2958
+
2959
+ multaneously. Total compute amounted to approx-
2960
+
2961
+ imately 500+ GPU-hours equivalent. Complete
2962
+
2963
+ conversation histories, token counts, and timing
2964
+
2965
+ information were recorded for all runs.
2966
+
2967
+ C.4 Model Licenses and Intended Use
2968
+
2969
+ All language models used in this study are open-
2970
+
2971
+ source and deployed locally on our infrastruc-
2972
+
2973
+ ture. DeepSeek-V3.1 is released under the MIT
2974
+
2975
+ License1, GPT-OSS-120B under the Apache 2.0
2976
+
2977
+ License2, and Qwen3-Next-80B-A3B under the
2978
+
2979
+ Apache 2.0 License3, all permitting research and
2980
+
2981
+ commercial use.
2982
+
2983
+ D Token Budget Feasibility Analysis
2984
+
2985
+ To verify that SILO-BENCH operates within prac-
2986
+
2987
+ tical token limits, we profile token consumption
2988
+
2989
+ across all 54 configurations, decomposing total us-
2990
+
2991
+ age into three components: the base initialization
2992
+
2993
+ prompt (Tbase), the local data shard (Tdata), and
2994
+
2995
+ accumulated communication messages (Tcomm).
2996
+
2997
+ Table 6 summarizes the fixed components, and
2998
+
2999
+ Table 7 reports model-dependent communication
3000
+
3001
+ costs.
3002
+
3003
+ 1
3004
+
3005
+ https://huggingface.co/deepseek-ai/
3006
+
3007
+ DeepSeek-V3.1
3008
+
3009
+ 2
3010
+
3011
+ https://huggingface.co/openai/gpt-oss-120b
3012
+
3013
+ 3
3014
+
3015
+ https://huggingface.co/Qwen/
3016
+
3017
+ Qwen3-Next-80B-A3B-Instruct
3018
+
3019
+ 14
3020
+
3021
+ ### Page 15
3022
+
3023
+ Model Min Mean Max Std
3024
+
3025
+ DeepSeek-V3.1 124 8,498.7 98,432 12,847
3026
+
3027
+ GPT-OSS-120B 89 2,049.3 45,218 5,632
3028
+
3029
+ Qwen3-Next-80B-A3B 156 2,299.3 52,847 6,891
3030
+
3031
+ Table 7: Communication (Tcomm) token consumption
3032
+
3033
+ per agent by model (across all configurations).
3034
+
3035
+ Model Mean Util. 95th Pctl. Max Util.
3036
+
3037
+ DeepSeek-V3.1 7.5% 28.4% 76.9%
3038
+
3039
+ GPT-OSS-120B 2.4% 8.2% 35.3%
3040
+
3041
+ Qwen3-Next-80B-A3B 2.6% 9.1% 41.3%
3042
+
3043
+ Table 8: Context window utilization (%) for 128K con-
3044
+
3045
+ text models.
3046
+
3047
+ Context window utilization, shown in Table 8,
3048
+
3049
+ remains low on average: DeepSeek-V3.1 uses 7.5%
3050
+
3051
+ of the 128K budget on average, while GPT-OSS-
3052
+
3053
+ 120B and Qwen3 stay below 3%. The 95th per-
3054
+
3055
+ centile cases—driven by redundant broadcasting,
3056
+
3057
+ failed convergence with extended verbose rounds,
3058
+
3059
+ or agents copy-pasting full message histories—
3060
+
3061
+ are precisely the coordination inefficiencies SILO-
3062
+
3063
+ BENCH is designed to expose. Overall, frontier
3064
+
3065
+ models (128K–200K context) can run all config-
3066
+
3067
+ urations comfortably; mid-tier models (32K) han-
3068
+
3069
+ dle over 90% of configurations, with Level-III at
3070
+
3071
+ N ≥50 potentially requiring truncation; and smaller
3072
+
3073
+ models (8K) are suitable for N ≤10 and Level I–II
3074
+
3075
+ tasks.
3076
+
3077
+ E Complete Task Specifications
3078
+
3079
+ Table 9 provides the complete mapping between
3080
+
3081
+ SILO-BENCH tasks and their algorithmic founda-
3082
+
3083
+ tions, including the distributed adaptation approach
3084
+
3085
+ for each.
3086
+
3087
+ F Detailed Results
3088
+
3089
+ F.1 Task-Level Breakdown for DeepSeek-V3.1
3090
+
3091
+ Table 10 provides a comprehensive breakdown of
3092
+
3093
+ DeepSeek-V3.1’s success rate across all 30 tasks
3094
+
3095
+ and three communication protocols. Tasks achiev-
3096
+
3097
+ ing ≥50% success rate under any protocol are high-
3098
+
3099
+ lighted in green. Level-I aggregation tasks cluster
3100
+
3101
+ at the top, with Distributed Vote (I-03) and Any
3102
+
3103
+ Match (I-04) achieving near-perfect performance
3104
+
3105
+ across all protocols. Performance degrades sharply
3106
+
3107
+ for Level-III tasks, with K-Means Iteration (III-25),
3108
+
3109
+ Collaborative Filtering (III-27), PageRank Step (III-
3110
+
3111
+ 28), and Matrix Multiply (III-30) achieving zero
3112
+
3113
+ success across all protocols—these tasks require
3114
+
3115
+ precise numerical computation over all data shards
3116
+
3117
+ simultaneously, which proves beyond current dis-
3118
+
3119
+ tributed LLM capabilities.
3120
+
3121
+ F.2 Results by Model, Protocol, and Difficulty
3122
+
3123
+ Table 11 provides success rates for all model-
3124
+
3125
+ protocol-difficulty combinations, and Table 12 re-
3126
+
3127
+ ports success rates by agent count across models.
3128
+
3129
+ Together, they confirm that the patterns observed
3130
+
3131
+ for DeepSeek-V3.1 are consistent across all three
3132
+
3133
+ models: P2P generally outperforms or matches BP
3134
+
3135
+ for GPT-OSS-120B and Qwen3, SFS consistently
3136
+
3137
+ underperforms, and performance degrades mono-
3138
+
3139
+ tonically with both complexity level and agent
3140
+
3141
+ count.
3142
+
3143
+ F.3 Communication Density Analysis
3144
+
3145
+ Table 13 reports communication density across con-
3146
+
3147
+ figurations. A consistent pattern emerges across
3148
+
3149
+ all models: P2P yields substantially higher densi-
3150
+
3151
+ ties than BP, reflecting agents’ tendency to send
3152
+
3153
+ multiple targeted messages per pair across rounds;
3154
+
3155
+ BP densities cluster below 1.0, consistent with one-
3156
+
3157
+ to-all single broadcasts; and SFS yields notably
3158
+
3159
+ lower densities than P2P across all models and
3160
+
3161
+ difficulty levels, indicating that file-based coordi-
3162
+
3163
+ nation generates sparser cross-agent information
3164
+
3165
+ flow under our operational definition (read-based
3166
+
3167
+ transfer counting)—which further explains SFS’s
3168
+
3169
+ systematic under performance on SR despite non-
3170
+
3171
+ trivial write activity.
3172
+
3173
+ G Failure Mode Analysis
3174
+
3175
+ G.1 Representative Failure Cases
3176
+
3177
+ Table 14 presents representative examples of each
3178
+
3179
+ failure mode extracted from DeepSeek-V3.1 execu-
3180
+
3181
+ tion logs. The cases illustrate how failures manifest
3182
+
3183
+ in practice: premature submission occurs even after
3184
+
3185
+ reasonable communication volume (Case 4: only
3186
+
3187
+ 28 of 100 peers contacted before submitting); con-
3188
+
3189
+ sensus failure can persist despite near-unanimous
3190
+
3191
+ agreement, with a single outlier agent preventing
3192
+
3193
+ full success (Cases 1 and 3); and computation error
3194
+
3195
+ strikes even when agents have gathered the com-
3196
+
3197
+ plete required data (Case 2: off-by-one arithmetic
3198
+
3199
+ during aggregation).
3200
+
3201
+ 15
3202
+
3203
+ ### Page 16
3204
+
3205
+ ID Task Name Reference Distributed Adaptation
3206
+
3207
+ Level I: Aggregation (Optimal: Star/Tree Topology, O(N) messages)
3208
+
3209
+ I-01 Global Maximum LC-414 Array partitioned; local max → global aggregation
3210
+
3211
+ I-02 Word Frequency LC-2085 Word lists distributed; count target word globally
3212
+
3213
+ I-03 Distributed Vote LC-169 Vote records partitioned; aggregate to find majority
3214
+
3215
+ I-04 Any Match LC-28 String collection split; detect pattern in any shard
3216
+
3217
+ I-05 Range Count LC-327 Count elements in range across shards
3218
+
3219
+ I-06 Checksum (XOR) LC-136 Data blocks distributed; compute global XOR
3220
+
3221
+ I-07 Average Value LC-1491 Array partitioned; combine local sums and counts
3222
+
3223
+ I-08 Set Union Size LC-217 Elements distributed; compute |
3224
+
3225
+ S
3226
+
3227
+ i Di|
3228
+
3229
+ I-09 Top-K Selection LC-215 Array partitioned; merge local top-K candidates
3230
+
3231
+ I-10 Standard Deviation — Two-phase: global mean → global variance
3232
+
3233
+ Level II: Mesh Network (Optimal: Chain Topology, O(N) messages)
3234
+
3235
+ II-11 Prefix Sum LC-1480 Sequential dependency; cumulative offset propaga-
3236
+
3237
+ tion
3238
+
3239
+ II-12 Moving Average LC-346 Sliding window spans boundaries; neighbor ex-
3240
+
3241
+ change
3242
+
3243
+ II-13 Longest Palindrome LC-5 String partitioned; palindromes may cross bound-
3244
+
3245
+ aries
3246
+
3247
+ II-14 1D Life Game LC-289 Cellular automaton; boundary cells need neighbor
3248
+
3249
+ states
3250
+
3251
+ II-15 Pattern Search LC-392 Subsequence matching across partitioned sequence
3252
+
3253
+ II-16 Trapping Rain LC-42 Global max-left/max-right propagation required
3254
+
3255
+ II-17 Diff Array LC-1094 Difference array with boundary handling
3256
+
3257
+ II-18 List Ranking LC-542 Linked list ranking requires predecessor chain
3258
+
3259
+ II-19 Merge Neighbors — Boundary element merging between adjacent agents
3260
+
3261
+ II-20 Pipeline Hash — Sequential hash with chained dependencies
3262
+
3263
+ Level III: Shuffling (Optimal: Varies, O(N log N) to O(N 2
3264
+
3265
+ ) messages)
3266
+
3267
+ III-21 Distributed Sort LC-912 Sample-sort or merge-sort across partitions
3268
+
3269
+ III-22 Median of Medians LC-295 Iterative median selection across distributed data
3270
+
3271
+ III-23 Graph Components LC-323 Edges distributed; iterative union-find
3272
+
3273
+ III-24 BFS Distance LC-542 Graph BFS with distributed edge list
3274
+
3275
+ III-25 K-Means Iteration LC-296 One K-means iteration with distributed points
3276
+
3277
+ III-26 Global Distinct LC-349 Hash-based global deduplication
3278
+
3279
+ III-27 Collab. Filtering LC-1 User-item matching with distributed vectors
3280
+
3281
+ III-28 PageRank Step LC-207 One PageRank iteration with distributed edges
3282
+
3283
+ III-29 Load Balance LC-410 Task redistribution to minimize load variance
3284
+
3285
+ III-30 Matrix Multiply LC-311 Row/column partitioned matrix multiplication
3286
+
3287
+ Table 9: Complete specification of SILO-BENCH tasks with LeetCode references and distributed adaptation
3288
+
3289
+ descriptions.
3290
+
3291
+ G.2 Failure Mode Definitions and
3292
+
3293
+ Co-occurrence
3294
+
3295
+ We formally define the three failure modes as fol-
3296
+
3297
+ lows. Premature Submission occurs when an agent
3298
+
3299
+ invokes submit_result() before receiving infor-
3300
+
3301
+ mation from a sufficient subset of peers—where
3302
+
3303
+ “sufficient” means the minimum number of agents
3304
+
3305
+ whose data is required to compute the correct an-
3306
+
3307
+ swer. Consensus Failure occurs when agents sub-
3308
+
3309
+ mit multiple distinct answers (|{ˆyi}N
3310
+
3311
+ i=1| > 1), in-
3312
+
3313
+ dicating that coordination failed to synchronize
3314
+
3315
+ agents’ understanding of global state. Computa-
3316
+
3317
+ tion Error occurs when an agent receives sufficient
3318
+
3319
+ information but submits an incorrect answer, iso-
3320
+
3321
+ lating failures in the reasoning phase from those in
3322
+
3323
+ the communication phase.
3324
+
3325
+ These modes frequently co-occur within single
3326
+
3327
+ runs, as shown in Table 15. The high co-occurrence
3328
+
3329
+ between premature submission and consensus fail-
3330
+
3331
+ ure (67 cases) suggests a cascading effect: agents
3332
+
3333
+ submitting early cannot participate in subsequent
3334
+
3335
+ consensus-building, leaving remaining agents with
3336
+
3337
+ incomplete information and widening the conver-
3338
+
3339
+ gence gap.
3340
+
3341
+ 16
3342
+
3343
+ ### Page 17
3344
+
3345
+ Task BP P2P SFS Avg
3346
+
3347
+ Level I: Aggregation
3348
+
3349
+ I-01 Global Max 100 100 80 93.3
3350
+
3351
+ I-02 Word Frequency 100 52 70 73.9
3352
+
3353
+ I-03 Distributed Vote 100 100 100 100.0
3354
+
3355
+ I-04 Any Match 100 100 99 99.6
3356
+
3357
+ I-05 Range Count 83 67 63 71.0
3358
+
3359
+ I-06 Checksum (XOR) 17 0 2 6.1
3360
+
3361
+ I-07 Average Value 99 83 46 76.2
3362
+
3363
+ I-08 Set Union Size 50 50 42 47.5
3364
+
3365
+ I-09 Top-K Select 36 67 21 41.2
3366
+
3367
+ I-10 Standard Deviation 17 17 17 16.7
3368
+
3369
+ Level II: Structured Mesh
3370
+
3371
+ II-11 Prefix Sum 84 80 43 68.8
3372
+
3373
+ II-12 Moving Average 0 0 0 0.0
3374
+
3375
+ II-13 Longest Palindrome 49 66 48 54.5
3376
+
3377
+ II-14 1D Life Game 27 47 24 32.4
3378
+
3379
+ II-15 Pattern Search 0 17 17 11.1
3380
+
3381
+ II-16 Trapping Rain 33 33 40 35.6
3382
+
3383
+ II-17 Diff Array 48 62 20 43.3
3384
+
3385
+ II-18 List Ranking 0 0 0 0.0
3386
+
3387
+ II-19 Merge Neighbors 59 60 66 61.8
3388
+
3389
+ II-20 Pipeline Hash 52 36 55 47.6
3390
+
3391
+ Level III: Global Shuffle
3392
+
3393
+ III-21 Distributed Sort 33 20 20 24.4
3394
+
3395
+ III-22 Median of Medians 20 20 0 13.3
3396
+
3397
+ III-23 Graph Components 40 40 14 31.3
3398
+
3399
+ III-24 BFS Distance 0 0 0 0.0
3400
+
3401
+ III-25 K-Means Iteration 0 0 0 0.0
3402
+
3403
+ III-26 Global Distinct 33 33 0 22.2
3404
+
3405
+ III-27 Collab. Filtering 0 0 0 0.0
3406
+
3407
+ III-28 PageRank Step 0 0 0 0.0
3408
+
3409
+ III-29 Load Balance 32 3 63 32.9
3410
+
3411
+ III-30 Matrix Multiply 0 0 0 0.0
3412
+
3413
+ Table 10: Success Rate (%) by task and communication
3414
+
3415
+ protocol for DeepSeek-V3.1. Tasks with ≥50% success
3416
+
3417
+ are highlighted in gray background.
3418
+
3419
+ G.3 Behavioral Patterns and Leader
3420
+
3421
+ Emergence
3422
+
3423
+ Table 16 compares behavioral metrics between suc-
3424
+
3425
+ cessful and failed runs. Successful runs complete
3426
+
3427
+ in notably fewer rounds (8.3 vs. 12.7), suggest-
3428
+
3429
+ ing that effective coordination converges quickly
3430
+
3431
+ while failed runs engage in extended but ultimately
3432
+
3433
+ unproductive communication loops. Verification
3434
+
3435
+ behaviors appear in over 95% of runs regardless
3436
+
3437
+ of outcome, confirming that the bottleneck is not
3438
+
3439
+ communication intent but reasoning quality.
3440
+
3441
+ We also examined whether spontaneous leader
3442
+
3443
+ emergence correlates with task success, classifying
3444
+
3445
+ an agent as an emergent leader if it receives more
3446
+
3447
+ than 1.5× the average number of messages. The
3448
+
3449
+ results in Table 17 are counterintuitive: leader emer-
3450
+
3451
+ gence does not consistently improve outcomes, and
3452
+
3453
+ for Level-III tasks, runs with an emergent leader
3454
+
3455
+ Model Protocol L-I L-II L-III
3456
+
3457
+ DeepSeek-V3.1
3458
+
3459
+ BP 69.7 34.5 14.7
3460
+
3461
+ P2P 62.9 39.9 11.5
3462
+
3463
+ SFS 53.5 30.6 9.0
3464
+
3465
+ GPT-OSS-120B
3466
+
3467
+ BP 19.6 13.9 9.7
3468
+
3469
+ P2P 34.9 18.5 7.5
3470
+
3471
+ SFS 27.7 11.0 9.1
3472
+
3473
+ Qwen3-Next-80B-A3B
3474
+
3475
+ BP 9.0 3.2 1.1
3476
+
3477
+ P2P 27.6 3.4 0.3
3478
+
3479
+ SFS 25.5 2.1 1.7
3480
+
3481
+ Table 11: Success Rate (%) by model, protocol, and
3482
+
3483
+ difficulty level.
3484
+
3485
+ Model N=2 N=5 N=10 N=20 N=50 N=100
3486
+
3487
+ DeepSeek-V3.1 61.2 48.5 39.9 33.6 19.0 18.1
3488
+
3489
+ GPT-OSS-120B 34.4 28.0 14.0 13.2 5.2 6.4
3490
+
3491
+ Qwen3-Next-80B-A3B 17.2 9.1 8.6 7.4 5.1 1.3
3492
+
3493
+ Table 12: Success Rate (%) by model and agent count.
3494
+
3495
+ achieve 0% success versus 33.3% without one.
3496
+
3497
+ This suggests that spontaneous centralization at
3498
+
3499
+ high complexity creates coordination bottlenecks—
3500
+
3501
+ the designated aggregator becomes overwhelmed
3502
+
3503
+ by the volume of global data—rather than resolving
3504
+
3505
+ them.
3506
+
3507
+ G.4 Successful Coordination Examples
3508
+
3509
+ To contrast with the failure modes above, we docu-
3510
+
3511
+ ment two illustrative successful patterns. In Case
3512
+
3513
+ S-1 (Task I-07, N =5), Agent-0 emerged as coordi-
3514
+
3515
+ nator organically: all agents broadcast local results
3516
+
3517
+ to Agent-0, which computed and rebroadcast the
3518
+
3519
+ global answer. All agents verified and submitted
3520
+
3521
+ identically within 4 rounds (100% success). In
3522
+
3523
+ Case S-2 (Task I-01, N =10), agents adopted a dis-
3524
+
3525
+ tributed verification strategy, with each agent con-
3526
+
3527
+ firming understanding with two neighbors before
3528
+
3529
+ submission. This redundant verification eliminated
3530
+
3531
+ consensus failures despite higher message over-
3532
+
3533
+ head. Both successful patterns share a key property:
3534
+
3535
+ explicit synchronization checkpoints where agents
3536
+
3537
+ confirm mutual understanding before proceeding—
3538
+
3539
+ a discipline entirely absent in failed runs.
3540
+
3541
+ H Prompt Scaffold Ablation
3542
+
3543
+ To assess the sensitivity of our results to prompt de-
3544
+
3545
+ sign, we ran a controlled ablation under DeepSeek-
3546
+
3547
+ V3.1 + P2P with three scaffolding conditions be-
3548
+
3549
+ yond the standard neutral prompt: (a) planning
3550
+
3551
+ round—a dedicated strategy-discussion round be-
3552
+
3553
+ 17
3554
+
3555
+ ### Page 18
3556
+
3557
+ Model Protocol L-I L-II L-III
3558
+
3559
+ DeepSeek-V3.1
3560
+
3561
+ BP 0.56 0.82 1.46
3562
+
3563
+ P2P 0.76 1.06 2.83
3564
+
3565
+ SFS 0.82 1.08 1.51
3566
+
3567
+ GPT-OSS-120B
3568
+
3569
+ BP 0.54 0.84 0.94
3570
+
3571
+ P2P 1.86 2.15 2.73
3572
+
3573
+ SFS 0.73 0.54 0.95
3574
+
3575
+ Qwen3-Next-80B-A3B
3576
+
3577
+ BP 0.23 0.10 0.06
3578
+
3579
+ P2P 1.15 0.38 0.33
3580
+
3581
+ SFS 0.48 0.07 0.09
3582
+
3583
+ Table 13: Communication Density by model, protocol,
3584
+
3585
+ and difficulty level.
3586
+
3587
+ fore data exchange begins; (b) protocol reminder—
3588
+
3589
+ a brief restatement of available communication ac-
3590
+
3591
+ tions injected at each round; and (c) scratchpad
3592
+
3593
+ hint—a suggestion to maintain a shared intermedi-
3594
+
3595
+ ate workspace.
3596
+
3597
+ The planning round yields the most consistent
3598
+
3599
+ gains (∼5–8 on Level-II/III); the protocol reminder
3600
+
3601
+ helps primarily on Level-I; the scratchpad hint ben-
3602
+
3603
+ efits intermediate scales (N =10–20) but cannot
3604
+
3605
+ prevent collapse at N ≥50. Critically, qualitative
3606
+
3607
+ failure patterns remain stable across all conditions:
3608
+
3609
+ agents continue to communicate actively while fail-
3610
+
3611
+ ing to translate interaction into correct distributed
3612
+
3613
+ computation, and the Communication-Reasoning
3614
+
3615
+ Gap persists regardless of scaffolding. This con-
3616
+
3617
+ firms that the bottleneck reflects genuine LLM lim-
3618
+
3619
+ itations in distributed information synthesis rather
3620
+
3621
+ than a prompting artifact.
3622
+
3623
+ 18
3624
+
3625
+ ### Page 19
3626
+
3627
+ Case Failure Mode Description Key Evidence
3628
+
3629
+ 1 Consensus
3630
+
3631
+ Failure
3632
+
3633
+ In task I-05, agents communicated extensively
3634
+
3635
+ but failed to converge, submitting three distinct
3636
+
3637
+ answers: {1176, 1182, 1167}.
3638
+
3639
+ 97 of 100 agents submitted 1182; Agent-3
3640
+
3641
+ submitted 1176; Agent-55 submitted 1167.
3642
+
3643
+ 2 Computation
3644
+
3645
+ Error
3646
+
3647
+ Agent-10 in task I-05 received complete data
3648
+
3649
+ from all peers but computed an incorrect range
3650
+
3651
+ count.
3652
+
3653
+ Submitted 619 instead of correct answer 620.
3654
+
3655
+ Arithmetic error during final aggregation.
3656
+
3657
+ 3 Consensus
3658
+
3659
+ Failure
3660
+
3661
+ In task I-05 (different instance), 50 agents split
3662
+
3663
+ between two answers despite communication.
3664
+
3665
+ 49 agents submitted 619; Agent-27 submitted
3666
+
3667
+ 631.
3668
+
3669
+ 4 Premature
3670
+
3671
+ Submission
3672
+
3673
+ Agent-77 in task I-06 submitted before
3674
+
3675
+ collecting sufficient data.
3676
+
3677
+ Submitted after receiving data from only 28 of
3678
+
3679
+ 100 agents. Answer: 208; Expected: 114.
3680
+
3681
+ 5 Consensus
3682
+
3683
+ Failure
3684
+
3685
+ 100 agents in task I-06 produced 12 distinct
3686
+
3687
+ answers for XOR checksum task.
3688
+
3689
+ Answers ranged from 42 to 238. Majority (86
3690
+
3691
+ agents) converged on 146.
3692
+
3693
+ Table 14: Representative failure cases from DeepSeek-V3.1 experiments.
3694
+
3695
+ Premature Consensus Compute
3696
+
3697
+ Premature 112 67 45
3698
+
3699
+ Consensus – 90 52
3700
+
3701
+ Compute – – 86
3702
+
3703
+ Table 15: Co-occurrence of failure modes. Diagonal:
3704
+
3705
+ total occurrences; off-diagonal: joint occurrences.
3706
+
3707
+ Metric Success Failed
3708
+
3709
+ Verification Rate 98.7% 95.9%
3710
+
3711
+ Strategy Discussion Rate 93.5% 87.2%
3712
+
3713
+ Avg. Messages per Agent 31.2 27.4
3714
+
3715
+ Avg. Rounds to Completion 8.3 12.7
3716
+
3717
+ Table 16: Behavioral comparison between successful
3718
+
3719
+ and failed runs.
3720
+
3721
+ Level Leader Rate w/ Leader w/o Leader
3722
+
3723
+ I 27.5% 56.8% 62.1%
3724
+
3725
+ II 21.8% 23.5% 59.0%
3726
+
3727
+ III 23.8% 0.0% 33.3%
3728
+
3729
+ Table 17: Leader emergence rates and associated suc-
3730
+
3731
+ cess rates by complexity level.
3732
+
3733
+ Scaffold L-I SR L-II SR L-III SR
3734
+
3735
+ No scaffold (baseline) 62.9 39.9 11.5
3736
+
3737
+ +Planning round 64.3 47.2 17.8
3738
+
3739
+ +Protocol reminder 65.1 41.3 12.1
3740
+
3741
+ +Scratchpad hint 63.7 44.6 14.9
3742
+
3743
+ Table 18: Prompt scaffold ablation results under
3744
+
3745
+ DeepSeek-V3.1 + P2P. SR = Success Rate (%).
3746
+
3747
+ 19