@chllming/wave-orchestration 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (112) hide show
  1. package/CHANGELOG.md +57 -1
  2. package/README.md +39 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +48 -11
  39. package/docs/plans/context7-wave-orchestrator.md +20 -0
  40. package/docs/plans/current-state.md +8 -1
  41. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  42. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  43. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  44. package/docs/plans/wave-orchestrator.md +62 -11
  45. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  46. package/docs/reference/coordination-and-closure.md +436 -0
  47. package/docs/reference/live-proof-waves.md +25 -3
  48. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  49. package/docs/reference/proof-metrics.md +90 -0
  50. package/docs/reference/runtime-config/README.md +61 -0
  51. package/docs/reference/sample-waves.md +29 -18
  52. package/docs/reference/wave-control.md +164 -0
  53. package/docs/reference/wave-planning-lessons.md +131 -0
  54. package/package.json +5 -4
  55. package/releases/manifest.json +18 -0
  56. package/scripts/research/agent-context-archive.mjs +18 -0
  57. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  58. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  59. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  60. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  61. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  62. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  63. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  64. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  65. package/scripts/wave-orchestrator/config.mjs +175 -0
  66. package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
  67. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  68. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  69. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  70. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  71. package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
  72. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  73. package/scripts/wave-orchestrator/evals.mjs +23 -0
  74. package/scripts/wave-orchestrator/executors.mjs +3 -2
  75. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  76. package/scripts/wave-orchestrator/install.mjs +55 -1
  77. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  78. package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
  79. package/scripts/wave-orchestrator/launcher.mjs +796 -35
  80. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  81. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  82. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  83. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  84. package/scripts/wave-orchestrator/replay.mjs +10 -4
  85. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  86. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  87. package/scripts/wave-orchestrator/shared.mjs +26 -0
  88. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  89. package/scripts/wave-orchestrator/traces.mjs +157 -2
  90. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  91. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  92. package/scripts/wave-orchestrator/wave-files.mjs +17 -5
  93. package/scripts/wave.mjs +27 -0
  94. package/skills/repo-coding-rules/SKILL.md +1 -0
  95. package/skills/role-cont-eval/SKILL.md +1 -0
  96. package/skills/role-cont-qa/SKILL.md +13 -6
  97. package/skills/role-deploy/SKILL.md +1 -0
  98. package/skills/role-documentation/SKILL.md +4 -0
  99. package/skills/role-implementation/SKILL.md +4 -0
  100. package/skills/role-infra/SKILL.md +2 -1
  101. package/skills/role-integration/SKILL.md +15 -8
  102. package/skills/role-planner/SKILL.md +39 -0
  103. package/skills/role-planner/skill.json +21 -0
  104. package/skills/role-research/SKILL.md +1 -0
  105. package/skills/role-security/SKILL.md +2 -2
  106. package/skills/runtime-claude/SKILL.md +2 -1
  107. package/skills/runtime-codex/SKILL.md +1 -0
  108. package/skills/runtime-local/SKILL.md +2 -0
  109. package/skills/runtime-opencode/SKILL.md +1 -0
  110. package/skills/wave-core/SKILL.md +25 -6
  111. package/skills/wave-core/references/marker-syntax.md +16 -8
  112. package/wave.config.json +45 -0
@@ -0,0 +1,2251 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for DPBench: Large Language Models Struggle with Simultaneous Coordination.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - planning-and-orchestration
8
+ - repo-context-and-evaluation
9
+ kind: 'paper'
10
+ title: 'DPBench: Large Language Models Struggle with Simultaneous Coordination'
11
+ ---
12
+ # DPBench: Large Language Models Struggle with Simultaneous Coordination
13
+
14
+ <Note>
15
+ Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
+ </Note>
17
+
18
+ ## Metadata
19
+
20
+ | Field | Value |
21
+ | --- | --- |
22
+ | Content type | Paper / report |
23
+ | Authors | Najmul Hasan, Prashanth BusiReddyGari |
24
+ | Year | 2026 |
25
+ | Venue | arXiv 2602.13255 |
26
+ | Research bucket | P1 strong adjacent work |
27
+ | Maps to | Distributed-information coordination benchmarks with simultaneous constraints. |
28
+ | Harness fit | Useful benchmark for testing whether coordination-heavy planning systems scale beyond serial reasoning. |
29
+ | Source page | [Open source](https://arxiv.org/abs/2602.13255) |
30
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2602.13255.pdf) |
31
+
32
+ ## Extracted text
33
+ ### Page 1
34
+
35
+ DPBench: Large Language Models Struggle with Simultaneous Coordination
36
+
37
+ Najmul Hasan * 1 Prashanth BusiReddyGari * 1
38
+
39
+ Abstract
40
+
41
+ Large language models are increasingly deployed
42
+
43
+ in multi-agent systems, yet we lack benchmarks
44
+
45
+ that test whether they can coordinate under re-
46
+
47
+ source contention. We introduce DPBench, a
48
+
49
+ benchmark based on the Dining Philosophers
50
+
51
+ problem that evaluates LLM coordination across
52
+
53
+ eight conditions that vary decision timing, group
54
+
55
+ size, and communication. Our experiments with
56
+
57
+ GPT-5.2, Claude Opus 4.5, and Grok 4.1 reveal a
58
+
59
+ striking asymmetry: LLMs coordinate effectively
60
+
61
+ in sequential settings but fail when decisions must
62
+
63
+ be made simultaneously, with deadlock rates ex-
64
+
65
+ ceeding 95% under some conditions. We trace
66
+
67
+ this failure to convergent reasoning, where agents
68
+
69
+ independently arrive at identical strategies that,
70
+
71
+ when executed simultaneously, guarantee dead-
72
+
73
+ lock. Contrary to expectations, enabling commu-
74
+
75
+ nication does not resolve this problem and can
76
+
77
+ even increase deadlock rates. Our findings sug-
78
+
79
+ gest that multi-agent LLM systems requiring con-
80
+
81
+ current resource access may need external coor-
82
+
83
+ dination mechanisms rather than relying on emer-
84
+
85
+ gent coordination. DPBench is released as an
86
+
87
+ open-source benchmark. 2
88
+
89
+ 1. Introduction
90
+
91
+ Large language models are increasingly deployed in multi-
92
+
93
+ agent systems (Hong et al., 2024; Bo et al., 2024; Kim et al.,
94
+
95
+ 2024). Multiple LLM agents collaborate on complex tasks,
96
+
97
+ from software development to scientific research (Du et al.,
98
+
99
+ 2024). These systems raise a fundamental question: when
100
+
101
+ multiple agents must make decisions about shared resources,
102
+
103
+ can they coordinate effectively?
104
+
105
+ Consider a simple scenario: two LLM agents need to access
106
+
107
+ the same database. If both attempt to write simultaneously,
108
+
109
+ they may corrupt data or create inconsistencies. They need
110
+
111
+ 1
112
+
113
+ Department of Mathematics and Computer Science, University
114
+
115
+ of North Carolina at Pembroke.
116
+
117
+ Preprint. February 17, 2026.
118
+
119
+ 2
120
+
121
+ https://github.com/najmulhasan-code/
122
+
123
+ dpbench; install via pip install dpbench
124
+
125
+ P0
126
+
127
+ P1
128
+
129
+ P2P3
130
+
131
+ P4
132
+
133
+ circular wait
134
+
135
+ holds needs (blocked)
136
+
137
+ Figure 1. Deadlock state in the Dining Philosophers problem (N =
138
+
139
+ 5). Each philosopher holds one of their two adjacent forks (green)
140
+
141
+ but needs the other to eat. That fork is held by their neighbor
142
+
143
+ (red dashed), forming a circular wait: P0→P4→P3→P2→P1→P0.
144
+
145
+ No agent can proceed. This is the coordination failure DPBench
146
+
147
+ measures.
148
+
149
+ to coordinate, whether by taking turns or by dividing the
150
+
151
+ work so that their actions are compatible. This type of
152
+
153
+ coordination is essential for reliable multi-agent systems.
154
+
155
+ However, current LLM benchmarks do not test this capabil-
156
+
157
+ ity. Existing benchmarks evaluate single-agent performance
158
+
159
+ on knowledge (Hendrycks et al., 2021), reasoning (Wei
160
+
161
+ et al., 2022), planning (Valmeekam et al., 2023a), or strate-
162
+
163
+ gic games (Duan et al., 2024). Multi-agent benchmarks
164
+
165
+ typically use turn-based interaction where agents respond in
166
+
167
+ sequence, avoiding the challenge of simultaneous decisions.
168
+
169
+ Zero-shot coordination benchmarks exist for reinforcement
170
+
171
+ learning agents (Wang et al., 2024; Hu et al., 2020) but
172
+
173
+ not for LLMs. We lack a benchmark that specifically tests
174
+
175
+ whether LLMs can coordinate when they must act at the
176
+
177
+ same time.
178
+
179
+ We introduce DPBench, a benchmark for evaluating LLM
180
+
181
+ coordination based on the Dining Philosophers problem (Di-
182
+
183
+ jkstra, 1965). In this classic coordination puzzle, agents
184
+
185
+ must acquire shared resources (forks) to complete a task
186
+
187
+ (eating), but concurrent acquisition can lead to deadlock (all
188
+
189
+ agents stuck waiting). The problem has been studied for six
190
+
191
+ decades and provides a rigorous test of coordination under
192
+
193
+ 1
194
+
195
+ arXiv:2602.13255v1 [cs.AI] 2 Feb 2026
196
+
197
+ ### Page 2
198
+
199
+ DPBench: LLMs Struggle with Simultaneous Coordination
200
+
201
+ resource contention.
202
+
203
+ DPBench tests LLMs on eight conditions varying three fac-
204
+
205
+ tors: simultaneous versus sequential decision-making, three
206
+
207
+ versus five agents, and with or without inter-agent com-
208
+
209
+ munication. We define six standardized metrics including
210
+
211
+ deadlock rate, throughput, and fairness. The benchmark is
212
+
213
+ model-agnostic and designed for reproducible evaluation.
214
+
215
+ We evaluated three frontier models: GPT-5.2, Claude Opus
216
+
217
+ 4.5, and Grok 4.1. Our experiments reveal that current
218
+
219
+ LLMs struggle with simultaneous coordination. GPT-5.2,
220
+
221
+ the best-performing model, achieves 0% deadlock in se-
222
+
223
+ quential mode, but 25–95% deadlock in simultaneous mode.
224
+
225
+ Communication between agents does not reliably improve
226
+
227
+ coordination and sometimes increases deadlock rates.
228
+
229
+ These findings have implications for deploying LLMs in
230
+
231
+ multi-agent systems. Applications that require simultane-
232
+
233
+ ous decisions about shared resources, such as autonomous
234
+
235
+ vehicles, collaborative robotics, and distributed computing,
236
+
237
+ may experience coordination failures. Sequential protocols
238
+
239
+ or external coordination mechanisms may be necessary.
240
+
241
+ Contributions. (1) We introduce DPBench, the first bench-
242
+
243
+ mark specifically designed to test LLM coordination under
244
+
245
+ simultaneous decision-making. (2) We evaluate frontier
246
+
247
+ models and find that they struggle with simultaneous coor-
248
+
249
+ dination, while succeeding in sequential coordination. (3)
250
+
251
+ We analyze why LLMs fail and discuss implications for
252
+
253
+ multi-agent deployment.
254
+
255
+ 2. The Coordination Problem
256
+
257
+ Large language models (Brown et al., 2020) are increasingly
258
+
259
+ deployed in multi-agent systems where multiple models
260
+
261
+ interact. This raises the question: can LLMs coordinate their
262
+
263
+ actions to achieve shared goals while avoiding conflicts?
264
+
265
+ What is coordination? In multi-agent systems, coordina-
266
+
267
+ tion refers to the ability of agents to select actions that are
268
+
269
+ mutually compatible. Agents must avoid conflicts (e.g., two
270
+
271
+ agents grabbing the same resource) and achieve efficient
272
+
273
+ outcomes (e.g., maximizing total utility). Coordination is
274
+
275
+ challenging because each agent’s optimal action depends on
276
+
277
+ what others do.
278
+
279
+ Sequential vs. simultaneous decisions. Coordination prob-
280
+
281
+ lems differ fundamentally based on timing. In sequential
282
+
283
+ settings, agents observe the actions of others before decid-
284
+
285
+ ing. This is strictly easier: if agent A acts first, agent B
286
+
287
+ can adapt. In simultaneous settings, all agents decide at
288
+
289
+ the same moment without observing current actions. This
290
+
291
+ requires each agent to predict what the others will do.
292
+
293
+ Most multi-agent LLM benchmarks use sequential or turn-
294
+
295
+ based interaction (Hua et al., 2024). In dialogue tasks, one
296
+
297
+ agent speaks and then another responds. In collaborative
298
+
299
+ problem-solving, agents take turns contributing (Bo et al.,
300
+
301
+ 2024). This turn-taking structure avoids the core challenge
302
+
303
+ of simultaneous coordination.
304
+
305
+ Why simultaneous coordination matters; Real-world
306
+
307
+ multi-agent systems often require simultaneous decisions.
308
+
309
+ Autonomous vehicles at an intersection must decide concur-
310
+
311
+ rently. Robotic swarms must coordinate movement without
312
+
313
+ central control. In these settings, agents cannot wait to see
314
+
315
+ what others do; they must predict and act.
316
+
317
+ Why Dining Philosophers? The Dining Philosophers prob-
318
+
319
+ lem, introduced by Dijkstra in 1965 (Dijkstra, 1965), is the
320
+
321
+ canonical test for coordination under resource contention.
322
+
323
+ Philosophers must acquire two shared resources (forks) to
324
+
325
+ eat, and concurrent acquisition attempts can lead to a dead-
326
+
327
+ lock. The problem has been studied for six decades in op-
328
+
329
+ erating systems and distributed computing (Lamport, 1978;
330
+
331
+ Chandy & Misra, 1984).
332
+
333
+ We use Dining Philosophers because it isolates the core
334
+
335
+ coordination challenge: agents must make compatible de-
336
+
337
+ cisions about shared resources without direct observation
338
+
339
+ of others’ current choices. The problem has a clear fail-
340
+
341
+ ure mode (deadlock), well-defined metrics, and theoretical
342
+
343
+ foundations that allow rigorous analysis.
344
+
345
+ What DPBench adds. Existing LLM benchmarks focus on
346
+
347
+ individual capabilities: knowledge (Hendrycks et al., 2021),
348
+
349
+ reasoning (Mirzadeh et al., 2025), or single-agent tasks (Liu
350
+
351
+ et al., 2024). Multi-agent benchmarks exist, but typically use
352
+
353
+ turn-based interaction (Zhu et al., 2025) or test cooperation
354
+
355
+ without resource contention (Agashe et al., 2025). DPBench
356
+
357
+ specifically tests simultaneous coordination under resource
358
+
359
+ contention, a capability that existing benchmarks do not
360
+
361
+ measure.
362
+
363
+ 3. DPBench
364
+
365
+ DPBench implements the Dining Philosophers problem as
366
+
367
+ a multi-agent environment where LLM agents must coor-
368
+
369
+ dinate to avoid deadlock. We describe the environment,
370
+
371
+ metrics, and experimental conditions.
372
+
373
+ 3.1. Environment Design
374
+
375
+ The environment follows Dijkstra’s original formulation (Di-
376
+
377
+ jkstra, 1965). The N philosophers sit around a circular table
378
+
379
+ with N forks, one placed between each adjacent pair (Fig-
380
+
381
+ ure 1). To eat, a philosopher must hold both adjacent forks
382
+
383
+ simultaneously. Each fork can only be held by one philoso-
384
+
385
+ pher at a time.
386
+
387
+ States. Each philosopher is in one of two states: HUNGRY
388
+
389
+ (seeking forks) or EATING (holding both forks). We use
390
+
391
+ the “always hungry” variant, where philosophers return to
392
+
393
+ 2
394
+
395
+ ### Page 3
396
+
397
+ DPBench: LLMs Struggle with Simultaneous Coordination
398
+
399
+ HUNGRY immediately after eating.
400
+
401
+ Actions. At each timestep, a philosopher chooses one of
402
+
403
+ four actions. The GRAB LEFT action picks up the left
404
+
405
+ fork and succeeds only if that fork is free. Similarly,
406
+
407
+ GRAB RIGHT picks up the right fork when available. The
408
+
409
+ RELEASE action releases all held forks, and WAIT does not
410
+
411
+ take action for the current timestep.
412
+
413
+ Automatic Release. After a philosopher eats (holds both
414
+
415
+ forks for one timestep), both forks are automatically re-
416
+
417
+ leased. This prevents trivial strategies like hoarding forks.
418
+
419
+ Deadlock Detection. A deadlock occurs when all philoso-
420
+
421
+ phers are HUNGRY and each holds exactly one fork. In this
422
+
423
+ state, no philosopher can eat (each needs their neighbor’s
424
+
425
+ fork) and no philosopher will release (each is waiting for
426
+
427
+ the other fork). The episode terminates when deadlock is
428
+
429
+ detected.
430
+
431
+ Conflict Resolution. In simultaneous mode, if multiple
432
+
433
+ philosophers attempt to grab the same fork, the philosopher
434
+
435
+ with the lower ID succeeds. This deterministic rule ensures
436
+
437
+ reproducibility.
438
+
439
+ Partial Observability. Each philosopher observes only lo-
440
+
441
+ cal information: their own state, whether they hold each
442
+
443
+ fork, and whether each adjacent fork is currently avail-
444
+
445
+ able. Philosophers cannot see the global table state or other
446
+
447
+ philosophers’ holdings. When communication is enabled,
448
+
449
+ philosophers also receive messages from their immediate
450
+
451
+ neighbors sent in the previous timestep.
452
+
453
+ 3.2. Metrics
454
+
455
+ DPBench uses six fixed metrics. Standardized metrics en-
456
+
457
+ able fair comparison across different models and studies.
458
+
459
+ Primary Metrics:
460
+
461
+ Deadlock Rate. The fraction of episodes that end in dead-
462
+
463
+ lock:
464
+
465
+ Deadlock Rate =
466
+
467
+ Episodes with deadlock
468
+
469
+ Total episodes
470
+
471
+ (1)
472
+
473
+ Throughput. The average number of meals per timestep,
474
+
475
+ measuring coordination efficiency:
476
+
477
+ Throughput =
478
+
479
+ 1
480
+
481
+ E
482
+
483
+ EX
484
+
485
+ e=1
486
+
487
+ Me
488
+
489
+ Te
490
+
491
+ (2)
492
+
493
+ where Me is total meals in episode e and Te is the number
494
+
495
+ of timesteps.
496
+
497
+ Fairness. We measure fairness using the Gini coeffi-
498
+
499
+ cient (Gini, 1912) over meal distribution. Let mi be meals
500
+
501
+ eaten by philosopher i, sorted in ascending order. The Gini
502
+
503
+ coefficient is:
504
+
505
+ G =
506
+
507
+ 2
508
+
509
+ PN
510
+
511
+ i=1 i · mi
512
+
513
+ N
514
+
515
+ PN
516
+
517
+ i=1 mi
518
+
519
+
520
+
521
+ N + 1
522
+
523
+ N
524
+
525
+ (3)
526
+
527
+ We normalize by Gnorm = G · N
528
+
529
+ N−1 so that maximum in-
530
+
531
+ equality yields Gnorm = 1, then report 1 − Gnorm so that
532
+
533
+ higher values indicate fairer distribution (1.0 = perfect equal-
534
+
535
+ ity, 0.0 = maximum inequality).
536
+
537
+ Secondary Metrics:
538
+
539
+ Time to Deadlock. Average timestep at which deadlock
540
+
541
+ occurs, computed only over episodes that deadlock.
542
+
543
+ Starvation Count. Number of philosophers with zero meals
544
+
545
+ at episode end.
546
+
547
+ Communication Metric:
548
+
549
+ Message-Action Consistency. When communication is en-
550
+
551
+ abled, we measure how often stated intentions match actual
552
+
553
+ actions. If a philosopher says “I will grab left” and then
554
+
555
+ executes GRAB LEFT, this counts as consistent.
556
+
557
+ 3.3. Experimental Conditions
558
+
559
+ DPBench defines eight experimental conditions by varying
560
+
561
+ three factors:
562
+
563
+ Decision Mode. We test two decision modes that differ
564
+
565
+ in timing. In simultaneous mode, all philosophers decide
566
+
567
+ at the same time without seeing others’ current actions,
568
+
569
+ which represents the canonical Dining Philosophers setup.
570
+
571
+ In sequential mode, philosophers decide one at a time, each
572
+
573
+ seeing the updated state after previous decisions. Sequential
574
+
575
+ mode is strictly easier since agents can react to what others
576
+
577
+ have done.
578
+
579
+ Number of Philosophers. We test with N = 3 and N = 5.
580
+
581
+ More philosophers increases coordination complexity but
582
+
583
+ also provides more opportunities for successful coordina-
584
+
585
+ tion.
586
+
587
+ Communication. When enabled, philosophers can send a
588
+
589
+ short message to their neighbors each turn. Messages from
590
+
591
+ the previous timestep are visible in the current observation.
592
+
593
+ Table 1 lists all conditions. The condition codes follow the
594
+
595
+ pattern: mode (sim/seq) + philosophers (3/5) + communica-
596
+
597
+ tion (c/nc).
598
+
599
+ 4. Experiments
600
+
601
+ We evaluate frontier LLMs on all eight DPBench conditions.
602
+
603
+ Our experiments test whether current models can coordinate
604
+
605
+ effectively under simultaneous decision-making.
606
+
607
+ 3
608
+
609
+ ### Page 4
610
+
611
+ DPBench: LLMs Struggle with Simultaneous Coordination
612
+
613
+ Table 1. Eight experimental conditions in DPBench.
614
+
615
+ Code Mode N Communication
616
+
617
+ sim5nc Simultaneous 5 No
618
+
619
+ sim5c Simultaneous 5 Yes
620
+
621
+ seq5nc Sequential 5 No
622
+
623
+ seq5c Sequential 5 Yes
624
+
625
+ sim3nc Simultaneous 3 No
626
+
627
+ sim3c Simultaneous 3 Yes
628
+
629
+ seq3nc Sequential 3 No
630
+
631
+ seq3c Sequential 3 Yes
632
+
633
+ Table 2. GPT-5.2 performance across all eight DPBench conditions.
634
+
635
+ DL = Deadlock Rate, TP = Throughput (meals/timestep), FR =
636
+
637
+ Fairness (1 = perfect equality).
638
+
639
+ Condition DL TP FR
640
+
641
+ sim5nc 0.25 0.446 0.576
642
+
643
+ sim5c 0.65 0.452 0.527
644
+
645
+ seq5nc 0.00 0.115 0.540
646
+
647
+ seq5c 0.00 0.145 0.690
648
+
649
+ sim3nc 0.95 0.243 0.333
650
+
651
+ sim3c 1.00 0.190 0.379
652
+
653
+ seq3nc 0.00 0.107 0.617
654
+
655
+ seq3c 0.10 0.128 0.702
656
+
657
+ 4.1. Setup
658
+
659
+ Models. We evaluate three frontier models: GPT-5.2 (Ope-
660
+
661
+ nAI), Claude Opus 4.5 (Anthropic), and Grok 4.1 (xAI).
662
+
663
+ We conduct full evaluation across all eight conditions with
664
+
665
+ GPT-5.2 and evaluate Claude and Grok on a representative
666
+
667
+ subset of conditions for cross-model comparison.
668
+
669
+ Parameters. Each condition runs for 20 episodes with a
670
+
671
+ maximum of 30 timesteps per episode. We use temperature
672
+
673
+ 0.7 and seed 42 for reproducibility.
674
+
675
+ Prompts. Each LLM agent receives a system prompt de-
676
+
677
+ scribing the Dining Philosophers problem, available actions,
678
+
679
+ and goals (avoid deadlock, maximize throughput, ensure
680
+
681
+ fairness). At each timestep, agents receive an observation
682
+
683
+ prompt showing their current state, which forks they hold,
684
+
685
+ and fork availability. When communication is enabled,
686
+
687
+ agents also see messages from neighbors and can send their
688
+
689
+ own. Full prompts are in Appendix A.
690
+
691
+ 4.2. Results
692
+
693
+ GPT-5.2 Full Evaluation. Table 2 shows GPT-5.2 perfor-
694
+
695
+ mance across all eight conditions.
696
+
697
+ Several patterns emerge, visualized in Figure 2. First, si-
698
+
699
+ multaneous mode produces substantially higher deadlock
700
+
701
+ rates than sequential mode. In sequential mode without
702
+
703
+ communication, GPT-5.2 achieves zero deadlocks for both
704
+
705
+ N = 3 and N = 5. In simultaneous mode, deadlock rates
706
+
707
+ S3-S3+ S5-S5+ Q3-Q3+ Q5-Q5+
708
+
709
+ Condition (S=Simultaneous, Q=Sequential, 3/5=Philosophers, -/+=Comm)
710
+
711
+ 0
712
+
713
+ 20
714
+
715
+ 40
716
+
717
+ 60
718
+
719
+ 80
720
+
721
+ 100
722
+
723
+ Deadlock Rate (%)
724
+
725
+ 95
726
+
727
+ 100
728
+
729
+ 25
730
+
731
+ 65
732
+
733
+ 10
734
+
735
+ Simultaneous
736
+
737
+ Sequential
738
+
739
+ Figure 2. GPT-5.2 deadlock rates across all eight DPBench condi-
740
+
741
+ tions. Simultaneous mode (orange) produces dramatically higher
742
+
743
+ deadlock rates than sequential mode (blue). The gap is most pro-
744
+
745
+ nounced with 3 philosophers, where simultaneous mode reaches
746
+
747
+ 95–100% deadlock while sequential mode stays near 0%.
748
+
749
+ Table 3. Cross-model comparison on shared conditions.
750
+
751
+ Condition Model DL TP FR
752
+
753
+ sim5nc
754
+
755
+ GPT-5.2 0.25 0.446 0.576
756
+
757
+ Claude 4.5 0.55 0.455 0.619
758
+
759
+ Grok 4.1 0.70 0.437 0.578
760
+
761
+ sim5c
762
+
763
+ GPT-5.2 0.65 0.452 0.527
764
+
765
+ Claude 4.5 0.60 0.554 0.717
766
+
767
+ Grok 4.1 0.60 0.438 0.743
768
+
769
+ seq5nc
770
+
771
+ GPT-5.2 0.00 0.115 0.540
772
+
773
+ Claude 4.5 0.60 0.078 0.890
774
+
775
+ Grok 4.1 0.25 0.112 0.655
776
+
777
+ reach 25% (N = 5) and 95% (N = 3).
778
+
779
+ Second, three philosophers proves harder than five in simul-
780
+
781
+ taneous mode. This counterintuitive result occurs because
782
+
783
+ with fewer philosophers, the probability that all grab the
784
+
785
+ same direction (causing immediate deadlock) is higher.
786
+
787
+ Third, communication increases deadlock in simultaneous
788
+
789
+ mode with 5 philosophers (25% to 65%), as shown in Fig-
790
+
791
+ ure 4. Agents attempt to coordinate through messages but
792
+
793
+ fail to act on them consistently. The message-action con-
794
+
795
+ sistency metric shows only 29% alignment between stated
796
+
797
+ intentions and actual actions.
798
+
799
+ Cross-Model Comparison. Table 3 compares all three
800
+
801
+ models on the conditions they share: sim5nc, sim5c, and
802
+
803
+ seq5nc.
804
+
805
+ Figure 3 visualizes the cross-model comparison. GPT-5.2
806
+
807
+ achieves the lowest deadlock rates across conditions. In
808
+
809
+ simultaneous mode without communication, deadlock rates
810
+
811
+ range from 25% (GPT-5.2) to 70% (Grok 4.1). All models
812
+
813
+ struggle with simultaneous coordination, confirming that
814
+
815
+ this is a challenging capability for current LLMs.
816
+
817
+ The sequential mode reveals interesting differences. GPT-
818
+
819
+ 5.2 achieves zero deadlocks, while Claude and Grok still
820
+
821
+ experience deadlocks (60% and 25% respectively). This sug-
822
+
823
+ gests that even with the advantage of seeing others’ actions,
824
+
825
+ some models fail to exploit the information effectively.
826
+
827
+ 4
828
+
829
+ ### Page 5
830
+
831
+ DPBench: LLMs Struggle with Simultaneous Coordination
832
+
833
+ Sim 5P
834
+
835
+ No Comm
836
+
837
+ Sim 5P
838
+
839
+ Comm
840
+
841
+ Seq 5P
842
+
843
+ No Comm
844
+
845
+ 0
846
+
847
+ 20
848
+
849
+ 40
850
+
851
+ 60
852
+
853
+ 80
854
+
855
+ Deadlock Rate (%)
856
+
857
+ 25
858
+
859
+ 65
860
+
861
+ 55
862
+
863
+ 60 60
864
+
865
+ 70
866
+
867
+ 60
868
+
869
+ 25
870
+
871
+ GPT-5.2
872
+
873
+ Claude 4.5
874
+
875
+ Grok 4.1
876
+
877
+ Figure 3. Cross-model comparison of deadlock rates. GPT-5.2
878
+
879
+ (blue) achieves 0% deadlock in sequential mode, while Claude
880
+
881
+ 4.5 (orange) and Grok 4.1 (green) still deadlock 60% and 25% of
882
+
883
+ episodes respectively. All models struggle in simultaneous mode,
884
+
885
+ with deadlock rates between 25–70%.
886
+
887
+ 5. Analysis
888
+
889
+ Our results reveal fundamental limitations in how LLMs
890
+
891
+ coordinate under simultaneous decision-making. We discuss
892
+
893
+ the key findings, their causes, and implications.
894
+
895
+ Finding 1: Simultaneous coordination is fundamentally
896
+
897
+ harder. The gap between simultaneous and sequential
898
+
899
+ modes is substantial. GPT-5.2 achieves 0% deadlock in
900
+
901
+ sequential mode but 25–95% in simultaneous mode. This
902
+
903
+ gap persists across all models tested.
904
+
905
+ The explanation lies in the nature of the decision process.
906
+
907
+ In sequential mode, agents observe the updated state after
908
+
909
+ each decision. If philosopher P0 grabs their left fork, P1
910
+
911
+ sees this and can adapt. In simultaneous mode, all agents
912
+
913
+ decide based on the same snapshot. If all reason “both forks
914
+
915
+ are free, I should grab left,” all attempt the same action
916
+
917
+ simultaneously, and deadlock follows.
918
+
919
+ Finding 2: Communication does not solve the coordi-
920
+
921
+ nation problem. We expected communication to reduce
922
+
923
+ deadlock rates. Instead, enabling communication increased
924
+
925
+ deadlocks in simultaneous mode with 5 philosophers (25%
926
+
927
+ to 65% for GPT-5.2). This pattern persists across all condi-
928
+
929
+ tions (Figure 4).
930
+
931
+ Examining the transcripts reveals why. Agents send mes-
932
+
933
+ sages like “I will grab my left fork” but then face a timing
934
+
935
+ problem: messages arrive one timestep late. By the time
936
+
937
+ neighbors receive the message, the sender has already acted.
938
+
939
+ Moreover, message-action consistency is low (29–44%),
940
+
941
+ meaning agents often do not follow through on stated inten-
942
+
943
+ tions.
944
+
945
+ Finding 3: Fewer agents can mean harder coordination.
946
+
947
+ With 3 philosophers, deadlock rates reached 95–100% in
948
+
949
+ 3P Sim 5P Sim 3P Seq 5P Seq
950
+
951
+ 0
952
+
953
+ 20
954
+
955
+ 40
956
+
957
+ 60
958
+
959
+ 80
960
+
961
+ 100
962
+
963
+ Deadlock Rate (%)
964
+
965
+ 95
966
+
967
+ 25
968
+
969
+ 100
970
+
971
+ 65
972
+
973
+ 10
974
+
975
+ No Comm
976
+
977
+ With Comm
978
+
979
+ Figure 4. Effect of communication on deadlock rates. Contrary to
980
+
981
+ expectations, enabling communication (pink) often increases dead-
982
+
983
+ lock compared to no communication (blue). In simultaneous mode
984
+
985
+ with 5 philosophers, deadlock rises from 25% to 65%. Sequential
986
+
987
+ mode remains near 0% regardless of communication.
988
+
989
+ simultaneous mode, compared to 25–65% with 5 philoso-
990
+
991
+ phers. With 3 agents in a symmetric situation, if all choose
992
+
993
+ the same direction (e.g., all grab left), immediate deadlock
994
+
995
+ occurs. With 5 agents, there is more room for heterogeneous
996
+
997
+ behavior to emerge.
998
+
999
+ Why LLMs Fail at Simultaneous Coordination. The core
1000
+
1001
+ challenge is prediction under uncertainty. In simultaneous
1002
+
1003
+ mode, each agent must predict what others will do, then
1004
+
1005
+ choose an action that works well given those predictions.
1006
+
1007
+ This is precisely the challenge studied in game-theoretic
1008
+
1009
+ evaluations (Duan et al., 2024; Mozikov et al., 2024), where
1010
+
1011
+ LLMs have shown limitations in strategic reasoning. Hu-
1012
+
1013
+ mans solve coordination problems through conventions (e.g.,
1014
+
1015
+ always grab the lower-numbered fork first) or randomization.
1016
+
1017
+ LLMs, trained to give helpful and logical responses (Ouyang
1018
+
1019
+ et al., 2022), tend to converge on similar reasoning patterns.
1020
+
1021
+ When all agents reason identically, they act identically, and
1022
+
1023
+ identical action in the Dining Philosophers problem leads
1024
+
1025
+ to deadlock. Figure 5 illustrates this phenomenon with an
1026
+
1027
+ actual transcript excerpt.
1028
+
1029
+ This convergence phenomenon relates to broader concerns
1030
+
1031
+ about multi-agent AI systems and potential coordination
1032
+
1033
+ failures (Motwani et al., 2024).
1034
+
1035
+ This is not a failure of reasoning capability. The models
1036
+
1037
+ understand the problem and can articulate strategies to avoid
1038
+
1039
+ deadlock. The failure is in coordination: translating individ-
1040
+
1041
+ ual reasoning into compatible joint action without explicit
1042
+
1043
+ communication timing.
1044
+
1045
+ Implications for Deployment. Our results suggest caution
1046
+
1047
+ when deploying multiple LLM agents in settings requiring
1048
+
1049
+ simultaneous coordination. Applications where agents must
1050
+
1051
+ make concurrent decisions about shared resources, such
1052
+
1053
+ as traffic control, emergency response coordination, and
1054
+
1055
+ 5
1056
+
1057
+ ### Page 6
1058
+
1059
+ DPBench: LLMs Struggle with Simultaneous Coordination
1060
+
1061
+ GPT-5.2, simultaneous mode, N = 3, no communication.
1062
+
1063
+ Timestep 4:
1064
+
1065
+ P0: “Both forks available; I’ll follow a consistent order and
1066
+
1067
+ pick up my right fork first.” → grab right
1068
+
1069
+ P1: “Both forks available; I’ll start by taking my right fork
1070
+
1071
+ (consistent ordering can help).” → grab right
1072
+
1073
+ P2: “Both forks available; I’ll follow a consistent order and
1074
+
1075
+ pick up my right fork first.” → grab right
1076
+
1077
+ Result: All three philosophers now hold one fork each.
1078
+
1079
+ DEADLOCK.
1080
+
1081
+ Figure 5. Convergent reasoning leading to deadlock. All three
1082
+
1083
+ philosophers independently decide to follow a “consistent order”
1084
+
1085
+ by grabbing their right fork first. This identical reasoning produces
1086
+
1087
+ identical actions, resulting in immediate deadlock.
1088
+
1089
+ resource allocation, may experience coordination failures
1090
+
1091
+ similar to those observed here.
1092
+
1093
+ Sequential protocols, where agents take turns and observe
1094
+
1095
+ others’ actions, appear much safer. If simultaneous deci-
1096
+
1097
+ sions are unavoidable, external coordination mechanisms
1098
+
1099
+ (locks, arbiters, or turn-taking protocols) may be necessary.
1100
+
1101
+ Limitations. This study has several limitations. First, we
1102
+
1103
+ tested only three models due to API costs and runtime. Re-
1104
+
1105
+ sults may differ for other models. Second, we used a single
1106
+
1107
+ prompt design. Different prompting strategies might im-
1108
+
1109
+ prove coordination. Third, we tested only N = 3 and
1110
+
1111
+ N = 5. Larger groups might exhibit different dynam-
1112
+
1113
+ ics. Fourth, the Dining Philosophers problem is stylized;
1114
+
1115
+ real-world coordination may involve richer state and action
1116
+
1117
+ spaces.
1118
+
1119
+ 6. Related Work
1120
+
1121
+ LLM Benchmarks. Current benchmarks focus on single-
1122
+
1123
+ agent capabilities. MMLU (Hendrycks et al., 2021) tests
1124
+
1125
+ knowledge across 57 domains. GSM-Symbolic (Mirzadeh
1126
+
1127
+ et al., 2025) tests mathematical reasoning. AgentBench (Liu
1128
+
1129
+ et al., 2024) tests LLMs as agents on web browsing, cod-
1130
+
1131
+ ing, and game tasks. GTBench (Duan et al., 2024) evaluates
1132
+
1133
+ strategic reasoning through game-theoretic tasks but focuses
1134
+
1135
+ on two-player competitive games. PlanBench (Valmeekam
1136
+
1137
+ et al., 2023a) tests planning and reasoning about change.
1138
+
1139
+ AgentHarm (Andriushchenko et al., 2025) measures po-
1140
+
1141
+ tential harms from LLM agents in multi-step scenarios.
1142
+
1143
+ These benchmarks evaluate individual performance, not
1144
+
1145
+ multi-agent coordination under simultaneous decisions.
1146
+
1147
+ LLM Reasoning. Chain-of-thought prompting (Wei et al.,
1148
+
1149
+ 2022) enables LLMs to solve complex reasoning tasks by
1150
+
1151
+ generating intermediate steps. Self-consistency (Wang et al.,
1152
+
1153
+ 2023) improves reasoning by sampling multiple paths and
1154
+
1155
+ selecting the most consistent answer. Tree of Thoughts (Yao
1156
+
1157
+ et al., 2023a) extends this to deliberate exploration of rea-
1158
+
1159
+ soning paths. ReAct (Yao et al., 2023b) combines reasoning
1160
+
1161
+ with acting in interactive environments. Language Agent
1162
+
1163
+ Tree Search (Zhou et al., 2024) unifies reasoning, acting,
1164
+
1165
+ and planning through Monte Carlo tree search. Despite
1166
+
1167
+ these advances, recent work shows LLMs struggle with
1168
+
1169
+ planning tasks (Valmeekam et al., 2023b; Kambhampati
1170
+
1171
+ et al., 2024). Kambhampati et al. argue that LLMs cannot
1172
+
1173
+ plan autonomously but can assist planning in hybrid frame-
1174
+
1175
+ works. Self-verification has also proven unreliable (Stechly
1176
+
1177
+ et al., 2025). Thought of Search (Katz et al., 2024) pro-
1178
+
1179
+ poses more efficient planning by using LLMs to generate
1180
+
1181
+ search components rather than performing search directly.
1182
+
1183
+ Our findings align with these limitations in the multi-agent
1184
+
1185
+ setting.
1186
+
1187
+ Multi-Agent LLM Systems. Recent work explores LLMs
1188
+
1189
+ in multi-agent settings. MetaGPT (Hong et al., 2024) en-
1190
+
1191
+ ables multi-agent collaboration for software development.
1192
+
1193
+ MultiAgentBench (Zhu et al., 2025) evaluates collabora-
1194
+
1195
+ tion and competition but uses turn-based interaction. LLM-
1196
+
1197
+ Coordination (Agashe et al., 2025) studies coordination
1198
+
1199
+ in game-theoretic settings. DeMac (Liu et al., 2025) en-
1200
+
1201
+ hances coordination through dynamic task allocation. MDA-
1202
+
1203
+ gents (Kim et al., 2024) adaptively assigns collaboration
1204
+
1205
+ structures for medical decision-making. Multiagent de-
1206
+
1207
+ bate (Du et al., 2024) improves reasoning and factuality by
1208
+
1209
+ having multiple LLM instances debate their responses. Re-
1210
+
1211
+ flective collaboration (Bo et al., 2024) uses self-reflection to
1212
+
1213
+ enhance multi-agent coordination. Research on LLM negoti-
1214
+
1215
+ ation (Hua et al., 2024; Kwon et al., 2025) explores strategic
1216
+
1217
+ multi-turn dialogue. Work on emergent behaviors shows
1218
+
1219
+ that LLM agents can develop volunteer and conformity be-
1220
+
1221
+ haviors in collaboration (Ma et al., 2024). Theory-of-mind
1222
+
1223
+ benchmarks like OpenToM (Xu et al., 2024), Hi-ToM (Wu
1224
+
1225
+ et al., 2023), and Hypothetical Minds (Cross et al., 2025)
1226
+
1227
+ test whether LLMs can model others’ beliefs. These works
1228
+
1229
+ advance our understanding of multi-agent LLMs but do not
1230
+
1231
+ test simultaneous coordination under resource contention.
1232
+
1233
+ Multi-Agent Reinforcement Learning. In MARL, coor-
1234
+
1235
+ dination has been extensively studied (Lanctot et al., 2017).
1236
+
1237
+ Value decomposition methods like VDN (Sunehag et al.,
1238
+
1239
+ 2018) and QMIX (Rashid et al., 2018) learn decentral-
1240
+
1241
+ ized policies with centralized training. MADDPG (Lowe
1242
+
1243
+ et al., 2017) extends actor-critic methods to multi-agent
1244
+
1245
+ settings. CommNet (Sukhbaatar et al., 2016) and DIAL (Fo-
1246
+
1247
+ erster et al., 2016) study learned communication protocols.
1248
+
1249
+ Zero-shot coordination, where agents must coordinate with
1250
+
1251
+ unseen partners, is studied through methods like Other-
1252
+
1253
+ Play (Hu et al., 2020) and trajectory diversity (Lupu et al.,
1254
+
1255
+ 2021). ZSC-Eval (Wang et al., 2024) provides a compre-
1256
+
1257
+ hensive benchmark for evaluating zero-shot coordination.
1258
+
1259
+ Language grounding has been explored to make emergent
1260
+
1261
+ 6
1262
+
1263
+ ### Page 7
1264
+
1265
+ DPBench: LLMs Struggle with Simultaneous Coordination
1266
+
1267
+ communication interpretable (Li et al., 2024). Work on
1268
+
1269
+ emergent communication (Eccles et al., 2019; Lazaridou
1270
+
1271
+ & Baroni, 2021; Chaabouni et al., 2021) shows that agents
1272
+
1273
+ can develop effective signaling strategies through training.
1274
+
1275
+ These approaches use learned policies optimized over many
1276
+
1277
+ episodes, whereas LLMs rely on in-context reasoning (Xie
1278
+
1279
+ et al., 2022) without task-specific training.
1280
+
1281
+ Dining Philosophers. The Dining Philosophers problem
1282
+
1283
+ was introduced by Dijkstra (Dijkstra, 1965) to illustrate
1284
+
1285
+ deadlock and mutual exclusion. Lamport (Lamport, 1978)
1286
+
1287
+ connected the problem to distributed systems and logical
1288
+
1289
+ clocks. Chandy and Misra (Chandy & Misra, 1984) general-
1290
+
1291
+ ized it to the Drinking Philosophers problem with dynamic
1292
+
1293
+ resource requirements. The problem has been a staple of
1294
+
1295
+ concurrent programming education for decades. We use it as
1296
+
1297
+ a benchmark because it provides a minimal, well-understood
1298
+
1299
+ test of coordination under resource contention.
1300
+
1301
+ 7. Conclusion
1302
+
1303
+ We introduced DPBench, a benchmark that tests whether
1304
+
1305
+ LLMs can coordinate under resource contention using the
1306
+
1307
+ Dining Philosophers problem. Our experiments with GPT-
1308
+
1309
+ 5.2, Claude Opus 4.5, and Grok 4.1 reveal three key findings.
1310
+
1311
+ First, LLMs exhibit a fundamental asymmetry in coordina-
1312
+
1313
+ tion: they succeed in sequential settings where they observe
1314
+
1315
+ others’ actions but fail dramatically in simultaneous settings,
1316
+
1317
+ with deadlock rates reaching 95–100% in some conditions.
1318
+
1319
+ Second, we identify convergent reasoning as the underlying
1320
+
1321
+ cause: agents independently arrive at identical “rational”
1322
+
1323
+ strategies that, when executed simultaneously, guarantee
1324
+
1325
+ deadlock. Third, contrary to intuition, enabling communi-
1326
+
1327
+ cation does not resolve this problem and can even increase
1328
+
1329
+ deadlock rates, as agents fail to act consistently on stated
1330
+
1331
+ intentions.
1332
+
1333
+ These findings have implications for deploying multi-agent
1334
+
1335
+ LLM systems. Applications requiring concurrent decisions
1336
+
1337
+ about shared resources, such as autonomous vehicles, col-
1338
+
1339
+ laborative robotics, or distributed computing, may need
1340
+
1341
+ external coordination mechanisms rather than relying on
1342
+
1343
+ emergent coordination among agents.
1344
+
1345
+ Our study has limitations. We tested three models on a styl-
1346
+
1347
+ ized problem with small group sizes. Real-world coordina-
1348
+
1349
+ tion involves richer state spaces and larger agent populations.
1350
+
1351
+ Future work should explore whether fine-tuning on coordi-
1352
+
1353
+ nation tasks can develop this capability, whether alternative
1354
+
1355
+ communication protocols (such as explicit turn-taking or
1356
+
1357
+ leader election) improve outcomes, and how coordination
1358
+
1359
+ scales with agent count.
1360
+
1361
+ We release DPBench to enable the research community to
1362
+
1363
+ measure progress on this challenge and to develop LLM
1364
+
1365
+ systems capable of reliable multi-agent coordination.
1366
+
1367
+ Impact Statement
1368
+
1369
+ This paper introduces DPBench, a benchmark for evaluating
1370
+
1371
+ coordination capabilities in multi-agent LLM systems. We
1372
+
1373
+ discuss potential impacts below.
1374
+
1375
+ Positive Impacts. Our work can help identify coordination
1376
+
1377
+ failures before LLM agents are deployed in high-stakes ap-
1378
+
1379
+ plications. By revealing that current models struggle with
1380
+
1381
+ simultaneous decision-making, we provide guidance for
1382
+
1383
+ practitioners: systems requiring concurrent resource access
1384
+
1385
+ should incorporate external coordination mechanisms rather
1386
+
1387
+ than assuming emergent coordination. This finding may pre-
1388
+
1389
+ vent failures in safety-critical domains such as autonomous
1390
+
1391
+ systems and collaborative robotics.
1392
+
1393
+ Potential Concerns. Our benchmark could be misused to
1394
+
1395
+ identify exploitable coordination weaknesses in deployed
1396
+
1397
+ systems. However, the coordination failures we document
1398
+
1399
+ (convergent reasoning, communication ineffectiveness) are
1400
+
1401
+ fundamental limitations rather than specific vulnerabilities,
1402
+
1403
+ making targeted exploitation unlikely. Additionally, running
1404
+
1405
+ large-scale LLM experiments incurs computational and en-
1406
+
1407
+ vironmental costs; we report token usage and API calls to
1408
+
1409
+ enable cost-aware replication.
1410
+
1411
+ Limitations of Benchmark Evaluation. As with any
1412
+
1413
+ benchmark, performance on DPBench may not fully predict
1414
+
1415
+ real-world coordination capabilities. The Dining Philoso-
1416
+
1417
+ phers problem is a stylized setting; actual multi-agent de-
1418
+
1419
+ ployments involve richer contexts and larger scales. We
1420
+
1421
+ encourage complementary evaluation approaches alongside
1422
+
1423
+ benchmark testing.
1424
+
1425
+ References
1426
+
1427
+ Agashe, S., Fan, Y., Reyna, A., and Wang, X. E. LLM-
1428
+
1429
+ coordination: Evaluating and analyzing multi-agent coor-
1430
+
1431
+ dination abilities in large language models. In Findings of
1432
+
1433
+ the Association for Computational Linguistics: NAACL
1434
+
1435
+ 2025, pp. 8038–8057, Albuquerque, New Mexico, April
1436
+
1437
+ 2025. Association for Computational Linguistics. doi:
1438
+
1439
+ 10.18653/v1/2025.findings-naacl.448.
1440
+
1441
+ Andriushchenko, M., Souly, A., Dziemian, M., Duenas,
1442
+
1443
+ D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter,
1444
+
1445
+ Z., Fredrikson, M., Winsor, E., Wynne, J., Gal, Y.,
1446
+
1447
+ and Davies, X. AgentHarm: A benchmark for mea-
1448
+
1449
+ suring harmfulness of LLM agents. In The Thirteenth
1450
+
1451
+ International Conference on Learning Representations,
1452
+
1453
+ 2025. URL https://openreview.net/forum?
1454
+
1455
+ id=AC5n7xHuR1.
1456
+
1457
+ Bo, X., Zhang, Z., Dai, Q., Feng, X., Wang, L., Li, R., Chen,
1458
+
1459
+ 7
1460
+
1461
+ ### Page 8
1462
+
1463
+ DPBench: LLMs Struggle with Simultaneous Coordination
1464
+
1465
+ X., and Wen, J.-R. Reflective multi-agent collaboration
1466
+
1467
+ based on large language models. In Advances in Neu-
1468
+
1469
+ ral Information Processing Systems, volume 37. Curran
1470
+
1471
+ Associates, Inc., 2024.
1472
+
1473
+ Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
1474
+
1475
+ Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
1476
+
1477
+ Askell, A., et al. Language models are few-shot learners.
1478
+
1479
+ In Advances in Neural Information Processing Systems,
1480
+
1481
+ volume 33, pp. 1877–1901. Curran Associates, Inc., 2020.
1482
+
1483
+ Chaabouni, R., Kharitonov, E., Bouchacourt, D., Dupoux,
1484
+
1485
+ E., and Baroni, M. Emergent communication under vary-
1486
+
1487
+ ing sizes and connectivities. In Advances in Neural In-
1488
+
1489
+ formation Processing Systems, volume 34. Curran Asso-
1490
+
1491
+ ciates, Inc., 2021.
1492
+
1493
+ Chandy, K. M. and Misra, J. The drinking philosophers
1494
+
1495
+ problem. ACM Transactions on Programming Languages
1496
+
1497
+ and Systems, 6(4):632–646, October 1984. doi: 10.1145/
1498
+
1499
+ 1780.1804.
1500
+
1501
+ Cross, L., Xiang, V., Bhatia, A., Yamins, D. L., and Haber,
1502
+
1503
+ N. Hypothetical minds: Scaffolding theory of mind for
1504
+
1505
+ multi-agent tasks with large language models. In The Thir-
1506
+
1507
+ teenth International Conference on Learning Represen-
1508
+
1509
+ tations, 2025. URL https://openreview.net/
1510
+
1511
+ forum?id=otW0TJOUYF.
1512
+
1513
+ Dijkstra, E. W. Solution of a problem in concurrent pro-
1514
+
1515
+ gramming control. Communications of the ACM, 8(9):
1516
+
1517
+ 569, 1965. doi: 10.1145/365559.365617.
1518
+
1519
+ Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch,
1520
+
1521
+ I. Improving factuality and reasoning in language models
1522
+
1523
+ through multiagent debate. In Proceedings of the 41st
1524
+
1525
+ International Conference on Machine Learning, volume
1526
+
1527
+ 235 of Proceedings of Machine Learning Research, pp.
1528
+
1529
+ 11733–11763. PMLR, 2024.
1530
+
1531
+ Duan, J., Zhang, R., Diffenderfer, J., Kailkhura, B., Sun, L.,
1532
+
1533
+ Stengel-Eskin, E., Bansal, M., Chen, T., and Xu, K. GT-
1534
+
1535
+ Bench: Uncovering the strategic reasoning capabilities
1536
+
1537
+ of LLMs via game-theoretic evaluations. In Advances
1538
+
1539
+ in Neural Information Processing Systems, volume 37.
1540
+
1541
+ Curran Associates, Inc., 2024.
1542
+
1543
+ Eccles, T., Bachrach, Y., Lever, G., Lazaridou, A., and Grae-
1544
+
1545
+ pel, T. Biases for emergent communication in multi-agent
1546
+
1547
+ reinforcement learning. In Advances in Neural Informa-
1548
+
1549
+ tion Processing Systems, volume 32. Curran Associates,
1550
+
1551
+ Inc., 2019.
1552
+
1553
+ Foerster, J., Assael, I. A., de Freitas, N., and Whiteson,
1554
+
1555
+ S. Learning to communicate with deep multi-agent rein-
1556
+
1557
+ forcement learning. In Advances in Neural Information
1558
+
1559
+ Processing Systems, volume 29, pp. 2137–2145. Curran
1560
+
1561
+ Associates, Inc., 2016.
1562
+
1563
+ Gini, C. Variabilit`a e mutabilit`a: contributo allo studio
1564
+
1565
+ delle distribuzioni e delle relazioni statistiche. Studi
1566
+
1567
+ Economico-Giuridici della Regia Universit`a di Cagliari.
1568
+
1569
+ Tipografia di Paolo Cuppini, Bologna, 1912.
1570
+
1571
+ Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
1572
+
1573
+ Song, D., and Steinhardt, J. Measuring massive multitask
1574
+
1575
+ language understanding. In International Conference
1576
+
1577
+ on Learning Representations, 2021. URL https://
1578
+
1579
+ openreview.net/forum?id=d7KBjmI3GmQ.
1580
+
1581
+ Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang,
1582
+
1583
+ C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., et al.
1584
+
1585
+ MetaGPT: Meta programming for a multi-agent collabo-
1586
+
1587
+ rative framework. In The Twelfth International Confer-
1588
+
1589
+ ence on Learning Representations, 2024. URL https:
1590
+
1591
+ //openreview.net/forum?id=VtmBAGCN7o.
1592
+
1593
+ Hu, H., Lerer, A., Peysakhovich, A., and Foerster, J. “other-
1594
+
1595
+ play” for zero-shot coordination. In Proceedings of the
1596
+
1597
+ 37th International Conference on Machine Learning, vol-
1598
+
1599
+ ume 119 of Proceedings of Machine Learning Research,
1600
+
1601
+ pp. 4399–4410. PMLR, 2020.
1602
+
1603
+ Hua, Y., Qu, L., and Haffari, G. Assistive large language
1604
+
1605
+ model agents for socially-aware negotiation dialogues.
1606
+
1607
+ In Findings of the Association for Computational Lin-
1608
+
1609
+ guistics: EMNLP 2024, pp. 8047–8074, Miami, Florida,
1610
+
1611
+ USA, 2024. Association for Computational Linguistics.
1612
+
1613
+ Kambhampati, S., Valmeekam, K., Guan, L., Verma, M.,
1614
+
1615
+ Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A. Po-
1616
+
1617
+ sition: LLMs can’t plan, but can help planning in LLM-
1618
+
1619
+ modulo frameworks. In Proceedings of the 41st Interna-
1620
+
1621
+ tional Conference on Machine Learning, volume 235 of
1622
+
1623
+ Proceedings of Machine Learning Research, pp. 22895–
1624
+
1625
+ 22907. PMLR, 2024.
1626
+
1627
+ Katz, M., Kokel, H., Srinivas, K., and Sohrabi, S. Thought
1628
+
1629
+ of search: Planning with language models through the
1630
+
1631
+ lens of efficiency. In Advances in Neural Information
1632
+
1633
+ Processing Systems, volume 37. Curran Associates, Inc.,
1634
+
1635
+ 2024.
1636
+
1637
+ Kim, Y., Park, C., Jeong, H., Chan, Y. S., Xu, X., McDuff,
1638
+
1639
+ D., Lee, H., Ghassemi, M., Breazeal, C., and Park, H. W.
1640
+
1641
+ MDAgents: An adaptive collaboration of LLMs for med-
1642
+
1643
+ ical decision-making. In Advances in Neural Information
1644
+
1645
+ Processing Systems, volume 37. Curran Associates, Inc.,
1646
+
1647
+ 2024.
1648
+
1649
+ Kwon, D., Hae, J., Clift, E., Shamsoddini, D., Gratch, J.,
1650
+
1651
+ and Lucas, G. ASTRA: A negotiation agent with adap-
1652
+
1653
+ tive and strategic reasoning via tool-integrated action for
1654
+
1655
+ dynamic offer optimization. In Proceedings of the 2025
1656
+
1657
+ Conference on Empirical Methods in Natural Language
1658
+
1659
+ Processing, pp. 16228–16249, Suzhou, China, 2025. As-
1660
+
1661
+ sociation for Computational Linguistics.
1662
+
1663
+ 8
1664
+
1665
+ ### Page 9
1666
+
1667
+ DPBench: LLMs Struggle with Simultaneous Coordination
1668
+
1669
+ Lamport, L. Time, clocks, and the ordering of events in a
1670
+
1671
+ distributed system. Communications of the ACM, 21(7):
1672
+
1673
+ 558–565, July 1978. doi: 10.1145/359545.359563.
1674
+
1675
+ Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A.,
1676
+
1677
+ Tuyls, K., P´erolat, J., Silver, D., and Graepel, T. A uni-
1678
+
1679
+ fied game-theoretic approach to multiagent reinforcement
1680
+
1681
+ learning. In Advances in Neural Information Processing
1682
+
1683
+ Systems, volume 30. Curran Associates, Inc., 2017.
1684
+
1685
+ Lazaridou, A. and Baroni, M. Emergent communication
1686
+
1687
+ of generalizations. In Advances in Neural Information
1688
+
1689
+ Processing Systems, volume 34. Curran Associates, Inc.,
1690
+
1691
+ 2021.
1692
+
1693
+ Li, H., Mahjoub, H. N., Chalaki, B., Tadiparthi, V., Lee,
1694
+
1695
+ K., Moradi-Pari, E., Lewis, M., and Sycara, K. Lan-
1696
+
1697
+ guage grounded multi-agent reinforcement learning with
1698
+
1699
+ human-interpretable communication. In Advances in Neu-
1700
+
1701
+ ral Information Processing Systems, volume 37. Curran
1702
+
1703
+ Associates, Inc., 2024.
1704
+
1705
+ Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y.,
1706
+
1707
+ Ding, H., Men, K., Yang, K., et al. Agentbench: Evaluat-
1708
+
1709
+ ing LLMs as agents. In The Twelfth International Confer-
1710
+
1711
+ ence on Learning Representations, 2024. URL https:
1712
+
1713
+ //openreview.net/forum?id=zAdUB0aCTQ.
1714
+
1715
+ Liu, Y., Xu, C., Liu, L., Wang, Y., Chen, F., Jia, Q., Zhao, Y.,
1716
+
1717
+ Wang, Z., and Li, X. DeMAC: Enhancing multi-agent co-
1718
+
1719
+ ordination with dynamic DAG and manager-player feed-
1720
+
1721
+ back. In Findings of the Association for Computational
1722
+
1723
+ Linguistics: EMNLP 2025, pp. 14072–14098, Suzhou,
1724
+
1725
+ China, November 2025. Association for Computational
1726
+
1727
+ Linguistics. doi: 10.18653/v1/2025.findings-emnlp.757.
1728
+
1729
+ Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mor-
1730
+
1731
+ datch, I. Multi-agent actor-critic for mixed cooperative-
1732
+
1733
+ competitive environments. In Advances in Neural In-
1734
+
1735
+ formation Processing Systems, volume 30. Curran Asso-
1736
+
1737
+ ciates, Inc., 2017.
1738
+
1739
+ Lupu, A., Cui, B., Hu, H., and Foerster, J. Trajectory diver-
1740
+
1741
+ sity for zero-shot coordination. In Proceedings of the 38th
1742
+
1743
+ International Conference on Machine Learning, volume
1744
+
1745
+ 139 of Proceedings of Machine Learning Research, pp.
1746
+
1747
+ 7204–7213. PMLR, 2021.
1748
+
1749
+ Ma, H., Hu, T., Pu, Z., Liu, B., Ai, X., Liang, Y., and Chen,
1750
+
1751
+ M. Coevolving with the other you: Fine-tuning LLM
1752
+
1753
+ with sequential cooperative multi-agent reinforcement
1754
+
1755
+ learning. In Advances in Neural Information Processing
1756
+
1757
+ Systems, volume 37. Curran Associates, Inc., 2024.
1758
+
1759
+ Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Ben-
1760
+
1761
+ gio, S., and Farajtabar, M. GSM-Symbolic: Understand-
1762
+
1763
+ ing the limitations of mathematical reasoning in large
1764
+
1765
+ language models. In The Thirteenth International Confer-
1766
+
1767
+ ence on Learning Representations, 2025. URL https:
1768
+
1769
+ //openreview.net/forum?id=AjXkRZIvjB.
1770
+
1771
+ Motwani, S. R., Baranchuk, M., Strohmeier, M., Bolina,
1772
+
1773
+ V., Torr, P. H., Hammond, L., and Schroeder de Witt, C.
1774
+
1775
+ Secret collusion among AI agents: Multi-agent deception
1776
+
1777
+ via steganography. In Advances in Neural Information
1778
+
1779
+ Processing Systems, volume 37. Curran Associates, Inc.,
1780
+
1781
+ 2024.
1782
+
1783
+ Mozikov, M., Severin, N., Bodishtianu, V., Glushanina, M.,
1784
+
1785
+ Nasonov, I., Orekhov, D., Pekhotin, V., Makovetskiy, I.,
1786
+
1787
+ Baklashkin, M., Lavrentyev, V., Tsvigun, A., Turdakov,
1788
+
1789
+ D., Shavrina, T., Savchenko, A., and Makarov, I. EAI:
1790
+
1791
+ Emotional decision-making of LLMs in strategic games
1792
+
1793
+ and ethical dilemmas. In Advances in Neural Information
1794
+
1795
+ Processing Systems, volume 37. Curran Associates, Inc.,
1796
+
1797
+ 2024.
1798
+
1799
+ Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
1800
+
1801
+ Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
1802
+
1803
+ et al. Training language models to follow instructions
1804
+
1805
+ with human feedback. In Advances in Neural Informa-
1806
+
1807
+ tion Processing Systems, volume 35, pp. 27730–27744.
1808
+
1809
+ Curran Associates, Inc., 2022.
1810
+
1811
+ Rashid, T., Samvelyan, M., Schroeder de Witt, C., Far-
1812
+
1813
+ quhar, G., Foerster, J., and Whiteson, S. QMIX:
1814
+
1815
+ Monotonic value function factorisation for deep multi-
1816
+
1817
+ agent reinforcement learning. In Proceedings of
1818
+
1819
+ the 35th International Conference on Machine Learn-
1820
+
1821
+ ing, volume 80 of Proceedings of Machine Learn-
1822
+
1823
+ ing Research, pp. 4295–4304. PMLR, July 2018.
1824
+
1825
+ URL https://proceedings.mlr.press/v80/
1826
+
1827
+ rashid18a.html.
1828
+
1829
+ Stechly, K., Valmeekam, K., and Kambhampati, S. On
1830
+
1831
+ the self-verification limitations of large language mod-
1832
+
1833
+ els on reasoning and planning tasks. In The Thirteenth
1834
+
1835
+ International Conference on Learning Representations,
1836
+
1837
+ 2025. URL https://openreview.net/forum?
1838
+
1839
+ id=4O0v4s3IzY.
1840
+
1841
+ Sukhbaatar, S., Szlam, A., and Fergus, R. Learn-
1842
+
1843
+ ing multiagent communication with backpropa-
1844
+
1845
+ gation. In Advances in Neural Information Pro-
1846
+
1847
+ cessing Systems, volume 29. Curran Associates,
1848
+
1849
+ Inc., 2016. URL https://proceedings.
1850
+
1851
+ neurips.cc/paper/2016/hash/
1852
+
1853
+ 55b1927fdafef39c48e5b73b5d61ea60-Abstract.
1854
+
1855
+ html.
1856
+
1857
+ Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam-
1858
+
1859
+ baldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo,
1860
+
1861
+ J. Z., Tuyls, K., and Graepel, T. Value-decomposition net-
1862
+
1863
+ works for cooperative multi-agent learning based on team
1864
+
1865
+ 9
1866
+
1867
+ ### Page 10
1868
+
1869
+ DPBench: LLMs Struggle with Simultaneous Coordination
1870
+
1871
+ reward. In Proceedings of the 17th International Confer-
1872
+
1873
+ ence on Autonomous Agents and MultiAgent Systems, pp.
1874
+
1875
+ 2085–2087, 2018.
1876
+
1877
+ Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S.,
1878
+
1879
+ and Kambhampati, S. PlanBench: An extensible bench-
1880
+
1881
+ mark for evaluating large language models on planning
1882
+
1883
+ and reasoning about change. In Advances in Neural In-
1884
+
1885
+ formation Processing Systems, volume 36. Curran Asso-
1886
+
1887
+ ciates, Inc., 2023a.
1888
+
1889
+ Valmeekam, K., Marquez, M., Sreedharan, S., and Kamb-
1890
+
1891
+ hampati, S. On the planning abilities of large language
1892
+
1893
+ models - a critical investigation. In Advances in Neu-
1894
+
1895
+ ral Information Processing Systems, volume 36. Curran
1896
+
1897
+ Associates, Inc., 2023b.
1898
+
1899
+ Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,
1900
+
1901
+ S., Chowdhery, A., and Zhou, D. Self-consistency im-
1902
+
1903
+ proves chain of thought reasoning in language models.
1904
+
1905
+ In The Eleventh International Conference on Learning
1906
+
1907
+ Representations, 2023. URL https://openreview.
1908
+
1909
+ net/forum?id=1PL1NIMMrw.
1910
+
1911
+ Wang, X., Zhang, S., Zhang, W., Dong, W., Chen, J., Wen,
1912
+
1913
+ Y., and Zhang, W. ZSC-Eval: An evaluation toolkit
1914
+
1915
+ and benchmark for multi-agent zero-shot coordination.
1916
+
1917
+ In Advances in Neural Information Processing Systems,
1918
+
1919
+ volume 37. Curran Associates, Inc., 2024. Datasets and
1920
+
1921
+ Benchmarks Track.
1922
+
1923
+ Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
1924
+
1925
+ Xia, F., Chi, E., Le, Q. V., and Zhou, D. Chain-of-thought
1926
+
1927
+ prompting elicits reasoning in large language models.
1928
+
1929
+ In Advances in Neural Information Processing Systems,
1930
+
1931
+ volume 35, pp. 24824–24837. Curran Associates, Inc.,
1932
+
1933
+ 2022.
1934
+
1935
+ Wu, Y., He, Y., Jia, Y., Mihalcea, R., Chen, Y., and Deng, N.
1936
+
1937
+ Hi-ToM: A benchmark for evaluating higher-order theory
1938
+
1939
+ of mind reasoning in large language models. In Find-
1940
+
1941
+ ings of the Association for Computational Linguistics:
1942
+
1943
+ EMNLP 2023, pp. 10691–10706, Singapore, December
1944
+
1945
+ 2023. Association for Computational Linguistics. doi:
1946
+
1947
+ 10.18653/v1/2023.findings-emnlp.717.
1948
+
1949
+ Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An ex-
1950
+
1951
+ planation of in-context learning as implicit bayesian infer-
1952
+
1953
+ ence. In The Tenth International Conference on Learning
1954
+
1955
+ Representations, 2022. URL https://openreview.
1956
+
1957
+ net/forum?id=RdJVFCHjUMI.
1958
+
1959
+ Xu, H., Zhao, R., Zhu, L., Du, J., and He, Y. OpenToM:
1960
+
1961
+ A comprehensive benchmark for evaluating theory-of-
1962
+
1963
+ mind reasoning capabilities of large language models. In
1964
+
1965
+ Proceedings of the 62nd Annual Meeting of the Associ-
1966
+
1967
+ ation for Computational Linguistics (Volume 1: Long
1968
+
1969
+ Papers), pp. 8593–8623, Bangkok, Thailand, August
1970
+
1971
+ 2024. Association for Computational Linguistics. doi:
1972
+
1973
+ 10.18653/v1/2024.acl-long.466.
1974
+
1975
+ Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,
1976
+
1977
+ Y., and Narasimhan, K. Tree of thoughts: Deliberate
1978
+
1979
+ problem solving with large language models. In Advances
1980
+
1981
+ in Neural Information Processing Systems, volume 36.
1982
+
1983
+ Curran Associates, Inc., 2023a.
1984
+
1985
+ Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
1986
+
1987
+ K. R., and Cao, Y. React: Synergizing reasoning
1988
+
1989
+ and acting in language models. In The Eleventh In-
1990
+
1991
+ ternational Conference on Learning Representations,
1992
+
1993
+ 2023b. URL https://openreview.net/forum?
1994
+
1995
+ id=WE_vluYUL-X.
1996
+
1997
+ Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
1998
+
1999
+ and Wang, Y.-X. Language agent tree search unifies
2000
+
2001
+ reasoning, acting, and planning in language models. In
2002
+
2003
+ Proceedings of the 41st International Conference on Ma-
2004
+
2005
+ chine Learning, volume 235 of Proceedings of Machine
2006
+
2007
+ Learning Research, pp. 61816–61836. PMLR, 2024.
2008
+
2009
+ Zhu, K., Du, H., Hong, Z., Yang, X., Guo, S., Wang, Z.,
2010
+
2011
+ Wang, Z., Qian, C., Tang, R., Ji, H., and You, J. MultiA-
2012
+
2013
+ gentBench: Evaluating the collaboration and competition
2014
+
2015
+ of LLM agents. In Proceedings of the 63rd Annual Meet-
2016
+
2017
+ ing of the Association for Computational Linguistics (Vol-
2018
+
2019
+ ume 1: Long Papers), pp. 8580–8622, Vienna, Austria,
2020
+
2021
+ July 2025. Association for Computational Linguistics.
2022
+
2023
+ doi: 10.18653/v1/2025.acl-long.421.
2024
+
2025
+ 10
2026
+
2027
+ ### Page 11
2028
+
2029
+ DPBench: LLMs Struggle with Simultaneous Coordination
2030
+
2031
+ A. Prompts
2032
+
2033
+ We provide the full prompts used in experiments. Variables in braces (e.g., {philosopher name}) are replaced with
2034
+
2035
+ actual values at runtime.
2036
+
2037
+ System Prompt (No Communication)
2038
+
2039
+ You are {philosopher name}, one of {num philosophers} philosophers seated at a
2040
+
2041
+ circular dining table.
2042
+
2043
+ THE DINING PHILOSOPHERS PROBLEM:
2044
+
2045
+ -You and your fellow philosophers share forks placed between each pair of adjacent
2046
+
2047
+ philosophers
2048
+
2049
+ -To eat, you must hold BOTH your left fork AND your right fork simultaneously
2050
+
2051
+ -Each fork can only be held by one philosopher at a time
2052
+
2053
+ -After eating, you automatically release both forks
2054
+
2055
+ YOUR GOAL:
2056
+
2057
+ -Coordinate with others to avoid DEADLOCK (where everyone holds one fork and waits
2058
+
2059
+ forever)
2060
+
2061
+ -Maximize total meals eaten by the group
2062
+
2063
+ -Ensure fair distribution of meals among all philosophers
2064
+
2065
+ AVAILABLE ACTIONS:
2066
+
2067
+ -GRAB LEFT: Pick up the fork on your left (if available)
2068
+
2069
+ -GRAB RIGHT: Pick up the fork on your right (if available)
2070
+
2071
+ -RELEASE: Release any forks you are holding
2072
+
2073
+ -WAIT: Do nothing this turn
2074
+
2075
+ RESPONSE FORMAT:
2076
+
2077
+ THINKING: [Brief reasoning about the current situation]
2078
+
2079
+ ACTION: [One of: GRAB LEFT, GRAB RIGHT, RELEASE, WAIT]
2080
+
2081
+ Figure 6. System prompt provided to each LLM agent at the start of an episode. This prompt establishes the problem context, goals, and
2082
+
2083
+ expected response format.
2084
+
2085
+ Decision Prompt (No Communication)
2086
+
2087
+ You are {philosopher name}.
2088
+
2089
+ CURRENT STATE:
2090
+
2091
+ -Your state: {state}
2092
+
2093
+ -Meals eaten: {meals eaten}
2094
+
2095
+ -Currently holding: {holding status}
2096
+
2097
+ FORK STATUS:
2098
+
2099
+ -Left fork: {left fork status}
2100
+
2101
+ -Right fork: {right fork status}
2102
+
2103
+ What is your action?
2104
+
2105
+ THINKING: [Your reasoning]
2106
+
2107
+ ACTION: [GRAB LEFT /GRAB RIGHT /RELEASE /WAIT]
2108
+
2109
+ Figure 7. Decision prompt sent at each timestep. Variables are populated with the agent’s current state and fork availability.
2110
+
2111
+ 11
2112
+
2113
+ ### Page 12
2114
+
2115
+ DPBench: LLMs Struggle with Simultaneous Coordination
2116
+
2117
+ Communication Addition (System Prompt)
2118
+
2119
+ COMMUNICATION:
2120
+
2121
+ -You can send a message to your neighbors each turn
2122
+
2123
+ -Use messages to coordinate and avoid conflicts
2124
+
2125
+ -Be concise and clear in your communication
2126
+
2127
+ RESPONSE FORMAT:
2128
+
2129
+ THINKING: [Brief reasoning about the current situation]
2130
+
2131
+ MESSAGE: [Short message to your neighbors, or "None"]
2132
+
2133
+ ACTION: [One of: GRAB LEFT, GRAB RIGHT, RELEASE, WAIT]
2134
+
2135
+ Figure 8. Additional section appended to the system prompt when communication is enabled. The response format is extended to include
2136
+
2137
+ a message field.
2138
+
2139
+ Communication Addition (Decision Prompt)
2140
+
2141
+ NEIGHBOR MESSAGES:
2142
+
2143
+ -From left neighbor: {left message}
2144
+
2145
+ -From right neighbor: {right message}
2146
+
2147
+ What is your action? You may also send a message to coordinate.
2148
+
2149
+ THINKING: [Your reasoning]
2150
+
2151
+ MESSAGE: [Short message to neighbors, or "None"]
2152
+
2153
+ ACTION: [GRAB LEFT /GRAB RIGHT /RELEASE /WAIT]
2154
+
2155
+ Figure 9. Additional section in the decision prompt when communication is enabled. Agents receive messages from neighbors sent in the
2156
+
2157
+ previous timestep.
2158
+
2159
+ B. Additional Results
2160
+
2161
+ Table 4 provides extended metrics for GPT-5.2 across all conditions, including standard deviations and secondary metrics.
2162
+
2163
+ Table 4. Extended GPT-5.2 results with standard deviations. TTD = Time to Deadlock, SC = Starvation Count, MAC = Message-Action
2164
+
2165
+ Consistency (%).
2166
+
2167
+ Condition DL TP (std) FR (std) TTD SC MAC
2168
+
2169
+ sim5nc 0.25 0.45 (0.16) 0.58 (0.21) 11.8 1.15 N/A
2170
+
2171
+ sim5c 0.65 0.45 (0.15) 0.53 (0.22) 13.2 1.40 28.9
2172
+
2173
+ seq5nc 0.00 0.12 (0.02) 0.54 (0.21) N/A 1.75 N/A
2174
+
2175
+ seq5c 0.00 0.15 (0.02) 0.69 (0.25) N/A 1.10 34.2
2176
+
2177
+ sim3nc 0.95 0.24 (0.11) 0.33 (0.35) 7.9 1.60 N/A
2178
+
2179
+ sim3c 1.00 0.19 (0.12) 0.38 (0.45) 5.7 1.90 42.2
2180
+
2181
+ seq3nc 0.00 0.11 (0.02) 0.62 (0.27) N/A 0.55 N/A
2182
+
2183
+ seq3c 0.10 0.13 (0.04) 0.70 (0.22) 7.5 0.40 27.4
2184
+
2185
+ Table 5 reports computational costs for each model on the sim5nc condition, the primary simultaneous mode benchmark
2186
+
2187
+ where all three models were evaluated. Latency is the average API response time per call. Token counts are reported by the
2188
+
2189
+ respective APIs.
2190
+
2191
+ Table 5. Computational costs per model on sim5nc condition (20 episodes).
2192
+
2193
+ Model Avg Latency (ms) Total Tokens LLM Calls
2194
+
2195
+ GPT-5.2 1,626 884,630 2,545
2196
+
2197
+ Claude 4.5 5,245 1,000,055 2,050
2198
+
2199
+ Grok 4.1 9,235 924,540 1,895
2200
+
2201
+ 12
2202
+
2203
+ ### Page 13
2204
+
2205
+ DPBench: LLMs Struggle with Simultaneous Coordination
2206
+
2207
+ C. Implementation Details
2208
+
2209
+ C.1. Agent Orchestration
2210
+
2211
+ DPBench uses LangGraph to orchestrate agent execution. In simultaneous mode, the graph executes all philosopher nodes
2212
+
2213
+ in parallel within a single timestep. Each node receives the same observation snapshot, calls the LLM independently, and
2214
+
2215
+ returns a decision. After all decisions are collected, an apply node resolves conflicts and updates the table state. In sequential
2216
+
2217
+ mode, philosopher nodes execute one after another in a chain. Each node observes the current table state, makes a decision,
2218
+
2219
+ and immediately applies its action before the next philosopher observes. This means philosopher P1 sees the result of
2220
+
2221
+ P0’s action, P2 sees the results of both P0 and P1, and so on. In sequential mode, each philosopher’s action constitutes
2222
+
2223
+ one timestep, whereas in simultaneous mode all philosophers act within a single timestep. Consequently, for the same
2224
+
2225
+ max timesteps setting, sequential mode executes fewer full rounds than simultaneous mode.
2226
+
2227
+ C.2. Model Configuration
2228
+
2229
+ We evaluate three frontier models accessed through their respective APIs. GPT-5.2 uses model ID gpt-5.2-2025-12-11
2230
+
2231
+ via the OpenAI API. Claude Opus 4.5 uses model ID claude-opus-4-5-20251101 via the Anthropic API. Grok 4.1
2232
+
2233
+ uses model ID grok-4-1-fast-reasoning via the xAI API. All models use temperature 0.7 and default maximum
2234
+
2235
+ token limits.
2236
+
2237
+ C.3. Experimental Parameters
2238
+
2239
+ Each condition runs for 20 episodes with a maximum of 30 timesteps per episode. We use random seed 42 for reproducibility.
2240
+
2241
+ When multiple philosophers attempt to grab the same fork simultaneously, the conflict is resolved by awarding the fork to
2242
+
2243
+ the philosopher with the lower ID.
2244
+
2245
+ C.4. Code Availability
2246
+
2247
+ DPBench is implemented in Python using LangGraph for agent orchestration. The source code is available at https:
2248
+
2249
+ //github.com/najmulhasan-code/dpbench and can be installed via pip install dpbench.
2250
+
2251
+ 13