@chllming/wave-orchestration 0.6.3 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/CHANGELOG.md +82 -1
  2. package/README.md +40 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +67 -11
  39. package/docs/guides/terminal-surfaces.md +12 -0
  40. package/docs/plans/context7-wave-orchestrator.md +20 -0
  41. package/docs/plans/current-state.md +8 -1
  42. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  43. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  44. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  45. package/docs/plans/migration.md +26 -0
  46. package/docs/plans/wave-orchestrator.md +60 -12
  47. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  48. package/docs/reference/cli-reference.md +547 -0
  49. package/docs/reference/coordination-and-closure.md +436 -0
  50. package/docs/reference/live-proof-waves.md +25 -3
  51. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  52. package/docs/reference/proof-metrics.md +90 -0
  53. package/docs/reference/runtime-config/README.md +63 -2
  54. package/docs/reference/runtime-config/codex.md +2 -1
  55. package/docs/reference/sample-waves.md +29 -18
  56. package/docs/reference/wave-control.md +164 -0
  57. package/docs/reference/wave-planning-lessons.md +131 -0
  58. package/package.json +5 -4
  59. package/releases/manifest.json +40 -0
  60. package/scripts/research/agent-context-archive.mjs +18 -0
  61. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  62. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  63. package/scripts/wave-orchestrator/agent-state.mjs +11 -2
  64. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  65. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  66. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  67. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  68. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  69. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  70. package/scripts/wave-orchestrator/config.mjs +175 -0
  71. package/scripts/wave-orchestrator/control-cli.mjs +1216 -0
  72. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  73. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  74. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  75. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  76. package/scripts/wave-orchestrator/dashboard-renderer.mjs +120 -5
  77. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  78. package/scripts/wave-orchestrator/evals.mjs +23 -0
  79. package/scripts/wave-orchestrator/executors.mjs +3 -2
  80. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  81. package/scripts/wave-orchestrator/install.mjs +151 -2
  82. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  83. package/scripts/wave-orchestrator/launcher-runtime.mjs +33 -30
  84. package/scripts/wave-orchestrator/launcher.mjs +884 -36
  85. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  86. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  87. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  88. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  89. package/scripts/wave-orchestrator/replay.mjs +10 -4
  90. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  91. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  92. package/scripts/wave-orchestrator/shared.mjs +26 -0
  93. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  94. package/scripts/wave-orchestrator/terminals.mjs +1 -1
  95. package/scripts/wave-orchestrator/traces.mjs +157 -2
  96. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  97. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  98. package/scripts/wave-orchestrator/wave-files.mjs +144 -23
  99. package/scripts/wave.mjs +27 -0
  100. package/skills/repo-coding-rules/SKILL.md +1 -0
  101. package/skills/role-cont-eval/SKILL.md +1 -0
  102. package/skills/role-cont-qa/SKILL.md +13 -6
  103. package/skills/role-deploy/SKILL.md +1 -0
  104. package/skills/role-documentation/SKILL.md +4 -0
  105. package/skills/role-implementation/SKILL.md +4 -0
  106. package/skills/role-infra/SKILL.md +2 -1
  107. package/skills/role-integration/SKILL.md +15 -8
  108. package/skills/role-planner/SKILL.md +39 -0
  109. package/skills/role-planner/skill.json +21 -0
  110. package/skills/role-research/SKILL.md +1 -0
  111. package/skills/role-security/SKILL.md +2 -2
  112. package/skills/runtime-claude/SKILL.md +2 -1
  113. package/skills/runtime-codex/SKILL.md +1 -0
  114. package/skills/runtime-local/SKILL.md +2 -0
  115. package/skills/runtime-opencode/SKILL.md +1 -0
  116. package/skills/wave-core/SKILL.md +25 -6
  117. package/skills/wave-core/references/marker-syntax.md +16 -8
  118. package/wave.config.json +45 -0
@@ -0,0 +1,3283 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for CooperBench: Why Coding Agents Cannot be Your Teammates Yet.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - planning-and-orchestration
8
+ - agent-cooperation-and-coordination
9
+ - repo-context-and-evaluation
10
+ kind: 'paper'
11
+ title: 'CooperBench: Why Coding Agents Cannot be Your Teammates Yet'
12
+ ---
13
+ # CooperBench: Why Coding Agents Cannot be Your Teammates Yet
14
+
15
+ <Note>
16
+ Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
17
+ </Note>
18
+
19
+ ## Metadata
20
+
21
+ | Field | Value |
22
+ | --- | --- |
23
+ | Content type | Paper / report |
24
+ | Authors | Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, Diyi Yang |
25
+ | Year | 2026 |
26
+ | Venue | arXiv 2601.13295 |
27
+ | Research bucket | P0 direct hits |
28
+ | Maps to | Collaborative coding benchmark for inter-agent cooperation, communication quality, commitment tracking, and coordination failures. |
29
+ | Harness fit | Direct benchmark for whether coding agents behave like usable teammates instead of isolated solo solvers. |
30
+ | Source page | [Open source](https://arxiv.org/abs/2601.13295) |
31
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2601.13295.pdf) |
32
+ | Additional source | [Open source](https://cooperbench.com) |
33
+ | Additional PDF | [Open PDF](https://cooperbench.com/static/pdfs/main.pdf) |
34
+ | Notes | Project site hosts the same paper PDF plus leaderboard, dataset, and trajectory viewer for the benchmark. |
35
+
36
+ ## Extracted text
37
+ ### Page 1
38
+
39
+ 2026-1-27
40
+
41
+ CooperBench: Why Coding Agents Cannot be
42
+
43
+ Your Teammates Yet
44
+
45
+ Arpandeep Khatua1∗, Hao Zhu1∗,
46
+
47
+ Peter Tran2∗∗, Arya Prabhudesai2∗∗, Frederic Sadrieh2∗∗, Johann K. Lieberwirth2∗∗,
48
+
49
+ Xinkai Yu1, Yicheng Fu1, Michael J. Ryan1, Jiaxin Pei1, Diyi Yang1
50
+
51
+ 1Stanford University 2SAP Labs US ∗Equal Contribution ∗∗Equal Contribution
52
+
53
+ https://cooperbench.com
54
+
55
+ CooperBench
56
+
57
+ Expert-written
58
+
59
+ features with
60
+
61
+ potential
62
+
63
+ conflicts but
64
+
65
+ compatible
66
+
67
+ solutions.
68
+
69
+ Individual execution environments
70
+
71
+ Agent 1’s Goal Ensure images mutable after saving to the disk.
72
+
73
+ Virtual Machine 1
74
+
75
+ 11:45am 12:01pm 12:03pm 12:05pm 12:10pm 12:11pm
76
+
77
+ ...
78
+
79
+ Agent 2’s Goal Auto backup when overwriting existing files.
80
+
81
+ Virtual Machine 2
82
+
83
+ 11:45am 12:01pm 12:03pm 12:08pm 12:15pm 12:27pm
84
+
85
+ ...
86
+
87
+ Chat between agents
88
+
89
+ Agent 1 12:01pm
90
+
91
+ I will make Image.save function call
92
+
93
+ _ensure_mutable.
94
+
95
+ Agent 2 12:03pm
96
+
97
+ Thanks! I’ll anchor mine around the fp
98
+
99
+ path handling block. We won’t conflict.
100
+
101
+ Agent 2 12:27pm
102
+
103
+ Update: I have finished my tasks.
104
+
105
+ Agent 1 12:35pm
106
+
107
+ Wait, I now need your help to make sure to
108
+
109
+ not return until the end of the function.
110
+
111
+ Agent 2 12:27pm
112
+
113
+ Got it. I have updated my patch.
114
+
115
+ Evaluation
116
+
117
+ Merge Patches
118
+
119
+ Agents patches
120
+
121
+ should be
122
+
123
+ compatible.
124
+
125
+ Unit Tests
126
+
127
+ Agent 1’s Tests
128
+
129
+ +Agent 2’s Tests
130
+
131
+ Figure 1 | The CooperBench benchmark draws tasks for two agents from a pool of features with potential
132
+
133
+ conflicts. The agents execute the tasks in their individual environments, communicating in real time to
134
+
135
+ coordinate. Success is measured by whether the resulting code changes by both agents are compatible and
136
+
137
+ pass the requirements for both features.
138
+
139
+ Resolving team conflicts requires not only task-specific competence, but also social intelligence to find
140
+
141
+ common ground and build consensus. Similarly, as AI agents increasingly collaborate on complex work,
142
+
143
+ they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that
144
+
145
+ current agents lack these capabilities. To test this hypothesis, we introduce CooperBench, a benchmark
146
+
147
+ of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns
148
+
149
+ two agents different features that can be implemented independently but may conflict without proper
150
+
151
+ coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-
152
+
153
+ of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success
154
+
155
+ rates when working together compared to performing both tasks individually, across the full spectrum of
156
+
157
+ task difficulties. This contrasts sharply with human teams, where adding teammates typically improves
158
+
159
+ rather than diminishes productivity. Our analysis reveals three key issues: (1) communication channels
160
+
161
+ become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication,
162
+
163
+ agents deviate from their commitments; and (3) agents often hold incorrect expectations about others’ plans,
164
+
165
+ observations, and communication. Besides these issues, through large-scale simulation, we also observe rare
166
+
167
+ but interesting emergent coordination behavior between agents including role division, resource division,
168
+
169
+ and negotiation. Our research not only presents a novel benchmark for collaborative coding, but also calls
170
+
171
+ for a research shift from pursuing individual agent capability to developing social intelligence: the ability to
172
+
173
+ understand others, communicate effectively, and coordinate actions.
174
+
175
+ arXiv:2601.13295v2 [cs.LG] 26 Jan 2026
176
+
177
+ ### Page 2
178
+
179
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
180
+
181
+ 1. Introduction
182
+
183
+ Most achievements in modern civilization arise from individuals working cooperatively, from the
184
+
185
+ construction of cathedrals to the development of open-source software (Raymond, 1999; Woolley
186
+
187
+ et al., 2010). In human societies, such cooperation relies on social intelligence: the ability to
188
+
189
+ communicate intentions, understand others’ goals, and negotiate mutually compatible solutions
190
+
191
+ (Humphrey, 1976). This capability is often viewed as what makes us uniquely human and the basis
192
+
193
+ of human thinking (Tomasello, 2014). As we deploy AI agents in cooperative settings, whether
194
+
195
+ strong individual capabilities translate to effective cooperation with either humans or agents remains
196
+
197
+ an open question. In this paper, we empirically demonstrate that for current AI systems, there
198
+
199
+ is a curse of coordination: agent cooperation perform much worse than a single agent given the same
200
+
201
+ total workload. This deficit presents a fundamental barrier to deploying AI systems that can work
202
+
203
+ alongside humans or other agents. We theorize that at a fundamental level, effective human–AI and
204
+
205
+ agent–agent cooperation rely on the same coordination abilities.
206
+
207
+ Glossary
208
+
209
+ Cooperation: When two or more agents/humans work together towards a shared goal, where an agent
210
+
211
+ may altruistically help another achieve things outside their original responsibility.
212
+
213
+ Collaboration: When two or more agents/humans work together towards a shared goal.
214
+
215
+ Coordination: The capability to act and communicate in accordance with other agents/humans.
216
+
217
+ Existing research on automating human tasks and multi-agent systems largely sidesteps this
218
+
219
+ challenge by either providing more scaffolds (Fourney et al., 2024a; Pan et al., 2025; Zhang et al.,
220
+
221
+ 2025b; Zhuge et al., 2024), enforcing strict workflows (Cheng et al., 2025; Hong et al., 2023; Nguyen
222
+
223
+ et al., 2024), or providing active supervision and verification (Huang et al., 2025; Xiang et al., 2025;
224
+
225
+ Zheng et al., 2025). These systems rely on developer- or user-provided scaffolding to manage
226
+
227
+ coordination, which limits flexible cooperation and places additional burden on humans.
228
+
229
+ We present CooperBench, the first benchmark designed to measure how well agents can coop-
230
+
231
+ erate when handling individual tasks with potential conflicts. Considering software engineering
232
+
233
+ as a realistic domain where humans typically need to navigate work in a team (Purna Sudhakar
234
+
235
+ et al., 2011), our benchmark offers verifiable evaluation for the success of agent cooperation. As
236
+
237
+ illustrated in Fig. 1, CooperBench comprises 652 tasks constructed from 12 popular open-source
238
+
239
+ libraries across Python, TypeScript, Go, and Rust. Eight co-authors of this paper with real-world
240
+
241
+ software engineering backgrounds created new features, unit tests, and ground-truth code for these
242
+
243
+ libraries, ensuring high-quality and realistic task design.
244
+
245
+ In CooperBench, each task assigns each agent a feature to be implemented based on the same
246
+
247
+ repository state. Conflicts are intentionally embedded at the code level, as the assigned features
248
+
249
+ are logically compatible but require agents to modify overlapping or interdependent code. For
250
+
251
+ example, in Fig. 1, one agent implements image mutability in the serialization process while another
252
+
253
+ adds backup functionality to the same process. Without understanding each other’s goals, plans,
254
+
255
+ and expectations, their solutions may introduce incompatible changes. This mirrors real-world
256
+
257
+ software development where coordination failures stem from insufficient mutual understanding.
258
+
259
+ CooperBench enables us to investigate three research questions:
260
+
261
+ RQ1: How well can agents cooperate with each other? (§4)
262
+
263
+ RQ2: What role does communication play in agent-agent cooperation? (§5)
264
+
265
+ RQ3: What coordination failures do agents exhibit? (§6)
266
+
267
+ 2
268
+
269
+ ### Page 3
270
+
271
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
272
+
273
+ Through evaluating state-of-the-art coding agents on CooperBench, we observe the curse of
274
+
275
+ coordination: GPT-5 and Claude Sonnet 4.5 based agents achieve only 25% with two-agent coopera-
276
+
277
+ tion on CooperBench, which is around 50% lower than a “Solo” baseline which uses one agent to
278
+
279
+ implement both features.
280
+
281
+ Diving deeper into the coordination failures, we identify three key issues. First, communication
282
+
283
+ channels become jammed with vague, ill-timed, and inaccurate messages where agents fail to
284
+
285
+ respond to direct questions, send messages that arrive too late to inform decisions, or flood channels
286
+
287
+ with repetitive status updates that lack actionable detail. Second, even with effective communi-
288
+
289
+ cation, agents deviate from their commitments. They make unverifiable claims about code state,
290
+
291
+ ignore agreed-upon integration points, and break explicit promises. Third, agents hold incorrect
292
+
293
+ expectations about their partner’s plans, observations and duplicate work despite warnings and
294
+
295
+ overwrite changes they believe will merge cleanly (§6).
296
+
297
+ Besides failures, we are excited to report emergent coordination behaviors which often lead to
298
+
299
+ the success of the CooperBench tasks. These coordination behaviors are rarely performed by the
300
+
301
+ agents, but through our large-scale simulation, we uncover three major categories of them: role
302
+
303
+ division, resource division, and negotiations (§6.4). These examples hint at a path of coordination
304
+
305
+ capability acquisition through reinforcing success on CooperBench.
306
+
307
+ We contribute both a novel understanding of what agents need to become effective teammates
308
+
309
+ and a practical benchmark for measuring progress. Our open-sourced CooperBench platform
310
+
311
+ enables researchers and practitioners to evaluate and improve cooperative coding agents.
312
+
313
+ 2. CooperBench Benchmark
314
+
315
+ Flexible `dspy.ToolCalls`
316
+
317
+ parsing for varied formats
318
+
319
+ test_toolcalls_vague_match
320
+
321
+ test_tool_convert...no_input_params
322
+
323
+ test_tool_convert...args_lang_chain
324
+
325
+ adapters/types/tool.py
326
+
327
+ tests/adapters/test_tool.py
328
+
329
+ Feature 1
330
+
331
+ Minimal Python-call syntax
332
+
333
+ parser for `ToolCalls`
334
+
335
+ test_toolcalls_validate...string
336
+
337
+ test_toolcalls...multiple_calls
338
+
339
+ test_parse_python_calls_basic
340
+
341
+ test_parse_python_calls_multiple
342
+
343
+ adapters/types/tool.py
344
+
345
+ tests/adapters/test_tool.py
346
+
347
+ Feature 2
348
+
349
+ Minimal type coercion & unit
350
+
351
+ parsing for `Tool` class
352
+
353
+ arguments (safe, pre-validation)
354
+
355
+ ...
356
+
357
+ Feature 3
358
+
359
+ Feature 4-6 Omitted
360
+
361
+ Feature Pool Using repo state stanfordnlp/
362
+
363
+ dspy #80412c as an example
364
+
365
+ Compatible
366
+
367
+ Potentially
368
+
369
+ Conflicting
370
+
371
+ All of the
372
+
373
+ features can be
374
+
375
+ implemented in
376
+
377
+ a compatible
378
+
379
+ way
380
+
381
+ Features are
382
+
383
+ related to
384
+
385
+ overlapping
386
+
387
+ files. There is a
388
+
389
+ potential for
390
+
391
+ conflicts if not
392
+
393
+ coordinated
394
+
395
+ well.
396
+
397
+ Figure 2 | An example feature pool based on DSPy
398
+
399
+ GitHub repository. This feature pool has 6 features
400
+
401
+ which can be implemented compatibly based on the
402
+
403
+ repository state, but without coordination agents could
404
+
405
+ conflict with each other.
406
+
407
+ CooperBench seeks to satisfy the following
408
+
409
+ desiderata: (1) Realism: the tasks should be
410
+
411
+ reasonable for a software development team
412
+
413
+ to work on at a given repository state. (2) Con-
414
+
415
+ flict potential: the agents’ scopes should overlap
416
+
417
+ with each other so that they need to coordinate
418
+
419
+ well to avoid potential conflicts. (3) Verifiable:
420
+
421
+ the success of the tasks can be evaluated with a
422
+
423
+ pipeline that is deterministic and interpretable.
424
+
425
+ These three desiderata provide a basis for ac-
426
+
427
+ curately measuring the real-world cooperation
428
+
429
+ capabilities of agents.
430
+
431
+ 2.1. Task space
432
+
433
+ Task Each task consists of a repository state,
434
+
435
+ two features, and two corresponding sets of
436
+
437
+ unit tests. The two features are drawn from a
438
+
439
+ pool of features (like the one illustrated in Fig.
440
+
441
+ 2) that can be simultaneously implemented on
442
+
443
+ the given repository state. The patches from
444
+
445
+ the two agents are merged and evaluated. Each
446
+
447
+ agent’s goal is to get their assigned feature im-
448
+
449
+ 3
450
+
451
+ ### Page 4
452
+
453
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
454
+
455
+ plemented in the merged patch.1 Based on a
456
+
457
+ pool of n features, there will be (n
458
+
459
+ 2) tasks when
460
+
461
+ we are evaluating agents self-play, and double the number when we evaluate two different agents
462
+
463
+ cooperate with each other. Note that agents can only view their own features. For example, in Fig.
464
+
465
+ 2, there are 6 features in this pool, which produces 15 tasks for evaluating GPT-5 agents cooperating
466
+
467
+ with each other. If we want to evaluate how well GPT-5 agents cooperate with Claude Sonnet 4.5
468
+
469
+ agents, we will have 30 tasks drawn from this pool. In CooperBench we have 34 such features pools.
470
+
471
+ Features In this paper, we use features to denote desirable changes to the codebase that implement
472
+
473
+ missing functionality, fix existing bugs, or both. As illustrated in Fig. 2, each feature is described
474
+
475
+ in a markdown file, which includes a title, description, examples, and a list of files which may
476
+
477
+ be relevant. For each feature, we write a list of unit tests without the help of coding assistants to
478
+
479
+ ensure accurate evaluation of the implementation. In addition, we write a ground-truth solution
480
+
481
+ to understand the potential conflicts between features and to verify that the given feature can be
482
+
483
+ implemented on the repository and pass the unit tests. The tests and the ground-truth solution is
484
+
485
+ not provided to the agents to prevent test leakage.
486
+
487
+ Task composition For each repository state, we create a pool of feature candidates. These features
488
+
489
+ are compatible and potentially conflicting. “Compatible” means the features can be implemented
490
+
491
+ jointly. To verify this, we produce a joint ground-truth solution of all features in the pool, which
492
+
493
+ passes all individual unit tests. “Potentially conflicting” means the features have overlapping code
494
+
495
+ logic changes that influence each other. In our dataset, 77.3% of tasks have conflicting ground-truth
496
+
497
+ solutions. As a result, CooperBench tasks are not adversarial, but still require the capability to
498
+
499
+ cooperate under conflicts by communicating individual goals, understanding others’ plans, and
500
+
501
+ negotiating mutually compatible solutions.
502
+
503
+ Action space Agents can take two kinds of actions in real time: the communication tool and computer-
504
+
505
+ use tools. The communication tool allows agents to send open-ended natural language messages to
506
+
507
+ each other, and the computer-use tools include file and terminal operations. In our paper, we limit
508
+
509
+ the computer-use tools to local operations to control the experiments. In the future, researchers
510
+
511
+ could consider GUI and browser-based actions to expand the tasks the agents can take. Both agents
512
+
513
+ can use these tools at any time, without synchronizing their turns with each other. This not only
514
+
515
+ raises the flexibility of agents, but also poses challenges for agents to timely communicate and
516
+
517
+ execute commands. In our benchmarking process, we use cloud virtual machines for agents to
518
+
519
+ ensure isolated workspaces and sufficient resources. We set an upper-bound number (100)2 of
520
+
521
+ actions an agent can take to complete the tasks.
522
+
523
+ 2.2. Evaluation pipeline
524
+
525
+ Cooperation is hard to evaluate, but we make the product of the cooperation verifiable. CooperBench
526
+
527
+ evaluates tasks based on two criteria: (1) compatible solutions and (2) implementation correctness.
528
+
529
+ Solution compatibility After the two agents complete execution, we attempt to merge their resulting
530
+
531
+ patches using git merge-file -L patch_1 -L repo_state -L patch_2. This operation captures
532
+
533
+ whether the independently produced solutions are structurally compatible. In practice, some merge
534
+
535
+ failures arise from superficial differences such as formatting or indentation styles (e.g., K&R versus
536
+
537
+ Allman) rather than substantive conflicts. To avoid treating such cases as coordination failures, we
538
+
539
+ train a small coding model (Qwen 3 Coder 1.5B; Yang et al. 2025) on synthetic examples to resolve
540
+
541
+ 1Agents have the freedom to redivide the two features as long as the merged patch implements both features. Agents
542
+
543
+ perform this kind of coordination occasionally well. Check out §6.4 for concrete examples.
544
+
545
+ 2We do not observe performance gains on our tasks from raising this number.
546
+
547
+ 4
548
+
549
+ ### Page 5
550
+
551
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
552
+
553
+ trivial merge conflicts when standard merging fails. This step ensures that the compatibility check
554
+
555
+ reflects semantic agreement between solutions rather than low-level stylistic discrepancies, while
556
+
557
+ leaving the overall cooperation score largely unaffected (App. § B). If even the coding model cannot
558
+
559
+ produce a patch without conflicts, the agents both fail the tasks.
560
+
561
+ Implementation correctness If we successfully merge the two patches into the repository state, we
562
+
563
+ run both sets of unit tests on the merged codebase. As mentioned before, we do not restrict agents
564
+
565
+ to only finish their own work. If they can coordinate well, they can divide their two features in
566
+
567
+ a different way as long as the merged solution can pass the two features’ tests. This evaluation
568
+
569
+ pipeline ensures a rigorous evaluation of the cooperation outcome.
570
+
571
+ 2.3. Dataset Construction
572
+
573
+ Open-source repos of different
574
+
575
+ language with over 1K stars and
576
+
577
+ dataset creator expertise
578
+
579
+ Well documented Issue/PR
580
+
581
+ Has new/updated tests
582
+
583
+ <200 lines and 2 files
584
+
585
+ Set of 34 PRs for 12 different
586
+
587
+ repos in 4 languages
588
+
589
+ Stage I - Repository & PR selection Stage II - Feature creation Stage III - Environment Setup
590
+
591
+ Reproducible Execution
592
+
593
+ Sanitize & format Manually creating
594
+
595
+ test files
596
+
597
+ Feature Pool
598
+
599
+ Curator + LLM
600
+
601
+ Ideation
602
+
603
+ Figure 3 | The CooperBench construction pipeline. Each task is carefully engineered by domain
604
+
605
+ experts to ensure conflicts are realistic, resolvable, and representative of production software
606
+
607
+ development challenges.
608
+
609
+ CooperBench is constructed through a three-stage pipeline that grounds tasks in real software
610
+
611
+ development and enables controlled evaluation of coordination (Fig. 3). To create the pools of
612
+
613
+ features, we start from real-world feature implementations and proceed as follows: (Stage I) we write
614
+
615
+ anchor features drawn from popular repositories, each of them is a slight modification of a real pull
616
+
617
+ request (PR) authored by human contributors; (Stage II) for each anchor feature, we expand the pool
618
+
619
+ by introducing a family of adjacent features authored by human annotators, representing plausible
620
+
621
+ alternative features that could realistically co-occur; and (Stage III) we validate the compatibility of
622
+
623
+ each feature pool by executing and testing all feature combinations in a controlled environment to
624
+
625
+ rule out intrinsically incompatible specifications.
626
+
627
+ Stage I: Repository and PR Selection In the first stage we select twelve actively maintained open-
628
+
629
+ source repositories spanning Python, TypeScript, Rust, and Go. Each repository exceeds one
630
+
631
+ thousand GitHub stars and does not appear in SWE-Bench (Jimenez et al., 2023) or Multi-SWE-
632
+
633
+ Bench (Zan et al., 2025), reducing data contamination risk. Selection is guided by curator expertise
634
+
635
+ so that each repository is assigned to an author familiar with its architecture and development
636
+
637
+ practices. We extract PRs that meet strict inclusion constraints: clear feature description, code+tests,
638
+
639
+ feature addition, bounded change size, and robust tests. Appendix A provides full selection details
640
+
641
+ and thresholds, and App. Tab. 3 summarizes the repository distribution.
642
+
643
+ Stage II: Feature Extraction and Augmentation In the second stage, we convert each selected PR into
644
+
645
+ a feature pool containing one anchor feature and multiple synthetic adjacent features. We sanitize
646
+
647
+ and rewrite original PR descriptions into self-contained specifications to prevent information
648
+
649
+ 5
650
+
651
+ ### Page 6
652
+
653
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
654
+
655
+ leakage. Curators author adjacent features to plausibly co-occur and to create natural overlap
656
+
657
+ without adversarial specifications (with LLM-assisted ideation). Appendix A provides full details
658
+
659
+ on adjacent-feature design, manual test writing, and gold-solution validation. All features derived
660
+
661
+ from the same base commit constitute a feature pool with two to 12 features. To ensure the
662
+
663
+ compatibility among all features in a pool, we construct a single gold patch that jointly implements
664
+
665
+ all features in each set and passes all associated tests.
666
+
667
+ Stage III: Environment and Reproducibility The final stage provides a deterministic execution
668
+
669
+ environment for evaluating agents. Each task set includes an automated setup script that clones
670
+
671
+ the repository at the exact base commit, installs dependencies, and executes the full test suite to
672
+
673
+ verify the environment. To ensure consistent behavior across hardware and operating systems, we
674
+
675
+ additionally provide containerized environments that encapsulate the complete repository state and
676
+
677
+ all runtime dependencies. These environments guarantee reproducible execution and isolate agent
678
+
679
+ behavior from external variability, enabling reliable measurement of coordination performance
680
+
681
+ through the evaluation pipeline described in §2.2.
682
+
683
+ Dataset composition and feature-complexity statistics are reported in App. A. Together, these
684
+
685
+ findings demonstrate that CooperBench features are individually tractable and realistic, ensuring
686
+
687
+ that the benchmark’s primary challenge arises from coordinating partially overlapping implementa-
688
+
689
+ tions rather than from executing unusually complex or oversized programming tasks.
690
+
691
+ 3. Experiment Settings
692
+
693
+ CooperBench allows us to study the following research questions. First, how well can current
694
+
695
+ state-of-the-art foundation models cooperate with each other when they are used in coding agents?
696
+
697
+ Second, do agents use the communication channel effectively for coordination? And, what are the
698
+
699
+ reasons why agents fail or succeed on CooperBench?
700
+
701
+ In order to evaluate models fairly, we create an agent framework incorporating leading open-
702
+
703
+ source coding agent framework OpenHands (v0.54) (Wang et al., 2024b). The two agents perform
704
+
705
+ their own work in their respective docker-based containers without interruption from another
706
+
707
+ agent. Since OpenHands was not designed as a framework which performs multi-agent cooper-
708
+
709
+ ation, we created a communication tool (§2.1) using an SQL database for message passing. This
710
+
711
+ communication tool supports message sending action. When an agent sends a message to another
712
+
713
+ agent, the other agent will immediately receive it, and include it in the prompt of the next step.
714
+
715
+ This communication setting achieves both real-time communication and asynchronous execution.
716
+
717
+ We open-source this framework to not only ensure reproducibility of our experiments, but also
718
+
719
+ provide a starting point for researchers to build multi-agent cooperation systems which can perform
720
+
721
+ multiple tasks and resolve conflicts.
722
+
723
+ However, note that CooperBench does not tie with the agent framework or the communication tool. In
724
+
725
+ this paper, we are more concerned with foundation models’ intrinsic capability to cooperate, so we
726
+
727
+ do not compare different agent frameworks or creative methods to enhance coordination. In the
728
+
729
+ future, researchers should use CooperBench to compare different models, different frameworks, and
730
+
731
+ different combinations as well. We especially encourage researchers to develop novel frameworks
732
+
733
+ or to train agents to achieve higher Coop scores or to close the Solo-Coop gaps (§4) on CooperBench.
734
+
735
+ Similarly, we encourage researchers to develop other communication tools, e.g. screen sharing, to
736
+
737
+ expand the communication bandwidth or reduce the communication noises.
738
+
739
+ We evaluate the performance of five language models, both closed-source ones, and open-source
740
+
741
+ 6
742
+
743
+ ### Page 7
744
+
745
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
746
+
747
+ gpt-5
748
+
749
+ claude
750
+
751
+ minimax
752
+
753
+ qwen coder
754
+
755
+ qwen
756
+
757
+ 0.0
758
+
759
+ 0.2
760
+
761
+ 0.4
762
+
763
+ 0.6
764
+
765
+ 0.8
766
+
767
+ 1.0
768
+
769
+ Success rate
770
+
771
+ 0.48
772
+
773
+ 0.28
774
+
775
+ 0.47
776
+
777
+ 0.26
778
+
779
+ 0.36
780
+
781
+ 0.14
782
+
783
+ 0.22
784
+
785
+ 0.13
786
+
787
+ 0.06
788
+
789
+ 0.05
790
+
791
+ 0.0 0.2 0.4 0.6 0.8 1.0
792
+
793
+ Relative Difficulty
794
+
795
+ 0.0
796
+
797
+ 0.2
798
+
799
+ 0.4
800
+
801
+ 0.6
802
+
803
+ 0.8
804
+
805
+ 1.0
806
+
807
+ Solo
808
+
809
+ Coop
810
+
811
+ Figure 4 | Left: Under Coop setting, agents with different foundation models perform significantly
812
+
813
+ worse than how they perform under Solo setting, except for Qwen3-30B-A3B-Instruct-2507, which
814
+
815
+ performs bad under both settings. This Solo-Coop gap is what we call the “coordination gap”.
816
+
817
+ Right: The relationship between tasks’ technical difficulties and Solo-Coop gap. The shaded area
818
+
819
+ has a large middle section which shows that the coordination gap is larger for middle-level tasks
820
+
821
+ than for tasks which are extremely easy or difficult.
822
+
823
+ ones: GPT-5, Claude 4.5 Sonnet, MiniMax-M2, Qwen3-Coder-30B-A3B-Instruct, and
824
+
825
+ Qwen3-30B-A3B-Instruct-2507. We serve the two Qwen models via vLLM3, GPT-5 and Minimax
826
+
827
+ models via their respective official API, and the Claude model through GCP.
828
+
829
+ 4. How well are agents able to cooperate with each other?
830
+
831
+ In CooperBench, each of the two agents are assigned a feature to implement, which will be called
832
+
833
+ the Coop setting to distinguish from the Solo baseline. In the Solo baseline, the two tasks are
834
+
835
+ assigned to one agent. For humans, teams should perform better or faster than individuals, which
836
+
837
+ is the bottom line for cooperation to be considered as functional. We hypothesize for agents, the
838
+
839
+ advantage of cooperation is overwhelmed by their incapability to coordination. This should lead to
840
+
841
+ a “coordination gap”: two agents perform worse than one agent for the same workload.
842
+
843
+ The curse of coordination. As shown in Fig. 4 (Left), across all models, success rates under the
844
+
845
+ Coop setting is consistently lower than those under Solo settings, which means when two agents
846
+
847
+ need to coordinate between them, they perform even worse than one agent “solo”ing the two
848
+
849
+ features. This coordination gap is as large as 50% in the leading models: GPT-5, Claude Sonnet 4.5,
850
+
851
+ and Minimax M2. Qwen models have smaller gaps, but their Solo setting score is much lower as well.
852
+
853
+ All error bars in Fig. 4 are 95% Wilson confidence intervals computed over task sets (App. C).
854
+
855
+ Mid-difficulty crisis. As shown in Fig. 4 (Right), the gap between the two settings is larger and
856
+
857
+ more significant on the tasks with middle-level technical difficulty than on the ones which are too
858
+
859
+ easy or too hard. Here we stratify tasks by relative difficulty. For each task pair t, we define a raw
860
+
861
+ difficulty score d(t) = 1 − 1
862
+
863
+ |M| ∑m∈M Solom(t), where Solom(t) denotes model m’s Solo success on t.
864
+
865
+ 3https://vllm.ai/
866
+
867
+ 7
868
+
869
+ ### Page 8
870
+
871
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
872
+
873
+ 0 10 20 30 40 50
874
+
875
+ Percent of tasks
876
+
877
+ claude
878
+
879
+ gpt-5
880
+
881
+ minimax
882
+
883
+ qwen
884
+
885
+ qwen coder
886
+
887
+ With comm
888
+
889
+ No comm
890
+
891
+ (a) Success rate
892
+
893
+ 0 20 40 60 80 100
894
+
895
+ Percent of tasks
896
+
897
+ With comm
898
+
899
+ No comm
900
+
901
+ (b) Conflict rate
902
+
903
+ 0 5 10 15 20
904
+
905
+ Comm events (% of all events)
906
+
907
+ 20.0%
908
+
909
+ 16.3%
910
+
911
+ 13.6%
912
+
913
+ 6.2%
914
+
915
+ 3.3%
916
+
917
+ Plan Question
918
+
919
+ Answer Update
920
+
921
+ Ack
922
+
923
+ (c) Communication overhead
924
+
925
+ Figure 5 | (a) Effect of inter-agent communication on cooperation success or lack thereof. All agents
926
+
927
+ fail to use communication for improving cooperation success. (b) Communication substantially
928
+
929
+ reduces naive merge conflicts across all models. (c) Communication overhead as a percentage of
930
+
931
+ all execution events, broken down by message type. Models that communicate more (e.g., Claude
932
+
933
+ Sonnet 4.5, GPT-5) show larger reductions in conflict rate.
934
+
935
+ For visualization, we linearly rescale d(t) to ˜d(t) ∈ [0, 1] using the minimum and maximum d(t)
936
+
937
+ values in the benchmark. We bucket tasks by ˜d(t) and report success rates as a function of ˜d(t) for
938
+
939
+ both Solo and Coop. This result points out that agents struggle to balance the two pressures for
940
+
941
+ technical difficulty and cooperation difficulty. When tasks are too easy, agents could spare more
942
+
943
+ effort for coordination, but when tasks are getting harder, agents cannot effectively coordinate.
944
+
945
+ Scaling the number of cooperating agents. Our hypothesis is that increasing the number of agents
946
+
947
+ in the same cooperative workspace exacerbates coordination overhead (e.g., more context to track
948
+
949
+ and more opportunities for inconsistent plans), leading to lower end-to-end success. To probe this
950
+
951
+ directly, we run a small scale experiment using 46 tasks from 3 separate task sets where we scale
952
+
953
+ the number of concurrently cooperating agents from 2 to 4 while keeping the cooperative setting
954
+
955
+ fixed. We observe a monotonic decline in success as the number of agents increases. Specifically,
956
+
957
+ performance drops from 68.6% with 2 agents to 46.5% with 3 agents and further to 30.0% with 4
958
+
959
+ agents on the tasks, reinforcing the “curse of coordination” beyond the 2-agent setting.
960
+
961
+ 5. What is the role of communication in agent-agent cooperation?
962
+
963
+ In CooperBench, the communication tool we provide is the only channel agents could use to
964
+
965
+ coordinate with each other. Can agents effectively use it? We hypothesize that although agents
966
+
967
+ might actively use the tool, their communication might be far from effective or efficient. To evaluate
968
+
969
+ this, we compare with a baseline setting, where the communication tool is banned, i.e. “no comm”.
970
+
971
+ Communication does not lead to better cooperation. As shown in Fig. 5 (a), none of the models
972
+
973
+ effectively leverage communication tool to achieve higher cooperation success. The difference
974
+
975
+ between “with comm” and “no comm” settings is not statistically significant. This shows that
976
+
977
+ existence of the communication tool does not help coordination. Does this mean agents are not
978
+
979
+ using them? We quickly negate this question through examine the usage and the conflict rate.
980
+
981
+ 8
982
+
983
+ ### Page 9
984
+
985
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
986
+
987
+ Communication reduces merge conflicts. As shown in Fig. 5 (b), communication does significantly
988
+
989
+ reduce the merge conflicts between patches in Claude Sonnet 4.5, GPT-5, MiniMax M2, and Qwen
990
+
991
+ Instruct. This shows that agents could leverage the communication tool to reduce overlap in their
992
+
993
+ work, despite that just avoiding conflicts does not warrant cooperation success. Communication
994
+
995
+ also consumes a meaningful share of the agent’s action budget. Fig. 5 (c) reports the frequency of all
996
+
997
+ communication speech act types. This result shows that agents spent as much as 20% of the steps in
998
+
999
+ communication, within which planning, questioning, and updating each almost takes up 1/3 of
1000
+
1001
+ steps. But why this much effort in communication does not translate into better cooperation?
1002
+
1003
+ What distinguishes effective communication? To understand why communication helps conflicts
1004
+
1005
+ but not success, we analyze what successful communication looks like. Three patterns emerge.
1006
+
1007
+ First, successful agents plan more and question less. Trajectories that avoid conflicts have a Plan:Question
1008
+
1009
+ ratio of 2.04, compared to 1.31 for conflict trajectories. This suggests that questions are a symptom of
1010
+
1011
+ coordination problems, not a cure. Agents that are already struggling tend to ask more questions,
1012
+
1013
+ but questioning does not prevent conflicts.
1014
+
1015
+ Second, first-turn planning is the strongest predictor. Having a Plan message in the very first turn nearly
1016
+
1017
+ halves the conflict rate (29.4% vs 51.5%). This effect is robust across difficulty levels: in 7 out of 8
1018
+
1019
+ difficulty buckets, first-turn planning significantly reduces conflicts, with the effect actually stronger
1020
+
1021
+ for harder tasks (39% reduction at the highest difficulty).
1022
+
1023
+ Third, specificity matters. Successful trajectories contain significantly more concrete references: 32.6
1024
+
1025
+ line number mentions versus 22.5, and 13.1 file path mentions versus 10.0. Agents that communicate
1026
+
1027
+ where they are editing with specific line ranges successfully avoid overlapping changes.
1028
+
1029
+ Spatial vs. semantic coordination. These findings explain why communication helps conflicts but
1030
+
1031
+ not success. Merge conflicts are fundamentally a spatial coordination problem: agents must agree
1032
+
1033
+ on who edits which lines. The patterns above (early planning, specific line numbers, file paths) all
1034
+
1035
+ address spatial coordination, and they work.
1036
+
1037
+ However, task success requires semantic coordination: understanding what to implement, not
1038
+
1039
+ just where. Our case study in Appendix I illustrates this gap. Two agents successfully coordinated on
1040
+
1041
+ line numbers and edit ranges (spatial), yet failed because they never discussed the actual parameter
1042
+
1043
+ values their implementations should use (semantic). They solved the “formatting” problem of
1044
+
1045
+ avoiding overlapping edits but not the “design” problem of ensuring compatible implementations.
1046
+
1047
+ Repetition, Unresponsiveness, and Hallucination. Beyond the spatial-semantic gap, the com-
1048
+
1049
+ munication itself is often flawed. We identify three major communication problems, and show
1050
+
1051
+ their frequencies in Fig. 6. We automatically detect these patterns using an LLM-as-judge approach
1052
+
1053
+ with a precision-focused taxonomy; see Appendix F for the full rubric and evidence requirements.
1054
+
1055
+ Repetition consumes budget without adding constraints a partner can act on, which is consistent
1056
+
1057
+ with high communication overhead without commensurate gains in end-to-end success. Unrespon-
1058
+
1059
+ siveness breaks the feedback loop when one agent asks for a decision that gates implementation,
1060
+
1061
+ and incorrectness creates false shared context, such as asserting an interface decision or a completed
1062
+
1063
+ change that is not actually satisfied. Hallucination results in noises which makes it hard to partners
1064
+
1065
+ to coordinate under imperfect information.
1066
+
1067
+ In this section, we show that the communication tool is heavily used, but not properly leveraged
1068
+
1069
+ by agents for coordination. This shows that agents lack the critical pragmatics understanding of
1070
+
1071
+ 9
1072
+
1073
+ ### Page 10
1074
+
1075
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1076
+
1077
+ 0 5 10 15
1078
+
1079
+ Avg turns
1080
+
1081
+ qwen coder
1082
+
1083
+ qwen
1084
+
1085
+ minimax
1086
+
1087
+ gpt-5
1088
+
1089
+ claude
1090
+
1091
+ 1.1
1092
+
1093
+ 3.2
1094
+
1095
+ 10.8
1096
+
1097
+ 15.5
1098
+
1099
+ 14.5
1100
+
1101
+ (a) Conversation turns
1102
+
1103
+ 0 10 20 30 40
1104
+
1105
+ Percent of conversations
1106
+
1107
+ 0.0%
1108
+
1109
+ 6.8%
1110
+
1111
+ 17.8%
1112
+
1113
+ 5.1%
1114
+
1115
+ 37.1%
1116
+
1117
+ Repeats same information
1118
+
1119
+ Near-duplicate status blocks
1120
+
1121
+ (b) Repetition
1122
+
1123
+ 0 10 20 30 40
1124
+
1125
+ Percent of conversations
1126
+
1127
+ 0.0%
1128
+
1129
+ 1.9%
1130
+
1131
+ 21.3%
1132
+
1133
+ 5.8%
1134
+
1135
+ 9.9%
1136
+
1137
+ No reply to direct question
1138
+
1139
+ Reply but ignores question
1140
+
1141
+ Vague / non-answer
1142
+
1143
+ (c) Unresponsiveness
1144
+
1145
+ 0 10 20 30 40
1146
+
1147
+ Percent of conversations
1148
+
1149
+ 0.0%
1150
+
1151
+ 0.0%
1152
+
1153
+ 5.4%
1154
+
1155
+ 2.1%
1156
+
1157
+ 6.9%
1158
+
1159
+ Plan drift / unilateral deviation
1160
+
1161
+ Hallucination (uncorrected)
1162
+
1163
+ Hallucination (corrected)
1164
+
1165
+ (d) Hallucination
1166
+
1167
+ Figure 6 | Break down frequencies of different kinds of communication errors.
1168
+
1169
+ language: communication is not just about message passing, but about achieving certain functions
1170
+
1171
+ through passing the messages. Agents are “talking” a lot, but they cannot achieve their commu-
1172
+
1173
+ nication goals through communication when communication channel is jammed with repetitions,
1174
+
1175
+ unresponded questions, or false information.
1176
+
1177
+ 6. What are the coordination failures that the agents exhibit?
1178
+
1179
+ Section 5 showed that communication alone does not improve coordination. Why not? We find that
1180
+
1181
+ even when agents communicate their plans, they struggle to honor commitments and anticipate
1182
+
1183
+ partner actions. Coordination failures stem from three capability gaps: communication (failing to
1184
+
1185
+ exchange key information), commitment (not following through on promises), and expectation (failing
1186
+
1187
+ to model what partners are doing). We first categorize failures by their observable symptoms (§6.1),
1188
+
1189
+ then identify these underlying causes (§6.2).
1190
+
1191
+ 6.1. Failure Symptoms
1192
+
1193
+ We analyze all failed Coop trajectories across all five models on the full dataset. Through iterative
1194
+
1195
+ qualitative coding, we develop the failure symptom taxonomy shown in Tab. 1. We then use GPT-5
1196
+
1197
+ as an LLM-as-a-Judge to categorize trajectories at scale, yielding the frequency distribution in Tab. 1.
1198
+
1199
+ The resulting vocabulary provides a structured way to diagnose coordination breakdowns. See
1200
+
1201
+ App. G for the annotation procedure and human validation.
1202
+
1203
+ 6.2. Failure Reasons
1204
+
1205
+ Symptoms describe what went wrong; causes explain why. To identify the underlying capability
1206
+
1207
+ gaps, we manually reviewed 50 failed Coop traces. For each trace, we examined the symptom
1208
+
1209
+ labels, conversation logs, and merged artifacts to determine why coordination broke down. We
1210
+
1211
+ grouped root causes into three categories shown in Tab. 2. Unlike symptoms, which can be reliably
1212
+
1213
+ detected by an LLM annotator, causes require deeper interpretation of the coordination dynamics
1214
+
1215
+ and are therefore manually assigned.
1216
+
1217
+ 6.3. Representative examples of capability gaps
1218
+
1219
+ We provide one representative example for each coordination capability gap. Additional symptom-
1220
+
1221
+ level examples are available in Appendix H.
1222
+
1223
+ 10
1224
+
1225
+ ### Page 11
1226
+
1227
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1228
+
1229
+ Table 1 | Coordination failure symptoms. Observable patterns in how coordination breakdowns
1230
+
1231
+ surface in merged artifacts.
1232
+
1233
+ Symptom Meaning %
1234
+
1235
+ Work overlap Both agents independently implement the same functionality, duplicating
1236
+
1237
+ work and overwriting details.
1238
+
1239
+ 33.2
1240
+
1241
+ Divergent architec-
1242
+
1243
+ ture
1244
+
1245
+ Incompatible design decisions lead to semantic loss even under a clean
1246
+
1247
+ merge.
1248
+
1249
+ 29.7
1250
+
1251
+ Repetition Verbose status messages add little new information and reduce signal. 14.7
1252
+
1253
+ Unresponsiveness Direct questions or requests are not answered, breaking the decision loop. 8.7
1254
+
1255
+ Unverifiable claims Agent asserts a change or interface decision without evidence the partner
1256
+
1257
+ can check (no checkable commitment).
1258
+
1259
+ 4.3
1260
+
1261
+ Broken commitment Confident completion claims create false shared context when the promised
1262
+
1263
+ change is absent.
1264
+
1265
+ 3.7
1266
+
1267
+ Dependency access Missing risk communication leaves agents unable to anticipate merged
1268
+
1269
+ dependency interactions (e.g., circular imports).
1270
+
1271
+ 1.7
1272
+
1273
+ Placeholder misuse An explicit integration contract exists but is applied differently than agreed. 1.5
1274
+
1275
+ Parameter flow Ambiguity about a changing interface leaves one agent implementing
1276
+
1277
+ against an outdated contract.
1278
+
1279
+ 1.3
1280
+
1281
+ Timing dependency Agents agree on order but fail to communicate an enforceable plan that
1282
+
1283
+ preserves it after merge.
1284
+
1285
+ 1.1
1286
+
1287
+ Expectation. In the first example, Agent A announces it will modify prompts.py and call B’s
1288
+
1289
+ get_global_filters(). Agent B states it will insert GLOBAL_FILTERS at a specific location. Both
1290
+
1291
+ agents communicate their plans explicitly, yet the merge fails. The problem is not missing informa-
1292
+
1293
+ tion but failure to integrate it. Despite hearing B’s plan, A proceeds as if B’s code won’t exist. This is
1294
+
1295
+ the most common cause, reflecting a fundamental difficulty in maintaining an accurate model of
1296
+
1297
+ partner state during independent work.
1298
+
1299
+ Expectation: Failure to model partner state
1300
+
1301
+ I will modify outlines/prompts.py. I'm removing _template_from_str() and
1302
+
1303
+ _template_from_file() methods. My create_jinja_env() function will call your
1304
+
1305
+ get_global_filters().
1306
+
1307
+ ......
1308
+
1309
+ ......
1310
+
1311
+ Add global filter registry
1312
+
1313
+ Add global filter registry
1314
+
1315
+ I will insert a block defining GLOBAL_FILTERS + register/
1316
+
1317
+ unregister APIs.
1318
+
1319
+ Bash View Edit Comm
1320
+
1321
+ Commitment. In the second example, the agent promises “I will add bypass check at lines 100–104,
1322
+
1323
+ happens FIRST in get().” Later it claims completion with a checkmark. But after merge, the bypass
1324
+
1325
+ code is missing. The partner trusted this claim and built on it, but under workspace isolation, trust
1326
+
1327
+ is all they had. The commitment was unverifiable. No pasted signature, no diff, nothing the partner
1328
+
1329
+ could check without access to the branch.
1330
+
1331
+ 11
1332
+
1333
+ ### Page 12
1334
+
1335
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1336
+
1337
+ Table 2 | Coordination capability gaps. Underlying causes inferred through qualitative analysis of
1338
+
1339
+ failure traces.
1340
+
1341
+ Cause Definition %
1342
+
1343
+ Expectation Cases where one agent has clearly communicated what they are doing, but the other
1344
+
1345
+ agent still treats the situation as if that work is not being done. This reflects a failure
1346
+
1347
+ to model the state of the other agent’s code changes and what that means for the
1348
+
1349
+ system as a whole.
1350
+
1351
+ 42
1352
+
1353
+ Commitment Cases where an agent is not doing the things they promised to do. This includes
1354
+
1355
+ failures to establish or maintain verifiable integration contracts, where agents make
1356
+
1357
+ commitments but do not follow through on them.
1358
+
1359
+ 32
1360
+
1361
+ Communication Breakdowns in using language to coordinate. This includes failures in information
1362
+
1363
+ sharing and decision loops between agents, where agents do not effectively
1364
+
1365
+ communicate their intentions, questions, or status updates.
1366
+
1367
+ 26
1368
+
1369
+ Commitment: Failure to follow through on promises
1370
+
1371
+ I will add bypass check
1372
+
1373
+ at lines 100-104
1374
+
1375
+ happens FIRST in get().
1376
+
1377
+ ......
1378
+
1379
+ ......
1380
+
1381
+ Bash View Edit Comm
1382
+
1383
+ bypass code
1384
+
1385
+ missing
1386
+
1387
+ Implementation complete!....
1388
+
1389
+ ✓ Added bypass() context
1390
+
1391
+ manager method....
1392
+
1393
+ Communication. In the third example, Agent A asks a direct question, “Which approach would
1394
+
1395
+ you prefer?” The response is silence. Without an answer, the coordination loop collapses. A
1396
+
1397
+ needed a decision to proceed, and without one, both agents continued with potentially incompat-
1398
+
1399
+ ible assumptions. Unlike expectation failures (where information exists but isn’t integrated) or
1400
+
1401
+ commitment failures (where promises aren’t kept), this is a failure to even establish shared context.
1402
+
1403
+ Communication: Breakdown in using language to coordinate
1404
+
1405
+ .... Which approach would you prefer? I want to ensure we don’t lose any functionality
1406
+
1407
+ while resolving this conflict.
1408
+
1409
+ ......
1410
+
1411
+ ......
1412
+
1413
+ Bash View Edit Comm
1414
+
1415
+ No response at ALL.
1416
+
1417
+ The examples above reveal why coordination, rather than raw coding ability, is often the limiting
1418
+
1419
+ factor. The common thread is partial observability. Each agent acts while holding an uncertain
1420
+
1421
+ model of its partner’s state, edits, and commitments. A merge can be conflict-free yet still embed
1422
+
1423
+ incompatible assumptions.
1424
+
1425
+ These causes manifest through the symptoms in Tab. 1. Expectation failures produce work
1426
+
1427
+ 12
1428
+
1429
+ ### Page 13
1430
+
1431
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1432
+
1433
+ overlap and silent overwrites, commitment failures lead to unverifiable claims and broken promises,
1434
+
1435
+ and communication failures result in unresponsiveness and repetition.
1436
+
1437
+ These failures suggest current models lack reliable representations for (i) partner state (what the
1438
+
1439
+ other agent has actually changed), (ii) checkable commitments (contracts verifiable after merge), and
1440
+
1441
+ (iii) cross-branch integration reasoning (anticipating how independent patches interact). Coordination
1442
+
1443
+ requires more than plausible code. It requires verifiable and actionable constraints for a partner oper-
1444
+
1445
+ ating under isolation. This explains why prompt optimization yields only marginal improvements
1446
+
1447
+ (App. D). Most errors stem from coordination challenges, not prompt wording.
1448
+
1449
+ The trust paradox. We hypothesize that a deeper tension underlies expectation failures. Models are
1450
+
1451
+ trained to be cautious, requiring observable evidence and resisting unverifiable assertions. This
1452
+
1453
+ is a sensible default for single-agent interactions, where users may attempt to mislead the model.
1454
+
1455
+ However, collaboration under workspace isolation requires the opposite. Agents must trust partner
1456
+
1457
+ claims about states they cannot observe. When Agent A reports “I added the handler at line 50,”
1458
+
1459
+ Agent B’s instinct is to verify, but verification fails because they are on separate branches. This
1460
+
1461
+ mismatch between verification-first training and trust-requiring collaboration may partly explain why
1462
+
1463
+ agents consistently fail to update their model of partner state despite explicit communication.
1464
+
1465
+ Effective collaboration likely requires lightweight mechanisms that turn conversation into
1466
+
1467
+ verifiable shared state, such as pasted signatures, explicit insertion-point contracts, and integration
1468
+
1469
+ checks before declaring safety. We now turn to successful cases to see what these mechanisms look
1470
+
1471
+ like in practice.
1472
+
1473
+ 6.4. Emergent Coordination Behavior
1474
+
1475
+ Among successful runs, we observe coordination patterns that are largely absent from failures.
1476
+
1477
+ These behaviors are not prompted or scaffolded; they emerge when agents successfully navigate
1478
+
1479
+ partial observability. What they share is a shift from vague intentions to specific commitments that
1480
+
1481
+ a partner can verify even without seeing the underlying work. We identify three such patterns.
1482
+
1483
+ Role division. Agents agree on who handles which part of the task and establish clear boundaries
1484
+
1485
+ around their scope.
1486
+
1487
+ Runs a quick check up
1488
+
1489
+ front, then delegates
1490
+
1491
+ the concrete insertion
1492
+
1493
+ tasks with a clear plan.
1494
+
1495
+ Insert octal_str in
1496
+
1497
+ outlines/types/
1498
+
1499
+ __init__.py
1500
+
1501
+ On my branch I’ll insert: header
1502
+
1503
+ → binary_str → octal_str.
1504
+
1505
+ Inspects outlines/types/__init__.py to confirm the
1506
+
1507
+ exact alphabetical insertion point.
1508
+
1509
+ Locates the relevant files
1510
+
1511
+ to understand the surface
1512
+
1513
+ area for multi-file editing.
1514
+
1515
+ reads `src/click/termui.py` to
1516
+
1517
+ understand how `click.edit`
1518
+
1519
+ currently routes into the editor
1520
+
1521
+ implementation.
1522
+
1523
+ You implement the environment-
1524
+
1525
+ isolation feature and I’ll implement
1526
+
1527
+ multi-file editing support. Please
1528
+
1529
+ avoid the backend implementation;
1530
+
1531
+ I will handle it end-to-end.
1532
+
1533
+ I won’t touch the backend editor implementation. I’ll implement
1534
+
1535
+ environment isolation in the CLI-layer logic only.
1536
+
1537
+ I’ll add the section header + octal_str
1538
+
1539
+ in outlines/types/__init__.py; you add
1540
+
1541
+ binary_str between them to keep
1542
+
1543
+ alphabetical order.
1544
+
1545
+ What distinguishes successful role division is mutual confirmation. Under partial observability,
1546
+
1547
+ a unilateral declaration can easily be missed or misunderstood. When both agents explicitly
1548
+
1549
+ acknowledge the split, they create verified shared understanding that both sides can rely on during
1550
+
1551
+ independent work.
1552
+
1553
+ Resource division. Agents avoid collisions by partitioning shared resources, most commonly
1554
+
1555
+ specific files, code ranges, or ownership blocks.
1556
+
1557
+ What makes resource division effective is specificity. Vague commitments cannot be verified and
1558
+
1559
+ thus require trust. Line-level boundaries, by contrast, create safe zones where conflict is impossible
1560
+
1561
+ 13
1562
+
1563
+ ### Page 14
1564
+
1565
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1566
+
1567
+ I will modify llama-index-core/llama_index/core/base/
1568
+
1569
+ llms/types.py lines 68–84 (ImageBlock.image_to_base64)
1570
+
1571
+ ......
1572
+
1573
+ ......
1574
+
1575
+ Edit types.py
1576
+
1577
+ (ImageBlock.image
1578
+
1579
+ _to_base64)
1580
+
1581
+ I will NOT edit lines 68–84. My plan is to insert
1582
+
1583
+ get_image_mimetype() AFTER line 84 (starting at new line 85).
1584
+
1585
+ ......
1586
+
1587
+ ......
1588
+
1589
+ Edit Prompt/conditional-filter logic in
1590
+
1591
+ outlines/prompts.py (outside the
1592
+
1593
+ inserted registry block).
1594
+
1595
+ I will INSERT (no existing lines edited) into outlines/
1596
+
1597
+ prompts.py between current line 14 and @dataclass Prompt
1598
+
1599
+ at line 15. The inserted block defines GLOBAL_FILTERS +
1600
+
1601
+ register/unregister APIs. Please avoid the inserted block.
1602
+
1603
+ I already modified outlines/prompts.py
1604
+
1605
+ line 7 (import field). I will edit only regions
1606
+
1607
+ inside @dataclass Prompt
1608
+
1609
+ by construction.
1610
+
1611
+ Negotiation. Agents resolve conflicting approaches by proposing alternatives and converging on
1612
+
1613
+ a single plan before acting.
1614
+
1615
+ I checked the file… here are two clean options:
1616
+
1617
+ (1) I add IsHash; you add import re + IsRegex; I handle all __init__.py exports.
1618
+
1619
+ (2) You add IsRegex; I add IsHash; you handle all __init__.py exports.
1620
+
1621
+ Which option do you prefer?
1622
+
1623
+ ......
1624
+
1625
+ ......
1626
+
1627
+ Add import re
1628
+
1629
+ Edits `dirty_equals/__init__.py` to export
1630
+
1631
+ both new classes.
1632
+
1633
+ I’m seeing a potential overlap between our plans… I’ll check
1634
+
1635
+ the current file state first, then we’ll coordinate a clean split.
1636
+
1637
+ Let’s do option (1)… I’ve already added `import re` now. You
1638
+
1639
+ add IsHash, then I’ll add IsRegex.
1640
+
1641
+ Effective negotiation does cognitive work for both parties. By proposing mutually exclusive op-
1642
+
1643
+ tions that fully specify what each agent will do, one agent reduces a complex coordination problem
1644
+
1645
+ to a simple choice. The result is not just agreement on intent but complete action specifications that
1646
+
1647
+ leave nothing to interpret.
1648
+
1649
+ These coordination patterns are rare in our traces but their presence in successful cases suggests
1650
+
1651
+ that the underlying capability exists. The challenge is not teaching agents new coordination skills
1652
+
1653
+ but making existing ones reliable.
1654
+
1655
+ 7. Related Work
1656
+
1657
+ Multi-agent LLM systems and tool-using coding agents have advanced rapidly, but reliable collabo-
1658
+
1659
+ ration remains unresolved. Prior work largely evaluates task success under engineered interaction
1660
+
1661
+ structure rather than free-form coordination under partial information.
1662
+
1663
+ Multi-agent LLM systems Many frameworks improve performance through structured inter-
1664
+
1665
+ action. CAMEL (Li et al., 2023a) and AutoGen (Wu et al., 2023) use conversation programming;
1666
+
1667
+ MetaGPT (Hong et al., 2024) and ChatDev (Qian et al., 2024) emulate software organizations;
1668
+
1669
+ Magentic-One (Fourney et al., 2024b), MAGIS (Tao et al., 2024), and AgileCoder (Nguyen et al.,
1670
+
1671
+ 2024) use explicit orchestrators. Even with such scaffolding, multi-agent systems exhibit high failure
1672
+
1673
+ rates. Multi-agent configurations degrade performance by 39 to 70 percent relative to single-agent
1674
+
1675
+ baselines (Su et al., 2025), and failure analyses identify inter-agent misalignment as a major category
1676
+
1677
+ (Cemri et al., 2025). These findings suggest that externally imposed protocols mask rather than solve
1678
+
1679
+ the underlying coordination problem. Sotopia (Zhou et al., 2024) provides a general framework
1680
+
1681
+ for evaluating agents’ social intelligence, while our work focus specifically on cooperative coding
1682
+
1683
+ agents with verified tasks.
1684
+
1685
+ Tool-using coding agents such as SWE-agent (Yang et al., 2024), OpenHands (Wang et al., 2025),
1686
+
1687
+ and Agentless (Xia et al., 2024) achieve strong results on SWE-bench (Jimenez et al., 2024). However,
1688
+
1689
+ 14
1690
+
1691
+ ### Page 15
1692
+
1693
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1694
+
1695
+ these evaluations measure single-agent success rather than whether multiple peers can integrate
1696
+
1697
+ changes without conflict under partial information.
1698
+
1699
+ Coordination benchmarks Existing benchmarks span games, embodied tasks, and reasoning.
1700
+
1701
+ Hanabi (Forkel & Foerster, 2025) and Cicero (, FAIR) test coordination under information asymme-
1702
+
1703
+ try; MultiAgentBench (Zhu et al., 2025) and Collab-Overcooked (Sun et al., 2025) evaluate LLM
1704
+
1705
+ collaboration; Tool-RoCo (Zhang et al., 2025a) and RoCoBench (Mandi et al., 2023) assess multi-
1706
+
1707
+ robot cooperation. In software, SyncBench (Guo et al., 2025) tests divergent understanding and The
1708
+
1709
+ Collaboration Gap (Davidson et al., 2025) finds that solo-capable models degrade when required to
1710
+
1711
+ collaborate. These benchmarks typically enforce turn-taking or shared observability rather than
1712
+
1713
+ testing code integration under workspace isolation. Agent-human collaboration benchmarks such as
1714
+
1715
+ Co-Gym (Shao et al., 2025), HULA (Takerngsaksiri et al., 2025), and HAI-Eval (Luo et al., 2025) study
1716
+
1717
+ settings where humans arbitrate. We instead study whether agents can coordinate autonomously.
1718
+
1719
+ Theory of Mind evaluation Effective coordination requires modeling partner beliefs and inten-
1720
+
1721
+ tions, which commonly referred to Theory of Mind (Premack & Woodruff, 1978; Rabinowitz et al.,
1722
+
1723
+ 2018; Zhu et al., 2021). ToMBench (Chen et al., 2024), FANToM (Kim et al., 2023), and SoMi-ToM (Fan
1724
+
1725
+ et al., 2025) evaluate theory of mind in LLMs, finding substantial gaps versus human performance.
1726
+
1727
+ ToMSWE (Zhou et al., 2025) tries to build coding agents which can infer users’ Theory of Mind.
1728
+
1729
+ Studies of cooperative games (Li et al., 2023b) and Generative Agents (Park et al., 2023) show
1730
+
1731
+ emergent social behaviors but also challenges translating these to verifiable collaborative work.
1732
+
1733
+ We isolate free-form coordination as the central object of evaluation. CooperBench assigns
1734
+
1735
+ two agents partially overlapping features on a shared codebase while isolating their workspaces
1736
+
1737
+ and restricting coordination to natural language. Unlike benchmarks that impose interaction
1738
+
1739
+ structure or measure outcomes alone, we evaluate through coordination failures such as redundancy,
1740
+
1741
+ inconsistent assumptions, and semantic breakage. We demonstrate the curse of coordination in a
1742
+
1743
+ controlled setting with verifiable code integration, pointing to social intelligence as the bottleneck
1744
+
1745
+ for effective agent teamwork.
1746
+
1747
+ 8. Conclusion and Future Work
1748
+
1749
+ In a future where agents team with humans in high-stakes domains (Kim et al., 2025), accelerate
1750
+
1751
+ science and technology research (Gottweis et al., 2025), and empower creative endeavors (Waikar,
1752
+
1753
+ 2021), it is hard to imagine how an agent incapable of coordination would contribute to such a
1754
+
1755
+ future, however strong their individual capabilities.
1756
+
1757
+ Our work demonstrates that coordination, not raw coding ability, is a central bottleneck for
1758
+
1759
+ multi-agent software development. Through CooperBench, we show that frontier models like GPT-5
1760
+
1761
+ and Claude Sonnet 4.5 achieve only 25% success when two agents collaborate, roughly half the
1762
+
1763
+ success rate of a single agent performing the same workload. This curse of coordination stems from
1764
+
1765
+ three capability gaps: agents fail to communicate actionable information, deviate from their own
1766
+
1767
+ commitments, and hold incorrect expectations about their partners.
1768
+
1769
+ Yet coordination is not beyond reach. In successful traces, we observe emergent behaviors
1770
+
1771
+ such as role division, resource division, and negotiation that turn vague intentions into verifiable
1772
+
1773
+ commitments. These patterns are rare but their presence suggests the underlying capability exists;
1774
+
1775
+ the challenge is making it reliable. With multi-agent training methods, e.g. Sotopia-π (Wang et al.,
1776
+
1777
+ 2024a; Yu et al., 2025), we can expect these emergent behaviors to be reinforced through the success
1778
+
1779
+ of cooperation.
1780
+
1781
+ 15
1782
+
1783
+ ### Page 16
1784
+
1785
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1786
+
1787
+ Our findings open several directions: (1) training objectives that reward coordination under
1788
+
1789
+ partial observability, (2) lightweight protocols for verifiable commitments (e.g., shared signatures,
1790
+
1791
+ insertion-point contracts), and (3) richer communication channels such as screen sharing to expand
1792
+
1793
+ the modality beyond text. We release CooperBench as an open benchmark to measure progress on
1794
+
1795
+ these fronts.
1796
+
1797
+ Although we focus on software development, our findings generalize to any domain involving
1798
+
1799
+ role and resource conflicts under partial observability. We expect that the lack of social intelligence,
1800
+
1801
+ the ability to understand others, communicate effectively, and coordinate actions, will remain
1802
+
1803
+ a fundamental barrier limiting the real-world deployment of agents as teammates until these
1804
+
1805
+ capabilities are explicitly developed.
1806
+
1807
+ Acknowledgments
1808
+
1809
+ This research is supported in part by grants from ONR grant N000142412532, NSF grant IIS-2247357,
1810
+
1811
+ DSO National Laboratories (DSO), and support from SAP. We thank Google Cloud Platform and
1812
+
1813
+ Modal Platform for their credits. We thank Yutong Zhang, Gavin Li, Hannah Cha, John Yang, Yijia
1814
+
1815
+ Shao and all members of Stanford SALT Lab for their help and feedback throughout this project.
1816
+
1817
+ References
1818
+
1819
+ Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari,
1820
+
1821
+ Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E.
1822
+
1823
+ Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025. URL https://arxiv.org/
1824
+
1825
+ abs/2503.13657.
1826
+
1827
+ Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao,
1828
+
1829
+ Mengting Hu, Yunghwei Lai, Zexuan Xiong, and Minlie Huang. Tombench: Benchmarking
1830
+
1831
+ theory of mind in large language models, 2024. URL https://arxiv.org/abs/2402.15052.
1832
+
1833
+ Yuyang Cheng, Yumiao Xu, Chaojia Yu, and Yong Zhao. Hawk: A hierarchical workflow framework
1834
+
1835
+ for multi-agent collaboration, 2025. URL https://arxiv.org/abs/2507.04067.
1836
+
1837
+ Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, and Ece Kamar. The
1838
+
1839
+ collaboration gap, 2025. URL https://arxiv.org/abs/2511.02687.
1840
+
1841
+ Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily
1842
+
1843
+ Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan
1844
+
1845
+ Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis,
1846
+
1847
+ Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi,
1848
+
1849
+ Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in
1850
+
1851
+ the game of <i>diplomacy</i> by combining language models with strategic reasoning. Science,
1852
+
1853
+ 378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097. URL https://www.science.org/doi/
1854
+
1855
+ abs/10.1126/science.ade9097.
1856
+
1857
+ Xianzhe Fan, Xuhui Zhou, Chuyang Jin, Kolby Nottingham, Hao Zhu, and Maarten Sap. Somi-tom:
1858
+
1859
+ Evaluating multi-perspective theory of mind in embodied social interactions. In NeurIPS D&B,
1860
+
1861
+ 2025. URL https://arxiv.org/abs/2506.23046.
1862
+
1863
+ Johannes Forkel and Jakob Foerster. Entropy is all you need for inter-seed cross-play in hanabi,
1864
+
1865
+ 2025. URL https://arxiv.org/abs/2511.22581.
1866
+
1867
+ 16
1868
+
1869
+ ### Page 17
1870
+
1871
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1872
+
1873
+ Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu,
1874
+
1875
+ Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
1876
+
1877
+ Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema
1878
+
1879
+ Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024a. URL
1880
+
1881
+ https://arxiv.org/abs/2411.04468.
1882
+
1883
+ Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu,
1884
+
1885
+ Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
1886
+
1887
+ Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema
1888
+
1889
+ Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024b. URL
1890
+
1891
+ https://arxiv.org/abs/2411.04468.
1892
+
1893
+ Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom
1894
+
1895
+ Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.
1896
+
1897
+ arXiv preprint arXiv:2502.18864, 2025.
1898
+
1899
+ Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, and Heng Ji. Syncmind:
1900
+
1901
+ Measuring agent out-of-sync recovery in collaborative software engineering, 2025. URL https:
1902
+
1903
+ //arxiv.org/abs/2502.06994.
1904
+
1905
+ Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao
1906
+
1907
+ Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-
1908
+
1909
+ agent collaborative framework. In The Twelfth International Conference on Learning Representations,
1910
+
1911
+ 2023.
1912
+
1913
+ Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao
1914
+
1915
+ Zhang, Zili Wang, Steven K. S. Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao,
1916
+
1917
+ Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collabo-
1918
+
1919
+ rative framework. In International Conference on Learning Representations, 2024.
1920
+
1921
+ Saffron Huang, Bryan Seethor, Esin Durmus, Kunal Handa, Miles McCain, Michael Stern, and
1922
+
1923
+ Deep Ganguli. How ai is transforming work at anthropic, 2025. URL https://anthropic.com/
1924
+
1925
+ research/how-ai-is-transforming-work-at-anthropic/.
1926
+
1927
+ Nicholas K Humphrey. The social function of intellect. In Growing points in ethology, pp. 303–317.
1928
+
1929
+ Cambridge University Press, 1976.
1930
+
1931
+ Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
1932
+
1933
+ Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint
1934
+
1935
+ arXiv:2310.06770, 2023.
1936
+
1937
+ Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
1938
+
1939
+ Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International
1940
+
1941
+ Conference on Learning Representations, 2024.
1942
+
1943
+ Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten
1944
+
1945
+ Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions, 2023. URL
1946
+
1947
+ https://arxiv.org/abs/2310.15421.
1948
+
1949
+ Ji Woong Kim, Juo-Tung Chen, Pascal Hansen, Lucy Xiaoyang Shi, Antony Goldenberg, Samuel
1950
+
1951
+ Schmidgall, Paul Maria Scheikl, Anton Deguet, Brandon M White, De Ru Tsai, et al. Srt-h: A
1952
+
1953
+ hierarchical framework for autonomous surgery via language-conditioned imitation learning.
1954
+
1955
+ Science robotics, 10(104):eadt5254, 2025.
1956
+
1957
+ 17
1958
+
1959
+ ### Page 18
1960
+
1961
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
1962
+
1963
+ Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem.
1964
+
1965
+ CAMEL: Communicative agents for “mind” exploration of large language model society. In
1966
+
1967
+ Advances in Neural Information Processing Systems, 2023a.
1968
+
1969
+ Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and
1970
+
1971
+ Katia Sycara. Theory of mind for multi-agent collaboration via large language models. In
1972
+
1973
+ Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Em-
1974
+
1975
+ pirical Methods in Natural Language Processing, pp. 180–192, Singapore, December 2023b. As-
1976
+
1977
+ sociation for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.13. URL https:
1978
+
1979
+ //aclanthology.org/2023.emnlp-main.13/.
1980
+
1981
+ Hanjun Luo, Chiming Ni, Jiaheng Wen, Zhimu Huang, Yiran Wang, Bingduo Liao, Sylvia Chung,
1982
+
1983
+ Yingbin Jin, Xinfeng Li, Wenyuan Xu, XiaoFeng Wang, and Hanan Salam. Hai-eval: Measuring
1984
+
1985
+ human-ai synergy in collaborative coding, 2025. URL https://arxiv.org/abs/2512.04111.
1986
+
1987
+ Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large
1988
+
1989
+ language models, 2023. URL https://arxiv.org/abs/2307.04738.
1990
+
1991
+ Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder:
1992
+
1993
+ Dynamic collaborative agents for software development based on agile methodology, 2024. URL
1994
+
1995
+ https://arxiv.org/abs/2406.11912.
1996
+
1997
+ Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt
1998
+
1999
+ Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei
2000
+
2001
+ Zaharia, and Ion Stoica. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust
2002
+
2003
+ in Language Models and Applications, 2025. URL https://openreview.net/forum?id=wM521FqPvI.
2004
+
2005
+ Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S.
2006
+
2007
+ Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https:
2008
+
2009
+ //arxiv.org/abs/2304.03442.
2010
+
2011
+ David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral
2012
+
2013
+ and Brain Sciences, 1(4):515–526, 1978. doi: 10.1017/S0140525X00076512. Publisher: Cambridge
2014
+
2015
+ University Press.
2016
+
2017
+ Goparaju Purna Sudhakar, Ayesha Farooq, and Sanghamitra Patnaik. Soft factors affecting the
2018
+
2019
+ performance of software development teams. Team Performance Management: An International
2020
+
2021
+ Journal, 17(3/4):187–205, 2011.
2022
+
2023
+ Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize
2024
+
2025
+ Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev:
2026
+
2027
+ Communicative agents for software development. In Proceedings of the Annual Meeting of the
2028
+
2029
+ Association for Computational Linguistics, 2024.
2030
+
2031
+ Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew
2032
+
2033
+ Botvinick. Machine theory of mind. In International conference on machine learning, pp. 4218–
2034
+
2035
+ 4227. PMLR, 2018.
2036
+
2037
+ Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai
2038
+
2039
+ Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao
2040
+
2041
+ Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu,
2042
+
2043
+ and Lin Lee Cheong. A systematic survey of automatic prompt optimization techniques. In Pro-
2044
+
2045
+ ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 33066–33098.
2046
+
2047
+ 18
2048
+
2049
+ ### Page 19
2050
+
2051
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2052
+
2053
+ Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.emnlp-main.1681. URL
2054
+
2055
+ http://dx.doi.org/10.18653/v1/2025.emnlp-main.1681.
2056
+
2057
+ Eric Raymond. The cathedral and the bazaar. Knowledge, Technology & Policy, 12(3):23–49, 1999.
2058
+
2059
+ Prateek Sahoo et al. A systematic survey of prompt engineering in large language models: Tech-
2060
+
2061
+ niques and applications. arXiv preprint arXiv:2402.07927, 2024.
2062
+
2063
+ Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. Collaborative gym: A frame-
2064
+
2065
+ work for enabling and evaluating human-agent collaboration, 2025. URL https://arxiv.org/
2066
+
2067
+ abs/2412.15701.
2068
+
2069
+ Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang,
2070
+
2071
+ Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao
2072
+
2073
+ Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie,
2074
+
2075
+ Fei Huang, and Jingren Zhou. Scaling agents via continual pre-training, 2025. URL https:
2076
+
2077
+ //arxiv.org/abs/2509.13310.
2078
+
2079
+ Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan,
2080
+
2081
+ and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as
2082
+
2083
+ collaborative agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language
2084
+
2085
+ Processing, pp. 4922–4951. Association for Computational Linguistics, 2025. doi: 10.18653/v1/
2086
+
2087
+ 2025.emnlp-main.249. URL http://dx.doi.org/10.18653/v1/2025.emnlp-main.249.
2088
+
2089
+ Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn,
2090
+
2091
+ Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human-in-the-loop
2092
+
2093
+ software development agents, 2025. URL https://arxiv.org/abs/2411.12924.
2094
+
2095
+ Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. MAGIS:
2096
+
2097
+ LLM-based multi-agent framework for github issue resolution. arXiv preprint arXiv:2403.17927,
2098
+
2099
+ 2024.
2100
+
2101
+ Michael Tomasello. A natural history of human thinking. Harvard University Press, 2014.
2102
+
2103
+ Sachin Waikar. Artists’ perspective: How ai enhances creativity and
2104
+
2105
+ reimagines meaning, Apr 2021. URL https://hai.stanford.edu/news/
2106
+
2107
+ artists-perspective-how-ai-enhances-creativity-and-reimagines-meaning.
2108
+
2109
+ Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Yonatan Bisk, Graham Neubig,
2110
+
2111
+ and Hao Zhu. Sotopia-π: Interactive learning of socially intelligent language agents. In Proceedings
2112
+
2113
+ of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
2114
+
2115
+ pp. 12912–12940, 2024a.
2116
+
2117
+ Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi
2118
+
2119
+ Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as
2120
+
2121
+ generalist agents. In The Thirteenth International Conference on Learning Representations, 2024b.
2122
+
2123
+ Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan,
2124
+
2125
+ Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng,
2126
+
2127
+ Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert
2128
+
2129
+ Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI software
2130
+
2131
+ developers as generalist agents. In International Conference on Learning Representations, 2025.
2132
+
2133
+ Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models. Advances
2134
+
2135
+ in Neural Information Processing Systems, 35:24824–24837, 2022.
2136
+
2137
+ 19
2138
+
2139
+ ### Page 20
2140
+
2141
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2142
+
2143
+ Anita Williams Woolley, Christopher F Chabris, Alex Pentland, Nada Hashmi, and Thomas W
2144
+
2145
+ Malone. Evidence for a collective intelligence factor in the performance of human groups. science,
2146
+
2147
+ 330(6004):686–688, 2010.
2148
+
2149
+ Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun
2150
+
2151
+ Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and
2152
+
2153
+ Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv
2154
+
2155
+ preprint arXiv:2308.08155, 2023.
2156
+
2157
+ Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying
2158
+
2159
+ llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489.
2160
+
2161
+ Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong,
2162
+
2163
+ Chulin Xie, Carl Yang, Dawn Song, and Bo Li. Guardagent: Safeguard llm agents by a guard
2164
+
2165
+ agent via knowledge-enabled reasoning, 2025. URL https://arxiv.org/abs/2406.09187.
2166
+
2167
+ An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang
2168
+
2169
+ Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu,
2170
+
2171
+ Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
2172
+
2173
+ Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu,
2174
+
2175
+ Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men,
2176
+
2177
+ Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren,
2178
+
2179
+ Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang,
2180
+
2181
+ Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu.
2182
+
2183
+ Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388.
2184
+
2185
+ John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan,
2186
+
2187
+ and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.
2188
+
2189
+ arXiv preprint arXiv:2405.15793, 2024.
2190
+
2191
+ Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad
2192
+
2193
+ Majumder, Hao Zhu, Paul Pu Liang, and Jiaxuan You. Sotopia-rl: Reward design for social
2194
+
2195
+ intelligence. arXiv preprint arXiv:2508.03905, 2025.
2196
+
2197
+ Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu,
2198
+
2199
+ Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su,
2200
+
2201
+ Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark
2202
+
2203
+ for issue resolving, 2025. URL https://arxiv.org/abs/2504.02605.
2204
+
2205
+ Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, and
2206
+
2207
+ Toshiharu Sugawara. Tool-roco: An agent-as-tool self-organization large language model bench-
2208
+
2209
+ mark in multi-robot cooperation, 2025a. URL https://arxiv.org/abs/2511.21510.
2210
+
2211
+ Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui
2212
+
2213
+ Zhou, and Bo An. Agentorchestra: Orchestrating hierarchical multi-agent intelligence with the
2214
+
2215
+ tool-environment-agent(tea) protocol, 2025b. URL https://arxiv.org/abs/2506.12508.
2216
+
2217
+ Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang,
2218
+
2219
+ Xiang Deng, Dawn Song, Huan Sun, and Yu Su. Webguard: Building a generalizable guardrail
2220
+
2221
+ for web agents, 2025. URL https://arxiv.org/abs/2507.14293.
2222
+
2223
+ Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe
2224
+
2225
+ Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation
2226
+
2227
+ 20
2228
+
2229
+ ### Page 21
2230
+
2231
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2232
+
2233
+ for social intelligence in language agents. In The Twelfth International Conference on Learning
2234
+
2235
+ Representations, 2024.
2236
+
2237
+ Xuhui Zhou, Valerie Chen, Zora Zhiruo Wang, Graham Neubig, Maarten Sap, and Xingyao Wang.
2238
+
2239
+ Tom-swe: User mental modeling for software engineering agents. arXiv preprint arXiv:2510.21903,
2240
+
2241
+ 2025.
2242
+
2243
+ Hao Zhu, Graham Neubig, and Yonatan Bisk. Few-shot language coordination by modeling theory
2244
+
2245
+ of mind. In International conference on machine learning, pp. 12901–12911. PMLR, 2021.
2246
+
2247
+ Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong
2248
+
2249
+ Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. Multiagentbench: Evaluating the
2250
+
2251
+ collaboration and competition of llm agents, 2025. URL https://arxiv.org/abs/2503.01935.
2252
+
2253
+ Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen
2254
+
2255
+ Schmidhuber. Language agents as optimizable graphs, 2024. URL https://arxiv.org/abs/2402.
2256
+
2257
+ 16823.
2258
+
2259
+ A. Dataset Details
2260
+
2261
+ This section provides detailed statistics on the CooperBench benchmark. Repository selection
2262
+
2263
+ criteria are described in §2.3.
2264
+
2265
+ A.1. Repository Distribution
2266
+
2267
+ Table 3 shows the full breakdown of repositories, features, and task pairs.
2268
+
2269
+ Table 3 | Distribution of benchmark tasks across source repositories. Feature counts and task pairs
2270
+
2271
+ are reported as aggregated totals across base commits (PRs) within each repository.
2272
+
2273
+ Language Repository #PRs Features (Σ) Task Pairs (Σ) License
2274
+
2275
+ Python DSPy 4 23 55 MIT
2276
+
2277
+ LlamaIndex 3 16 39 MIT
2278
+
2279
+ Pillow 3 15 30 MIT-CMU
2280
+
2281
+ Pallets Click 3 27 115 BSD-3
2282
+
2283
+ Pallets Jinja 3 30 135 BSD-3
2284
+
2285
+ HuggingFace Datasets 3 13 26 Apache-2.0
2286
+
2287
+ Outlines 3 22 79 Apache-2.0
2288
+
2289
+ Tiktoken 1 10 45 MIT
2290
+
2291
+ DirtyEquals 1 9 36 MIT
2292
+
2293
+ TypeScript React Hook Form 2 11 25 MIT
2294
+
2295
+ Go Chi Router 3 13 22 MIT
2296
+
2297
+ Rust Typst 3 10 45 Apache-2.0
2298
+
2299
+ Total 12 repositories 34 199 652
2300
+
2301
+ Note: Each repository contains 1–4 base commits (PRs), each defining an independent feature pool. Task
2302
+
2303
+ pairs are constructed within each PR as (n
2304
+
2305
+ 2) and summed across PRs.
2306
+
2307
+ 21
2308
+
2309
+ ### Page 22
2310
+
2311
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2312
+
2313
+ A.2. Feature Complexity
2314
+
2315
+ The final CooperBench benchmark comprises 199 individual features grouped into 52 task sets,
2316
+
2317
+ yielding 652 evaluated feature pairs. Since the objective is to evaluate coordination rather than
2318
+
2319
+ raw implementation difficulty, features are intentionally designed to be compact and comparable
2320
+
2321
+ in difficulty to those found in established code-generation benchmarks. This design ensures that
2322
+
2323
+ multi-agent failures reflect genuine coordination limitations rather than disproportionate feature
2324
+
2325
+ complexity.
2326
+
2327
+ To quantify feature complexity, we characterize the gold patches for each feature along three
2328
+
2329
+ axes: (i) code volume, measured as the total number of lines added and deleted; (ii) structural
2330
+
2331
+ footprint, captured by the number of modified functions and hunks4; and (iii) modification scope,
2332
+
2333
+ defined as the number of files affected. Across the benchmark, features exhibit a deliberately
2334
+
2335
+ compact footprint. On average, a feature comprises 52.3 changed lines and modifies only 1.4
2336
+
2337
+ files, confirming that CooperBench isolates coordination challenges rather than the difficulty of
2338
+
2339
+ single-agent implementation. Table 4 provides detailed statistics for each repository.
2340
+
2341
+ Table 4 | Feature Complexity Statistics by Repository
2342
+
2343
+ Language Repository Avg. Lines Avg. Functions Avg. Files Easy Medium Hard
2344
+
2345
+ Python DSPy 70.9 5.6 1.3 29% 417% 1774%
2346
+
2347
+ LlamaIndex 16.8 1.8 1.0 213% 1487% 00%
2348
+
2349
+ Pillow 38.1 2.7 1.0 17% 1173% 320%
2350
+
2351
+ Pallets Click 53.9 5.4 1.6 00% 1037% 1763%
2352
+
2353
+ Pallets Jinja 67.7 6.2 1.0 13% 1447% 1550%
2354
+
2355
+ HuggingFace Datasets 15.3 2.3 1.0 18% 1185% 18%
2356
+
2357
+ Outlines 44.7 4.1 1.1 836% 627% 836%
2358
+
2359
+ Tiktoken 46.4 4.6 1.0 00% 880% 220%
2360
+
2361
+ DirtyEquals 71.0 4.0 2.0 00% 111% 889%
2362
+
2363
+ TypeScript React Hook Form 49.8 4.6 2.3 00% 873% 327%
2364
+
2365
+ Go Chi Router 80.2 5.7 2.8 00% 538% 862%
2366
+
2367
+ Rust Typst 58.4 1.7 1.1 00% 770% 330%
2368
+
2369
+ Overall 12 Repositories 52.3 4.4 1.4 158% 9950% 8543%
2370
+
2371
+ Note: Complexity measured as lines changed (added + removed) and structural elements modified in gold
2372
+
2373
+ patches. Difficulty categories from SWE-Rater-32B: Easy = <15 min fix, Medium = 15 min–1 hour, Hard = 1–4
2374
+
2375
+ hours.
2376
+
2377
+ B. LLM-based merge conflict resolver
2378
+
2379
+ CooperBench evaluates cooperation on merged code. When patch merging produces textual
2380
+
2381
+ conflicts, we use a small learned resolver to remove conflict markers while preserving both sides’
2382
+
2383
+ intent. We train a small local resolver rather than calling a larger proprietary model so that the
2384
+
2385
+ merge step remains narrow and predictable, avoids fixing anything beyond trivial merge cleanup,
2386
+
2387
+ and can run locally. At evaluation time, we invoke the learned resolver only after a standard merge
2388
+
2389
+ attempt and a union merge attempt do not yield a test passing merged artifact.
2390
+
2391
+ We construct training data by replaying merges between independently produced feature
2392
+
2393
+ patches and extracting the conflict marked regions from conflicted files. We identify each conflict
2394
+
2395
+ region by scanning for Git conflict markers <<<<<<<, =======, and >>>>>>>. We extract the marked
2396
+
2397
+ block together with a small fixed context window, default c = 5 lines before and after.
2398
+
2399
+ 4A hunk is a contiguous block of changed lines in a diff, representing a localized code modification.
2400
+
2401
+ 22
2402
+
2403
+ ### Page 23
2404
+
2405
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2406
+
2407
+ We generate synthetic conflicts by perturbing these real conflict snippets. Our default generator is
2408
+
2409
+ gpt-4o. This keeps training examples representative of our patch distribution while avoiding direct
2410
+
2411
+ reuse of repository specific content. For each real or synthetic conflict snippet, we create a reference
2412
+
2413
+ resolution with gpt-5 and fine tune a small code model, Qwen/Qwen2.5-Coder-0.5B-Instruct, using
2414
+
2415
+ LoRA based supervised fine tuning (SFT). We train for three epochs with a maximum sequence
2416
+
2417
+ length of 2048 tokens. When the resolver is invoked, we extract the conflicted region with its fixed
2418
+
2419
+ context window, run deterministic decoding with temperature = 0, and replace that region with the
2420
+
2421
+ model’s resolution. We release the trained resolver as Qwen2.5-Coder-0.5B-Merge-Resolver.5
2422
+
2423
+ C. Difficulty-stratified evaluation
2424
+
2425
+ Raw success rates are insufficient for comparing coordination overhead across models. A model
2426
+
2427
+ dropping from 50% Solo to 30% Coop has the same 20-point gap as one dropping from 80% to
2428
+
2429
+ 60%, but the first loses 40% of its capability while the second loses only 25%. We need a metric that
2430
+
2431
+ accounts for baseline differences. We also want to integrate across task difficulty rather than rely
2432
+
2433
+ on aggregates that mask variation. This section derives such a metric using the relative difficulty
2434
+
2435
+ defined in Section 4.
2436
+
2437
+ We partition tasks into 10 equal-width buckets over the normalized difficulty range [0, 1] and
2438
+
2439
+ compute success rate at each bucket midpoint, with 95% Wilson confidence intervals that remain
2440
+
2441
+ well-calibrated near 0 and 1. This produces two curves per model, one for Solo and one for Coop.
2442
+
2443
+ We summarize each curve by its area under the curve (AUC) via trapezoidal integration. The
2444
+
2445
+ absolute gap ∆AUC = AUCSolo − AUCCoop measures coordination cost but depends on baseline.
2446
+
2447
+ We therefore report retention = AUCCoop/AUCSolo, which normalizes for capability. A retention of
2448
+
2449
+ 0.64 means 64% of Solo performance survives coordination.
2450
+
2451
+ For aggregate statistics across models we sum raw counts rather than averaging rates, which
2452
+
2453
+ preserves proper weighting when models have different sample sizes.
2454
+
2455
+ 5huggingface.co/CodeConflict/Qwen2.5-Coder-0.5B-Merge-Resolver
2456
+
2457
+ 23
2458
+
2459
+ ### Page 24
2460
+
2461
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2462
+
2463
+ Algorithm 1: Constructing difficulty-stratified success curves
2464
+
2465
+ Input: Task set with difficulty scores d(t) ∈ [0, 1], success outcomes for Solo and Coop per
2466
+
2467
+ model
2468
+
2469
+ Output: Success curves with 95% CIs, AUC gap, and retention per model and pooled
2470
+
2471
+ // Bucket tasks by difficulty
2472
+
2473
+ 1 Split [0, 1] into 10 equal buckets;
2474
+
2475
+ 2 Assign each task to its bucket based on d(t);
2476
+
2477
+ // Compute curves per model
2478
+
2479
+ 3 foreach model m do
2480
+
2481
+ 4 foreach bucket b do
2482
+
2483
+ 5 Compute Solo success rate rSolo
2484
+
2485
+ m,b = kSolo
2486
+
2487
+ m,b /nm,b;
2488
+
2489
+ 6 Compute Coop success rate rCoop
2490
+
2491
+ m,b = kCoop
2492
+
2493
+ m,b /nm,b;
2494
+
2495
+ 7 Compute 95% Wilson CI for each rate;
2496
+
2497
+ 8 end
2498
+
2499
+ 9 Compute AUCSolo and AUCCoop via trapezoidal integration;
2500
+
2501
+ 10 Compute ∆AUC = AUCSolo − AUCCoop;
2502
+
2503
+ 11 Compute retention = AUCCoop/AUCSolo;
2504
+
2505
+ 12 end
2506
+
2507
+ // Pool across models
2508
+
2509
+ 13 foreach bucket b do
2510
+
2511
+ 14 Sum counts across models to get pooled nb and kb;
2512
+
2513
+ 15 Compute pooled rates and Wilson CIs;
2514
+
2515
+ 16 end
2516
+
2517
+ 17 Compute pooled AUC gap and retention;
2518
+
2519
+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2520
+
2521
+ Relative Difficulty
2522
+
2523
+ 0.0
2524
+
2525
+ 0.2
2526
+
2527
+ 0.4
2528
+
2529
+ 0.6
2530
+
2531
+ 0.8
2532
+
2533
+ 1.0
2534
+
2535
+ Task success rate
2536
+
2537
+ gpt-5
2538
+
2539
+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2540
+
2541
+ Relative Difficulty
2542
+
2543
+ claude
2544
+
2545
+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2546
+
2547
+ Relative Difficulty
2548
+
2549
+ minimax
2550
+
2551
+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2552
+
2553
+ Relative Difficulty
2554
+
2555
+ qwen coder
2556
+
2557
+ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2558
+
2559
+ Relative Difficulty
2560
+
2561
+ qwen
2562
+
2563
+ Coop
2564
+
2565
+ Solo
2566
+
2567
+ Figure 7 | Success rate versus relative difficulty for Solo and Coop settings. Shaded regions indicate
2568
+
2569
+ 95% Wilson confidence intervals. The gap between curves represents coordination cost, which is
2570
+
2571
+ largest at mid-difficulty.
2572
+
2573
+ On average, 41% of Solo capability is lost when agents must coordinate (pooled retention
2574
+
2575
+ 0.59). The pattern across models reinforces that coding ability does not predict coordination ability.
2576
+
2577
+ MiniMax exhibits the worst retention (0.46) despite mid-tier coding performance, while Qwen achieves
2578
+
2579
+ the highest retention (0.68) despite being the weakest coder. Weak models may benefit from a floor
2580
+
2581
+ effect, but MiniMax demonstrates that strong coding provides no protection against coordination
2582
+
2583
+ overhead.
2584
+
2585
+ 24
2586
+
2587
+ ### Page 25
2588
+
2589
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2590
+
2591
+ Table 5 | Coordination retention by model. Retention measures what fraction of Solo AUC is
2592
+
2593
+ preserved under Coop. Higher values indicate better coordination capability.
2594
+
2595
+ Counts (k) AUC Derived
2596
+
2597
+ Model Solo Coop Solo Coop ∆AUC Retention
2598
+
2599
+ gpt-5 315 183 0.506 0.325 0.181 0.64
2600
+
2601
+ claude 307 168 0.469 0.283 0.186 0.60
2602
+
2603
+ minimax 236 91 0.374 0.171 0.203 0.46
2604
+
2605
+ qwen coder 141 87 0.236 0.148 0.088 0.63
2606
+
2607
+ qwen 41 30 0.106 0.072 0.034 0.68
2608
+
2609
+ pooled 1039 558 0.338 0.200 0.138 0.59
2610
+
2611
+ D. Prompt Optimization: Failure-Driven Design
2612
+
2613
+ This appendix documents the iterative optimization of the collaborative setting execution prompt
2614
+
2615
+ through systematic failure analysis. Following established prompt engineering practices (Ramnath
2616
+
2617
+ et al., 2025; Sahoo et al., 2024), we employed an evidence-based approach: beginning with a basic
2618
+
2619
+ prompt and incrementally adding sections to address specific failure modes observed in agent
2620
+
2621
+ behavior. The prompt shown below represents the final, stable version used consistently across
2622
+
2623
+ all experimental runs reported in this paper.
2624
+
2625
+ Through iterative refinement, we identified three primary failure categories requiring explicit
2626
+
2627
+ prompt guidance: context misunderstanding (agents treating coordination as optional), spatial coordi-
2628
+
2629
+ nation failures (overlapping edits due to vague messages), and coordination protocol failures (missing
2630
+
2631
+ final status updates). The final prompt structure directly maps to these failure categories.
2632
+
2633
+ 25
2634
+
2635
+ ### Page 26
2636
+
2637
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2638
+
2639
+ Collaborative Setting Execution Prompt
2640
+
2641
+ Role: You are {{agent_id}} working on the following feature in parallel with another agent.
2642
+
2643
+ Scenario: You are working on separate branches implementing different features, but your imple-
2644
+
2645
+ mentations will be tested by 2-way merging both branches to main. You must prevent any merge
2646
+
2647
+ conflicts.
2648
+
2649
+ Feature Description:
2650
+
2651
+ {{feature_description}}
2652
+
2653
+ Implementation Plan:
2654
+
2655
+ {{plan}}
2656
+
2657
+ Your Task:
2658
+
2659
+ 1. Implement the feature according to the plan.
2660
+
2661
+ 2. You can communicate with the other agent using MCP tools:
2662
+
2663
+ • openhands_comm_send: Send messages to the other agent
2664
+
2665
+ • Messages from the other agent will appear automatically as '[Inter-agent message]'
2666
+
2667
+ 3. Coordinate to avoid conflicts by specifying exact file paths and line numbers.
2668
+
2669
+ 4. Complete the implementation.
2670
+
2671
+ Coordination Requirements:
2672
+
2673
+ • Share your implementation approach early with specific line ranges so both agents can coordi-
2674
+
2675
+ nate.
2676
+
2677
+ • If the other agent reports working on the same file, discuss who modifies which specific line
2678
+
2679
+ ranges to avoid conflicts.
2680
+
2681
+ • Never use insertion markers or comments like // [handleSubmit:onFinally] other agent inserts
2682
+
2683
+ – these cause merge conflicts.
2684
+
2685
+ • Instead, coordinate by dividing the file into non-overlapping sections with specific line ranges.
2686
+
2687
+ • Before you stop or complete your work, you must send a final status update message to the
2688
+
2689
+ other agent summarizing what you’ve implemented.
2690
+
2691
+ Merge Conflict Prevention:
2692
+
2693
+ • Think of this as two developers working on separate branches that will be merged together.
2694
+
2695
+ • Any overlapping changes to the same lines will cause merge conflicts.
2696
+
2697
+ • Coordinate line-by-line to ensure no overlap in your modifications.
2698
+
2699
+ Work directory: {{workspace}}
2700
+
2701
+ Failure-to-Prompt Mapping The scenario section addresses context misunderstanding by explic-
2702
+
2703
+ itly establishing that agents work on separate branches that will be merged, making coordination
2704
+
2705
+ mandatory. Analysis showed that many agents in early versions did not coordinate until after
2706
+
2707
+ starting implementation; with the scenario section, most agents coordinate during planning. The
2708
+
2709
+ coordination requirements section addresses spatial coordination failures through multiple mecha-
2710
+
2711
+ nisms. The exact line number requirement (with concrete example) addresses vague coordination
2712
+
2713
+ messages, significantly reducing spatial conflicts. The insertion marker prohibition substantially
2714
+
2715
+ reduced marker-related conflicts. The mandatory final status update requirement increased com-
2716
+
2717
+ pliance and reduced incomplete handoff failures. The merge conflict prevention section reinforces
2718
+
2719
+ context understanding through a mental model and technical explanation of merge conflict mecha-
2720
+
2721
+ nisms, helping agents understand why coordination matters and how to prevent conflicts.
2722
+
2723
+ 26
2724
+
2725
+ ### Page 27
2726
+
2727
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2728
+
2729
+ Design Decisions The prompt follows a specific ordering: (1) Identity establishes agent role, (2)
2730
+
2731
+ Scenario sets merge conflict constraints before task description, (3) Feature and (4) Plan provide
2732
+
2733
+ context, (5) Task describes what to do, (6) Requirements specify how to coordinate, and (7) Prevention
2734
+
2735
+ reinforces understanding. This ordering follows the principle that constraints should precede
2736
+
2737
+ task descriptions (Sahoo et al., 2024). Language choices employ mandatory language for critical
2738
+
2739
+ behaviors and strong prohibitions for anti-patterns, as optional language was frequently ignored.
2740
+
2741
+ Concrete examples are included rather than abstract guidance, consistent with findings that concrete
2742
+
2743
+ examples improve prompt effectiveness (Wei et al., 2022). All experimental results reported in this
2744
+
2745
+ paper were obtained using this final prompt version.
2746
+
2747
+ E. Communication ablation
2748
+
2749
+ Section 5 reports that communication does not improve cooperation success. Table 6 provides
2750
+
2751
+ the full breakdown across merge strategies. We evaluate three merging approaches in sequence:
2752
+
2753
+ Naive (standard git merge), Union (accept both sides on conflict), and LLM (our learned resolver
2754
+
2755
+ from App. B). The ∆ column shows the net effect of communication on final merge success after all
2756
+
2757
+ resolution steps. Communication slightly improves Naive merge rates by reducing raw conflicts,
2758
+
2759
+ but this advantage disappears after Union and LLM resolution. The final effect is near zero or
2760
+
2761
+ slightly negative across all models.
2762
+
2763
+ Table 6 | Merge success (%) on the 652-task summary. Subscripts show ∆ from prior column; final
2764
+
2765
+ column shows comm effect.
2766
+
2767
+ No-comm With-comm
2768
+
2769
+ Model Naive Union LLM Naive Union LLM ∆
2770
+
2771
+ GPT-5 13.88 26.69+12.8 27.91+1.2 20.42 26.64+6.2 27.90+1.3 -0.1
2772
+
2773
+ Claude 4.5 12.27 26.84+14.6 27.30+0.5 16.72 24.85+8.1 25.92+1.1 -1.4
2774
+
2775
+ MiniMax-M2 8.62 14.72+6.1 14.88+0.2 7.36 11.50+4.1 13.96+2.5 -0.9
2776
+
2777
+ Qwen3-Coder 6.90 12.88+6.0 14.72+1.8 6.75 12.42+5.7 13.34+0.9 -1.4
2778
+
2779
+ Qwen3-Instruct 1.53 3.22+1.7 3.37+0.2 2.30 4.45+2.1 4.60+0.2 +1.2
2780
+
2781
+ Avg. 8.64 16.87+8.2 17.64+0.8 10.71 15.97+5.3 17.14+1.2 -0.5
2782
+
2783
+ F. Communication error detection
2784
+
2785
+ We use an LLM-as-judge to classify communication failures for Section 5. Abstract labels like
2786
+
2787
+ “hallucination” are difficult for LLMs to apply reliably, so we instead define fine-grained categories
2788
+
2789
+ anchored to quotable evidence. The judge must cite exact quotes from the conversation and omits
2790
+
2791
+ the label if evidence is weak. We then aggregate these detections into three high-level categories for
2792
+
2793
+ reporting.
2794
+
2795
+ 27
2796
+
2797
+ ### Page 28
2798
+
2799
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2800
+
2801
+ Communication Error Detection Prompt
2802
+
2803
+ You are a careful reviewer of two agent collaboration conversations. This is a precision-first detector
2804
+
2805
+ of bad conversation patterns. Prefer returning no issue unless the evidence is strong and explicit.
2806
+
2807
+ Important exclusion. Do not label state mismatch or visibility confusion itself as an error (e.g., agents
2808
+
2809
+ on separate branches unable to see each other’s changes). Bad conversation patterns around these
2810
+
2811
+ topics should still be labeled.
2812
+
2813
+ Taxonomy. Label at most one category per conversation.
2814
+
2815
+ • C1a Unanswered direct question (no reply)
2816
+
2817
+ • C1b Unanswered direct question (ignored)
2818
+
2819
+ • C2 Non-answer or vague answer
2820
+
2821
+ • C4a Incorrect claim (uncorrected)
2822
+
2823
+ • C3b Incorrect claim (corrected)
2824
+
2825
+ • C4a Spammy repetition (repeats same information)
2826
+
2827
+ • C4b Spammy repetition (near-duplicate status blocks)
2828
+
2829
+ Evidence requirements. Include at least two exact quotes that make the issue undeniable. C1a/C1b
2830
+
2831
+ require the question plus demonstration of missing or irrelevant response. C3a requires the incorrect
2832
+
2833
+ claim and later contradiction. C4a/C4b require two quotes showing the repetition.
2834
+
2835
+ Output. Return JSON with evidence (list of quotes) and optional issue (category id and short
2836
+
2837
+ description). Omit issue if evidence is weak.
2838
+
2839
+ Taxonomy design. The eight categories decompose three failure modes into verifiable patterns.
2840
+
2841
+ Unresponsiveness (C1a, C1b, C2) covers questions that receive no reply, are ignored, or get vague
2842
+
2843
+ non-answers. Hallucination (C3a, C3b) covers false claims about code state or completion status. We
2844
+
2845
+ distinguish corrected from uncorrected claims because uncorrected errors propagate to downstream
2846
+
2847
+ decisions. Repetition (C4a, C4b) covers redundant messages that consume budget without adding
2848
+
2849
+ information.
2850
+
2851
+ G. Failure Symptom Annotation Procedure
2852
+
2853
+ We followed a six-stage process, similar in spirit to recent work on multi-agent failure analy-
2854
+
2855
+ sis (Cemri et al., 2025). (1) Collect multi-agent-system (MAS) traces from Collaborative runs; (2)
2856
+
2857
+ identify failures from merged artifacts (e.g., failing tests or missing intended behavior), and link
2858
+
2859
+ them back to the interaction; (3) develop symptom categories by iterative qualitative coding and
2860
+
2861
+ resolve disagreements to reach inter-annotator agreement on a shared set of definitions; (4) finalize
2862
+
2863
+ the resulting symptom set; (5) calibrate an LLM-based annotator on the agreed definitions; and (6)
2864
+
2865
+ apply the annotator to produce symptom annotations at scale.
2866
+
2867
+ Each labeled instance is grounded in three artifacts: (i) conversation evidence (the coordination
2868
+
2869
+ dialogue), (ii) patch/code evidence (what each agent changed), and (iii) outcome evidence (merge reports
2870
+
2871
+ and test outputs). A key operational distinction in our rubric is between implementation failures
2872
+
2873
+ (an individual agent delivers incomplete/buggy code regardless of coordination) and coordination
2874
+
2875
+ failures (a breakdown that is only apparent when we consider what agents said and assumed under
2876
+
2877
+ workspace isolation). Concretely, we require explicit conversation evidence to assign a coordination-
2878
+
2879
+ failure label; if the only evidence is in the code or error trace, we default to implementation-level
2880
+
2881
+ failure rather than inferring a coordination breakdown. We codified the final symptom definitions
2882
+
2883
+ as a structured rubric (including verification requirements and common confusions, e.g., when to
2884
+
2885
+ 28
2886
+
2887
+ ### Page 29
2888
+
2889
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2890
+
2891
+ treat “unverifiable claims” versus “work overlap”). We then calibrated an LLM-based annotator
2892
+
2893
+ on this rubric and required it to emit structured labels (a primary symptom plus any secondary
2894
+
2895
+ symptoms) together with short supporting evidence snippets.
2896
+
2897
+ Human validation. To validate the LLM-based annotator, we randomly sampled 50 trajectories
2898
+
2899
+ and had human experts independently label them using the same rubric. Human labels matched
2900
+
2901
+ the LLM annotations on 48 of 50 cases (96% agreement). With n = 50 and ˆp = 0.96, the Wilson 95%
2902
+
2903
+ confidence interval is [86%, 99%], confirming the annotator’s reliability.
2904
+
2905
+ H. Symptom examples
2906
+
2907
+ We provide representative examples for each coordination failure symptom identified in Table 1.
2908
+
2909
+ Work overlap
2910
+
2911
+ A msg: guid regex includes surrounding curly braces ({...}), and A warns about overlap.
2912
+
2913
+ A msg: WAIT Agent 10! If you add the section header AND my guid type to your branch, that WILL create a merge
2914
+
2915
+ conflict!
2916
+
2917
+ B msg: I’ll add the COMPLETE section (lines 72–81) to my branch, which includes both the section header, your guid
2918
+
2919
+ type, AND my hash_sha256 type.
2920
+
2921
+ Explanation: Despite explicit boundaries and warnings, B duplicates A’s work; the merged artifact keeps B’s (incorrect) version.
2922
+
2923
+ Outcome: Wrong regex version wins; GUID tests fail (merged guid pattern missing curly braces).
2924
+
2925
+ Divergent architecture
2926
+
2927
+ A msg: Lines 15–66: My global filter registry code is present... Line 69: Still has @dataclass decorator.
2928
+
2929
+ B msg: Removed dataclass import... Converted Prompt from dataclass to regular class with explicit __init__.
2930
+
2931
+ B msg: The 2-way merge should combine our changes cleanly.
2932
+
2933
+ Merge: clean (0 conflicts), but A’s additions are missing from the merged module.
2934
+
2935
+ Explanation: Despite explicit expectations about merge behavior, B’s class rewrite overwrites the region where A added
2936
+
2937
+ register_filter/unregister_filter; the merge reports 0 conflicts but the exported API is missing.
2938
+
2939
+ Outcome: ImportError: cannot import name register_filter from outlines.prompts.
2940
+
2941
+ Repetition
2942
+
2943
+ Near-duplicate status updates (different strings, same info).
2944
+
2945
+ A msg: I have successfully added the url type to outlines/types/__init__.py at lines 72–77:
2946
+
2947
+ A msg: Successfully added to outlines/types/__init__.py at lines 72–77
2948
+
2949
+ Explanation: The same completion/location is restated with minimal new information, consuming turns without adding constraints
2950
+
2951
+ a partner can act on.
2952
+
2953
+ Outcome: Repeated status updates reduce signal-to-noise and can crowd out missing coordination details.
2954
+
2955
+ Unresponsiveness
2956
+
2957
+ A msg: Which approach would you prefer? I want to ensure we don’t lose any functionality while resolving this
2958
+
2959
+ conflict.
2960
+
2961
+ B: no later message answers this question in the conversation log.
2962
+
2963
+ Explanation: The decision is explicitly requested and never resolved, breaking the coordination loop.
2964
+
2965
+ Outcome: The team proceeds without an agreed decision; implementation assumptions diverge.
2966
+
2967
+ Unverifiable claims
2968
+
2969
+ 29
2970
+
2971
+ ### Page 30
2972
+
2973
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
2974
+
2975
+ A claim: max_resolution is already added at specific line ranges.
2976
+
2977
+ B constraint: I cannot verify your intermediate changes (separate branches).
2978
+
2979
+ Result: the merge can be clean while one side’s asserted change is silently absent.
2980
+
2981
+ Explanation: The claim is specific but non-checkable; coordination lacks a verification mechanism (e.g., pasted signature, exact diff,
2982
+
2983
+ or an agreed placeholder).
2984
+
2985
+ Outcome: False shared context about code state leads to incompatible downstream edits.
2986
+
2987
+ Broken commitment
2988
+
2989
+ A msg: I’ll add BOTH parameters (fallback_processor and max_batch_size) to the constructor signature, BOTH
2990
+
2991
+ docstrings, and BOTH initializations.
2992
+
2993
+ A msg: ✓Line 26: Added BOTH parameters (fallback_processor and max_batch_size) to constructor signature.
2994
+
2995
+ Observed after merge: constructor only has fallback_processor, missing max_batch_size.
2996
+
2997
+ Explanation: A makes a confident completion claim that is not corrected or verified by B; under partial observability, this creates
2998
+
2999
+ false shared context.
3000
+
3001
+ Outcome: TypeError: __init__() got an unexpected keyword argument max_batch_size (tests fail).
3002
+
3003
+ Dependency access
3004
+
3005
+ Conversation evidence: 0 coordination messages were sent (total_messages=0), so neither agent disclosed
3006
+
3007
+ import/initialization risks.
3008
+
3009
+ Merge traceback excerpt: src/PIL/Image.py:60 executes from. import ImageDraw. Then src/PIL/ImageDraw.py:45 reads
3010
+
3011
+ Image.core.
3012
+
3013
+ (PIL.Image still initializing ⇒ circular import failure).
3014
+
3015
+ Explanation: With no communication at all, the merged import graph is never discussed; the first shared integration check happens
3016
+
3017
+ only at import time and fails deterministically.
3018
+
3019
+ Outcome: AttributeError: partially initialized module PIL.Image has no attribute core (most likely due to a circular
3020
+
3021
+ import).
3022
+
3023
+ Placeholder misuse
3024
+
3025
+ A msg: I’ll add a clear comment marker: [Conditional filters overlay insertion point].
3026
+
3027
+ A msg: Please insert your logic immediately AFTER the marker... without modifying lines above it.
3028
+
3029
+ B msg: Given your marker plan, I didn’t alter those methods... I rely on __post_init__ to overlay filters.
3030
+
3031
+ Explanation: The agreed integration point (insert-after-marker) is not used; B implements an alternative wiring path, so the merged
3032
+
3033
+ decorator surface no longer matches the expected call pattern.
3034
+
3035
+ Outcome: TypeError: prompt got an unexpected keyword argument conditional_filters.
3036
+
3037
+ Parameter flow
3038
+
3039
+ A msg: renamed edit_file to edit_files with multi-file command construction.
3040
+
3041
+ B msg: I’m going to continue... based on the current state I see (edit_file method).
3042
+
3043
+ B code shape: builds a shell command by interpolating filename into a quoted string, assuming it is a single
3044
+
3045
+ string.
3046
+
3047
+ Explanation: Ambiguity about a changing interface leaves one agent implementing against an outdated contract; after merge, a list
3048
+
3049
+ flows into string-only formatting.
3050
+
3051
+ Outcome: sed: can’t read [...]: No such file or directory (list passed as a literal string).
3052
+
3053
+ Timing dependency
3054
+
3055
+ 30
3056
+
3057
+ ### Page 31
3058
+
3059
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
3060
+
3061
+ A msg: Processing Pipeline: load → image.load → EXIF correction (NEW) → B crop (pending) → mode conversion →
3062
+
3063
+ return.
3064
+
3065
+ B msg: Applied AFTER EXIF correction (A) and BEFORE mode conversion... Pipeline (after merge): load → EXIF
3066
+
3067
+ correction → center-crop.
3068
+
3069
+ Merge: CLEAN (0 textual conflicts); both declare No conflicts expected.
3070
+
3071
+ Merged code excerpt: image = image.crop(...)
3072
+
3073
+ Merged code absence: no ImageOps.exif_transpose(...) call exists in the merged function.
3074
+
3075
+ Explanation: They agree on the intended order, but fail to ensure the EXIF correction block is actually present at the agreed insertion
3076
+
3077
+ point after merge.
3078
+
3079
+ Outcome: assert (640, 480) == (480, 640) (EXIF correction missing).
3080
+
3081
+ I. Case Study: Spatial vs. Semantic Coordination
3082
+
3083
+ Section 5 shows that communication reduces merge conflicts but does not improve task success. To
3084
+
3085
+ understand why, we examine a representative failure in detail. This case illustrates the distinction
3086
+
3087
+ between spatial coordination (agreeing on which lines to edit) and semantic coordination (agreeing on
3088
+
3089
+ what values and behaviors to implement). We find that agents excel at the former but neglect the
3090
+
3091
+ latter.
3092
+
3093
+ I.1. Task Setup
3094
+
3095
+ The task comes from the Jinja2 template engine. Jinja2 provides a groupby filter that groups items by
3096
+
3097
+ an attribute. For example, users|groupby("city") groups a list of users by their city. The original
3098
+
3099
+ function signature looks like this:
3100
+
3101
+ def sync_do_groupby (
3102
+
3103
+ environment: " Environment ",
3104
+
3105
+ value: "t. Iterable [V]",
3106
+
3107
+ attribute: t. Union [str, int],
3108
+
3109
+ default: t. Optional [t. Any] = None,
3110
+
3111
+ ) -> "t. List [t. Tuple [t. Any,␣t. List [V]]] ":
3112
+
3113
+ Two agents must independently extend this filter with new parameters. Agent 1 adds a
3114
+
3115
+ case_sensitive parameter that controls whether grouping treats “CA” and “ca” as the same value.
3116
+
3117
+ The intended default is False, matching the existing sort filter’s case-insensitive behavior. Agent 2
3118
+
3119
+ adds a reverse parameter that controls whether groups appear in descending order.
3120
+
3121
+ The coordination challenge is that both parameters must be added to the same function signature.
3122
+
3123
+ Git performs conflict detection at the hunk level (contiguous blocks of changes), not individual lines.
3124
+
3125
+ If both agents modify the signature independently, Git will detect a conflict because both changes
3126
+
3127
+ touch the same hunk. This creates a genuine coordination requirement where agents cannot simply
3128
+
3129
+ work in isolation.
3130
+
3131
+ I.2. What Agents Communicated
3132
+
3133
+ The agents exchanged 10 messages totaling over 3,000 words. Their communication focused almost
3134
+
3135
+ entirely on spatial coordination: line numbers, file paths, and edit ranges.
3136
+
3137
+ 31
3138
+
3139
+ ### Page 32
3140
+
3141
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
3142
+
3143
+ Agent 1: I’ll be modifying the sync_do_groupby and do_groupby functions in src/jinja2/filters.py
3144
+
3145
+ lines 1161–1230... According to our coordination plan, both parameters will be added together to avoid
3146
+
3147
+ merge conflicts.
3148
+
3149
+ Agent 2: I’ll wait for you to complete the implementation in src/jinja2/filters.py lines 1161–1230...
3150
+
3151
+ Agent 1: Your reverse insertion points remain:
3152
+
3153
+ - sync: for _, values in groupby(sorted(value, key=group_expr), group_expr):
3154
+
3155
+ Please proceed to add reverse: bool = False immediately after case_sensitive.
3156
+
3157
+ This communication successfully identified the shared code region and established a protocol
3158
+
3159
+ for who edits what. What the agents never discussed were the semantic details that matter for
3160
+
3161
+ correctness. Agent 1 correctly states that case_sensitive should default to False to match the sort
3162
+
3163
+ filter, but this default value is never explicitly confirmed with Agent 2. Neither agent discusses
3164
+
3165
+ whether the two parameters interact or whether their implementations are independent.
3166
+
3167
+ I.3. What Went Wrong
3168
+
3169
+ Because both agents anticipated the need to modify the same function signature, each independently
3170
+
3171
+ added both parameters to avoid a Git conflict. However, without semantic coordination, they made
3172
+
3173
+ inconsistent choices.
3174
+
3175
+ Agent 1’s patch added only the case_sensitive parameter with the correct default:
3176
+
3177
+ def sync_do_groupby (
3178
+
3179
+ environment: " Environment ",
3180
+
3181
+ value: "t. Iterable [V]",
3182
+
3183
+ attribute: t. Union [str, int],
3184
+
3185
+ default: t. Optional [t. Any] = None,
3186
+
3187
+ case_sensitive: bool = False, # Correct default
3188
+
3189
+ ) -> "t. List [_GroupTuple]":
3190
+
3191
+ Agent 2’s patch added both parameters (to avoid merge conflicts), but reported the wrong value
3192
+
3193
+ in communication:
3194
+
3195
+ Agent 2’s status message:
3196
+
3197
+ “Signatures now are: (environment, value, attribute, default=None, case_sensitive=True)”
3198
+
3199
+ Agent 2 reported case_sensitive=True as the default while the correct value is False. This
3200
+
3201
+ discrepancy was never caught because the conversation focused entirely on where edits would
3202
+
3203
+ happen, not what values would be used. Neither agent verified the other’s actual implementation;
3204
+
3205
+ they relied on status messages. The semantic meaning of the default (“should match the sort filter”)
3206
+
3207
+ was mentioned by Agent 1 but never confirmed by Agent 2.
3208
+
3209
+ For reference, the gold (correct) patches show what each feature should look like. The gold
3210
+
3211
+ patch for case_sensitive adds:
3212
+
3213
+ default: t. Optional [t. Any] = None,
3214
+
3215
+ case_sensitive: bool = False,
3216
+
3217
+ ) -> "t. List [_GroupTuple]":
3218
+
3219
+ And the gold patch for reverse adds:
3220
+
3221
+ default: t. Optional [t. Any] = None,
3222
+
3223
+ reverse: bool = False,
3224
+
3225
+ ) -> "t. List [t. Tuple [t. Any,␣t. List [V]]] ":
3226
+
3227
+ 32
3228
+
3229
+ ### Page 33
3230
+
3231
+ CooperBench: Why Coding Agents Cannot be Your Teammates Yet
3232
+
3233
+ The correct merged signature would combine both:
3234
+
3235
+ def sync_do_groupby (
3236
+
3237
+ environment: " Environment ",
3238
+
3239
+ value: "t. Iterable [V]",
3240
+
3241
+ attribute: t. Union [str, int],
3242
+
3243
+ default: t. Optional [t. Any] = None,
3244
+
3245
+ case_sensitive: bool = False,
3246
+
3247
+ reverse: bool = False,
3248
+
3249
+ ) -> "t. List [_GroupTuple]":
3250
+
3251
+ I.4. What Would Have Worked
3252
+
3253
+ For this task to succeed, agents needed to coordinate on three levels. Spatial coordination they
3254
+
3255
+ achieved: “I’m editing lines 1161–1230; please add your parameter after mine.” Structural coordi-
3256
+
3257
+ nation they partially achieved: “Both parameters go in the signature; I’ll add mine first.” Semantic
3258
+
3259
+ coordination was missing entirely.
3260
+
3261
+ A single message could have prevented the failure:
3262
+
3263
+ Missing coordination:
3264
+
3265
+ “I’m implementing case_sensitive with default value False (not True). This matches the sort
3266
+
3267
+ filter’s case-insensitive default. If you need to include this parameter in your patch, please
3268
+
3269
+ use exactly case_sensitive: bool = False.”
3270
+
3271
+ I.5. Implications
3272
+
3273
+ This case study provides concrete evidence for the spatial-semantic gap discussed in Section 5.
3274
+
3275
+ Despite 10 messages and over 3,000 words of coordination, the agents never once discussed the
3276
+
3277
+ actual default value that case_sensitive should have. They successfully negotiated where to edit
3278
+
3279
+ but failed to negotiate what to implement. A single clarifying message about the intended default
3280
+
3281
+ value would have prevented the failure entirely.
3282
+
3283
+ 33