@chllming/wave-orchestration 0.6.3 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +82 -1
- package/README.md +40 -7
- package/docs/agents/wave-orchestrator-role.md +50 -0
- package/docs/agents/wave-planner-role.md +39 -0
- package/docs/context7/bundles.json +9 -0
- package/docs/context7/planner-agent/README.md +25 -0
- package/docs/context7/planner-agent/manifest.json +83 -0
- package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
- package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
- package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
- package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
- package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
- package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
- package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
- package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
- package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
- package/docs/evals/README.md +96 -1
- package/docs/evals/arm-templates/README.md +13 -0
- package/docs/evals/arm-templates/full-wave.json +15 -0
- package/docs/evals/arm-templates/single-agent.json +15 -0
- package/docs/evals/benchmark-catalog.json +7 -0
- package/docs/evals/cases/README.md +47 -0
- package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
- package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
- package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
- package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
- package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
- package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
- package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
- package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
- package/docs/evals/external-benchmarks.json +85 -0
- package/docs/evals/external-command-config.sample.json +9 -0
- package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
- package/docs/evals/pilots/README.md +47 -0
- package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
- package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
- package/docs/evals/wave-benchmark-program.md +302 -0
- package/docs/guides/planner.md +67 -11
- package/docs/guides/terminal-surfaces.md +12 -0
- package/docs/plans/context7-wave-orchestrator.md +20 -0
- package/docs/plans/current-state.md +8 -1
- package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
- package/docs/plans/examples/wave-example-live-proof.md +1 -1
- package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
- package/docs/plans/migration.md +26 -0
- package/docs/plans/wave-orchestrator.md +60 -12
- package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
- package/docs/reference/cli-reference.md +547 -0
- package/docs/reference/coordination-and-closure.md +436 -0
- package/docs/reference/live-proof-waves.md +25 -3
- package/docs/reference/npmjs-trusted-publishing.md +3 -3
- package/docs/reference/proof-metrics.md +90 -0
- package/docs/reference/runtime-config/README.md +63 -2
- package/docs/reference/runtime-config/codex.md +2 -1
- package/docs/reference/sample-waves.md +29 -18
- package/docs/reference/wave-control.md +164 -0
- package/docs/reference/wave-planning-lessons.md +131 -0
- package/package.json +5 -4
- package/releases/manifest.json +40 -0
- package/scripts/research/agent-context-archive.mjs +18 -0
- package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
- package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
- package/scripts/wave-orchestrator/agent-state.mjs +11 -2
- package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
- package/scripts/wave-orchestrator/autonomous.mjs +7 -0
- package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
- package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
- package/scripts/wave-orchestrator/benchmark.mjs +972 -0
- package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
- package/scripts/wave-orchestrator/config.mjs +175 -0
- package/scripts/wave-orchestrator/control-cli.mjs +1216 -0
- package/scripts/wave-orchestrator/control-plane.mjs +697 -0
- package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
- package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
- package/scripts/wave-orchestrator/coordination.mjs +84 -0
- package/scripts/wave-orchestrator/dashboard-renderer.mjs +120 -5
- package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
- package/scripts/wave-orchestrator/evals.mjs +23 -0
- package/scripts/wave-orchestrator/executors.mjs +3 -2
- package/scripts/wave-orchestrator/feedback.mjs +55 -0
- package/scripts/wave-orchestrator/install.mjs +151 -2
- package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
- package/scripts/wave-orchestrator/launcher-runtime.mjs +33 -30
- package/scripts/wave-orchestrator/launcher.mjs +884 -36
- package/scripts/wave-orchestrator/planner-context.mjs +75 -0
- package/scripts/wave-orchestrator/planner.mjs +2270 -136
- package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
- package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
- package/scripts/wave-orchestrator/replay.mjs +10 -4
- package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
- package/scripts/wave-orchestrator/retry-control.mjs +225 -0
- package/scripts/wave-orchestrator/shared.mjs +26 -0
- package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
- package/scripts/wave-orchestrator/terminals.mjs +1 -1
- package/scripts/wave-orchestrator/traces.mjs +157 -2
- package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
- package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
- package/scripts/wave-orchestrator/wave-files.mjs +144 -23
- package/scripts/wave.mjs +27 -0
- package/skills/repo-coding-rules/SKILL.md +1 -0
- package/skills/role-cont-eval/SKILL.md +1 -0
- package/skills/role-cont-qa/SKILL.md +13 -6
- package/skills/role-deploy/SKILL.md +1 -0
- package/skills/role-documentation/SKILL.md +4 -0
- package/skills/role-implementation/SKILL.md +4 -0
- package/skills/role-infra/SKILL.md +2 -1
- package/skills/role-integration/SKILL.md +15 -8
- package/skills/role-planner/SKILL.md +39 -0
- package/skills/role-planner/skill.json +21 -0
- package/skills/role-research/SKILL.md +1 -0
- package/skills/role-security/SKILL.md +2 -2
- package/skills/runtime-claude/SKILL.md +2 -1
- package/skills/runtime-codex/SKILL.md +1 -0
- package/skills/runtime-local/SKILL.md +2 -0
- package/skills/runtime-opencode/SKILL.md +1 -0
- package/skills/wave-core/SKILL.md +25 -6
- package/skills/wave-core/references/marker-syntax.md +16 -8
- package/wave.config.json +45 -0
|
@@ -0,0 +1,3283 @@
|
|
|
1
|
+
---
|
|
2
|
+
summary: 'Converted paper text and source links for CooperBench: Why Coding Agents Cannot be Your Teammates Yet.'
|
|
3
|
+
read_when:
|
|
4
|
+
- Reviewing harness and coordination research source material in the docs tree
|
|
5
|
+
- You want the extracted paper text with source links preserved
|
|
6
|
+
topics:
|
|
7
|
+
- planning-and-orchestration
|
|
8
|
+
- agent-cooperation-and-coordination
|
|
9
|
+
- repo-context-and-evaluation
|
|
10
|
+
kind: 'paper'
|
|
11
|
+
title: 'CooperBench: Why Coding Agents Cannot be Your Teammates Yet'
|
|
12
|
+
---
|
|
13
|
+
# CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
14
|
+
|
|
15
|
+
<Note>
|
|
16
|
+
Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
|
|
17
|
+
</Note>
|
|
18
|
+
|
|
19
|
+
## Metadata
|
|
20
|
+
|
|
21
|
+
| Field | Value |
|
|
22
|
+
| --- | --- |
|
|
23
|
+
| Content type | Paper / report |
|
|
24
|
+
| Authors | Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, Diyi Yang |
|
|
25
|
+
| Year | 2026 |
|
|
26
|
+
| Venue | arXiv 2601.13295 |
|
|
27
|
+
| Research bucket | P0 direct hits |
|
|
28
|
+
| Maps to | Collaborative coding benchmark for inter-agent cooperation, communication quality, commitment tracking, and coordination failures. |
|
|
29
|
+
| Harness fit | Direct benchmark for whether coding agents behave like usable teammates instead of isolated solo solvers. |
|
|
30
|
+
| Source page | [Open source](https://arxiv.org/abs/2601.13295) |
|
|
31
|
+
| Source PDF | [Open PDF](https://arxiv.org/pdf/2601.13295.pdf) |
|
|
32
|
+
| Additional source | [Open source](https://cooperbench.com) |
|
|
33
|
+
| Additional PDF | [Open PDF](https://cooperbench.com/static/pdfs/main.pdf) |
|
|
34
|
+
| Notes | Project site hosts the same paper PDF plus leaderboard, dataset, and trajectory viewer for the benchmark. |
|
|
35
|
+
|
|
36
|
+
## Extracted text
|
|
37
|
+
### Page 1
|
|
38
|
+
|
|
39
|
+
2026-1-27
|
|
40
|
+
|
|
41
|
+
CooperBench: Why Coding Agents Cannot be
|
|
42
|
+
|
|
43
|
+
Your Teammates Yet
|
|
44
|
+
|
|
45
|
+
Arpandeep Khatua1∗, Hao Zhu1∗,
|
|
46
|
+
|
|
47
|
+
Peter Tran2∗∗, Arya Prabhudesai2∗∗, Frederic Sadrieh2∗∗, Johann K. Lieberwirth2∗∗,
|
|
48
|
+
|
|
49
|
+
Xinkai Yu1, Yicheng Fu1, Michael J. Ryan1, Jiaxin Pei1, Diyi Yang1
|
|
50
|
+
|
|
51
|
+
1Stanford University 2SAP Labs US ∗Equal Contribution ∗∗Equal Contribution
|
|
52
|
+
|
|
53
|
+
https://cooperbench.com
|
|
54
|
+
|
|
55
|
+
CooperBench
|
|
56
|
+
|
|
57
|
+
Expert-written
|
|
58
|
+
|
|
59
|
+
features with
|
|
60
|
+
|
|
61
|
+
potential
|
|
62
|
+
|
|
63
|
+
conflicts but
|
|
64
|
+
|
|
65
|
+
compatible
|
|
66
|
+
|
|
67
|
+
solutions.
|
|
68
|
+
|
|
69
|
+
Individual execution environments
|
|
70
|
+
|
|
71
|
+
Agent 1’s Goal Ensure images mutable after saving to the disk.
|
|
72
|
+
|
|
73
|
+
Virtual Machine 1
|
|
74
|
+
|
|
75
|
+
11:45am 12:01pm 12:03pm 12:05pm 12:10pm 12:11pm
|
|
76
|
+
|
|
77
|
+
...
|
|
78
|
+
|
|
79
|
+
Agent 2’s Goal Auto backup when overwriting existing files.
|
|
80
|
+
|
|
81
|
+
Virtual Machine 2
|
|
82
|
+
|
|
83
|
+
11:45am 12:01pm 12:03pm 12:08pm 12:15pm 12:27pm
|
|
84
|
+
|
|
85
|
+
...
|
|
86
|
+
|
|
87
|
+
Chat between agents
|
|
88
|
+
|
|
89
|
+
Agent 1 12:01pm
|
|
90
|
+
|
|
91
|
+
I will make Image.save function call
|
|
92
|
+
|
|
93
|
+
_ensure_mutable.
|
|
94
|
+
|
|
95
|
+
Agent 2 12:03pm
|
|
96
|
+
|
|
97
|
+
Thanks! I’ll anchor mine around the fp
|
|
98
|
+
|
|
99
|
+
path handling block. We won’t conflict.
|
|
100
|
+
|
|
101
|
+
Agent 2 12:27pm
|
|
102
|
+
|
|
103
|
+
Update: I have finished my tasks.
|
|
104
|
+
|
|
105
|
+
Agent 1 12:35pm
|
|
106
|
+
|
|
107
|
+
Wait, I now need your help to make sure to
|
|
108
|
+
|
|
109
|
+
not return until the end of the function.
|
|
110
|
+
|
|
111
|
+
Agent 2 12:27pm
|
|
112
|
+
|
|
113
|
+
Got it. I have updated my patch.
|
|
114
|
+
|
|
115
|
+
Evaluation
|
|
116
|
+
|
|
117
|
+
Merge Patches
|
|
118
|
+
|
|
119
|
+
Agents patches
|
|
120
|
+
|
|
121
|
+
should be
|
|
122
|
+
|
|
123
|
+
compatible.
|
|
124
|
+
|
|
125
|
+
Unit Tests
|
|
126
|
+
|
|
127
|
+
Agent 1’s Tests
|
|
128
|
+
|
|
129
|
+
+Agent 2’s Tests
|
|
130
|
+
|
|
131
|
+
Figure 1 | The CooperBench benchmark draws tasks for two agents from a pool of features with potential
|
|
132
|
+
|
|
133
|
+
conflicts. The agents execute the tasks in their individual environments, communicating in real time to
|
|
134
|
+
|
|
135
|
+
coordinate. Success is measured by whether the resulting code changes by both agents are compatible and
|
|
136
|
+
|
|
137
|
+
pass the requirements for both features.
|
|
138
|
+
|
|
139
|
+
Resolving team conflicts requires not only task-specific competence, but also social intelligence to find
|
|
140
|
+
|
|
141
|
+
common ground and build consensus. Similarly, as AI agents increasingly collaborate on complex work,
|
|
142
|
+
|
|
143
|
+
they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that
|
|
144
|
+
|
|
145
|
+
current agents lack these capabilities. To test this hypothesis, we introduce CooperBench, a benchmark
|
|
146
|
+
|
|
147
|
+
of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns
|
|
148
|
+
|
|
149
|
+
two agents different features that can be implemented independently but may conflict without proper
|
|
150
|
+
|
|
151
|
+
coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-
|
|
152
|
+
|
|
153
|
+
of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success
|
|
154
|
+
|
|
155
|
+
rates when working together compared to performing both tasks individually, across the full spectrum of
|
|
156
|
+
|
|
157
|
+
task difficulties. This contrasts sharply with human teams, where adding teammates typically improves
|
|
158
|
+
|
|
159
|
+
rather than diminishes productivity. Our analysis reveals three key issues: (1) communication channels
|
|
160
|
+
|
|
161
|
+
become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication,
|
|
162
|
+
|
|
163
|
+
agents deviate from their commitments; and (3) agents often hold incorrect expectations about others’ plans,
|
|
164
|
+
|
|
165
|
+
observations, and communication. Besides these issues, through large-scale simulation, we also observe rare
|
|
166
|
+
|
|
167
|
+
but interesting emergent coordination behavior between agents including role division, resource division,
|
|
168
|
+
|
|
169
|
+
and negotiation. Our research not only presents a novel benchmark for collaborative coding, but also calls
|
|
170
|
+
|
|
171
|
+
for a research shift from pursuing individual agent capability to developing social intelligence: the ability to
|
|
172
|
+
|
|
173
|
+
understand others, communicate effectively, and coordinate actions.
|
|
174
|
+
|
|
175
|
+
arXiv:2601.13295v2 [cs.LG] 26 Jan 2026
|
|
176
|
+
|
|
177
|
+
### Page 2
|
|
178
|
+
|
|
179
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
180
|
+
|
|
181
|
+
1. Introduction
|
|
182
|
+
|
|
183
|
+
Most achievements in modern civilization arise from individuals working cooperatively, from the
|
|
184
|
+
|
|
185
|
+
construction of cathedrals to the development of open-source software (Raymond, 1999; Woolley
|
|
186
|
+
|
|
187
|
+
et al., 2010). In human societies, such cooperation relies on social intelligence: the ability to
|
|
188
|
+
|
|
189
|
+
communicate intentions, understand others’ goals, and negotiate mutually compatible solutions
|
|
190
|
+
|
|
191
|
+
(Humphrey, 1976). This capability is often viewed as what makes us uniquely human and the basis
|
|
192
|
+
|
|
193
|
+
of human thinking (Tomasello, 2014). As we deploy AI agents in cooperative settings, whether
|
|
194
|
+
|
|
195
|
+
strong individual capabilities translate to effective cooperation with either humans or agents remains
|
|
196
|
+
|
|
197
|
+
an open question. In this paper, we empirically demonstrate that for current AI systems, there
|
|
198
|
+
|
|
199
|
+
is a curse of coordination: agent cooperation perform much worse than a single agent given the same
|
|
200
|
+
|
|
201
|
+
total workload. This deficit presents a fundamental barrier to deploying AI systems that can work
|
|
202
|
+
|
|
203
|
+
alongside humans or other agents. We theorize that at a fundamental level, effective human–AI and
|
|
204
|
+
|
|
205
|
+
agent–agent cooperation rely on the same coordination abilities.
|
|
206
|
+
|
|
207
|
+
Glossary
|
|
208
|
+
|
|
209
|
+
Cooperation: When two or more agents/humans work together towards a shared goal, where an agent
|
|
210
|
+
|
|
211
|
+
may altruistically help another achieve things outside their original responsibility.
|
|
212
|
+
|
|
213
|
+
Collaboration: When two or more agents/humans work together towards a shared goal.
|
|
214
|
+
|
|
215
|
+
Coordination: The capability to act and communicate in accordance with other agents/humans.
|
|
216
|
+
|
|
217
|
+
Existing research on automating human tasks and multi-agent systems largely sidesteps this
|
|
218
|
+
|
|
219
|
+
challenge by either providing more scaffolds (Fourney et al., 2024a; Pan et al., 2025; Zhang et al.,
|
|
220
|
+
|
|
221
|
+
2025b; Zhuge et al., 2024), enforcing strict workflows (Cheng et al., 2025; Hong et al., 2023; Nguyen
|
|
222
|
+
|
|
223
|
+
et al., 2024), or providing active supervision and verification (Huang et al., 2025; Xiang et al., 2025;
|
|
224
|
+
|
|
225
|
+
Zheng et al., 2025). These systems rely on developer- or user-provided scaffolding to manage
|
|
226
|
+
|
|
227
|
+
coordination, which limits flexible cooperation and places additional burden on humans.
|
|
228
|
+
|
|
229
|
+
We present CooperBench, the first benchmark designed to measure how well agents can coop-
|
|
230
|
+
|
|
231
|
+
erate when handling individual tasks with potential conflicts. Considering software engineering
|
|
232
|
+
|
|
233
|
+
as a realistic domain where humans typically need to navigate work in a team (Purna Sudhakar
|
|
234
|
+
|
|
235
|
+
et al., 2011), our benchmark offers verifiable evaluation for the success of agent cooperation. As
|
|
236
|
+
|
|
237
|
+
illustrated in Fig. 1, CooperBench comprises 652 tasks constructed from 12 popular open-source
|
|
238
|
+
|
|
239
|
+
libraries across Python, TypeScript, Go, and Rust. Eight co-authors of this paper with real-world
|
|
240
|
+
|
|
241
|
+
software engineering backgrounds created new features, unit tests, and ground-truth code for these
|
|
242
|
+
|
|
243
|
+
libraries, ensuring high-quality and realistic task design.
|
|
244
|
+
|
|
245
|
+
In CooperBench, each task assigns each agent a feature to be implemented based on the same
|
|
246
|
+
|
|
247
|
+
repository state. Conflicts are intentionally embedded at the code level, as the assigned features
|
|
248
|
+
|
|
249
|
+
are logically compatible but require agents to modify overlapping or interdependent code. For
|
|
250
|
+
|
|
251
|
+
example, in Fig. 1, one agent implements image mutability in the serialization process while another
|
|
252
|
+
|
|
253
|
+
adds backup functionality to the same process. Without understanding each other’s goals, plans,
|
|
254
|
+
|
|
255
|
+
and expectations, their solutions may introduce incompatible changes. This mirrors real-world
|
|
256
|
+
|
|
257
|
+
software development where coordination failures stem from insufficient mutual understanding.
|
|
258
|
+
|
|
259
|
+
CooperBench enables us to investigate three research questions:
|
|
260
|
+
|
|
261
|
+
RQ1: How well can agents cooperate with each other? (§4)
|
|
262
|
+
|
|
263
|
+
RQ2: What role does communication play in agent-agent cooperation? (§5)
|
|
264
|
+
|
|
265
|
+
RQ3: What coordination failures do agents exhibit? (§6)
|
|
266
|
+
|
|
267
|
+
2
|
|
268
|
+
|
|
269
|
+
### Page 3
|
|
270
|
+
|
|
271
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
272
|
+
|
|
273
|
+
Through evaluating state-of-the-art coding agents on CooperBench, we observe the curse of
|
|
274
|
+
|
|
275
|
+
coordination: GPT-5 and Claude Sonnet 4.5 based agents achieve only 25% with two-agent coopera-
|
|
276
|
+
|
|
277
|
+
tion on CooperBench, which is around 50% lower than a “Solo” baseline which uses one agent to
|
|
278
|
+
|
|
279
|
+
implement both features.
|
|
280
|
+
|
|
281
|
+
Diving deeper into the coordination failures, we identify three key issues. First, communication
|
|
282
|
+
|
|
283
|
+
channels become jammed with vague, ill-timed, and inaccurate messages where agents fail to
|
|
284
|
+
|
|
285
|
+
respond to direct questions, send messages that arrive too late to inform decisions, or flood channels
|
|
286
|
+
|
|
287
|
+
with repetitive status updates that lack actionable detail. Second, even with effective communi-
|
|
288
|
+
|
|
289
|
+
cation, agents deviate from their commitments. They make unverifiable claims about code state,
|
|
290
|
+
|
|
291
|
+
ignore agreed-upon integration points, and break explicit promises. Third, agents hold incorrect
|
|
292
|
+
|
|
293
|
+
expectations about their partner’s plans, observations and duplicate work despite warnings and
|
|
294
|
+
|
|
295
|
+
overwrite changes they believe will merge cleanly (§6).
|
|
296
|
+
|
|
297
|
+
Besides failures, we are excited to report emergent coordination behaviors which often lead to
|
|
298
|
+
|
|
299
|
+
the success of the CooperBench tasks. These coordination behaviors are rarely performed by the
|
|
300
|
+
|
|
301
|
+
agents, but through our large-scale simulation, we uncover three major categories of them: role
|
|
302
|
+
|
|
303
|
+
division, resource division, and negotiations (§6.4). These examples hint at a path of coordination
|
|
304
|
+
|
|
305
|
+
capability acquisition through reinforcing success on CooperBench.
|
|
306
|
+
|
|
307
|
+
We contribute both a novel understanding of what agents need to become effective teammates
|
|
308
|
+
|
|
309
|
+
and a practical benchmark for measuring progress. Our open-sourced CooperBench platform
|
|
310
|
+
|
|
311
|
+
enables researchers and practitioners to evaluate and improve cooperative coding agents.
|
|
312
|
+
|
|
313
|
+
2. CooperBench Benchmark
|
|
314
|
+
|
|
315
|
+
Flexible `dspy.ToolCalls`
|
|
316
|
+
|
|
317
|
+
parsing for varied formats
|
|
318
|
+
|
|
319
|
+
test_toolcalls_vague_match
|
|
320
|
+
|
|
321
|
+
test_tool_convert...no_input_params
|
|
322
|
+
|
|
323
|
+
test_tool_convert...args_lang_chain
|
|
324
|
+
|
|
325
|
+
adapters/types/tool.py
|
|
326
|
+
|
|
327
|
+
tests/adapters/test_tool.py
|
|
328
|
+
|
|
329
|
+
Feature 1
|
|
330
|
+
|
|
331
|
+
Minimal Python-call syntax
|
|
332
|
+
|
|
333
|
+
parser for `ToolCalls`
|
|
334
|
+
|
|
335
|
+
test_toolcalls_validate...string
|
|
336
|
+
|
|
337
|
+
test_toolcalls...multiple_calls
|
|
338
|
+
|
|
339
|
+
test_parse_python_calls_basic
|
|
340
|
+
|
|
341
|
+
test_parse_python_calls_multiple
|
|
342
|
+
|
|
343
|
+
adapters/types/tool.py
|
|
344
|
+
|
|
345
|
+
tests/adapters/test_tool.py
|
|
346
|
+
|
|
347
|
+
Feature 2
|
|
348
|
+
|
|
349
|
+
Minimal type coercion & unit
|
|
350
|
+
|
|
351
|
+
parsing for `Tool` class
|
|
352
|
+
|
|
353
|
+
arguments (safe, pre-validation)
|
|
354
|
+
|
|
355
|
+
...
|
|
356
|
+
|
|
357
|
+
Feature 3
|
|
358
|
+
|
|
359
|
+
Feature 4-6 Omitted
|
|
360
|
+
|
|
361
|
+
Feature Pool Using repo state stanfordnlp/
|
|
362
|
+
|
|
363
|
+
dspy #80412c as an example
|
|
364
|
+
|
|
365
|
+
Compatible
|
|
366
|
+
|
|
367
|
+
Potentially
|
|
368
|
+
|
|
369
|
+
Conflicting
|
|
370
|
+
|
|
371
|
+
All of the
|
|
372
|
+
|
|
373
|
+
features can be
|
|
374
|
+
|
|
375
|
+
implemented in
|
|
376
|
+
|
|
377
|
+
a compatible
|
|
378
|
+
|
|
379
|
+
way
|
|
380
|
+
|
|
381
|
+
Features are
|
|
382
|
+
|
|
383
|
+
related to
|
|
384
|
+
|
|
385
|
+
overlapping
|
|
386
|
+
|
|
387
|
+
files. There is a
|
|
388
|
+
|
|
389
|
+
potential for
|
|
390
|
+
|
|
391
|
+
conflicts if not
|
|
392
|
+
|
|
393
|
+
coordinated
|
|
394
|
+
|
|
395
|
+
well.
|
|
396
|
+
|
|
397
|
+
Figure 2 | An example feature pool based on DSPy
|
|
398
|
+
|
|
399
|
+
GitHub repository. This feature pool has 6 features
|
|
400
|
+
|
|
401
|
+
which can be implemented compatibly based on the
|
|
402
|
+
|
|
403
|
+
repository state, but without coordination agents could
|
|
404
|
+
|
|
405
|
+
conflict with each other.
|
|
406
|
+
|
|
407
|
+
CooperBench seeks to satisfy the following
|
|
408
|
+
|
|
409
|
+
desiderata: (1) Realism: the tasks should be
|
|
410
|
+
|
|
411
|
+
reasonable for a software development team
|
|
412
|
+
|
|
413
|
+
to work on at a given repository state. (2) Con-
|
|
414
|
+
|
|
415
|
+
flict potential: the agents’ scopes should overlap
|
|
416
|
+
|
|
417
|
+
with each other so that they need to coordinate
|
|
418
|
+
|
|
419
|
+
well to avoid potential conflicts. (3) Verifiable:
|
|
420
|
+
|
|
421
|
+
the success of the tasks can be evaluated with a
|
|
422
|
+
|
|
423
|
+
pipeline that is deterministic and interpretable.
|
|
424
|
+
|
|
425
|
+
These three desiderata provide a basis for ac-
|
|
426
|
+
|
|
427
|
+
curately measuring the real-world cooperation
|
|
428
|
+
|
|
429
|
+
capabilities of agents.
|
|
430
|
+
|
|
431
|
+
2.1. Task space
|
|
432
|
+
|
|
433
|
+
Task Each task consists of a repository state,
|
|
434
|
+
|
|
435
|
+
two features, and two corresponding sets of
|
|
436
|
+
|
|
437
|
+
unit tests. The two features are drawn from a
|
|
438
|
+
|
|
439
|
+
pool of features (like the one illustrated in Fig.
|
|
440
|
+
|
|
441
|
+
2) that can be simultaneously implemented on
|
|
442
|
+
|
|
443
|
+
the given repository state. The patches from
|
|
444
|
+
|
|
445
|
+
the two agents are merged and evaluated. Each
|
|
446
|
+
|
|
447
|
+
agent’s goal is to get their assigned feature im-
|
|
448
|
+
|
|
449
|
+
3
|
|
450
|
+
|
|
451
|
+
### Page 4
|
|
452
|
+
|
|
453
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
454
|
+
|
|
455
|
+
plemented in the merged patch.1 Based on a
|
|
456
|
+
|
|
457
|
+
pool of n features, there will be (n
|
|
458
|
+
|
|
459
|
+
2) tasks when
|
|
460
|
+
|
|
461
|
+
we are evaluating agents self-play, and double the number when we evaluate two different agents
|
|
462
|
+
|
|
463
|
+
cooperate with each other. Note that agents can only view their own features. For example, in Fig.
|
|
464
|
+
|
|
465
|
+
2, there are 6 features in this pool, which produces 15 tasks for evaluating GPT-5 agents cooperating
|
|
466
|
+
|
|
467
|
+
with each other. If we want to evaluate how well GPT-5 agents cooperate with Claude Sonnet 4.5
|
|
468
|
+
|
|
469
|
+
agents, we will have 30 tasks drawn from this pool. In CooperBench we have 34 such features pools.
|
|
470
|
+
|
|
471
|
+
Features In this paper, we use features to denote desirable changes to the codebase that implement
|
|
472
|
+
|
|
473
|
+
missing functionality, fix existing bugs, or both. As illustrated in Fig. 2, each feature is described
|
|
474
|
+
|
|
475
|
+
in a markdown file, which includes a title, description, examples, and a list of files which may
|
|
476
|
+
|
|
477
|
+
be relevant. For each feature, we write a list of unit tests without the help of coding assistants to
|
|
478
|
+
|
|
479
|
+
ensure accurate evaluation of the implementation. In addition, we write a ground-truth solution
|
|
480
|
+
|
|
481
|
+
to understand the potential conflicts between features and to verify that the given feature can be
|
|
482
|
+
|
|
483
|
+
implemented on the repository and pass the unit tests. The tests and the ground-truth solution is
|
|
484
|
+
|
|
485
|
+
not provided to the agents to prevent test leakage.
|
|
486
|
+
|
|
487
|
+
Task composition For each repository state, we create a pool of feature candidates. These features
|
|
488
|
+
|
|
489
|
+
are compatible and potentially conflicting. “Compatible” means the features can be implemented
|
|
490
|
+
|
|
491
|
+
jointly. To verify this, we produce a joint ground-truth solution of all features in the pool, which
|
|
492
|
+
|
|
493
|
+
passes all individual unit tests. “Potentially conflicting” means the features have overlapping code
|
|
494
|
+
|
|
495
|
+
logic changes that influence each other. In our dataset, 77.3% of tasks have conflicting ground-truth
|
|
496
|
+
|
|
497
|
+
solutions. As a result, CooperBench tasks are not adversarial, but still require the capability to
|
|
498
|
+
|
|
499
|
+
cooperate under conflicts by communicating individual goals, understanding others’ plans, and
|
|
500
|
+
|
|
501
|
+
negotiating mutually compatible solutions.
|
|
502
|
+
|
|
503
|
+
Action space Agents can take two kinds of actions in real time: the communication tool and computer-
|
|
504
|
+
|
|
505
|
+
use tools. The communication tool allows agents to send open-ended natural language messages to
|
|
506
|
+
|
|
507
|
+
each other, and the computer-use tools include file and terminal operations. In our paper, we limit
|
|
508
|
+
|
|
509
|
+
the computer-use tools to local operations to control the experiments. In the future, researchers
|
|
510
|
+
|
|
511
|
+
could consider GUI and browser-based actions to expand the tasks the agents can take. Both agents
|
|
512
|
+
|
|
513
|
+
can use these tools at any time, without synchronizing their turns with each other. This not only
|
|
514
|
+
|
|
515
|
+
raises the flexibility of agents, but also poses challenges for agents to timely communicate and
|
|
516
|
+
|
|
517
|
+
execute commands. In our benchmarking process, we use cloud virtual machines for agents to
|
|
518
|
+
|
|
519
|
+
ensure isolated workspaces and sufficient resources. We set an upper-bound number (100)2 of
|
|
520
|
+
|
|
521
|
+
actions an agent can take to complete the tasks.
|
|
522
|
+
|
|
523
|
+
2.2. Evaluation pipeline
|
|
524
|
+
|
|
525
|
+
Cooperation is hard to evaluate, but we make the product of the cooperation verifiable. CooperBench
|
|
526
|
+
|
|
527
|
+
evaluates tasks based on two criteria: (1) compatible solutions and (2) implementation correctness.
|
|
528
|
+
|
|
529
|
+
Solution compatibility After the two agents complete execution, we attempt to merge their resulting
|
|
530
|
+
|
|
531
|
+
patches using git merge-file -L patch_1 -L repo_state -L patch_2. This operation captures
|
|
532
|
+
|
|
533
|
+
whether the independently produced solutions are structurally compatible. In practice, some merge
|
|
534
|
+
|
|
535
|
+
failures arise from superficial differences such as formatting or indentation styles (e.g., K&R versus
|
|
536
|
+
|
|
537
|
+
Allman) rather than substantive conflicts. To avoid treating such cases as coordination failures, we
|
|
538
|
+
|
|
539
|
+
train a small coding model (Qwen 3 Coder 1.5B; Yang et al. 2025) on synthetic examples to resolve
|
|
540
|
+
|
|
541
|
+
1Agents have the freedom to redivide the two features as long as the merged patch implements both features. Agents
|
|
542
|
+
|
|
543
|
+
perform this kind of coordination occasionally well. Check out §6.4 for concrete examples.
|
|
544
|
+
|
|
545
|
+
2We do not observe performance gains on our tasks from raising this number.
|
|
546
|
+
|
|
547
|
+
4
|
|
548
|
+
|
|
549
|
+
### Page 5
|
|
550
|
+
|
|
551
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
552
|
+
|
|
553
|
+
trivial merge conflicts when standard merging fails. This step ensures that the compatibility check
|
|
554
|
+
|
|
555
|
+
reflects semantic agreement between solutions rather than low-level stylistic discrepancies, while
|
|
556
|
+
|
|
557
|
+
leaving the overall cooperation score largely unaffected (App. § B). If even the coding model cannot
|
|
558
|
+
|
|
559
|
+
produce a patch without conflicts, the agents both fail the tasks.
|
|
560
|
+
|
|
561
|
+
Implementation correctness If we successfully merge the two patches into the repository state, we
|
|
562
|
+
|
|
563
|
+
run both sets of unit tests on the merged codebase. As mentioned before, we do not restrict agents
|
|
564
|
+
|
|
565
|
+
to only finish their own work. If they can coordinate well, they can divide their two features in
|
|
566
|
+
|
|
567
|
+
a different way as long as the merged solution can pass the two features’ tests. This evaluation
|
|
568
|
+
|
|
569
|
+
pipeline ensures a rigorous evaluation of the cooperation outcome.
|
|
570
|
+
|
|
571
|
+
2.3. Dataset Construction
|
|
572
|
+
|
|
573
|
+
Open-source repos of different
|
|
574
|
+
|
|
575
|
+
language with over 1K stars and
|
|
576
|
+
|
|
577
|
+
dataset creator expertise
|
|
578
|
+
|
|
579
|
+
Well documented Issue/PR
|
|
580
|
+
|
|
581
|
+
Has new/updated tests
|
|
582
|
+
|
|
583
|
+
<200 lines and 2 files
|
|
584
|
+
|
|
585
|
+
Set of 34 PRs for 12 different
|
|
586
|
+
|
|
587
|
+
repos in 4 languages
|
|
588
|
+
|
|
589
|
+
Stage I - Repository & PR selection Stage II - Feature creation Stage III - Environment Setup
|
|
590
|
+
|
|
591
|
+
Reproducible Execution
|
|
592
|
+
|
|
593
|
+
Sanitize & format Manually creating
|
|
594
|
+
|
|
595
|
+
test files
|
|
596
|
+
|
|
597
|
+
Feature Pool
|
|
598
|
+
|
|
599
|
+
Curator + LLM
|
|
600
|
+
|
|
601
|
+
Ideation
|
|
602
|
+
|
|
603
|
+
Figure 3 | The CooperBench construction pipeline. Each task is carefully engineered by domain
|
|
604
|
+
|
|
605
|
+
experts to ensure conflicts are realistic, resolvable, and representative of production software
|
|
606
|
+
|
|
607
|
+
development challenges.
|
|
608
|
+
|
|
609
|
+
CooperBench is constructed through a three-stage pipeline that grounds tasks in real software
|
|
610
|
+
|
|
611
|
+
development and enables controlled evaluation of coordination (Fig. 3). To create the pools of
|
|
612
|
+
|
|
613
|
+
features, we start from real-world feature implementations and proceed as follows: (Stage I) we write
|
|
614
|
+
|
|
615
|
+
anchor features drawn from popular repositories, each of them is a slight modification of a real pull
|
|
616
|
+
|
|
617
|
+
request (PR) authored by human contributors; (Stage II) for each anchor feature, we expand the pool
|
|
618
|
+
|
|
619
|
+
by introducing a family of adjacent features authored by human annotators, representing plausible
|
|
620
|
+
|
|
621
|
+
alternative features that could realistically co-occur; and (Stage III) we validate the compatibility of
|
|
622
|
+
|
|
623
|
+
each feature pool by executing and testing all feature combinations in a controlled environment to
|
|
624
|
+
|
|
625
|
+
rule out intrinsically incompatible specifications.
|
|
626
|
+
|
|
627
|
+
Stage I: Repository and PR Selection In the first stage we select twelve actively maintained open-
|
|
628
|
+
|
|
629
|
+
source repositories spanning Python, TypeScript, Rust, and Go. Each repository exceeds one
|
|
630
|
+
|
|
631
|
+
thousand GitHub stars and does not appear in SWE-Bench (Jimenez et al., 2023) or Multi-SWE-
|
|
632
|
+
|
|
633
|
+
Bench (Zan et al., 2025), reducing data contamination risk. Selection is guided by curator expertise
|
|
634
|
+
|
|
635
|
+
so that each repository is assigned to an author familiar with its architecture and development
|
|
636
|
+
|
|
637
|
+
practices. We extract PRs that meet strict inclusion constraints: clear feature description, code+tests,
|
|
638
|
+
|
|
639
|
+
feature addition, bounded change size, and robust tests. Appendix A provides full selection details
|
|
640
|
+
|
|
641
|
+
and thresholds, and App. Tab. 3 summarizes the repository distribution.
|
|
642
|
+
|
|
643
|
+
Stage II: Feature Extraction and Augmentation In the second stage, we convert each selected PR into
|
|
644
|
+
|
|
645
|
+
a feature pool containing one anchor feature and multiple synthetic adjacent features. We sanitize
|
|
646
|
+
|
|
647
|
+
and rewrite original PR descriptions into self-contained specifications to prevent information
|
|
648
|
+
|
|
649
|
+
5
|
|
650
|
+
|
|
651
|
+
### Page 6
|
|
652
|
+
|
|
653
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
654
|
+
|
|
655
|
+
leakage. Curators author adjacent features to plausibly co-occur and to create natural overlap
|
|
656
|
+
|
|
657
|
+
without adversarial specifications (with LLM-assisted ideation). Appendix A provides full details
|
|
658
|
+
|
|
659
|
+
on adjacent-feature design, manual test writing, and gold-solution validation. All features derived
|
|
660
|
+
|
|
661
|
+
from the same base commit constitute a feature pool with two to 12 features. To ensure the
|
|
662
|
+
|
|
663
|
+
compatibility among all features in a pool, we construct a single gold patch that jointly implements
|
|
664
|
+
|
|
665
|
+
all features in each set and passes all associated tests.
|
|
666
|
+
|
|
667
|
+
Stage III: Environment and Reproducibility The final stage provides a deterministic execution
|
|
668
|
+
|
|
669
|
+
environment for evaluating agents. Each task set includes an automated setup script that clones
|
|
670
|
+
|
|
671
|
+
the repository at the exact base commit, installs dependencies, and executes the full test suite to
|
|
672
|
+
|
|
673
|
+
verify the environment. To ensure consistent behavior across hardware and operating systems, we
|
|
674
|
+
|
|
675
|
+
additionally provide containerized environments that encapsulate the complete repository state and
|
|
676
|
+
|
|
677
|
+
all runtime dependencies. These environments guarantee reproducible execution and isolate agent
|
|
678
|
+
|
|
679
|
+
behavior from external variability, enabling reliable measurement of coordination performance
|
|
680
|
+
|
|
681
|
+
through the evaluation pipeline described in §2.2.
|
|
682
|
+
|
|
683
|
+
Dataset composition and feature-complexity statistics are reported in App. A. Together, these
|
|
684
|
+
|
|
685
|
+
findings demonstrate that CooperBench features are individually tractable and realistic, ensuring
|
|
686
|
+
|
|
687
|
+
that the benchmark’s primary challenge arises from coordinating partially overlapping implementa-
|
|
688
|
+
|
|
689
|
+
tions rather than from executing unusually complex or oversized programming tasks.
|
|
690
|
+
|
|
691
|
+
3. Experiment Settings
|
|
692
|
+
|
|
693
|
+
CooperBench allows us to study the following research questions. First, how well can current
|
|
694
|
+
|
|
695
|
+
state-of-the-art foundation models cooperate with each other when they are used in coding agents?
|
|
696
|
+
|
|
697
|
+
Second, do agents use the communication channel effectively for coordination? And, what are the
|
|
698
|
+
|
|
699
|
+
reasons why agents fail or succeed on CooperBench?
|
|
700
|
+
|
|
701
|
+
In order to evaluate models fairly, we create an agent framework incorporating leading open-
|
|
702
|
+
|
|
703
|
+
source coding agent framework OpenHands (v0.54) (Wang et al., 2024b). The two agents perform
|
|
704
|
+
|
|
705
|
+
their own work in their respective docker-based containers without interruption from another
|
|
706
|
+
|
|
707
|
+
agent. Since OpenHands was not designed as a framework which performs multi-agent cooper-
|
|
708
|
+
|
|
709
|
+
ation, we created a communication tool (§2.1) using an SQL database for message passing. This
|
|
710
|
+
|
|
711
|
+
communication tool supports message sending action. When an agent sends a message to another
|
|
712
|
+
|
|
713
|
+
agent, the other agent will immediately receive it, and include it in the prompt of the next step.
|
|
714
|
+
|
|
715
|
+
This communication setting achieves both real-time communication and asynchronous execution.
|
|
716
|
+
|
|
717
|
+
We open-source this framework to not only ensure reproducibility of our experiments, but also
|
|
718
|
+
|
|
719
|
+
provide a starting point for researchers to build multi-agent cooperation systems which can perform
|
|
720
|
+
|
|
721
|
+
multiple tasks and resolve conflicts.
|
|
722
|
+
|
|
723
|
+
However, note that CooperBench does not tie with the agent framework or the communication tool. In
|
|
724
|
+
|
|
725
|
+
this paper, we are more concerned with foundation models’ intrinsic capability to cooperate, so we
|
|
726
|
+
|
|
727
|
+
do not compare different agent frameworks or creative methods to enhance coordination. In the
|
|
728
|
+
|
|
729
|
+
future, researchers should use CooperBench to compare different models, different frameworks, and
|
|
730
|
+
|
|
731
|
+
different combinations as well. We especially encourage researchers to develop novel frameworks
|
|
732
|
+
|
|
733
|
+
or to train agents to achieve higher Coop scores or to close the Solo-Coop gaps (§4) on CooperBench.
|
|
734
|
+
|
|
735
|
+
Similarly, we encourage researchers to develop other communication tools, e.g. screen sharing, to
|
|
736
|
+
|
|
737
|
+
expand the communication bandwidth or reduce the communication noises.
|
|
738
|
+
|
|
739
|
+
We evaluate the performance of five language models, both closed-source ones, and open-source
|
|
740
|
+
|
|
741
|
+
6
|
|
742
|
+
|
|
743
|
+
### Page 7
|
|
744
|
+
|
|
745
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
746
|
+
|
|
747
|
+
gpt-5
|
|
748
|
+
|
|
749
|
+
claude
|
|
750
|
+
|
|
751
|
+
minimax
|
|
752
|
+
|
|
753
|
+
qwen coder
|
|
754
|
+
|
|
755
|
+
qwen
|
|
756
|
+
|
|
757
|
+
0.0
|
|
758
|
+
|
|
759
|
+
0.2
|
|
760
|
+
|
|
761
|
+
0.4
|
|
762
|
+
|
|
763
|
+
0.6
|
|
764
|
+
|
|
765
|
+
0.8
|
|
766
|
+
|
|
767
|
+
1.0
|
|
768
|
+
|
|
769
|
+
Success rate
|
|
770
|
+
|
|
771
|
+
0.48
|
|
772
|
+
|
|
773
|
+
0.28
|
|
774
|
+
|
|
775
|
+
0.47
|
|
776
|
+
|
|
777
|
+
0.26
|
|
778
|
+
|
|
779
|
+
0.36
|
|
780
|
+
|
|
781
|
+
0.14
|
|
782
|
+
|
|
783
|
+
0.22
|
|
784
|
+
|
|
785
|
+
0.13
|
|
786
|
+
|
|
787
|
+
0.06
|
|
788
|
+
|
|
789
|
+
0.05
|
|
790
|
+
|
|
791
|
+
0.0 0.2 0.4 0.6 0.8 1.0
|
|
792
|
+
|
|
793
|
+
Relative Difficulty
|
|
794
|
+
|
|
795
|
+
0.0
|
|
796
|
+
|
|
797
|
+
0.2
|
|
798
|
+
|
|
799
|
+
0.4
|
|
800
|
+
|
|
801
|
+
0.6
|
|
802
|
+
|
|
803
|
+
0.8
|
|
804
|
+
|
|
805
|
+
1.0
|
|
806
|
+
|
|
807
|
+
Solo
|
|
808
|
+
|
|
809
|
+
Coop
|
|
810
|
+
|
|
811
|
+
Figure 4 | Left: Under Coop setting, agents with different foundation models perform significantly
|
|
812
|
+
|
|
813
|
+
worse than how they perform under Solo setting, except for Qwen3-30B-A3B-Instruct-2507, which
|
|
814
|
+
|
|
815
|
+
performs bad under both settings. This Solo-Coop gap is what we call the “coordination gap”.
|
|
816
|
+
|
|
817
|
+
Right: The relationship between tasks’ technical difficulties and Solo-Coop gap. The shaded area
|
|
818
|
+
|
|
819
|
+
has a large middle section which shows that the coordination gap is larger for middle-level tasks
|
|
820
|
+
|
|
821
|
+
than for tasks which are extremely easy or difficult.
|
|
822
|
+
|
|
823
|
+
ones: GPT-5, Claude 4.5 Sonnet, MiniMax-M2, Qwen3-Coder-30B-A3B-Instruct, and
|
|
824
|
+
|
|
825
|
+
Qwen3-30B-A3B-Instruct-2507. We serve the two Qwen models via vLLM3, GPT-5 and Minimax
|
|
826
|
+
|
|
827
|
+
models via their respective official API, and the Claude model through GCP.
|
|
828
|
+
|
|
829
|
+
4. How well are agents able to cooperate with each other?
|
|
830
|
+
|
|
831
|
+
In CooperBench, each of the two agents are assigned a feature to implement, which will be called
|
|
832
|
+
|
|
833
|
+
the Coop setting to distinguish from the Solo baseline. In the Solo baseline, the two tasks are
|
|
834
|
+
|
|
835
|
+
assigned to one agent. For humans, teams should perform better or faster than individuals, which
|
|
836
|
+
|
|
837
|
+
is the bottom line for cooperation to be considered as functional. We hypothesize for agents, the
|
|
838
|
+
|
|
839
|
+
advantage of cooperation is overwhelmed by their incapability to coordination. This should lead to
|
|
840
|
+
|
|
841
|
+
a “coordination gap”: two agents perform worse than one agent for the same workload.
|
|
842
|
+
|
|
843
|
+
The curse of coordination. As shown in Fig. 4 (Left), across all models, success rates under the
|
|
844
|
+
|
|
845
|
+
Coop setting is consistently lower than those under Solo settings, which means when two agents
|
|
846
|
+
|
|
847
|
+
need to coordinate between them, they perform even worse than one agent “solo”ing the two
|
|
848
|
+
|
|
849
|
+
features. This coordination gap is as large as 50% in the leading models: GPT-5, Claude Sonnet 4.5,
|
|
850
|
+
|
|
851
|
+
and Minimax M2. Qwen models have smaller gaps, but their Solo setting score is much lower as well.
|
|
852
|
+
|
|
853
|
+
All error bars in Fig. 4 are 95% Wilson confidence intervals computed over task sets (App. C).
|
|
854
|
+
|
|
855
|
+
Mid-difficulty crisis. As shown in Fig. 4 (Right), the gap between the two settings is larger and
|
|
856
|
+
|
|
857
|
+
more significant on the tasks with middle-level technical difficulty than on the ones which are too
|
|
858
|
+
|
|
859
|
+
easy or too hard. Here we stratify tasks by relative difficulty. For each task pair t, we define a raw
|
|
860
|
+
|
|
861
|
+
difficulty score d(t) = 1 − 1
|
|
862
|
+
|
|
863
|
+
|M| ∑m∈M Solom(t), where Solom(t) denotes model m’s Solo success on t.
|
|
864
|
+
|
|
865
|
+
3https://vllm.ai/
|
|
866
|
+
|
|
867
|
+
7
|
|
868
|
+
|
|
869
|
+
### Page 8
|
|
870
|
+
|
|
871
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
872
|
+
|
|
873
|
+
0 10 20 30 40 50
|
|
874
|
+
|
|
875
|
+
Percent of tasks
|
|
876
|
+
|
|
877
|
+
claude
|
|
878
|
+
|
|
879
|
+
gpt-5
|
|
880
|
+
|
|
881
|
+
minimax
|
|
882
|
+
|
|
883
|
+
qwen
|
|
884
|
+
|
|
885
|
+
qwen coder
|
|
886
|
+
|
|
887
|
+
With comm
|
|
888
|
+
|
|
889
|
+
No comm
|
|
890
|
+
|
|
891
|
+
(a) Success rate
|
|
892
|
+
|
|
893
|
+
0 20 40 60 80 100
|
|
894
|
+
|
|
895
|
+
Percent of tasks
|
|
896
|
+
|
|
897
|
+
With comm
|
|
898
|
+
|
|
899
|
+
No comm
|
|
900
|
+
|
|
901
|
+
(b) Conflict rate
|
|
902
|
+
|
|
903
|
+
0 5 10 15 20
|
|
904
|
+
|
|
905
|
+
Comm events (% of all events)
|
|
906
|
+
|
|
907
|
+
20.0%
|
|
908
|
+
|
|
909
|
+
16.3%
|
|
910
|
+
|
|
911
|
+
13.6%
|
|
912
|
+
|
|
913
|
+
6.2%
|
|
914
|
+
|
|
915
|
+
3.3%
|
|
916
|
+
|
|
917
|
+
Plan Question
|
|
918
|
+
|
|
919
|
+
Answer Update
|
|
920
|
+
|
|
921
|
+
Ack
|
|
922
|
+
|
|
923
|
+
(c) Communication overhead
|
|
924
|
+
|
|
925
|
+
Figure 5 | (a) Effect of inter-agent communication on cooperation success or lack thereof. All agents
|
|
926
|
+
|
|
927
|
+
fail to use communication for improving cooperation success. (b) Communication substantially
|
|
928
|
+
|
|
929
|
+
reduces naive merge conflicts across all models. (c) Communication overhead as a percentage of
|
|
930
|
+
|
|
931
|
+
all execution events, broken down by message type. Models that communicate more (e.g., Claude
|
|
932
|
+
|
|
933
|
+
Sonnet 4.5, GPT-5) show larger reductions in conflict rate.
|
|
934
|
+
|
|
935
|
+
For visualization, we linearly rescale d(t) to ˜d(t) ∈ [0, 1] using the minimum and maximum d(t)
|
|
936
|
+
|
|
937
|
+
values in the benchmark. We bucket tasks by ˜d(t) and report success rates as a function of ˜d(t) for
|
|
938
|
+
|
|
939
|
+
both Solo and Coop. This result points out that agents struggle to balance the two pressures for
|
|
940
|
+
|
|
941
|
+
technical difficulty and cooperation difficulty. When tasks are too easy, agents could spare more
|
|
942
|
+
|
|
943
|
+
effort for coordination, but when tasks are getting harder, agents cannot effectively coordinate.
|
|
944
|
+
|
|
945
|
+
Scaling the number of cooperating agents. Our hypothesis is that increasing the number of agents
|
|
946
|
+
|
|
947
|
+
in the same cooperative workspace exacerbates coordination overhead (e.g., more context to track
|
|
948
|
+
|
|
949
|
+
and more opportunities for inconsistent plans), leading to lower end-to-end success. To probe this
|
|
950
|
+
|
|
951
|
+
directly, we run a small scale experiment using 46 tasks from 3 separate task sets where we scale
|
|
952
|
+
|
|
953
|
+
the number of concurrently cooperating agents from 2 to 4 while keeping the cooperative setting
|
|
954
|
+
|
|
955
|
+
fixed. We observe a monotonic decline in success as the number of agents increases. Specifically,
|
|
956
|
+
|
|
957
|
+
performance drops from 68.6% with 2 agents to 46.5% with 3 agents and further to 30.0% with 4
|
|
958
|
+
|
|
959
|
+
agents on the tasks, reinforcing the “curse of coordination” beyond the 2-agent setting.
|
|
960
|
+
|
|
961
|
+
5. What is the role of communication in agent-agent cooperation?
|
|
962
|
+
|
|
963
|
+
In CooperBench, the communication tool we provide is the only channel agents could use to
|
|
964
|
+
|
|
965
|
+
coordinate with each other. Can agents effectively use it? We hypothesize that although agents
|
|
966
|
+
|
|
967
|
+
might actively use the tool, their communication might be far from effective or efficient. To evaluate
|
|
968
|
+
|
|
969
|
+
this, we compare with a baseline setting, where the communication tool is banned, i.e. “no comm”.
|
|
970
|
+
|
|
971
|
+
Communication does not lead to better cooperation. As shown in Fig. 5 (a), none of the models
|
|
972
|
+
|
|
973
|
+
effectively leverage communication tool to achieve higher cooperation success. The difference
|
|
974
|
+
|
|
975
|
+
between “with comm” and “no comm” settings is not statistically significant. This shows that
|
|
976
|
+
|
|
977
|
+
existence of the communication tool does not help coordination. Does this mean agents are not
|
|
978
|
+
|
|
979
|
+
using them? We quickly negate this question through examine the usage and the conflict rate.
|
|
980
|
+
|
|
981
|
+
8
|
|
982
|
+
|
|
983
|
+
### Page 9
|
|
984
|
+
|
|
985
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
986
|
+
|
|
987
|
+
Communication reduces merge conflicts. As shown in Fig. 5 (b), communication does significantly
|
|
988
|
+
|
|
989
|
+
reduce the merge conflicts between patches in Claude Sonnet 4.5, GPT-5, MiniMax M2, and Qwen
|
|
990
|
+
|
|
991
|
+
Instruct. This shows that agents could leverage the communication tool to reduce overlap in their
|
|
992
|
+
|
|
993
|
+
work, despite that just avoiding conflicts does not warrant cooperation success. Communication
|
|
994
|
+
|
|
995
|
+
also consumes a meaningful share of the agent’s action budget. Fig. 5 (c) reports the frequency of all
|
|
996
|
+
|
|
997
|
+
communication speech act types. This result shows that agents spent as much as 20% of the steps in
|
|
998
|
+
|
|
999
|
+
communication, within which planning, questioning, and updating each almost takes up 1/3 of
|
|
1000
|
+
|
|
1001
|
+
steps. But why this much effort in communication does not translate into better cooperation?
|
|
1002
|
+
|
|
1003
|
+
What distinguishes effective communication? To understand why communication helps conflicts
|
|
1004
|
+
|
|
1005
|
+
but not success, we analyze what successful communication looks like. Three patterns emerge.
|
|
1006
|
+
|
|
1007
|
+
First, successful agents plan more and question less. Trajectories that avoid conflicts have a Plan:Question
|
|
1008
|
+
|
|
1009
|
+
ratio of 2.04, compared to 1.31 for conflict trajectories. This suggests that questions are a symptom of
|
|
1010
|
+
|
|
1011
|
+
coordination problems, not a cure. Agents that are already struggling tend to ask more questions,
|
|
1012
|
+
|
|
1013
|
+
but questioning does not prevent conflicts.
|
|
1014
|
+
|
|
1015
|
+
Second, first-turn planning is the strongest predictor. Having a Plan message in the very first turn nearly
|
|
1016
|
+
|
|
1017
|
+
halves the conflict rate (29.4% vs 51.5%). This effect is robust across difficulty levels: in 7 out of 8
|
|
1018
|
+
|
|
1019
|
+
difficulty buckets, first-turn planning significantly reduces conflicts, with the effect actually stronger
|
|
1020
|
+
|
|
1021
|
+
for harder tasks (39% reduction at the highest difficulty).
|
|
1022
|
+
|
|
1023
|
+
Third, specificity matters. Successful trajectories contain significantly more concrete references: 32.6
|
|
1024
|
+
|
|
1025
|
+
line number mentions versus 22.5, and 13.1 file path mentions versus 10.0. Agents that communicate
|
|
1026
|
+
|
|
1027
|
+
where they are editing with specific line ranges successfully avoid overlapping changes.
|
|
1028
|
+
|
|
1029
|
+
Spatial vs. semantic coordination. These findings explain why communication helps conflicts but
|
|
1030
|
+
|
|
1031
|
+
not success. Merge conflicts are fundamentally a spatial coordination problem: agents must agree
|
|
1032
|
+
|
|
1033
|
+
on who edits which lines. The patterns above (early planning, specific line numbers, file paths) all
|
|
1034
|
+
|
|
1035
|
+
address spatial coordination, and they work.
|
|
1036
|
+
|
|
1037
|
+
However, task success requires semantic coordination: understanding what to implement, not
|
|
1038
|
+
|
|
1039
|
+
just where. Our case study in Appendix I illustrates this gap. Two agents successfully coordinated on
|
|
1040
|
+
|
|
1041
|
+
line numbers and edit ranges (spatial), yet failed because they never discussed the actual parameter
|
|
1042
|
+
|
|
1043
|
+
values their implementations should use (semantic). They solved the “formatting” problem of
|
|
1044
|
+
|
|
1045
|
+
avoiding overlapping edits but not the “design” problem of ensuring compatible implementations.
|
|
1046
|
+
|
|
1047
|
+
Repetition, Unresponsiveness, and Hallucination. Beyond the spatial-semantic gap, the com-
|
|
1048
|
+
|
|
1049
|
+
munication itself is often flawed. We identify three major communication problems, and show
|
|
1050
|
+
|
|
1051
|
+
their frequencies in Fig. 6. We automatically detect these patterns using an LLM-as-judge approach
|
|
1052
|
+
|
|
1053
|
+
with a precision-focused taxonomy; see Appendix F for the full rubric and evidence requirements.
|
|
1054
|
+
|
|
1055
|
+
Repetition consumes budget without adding constraints a partner can act on, which is consistent
|
|
1056
|
+
|
|
1057
|
+
with high communication overhead without commensurate gains in end-to-end success. Unrespon-
|
|
1058
|
+
|
|
1059
|
+
siveness breaks the feedback loop when one agent asks for a decision that gates implementation,
|
|
1060
|
+
|
|
1061
|
+
and incorrectness creates false shared context, such as asserting an interface decision or a completed
|
|
1062
|
+
|
|
1063
|
+
change that is not actually satisfied. Hallucination results in noises which makes it hard to partners
|
|
1064
|
+
|
|
1065
|
+
to coordinate under imperfect information.
|
|
1066
|
+
|
|
1067
|
+
In this section, we show that the communication tool is heavily used, but not properly leveraged
|
|
1068
|
+
|
|
1069
|
+
by agents for coordination. This shows that agents lack the critical pragmatics understanding of
|
|
1070
|
+
|
|
1071
|
+
9
|
|
1072
|
+
|
|
1073
|
+
### Page 10
|
|
1074
|
+
|
|
1075
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1076
|
+
|
|
1077
|
+
0 5 10 15
|
|
1078
|
+
|
|
1079
|
+
Avg turns
|
|
1080
|
+
|
|
1081
|
+
qwen coder
|
|
1082
|
+
|
|
1083
|
+
qwen
|
|
1084
|
+
|
|
1085
|
+
minimax
|
|
1086
|
+
|
|
1087
|
+
gpt-5
|
|
1088
|
+
|
|
1089
|
+
claude
|
|
1090
|
+
|
|
1091
|
+
1.1
|
|
1092
|
+
|
|
1093
|
+
3.2
|
|
1094
|
+
|
|
1095
|
+
10.8
|
|
1096
|
+
|
|
1097
|
+
15.5
|
|
1098
|
+
|
|
1099
|
+
14.5
|
|
1100
|
+
|
|
1101
|
+
(a) Conversation turns
|
|
1102
|
+
|
|
1103
|
+
0 10 20 30 40
|
|
1104
|
+
|
|
1105
|
+
Percent of conversations
|
|
1106
|
+
|
|
1107
|
+
0.0%
|
|
1108
|
+
|
|
1109
|
+
6.8%
|
|
1110
|
+
|
|
1111
|
+
17.8%
|
|
1112
|
+
|
|
1113
|
+
5.1%
|
|
1114
|
+
|
|
1115
|
+
37.1%
|
|
1116
|
+
|
|
1117
|
+
Repeats same information
|
|
1118
|
+
|
|
1119
|
+
Near-duplicate status blocks
|
|
1120
|
+
|
|
1121
|
+
(b) Repetition
|
|
1122
|
+
|
|
1123
|
+
0 10 20 30 40
|
|
1124
|
+
|
|
1125
|
+
Percent of conversations
|
|
1126
|
+
|
|
1127
|
+
0.0%
|
|
1128
|
+
|
|
1129
|
+
1.9%
|
|
1130
|
+
|
|
1131
|
+
21.3%
|
|
1132
|
+
|
|
1133
|
+
5.8%
|
|
1134
|
+
|
|
1135
|
+
9.9%
|
|
1136
|
+
|
|
1137
|
+
No reply to direct question
|
|
1138
|
+
|
|
1139
|
+
Reply but ignores question
|
|
1140
|
+
|
|
1141
|
+
Vague / non-answer
|
|
1142
|
+
|
|
1143
|
+
(c) Unresponsiveness
|
|
1144
|
+
|
|
1145
|
+
0 10 20 30 40
|
|
1146
|
+
|
|
1147
|
+
Percent of conversations
|
|
1148
|
+
|
|
1149
|
+
0.0%
|
|
1150
|
+
|
|
1151
|
+
0.0%
|
|
1152
|
+
|
|
1153
|
+
5.4%
|
|
1154
|
+
|
|
1155
|
+
2.1%
|
|
1156
|
+
|
|
1157
|
+
6.9%
|
|
1158
|
+
|
|
1159
|
+
Plan drift / unilateral deviation
|
|
1160
|
+
|
|
1161
|
+
Hallucination (uncorrected)
|
|
1162
|
+
|
|
1163
|
+
Hallucination (corrected)
|
|
1164
|
+
|
|
1165
|
+
(d) Hallucination
|
|
1166
|
+
|
|
1167
|
+
Figure 6 | Break down frequencies of different kinds of communication errors.
|
|
1168
|
+
|
|
1169
|
+
language: communication is not just about message passing, but about achieving certain functions
|
|
1170
|
+
|
|
1171
|
+
through passing the messages. Agents are “talking” a lot, but they cannot achieve their commu-
|
|
1172
|
+
|
|
1173
|
+
nication goals through communication when communication channel is jammed with repetitions,
|
|
1174
|
+
|
|
1175
|
+
unresponded questions, or false information.
|
|
1176
|
+
|
|
1177
|
+
6. What are the coordination failures that the agents exhibit?
|
|
1178
|
+
|
|
1179
|
+
Section 5 showed that communication alone does not improve coordination. Why not? We find that
|
|
1180
|
+
|
|
1181
|
+
even when agents communicate their plans, they struggle to honor commitments and anticipate
|
|
1182
|
+
|
|
1183
|
+
partner actions. Coordination failures stem from three capability gaps: communication (failing to
|
|
1184
|
+
|
|
1185
|
+
exchange key information), commitment (not following through on promises), and expectation (failing
|
|
1186
|
+
|
|
1187
|
+
to model what partners are doing). We first categorize failures by their observable symptoms (§6.1),
|
|
1188
|
+
|
|
1189
|
+
then identify these underlying causes (§6.2).
|
|
1190
|
+
|
|
1191
|
+
6.1. Failure Symptoms
|
|
1192
|
+
|
|
1193
|
+
We analyze all failed Coop trajectories across all five models on the full dataset. Through iterative
|
|
1194
|
+
|
|
1195
|
+
qualitative coding, we develop the failure symptom taxonomy shown in Tab. 1. We then use GPT-5
|
|
1196
|
+
|
|
1197
|
+
as an LLM-as-a-Judge to categorize trajectories at scale, yielding the frequency distribution in Tab. 1.
|
|
1198
|
+
|
|
1199
|
+
The resulting vocabulary provides a structured way to diagnose coordination breakdowns. See
|
|
1200
|
+
|
|
1201
|
+
App. G for the annotation procedure and human validation.
|
|
1202
|
+
|
|
1203
|
+
6.2. Failure Reasons
|
|
1204
|
+
|
|
1205
|
+
Symptoms describe what went wrong; causes explain why. To identify the underlying capability
|
|
1206
|
+
|
|
1207
|
+
gaps, we manually reviewed 50 failed Coop traces. For each trace, we examined the symptom
|
|
1208
|
+
|
|
1209
|
+
labels, conversation logs, and merged artifacts to determine why coordination broke down. We
|
|
1210
|
+
|
|
1211
|
+
grouped root causes into three categories shown in Tab. 2. Unlike symptoms, which can be reliably
|
|
1212
|
+
|
|
1213
|
+
detected by an LLM annotator, causes require deeper interpretation of the coordination dynamics
|
|
1214
|
+
|
|
1215
|
+
and are therefore manually assigned.
|
|
1216
|
+
|
|
1217
|
+
6.3. Representative examples of capability gaps
|
|
1218
|
+
|
|
1219
|
+
We provide one representative example for each coordination capability gap. Additional symptom-
|
|
1220
|
+
|
|
1221
|
+
level examples are available in Appendix H.
|
|
1222
|
+
|
|
1223
|
+
10
|
|
1224
|
+
|
|
1225
|
+
### Page 11
|
|
1226
|
+
|
|
1227
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1228
|
+
|
|
1229
|
+
Table 1 | Coordination failure symptoms. Observable patterns in how coordination breakdowns
|
|
1230
|
+
|
|
1231
|
+
surface in merged artifacts.
|
|
1232
|
+
|
|
1233
|
+
Symptom Meaning %
|
|
1234
|
+
|
|
1235
|
+
Work overlap Both agents independently implement the same functionality, duplicating
|
|
1236
|
+
|
|
1237
|
+
work and overwriting details.
|
|
1238
|
+
|
|
1239
|
+
33.2
|
|
1240
|
+
|
|
1241
|
+
Divergent architec-
|
|
1242
|
+
|
|
1243
|
+
ture
|
|
1244
|
+
|
|
1245
|
+
Incompatible design decisions lead to semantic loss even under a clean
|
|
1246
|
+
|
|
1247
|
+
merge.
|
|
1248
|
+
|
|
1249
|
+
29.7
|
|
1250
|
+
|
|
1251
|
+
Repetition Verbose status messages add little new information and reduce signal. 14.7
|
|
1252
|
+
|
|
1253
|
+
Unresponsiveness Direct questions or requests are not answered, breaking the decision loop. 8.7
|
|
1254
|
+
|
|
1255
|
+
Unverifiable claims Agent asserts a change or interface decision without evidence the partner
|
|
1256
|
+
|
|
1257
|
+
can check (no checkable commitment).
|
|
1258
|
+
|
|
1259
|
+
4.3
|
|
1260
|
+
|
|
1261
|
+
Broken commitment Confident completion claims create false shared context when the promised
|
|
1262
|
+
|
|
1263
|
+
change is absent.
|
|
1264
|
+
|
|
1265
|
+
3.7
|
|
1266
|
+
|
|
1267
|
+
Dependency access Missing risk communication leaves agents unable to anticipate merged
|
|
1268
|
+
|
|
1269
|
+
dependency interactions (e.g., circular imports).
|
|
1270
|
+
|
|
1271
|
+
1.7
|
|
1272
|
+
|
|
1273
|
+
Placeholder misuse An explicit integration contract exists but is applied differently than agreed. 1.5
|
|
1274
|
+
|
|
1275
|
+
Parameter flow Ambiguity about a changing interface leaves one agent implementing
|
|
1276
|
+
|
|
1277
|
+
against an outdated contract.
|
|
1278
|
+
|
|
1279
|
+
1.3
|
|
1280
|
+
|
|
1281
|
+
Timing dependency Agents agree on order but fail to communicate an enforceable plan that
|
|
1282
|
+
|
|
1283
|
+
preserves it after merge.
|
|
1284
|
+
|
|
1285
|
+
1.1
|
|
1286
|
+
|
|
1287
|
+
Expectation. In the first example, Agent A announces it will modify prompts.py and call B’s
|
|
1288
|
+
|
|
1289
|
+
get_global_filters(). Agent B states it will insert GLOBAL_FILTERS at a specific location. Both
|
|
1290
|
+
|
|
1291
|
+
agents communicate their plans explicitly, yet the merge fails. The problem is not missing informa-
|
|
1292
|
+
|
|
1293
|
+
tion but failure to integrate it. Despite hearing B’s plan, A proceeds as if B’s code won’t exist. This is
|
|
1294
|
+
|
|
1295
|
+
the most common cause, reflecting a fundamental difficulty in maintaining an accurate model of
|
|
1296
|
+
|
|
1297
|
+
partner state during independent work.
|
|
1298
|
+
|
|
1299
|
+
Expectation: Failure to model partner state
|
|
1300
|
+
|
|
1301
|
+
I will modify outlines/prompts.py. I'm removing _template_from_str() and
|
|
1302
|
+
|
|
1303
|
+
_template_from_file() methods. My create_jinja_env() function will call your
|
|
1304
|
+
|
|
1305
|
+
get_global_filters().
|
|
1306
|
+
|
|
1307
|
+
......
|
|
1308
|
+
|
|
1309
|
+
......
|
|
1310
|
+
|
|
1311
|
+
Add global filter registry
|
|
1312
|
+
|
|
1313
|
+
Add global filter registry
|
|
1314
|
+
|
|
1315
|
+
I will insert a block defining GLOBAL_FILTERS + register/
|
|
1316
|
+
|
|
1317
|
+
unregister APIs.
|
|
1318
|
+
|
|
1319
|
+
Bash View Edit Comm
|
|
1320
|
+
|
|
1321
|
+
Commitment. In the second example, the agent promises “I will add bypass check at lines 100–104,
|
|
1322
|
+
|
|
1323
|
+
happens FIRST in get().” Later it claims completion with a checkmark. But after merge, the bypass
|
|
1324
|
+
|
|
1325
|
+
code is missing. The partner trusted this claim and built on it, but under workspace isolation, trust
|
|
1326
|
+
|
|
1327
|
+
is all they had. The commitment was unverifiable. No pasted signature, no diff, nothing the partner
|
|
1328
|
+
|
|
1329
|
+
could check without access to the branch.
|
|
1330
|
+
|
|
1331
|
+
11
|
|
1332
|
+
|
|
1333
|
+
### Page 12
|
|
1334
|
+
|
|
1335
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1336
|
+
|
|
1337
|
+
Table 2 | Coordination capability gaps. Underlying causes inferred through qualitative analysis of
|
|
1338
|
+
|
|
1339
|
+
failure traces.
|
|
1340
|
+
|
|
1341
|
+
Cause Definition %
|
|
1342
|
+
|
|
1343
|
+
Expectation Cases where one agent has clearly communicated what they are doing, but the other
|
|
1344
|
+
|
|
1345
|
+
agent still treats the situation as if that work is not being done. This reflects a failure
|
|
1346
|
+
|
|
1347
|
+
to model the state of the other agent’s code changes and what that means for the
|
|
1348
|
+
|
|
1349
|
+
system as a whole.
|
|
1350
|
+
|
|
1351
|
+
42
|
|
1352
|
+
|
|
1353
|
+
Commitment Cases where an agent is not doing the things they promised to do. This includes
|
|
1354
|
+
|
|
1355
|
+
failures to establish or maintain verifiable integration contracts, where agents make
|
|
1356
|
+
|
|
1357
|
+
commitments but do not follow through on them.
|
|
1358
|
+
|
|
1359
|
+
32
|
|
1360
|
+
|
|
1361
|
+
Communication Breakdowns in using language to coordinate. This includes failures in information
|
|
1362
|
+
|
|
1363
|
+
sharing and decision loops between agents, where agents do not effectively
|
|
1364
|
+
|
|
1365
|
+
communicate their intentions, questions, or status updates.
|
|
1366
|
+
|
|
1367
|
+
26
|
|
1368
|
+
|
|
1369
|
+
Commitment: Failure to follow through on promises
|
|
1370
|
+
|
|
1371
|
+
I will add bypass check
|
|
1372
|
+
|
|
1373
|
+
at lines 100-104
|
|
1374
|
+
|
|
1375
|
+
happens FIRST in get().
|
|
1376
|
+
|
|
1377
|
+
......
|
|
1378
|
+
|
|
1379
|
+
......
|
|
1380
|
+
|
|
1381
|
+
Bash View Edit Comm
|
|
1382
|
+
|
|
1383
|
+
bypass code
|
|
1384
|
+
|
|
1385
|
+
missing
|
|
1386
|
+
|
|
1387
|
+
Implementation complete!....
|
|
1388
|
+
|
|
1389
|
+
✓ Added bypass() context
|
|
1390
|
+
|
|
1391
|
+
manager method....
|
|
1392
|
+
|
|
1393
|
+
Communication. In the third example, Agent A asks a direct question, “Which approach would
|
|
1394
|
+
|
|
1395
|
+
you prefer?” The response is silence. Without an answer, the coordination loop collapses. A
|
|
1396
|
+
|
|
1397
|
+
needed a decision to proceed, and without one, both agents continued with potentially incompat-
|
|
1398
|
+
|
|
1399
|
+
ible assumptions. Unlike expectation failures (where information exists but isn’t integrated) or
|
|
1400
|
+
|
|
1401
|
+
commitment failures (where promises aren’t kept), this is a failure to even establish shared context.
|
|
1402
|
+
|
|
1403
|
+
Communication: Breakdown in using language to coordinate
|
|
1404
|
+
|
|
1405
|
+
.... Which approach would you prefer? I want to ensure we don’t lose any functionality
|
|
1406
|
+
|
|
1407
|
+
while resolving this conflict.
|
|
1408
|
+
|
|
1409
|
+
......
|
|
1410
|
+
|
|
1411
|
+
......
|
|
1412
|
+
|
|
1413
|
+
Bash View Edit Comm
|
|
1414
|
+
|
|
1415
|
+
No response at ALL.
|
|
1416
|
+
|
|
1417
|
+
The examples above reveal why coordination, rather than raw coding ability, is often the limiting
|
|
1418
|
+
|
|
1419
|
+
factor. The common thread is partial observability. Each agent acts while holding an uncertain
|
|
1420
|
+
|
|
1421
|
+
model of its partner’s state, edits, and commitments. A merge can be conflict-free yet still embed
|
|
1422
|
+
|
|
1423
|
+
incompatible assumptions.
|
|
1424
|
+
|
|
1425
|
+
These causes manifest through the symptoms in Tab. 1. Expectation failures produce work
|
|
1426
|
+
|
|
1427
|
+
12
|
|
1428
|
+
|
|
1429
|
+
### Page 13
|
|
1430
|
+
|
|
1431
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1432
|
+
|
|
1433
|
+
overlap and silent overwrites, commitment failures lead to unverifiable claims and broken promises,
|
|
1434
|
+
|
|
1435
|
+
and communication failures result in unresponsiveness and repetition.
|
|
1436
|
+
|
|
1437
|
+
These failures suggest current models lack reliable representations for (i) partner state (what the
|
|
1438
|
+
|
|
1439
|
+
other agent has actually changed), (ii) checkable commitments (contracts verifiable after merge), and
|
|
1440
|
+
|
|
1441
|
+
(iii) cross-branch integration reasoning (anticipating how independent patches interact). Coordination
|
|
1442
|
+
|
|
1443
|
+
requires more than plausible code. It requires verifiable and actionable constraints for a partner oper-
|
|
1444
|
+
|
|
1445
|
+
ating under isolation. This explains why prompt optimization yields only marginal improvements
|
|
1446
|
+
|
|
1447
|
+
(App. D). Most errors stem from coordination challenges, not prompt wording.
|
|
1448
|
+
|
|
1449
|
+
The trust paradox. We hypothesize that a deeper tension underlies expectation failures. Models are
|
|
1450
|
+
|
|
1451
|
+
trained to be cautious, requiring observable evidence and resisting unverifiable assertions. This
|
|
1452
|
+
|
|
1453
|
+
is a sensible default for single-agent interactions, where users may attempt to mislead the model.
|
|
1454
|
+
|
|
1455
|
+
However, collaboration under workspace isolation requires the opposite. Agents must trust partner
|
|
1456
|
+
|
|
1457
|
+
claims about states they cannot observe. When Agent A reports “I added the handler at line 50,”
|
|
1458
|
+
|
|
1459
|
+
Agent B’s instinct is to verify, but verification fails because they are on separate branches. This
|
|
1460
|
+
|
|
1461
|
+
mismatch between verification-first training and trust-requiring collaboration may partly explain why
|
|
1462
|
+
|
|
1463
|
+
agents consistently fail to update their model of partner state despite explicit communication.
|
|
1464
|
+
|
|
1465
|
+
Effective collaboration likely requires lightweight mechanisms that turn conversation into
|
|
1466
|
+
|
|
1467
|
+
verifiable shared state, such as pasted signatures, explicit insertion-point contracts, and integration
|
|
1468
|
+
|
|
1469
|
+
checks before declaring safety. We now turn to successful cases to see what these mechanisms look
|
|
1470
|
+
|
|
1471
|
+
like in practice.
|
|
1472
|
+
|
|
1473
|
+
6.4. Emergent Coordination Behavior
|
|
1474
|
+
|
|
1475
|
+
Among successful runs, we observe coordination patterns that are largely absent from failures.
|
|
1476
|
+
|
|
1477
|
+
These behaviors are not prompted or scaffolded; they emerge when agents successfully navigate
|
|
1478
|
+
|
|
1479
|
+
partial observability. What they share is a shift from vague intentions to specific commitments that
|
|
1480
|
+
|
|
1481
|
+
a partner can verify even without seeing the underlying work. We identify three such patterns.
|
|
1482
|
+
|
|
1483
|
+
Role division. Agents agree on who handles which part of the task and establish clear boundaries
|
|
1484
|
+
|
|
1485
|
+
around their scope.
|
|
1486
|
+
|
|
1487
|
+
Runs a quick check up
|
|
1488
|
+
|
|
1489
|
+
front, then delegates
|
|
1490
|
+
|
|
1491
|
+
the concrete insertion
|
|
1492
|
+
|
|
1493
|
+
tasks with a clear plan.
|
|
1494
|
+
|
|
1495
|
+
Insert octal_str in
|
|
1496
|
+
|
|
1497
|
+
outlines/types/
|
|
1498
|
+
|
|
1499
|
+
__init__.py
|
|
1500
|
+
|
|
1501
|
+
On my branch I’ll insert: header
|
|
1502
|
+
|
|
1503
|
+
→ binary_str → octal_str.
|
|
1504
|
+
|
|
1505
|
+
Inspects outlines/types/__init__.py to confirm the
|
|
1506
|
+
|
|
1507
|
+
exact alphabetical insertion point.
|
|
1508
|
+
|
|
1509
|
+
Locates the relevant files
|
|
1510
|
+
|
|
1511
|
+
to understand the surface
|
|
1512
|
+
|
|
1513
|
+
area for multi-file editing.
|
|
1514
|
+
|
|
1515
|
+
reads `src/click/termui.py` to
|
|
1516
|
+
|
|
1517
|
+
understand how `click.edit`
|
|
1518
|
+
|
|
1519
|
+
currently routes into the editor
|
|
1520
|
+
|
|
1521
|
+
implementation.
|
|
1522
|
+
|
|
1523
|
+
You implement the environment-
|
|
1524
|
+
|
|
1525
|
+
isolation feature and I’ll implement
|
|
1526
|
+
|
|
1527
|
+
multi-file editing support. Please
|
|
1528
|
+
|
|
1529
|
+
avoid the backend implementation;
|
|
1530
|
+
|
|
1531
|
+
I will handle it end-to-end.
|
|
1532
|
+
|
|
1533
|
+
I won’t touch the backend editor implementation. I’ll implement
|
|
1534
|
+
|
|
1535
|
+
environment isolation in the CLI-layer logic only.
|
|
1536
|
+
|
|
1537
|
+
I’ll add the section header + octal_str
|
|
1538
|
+
|
|
1539
|
+
in outlines/types/__init__.py; you add
|
|
1540
|
+
|
|
1541
|
+
binary_str between them to keep
|
|
1542
|
+
|
|
1543
|
+
alphabetical order.
|
|
1544
|
+
|
|
1545
|
+
What distinguishes successful role division is mutual confirmation. Under partial observability,
|
|
1546
|
+
|
|
1547
|
+
a unilateral declaration can easily be missed or misunderstood. When both agents explicitly
|
|
1548
|
+
|
|
1549
|
+
acknowledge the split, they create verified shared understanding that both sides can rely on during
|
|
1550
|
+
|
|
1551
|
+
independent work.
|
|
1552
|
+
|
|
1553
|
+
Resource division. Agents avoid collisions by partitioning shared resources, most commonly
|
|
1554
|
+
|
|
1555
|
+
specific files, code ranges, or ownership blocks.
|
|
1556
|
+
|
|
1557
|
+
What makes resource division effective is specificity. Vague commitments cannot be verified and
|
|
1558
|
+
|
|
1559
|
+
thus require trust. Line-level boundaries, by contrast, create safe zones where conflict is impossible
|
|
1560
|
+
|
|
1561
|
+
13
|
|
1562
|
+
|
|
1563
|
+
### Page 14
|
|
1564
|
+
|
|
1565
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1566
|
+
|
|
1567
|
+
I will modify llama-index-core/llama_index/core/base/
|
|
1568
|
+
|
|
1569
|
+
llms/types.py lines 68–84 (ImageBlock.image_to_base64)
|
|
1570
|
+
|
|
1571
|
+
......
|
|
1572
|
+
|
|
1573
|
+
......
|
|
1574
|
+
|
|
1575
|
+
Edit types.py
|
|
1576
|
+
|
|
1577
|
+
(ImageBlock.image
|
|
1578
|
+
|
|
1579
|
+
_to_base64)
|
|
1580
|
+
|
|
1581
|
+
I will NOT edit lines 68–84. My plan is to insert
|
|
1582
|
+
|
|
1583
|
+
get_image_mimetype() AFTER line 84 (starting at new line 85).
|
|
1584
|
+
|
|
1585
|
+
......
|
|
1586
|
+
|
|
1587
|
+
......
|
|
1588
|
+
|
|
1589
|
+
Edit Prompt/conditional-filter logic in
|
|
1590
|
+
|
|
1591
|
+
outlines/prompts.py (outside the
|
|
1592
|
+
|
|
1593
|
+
inserted registry block).
|
|
1594
|
+
|
|
1595
|
+
I will INSERT (no existing lines edited) into outlines/
|
|
1596
|
+
|
|
1597
|
+
prompts.py between current line 14 and @dataclass Prompt
|
|
1598
|
+
|
|
1599
|
+
at line 15. The inserted block defines GLOBAL_FILTERS +
|
|
1600
|
+
|
|
1601
|
+
register/unregister APIs. Please avoid the inserted block.
|
|
1602
|
+
|
|
1603
|
+
I already modified outlines/prompts.py
|
|
1604
|
+
|
|
1605
|
+
line 7 (import field). I will edit only regions
|
|
1606
|
+
|
|
1607
|
+
inside @dataclass Prompt
|
|
1608
|
+
|
|
1609
|
+
by construction.
|
|
1610
|
+
|
|
1611
|
+
Negotiation. Agents resolve conflicting approaches by proposing alternatives and converging on
|
|
1612
|
+
|
|
1613
|
+
a single plan before acting.
|
|
1614
|
+
|
|
1615
|
+
I checked the file… here are two clean options:
|
|
1616
|
+
|
|
1617
|
+
(1) I add IsHash; you add import re + IsRegex; I handle all __init__.py exports.
|
|
1618
|
+
|
|
1619
|
+
(2) You add IsRegex; I add IsHash; you handle all __init__.py exports.
|
|
1620
|
+
|
|
1621
|
+
Which option do you prefer?
|
|
1622
|
+
|
|
1623
|
+
......
|
|
1624
|
+
|
|
1625
|
+
......
|
|
1626
|
+
|
|
1627
|
+
Add import re
|
|
1628
|
+
|
|
1629
|
+
Edits `dirty_equals/__init__.py` to export
|
|
1630
|
+
|
|
1631
|
+
both new classes.
|
|
1632
|
+
|
|
1633
|
+
I’m seeing a potential overlap between our plans… I’ll check
|
|
1634
|
+
|
|
1635
|
+
the current file state first, then we’ll coordinate a clean split.
|
|
1636
|
+
|
|
1637
|
+
Let’s do option (1)… I’ve already added `import re` now. You
|
|
1638
|
+
|
|
1639
|
+
add IsHash, then I’ll add IsRegex.
|
|
1640
|
+
|
|
1641
|
+
Effective negotiation does cognitive work for both parties. By proposing mutually exclusive op-
|
|
1642
|
+
|
|
1643
|
+
tions that fully specify what each agent will do, one agent reduces a complex coordination problem
|
|
1644
|
+
|
|
1645
|
+
to a simple choice. The result is not just agreement on intent but complete action specifications that
|
|
1646
|
+
|
|
1647
|
+
leave nothing to interpret.
|
|
1648
|
+
|
|
1649
|
+
These coordination patterns are rare in our traces but their presence in successful cases suggests
|
|
1650
|
+
|
|
1651
|
+
that the underlying capability exists. The challenge is not teaching agents new coordination skills
|
|
1652
|
+
|
|
1653
|
+
but making existing ones reliable.
|
|
1654
|
+
|
|
1655
|
+
7. Related Work
|
|
1656
|
+
|
|
1657
|
+
Multi-agent LLM systems and tool-using coding agents have advanced rapidly, but reliable collabo-
|
|
1658
|
+
|
|
1659
|
+
ration remains unresolved. Prior work largely evaluates task success under engineered interaction
|
|
1660
|
+
|
|
1661
|
+
structure rather than free-form coordination under partial information.
|
|
1662
|
+
|
|
1663
|
+
Multi-agent LLM systems Many frameworks improve performance through structured inter-
|
|
1664
|
+
|
|
1665
|
+
action. CAMEL (Li et al., 2023a) and AutoGen (Wu et al., 2023) use conversation programming;
|
|
1666
|
+
|
|
1667
|
+
MetaGPT (Hong et al., 2024) and ChatDev (Qian et al., 2024) emulate software organizations;
|
|
1668
|
+
|
|
1669
|
+
Magentic-One (Fourney et al., 2024b), MAGIS (Tao et al., 2024), and AgileCoder (Nguyen et al.,
|
|
1670
|
+
|
|
1671
|
+
2024) use explicit orchestrators. Even with such scaffolding, multi-agent systems exhibit high failure
|
|
1672
|
+
|
|
1673
|
+
rates. Multi-agent configurations degrade performance by 39 to 70 percent relative to single-agent
|
|
1674
|
+
|
|
1675
|
+
baselines (Su et al., 2025), and failure analyses identify inter-agent misalignment as a major category
|
|
1676
|
+
|
|
1677
|
+
(Cemri et al., 2025). These findings suggest that externally imposed protocols mask rather than solve
|
|
1678
|
+
|
|
1679
|
+
the underlying coordination problem. Sotopia (Zhou et al., 2024) provides a general framework
|
|
1680
|
+
|
|
1681
|
+
for evaluating agents’ social intelligence, while our work focus specifically on cooperative coding
|
|
1682
|
+
|
|
1683
|
+
agents with verified tasks.
|
|
1684
|
+
|
|
1685
|
+
Tool-using coding agents such as SWE-agent (Yang et al., 2024), OpenHands (Wang et al., 2025),
|
|
1686
|
+
|
|
1687
|
+
and Agentless (Xia et al., 2024) achieve strong results on SWE-bench (Jimenez et al., 2024). However,
|
|
1688
|
+
|
|
1689
|
+
14
|
|
1690
|
+
|
|
1691
|
+
### Page 15
|
|
1692
|
+
|
|
1693
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1694
|
+
|
|
1695
|
+
these evaluations measure single-agent success rather than whether multiple peers can integrate
|
|
1696
|
+
|
|
1697
|
+
changes without conflict under partial information.
|
|
1698
|
+
|
|
1699
|
+
Coordination benchmarks Existing benchmarks span games, embodied tasks, and reasoning.
|
|
1700
|
+
|
|
1701
|
+
Hanabi (Forkel & Foerster, 2025) and Cicero (, FAIR) test coordination under information asymme-
|
|
1702
|
+
|
|
1703
|
+
try; MultiAgentBench (Zhu et al., 2025) and Collab-Overcooked (Sun et al., 2025) evaluate LLM
|
|
1704
|
+
|
|
1705
|
+
collaboration; Tool-RoCo (Zhang et al., 2025a) and RoCoBench (Mandi et al., 2023) assess multi-
|
|
1706
|
+
|
|
1707
|
+
robot cooperation. In software, SyncBench (Guo et al., 2025) tests divergent understanding and The
|
|
1708
|
+
|
|
1709
|
+
Collaboration Gap (Davidson et al., 2025) finds that solo-capable models degrade when required to
|
|
1710
|
+
|
|
1711
|
+
collaborate. These benchmarks typically enforce turn-taking or shared observability rather than
|
|
1712
|
+
|
|
1713
|
+
testing code integration under workspace isolation. Agent-human collaboration benchmarks such as
|
|
1714
|
+
|
|
1715
|
+
Co-Gym (Shao et al., 2025), HULA (Takerngsaksiri et al., 2025), and HAI-Eval (Luo et al., 2025) study
|
|
1716
|
+
|
|
1717
|
+
settings where humans arbitrate. We instead study whether agents can coordinate autonomously.
|
|
1718
|
+
|
|
1719
|
+
Theory of Mind evaluation Effective coordination requires modeling partner beliefs and inten-
|
|
1720
|
+
|
|
1721
|
+
tions, which commonly referred to Theory of Mind (Premack & Woodruff, 1978; Rabinowitz et al.,
|
|
1722
|
+
|
|
1723
|
+
2018; Zhu et al., 2021). ToMBench (Chen et al., 2024), FANToM (Kim et al., 2023), and SoMi-ToM (Fan
|
|
1724
|
+
|
|
1725
|
+
et al., 2025) evaluate theory of mind in LLMs, finding substantial gaps versus human performance.
|
|
1726
|
+
|
|
1727
|
+
ToMSWE (Zhou et al., 2025) tries to build coding agents which can infer users’ Theory of Mind.
|
|
1728
|
+
|
|
1729
|
+
Studies of cooperative games (Li et al., 2023b) and Generative Agents (Park et al., 2023) show
|
|
1730
|
+
|
|
1731
|
+
emergent social behaviors but also challenges translating these to verifiable collaborative work.
|
|
1732
|
+
|
|
1733
|
+
We isolate free-form coordination as the central object of evaluation. CooperBench assigns
|
|
1734
|
+
|
|
1735
|
+
two agents partially overlapping features on a shared codebase while isolating their workspaces
|
|
1736
|
+
|
|
1737
|
+
and restricting coordination to natural language. Unlike benchmarks that impose interaction
|
|
1738
|
+
|
|
1739
|
+
structure or measure outcomes alone, we evaluate through coordination failures such as redundancy,
|
|
1740
|
+
|
|
1741
|
+
inconsistent assumptions, and semantic breakage. We demonstrate the curse of coordination in a
|
|
1742
|
+
|
|
1743
|
+
controlled setting with verifiable code integration, pointing to social intelligence as the bottleneck
|
|
1744
|
+
|
|
1745
|
+
for effective agent teamwork.
|
|
1746
|
+
|
|
1747
|
+
8. Conclusion and Future Work
|
|
1748
|
+
|
|
1749
|
+
In a future where agents team with humans in high-stakes domains (Kim et al., 2025), accelerate
|
|
1750
|
+
|
|
1751
|
+
science and technology research (Gottweis et al., 2025), and empower creative endeavors (Waikar,
|
|
1752
|
+
|
|
1753
|
+
2021), it is hard to imagine how an agent incapable of coordination would contribute to such a
|
|
1754
|
+
|
|
1755
|
+
future, however strong their individual capabilities.
|
|
1756
|
+
|
|
1757
|
+
Our work demonstrates that coordination, not raw coding ability, is a central bottleneck for
|
|
1758
|
+
|
|
1759
|
+
multi-agent software development. Through CooperBench, we show that frontier models like GPT-5
|
|
1760
|
+
|
|
1761
|
+
and Claude Sonnet 4.5 achieve only 25% success when two agents collaborate, roughly half the
|
|
1762
|
+
|
|
1763
|
+
success rate of a single agent performing the same workload. This curse of coordination stems from
|
|
1764
|
+
|
|
1765
|
+
three capability gaps: agents fail to communicate actionable information, deviate from their own
|
|
1766
|
+
|
|
1767
|
+
commitments, and hold incorrect expectations about their partners.
|
|
1768
|
+
|
|
1769
|
+
Yet coordination is not beyond reach. In successful traces, we observe emergent behaviors
|
|
1770
|
+
|
|
1771
|
+
such as role division, resource division, and negotiation that turn vague intentions into verifiable
|
|
1772
|
+
|
|
1773
|
+
commitments. These patterns are rare but their presence suggests the underlying capability exists;
|
|
1774
|
+
|
|
1775
|
+
the challenge is making it reliable. With multi-agent training methods, e.g. Sotopia-π (Wang et al.,
|
|
1776
|
+
|
|
1777
|
+
2024a; Yu et al., 2025), we can expect these emergent behaviors to be reinforced through the success
|
|
1778
|
+
|
|
1779
|
+
of cooperation.
|
|
1780
|
+
|
|
1781
|
+
15
|
|
1782
|
+
|
|
1783
|
+
### Page 16
|
|
1784
|
+
|
|
1785
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1786
|
+
|
|
1787
|
+
Our findings open several directions: (1) training objectives that reward coordination under
|
|
1788
|
+
|
|
1789
|
+
partial observability, (2) lightweight protocols for verifiable commitments (e.g., shared signatures,
|
|
1790
|
+
|
|
1791
|
+
insertion-point contracts), and (3) richer communication channels such as screen sharing to expand
|
|
1792
|
+
|
|
1793
|
+
the modality beyond text. We release CooperBench as an open benchmark to measure progress on
|
|
1794
|
+
|
|
1795
|
+
these fronts.
|
|
1796
|
+
|
|
1797
|
+
Although we focus on software development, our findings generalize to any domain involving
|
|
1798
|
+
|
|
1799
|
+
role and resource conflicts under partial observability. We expect that the lack of social intelligence,
|
|
1800
|
+
|
|
1801
|
+
the ability to understand others, communicate effectively, and coordinate actions, will remain
|
|
1802
|
+
|
|
1803
|
+
a fundamental barrier limiting the real-world deployment of agents as teammates until these
|
|
1804
|
+
|
|
1805
|
+
capabilities are explicitly developed.
|
|
1806
|
+
|
|
1807
|
+
Acknowledgments
|
|
1808
|
+
|
|
1809
|
+
This research is supported in part by grants from ONR grant N000142412532, NSF grant IIS-2247357,
|
|
1810
|
+
|
|
1811
|
+
DSO National Laboratories (DSO), and support from SAP. We thank Google Cloud Platform and
|
|
1812
|
+
|
|
1813
|
+
Modal Platform for their credits. We thank Yutong Zhang, Gavin Li, Hannah Cha, John Yang, Yijia
|
|
1814
|
+
|
|
1815
|
+
Shao and all members of Stanford SALT Lab for their help and feedback throughout this project.
|
|
1816
|
+
|
|
1817
|
+
References
|
|
1818
|
+
|
|
1819
|
+
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari,
|
|
1820
|
+
|
|
1821
|
+
Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E.
|
|
1822
|
+
|
|
1823
|
+
Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025. URL https://arxiv.org/
|
|
1824
|
+
|
|
1825
|
+
abs/2503.13657.
|
|
1826
|
+
|
|
1827
|
+
Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao,
|
|
1828
|
+
|
|
1829
|
+
Mengting Hu, Yunghwei Lai, Zexuan Xiong, and Minlie Huang. Tombench: Benchmarking
|
|
1830
|
+
|
|
1831
|
+
theory of mind in large language models, 2024. URL https://arxiv.org/abs/2402.15052.
|
|
1832
|
+
|
|
1833
|
+
Yuyang Cheng, Yumiao Xu, Chaojia Yu, and Yong Zhao. Hawk: A hierarchical workflow framework
|
|
1834
|
+
|
|
1835
|
+
for multi-agent collaboration, 2025. URL https://arxiv.org/abs/2507.04067.
|
|
1836
|
+
|
|
1837
|
+
Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, and Ece Kamar. The
|
|
1838
|
+
|
|
1839
|
+
collaboration gap, 2025. URL https://arxiv.org/abs/2511.02687.
|
|
1840
|
+
|
|
1841
|
+
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily
|
|
1842
|
+
|
|
1843
|
+
Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan
|
|
1844
|
+
|
|
1845
|
+
Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis,
|
|
1846
|
+
|
|
1847
|
+
Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi,
|
|
1848
|
+
|
|
1849
|
+
Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in
|
|
1850
|
+
|
|
1851
|
+
the game of <i>diplomacy</i> by combining language models with strategic reasoning. Science,
|
|
1852
|
+
|
|
1853
|
+
378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097. URL https://www.science.org/doi/
|
|
1854
|
+
|
|
1855
|
+
abs/10.1126/science.ade9097.
|
|
1856
|
+
|
|
1857
|
+
Xianzhe Fan, Xuhui Zhou, Chuyang Jin, Kolby Nottingham, Hao Zhu, and Maarten Sap. Somi-tom:
|
|
1858
|
+
|
|
1859
|
+
Evaluating multi-perspective theory of mind in embodied social interactions. In NeurIPS D&B,
|
|
1860
|
+
|
|
1861
|
+
2025. URL https://arxiv.org/abs/2506.23046.
|
|
1862
|
+
|
|
1863
|
+
Johannes Forkel and Jakob Foerster. Entropy is all you need for inter-seed cross-play in hanabi,
|
|
1864
|
+
|
|
1865
|
+
2025. URL https://arxiv.org/abs/2511.22581.
|
|
1866
|
+
|
|
1867
|
+
16
|
|
1868
|
+
|
|
1869
|
+
### Page 17
|
|
1870
|
+
|
|
1871
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1872
|
+
|
|
1873
|
+
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu,
|
|
1874
|
+
|
|
1875
|
+
Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
|
|
1876
|
+
|
|
1877
|
+
Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema
|
|
1878
|
+
|
|
1879
|
+
Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024a. URL
|
|
1880
|
+
|
|
1881
|
+
https://arxiv.org/abs/2411.04468.
|
|
1882
|
+
|
|
1883
|
+
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu,
|
|
1884
|
+
|
|
1885
|
+
Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
|
|
1886
|
+
|
|
1887
|
+
Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema
|
|
1888
|
+
|
|
1889
|
+
Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024b. URL
|
|
1890
|
+
|
|
1891
|
+
https://arxiv.org/abs/2411.04468.
|
|
1892
|
+
|
|
1893
|
+
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom
|
|
1894
|
+
|
|
1895
|
+
Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.
|
|
1896
|
+
|
|
1897
|
+
arXiv preprint arXiv:2502.18864, 2025.
|
|
1898
|
+
|
|
1899
|
+
Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, and Heng Ji. Syncmind:
|
|
1900
|
+
|
|
1901
|
+
Measuring agent out-of-sync recovery in collaborative software engineering, 2025. URL https:
|
|
1902
|
+
|
|
1903
|
+
//arxiv.org/abs/2502.06994.
|
|
1904
|
+
|
|
1905
|
+
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao
|
|
1906
|
+
|
|
1907
|
+
Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-
|
|
1908
|
+
|
|
1909
|
+
agent collaborative framework. In The Twelfth International Conference on Learning Representations,
|
|
1910
|
+
|
|
1911
|
+
2023.
|
|
1912
|
+
|
|
1913
|
+
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao
|
|
1914
|
+
|
|
1915
|
+
Zhang, Zili Wang, Steven K. S. Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao,
|
|
1916
|
+
|
|
1917
|
+
Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collabo-
|
|
1918
|
+
|
|
1919
|
+
rative framework. In International Conference on Learning Representations, 2024.
|
|
1920
|
+
|
|
1921
|
+
Saffron Huang, Bryan Seethor, Esin Durmus, Kunal Handa, Miles McCain, Michael Stern, and
|
|
1922
|
+
|
|
1923
|
+
Deep Ganguli. How ai is transforming work at anthropic, 2025. URL https://anthropic.com/
|
|
1924
|
+
|
|
1925
|
+
research/how-ai-is-transforming-work-at-anthropic/.
|
|
1926
|
+
|
|
1927
|
+
Nicholas K Humphrey. The social function of intellect. In Growing points in ethology, pp. 303–317.
|
|
1928
|
+
|
|
1929
|
+
Cambridge University Press, 1976.
|
|
1930
|
+
|
|
1931
|
+
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
|
|
1932
|
+
|
|
1933
|
+
Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint
|
|
1934
|
+
|
|
1935
|
+
arXiv:2310.06770, 2023.
|
|
1936
|
+
|
|
1937
|
+
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
|
|
1938
|
+
|
|
1939
|
+
Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International
|
|
1940
|
+
|
|
1941
|
+
Conference on Learning Representations, 2024.
|
|
1942
|
+
|
|
1943
|
+
Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten
|
|
1944
|
+
|
|
1945
|
+
Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions, 2023. URL
|
|
1946
|
+
|
|
1947
|
+
https://arxiv.org/abs/2310.15421.
|
|
1948
|
+
|
|
1949
|
+
Ji Woong Kim, Juo-Tung Chen, Pascal Hansen, Lucy Xiaoyang Shi, Antony Goldenberg, Samuel
|
|
1950
|
+
|
|
1951
|
+
Schmidgall, Paul Maria Scheikl, Anton Deguet, Brandon M White, De Ru Tsai, et al. Srt-h: A
|
|
1952
|
+
|
|
1953
|
+
hierarchical framework for autonomous surgery via language-conditioned imitation learning.
|
|
1954
|
+
|
|
1955
|
+
Science robotics, 10(104):eadt5254, 2025.
|
|
1956
|
+
|
|
1957
|
+
17
|
|
1958
|
+
|
|
1959
|
+
### Page 18
|
|
1960
|
+
|
|
1961
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
1962
|
+
|
|
1963
|
+
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem.
|
|
1964
|
+
|
|
1965
|
+
CAMEL: Communicative agents for “mind” exploration of large language model society. In
|
|
1966
|
+
|
|
1967
|
+
Advances in Neural Information Processing Systems, 2023a.
|
|
1968
|
+
|
|
1969
|
+
Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and
|
|
1970
|
+
|
|
1971
|
+
Katia Sycara. Theory of mind for multi-agent collaboration via large language models. In
|
|
1972
|
+
|
|
1973
|
+
Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Em-
|
|
1974
|
+
|
|
1975
|
+
pirical Methods in Natural Language Processing, pp. 180–192, Singapore, December 2023b. As-
|
|
1976
|
+
|
|
1977
|
+
sociation for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.13. URL https:
|
|
1978
|
+
|
|
1979
|
+
//aclanthology.org/2023.emnlp-main.13/.
|
|
1980
|
+
|
|
1981
|
+
Hanjun Luo, Chiming Ni, Jiaheng Wen, Zhimu Huang, Yiran Wang, Bingduo Liao, Sylvia Chung,
|
|
1982
|
+
|
|
1983
|
+
Yingbin Jin, Xinfeng Li, Wenyuan Xu, XiaoFeng Wang, and Hanan Salam. Hai-eval: Measuring
|
|
1984
|
+
|
|
1985
|
+
human-ai synergy in collaborative coding, 2025. URL https://arxiv.org/abs/2512.04111.
|
|
1986
|
+
|
|
1987
|
+
Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large
|
|
1988
|
+
|
|
1989
|
+
language models, 2023. URL https://arxiv.org/abs/2307.04738.
|
|
1990
|
+
|
|
1991
|
+
Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder:
|
|
1992
|
+
|
|
1993
|
+
Dynamic collaborative agents for software development based on agile methodology, 2024. URL
|
|
1994
|
+
|
|
1995
|
+
https://arxiv.org/abs/2406.11912.
|
|
1996
|
+
|
|
1997
|
+
Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt
|
|
1998
|
+
|
|
1999
|
+
Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei
|
|
2000
|
+
|
|
2001
|
+
Zaharia, and Ion Stoica. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust
|
|
2002
|
+
|
|
2003
|
+
in Language Models and Applications, 2025. URL https://openreview.net/forum?id=wM521FqPvI.
|
|
2004
|
+
|
|
2005
|
+
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S.
|
|
2006
|
+
|
|
2007
|
+
Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https:
|
|
2008
|
+
|
|
2009
|
+
//arxiv.org/abs/2304.03442.
|
|
2010
|
+
|
|
2011
|
+
David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral
|
|
2012
|
+
|
|
2013
|
+
and Brain Sciences, 1(4):515–526, 1978. doi: 10.1017/S0140525X00076512. Publisher: Cambridge
|
|
2014
|
+
|
|
2015
|
+
University Press.
|
|
2016
|
+
|
|
2017
|
+
Goparaju Purna Sudhakar, Ayesha Farooq, and Sanghamitra Patnaik. Soft factors affecting the
|
|
2018
|
+
|
|
2019
|
+
performance of software development teams. Team Performance Management: An International
|
|
2020
|
+
|
|
2021
|
+
Journal, 17(3/4):187–205, 2011.
|
|
2022
|
+
|
|
2023
|
+
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize
|
|
2024
|
+
|
|
2025
|
+
Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev:
|
|
2026
|
+
|
|
2027
|
+
Communicative agents for software development. In Proceedings of the Annual Meeting of the
|
|
2028
|
+
|
|
2029
|
+
Association for Computational Linguistics, 2024.
|
|
2030
|
+
|
|
2031
|
+
Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew
|
|
2032
|
+
|
|
2033
|
+
Botvinick. Machine theory of mind. In International conference on machine learning, pp. 4218–
|
|
2034
|
+
|
|
2035
|
+
4227. PMLR, 2018.
|
|
2036
|
+
|
|
2037
|
+
Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai
|
|
2038
|
+
|
|
2039
|
+
Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao
|
|
2040
|
+
|
|
2041
|
+
Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu,
|
|
2042
|
+
|
|
2043
|
+
and Lin Lee Cheong. A systematic survey of automatic prompt optimization techniques. In Pro-
|
|
2044
|
+
|
|
2045
|
+
ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 33066–33098.
|
|
2046
|
+
|
|
2047
|
+
18
|
|
2048
|
+
|
|
2049
|
+
### Page 19
|
|
2050
|
+
|
|
2051
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2052
|
+
|
|
2053
|
+
Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.emnlp-main.1681. URL
|
|
2054
|
+
|
|
2055
|
+
http://dx.doi.org/10.18653/v1/2025.emnlp-main.1681.
|
|
2056
|
+
|
|
2057
|
+
Eric Raymond. The cathedral and the bazaar. Knowledge, Technology & Policy, 12(3):23–49, 1999.
|
|
2058
|
+
|
|
2059
|
+
Prateek Sahoo et al. A systematic survey of prompt engineering in large language models: Tech-
|
|
2060
|
+
|
|
2061
|
+
niques and applications. arXiv preprint arXiv:2402.07927, 2024.
|
|
2062
|
+
|
|
2063
|
+
Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. Collaborative gym: A frame-
|
|
2064
|
+
|
|
2065
|
+
work for enabling and evaluating human-agent collaboration, 2025. URL https://arxiv.org/
|
|
2066
|
+
|
|
2067
|
+
abs/2412.15701.
|
|
2068
|
+
|
|
2069
|
+
Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang,
|
|
2070
|
+
|
|
2071
|
+
Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao
|
|
2072
|
+
|
|
2073
|
+
Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie,
|
|
2074
|
+
|
|
2075
|
+
Fei Huang, and Jingren Zhou. Scaling agents via continual pre-training, 2025. URL https:
|
|
2076
|
+
|
|
2077
|
+
//arxiv.org/abs/2509.13310.
|
|
2078
|
+
|
|
2079
|
+
Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan,
|
|
2080
|
+
|
|
2081
|
+
and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as
|
|
2082
|
+
|
|
2083
|
+
collaborative agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language
|
|
2084
|
+
|
|
2085
|
+
Processing, pp. 4922–4951. Association for Computational Linguistics, 2025. doi: 10.18653/v1/
|
|
2086
|
+
|
|
2087
|
+
2025.emnlp-main.249. URL http://dx.doi.org/10.18653/v1/2025.emnlp-main.249.
|
|
2088
|
+
|
|
2089
|
+
Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn,
|
|
2090
|
+
|
|
2091
|
+
Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human-in-the-loop
|
|
2092
|
+
|
|
2093
|
+
software development agents, 2025. URL https://arxiv.org/abs/2411.12924.
|
|
2094
|
+
|
|
2095
|
+
Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. MAGIS:
|
|
2096
|
+
|
|
2097
|
+
LLM-based multi-agent framework for github issue resolution. arXiv preprint arXiv:2403.17927,
|
|
2098
|
+
|
|
2099
|
+
2024.
|
|
2100
|
+
|
|
2101
|
+
Michael Tomasello. A natural history of human thinking. Harvard University Press, 2014.
|
|
2102
|
+
|
|
2103
|
+
Sachin Waikar. Artists’ perspective: How ai enhances creativity and
|
|
2104
|
+
|
|
2105
|
+
reimagines meaning, Apr 2021. URL https://hai.stanford.edu/news/
|
|
2106
|
+
|
|
2107
|
+
artists-perspective-how-ai-enhances-creativity-and-reimagines-meaning.
|
|
2108
|
+
|
|
2109
|
+
Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Yonatan Bisk, Graham Neubig,
|
|
2110
|
+
|
|
2111
|
+
and Hao Zhu. Sotopia-π: Interactive learning of socially intelligent language agents. In Proceedings
|
|
2112
|
+
|
|
2113
|
+
of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
|
|
2114
|
+
|
|
2115
|
+
pp. 12912–12940, 2024a.
|
|
2116
|
+
|
|
2117
|
+
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi
|
|
2118
|
+
|
|
2119
|
+
Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as
|
|
2120
|
+
|
|
2121
|
+
generalist agents. In The Thirteenth International Conference on Learning Representations, 2024b.
|
|
2122
|
+
|
|
2123
|
+
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan,
|
|
2124
|
+
|
|
2125
|
+
Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng,
|
|
2126
|
+
|
|
2127
|
+
Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert
|
|
2128
|
+
|
|
2129
|
+
Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI software
|
|
2130
|
+
|
|
2131
|
+
developers as generalist agents. In International Conference on Learning Representations, 2025.
|
|
2132
|
+
|
|
2133
|
+
Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models. Advances
|
|
2134
|
+
|
|
2135
|
+
in Neural Information Processing Systems, 35:24824–24837, 2022.
|
|
2136
|
+
|
|
2137
|
+
19
|
|
2138
|
+
|
|
2139
|
+
### Page 20
|
|
2140
|
+
|
|
2141
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2142
|
+
|
|
2143
|
+
Anita Williams Woolley, Christopher F Chabris, Alex Pentland, Nada Hashmi, and Thomas W
|
|
2144
|
+
|
|
2145
|
+
Malone. Evidence for a collective intelligence factor in the performance of human groups. science,
|
|
2146
|
+
|
|
2147
|
+
330(6004):686–688, 2010.
|
|
2148
|
+
|
|
2149
|
+
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun
|
|
2150
|
+
|
|
2151
|
+
Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and
|
|
2152
|
+
|
|
2153
|
+
Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv
|
|
2154
|
+
|
|
2155
|
+
preprint arXiv:2308.08155, 2023.
|
|
2156
|
+
|
|
2157
|
+
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying
|
|
2158
|
+
|
|
2159
|
+
llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489.
|
|
2160
|
+
|
|
2161
|
+
Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong,
|
|
2162
|
+
|
|
2163
|
+
Chulin Xie, Carl Yang, Dawn Song, and Bo Li. Guardagent: Safeguard llm agents by a guard
|
|
2164
|
+
|
|
2165
|
+
agent via knowledge-enabled reasoning, 2025. URL https://arxiv.org/abs/2406.09187.
|
|
2166
|
+
|
|
2167
|
+
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang
|
|
2168
|
+
|
|
2169
|
+
Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu,
|
|
2170
|
+
|
|
2171
|
+
Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
|
|
2172
|
+
|
|
2173
|
+
Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu,
|
|
2174
|
+
|
|
2175
|
+
Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men,
|
|
2176
|
+
|
|
2177
|
+
Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren,
|
|
2178
|
+
|
|
2179
|
+
Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang,
|
|
2180
|
+
|
|
2181
|
+
Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu.
|
|
2182
|
+
|
|
2183
|
+
Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388.
|
|
2184
|
+
|
|
2185
|
+
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan,
|
|
2186
|
+
|
|
2187
|
+
and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.
|
|
2188
|
+
|
|
2189
|
+
arXiv preprint arXiv:2405.15793, 2024.
|
|
2190
|
+
|
|
2191
|
+
Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad
|
|
2192
|
+
|
|
2193
|
+
Majumder, Hao Zhu, Paul Pu Liang, and Jiaxuan You. Sotopia-rl: Reward design for social
|
|
2194
|
+
|
|
2195
|
+
intelligence. arXiv preprint arXiv:2508.03905, 2025.
|
|
2196
|
+
|
|
2197
|
+
Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu,
|
|
2198
|
+
|
|
2199
|
+
Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su,
|
|
2200
|
+
|
|
2201
|
+
Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark
|
|
2202
|
+
|
|
2203
|
+
for issue resolving, 2025. URL https://arxiv.org/abs/2504.02605.
|
|
2204
|
+
|
|
2205
|
+
Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, and
|
|
2206
|
+
|
|
2207
|
+
Toshiharu Sugawara. Tool-roco: An agent-as-tool self-organization large language model bench-
|
|
2208
|
+
|
|
2209
|
+
mark in multi-robot cooperation, 2025a. URL https://arxiv.org/abs/2511.21510.
|
|
2210
|
+
|
|
2211
|
+
Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui
|
|
2212
|
+
|
|
2213
|
+
Zhou, and Bo An. Agentorchestra: Orchestrating hierarchical multi-agent intelligence with the
|
|
2214
|
+
|
|
2215
|
+
tool-environment-agent(tea) protocol, 2025b. URL https://arxiv.org/abs/2506.12508.
|
|
2216
|
+
|
|
2217
|
+
Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang,
|
|
2218
|
+
|
|
2219
|
+
Xiang Deng, Dawn Song, Huan Sun, and Yu Su. Webguard: Building a generalizable guardrail
|
|
2220
|
+
|
|
2221
|
+
for web agents, 2025. URL https://arxiv.org/abs/2507.14293.
|
|
2222
|
+
|
|
2223
|
+
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe
|
|
2224
|
+
|
|
2225
|
+
Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation
|
|
2226
|
+
|
|
2227
|
+
20
|
|
2228
|
+
|
|
2229
|
+
### Page 21
|
|
2230
|
+
|
|
2231
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2232
|
+
|
|
2233
|
+
for social intelligence in language agents. In The Twelfth International Conference on Learning
|
|
2234
|
+
|
|
2235
|
+
Representations, 2024.
|
|
2236
|
+
|
|
2237
|
+
Xuhui Zhou, Valerie Chen, Zora Zhiruo Wang, Graham Neubig, Maarten Sap, and Xingyao Wang.
|
|
2238
|
+
|
|
2239
|
+
Tom-swe: User mental modeling for software engineering agents. arXiv preprint arXiv:2510.21903,
|
|
2240
|
+
|
|
2241
|
+
2025.
|
|
2242
|
+
|
|
2243
|
+
Hao Zhu, Graham Neubig, and Yonatan Bisk. Few-shot language coordination by modeling theory
|
|
2244
|
+
|
|
2245
|
+
of mind. In International conference on machine learning, pp. 12901–12911. PMLR, 2021.
|
|
2246
|
+
|
|
2247
|
+
Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong
|
|
2248
|
+
|
|
2249
|
+
Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. Multiagentbench: Evaluating the
|
|
2250
|
+
|
|
2251
|
+
collaboration and competition of llm agents, 2025. URL https://arxiv.org/abs/2503.01935.
|
|
2252
|
+
|
|
2253
|
+
Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen
|
|
2254
|
+
|
|
2255
|
+
Schmidhuber. Language agents as optimizable graphs, 2024. URL https://arxiv.org/abs/2402.
|
|
2256
|
+
|
|
2257
|
+
16823.
|
|
2258
|
+
|
|
2259
|
+
A. Dataset Details
|
|
2260
|
+
|
|
2261
|
+
This section provides detailed statistics on the CooperBench benchmark. Repository selection
|
|
2262
|
+
|
|
2263
|
+
criteria are described in §2.3.
|
|
2264
|
+
|
|
2265
|
+
A.1. Repository Distribution
|
|
2266
|
+
|
|
2267
|
+
Table 3 shows the full breakdown of repositories, features, and task pairs.
|
|
2268
|
+
|
|
2269
|
+
Table 3 | Distribution of benchmark tasks across source repositories. Feature counts and task pairs
|
|
2270
|
+
|
|
2271
|
+
are reported as aggregated totals across base commits (PRs) within each repository.
|
|
2272
|
+
|
|
2273
|
+
Language Repository #PRs Features (Σ) Task Pairs (Σ) License
|
|
2274
|
+
|
|
2275
|
+
Python DSPy 4 23 55 MIT
|
|
2276
|
+
|
|
2277
|
+
LlamaIndex 3 16 39 MIT
|
|
2278
|
+
|
|
2279
|
+
Pillow 3 15 30 MIT-CMU
|
|
2280
|
+
|
|
2281
|
+
Pallets Click 3 27 115 BSD-3
|
|
2282
|
+
|
|
2283
|
+
Pallets Jinja 3 30 135 BSD-3
|
|
2284
|
+
|
|
2285
|
+
HuggingFace Datasets 3 13 26 Apache-2.0
|
|
2286
|
+
|
|
2287
|
+
Outlines 3 22 79 Apache-2.0
|
|
2288
|
+
|
|
2289
|
+
Tiktoken 1 10 45 MIT
|
|
2290
|
+
|
|
2291
|
+
DirtyEquals 1 9 36 MIT
|
|
2292
|
+
|
|
2293
|
+
TypeScript React Hook Form 2 11 25 MIT
|
|
2294
|
+
|
|
2295
|
+
Go Chi Router 3 13 22 MIT
|
|
2296
|
+
|
|
2297
|
+
Rust Typst 3 10 45 Apache-2.0
|
|
2298
|
+
|
|
2299
|
+
Total 12 repositories 34 199 652
|
|
2300
|
+
|
|
2301
|
+
Note: Each repository contains 1–4 base commits (PRs), each defining an independent feature pool. Task
|
|
2302
|
+
|
|
2303
|
+
pairs are constructed within each PR as (n
|
|
2304
|
+
|
|
2305
|
+
2) and summed across PRs.
|
|
2306
|
+
|
|
2307
|
+
21
|
|
2308
|
+
|
|
2309
|
+
### Page 22
|
|
2310
|
+
|
|
2311
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2312
|
+
|
|
2313
|
+
A.2. Feature Complexity
|
|
2314
|
+
|
|
2315
|
+
The final CooperBench benchmark comprises 199 individual features grouped into 52 task sets,
|
|
2316
|
+
|
|
2317
|
+
yielding 652 evaluated feature pairs. Since the objective is to evaluate coordination rather than
|
|
2318
|
+
|
|
2319
|
+
raw implementation difficulty, features are intentionally designed to be compact and comparable
|
|
2320
|
+
|
|
2321
|
+
in difficulty to those found in established code-generation benchmarks. This design ensures that
|
|
2322
|
+
|
|
2323
|
+
multi-agent failures reflect genuine coordination limitations rather than disproportionate feature
|
|
2324
|
+
|
|
2325
|
+
complexity.
|
|
2326
|
+
|
|
2327
|
+
To quantify feature complexity, we characterize the gold patches for each feature along three
|
|
2328
|
+
|
|
2329
|
+
axes: (i) code volume, measured as the total number of lines added and deleted; (ii) structural
|
|
2330
|
+
|
|
2331
|
+
footprint, captured by the number of modified functions and hunks4; and (iii) modification scope,
|
|
2332
|
+
|
|
2333
|
+
defined as the number of files affected. Across the benchmark, features exhibit a deliberately
|
|
2334
|
+
|
|
2335
|
+
compact footprint. On average, a feature comprises 52.3 changed lines and modifies only 1.4
|
|
2336
|
+
|
|
2337
|
+
files, confirming that CooperBench isolates coordination challenges rather than the difficulty of
|
|
2338
|
+
|
|
2339
|
+
single-agent implementation. Table 4 provides detailed statistics for each repository.
|
|
2340
|
+
|
|
2341
|
+
Table 4 | Feature Complexity Statistics by Repository
|
|
2342
|
+
|
|
2343
|
+
Language Repository Avg. Lines Avg. Functions Avg. Files Easy Medium Hard
|
|
2344
|
+
|
|
2345
|
+
Python DSPy 70.9 5.6 1.3 29% 417% 1774%
|
|
2346
|
+
|
|
2347
|
+
LlamaIndex 16.8 1.8 1.0 213% 1487% 00%
|
|
2348
|
+
|
|
2349
|
+
Pillow 38.1 2.7 1.0 17% 1173% 320%
|
|
2350
|
+
|
|
2351
|
+
Pallets Click 53.9 5.4 1.6 00% 1037% 1763%
|
|
2352
|
+
|
|
2353
|
+
Pallets Jinja 67.7 6.2 1.0 13% 1447% 1550%
|
|
2354
|
+
|
|
2355
|
+
HuggingFace Datasets 15.3 2.3 1.0 18% 1185% 18%
|
|
2356
|
+
|
|
2357
|
+
Outlines 44.7 4.1 1.1 836% 627% 836%
|
|
2358
|
+
|
|
2359
|
+
Tiktoken 46.4 4.6 1.0 00% 880% 220%
|
|
2360
|
+
|
|
2361
|
+
DirtyEquals 71.0 4.0 2.0 00% 111% 889%
|
|
2362
|
+
|
|
2363
|
+
TypeScript React Hook Form 49.8 4.6 2.3 00% 873% 327%
|
|
2364
|
+
|
|
2365
|
+
Go Chi Router 80.2 5.7 2.8 00% 538% 862%
|
|
2366
|
+
|
|
2367
|
+
Rust Typst 58.4 1.7 1.1 00% 770% 330%
|
|
2368
|
+
|
|
2369
|
+
Overall 12 Repositories 52.3 4.4 1.4 158% 9950% 8543%
|
|
2370
|
+
|
|
2371
|
+
Note: Complexity measured as lines changed (added + removed) and structural elements modified in gold
|
|
2372
|
+
|
|
2373
|
+
patches. Difficulty categories from SWE-Rater-32B: Easy = <15 min fix, Medium = 15 min–1 hour, Hard = 1–4
|
|
2374
|
+
|
|
2375
|
+
hours.
|
|
2376
|
+
|
|
2377
|
+
B. LLM-based merge conflict resolver
|
|
2378
|
+
|
|
2379
|
+
CooperBench evaluates cooperation on merged code. When patch merging produces textual
|
|
2380
|
+
|
|
2381
|
+
conflicts, we use a small learned resolver to remove conflict markers while preserving both sides’
|
|
2382
|
+
|
|
2383
|
+
intent. We train a small local resolver rather than calling a larger proprietary model so that the
|
|
2384
|
+
|
|
2385
|
+
merge step remains narrow and predictable, avoids fixing anything beyond trivial merge cleanup,
|
|
2386
|
+
|
|
2387
|
+
and can run locally. At evaluation time, we invoke the learned resolver only after a standard merge
|
|
2388
|
+
|
|
2389
|
+
attempt and a union merge attempt do not yield a test passing merged artifact.
|
|
2390
|
+
|
|
2391
|
+
We construct training data by replaying merges between independently produced feature
|
|
2392
|
+
|
|
2393
|
+
patches and extracting the conflict marked regions from conflicted files. We identify each conflict
|
|
2394
|
+
|
|
2395
|
+
region by scanning for Git conflict markers <<<<<<<, =======, and >>>>>>>. We extract the marked
|
|
2396
|
+
|
|
2397
|
+
block together with a small fixed context window, default c = 5 lines before and after.
|
|
2398
|
+
|
|
2399
|
+
4A hunk is a contiguous block of changed lines in a diff, representing a localized code modification.
|
|
2400
|
+
|
|
2401
|
+
22
|
|
2402
|
+
|
|
2403
|
+
### Page 23
|
|
2404
|
+
|
|
2405
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2406
|
+
|
|
2407
|
+
We generate synthetic conflicts by perturbing these real conflict snippets. Our default generator is
|
|
2408
|
+
|
|
2409
|
+
gpt-4o. This keeps training examples representative of our patch distribution while avoiding direct
|
|
2410
|
+
|
|
2411
|
+
reuse of repository specific content. For each real or synthetic conflict snippet, we create a reference
|
|
2412
|
+
|
|
2413
|
+
resolution with gpt-5 and fine tune a small code model, Qwen/Qwen2.5-Coder-0.5B-Instruct, using
|
|
2414
|
+
|
|
2415
|
+
LoRA based supervised fine tuning (SFT). We train for three epochs with a maximum sequence
|
|
2416
|
+
|
|
2417
|
+
length of 2048 tokens. When the resolver is invoked, we extract the conflicted region with its fixed
|
|
2418
|
+
|
|
2419
|
+
context window, run deterministic decoding with temperature = 0, and replace that region with the
|
|
2420
|
+
|
|
2421
|
+
model’s resolution. We release the trained resolver as Qwen2.5-Coder-0.5B-Merge-Resolver.5
|
|
2422
|
+
|
|
2423
|
+
C. Difficulty-stratified evaluation
|
|
2424
|
+
|
|
2425
|
+
Raw success rates are insufficient for comparing coordination overhead across models. A model
|
|
2426
|
+
|
|
2427
|
+
dropping from 50% Solo to 30% Coop has the same 20-point gap as one dropping from 80% to
|
|
2428
|
+
|
|
2429
|
+
60%, but the first loses 40% of its capability while the second loses only 25%. We need a metric that
|
|
2430
|
+
|
|
2431
|
+
accounts for baseline differences. We also want to integrate across task difficulty rather than rely
|
|
2432
|
+
|
|
2433
|
+
on aggregates that mask variation. This section derives such a metric using the relative difficulty
|
|
2434
|
+
|
|
2435
|
+
defined in Section 4.
|
|
2436
|
+
|
|
2437
|
+
We partition tasks into 10 equal-width buckets over the normalized difficulty range [0, 1] and
|
|
2438
|
+
|
|
2439
|
+
compute success rate at each bucket midpoint, with 95% Wilson confidence intervals that remain
|
|
2440
|
+
|
|
2441
|
+
well-calibrated near 0 and 1. This produces two curves per model, one for Solo and one for Coop.
|
|
2442
|
+
|
|
2443
|
+
We summarize each curve by its area under the curve (AUC) via trapezoidal integration. The
|
|
2444
|
+
|
|
2445
|
+
absolute gap ∆AUC = AUCSolo − AUCCoop measures coordination cost but depends on baseline.
|
|
2446
|
+
|
|
2447
|
+
We therefore report retention = AUCCoop/AUCSolo, which normalizes for capability. A retention of
|
|
2448
|
+
|
|
2449
|
+
0.64 means 64% of Solo performance survives coordination.
|
|
2450
|
+
|
|
2451
|
+
For aggregate statistics across models we sum raw counts rather than averaging rates, which
|
|
2452
|
+
|
|
2453
|
+
preserves proper weighting when models have different sample sizes.
|
|
2454
|
+
|
|
2455
|
+
5huggingface.co/CodeConflict/Qwen2.5-Coder-0.5B-Merge-Resolver
|
|
2456
|
+
|
|
2457
|
+
23
|
|
2458
|
+
|
|
2459
|
+
### Page 24
|
|
2460
|
+
|
|
2461
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2462
|
+
|
|
2463
|
+
Algorithm 1: Constructing difficulty-stratified success curves
|
|
2464
|
+
|
|
2465
|
+
Input: Task set with difficulty scores d(t) ∈ [0, 1], success outcomes for Solo and Coop per
|
|
2466
|
+
|
|
2467
|
+
model
|
|
2468
|
+
|
|
2469
|
+
Output: Success curves with 95% CIs, AUC gap, and retention per model and pooled
|
|
2470
|
+
|
|
2471
|
+
// Bucket tasks by difficulty
|
|
2472
|
+
|
|
2473
|
+
1 Split [0, 1] into 10 equal buckets;
|
|
2474
|
+
|
|
2475
|
+
2 Assign each task to its bucket based on d(t);
|
|
2476
|
+
|
|
2477
|
+
// Compute curves per model
|
|
2478
|
+
|
|
2479
|
+
3 foreach model m do
|
|
2480
|
+
|
|
2481
|
+
4 foreach bucket b do
|
|
2482
|
+
|
|
2483
|
+
5 Compute Solo success rate rSolo
|
|
2484
|
+
|
|
2485
|
+
m,b = kSolo
|
|
2486
|
+
|
|
2487
|
+
m,b /nm,b;
|
|
2488
|
+
|
|
2489
|
+
6 Compute Coop success rate rCoop
|
|
2490
|
+
|
|
2491
|
+
m,b = kCoop
|
|
2492
|
+
|
|
2493
|
+
m,b /nm,b;
|
|
2494
|
+
|
|
2495
|
+
7 Compute 95% Wilson CI for each rate;
|
|
2496
|
+
|
|
2497
|
+
8 end
|
|
2498
|
+
|
|
2499
|
+
9 Compute AUCSolo and AUCCoop via trapezoidal integration;
|
|
2500
|
+
|
|
2501
|
+
10 Compute ∆AUC = AUCSolo − AUCCoop;
|
|
2502
|
+
|
|
2503
|
+
11 Compute retention = AUCCoop/AUCSolo;
|
|
2504
|
+
|
|
2505
|
+
12 end
|
|
2506
|
+
|
|
2507
|
+
// Pool across models
|
|
2508
|
+
|
|
2509
|
+
13 foreach bucket b do
|
|
2510
|
+
|
|
2511
|
+
14 Sum counts across models to get pooled nb and kb;
|
|
2512
|
+
|
|
2513
|
+
15 Compute pooled rates and Wilson CIs;
|
|
2514
|
+
|
|
2515
|
+
16 end
|
|
2516
|
+
|
|
2517
|
+
17 Compute pooled AUC gap and retention;
|
|
2518
|
+
|
|
2519
|
+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
|
|
2520
|
+
|
|
2521
|
+
Relative Difficulty
|
|
2522
|
+
|
|
2523
|
+
0.0
|
|
2524
|
+
|
|
2525
|
+
0.2
|
|
2526
|
+
|
|
2527
|
+
0.4
|
|
2528
|
+
|
|
2529
|
+
0.6
|
|
2530
|
+
|
|
2531
|
+
0.8
|
|
2532
|
+
|
|
2533
|
+
1.0
|
|
2534
|
+
|
|
2535
|
+
Task success rate
|
|
2536
|
+
|
|
2537
|
+
gpt-5
|
|
2538
|
+
|
|
2539
|
+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
|
|
2540
|
+
|
|
2541
|
+
Relative Difficulty
|
|
2542
|
+
|
|
2543
|
+
claude
|
|
2544
|
+
|
|
2545
|
+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
|
|
2546
|
+
|
|
2547
|
+
Relative Difficulty
|
|
2548
|
+
|
|
2549
|
+
minimax
|
|
2550
|
+
|
|
2551
|
+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
|
|
2552
|
+
|
|
2553
|
+
Relative Difficulty
|
|
2554
|
+
|
|
2555
|
+
qwen coder
|
|
2556
|
+
|
|
2557
|
+
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
|
|
2558
|
+
|
|
2559
|
+
Relative Difficulty
|
|
2560
|
+
|
|
2561
|
+
qwen
|
|
2562
|
+
|
|
2563
|
+
Coop
|
|
2564
|
+
|
|
2565
|
+
Solo
|
|
2566
|
+
|
|
2567
|
+
Figure 7 | Success rate versus relative difficulty for Solo and Coop settings. Shaded regions indicate
|
|
2568
|
+
|
|
2569
|
+
95% Wilson confidence intervals. The gap between curves represents coordination cost, which is
|
|
2570
|
+
|
|
2571
|
+
largest at mid-difficulty.
|
|
2572
|
+
|
|
2573
|
+
On average, 41% of Solo capability is lost when agents must coordinate (pooled retention
|
|
2574
|
+
|
|
2575
|
+
0.59). The pattern across models reinforces that coding ability does not predict coordination ability.
|
|
2576
|
+
|
|
2577
|
+
MiniMax exhibits the worst retention (0.46) despite mid-tier coding performance, while Qwen achieves
|
|
2578
|
+
|
|
2579
|
+
the highest retention (0.68) despite being the weakest coder. Weak models may benefit from a floor
|
|
2580
|
+
|
|
2581
|
+
effect, but MiniMax demonstrates that strong coding provides no protection against coordination
|
|
2582
|
+
|
|
2583
|
+
overhead.
|
|
2584
|
+
|
|
2585
|
+
24
|
|
2586
|
+
|
|
2587
|
+
### Page 25
|
|
2588
|
+
|
|
2589
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2590
|
+
|
|
2591
|
+
Table 5 | Coordination retention by model. Retention measures what fraction of Solo AUC is
|
|
2592
|
+
|
|
2593
|
+
preserved under Coop. Higher values indicate better coordination capability.
|
|
2594
|
+
|
|
2595
|
+
Counts (k) AUC Derived
|
|
2596
|
+
|
|
2597
|
+
Model Solo Coop Solo Coop ∆AUC Retention
|
|
2598
|
+
|
|
2599
|
+
gpt-5 315 183 0.506 0.325 0.181 0.64
|
|
2600
|
+
|
|
2601
|
+
claude 307 168 0.469 0.283 0.186 0.60
|
|
2602
|
+
|
|
2603
|
+
minimax 236 91 0.374 0.171 0.203 0.46
|
|
2604
|
+
|
|
2605
|
+
qwen coder 141 87 0.236 0.148 0.088 0.63
|
|
2606
|
+
|
|
2607
|
+
qwen 41 30 0.106 0.072 0.034 0.68
|
|
2608
|
+
|
|
2609
|
+
pooled 1039 558 0.338 0.200 0.138 0.59
|
|
2610
|
+
|
|
2611
|
+
D. Prompt Optimization: Failure-Driven Design
|
|
2612
|
+
|
|
2613
|
+
This appendix documents the iterative optimization of the collaborative setting execution prompt
|
|
2614
|
+
|
|
2615
|
+
through systematic failure analysis. Following established prompt engineering practices (Ramnath
|
|
2616
|
+
|
|
2617
|
+
et al., 2025; Sahoo et al., 2024), we employed an evidence-based approach: beginning with a basic
|
|
2618
|
+
|
|
2619
|
+
prompt and incrementally adding sections to address specific failure modes observed in agent
|
|
2620
|
+
|
|
2621
|
+
behavior. The prompt shown below represents the final, stable version used consistently across
|
|
2622
|
+
|
|
2623
|
+
all experimental runs reported in this paper.
|
|
2624
|
+
|
|
2625
|
+
Through iterative refinement, we identified three primary failure categories requiring explicit
|
|
2626
|
+
|
|
2627
|
+
prompt guidance: context misunderstanding (agents treating coordination as optional), spatial coordi-
|
|
2628
|
+
|
|
2629
|
+
nation failures (overlapping edits due to vague messages), and coordination protocol failures (missing
|
|
2630
|
+
|
|
2631
|
+
final status updates). The final prompt structure directly maps to these failure categories.
|
|
2632
|
+
|
|
2633
|
+
25
|
|
2634
|
+
|
|
2635
|
+
### Page 26
|
|
2636
|
+
|
|
2637
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2638
|
+
|
|
2639
|
+
Collaborative Setting Execution Prompt
|
|
2640
|
+
|
|
2641
|
+
Role: You are {{agent_id}} working on the following feature in parallel with another agent.
|
|
2642
|
+
|
|
2643
|
+
Scenario: You are working on separate branches implementing different features, but your imple-
|
|
2644
|
+
|
|
2645
|
+
mentations will be tested by 2-way merging both branches to main. You must prevent any merge
|
|
2646
|
+
|
|
2647
|
+
conflicts.
|
|
2648
|
+
|
|
2649
|
+
Feature Description:
|
|
2650
|
+
|
|
2651
|
+
{{feature_description}}
|
|
2652
|
+
|
|
2653
|
+
Implementation Plan:
|
|
2654
|
+
|
|
2655
|
+
{{plan}}
|
|
2656
|
+
|
|
2657
|
+
Your Task:
|
|
2658
|
+
|
|
2659
|
+
1. Implement the feature according to the plan.
|
|
2660
|
+
|
|
2661
|
+
2. You can communicate with the other agent using MCP tools:
|
|
2662
|
+
|
|
2663
|
+
• openhands_comm_send: Send messages to the other agent
|
|
2664
|
+
|
|
2665
|
+
• Messages from the other agent will appear automatically as '[Inter-agent message]'
|
|
2666
|
+
|
|
2667
|
+
3. Coordinate to avoid conflicts by specifying exact file paths and line numbers.
|
|
2668
|
+
|
|
2669
|
+
4. Complete the implementation.
|
|
2670
|
+
|
|
2671
|
+
Coordination Requirements:
|
|
2672
|
+
|
|
2673
|
+
• Share your implementation approach early with specific line ranges so both agents can coordi-
|
|
2674
|
+
|
|
2675
|
+
nate.
|
|
2676
|
+
|
|
2677
|
+
• If the other agent reports working on the same file, discuss who modifies which specific line
|
|
2678
|
+
|
|
2679
|
+
ranges to avoid conflicts.
|
|
2680
|
+
|
|
2681
|
+
• Never use insertion markers or comments like // [handleSubmit:onFinally] other agent inserts
|
|
2682
|
+
|
|
2683
|
+
– these cause merge conflicts.
|
|
2684
|
+
|
|
2685
|
+
• Instead, coordinate by dividing the file into non-overlapping sections with specific line ranges.
|
|
2686
|
+
|
|
2687
|
+
• Before you stop or complete your work, you must send a final status update message to the
|
|
2688
|
+
|
|
2689
|
+
other agent summarizing what you’ve implemented.
|
|
2690
|
+
|
|
2691
|
+
Merge Conflict Prevention:
|
|
2692
|
+
|
|
2693
|
+
• Think of this as two developers working on separate branches that will be merged together.
|
|
2694
|
+
|
|
2695
|
+
• Any overlapping changes to the same lines will cause merge conflicts.
|
|
2696
|
+
|
|
2697
|
+
• Coordinate line-by-line to ensure no overlap in your modifications.
|
|
2698
|
+
|
|
2699
|
+
Work directory: {{workspace}}
|
|
2700
|
+
|
|
2701
|
+
Failure-to-Prompt Mapping The scenario section addresses context misunderstanding by explic-
|
|
2702
|
+
|
|
2703
|
+
itly establishing that agents work on separate branches that will be merged, making coordination
|
|
2704
|
+
|
|
2705
|
+
mandatory. Analysis showed that many agents in early versions did not coordinate until after
|
|
2706
|
+
|
|
2707
|
+
starting implementation; with the scenario section, most agents coordinate during planning. The
|
|
2708
|
+
|
|
2709
|
+
coordination requirements section addresses spatial coordination failures through multiple mecha-
|
|
2710
|
+
|
|
2711
|
+
nisms. The exact line number requirement (with concrete example) addresses vague coordination
|
|
2712
|
+
|
|
2713
|
+
messages, significantly reducing spatial conflicts. The insertion marker prohibition substantially
|
|
2714
|
+
|
|
2715
|
+
reduced marker-related conflicts. The mandatory final status update requirement increased com-
|
|
2716
|
+
|
|
2717
|
+
pliance and reduced incomplete handoff failures. The merge conflict prevention section reinforces
|
|
2718
|
+
|
|
2719
|
+
context understanding through a mental model and technical explanation of merge conflict mecha-
|
|
2720
|
+
|
|
2721
|
+
nisms, helping agents understand why coordination matters and how to prevent conflicts.
|
|
2722
|
+
|
|
2723
|
+
26
|
|
2724
|
+
|
|
2725
|
+
### Page 27
|
|
2726
|
+
|
|
2727
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2728
|
+
|
|
2729
|
+
Design Decisions The prompt follows a specific ordering: (1) Identity establishes agent role, (2)
|
|
2730
|
+
|
|
2731
|
+
Scenario sets merge conflict constraints before task description, (3) Feature and (4) Plan provide
|
|
2732
|
+
|
|
2733
|
+
context, (5) Task describes what to do, (6) Requirements specify how to coordinate, and (7) Prevention
|
|
2734
|
+
|
|
2735
|
+
reinforces understanding. This ordering follows the principle that constraints should precede
|
|
2736
|
+
|
|
2737
|
+
task descriptions (Sahoo et al., 2024). Language choices employ mandatory language for critical
|
|
2738
|
+
|
|
2739
|
+
behaviors and strong prohibitions for anti-patterns, as optional language was frequently ignored.
|
|
2740
|
+
|
|
2741
|
+
Concrete examples are included rather than abstract guidance, consistent with findings that concrete
|
|
2742
|
+
|
|
2743
|
+
examples improve prompt effectiveness (Wei et al., 2022). All experimental results reported in this
|
|
2744
|
+
|
|
2745
|
+
paper were obtained using this final prompt version.
|
|
2746
|
+
|
|
2747
|
+
E. Communication ablation
|
|
2748
|
+
|
|
2749
|
+
Section 5 reports that communication does not improve cooperation success. Table 6 provides
|
|
2750
|
+
|
|
2751
|
+
the full breakdown across merge strategies. We evaluate three merging approaches in sequence:
|
|
2752
|
+
|
|
2753
|
+
Naive (standard git merge), Union (accept both sides on conflict), and LLM (our learned resolver
|
|
2754
|
+
|
|
2755
|
+
from App. B). The ∆ column shows the net effect of communication on final merge success after all
|
|
2756
|
+
|
|
2757
|
+
resolution steps. Communication slightly improves Naive merge rates by reducing raw conflicts,
|
|
2758
|
+
|
|
2759
|
+
but this advantage disappears after Union and LLM resolution. The final effect is near zero or
|
|
2760
|
+
|
|
2761
|
+
slightly negative across all models.
|
|
2762
|
+
|
|
2763
|
+
Table 6 | Merge success (%) on the 652-task summary. Subscripts show ∆ from prior column; final
|
|
2764
|
+
|
|
2765
|
+
column shows comm effect.
|
|
2766
|
+
|
|
2767
|
+
No-comm With-comm
|
|
2768
|
+
|
|
2769
|
+
Model Naive Union LLM Naive Union LLM ∆
|
|
2770
|
+
|
|
2771
|
+
GPT-5 13.88 26.69+12.8 27.91+1.2 20.42 26.64+6.2 27.90+1.3 -0.1
|
|
2772
|
+
|
|
2773
|
+
Claude 4.5 12.27 26.84+14.6 27.30+0.5 16.72 24.85+8.1 25.92+1.1 -1.4
|
|
2774
|
+
|
|
2775
|
+
MiniMax-M2 8.62 14.72+6.1 14.88+0.2 7.36 11.50+4.1 13.96+2.5 -0.9
|
|
2776
|
+
|
|
2777
|
+
Qwen3-Coder 6.90 12.88+6.0 14.72+1.8 6.75 12.42+5.7 13.34+0.9 -1.4
|
|
2778
|
+
|
|
2779
|
+
Qwen3-Instruct 1.53 3.22+1.7 3.37+0.2 2.30 4.45+2.1 4.60+0.2 +1.2
|
|
2780
|
+
|
|
2781
|
+
Avg. 8.64 16.87+8.2 17.64+0.8 10.71 15.97+5.3 17.14+1.2 -0.5
|
|
2782
|
+
|
|
2783
|
+
F. Communication error detection
|
|
2784
|
+
|
|
2785
|
+
We use an LLM-as-judge to classify communication failures for Section 5. Abstract labels like
|
|
2786
|
+
|
|
2787
|
+
“hallucination” are difficult for LLMs to apply reliably, so we instead define fine-grained categories
|
|
2788
|
+
|
|
2789
|
+
anchored to quotable evidence. The judge must cite exact quotes from the conversation and omits
|
|
2790
|
+
|
|
2791
|
+
the label if evidence is weak. We then aggregate these detections into three high-level categories for
|
|
2792
|
+
|
|
2793
|
+
reporting.
|
|
2794
|
+
|
|
2795
|
+
27
|
|
2796
|
+
|
|
2797
|
+
### Page 28
|
|
2798
|
+
|
|
2799
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2800
|
+
|
|
2801
|
+
Communication Error Detection Prompt
|
|
2802
|
+
|
|
2803
|
+
You are a careful reviewer of two agent collaboration conversations. This is a precision-first detector
|
|
2804
|
+
|
|
2805
|
+
of bad conversation patterns. Prefer returning no issue unless the evidence is strong and explicit.
|
|
2806
|
+
|
|
2807
|
+
Important exclusion. Do not label state mismatch or visibility confusion itself as an error (e.g., agents
|
|
2808
|
+
|
|
2809
|
+
on separate branches unable to see each other’s changes). Bad conversation patterns around these
|
|
2810
|
+
|
|
2811
|
+
topics should still be labeled.
|
|
2812
|
+
|
|
2813
|
+
Taxonomy. Label at most one category per conversation.
|
|
2814
|
+
|
|
2815
|
+
• C1a Unanswered direct question (no reply)
|
|
2816
|
+
|
|
2817
|
+
• C1b Unanswered direct question (ignored)
|
|
2818
|
+
|
|
2819
|
+
• C2 Non-answer or vague answer
|
|
2820
|
+
|
|
2821
|
+
• C4a Incorrect claim (uncorrected)
|
|
2822
|
+
|
|
2823
|
+
• C3b Incorrect claim (corrected)
|
|
2824
|
+
|
|
2825
|
+
• C4a Spammy repetition (repeats same information)
|
|
2826
|
+
|
|
2827
|
+
• C4b Spammy repetition (near-duplicate status blocks)
|
|
2828
|
+
|
|
2829
|
+
Evidence requirements. Include at least two exact quotes that make the issue undeniable. C1a/C1b
|
|
2830
|
+
|
|
2831
|
+
require the question plus demonstration of missing or irrelevant response. C3a requires the incorrect
|
|
2832
|
+
|
|
2833
|
+
claim and later contradiction. C4a/C4b require two quotes showing the repetition.
|
|
2834
|
+
|
|
2835
|
+
Output. Return JSON with evidence (list of quotes) and optional issue (category id and short
|
|
2836
|
+
|
|
2837
|
+
description). Omit issue if evidence is weak.
|
|
2838
|
+
|
|
2839
|
+
Taxonomy design. The eight categories decompose three failure modes into verifiable patterns.
|
|
2840
|
+
|
|
2841
|
+
Unresponsiveness (C1a, C1b, C2) covers questions that receive no reply, are ignored, or get vague
|
|
2842
|
+
|
|
2843
|
+
non-answers. Hallucination (C3a, C3b) covers false claims about code state or completion status. We
|
|
2844
|
+
|
|
2845
|
+
distinguish corrected from uncorrected claims because uncorrected errors propagate to downstream
|
|
2846
|
+
|
|
2847
|
+
decisions. Repetition (C4a, C4b) covers redundant messages that consume budget without adding
|
|
2848
|
+
|
|
2849
|
+
information.
|
|
2850
|
+
|
|
2851
|
+
G. Failure Symptom Annotation Procedure
|
|
2852
|
+
|
|
2853
|
+
We followed a six-stage process, similar in spirit to recent work on multi-agent failure analy-
|
|
2854
|
+
|
|
2855
|
+
sis (Cemri et al., 2025). (1) Collect multi-agent-system (MAS) traces from Collaborative runs; (2)
|
|
2856
|
+
|
|
2857
|
+
identify failures from merged artifacts (e.g., failing tests or missing intended behavior), and link
|
|
2858
|
+
|
|
2859
|
+
them back to the interaction; (3) develop symptom categories by iterative qualitative coding and
|
|
2860
|
+
|
|
2861
|
+
resolve disagreements to reach inter-annotator agreement on a shared set of definitions; (4) finalize
|
|
2862
|
+
|
|
2863
|
+
the resulting symptom set; (5) calibrate an LLM-based annotator on the agreed definitions; and (6)
|
|
2864
|
+
|
|
2865
|
+
apply the annotator to produce symptom annotations at scale.
|
|
2866
|
+
|
|
2867
|
+
Each labeled instance is grounded in three artifacts: (i) conversation evidence (the coordination
|
|
2868
|
+
|
|
2869
|
+
dialogue), (ii) patch/code evidence (what each agent changed), and (iii) outcome evidence (merge reports
|
|
2870
|
+
|
|
2871
|
+
and test outputs). A key operational distinction in our rubric is between implementation failures
|
|
2872
|
+
|
|
2873
|
+
(an individual agent delivers incomplete/buggy code regardless of coordination) and coordination
|
|
2874
|
+
|
|
2875
|
+
failures (a breakdown that is only apparent when we consider what agents said and assumed under
|
|
2876
|
+
|
|
2877
|
+
workspace isolation). Concretely, we require explicit conversation evidence to assign a coordination-
|
|
2878
|
+
|
|
2879
|
+
failure label; if the only evidence is in the code or error trace, we default to implementation-level
|
|
2880
|
+
|
|
2881
|
+
failure rather than inferring a coordination breakdown. We codified the final symptom definitions
|
|
2882
|
+
|
|
2883
|
+
as a structured rubric (including verification requirements and common confusions, e.g., when to
|
|
2884
|
+
|
|
2885
|
+
28
|
|
2886
|
+
|
|
2887
|
+
### Page 29
|
|
2888
|
+
|
|
2889
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2890
|
+
|
|
2891
|
+
treat “unverifiable claims” versus “work overlap”). We then calibrated an LLM-based annotator
|
|
2892
|
+
|
|
2893
|
+
on this rubric and required it to emit structured labels (a primary symptom plus any secondary
|
|
2894
|
+
|
|
2895
|
+
symptoms) together with short supporting evidence snippets.
|
|
2896
|
+
|
|
2897
|
+
Human validation. To validate the LLM-based annotator, we randomly sampled 50 trajectories
|
|
2898
|
+
|
|
2899
|
+
and had human experts independently label them using the same rubric. Human labels matched
|
|
2900
|
+
|
|
2901
|
+
the LLM annotations on 48 of 50 cases (96% agreement). With n = 50 and ˆp = 0.96, the Wilson 95%
|
|
2902
|
+
|
|
2903
|
+
confidence interval is [86%, 99%], confirming the annotator’s reliability.
|
|
2904
|
+
|
|
2905
|
+
H. Symptom examples
|
|
2906
|
+
|
|
2907
|
+
We provide representative examples for each coordination failure symptom identified in Table 1.
|
|
2908
|
+
|
|
2909
|
+
Work overlap
|
|
2910
|
+
|
|
2911
|
+
A msg: guid regex includes surrounding curly braces ({...}), and A warns about overlap.
|
|
2912
|
+
|
|
2913
|
+
A msg: WAIT Agent 10! If you add the section header AND my guid type to your branch, that WILL create a merge
|
|
2914
|
+
|
|
2915
|
+
conflict!
|
|
2916
|
+
|
|
2917
|
+
B msg: I’ll add the COMPLETE section (lines 72–81) to my branch, which includes both the section header, your guid
|
|
2918
|
+
|
|
2919
|
+
type, AND my hash_sha256 type.
|
|
2920
|
+
|
|
2921
|
+
Explanation: Despite explicit boundaries and warnings, B duplicates A’s work; the merged artifact keeps B’s (incorrect) version.
|
|
2922
|
+
|
|
2923
|
+
Outcome: Wrong regex version wins; GUID tests fail (merged guid pattern missing curly braces).
|
|
2924
|
+
|
|
2925
|
+
Divergent architecture
|
|
2926
|
+
|
|
2927
|
+
A msg: Lines 15–66: My global filter registry code is present... Line 69: Still has @dataclass decorator.
|
|
2928
|
+
|
|
2929
|
+
B msg: Removed dataclass import... Converted Prompt from dataclass to regular class with explicit __init__.
|
|
2930
|
+
|
|
2931
|
+
B msg: The 2-way merge should combine our changes cleanly.
|
|
2932
|
+
|
|
2933
|
+
Merge: clean (0 conflicts), but A’s additions are missing from the merged module.
|
|
2934
|
+
|
|
2935
|
+
Explanation: Despite explicit expectations about merge behavior, B’s class rewrite overwrites the region where A added
|
|
2936
|
+
|
|
2937
|
+
register_filter/unregister_filter; the merge reports 0 conflicts but the exported API is missing.
|
|
2938
|
+
|
|
2939
|
+
Outcome: ImportError: cannot import name register_filter from outlines.prompts.
|
|
2940
|
+
|
|
2941
|
+
Repetition
|
|
2942
|
+
|
|
2943
|
+
Near-duplicate status updates (different strings, same info).
|
|
2944
|
+
|
|
2945
|
+
A msg: I have successfully added the url type to outlines/types/__init__.py at lines 72–77:
|
|
2946
|
+
|
|
2947
|
+
A msg: Successfully added to outlines/types/__init__.py at lines 72–77
|
|
2948
|
+
|
|
2949
|
+
Explanation: The same completion/location is restated with minimal new information, consuming turns without adding constraints
|
|
2950
|
+
|
|
2951
|
+
a partner can act on.
|
|
2952
|
+
|
|
2953
|
+
Outcome: Repeated status updates reduce signal-to-noise and can crowd out missing coordination details.
|
|
2954
|
+
|
|
2955
|
+
Unresponsiveness
|
|
2956
|
+
|
|
2957
|
+
A msg: Which approach would you prefer? I want to ensure we don’t lose any functionality while resolving this
|
|
2958
|
+
|
|
2959
|
+
conflict.
|
|
2960
|
+
|
|
2961
|
+
B: no later message answers this question in the conversation log.
|
|
2962
|
+
|
|
2963
|
+
Explanation: The decision is explicitly requested and never resolved, breaking the coordination loop.
|
|
2964
|
+
|
|
2965
|
+
Outcome: The team proceeds without an agreed decision; implementation assumptions diverge.
|
|
2966
|
+
|
|
2967
|
+
Unverifiable claims
|
|
2968
|
+
|
|
2969
|
+
29
|
|
2970
|
+
|
|
2971
|
+
### Page 30
|
|
2972
|
+
|
|
2973
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
2974
|
+
|
|
2975
|
+
A claim: max_resolution is already added at specific line ranges.
|
|
2976
|
+
|
|
2977
|
+
B constraint: I cannot verify your intermediate changes (separate branches).
|
|
2978
|
+
|
|
2979
|
+
Result: the merge can be clean while one side’s asserted change is silently absent.
|
|
2980
|
+
|
|
2981
|
+
Explanation: The claim is specific but non-checkable; coordination lacks a verification mechanism (e.g., pasted signature, exact diff,
|
|
2982
|
+
|
|
2983
|
+
or an agreed placeholder).
|
|
2984
|
+
|
|
2985
|
+
Outcome: False shared context about code state leads to incompatible downstream edits.
|
|
2986
|
+
|
|
2987
|
+
Broken commitment
|
|
2988
|
+
|
|
2989
|
+
A msg: I’ll add BOTH parameters (fallback_processor and max_batch_size) to the constructor signature, BOTH
|
|
2990
|
+
|
|
2991
|
+
docstrings, and BOTH initializations.
|
|
2992
|
+
|
|
2993
|
+
A msg: ✓Line 26: Added BOTH parameters (fallback_processor and max_batch_size) to constructor signature.
|
|
2994
|
+
|
|
2995
|
+
Observed after merge: constructor only has fallback_processor, missing max_batch_size.
|
|
2996
|
+
|
|
2997
|
+
Explanation: A makes a confident completion claim that is not corrected or verified by B; under partial observability, this creates
|
|
2998
|
+
|
|
2999
|
+
false shared context.
|
|
3000
|
+
|
|
3001
|
+
Outcome: TypeError: __init__() got an unexpected keyword argument max_batch_size (tests fail).
|
|
3002
|
+
|
|
3003
|
+
Dependency access
|
|
3004
|
+
|
|
3005
|
+
Conversation evidence: 0 coordination messages were sent (total_messages=0), so neither agent disclosed
|
|
3006
|
+
|
|
3007
|
+
import/initialization risks.
|
|
3008
|
+
|
|
3009
|
+
Merge traceback excerpt: src/PIL/Image.py:60 executes from. import ImageDraw. Then src/PIL/ImageDraw.py:45 reads
|
|
3010
|
+
|
|
3011
|
+
Image.core.
|
|
3012
|
+
|
|
3013
|
+
(PIL.Image still initializing ⇒ circular import failure).
|
|
3014
|
+
|
|
3015
|
+
Explanation: With no communication at all, the merged import graph is never discussed; the first shared integration check happens
|
|
3016
|
+
|
|
3017
|
+
only at import time and fails deterministically.
|
|
3018
|
+
|
|
3019
|
+
Outcome: AttributeError: partially initialized module PIL.Image has no attribute core (most likely due to a circular
|
|
3020
|
+
|
|
3021
|
+
import).
|
|
3022
|
+
|
|
3023
|
+
Placeholder misuse
|
|
3024
|
+
|
|
3025
|
+
A msg: I’ll add a clear comment marker: [Conditional filters overlay insertion point].
|
|
3026
|
+
|
|
3027
|
+
A msg: Please insert your logic immediately AFTER the marker... without modifying lines above it.
|
|
3028
|
+
|
|
3029
|
+
B msg: Given your marker plan, I didn’t alter those methods... I rely on __post_init__ to overlay filters.
|
|
3030
|
+
|
|
3031
|
+
Explanation: The agreed integration point (insert-after-marker) is not used; B implements an alternative wiring path, so the merged
|
|
3032
|
+
|
|
3033
|
+
decorator surface no longer matches the expected call pattern.
|
|
3034
|
+
|
|
3035
|
+
Outcome: TypeError: prompt got an unexpected keyword argument conditional_filters.
|
|
3036
|
+
|
|
3037
|
+
Parameter flow
|
|
3038
|
+
|
|
3039
|
+
A msg: renamed edit_file to edit_files with multi-file command construction.
|
|
3040
|
+
|
|
3041
|
+
B msg: I’m going to continue... based on the current state I see (edit_file method).
|
|
3042
|
+
|
|
3043
|
+
B code shape: builds a shell command by interpolating filename into a quoted string, assuming it is a single
|
|
3044
|
+
|
|
3045
|
+
string.
|
|
3046
|
+
|
|
3047
|
+
Explanation: Ambiguity about a changing interface leaves one agent implementing against an outdated contract; after merge, a list
|
|
3048
|
+
|
|
3049
|
+
flows into string-only formatting.
|
|
3050
|
+
|
|
3051
|
+
Outcome: sed: can’t read [...]: No such file or directory (list passed as a literal string).
|
|
3052
|
+
|
|
3053
|
+
Timing dependency
|
|
3054
|
+
|
|
3055
|
+
30
|
|
3056
|
+
|
|
3057
|
+
### Page 31
|
|
3058
|
+
|
|
3059
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
3060
|
+
|
|
3061
|
+
A msg: Processing Pipeline: load → image.load → EXIF correction (NEW) → B crop (pending) → mode conversion →
|
|
3062
|
+
|
|
3063
|
+
return.
|
|
3064
|
+
|
|
3065
|
+
B msg: Applied AFTER EXIF correction (A) and BEFORE mode conversion... Pipeline (after merge): load → EXIF
|
|
3066
|
+
|
|
3067
|
+
correction → center-crop.
|
|
3068
|
+
|
|
3069
|
+
Merge: CLEAN (0 textual conflicts); both declare No conflicts expected.
|
|
3070
|
+
|
|
3071
|
+
Merged code excerpt: image = image.crop(...)
|
|
3072
|
+
|
|
3073
|
+
Merged code absence: no ImageOps.exif_transpose(...) call exists in the merged function.
|
|
3074
|
+
|
|
3075
|
+
Explanation: They agree on the intended order, but fail to ensure the EXIF correction block is actually present at the agreed insertion
|
|
3076
|
+
|
|
3077
|
+
point after merge.
|
|
3078
|
+
|
|
3079
|
+
Outcome: assert (640, 480) == (480, 640) (EXIF correction missing).
|
|
3080
|
+
|
|
3081
|
+
I. Case Study: Spatial vs. Semantic Coordination
|
|
3082
|
+
|
|
3083
|
+
Section 5 shows that communication reduces merge conflicts but does not improve task success. To
|
|
3084
|
+
|
|
3085
|
+
understand why, we examine a representative failure in detail. This case illustrates the distinction
|
|
3086
|
+
|
|
3087
|
+
between spatial coordination (agreeing on which lines to edit) and semantic coordination (agreeing on
|
|
3088
|
+
|
|
3089
|
+
what values and behaviors to implement). We find that agents excel at the former but neglect the
|
|
3090
|
+
|
|
3091
|
+
latter.
|
|
3092
|
+
|
|
3093
|
+
I.1. Task Setup
|
|
3094
|
+
|
|
3095
|
+
The task comes from the Jinja2 template engine. Jinja2 provides a groupby filter that groups items by
|
|
3096
|
+
|
|
3097
|
+
an attribute. For example, users|groupby("city") groups a list of users by their city. The original
|
|
3098
|
+
|
|
3099
|
+
function signature looks like this:
|
|
3100
|
+
|
|
3101
|
+
def sync_do_groupby (
|
|
3102
|
+
|
|
3103
|
+
environment: " Environment ",
|
|
3104
|
+
|
|
3105
|
+
value: "t. Iterable [V]",
|
|
3106
|
+
|
|
3107
|
+
attribute: t. Union [str, int],
|
|
3108
|
+
|
|
3109
|
+
default: t. Optional [t. Any] = None,
|
|
3110
|
+
|
|
3111
|
+
) -> "t. List [t. Tuple [t. Any,␣t. List [V]]] ":
|
|
3112
|
+
|
|
3113
|
+
Two agents must independently extend this filter with new parameters. Agent 1 adds a
|
|
3114
|
+
|
|
3115
|
+
case_sensitive parameter that controls whether grouping treats “CA” and “ca” as the same value.
|
|
3116
|
+
|
|
3117
|
+
The intended default is False, matching the existing sort filter’s case-insensitive behavior. Agent 2
|
|
3118
|
+
|
|
3119
|
+
adds a reverse parameter that controls whether groups appear in descending order.
|
|
3120
|
+
|
|
3121
|
+
The coordination challenge is that both parameters must be added to the same function signature.
|
|
3122
|
+
|
|
3123
|
+
Git performs conflict detection at the hunk level (contiguous blocks of changes), not individual lines.
|
|
3124
|
+
|
|
3125
|
+
If both agents modify the signature independently, Git will detect a conflict because both changes
|
|
3126
|
+
|
|
3127
|
+
touch the same hunk. This creates a genuine coordination requirement where agents cannot simply
|
|
3128
|
+
|
|
3129
|
+
work in isolation.
|
|
3130
|
+
|
|
3131
|
+
I.2. What Agents Communicated
|
|
3132
|
+
|
|
3133
|
+
The agents exchanged 10 messages totaling over 3,000 words. Their communication focused almost
|
|
3134
|
+
|
|
3135
|
+
entirely on spatial coordination: line numbers, file paths, and edit ranges.
|
|
3136
|
+
|
|
3137
|
+
31
|
|
3138
|
+
|
|
3139
|
+
### Page 32
|
|
3140
|
+
|
|
3141
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
3142
|
+
|
|
3143
|
+
Agent 1: I’ll be modifying the sync_do_groupby and do_groupby functions in src/jinja2/filters.py
|
|
3144
|
+
|
|
3145
|
+
lines 1161–1230... According to our coordination plan, both parameters will be added together to avoid
|
|
3146
|
+
|
|
3147
|
+
merge conflicts.
|
|
3148
|
+
|
|
3149
|
+
Agent 2: I’ll wait for you to complete the implementation in src/jinja2/filters.py lines 1161–1230...
|
|
3150
|
+
|
|
3151
|
+
Agent 1: Your reverse insertion points remain:
|
|
3152
|
+
|
|
3153
|
+
- sync: for _, values in groupby(sorted(value, key=group_expr), group_expr):
|
|
3154
|
+
|
|
3155
|
+
Please proceed to add reverse: bool = False immediately after case_sensitive.
|
|
3156
|
+
|
|
3157
|
+
This communication successfully identified the shared code region and established a protocol
|
|
3158
|
+
|
|
3159
|
+
for who edits what. What the agents never discussed were the semantic details that matter for
|
|
3160
|
+
|
|
3161
|
+
correctness. Agent 1 correctly states that case_sensitive should default to False to match the sort
|
|
3162
|
+
|
|
3163
|
+
filter, but this default value is never explicitly confirmed with Agent 2. Neither agent discusses
|
|
3164
|
+
|
|
3165
|
+
whether the two parameters interact or whether their implementations are independent.
|
|
3166
|
+
|
|
3167
|
+
I.3. What Went Wrong
|
|
3168
|
+
|
|
3169
|
+
Because both agents anticipated the need to modify the same function signature, each independently
|
|
3170
|
+
|
|
3171
|
+
added both parameters to avoid a Git conflict. However, without semantic coordination, they made
|
|
3172
|
+
|
|
3173
|
+
inconsistent choices.
|
|
3174
|
+
|
|
3175
|
+
Agent 1’s patch added only the case_sensitive parameter with the correct default:
|
|
3176
|
+
|
|
3177
|
+
def sync_do_groupby (
|
|
3178
|
+
|
|
3179
|
+
environment: " Environment ",
|
|
3180
|
+
|
|
3181
|
+
value: "t. Iterable [V]",
|
|
3182
|
+
|
|
3183
|
+
attribute: t. Union [str, int],
|
|
3184
|
+
|
|
3185
|
+
default: t. Optional [t. Any] = None,
|
|
3186
|
+
|
|
3187
|
+
case_sensitive: bool = False, # Correct default
|
|
3188
|
+
|
|
3189
|
+
) -> "t. List [_GroupTuple]":
|
|
3190
|
+
|
|
3191
|
+
Agent 2’s patch added both parameters (to avoid merge conflicts), but reported the wrong value
|
|
3192
|
+
|
|
3193
|
+
in communication:
|
|
3194
|
+
|
|
3195
|
+
Agent 2’s status message:
|
|
3196
|
+
|
|
3197
|
+
“Signatures now are: (environment, value, attribute, default=None, case_sensitive=True)”
|
|
3198
|
+
|
|
3199
|
+
Agent 2 reported case_sensitive=True as the default while the correct value is False. This
|
|
3200
|
+
|
|
3201
|
+
discrepancy was never caught because the conversation focused entirely on where edits would
|
|
3202
|
+
|
|
3203
|
+
happen, not what values would be used. Neither agent verified the other’s actual implementation;
|
|
3204
|
+
|
|
3205
|
+
they relied on status messages. The semantic meaning of the default (“should match the sort filter”)
|
|
3206
|
+
|
|
3207
|
+
was mentioned by Agent 1 but never confirmed by Agent 2.
|
|
3208
|
+
|
|
3209
|
+
For reference, the gold (correct) patches show what each feature should look like. The gold
|
|
3210
|
+
|
|
3211
|
+
patch for case_sensitive adds:
|
|
3212
|
+
|
|
3213
|
+
default: t. Optional [t. Any] = None,
|
|
3214
|
+
|
|
3215
|
+
case_sensitive: bool = False,
|
|
3216
|
+
|
|
3217
|
+
) -> "t. List [_GroupTuple]":
|
|
3218
|
+
|
|
3219
|
+
And the gold patch for reverse adds:
|
|
3220
|
+
|
|
3221
|
+
default: t. Optional [t. Any] = None,
|
|
3222
|
+
|
|
3223
|
+
reverse: bool = False,
|
|
3224
|
+
|
|
3225
|
+
) -> "t. List [t. Tuple [t. Any,␣t. List [V]]] ":
|
|
3226
|
+
|
|
3227
|
+
32
|
|
3228
|
+
|
|
3229
|
+
### Page 33
|
|
3230
|
+
|
|
3231
|
+
CooperBench: Why Coding Agents Cannot be Your Teammates Yet
|
|
3232
|
+
|
|
3233
|
+
The correct merged signature would combine both:
|
|
3234
|
+
|
|
3235
|
+
def sync_do_groupby (
|
|
3236
|
+
|
|
3237
|
+
environment: " Environment ",
|
|
3238
|
+
|
|
3239
|
+
value: "t. Iterable [V]",
|
|
3240
|
+
|
|
3241
|
+
attribute: t. Union [str, int],
|
|
3242
|
+
|
|
3243
|
+
default: t. Optional [t. Any] = None,
|
|
3244
|
+
|
|
3245
|
+
case_sensitive: bool = False,
|
|
3246
|
+
|
|
3247
|
+
reverse: bool = False,
|
|
3248
|
+
|
|
3249
|
+
) -> "t. List [_GroupTuple]":
|
|
3250
|
+
|
|
3251
|
+
I.4. What Would Have Worked
|
|
3252
|
+
|
|
3253
|
+
For this task to succeed, agents needed to coordinate on three levels. Spatial coordination they
|
|
3254
|
+
|
|
3255
|
+
achieved: “I’m editing lines 1161–1230; please add your parameter after mine.” Structural coordi-
|
|
3256
|
+
|
|
3257
|
+
nation they partially achieved: “Both parameters go in the signature; I’ll add mine first.” Semantic
|
|
3258
|
+
|
|
3259
|
+
coordination was missing entirely.
|
|
3260
|
+
|
|
3261
|
+
A single message could have prevented the failure:
|
|
3262
|
+
|
|
3263
|
+
Missing coordination:
|
|
3264
|
+
|
|
3265
|
+
“I’m implementing case_sensitive with default value False (not True). This matches the sort
|
|
3266
|
+
|
|
3267
|
+
filter’s case-insensitive default. If you need to include this parameter in your patch, please
|
|
3268
|
+
|
|
3269
|
+
use exactly case_sensitive: bool = False.”
|
|
3270
|
+
|
|
3271
|
+
I.5. Implications
|
|
3272
|
+
|
|
3273
|
+
This case study provides concrete evidence for the spatial-semantic gap discussed in Section 5.
|
|
3274
|
+
|
|
3275
|
+
Despite 10 messages and over 3,000 words of coordination, the agents never once discussed the
|
|
3276
|
+
|
|
3277
|
+
actual default value that case_sensitive should have. They successfully negotiated where to edit
|
|
3278
|
+
|
|
3279
|
+
but failed to negotiate what to implement. A single clarifying message about the intended default
|
|
3280
|
+
|
|
3281
|
+
value would have prevented the failure entirely.
|
|
3282
|
+
|
|
3283
|
+
33
|