@chllming/wave-orchestration 0.6.3 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/CHANGELOG.md +82 -1
  2. package/README.md +40 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +67 -11
  39. package/docs/guides/terminal-surfaces.md +12 -0
  40. package/docs/plans/context7-wave-orchestrator.md +20 -0
  41. package/docs/plans/current-state.md +8 -1
  42. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  43. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  44. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  45. package/docs/plans/migration.md +26 -0
  46. package/docs/plans/wave-orchestrator.md +60 -12
  47. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  48. package/docs/reference/cli-reference.md +547 -0
  49. package/docs/reference/coordination-and-closure.md +436 -0
  50. package/docs/reference/live-proof-waves.md +25 -3
  51. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  52. package/docs/reference/proof-metrics.md +90 -0
  53. package/docs/reference/runtime-config/README.md +63 -2
  54. package/docs/reference/runtime-config/codex.md +2 -1
  55. package/docs/reference/sample-waves.md +29 -18
  56. package/docs/reference/wave-control.md +164 -0
  57. package/docs/reference/wave-planning-lessons.md +131 -0
  58. package/package.json +5 -4
  59. package/releases/manifest.json +40 -0
  60. package/scripts/research/agent-context-archive.mjs +18 -0
  61. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  62. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  63. package/scripts/wave-orchestrator/agent-state.mjs +11 -2
  64. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  65. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  66. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  67. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  68. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  69. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  70. package/scripts/wave-orchestrator/config.mjs +175 -0
  71. package/scripts/wave-orchestrator/control-cli.mjs +1216 -0
  72. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  73. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  74. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  75. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  76. package/scripts/wave-orchestrator/dashboard-renderer.mjs +120 -5
  77. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  78. package/scripts/wave-orchestrator/evals.mjs +23 -0
  79. package/scripts/wave-orchestrator/executors.mjs +3 -2
  80. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  81. package/scripts/wave-orchestrator/install.mjs +151 -2
  82. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  83. package/scripts/wave-orchestrator/launcher-runtime.mjs +33 -30
  84. package/scripts/wave-orchestrator/launcher.mjs +884 -36
  85. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  86. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  87. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  88. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  89. package/scripts/wave-orchestrator/replay.mjs +10 -4
  90. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  91. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  92. package/scripts/wave-orchestrator/shared.mjs +26 -0
  93. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  94. package/scripts/wave-orchestrator/terminals.mjs +1 -1
  95. package/scripts/wave-orchestrator/traces.mjs +157 -2
  96. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  97. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  98. package/scripts/wave-orchestrator/wave-files.mjs +144 -23
  99. package/scripts/wave.mjs +27 -0
  100. package/skills/repo-coding-rules/SKILL.md +1 -0
  101. package/skills/role-cont-eval/SKILL.md +1 -0
  102. package/skills/role-cont-qa/SKILL.md +13 -6
  103. package/skills/role-deploy/SKILL.md +1 -0
  104. package/skills/role-documentation/SKILL.md +4 -0
  105. package/skills/role-implementation/SKILL.md +4 -0
  106. package/skills/role-infra/SKILL.md +2 -1
  107. package/skills/role-integration/SKILL.md +15 -8
  108. package/skills/role-planner/SKILL.md +39 -0
  109. package/skills/role-planner/skill.json +21 -0
  110. package/skills/role-research/SKILL.md +1 -0
  111. package/skills/role-security/SKILL.md +2 -2
  112. package/skills/runtime-claude/SKILL.md +2 -1
  113. package/skills/runtime-codex/SKILL.md +1 -0
  114. package/skills/runtime-local/SKILL.md +2 -0
  115. package/skills/runtime-opencode/SKILL.md +1 -0
  116. package/skills/wave-core/SKILL.md +25 -6
  117. package/skills/wave-core/references/marker-syntax.md +16 -8
  118. package/wave.config.json +45 -0
@@ -0,0 +1,1699 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - blackboard-and-shared-workspaces
8
+ - harnesses-and-practice
9
+ kind: 'paper'
10
+ title: 'DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation'
11
+ ---
12
+ # DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation
13
+
14
+ <Note>
15
+ Converted from the source document on 2026-03-21. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
+ </Note>
17
+
18
+ ## Metadata
19
+
20
+ | Field | Value |
21
+ | --- | --- |
22
+ | Content type | Paper / report |
23
+ | Authors | Aaron Shen, Alfred Shen |
24
+ | Year | 2026 |
25
+ | Venue | arXiv 2603.13327 |
26
+ | Research bucket | P0 direct hits |
27
+ | Maps to | Deliberation-first orchestration, iterative refinement, and transparent coordination for autonomous research. |
28
+ | Harness fit | Useful as a modern hybrid between harness design and blackboard-style coordination. |
29
+ | Source page | [Open source](https://arxiv.org/abs/2603.13327) |
30
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2603.13327.pdf) |
31
+
32
+ ## Extracted text
33
+ ### Page 1
34
+
35
+ DOVA: Deliberation-First Multi-Agent Orchestration
36
+
37
+ for Autonomous Research Automation
38
+
39
+ Aaron Shen 1 Alfred Shen 2
40
+
41
+ Abstract
42
+
43
+ Large language model (LLM) agents have demon-
44
+
45
+ strated remarkable capabilities in tool use, rea-
46
+
47
+ soning, and code generation, yet single-agent
48
+
49
+ systems exhibit fundamental limitations when
50
+
51
+ confronted with complex research tasks demand-
52
+
53
+ ing multi-source synthesis, adversarial verifica-
54
+
55
+ tion, and personalized delivery. We present
56
+
57
+ DOVA (Deep Orchestrated Versatile Agent), a
58
+
59
+ multi-agent platform introducing three innova-
60
+
61
+ tions: (1) deliberation-first orchestration, where
62
+
63
+ explicit meta-reasoning precedes tool invocation,
64
+
65
+ informed by a persistent user model and entity-
66
+
67
+ aware conversation context; (2) hybrid collabora-
68
+
69
+ tive reasoning, a composable three-phase pipeline
70
+
71
+ unifying ensemble diversity, blackboard trans-
72
+
73
+ parency, and iterative refinement; and (3) adap-
74
+
75
+ tive multi-tiered thinking, a six-level token-budget
76
+
77
+ allocation scheme reducing inference cost by 40–
78
+
79
+ 60% on simple tasks while preserving deep rea-
80
+
81
+ soning capacity. We formalize the core algo-
82
+
83
+ rithms, present an architectural ablation study
84
+
85
+ across seven system configurations, and analyze
86
+
87
+ the contribution of each component to answer con-
88
+
89
+ fidence, source coverage, and token efficiency.
90
+
91
+ 1. Introduction
92
+
93
+ The rapid advancement of large language models
94
+
95
+ (LLMs) (Brown et al., 2020; Anthropic, 2024a) has enabled
96
+
97
+ a new generation of autonomous agents capable of reason-
98
+
99
+ ing, tool use, and multi-step planning (Yao et al., 2023b;
100
+
101
+ Schick et al., 2023). However, deploying these agents for
102
+
103
+ complex research automation—where a single query may
104
+
105
+ require searching academic databases, analyzing code repos-
106
+
107
+ itories, cross-referencing model registries, and synthesiz-
108
+
109
+ ing findings with citations—exposes several limitations of
110
+
111
+ 1
112
+
113
+ University of California, Berkeley, USA 2
114
+
115
+ Amazon
116
+
117
+ Web Services, USA. Correspondence to: Aaron
118
+
119
+ Shen <aaron.shen@berkeley.edu>, Alfred Shen <al-
120
+
121
+ freshe@amazon.com>.
122
+
123
+ Preprint. March 17, 2026.
124
+
125
+ single-agent architectures:
126
+
127
+ • Linear reasoning. A single agent processes informa-
128
+
129
+ tion sequentially, missing cross-domain connections.
130
+
131
+ • Premature commitment. Without adversarial chal-
132
+
133
+ lenge, agents accept initial findings without verifica-
134
+
135
+ tion.
136
+
137
+ • Reflexive tool invocation. Standard REACT
138
+
139
+ loops (Yao et al., 2023b) trigger tools based on key-
140
+
141
+ word patterns rather than deliberate need assessment.
142
+
143
+ • Fixed computation cost. Identical reasoning depth
144
+
145
+ for trivial and complex queries wastes tokens on the
146
+
147
+ former and starves the latter.
148
+
149
+ We present DOVA, a multi-agent platform designed to ad-
150
+
151
+ dress these limitations.
152
+
153
+ 1.1. Contributions
154
+
155
+ 1. Deliberation-first orchestration (§5.2). A meta-
156
+
157
+ reasoning layer that deliberates—using a persistent
158
+
159
+ user model and entity-aware context—before invoking
160
+
161
+ any tool, reducing unnecessary API calls and enabling
162
+
163
+ context-aware follow-ups.
164
+
165
+ 2. Hybrid collaborative reasoning (§5.3). A compos-
166
+
167
+ able three-phase pipeline (ensemble → blackboard →
168
+
169
+ iterative refinement) combining breadth, transparency,
170
+
171
+ and depth of multi-round critique.
172
+
173
+ 3. Adaptive multi-tiered thinking (§5.4). A six-
174
+
175
+ level token-budget allocation with automatic task-
176
+
177
+ complexity selection, achieving significant token sav-
178
+
179
+ ings on simple tasks.
180
+
181
+ 4. Diversity-aware memory retrieval (§5.6).
182
+
183
+ MMR (Carbonell & Goldstein, 1998) reranking
184
+
185
+ over a multi-tier memory architecture with embedding-
186
+
187
+ based semantic search.
188
+
189
+ 5. Unified multi-modal interface (§6). Four cohesive
190
+
191
+ access modalities—REST API, CLI, browser UI, and
192
+
193
+ MCP server—sharing a single orchestration backend,
194
+
195
+ with seamless Claude Code integration via dynamic
196
+
197
+ plugin (Anthropic, 2024b).
198
+
199
+ 1
200
+
201
+ arXiv:2603.13327v1 [cs.AI] 4 Mar 2026
202
+
203
+ ### Page 2
204
+
205
+ DOVA: Deliberation-First Multi-Agent Orchestration
206
+
207
+ 2. Preliminaries
208
+
209
+ Definition 2.1 (Agent). An agent A = (π, T, M) is a tuple
210
+
211
+ of a policy π (an LLM with a system prompt), a tool set
212
+
213
+ T = {t1,..., tm}, and a memory store M.
214
+
215
+ Definition 2.2 (Reasoning Trace). A reasoning trace τ =
216
+
217
+ (s0, a1, o1, s1,..., an, on, sn) is an alternating sequence of
218
+
219
+ thought states si ∈ S, actions ai ∈ Aact ∪ {conclude},
220
+
221
+ and observations oi ∈ O.
222
+
223
+ Definition 2.3 (Confidence Function). A confidence func-
224
+
225
+ tion C: R × P → [0, 1] maps a response r and prompt p to
226
+
227
+ a scalar quality estimate.
228
+
229
+ Let Q denote user queries, D the data sources (ArXiv,
230
+
231
+ GitHub, HuggingFace, Web), and U a user model capturing
232
+
233
+ expertise, preferences, and history.
234
+
235
+ Problem. Given query q ∈ Q, user model u ∈ U, and
236
+
237
+ context ξ, produce response r∗ maximizing:
238
+
239
+ r∗ = arg max
240
+
241
+ r∈R
242
+
243
+ C(r, q) · Cov(r, D) s.t. cost(r) ≤ B(q),
244
+
245
+ (1)
246
+
247
+ where Cov(r, D) measures source coverage and B(q) is a
248
+
249
+ query-adaptive token budget.
250
+
251
+ 3. Related Work
252
+
253
+ LLM Reasoning. Chain-of-thought prompting (Wei et al.,
254
+
255
+ 2022) demonstrated that intermediate reasoning steps im-
256
+
257
+ prove LLM performance. REACT (Yao et al., 2023b) inter-
258
+
259
+ leaved reasoning with tool actions. Tree of Thoughts (Yao
260
+
261
+ et al., 2023a) and Language Agent Tree Search (Zhou et al.,
262
+
263
+ 2023) extended this to tree-structured exploration. Reflex-
264
+
265
+ ion (Shinn et al., 2023) added verbal self-reflection, Self-
266
+
267
+ Refine (Madaan et al., 2023) showed LLMs can critique
268
+
269
+ their own outputs, and Self-Consistency (Wang et al., 2023)
270
+
271
+ introduced majority voting. Wei et al. (2026) provide a
272
+
273
+ comprehensive taxonomy of agentic reasoning along foun-
274
+
275
+ dational, self-evolving, and collective dimensions, and a sur-
276
+
277
+ vey of long chain-of-thought reasoning (Chen et al., 2025)
278
+
279
+ traces the evolution from standard CoT to extended reason-
280
+
281
+ ing in models such as OpenAI O1 and DeepSeek-R1. DOVA
282
+
283
+ augments REACT with (a) a deliberation step that reasons
284
+
285
+ about whether to invoke tools and (b) multi-component
286
+
287
+ confidence scoring with self-reflection.
288
+
289
+ Multi-Agent Systems. Multi-agent debate (Du et al.,
290
+
291
+ 2023; Liang et al., 2023) improves factuality. CAMEL (Li
292
+
293
+ et al., 2023) explored role-playing communication. Gen-
294
+
295
+ erative Agents (Park et al., 2023) simulated behavior with
296
+
297
+ memory. MetaGPT (Hong et al., 2023) assigned software
298
+
299
+ roles. AutoGen (Wu et al., 2023) provided conversation-
300
+
301
+ based multi-agent frameworks. A recent survey (Tran et al.,
302
+
303
+ 2025) categorizes collaboration mechanisms into coopera-
304
+
305
+ tion, competition, and coordination protocols, while Dang
306
+
307
+ et al. (2025) propose centralized orchestration with rein-
308
+
309
+ forcement learning. Orogat et al. (2026) provide a uni-
310
+
311
+ fied benchmark showing that framework-level architectural
312
+
313
+ choices (e.g., message routing, memory sharing) can in-
314
+
315
+ crease latency by up to 100×, underscoring the importance
316
+
317
+ of deliberation-aware orchestration. Unlike these systems
318
+
319
+ which employ a single collaboration pattern, DOVA com-
320
+
321
+ poses three patterns into a hybrid pipeline with a delib-
322
+
323
+ eration layer determining when multi-agent reasoning is
324
+
325
+ warranted.
326
+
327
+ Tool-Augmented LLMs. Toolformer (Schick et al., 2023)
328
+
329
+ trained LLMs to self-annotate tool calls. Gorilla (Patil et al.,
330
+
331
+ 2023) fine-tuned on API documentation. ToolLLM (Qin
332
+
333
+ et al., 2023) scaled to 16,000+ APIs. MCP (Anthropic,
334
+
335
+ 2024b) standardized tool integration; Hou et al. (2025)
336
+
337
+ provide a systematic landscape analysis and threat taxon-
338
+
339
+ omy, while MCP-Universe (Luo et al., 2025) offers the first
340
+
341
+ comprehensive benchmark across real-world MCP servers.
342
+
343
+ DOVA leverages MCP but introduces deliberation-first tool
344
+
345
+ selection.
346
+
347
+ Adaptive Computation. Adaptive Computation
348
+
349
+ Time (Graves, 2016) introduced variable compute for
350
+
351
+ RNNs. Pause tokens (Goyal et al., 2023) allocated extra pro-
352
+
353
+ cessing. Recent work on budget-guided thinking (Li et al.,
354
+
355
+ 2025), token-budget-aware reasoning (Han et al., 2024),
356
+
357
+ and a survey of adaptive test-time compute (Alomrani
358
+
359
+ et al., 2025) confirm that variable token budgets improve
360
+
361
+ efficiency–quality trade-offs. Sleep-time compute (Lin
362
+
363
+ et al., 2025) extends this to pre-computation, while Zhu
364
+
365
+ et al. (2025) provide the first systematic study of test-time
366
+
367
+ scaling specifically for LLM agents. DOVA applies this at
368
+
369
+ the system level through a six-tier thinking budget.
370
+
371
+ 4. System Architecture
372
+
373
+ Figure 1 illustrates the layered architecture.
374
+
375
+ 4.1. Agent Layer
376
+
377
+ All agents inherit from a common base providing two
378
+
379
+ mixins: ReasoningMixin (implements the REACT loop
380
+
381
+ with self-reflection and a working-memory scratchpad) and
382
+
383
+ MemoryMixin (access to the enhanced memory service).
384
+
385
+ Five specialized agents compose the agent pool: (1) Re-
386
+
387
+ searchAgent—multi-source search via MCP servers with
388
+
389
+ query-type classification; (2) ProfilingAgent—user model
390
+
391
+ management via persistent memory; (3) ValidationAgent—
392
+
393
+ code analysis and sandboxed execution; (4) Synthesis-
394
+
395
+ Agent—narrative generation with source attribution; (5) De-
396
+
397
+ bateAgent—adversarial Bull-vs-Bear analysis.
398
+
399
+ 2
400
+
401
+ ### Page 3
402
+
403
+ DOVA: Deliberation-First Multi-Agent Orchestration
404
+
405
+ Figure 1. Layered architecture of DOVA. Queries enter through the Interface Layer, pass through Orchestration (with deliberation),
406
+
407
+ dispatch to specialized agents, which leverage collaborative reasoning and intelligence services.
408
+
409
+ Table 1. Model tier configuration.
410
+
411
+ Task Type Tier Max Tok. Temp.
412
+
413
+ Classification Basic 10K 0.0
414
+
415
+ Summarization Basic 20K 0.3
416
+
417
+ Chat Standard 40K 0.7
418
+
419
+ Code Gen. Advanced 80K 0.2
420
+
421
+ Reasoning Advanced 40K 0.7
422
+
423
+ 4.2. Model Tiering
424
+
425
+ DOVA routes LLM calls through a tiering system that maps
426
+
427
+ task types to model classes (Table 1).
428
+
429
+ 5. Core Algorithms
430
+
431
+ 5.1. ReAct Reasoning with Self-Reflection
432
+
433
+ The foundational reasoning loop extends REACT (Yao et al.,
434
+
435
+ 2023b) with a terminal self-reflection step. Each agent main-
436
+
437
+ tains a scratchpad—a working memory that accumulates
438
+
439
+ observations.
440
+
441
+ The trace confidence is the mean over per-step confidences:
442
+
443
+ ¯c(τ) =
444
+
445
+ 1
446
+
447
+ |{ci}|
448
+
449
+ X
450
+
451
+ i
452
+
453
+ ci, ci ∈ [0, 1]. (2)
454
+
455
+ Algorithm 1 ReAct Reasoning with Self-Reflection
456
+
457
+ Require: Problem q; max iterations N; reflect flag ϕ
458
+
459
+ Ensure: Reasoning trace τ, answer r, confidence ¯c
460
+
461
+ τ ← ∅; pad ← ∅
462
+
463
+ for i = 1 toN do
464
+
465
+ (si, ai, ci) ← THINK(q, τ, pad)
466
+
467
+ τ ← τ ∪ {(THOUGHT, si, ci)}
468
+
469
+ if ai = conclude then
470
+
471
+ r ← si; break
472
+
473
+ end if
474
+
475
+ oi ← ACT(ai) {execute tool}
476
+
477
+ τ ← τ ∪ {(ACT, ai), (OBS, oi)}
478
+
479
+ pad ← pad ∪ {oi}
480
+
481
+ end for
482
+
483
+ if ϕ and r exists then
484
+
485
+ (r′, crit) ← REFLECT(r, q, τ)
486
+
487
+ τ ← τ ∪ {(REFL, crit)}; r ← r′
488
+
489
+ end if
490
+
491
+ ¯c ← 1
492
+
493
+ |τc|
494
+
495
+ P
496
+
497
+ ci
498
+
499
+ return (τ, r, ¯c)
500
+
501
+ 3
502
+
503
+ ### Page 4
504
+
505
+ DOVA: Deliberation-First Multi-Agent Orchestration
506
+
507
+ 5.2. Deliberation-First Orchestration
508
+
509
+ The key innovation of DOVA’s
510
+
511
+ ThinkingOrchestrator is an explicit delibera-
512
+
513
+ tion step preceding all tool invocation. Unlike standard
514
+
515
+ REACT agents that reflexively call tools, the orchestrator
516
+
517
+ first assesses whether external information is necessary.
518
+
519
+ Algorithm 2 Deliberation-First Orchestration
520
+
521
+ Require: Query q; user model u; context ξ; sources D′
522
+
523
+ Ensure: Deliberation δ
524
+
525
+ exp ← FORMATEXPERTISE(u)
526
+
527
+ ent ← FORMATENTITIES(ξ)
528
+
529
+ rec ← RECENTTURNS(ξ, k=6)
530
+
531
+ Tavail ← DISCOVERTOOLS(D′)
532
+
533
+ δ ← LLM DELIBERATE(q, exp, ent, rec, Tavail)
534
+
535
+ if CHECKMANDATORYTRIGGERS(q) then
536
+
537
+ δ.action ← USE TOOLS
538
+
539
+ end if
540
+
541
+ return δ
542
+
543
+ The mandatory trigger function detects temporal keywords
544
+
545
+ (“latest,” “recent,” year patterns ≥2025), specificity mark-
546
+
547
+ ers (“specific papers”), and real-time queries that always
548
+
549
+ warrant tool invocation.
550
+
551
+ Proposition 5.1 (Tool Call Reduction). Let fd be
552
+
553
+ the fraction of queries where deliberation selects
554
+
555
+ RESPOND DIRECTLY. The expected tool-call volume rel-
556
+
557
+ ative to a standard REACT agent is (1 − fd), achieving cost
558
+
559
+ savings proportional to fd · ctool, where ctool is the average
560
+
561
+ cost per tool-augmented response.
562
+
563
+ 5.3. Hybrid Collaborative Reasoning
564
+
565
+ DOVA composes three collaboration patterns into a single
566
+
567
+ pipeline.
568
+
569
+ Phase 1: Ensemble. Multiple agents solve the problem
570
+
571
+ independently in parallel. The agreement score quantifies
572
+
573
+ consensus:
574
+
575
+ A(c1,..., cn) = max
576
+
577
+ 0, 1 − Var(c1,..., cn)
578
+
579
+ 
580
+
581
+ . (3)
582
+
583
+ Phase 2: Blackboard. Results are posted to a shared
584
+
585
+ workspace where agents contribute evidence and votes.
586
+
587
+ Each post carries a weighted confidence:
588
+
589
+ w(p) = cbase(p) ·
590
+
591
+ 1 + ¯a(p)
592
+
593
+ 2
594
+
595
+ , ¯a(p) =
596
+
597
+ 1
598
+
599
+ |Vp|
600
+
601
+ X
602
+
603
+ v∈Vp
604
+
605
+ vagree, (4)
606
+
607
+ where cbase is the agent’s self-assessed confidence and ¯a is
608
+
609
+ mean agreement from peer votes (vagree ∈ [−1, 1]) (Hayes-
610
+
611
+ Roth, 1985).
612
+
613
+ Phase 3: Iterative Refinement. The top-ranked synthesis
614
+
615
+ is iteratively refined through multi-round critique.
616
+
617
+ Algorithm 3 Hybrid Collaborative Reasoning
618
+
619
+ Require: Problem q; agents {Ai}; max iter. K; context ξ
620
+
621
+ Ensure: Result r∗, confidence c∗, agreement A
622
+
623
+ {Phase 1: Ensemble}
624
+
625
+ (ˆr, {ci}, dissent) ← ENSEMBLE(q, {Ai}, ξ)
626
+
627
+ A ← 1 − Var({ci})
628
+
629
+ {Phase 2: Blackboard}
630
+
631
+ BB.clear()
632
+
633
+ POST(HYPO, ˆr, ¯c)
634
+
635
+ for d ∈ dissent do
636
+
637
+ POST(EVID, d, 0.3)
638
+
639
+ end for
640
+
641
+ rbb ← SYNTHESIZEBB(BB)
642
+
643
+ {Phase 3: Iterative Refinement}
644
+
645
+ r∗ ← ITERREFINE(rbb, {A1, A2}, min(2, K))
646
+
647
+ c∗ ← 1
648
+
649
+ 2 (¯cens + citer)
650
+
651
+ return (r∗, c∗, A)
652
+
653
+ Table 2. Thinking levels and token budgets (2–4× scaling per
654
+
655
+ level).
656
+
657
+ Level Budget Typical Tasks
658
+
659
+ OFF 0 Embeddings
660
+
661
+ MINIMAL 1,024 Classification
662
+
663
+ LOW 4,096 Summarization
664
+
665
+ MEDIUM 16,384 Code generation
666
+
667
+ HIGH 32,768 Reasoning, research
668
+
669
+ XHIGH 65,536 Complex analysis
670
+
671
+ 5.4. Adaptive Multi-Tiered Thinking
672
+
673
+ DOVA allocates reasoning compute via a six-level budget
674
+
675
+ (Table 2).
676
+
677
+ The selection function maps a task to a thinking level:
678
+
679
+ Formally, the budget function is:
680
+
681
+ B(t, h, q) = BUD
682
+
683
+ 
684
+
685
+ clamp
686
+
687
+ β(t)+ α(h)+ γ(q), 0, 5
688
+
689
+ 
690
+
691
+ , (5)
692
+
693
+ where β: Ttask → {0,..., 5} maps task types, α:
694
+
695
+ H → {−1, 0, 1, 2} adjusts for complexity, and γ: Q →
696
+
697
+ {−1, 0, 1} adjusts for query length.
698
+
699
+ 5.5. Multi-Component Confidence Scoring
700
+
701
+ The self-evaluation service computes confidence as:
702
+
703
+ C(r, p) =
704
+
705
+ P
706
+
707
+ k wk · fk(r, p)
708
+
709
+ P
710
+
711
+ k wk
712
+
713
+ , (6)
714
+
715
+ 4
716
+
717
+ ### Page 5
718
+
719
+ DOVA: Deliberation-First Multi-Agent Orchestration
720
+
721
+ Algorithm 4 Adaptive Thinking Level Selection
722
+
723
+ Require: Task type t; query q; complexity hint h
724
+
725
+ Ensure: Level ℓ and budget b
726
+
727
+ L ← [OFF, MIN, LOW, MED, HI, XH]
728
+
729
+ base ← TASKDEFAULTS[t]
730
+
731
+ adj ← 0
732
+
733
+ if h = simple then
734
+
735
+ adj ← adj − 1
736
+
737
+ end if
738
+
739
+ if h = complex then
740
+
741
+ adj ← adj + 1
742
+
743
+ end if
744
+
745
+ if h = very complex then
746
+
747
+ adj ← adj + 2
748
+
749
+ end if
750
+
751
+ if |q| > 2000 then
752
+
753
+ adj ← adj + 1
754
+
755
+ end if
756
+
757
+ if |q| < 50 then
758
+
759
+ adj ← adj − 1
760
+
761
+ end if
762
+
763
+ idx ← clamp(indexOf(base) + adj, 0, 5)
764
+
765
+ ℓ ← L[idx]; b ← BUDGETS[ℓ]
766
+
767
+ return (ℓ, b)
768
+
769
+ with four components:
770
+
771
+ flen(r) = clip
772
+
773
+ 
774
+
775
+ |r|
776
+
777
+ τlen
778
+
779
+ , 0.2, 1.0
780
+
781
+ 
782
+
783
+ , (7)
784
+
785
+ fref (r) = 1 − 0.7 · ⊮[∃ k∈Kref: k⊆r], (8)
786
+
787
+ ffmt(r, φ) = format check(r, φ), (9)
788
+
789
+ frel(r, p) = min
790
+
791
+ 
792
+
793
+ 1, |kw(r)∩kw(p)|
794
+
795
+ 0.3·|kw(p)|
796
+
797
+ 
798
+
799
+ . (10)
800
+
801
+ A response is acceptable when C(r, p) ≥ θmin (default 0.6).
802
+
803
+ When C < 0.7, iterative query refinement triggers (up to 2
804
+
805
+ rounds).
806
+
807
+ 5.6. Diversity-Aware Memory Retrieval
808
+
809
+ The enhanced memory stores entries in three tiers: short-
810
+
811
+ term (TTL = 86,400s), long-term (persistent), and proce-
812
+
813
+ dural (reusable skills).
814
+
815
+ Retrieval uses cosine similarity reranked with MMR (Car-
816
+
817
+ bonell & Goldstein, 1998). Recent work on agent memory
818
+
819
+ beyond RAG (Hu et al., 2026) decouples memories into se-
820
+
821
+ mantic components; DOVA takes a complementary approach
822
+
823
+ with tiered storage and diversity-aware retrieval:
824
+
825
+ MMR(di) = λ·sim(di, q) − (1−λ)·max
826
+
827
+ dj∈S
828
+
829
+ sim(di, dj), (11)
830
+
831
+ where sim(a, b) = a·b/(∥a∥∥b∥), S is the set of already-
832
+
833
+ selected results, and λ ∈ [0, 1] (default 0.5) controls the
834
+
835
+ relevance–diversity trade-off.
836
+
837
+ Algorithm 5 MMR-Enhanced Semantic Memory Search
838
+
839
+ Require: Query q; top-k; λ; memory M
840
+
841
+ Ensure: Ranked results R
842
+
843
+ eq ← EMBED(q)
844
+
845
+ sc ← {(m, sim(eq, em)): m ∈ M}
846
+
847
+ Sort sc by similarity descending
848
+
849
+ S ← ∅; R ← ∅
850
+
851
+ while |R| < k and sc ̸= ∅ do
852
+
853
+ d∗ ← arg maxd∈sc λ · sim(d, q) − (1−λ) ·
854
+
855
+ maxd′∈S sim(d, d′)
856
+
857
+ R ← R ∪ {d∗}; S ← S ∪ {d∗}
858
+
859
+ sc ← sc \ {d∗}
860
+
861
+ end while
862
+
863
+ return R
864
+
865
+ Table 3. Query type to source routing.
866
+
867
+ Type ArXiv GitHub HF Web
868
+
869
+ Technical ✓ ✓ ✓ ✓
870
+
871
+ News ✓
872
+
873
+ Biographical ✓
874
+
875
+ Factual ✓ ✓
876
+
877
+ General ✓ ✓ ✓ ✓
878
+
879
+ 5.7. Query Intent Classification
880
+
881
+ The research agent classifies queries to route to appropriate
882
+
883
+ sources:
884
+
885
+ t∗(q) = arg max
886
+
887
+ t∈Tq
888
+
889
+ X
890
+
891
+ k∈Kt
892
+
893
+ ⊮[k ∈ q↓] + bonus(q, t), (12)
894
+
895
+ where Tq = {tech., news, bio., fact., gen.}, q↓ is the low-
896
+
897
+ ercased query, and bonus(q, bio.) = 2 · ⊮[is person(q)].
898
+
899
+ Table 3 shows the source routing.
900
+
901
+ 5.8. Multi-Round Adversarial Debate
902
+
903
+ The debate agent implements a Bull-vs-Bear pattern for
904
+
905
+ evaluative queries. Inspired by financial analysis practice,
906
+
907
+ two adversarial agents—Bull (advocate) and Bear (critic)—
908
+
909
+ argue opposing positions across multiple rounds. Each agent
910
+
911
+ receives the accumulated arguments of its opponent, forcing
912
+
913
+ direct engagement with counterpoints rather than indepen-
914
+
915
+ dent monologues.
916
+
917
+ The sequential turn-taking is critical: in round r, the
918
+
919
+ Bull agent conditions on all prior Bear arguments B<r
920
+
921
+ ear,
922
+
923
+ and vice versa. This creates an implicit convergence
924
+
925
+ dynamic—arguments that survive multiple rounds of ad-
926
+
927
+ versarial scrutiny carry higher epistemic weight in the final
928
+
929
+ synthesis.
930
+
931
+ The synthesis step aggregates both argument sets into a struc-
932
+
933
+ tured output containing: (i) a balanced summary, (ii) sur-
934
+
935
+ viving strengths (Bull arguments not effectively rebutted),
936
+
937
+ 5
938
+
939
+ ### Page 6
940
+
941
+ DOVA: Deliberation-First Multi-Agent Orchestration
942
+
943
+ Algorithm 6 Multi-Round Adversarial Debate
944
+
945
+ Require: Topic q; context ξ; rounds R (default 2)
946
+
947
+ Ensure: Conclusion: summary, strengths, concerns, confi-
948
+
949
+ dence
950
+
951
+ Bull ← ∅; Bear ← ∅
952
+
953
+ for r = 1 toR do
954
+
955
+ br ← BULLAGENT.ARGUE(q, ξ, Bear)
956
+
957
+ Bull ← Bull ∪ {br}
958
+
959
+ kr ← BEARAGENT.ARGUE(q, ξ, Bull)
960
+
961
+ Bear ← Bear ∪ {kr}
962
+
963
+ end for
964
+
965
+ return SYNTHESIZE(Bull, Bear)
966
+
967
+ Table 4. Interface modalities.
968
+
969
+ Interface Access Key Features
970
+
971
+ REST API HTTP 15+ endpoints, OAuth2
972
+
973
+ CLI Terminal CoT display, sessions
974
+
975
+ Browser UI Web Source chips, badges
976
+
977
+ MCP Server Stdio 5 tools, plugin arch.
978
+
979
+ (iii) validated concerns (Bear arguments not adequately ad-
980
+
981
+ dressed), and (iv) an overall confidence score reflecting
982
+
983
+ argument balance. We default to R=2 rounds, as empiri-
984
+
985
+ cally the marginal information gain diminishes beyond two
986
+
987
+ rounds while token cost grows linearly.
988
+
989
+ This pattern draws on multi-agent debate research (Du et al.,
990
+
991
+ 2023; Liang et al., 2023), extending it with structured syn-
992
+
993
+ thesis and integration into the broader orchestration pipeline
994
+
995
+ via the deliberation layer, which determines when adversar-
996
+
997
+ ial analysis is warranted versus simpler reasoning modes.
998
+
999
+ 6. Interface Modalities
1000
+
1001
+ DOVA exposes its orchestration engine through four inter-
1002
+
1003
+ faces sharing the same backend (Table 4).
1004
+
1005
+ 6.1. Claude Code Integration via Dynamic Plugin
1006
+
1007
+ The MCP server (Anthropic, 2024b) exposes
1008
+
1009
+ five tools to Claude Code: dova research,
1010
+
1011
+ dova search, dova debate, dova validate,
1012
+
1013
+ and dova web search. Communication uses stdio
1014
+
1015
+ transport with lazy initialization.
1016
+
1017
+ The plugin architecture provides: (i) a plugin.json
1018
+
1019
+ manifest; (ii) an.mcp.json server configuration;
1020
+
1021
+ (iii) custom slash-command skills (/dova-research,
1022
+
1023
+ /dova-debate); (iv) a custom agent definition enabling
1024
+
1025
+ autonomous multi-source research.
1026
+
1027
+ This creates a bidirectional integration: Claude Code in-
1028
+
1029
+ vokes DOVA as a tool provider, while DOVA uses Claude
1030
+
1031
+ models as its LLM backbone—each system augmenting the
1032
+
1033
+ other.
1034
+
1035
+ 6.2. Interactive CLI
1036
+
1037
+ The interactive CLI provides a seven-step chain-of-thought
1038
+
1039
+ pipeline: (1) Observe—parse input; (2) Recall—search
1040
+
1041
+ memory; (3) Reason—CoT analysis; (4) Plan—select ac-
1042
+
1043
+ tion; (5) Act—execute tools; (6) Reflect—evaluate qual-
1044
+
1045
+ ity; (7) Respond—generate output. Session commands
1046
+
1047
+ (/status, /thinking, /orchestrator) provide
1048
+
1049
+ runtime control.
1050
+
1051
+ 7. Experiments and Evaluation
1052
+
1053
+ We evaluate DOVA through an architectural ablation and
1054
+
1055
+ reasoning mode comparison.
1056
+
1057
+ 7.1. Setup
1058
+
1059
+ Models. Claude Sonnet 4.6 (Standard tier), Claude
1060
+
1061
+ Opus 4.6 (Advanced tier), and Claude Haiku 4.5 (Basic
1062
+
1063
+ tier).
1064
+
1065
+ Baselines. (1) Single-LLM: one Claude Opus call;
1066
+
1067
+ (2) REACT-only: standard REACT without deliberation
1068
+
1069
+ or collaboration; (3) Ensemble-only: parallel multi-agent
1070
+
1071
+ without blackboard or iterative refinement.
1072
+
1073
+ Metrics. Answer confidence (C), source coverage (Cov),
1074
+
1075
+ token efficiency, latency, refinement rate, and error recovery
1076
+
1077
+ rate.
1078
+
1079
+ 7.2. Ablation Study
1080
+
1081
+ Table 5 presents the architectural ablation across seven con-
1082
+
1083
+ figurations.
1084
+
1085
+ Key findings. (1) Collaboration is highest-impact: re-
1086
+
1087
+ moving it drops confidence by 0.14 and coverage by
1088
+
1089
+ 0.25. (2) Self-evaluation prevents degradation: without
1090
+
1091
+ it, low-quality responses reach the user (refinement rate
1092
+
1093
+ 18%→35%). (3) Adaptive thinking is a pure efficiency gain:
1094
+
1095
+ fixed MEDIUM reduces token efficiency by 32% with mini-
1096
+
1097
+ mal confidence impact. (4) Deliberation reduces cost: re-
1098
+
1099
+ moving it increases latency by 19% and decreases efficiency
1100
+
1101
+ by 27% through unnecessary tool invocations. (5) ReAct is
1102
+
1103
+ foundational: single-pass causes the largest confidence drop
1104
+
1105
+ (0.82→0.58).
1106
+
1107
+ 7.3. Reasoning Mode Comparison
1108
+
1109
+ Table 6 compares the four reasoning modes that DOVA ex-
1110
+
1111
+ poses, each representing a different point on the quality–cost
1112
+
1113
+ Pareto frontier.
1114
+
1115
+ Quick mode uses a single agent with minimal thinking
1116
+
1117
+ budget and no tool invocation, suitable for simple factual
1118
+
1119
+ 6
1120
+
1121
+ ### Page 7
1122
+
1123
+ DOVA: Deliberation-First Multi-Agent Orchestration
1124
+
1125
+ Table 5. Architectural ablation study. Each row removes one component. Values represent expected relative performance based on
1126
+
1127
+ architectural analysis. ↑ = higher is better; ↓ = lower is better. Bold indicates full-system values.
1128
+
1129
+ Configuration Reasoning Collab. Think Conf.↑ Cov.↑ Tok.Eff.↑ Lat.(s)↓
1130
+
1131
+ DOVA-Full ✓ ✓ Adaptive 0.82 0.90 0.71 12.4
1132
+
1133
+ −Collaboration ✓ — Adaptive 0.68 0.65 0.74 6.1
1134
+
1135
+ −Thinking (fixed Med) ✓ ✓ Fixed 0.79 0.88 0.48 11.8
1136
+
1137
+ −Memory ✓ ✓ Adaptive 0.75 0.85 0.65 11.2
1138
+
1139
+ −Deliberation ✓ ✓ Adaptive 0.77 0.90 0.52 14.8
1140
+
1141
+ −Self-Eval ✓ ✓ Adaptive 0.70 0.88 0.69 10.1
1142
+
1143
+ −ReAct (single pass) — — — 0.58 0.45 0.80 3.2
1144
+
1145
+ Single-LLM baseline — — — 0.52 0.00 0.85 1.8
1146
+
1147
+ Table 6. Reasoning mode comparison. Confidence and token
1148
+
1149
+ consumption are averaged across a mixed workload of factual,
1150
+
1151
+ technical, and evaluative queries.
1152
+
1153
+ Mode Agents Conf. Lat. Tok.
1154
+
1155
+ Quick 1 0.52 1.8s 2K
1156
+
1157
+ Standard 1 0.68 6.5s 12K
1158
+
1159
+ Deep N 0.78 18.3s 45K
1160
+
1161
+ Collaborative N 0.82 24.1s 65K
1162
+
1163
+ recall or conversational follow-ups. Standard mode enables
1164
+
1165
+ the full REACT loop with self-reflection and tool access,
1166
+
1167
+ providing a 31% confidence gain over Quick at 6× the token
1168
+
1169
+ cost. Deep mode activates multiple agents with ensemble
1170
+
1171
+ reasoning but without the blackboard or iterative refinement
1172
+
1173
+ phases, achieving a further 15% confidence improvement.
1174
+
1175
+ Collaborative mode engages the complete hybrid pipeline
1176
+
1177
+ (Algorithm 3), yielding the highest confidence at the cost of
1178
+
1179
+ 32.5× the tokens of Quick mode.
1180
+
1181
+ The confidence gap between Standard and Collaborative
1182
+
1183
+ (0.68 vs. 0.82) highlights the value of multi-agent reason-
1184
+
1185
+ ing for complex queries, while the gap between Quick and
1186
+
1187
+ Standard (0.52 vs. 0.68) demonstrates that tool access and
1188
+
1189
+ self-reflection are individually high-value. The delibera-
1190
+
1191
+ tion layer (§5.2) automatically selects the appropriate mode
1192
+
1193
+ based on query complexity, ensuring that simple queries de-
1194
+
1195
+ fault to Quick or Standard while research-intensive queries
1196
+
1197
+ escalate to Deep or Collaborative.
1198
+
1199
+ 7.4. Token Efficiency Analysis
1200
+
1201
+ Figure 2 illustrates the token savings from adaptive thinking
1202
+
1203
+ level selection (Algorithm 4) compared to a fixed MEDIUM
1204
+
1205
+ baseline across five representative task types.
1206
+
1207
+ The savings are most pronounced for lightweight tasks: clas-
1208
+
1209
+ sification drops from 16K to 1K tokens (94% reduction) and
1210
+
1211
+ summarization from 16K to 4K (75%), since these tasks
1212
+
1213
+ require only MINIMAL and LOW thinking budgets respec-
1214
+
1215
+ tively. For complex tasks (reasoning and research), the
1216
+
1217
+ adaptive system allocates HIGH budgets (33K), exceeding
1218
+
1219
+ the fixed 16K baseline—this is the intended behavior, as un-
1220
+
1221
+ Classif.
1222
+
1223
+ Summ.
1224
+
1225
+ Code
1226
+
1227
+ Reason.
1228
+
1229
+ Research
1230
+
1231
+ 0
1232
+
1233
+ 10
1234
+
1235
+ 20
1236
+
1237
+ 30
1238
+
1239
+ 40
1240
+
1241
+ 1
1242
+
1243
+ 4
1244
+
1245
+ 16
1246
+
1247
+ 33 33
1248
+
1249
+ 16 16 16 16 16
1250
+
1251
+ Tokens (K)
1252
+
1253
+ Adaptive
1254
+
1255
+ Fixed
1256
+
1257
+ Figure 2. Token consumption: adaptive vs. fixed MEDIUM. Adap-
1258
+
1259
+ tive saves 94% on classification and 75% on summarization.
1260
+
1261
+ derspending on hard tasks degrades answer quality (Table 5,
1262
+
1263
+ row 2).
1264
+
1265
+ The key insight is that adaptive allocation is not uniformly
1266
+
1267
+ cheaper. Rather, it redistributes tokens from tasks that do
1268
+
1269
+ not benefit from deep reasoning to tasks that do. Under
1270
+
1271
+ a realistic workload where 40–60% of queries are simple
1272
+
1273
+ (classification, summarization, or short factual lookups), the
1274
+
1275
+ aggregate token savings reach 40–60% with no measurable
1276
+
1277
+ confidence loss (Table 5: 0.82 vs. 0.79). Code generation
1278
+
1279
+ consumes 16K under both schemes because its default level
1280
+
1281
+ (MEDIUM) already matches the fixed baseline.
1282
+
1283
+ 7.5. Component Interaction Effects
1284
+
1285
+ We observe notable interactions:
1286
+
1287
+ • Deliberation × Collaboration: Removing both
1288
+
1289
+ is worse than the sum of individual removals—
1290
+
1291
+ deliberation gatekeeps expensive collaborative reason-
1292
+
1293
+ ing.
1294
+
1295
+ • Memory × Self-Eval: Memory provides context
1296
+
1297
+ that improves evaluation accuracy. Without it, false-
1298
+
1299
+ positive retries increase.
1300
+
1301
+ • Thinking × Tiering: Adaptive thinking (depth within
1302
+
1303
+ a model) is complementary to model tiering (which
1304
+
1305
+ model), providing two-dimensional cost optimization.
1306
+
1307
+ 7
1308
+
1309
+ ### Page 8
1310
+
1311
+ DOVA: Deliberation-First Multi-Agent Orchestration
1312
+
1313
+ 8. Discussion
1314
+
1315
+ Deliberation as meta-cognition. The deliberation-first
1316
+
1317
+ approach represents meta-reasoning—the system reasons
1318
+
1319
+ about whether to reason. This parallels human metacogni-
1320
+
1321
+ tive monitoring, where experts assess their knowledge state
1322
+
1323
+ before consulting external sources (Shinn et al., 2023).
1324
+
1325
+ Composition over specialization. Rather than a single
1326
+
1327
+ monolithic pattern, DOVA’s hybrid approach composes sim-
1328
+
1329
+ ple, well-understood patterns (ensemble, blackboard, iter-
1330
+
1331
+ ative) into a pipeline with emergent capabilities exceeding
1332
+
1333
+ any individual pattern.
1334
+
1335
+ Cost-aware intelligence. Model tiering + adaptive think-
1336
+
1337
+ ing provides two-dimensional cost control. Organizations
1338
+
1339
+ can set budget constraints knowing the system degrades
1340
+
1341
+ gracefully.
1342
+
1343
+ 8.1. Limitations
1344
+
1345
+ 1. Self-evaluation circularity. Confidence scoring uses
1346
+
1347
+ the same LLM that generated the response. External
1348
+
1349
+ signals (user feedback) would strengthen assessment.
1350
+
1351
+ 2. Ablation scope. Our ablation is based on architectural
1352
+
1353
+ analysis rather than large-scale benchmarks. Evalua-
1354
+
1355
+ tion on standard benchmarks (HotpotQA, MMLU) and
1356
+
1357
+ emerging agent evaluation frameworks (Ferrag et al.,
1358
+
1359
+ 2025) remains future work.
1360
+
1361
+ 3. Memory scalability. In-memory MMR search has
1362
+
1363
+ O(n · k) complexity; indexing is needed for very large
1364
+
1365
+ stores.
1366
+
1367
+ 4. Agent homogeneity. All agents share the same LLM
1368
+
1369
+ backbone. Heterogeneous models could improve en-
1370
+
1371
+ semble diversity.
1372
+
1373
+ 9. Conclusion
1374
+
1375
+ We presented DOVA, a multi-agent platform for autonomous
1376
+
1377
+ research automation introducing deliberation-first orches-
1378
+
1379
+ tration, hybrid collaborative reasoning, and adaptive multi-
1380
+
1381
+ tiered thinking. The architectural ablation demonstrates that
1382
+
1383
+ collaborative reasoning is the highest-impact component,
1384
+
1385
+ while adaptive thinking and deliberation provide significant
1386
+
1387
+ efficiency gains without sacrificing quality.
1388
+
1389
+ Future directions include: persistent user models learn-
1390
+
1391
+ ing from feedback; heterogeneous agent ensembles mix-
1392
+
1393
+ ing LLM providers; streaming deliberation display; multi-
1394
+
1395
+ modal context integration; and comprehensive benchmark-
1396
+
1397
+ ing on standard multi-hop QA datasets.
1398
+
1399
+ DOVA is available as open-source software under
1400
+
1401
+ Apache 2.0 at https://github.com/alfredcs/
1402
+
1403
+ dova.
1404
+
1405
+ References
1406
+
1407
+ Alomrani, M. A., Zhang, Y., Li, D., Sun, Q., Pal, S., Zhang,
1408
+
1409
+ Z., Hu, Y., Ajwani, R. D., Valkanas, A., et al. Reasoning
1410
+
1411
+ on a budget: A survey of adaptive and controllable test-
1412
+
1413
+ time compute in LLMs. arXiv preprint arXiv:2507.02076,
1414
+
1415
+ 2025.
1416
+
1417
+ Anthropic. The Claude model family: Technical report.
1418
+
1419
+ Technical report, Anthropic, 2024a.
1420
+
1421
+ Anthropic. Model context protocol specification.
1422
+
1423
+ Technical report, Anthropic, 2024b. https://
1424
+
1425
+ modelcontextprotocol.io.
1426
+
1427
+ Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
1428
+
1429
+ Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
1430
+
1431
+ Askell, A., et al. Language models are few-shot learners.
1432
+
1433
+ In Advances in Neural Information Processing Systems,
1434
+
1435
+ volume 33, pp. 1877–1901, 2020.
1436
+
1437
+ Carbonell, J. and Goldstein, J. The use of MMR, diversity-
1438
+
1439
+ based reranking for reordering documents and producing
1440
+
1441
+ summaries. In Proceedings of the 21st Annual Interna-
1442
+
1443
+ tional ACM SIGIR Conference on Research and Develop-
1444
+
1445
+ ment in Information Retrieval, pp. 335–336, 1998.
1446
+
1447
+ Chen, Q., Qin, L., Liu, J., et al. Towards reasoning era: A
1448
+
1449
+ survey of long chain-of-thought for reasoning large lan-
1450
+
1451
+ guage models. arXiv preprint arXiv:2503.09567, 2025.
1452
+
1453
+ Dang, Y., Qian, C., Luo, X., Fan, J., Xie, Z., Shi, R., Chen,
1454
+
1455
+ W., Yang, C., Che, X., Tian, Y., et al. Multi-agent col-
1456
+
1457
+ laboration via evolving orchestration. arXiv preprint
1458
+
1459
+ arXiv:2505.19591, 2025.
1460
+
1461
+ Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mor-
1462
+
1463
+ datch, I. Improving factuality and reasoning in lan-
1464
+
1465
+ guage models through multiagent debate. arXiv preprint
1466
+
1467
+ arXiv:2305.14325, 2023.
1468
+
1469
+ Ferrag, M. A., Tihanyi, N., and Debbah, M. From LLM
1470
+
1471
+ reasoning to autonomous AI agents: A comprehensive
1472
+
1473
+ review. arXiv preprint arXiv:2504.19678, 2025.
1474
+
1475
+ Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar,
1476
+
1477
+ S., and Naber, V. Think before you speak: Training
1478
+
1479
+ language models with pause tokens. arXiv preprint
1480
+
1481
+ arXiv:2310.02226, 2023.
1482
+
1483
+ Graves, A. Adaptive computation time for recurrent neural
1484
+
1485
+ networks. arXiv preprint arXiv:1603.08983, 2016.
1486
+
1487
+ Han, T., Wang, Z., Fang, C., et al. Token-budget-aware
1488
+
1489
+ LLM reasoning. arXiv preprint arXiv:2412.18547, 2024.
1490
+
1491
+ Hayes-Roth, B. A blackboard architecture for control. Arti-
1492
+
1493
+ ficial Intelligence, 26(3):251–321, 1985.
1494
+
1495
+ 8
1496
+
1497
+ ### Page 9
1498
+
1499
+ DOVA: Deliberation-First Multi-Agent Orchestration
1500
+
1501
+ Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang,
1502
+
1503
+ C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., et al.
1504
+
1505
+ MetaGPT: Meta programming for a multi-agent collab-
1506
+
1507
+ orative framework. arXiv preprint arXiv:2308.00352,
1508
+
1509
+ 2023.
1510
+
1511
+ Hou, X., Zhao, Y., Wang, S., and Wang, H. Model context
1512
+
1513
+ protocol (MCP): Landscape, security threats, and future
1514
+
1515
+ research directions. arXiv preprint arXiv:2503.23278,
1516
+
1517
+ 2025.
1518
+
1519
+ Hu, Z., Zhu, Q., Yan, H., et al. Beyond RAG for agent
1520
+
1521
+ memory: Retrieval by decoupling and aggregation. arXiv
1522
+
1523
+ preprint arXiv:2602.02007, 2026.
1524
+
1525
+ Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and
1526
+
1527
+ Ghanem, B. CAMEL: Communicative agents for “mind”
1528
+
1529
+ exploration of large language model society. Advances in
1530
+
1531
+ Neural Information Processing Systems, 36, 2023.
1532
+
1533
+ Li, J., Zhao, W., Zhang, Y., and Gan, C. Steering
1534
+
1535
+ LLM thinking with budget guidance. arXiv preprint
1536
+
1537
+ arXiv:2506.13752, 2025.
1538
+
1539
+ Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang,
1540
+
1541
+ R., Yang, Y., Tu, Z., and Shi, S. Encouraging divergent
1542
+
1543
+ thinking in large language models through multi-agent
1544
+
1545
+ debate. arXiv preprint arXiv:2305.19118, 2023.
1546
+
1547
+ Lin, K., Snell, C., Wang, Y., et al. Sleep-time compute:
1548
+
1549
+ Beyond inference scaling at test-time. arXiv preprint
1550
+
1551
+ arXiv:2504.13171, 2025.
1552
+
1553
+ Luo, Z., Shen, Z., Yang, W., et al. MCP-Universe:
1554
+
1555
+ Benchmarking large language models with real-world
1556
+
1557
+ model context protocol servers. arXiv preprint
1558
+
1559
+ arXiv:2508.14704, 2025.
1560
+
1561
+ Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao,
1562
+
1563
+ L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S.,
1564
+
1565
+ Yang, Y., et al. Self-refine: Iterative refinement with self-
1566
+
1567
+ feedback. In Advances in Neural Information Processing
1568
+
1569
+ Systems, volume 36, 2023.
1570
+
1571
+ Orogat, A., Rostam, A., and Mansour, E. Understanding
1572
+
1573
+ multi-agent LLM frameworks: A unified benchmark and
1574
+
1575
+ experimental analysis. arXiv preprint arXiv:2602.03128,
1576
+
1577
+ 2026.
1578
+
1579
+ Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang,
1580
+
1581
+ P., and Bernstein, M. S. Generative agents: Interactive
1582
+
1583
+ simulacra of human behavior. In Proceedings of the 36th
1584
+
1585
+ Annual ACM Symposium on User Interface Software and
1586
+
1587
+ Technology, pp. 1–22, 2023.
1588
+
1589
+ Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Go-
1590
+
1591
+ rilla: Large language model connected with massive APIs.
1592
+
1593
+ arXiv preprint arXiv:2305.15334, 2023.
1594
+
1595
+ Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y.,
1596
+
1597
+ Cong, X., Tang, X., Qian, B., et al. ToolLLM: Facilitating
1598
+
1599
+ large language models to master 16000+ real-world APIs.
1600
+
1601
+ arXiv preprint arXiv:2307.16789, 2023.
1602
+
1603
+ Schick, T., Dwivedi-Yu, J., Dess`ı, R., Raileanu, R., Lomeli,
1604
+
1605
+ M., Hambro, E., Zettlemoyer, L., Cancedda, N., and
1606
+
1607
+ Scialom, T. Toolformer: Language models can teach
1608
+
1609
+ themselves to use tools. In Advances in Neural Informa-
1610
+
1611
+ tion Processing Systems, volume 36, 2023.
1612
+
1613
+ Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and
1614
+
1615
+ Yao, S. Reflexion: Language agents with verbal rein-
1616
+
1617
+ forcement learning. In Advances in Neural Information
1618
+
1619
+ Processing Systems, volume 36, 2023.
1620
+
1621
+ Tran, K.-T., Dao, D., Nguyen, M.-D., Pham, Q.-V.,
1622
+
1623
+ O’Sullivan, B., and Nguyen, H. D. Multi-agent collabo-
1624
+
1625
+ ration mechanisms: A survey of LLMs. arXiv preprint
1626
+
1627
+ arXiv:2501.06322, 2025.
1628
+
1629
+ Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E.,
1630
+
1631
+ Narasimhan, S., Chowdhery, A., and Zhou, D. Self-
1632
+
1633
+ consistency improves chain of thought reasoning in lan-
1634
+
1635
+ guage models. In International Conference on Learning
1636
+
1637
+ Representations, 2023.
1638
+
1639
+ Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
1640
+
1641
+ Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought
1642
+
1643
+ prompting elicits reasoning in large language models.
1644
+
1645
+ In Advances in Neural Information Processing Systems,
1646
+
1647
+ volume 35, pp. 24824–24837, 2022.
1648
+
1649
+ Wei, T., Li, T.-W., Liu, Z., Ning, X., Yang, Z., Zou, J., Zeng,
1650
+
1651
+ Z., Qiu, R., Lin, X., Fu, D., et al. Agentic reasoning for
1652
+
1653
+ large language models. arXiv preprint arXiv:2601.12538,
1654
+
1655
+ 2026.
1656
+
1657
+ Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang,
1658
+
1659
+ L., Zhang, X., Zhang, S., Liu, J., et al. AutoGen: Enabling
1660
+
1661
+ next-gen LLM applications via multi-agent conversation.
1662
+
1663
+ arXiv preprint arXiv:2308.08155, 2023.
1664
+
1665
+ Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,
1666
+
1667
+ Y., and Narasimhan, K. Tree of thoughts: Deliberate
1668
+
1669
+ problem solving with large language models. Advances
1670
+
1671
+ in Neural Information Processing Systems, 36, 2023a.
1672
+
1673
+ Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
1674
+
1675
+ K., and Cao, Y. ReAct: Synergizing reasoning and act-
1676
+
1677
+ ing in language models. In International Conference on
1678
+
1679
+ Learning Representations, 2023b.
1680
+
1681
+ Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
1682
+
1683
+ and Wang, Y.-X. Language agent tree search unifies rea-
1684
+
1685
+ soning, acting, and planning in language models. arXiv
1686
+
1687
+ preprint arXiv:2310.04406, 2023.
1688
+
1689
+ 9
1690
+
1691
+ ### Page 10
1692
+
1693
+ DOVA: Deliberation-First Multi-Agent Orchestration
1694
+
1695
+ Zhu, K., Li, H., Wu, S., et al. Scaling test-time compute for
1696
+
1697
+ LLM agents. arXiv preprint arXiv:2506.12928, 2025.
1698
+
1699
+ 10