@chllming/wave-orchestration 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (112) hide show
  1. package/CHANGELOG.md +57 -1
  2. package/README.md +39 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +48 -11
  39. package/docs/plans/context7-wave-orchestrator.md +20 -0
  40. package/docs/plans/current-state.md +8 -1
  41. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  42. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  43. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  44. package/docs/plans/wave-orchestrator.md +62 -11
  45. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  46. package/docs/reference/coordination-and-closure.md +436 -0
  47. package/docs/reference/live-proof-waves.md +25 -3
  48. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  49. package/docs/reference/proof-metrics.md +90 -0
  50. package/docs/reference/runtime-config/README.md +61 -0
  51. package/docs/reference/sample-waves.md +29 -18
  52. package/docs/reference/wave-control.md +164 -0
  53. package/docs/reference/wave-planning-lessons.md +131 -0
  54. package/package.json +5 -4
  55. package/releases/manifest.json +18 -0
  56. package/scripts/research/agent-context-archive.mjs +18 -0
  57. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  58. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  59. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  60. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  61. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  62. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  63. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  64. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  65. package/scripts/wave-orchestrator/config.mjs +175 -0
  66. package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
  67. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  68. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  69. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  70. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  71. package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
  72. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  73. package/scripts/wave-orchestrator/evals.mjs +23 -0
  74. package/scripts/wave-orchestrator/executors.mjs +3 -2
  75. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  76. package/scripts/wave-orchestrator/install.mjs +55 -1
  77. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  78. package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
  79. package/scripts/wave-orchestrator/launcher.mjs +796 -35
  80. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  81. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  82. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  83. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  84. package/scripts/wave-orchestrator/replay.mjs +10 -4
  85. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  86. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  87. package/scripts/wave-orchestrator/shared.mjs +26 -0
  88. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  89. package/scripts/wave-orchestrator/traces.mjs +157 -2
  90. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  91. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  92. package/scripts/wave-orchestrator/wave-files.mjs +17 -5
  93. package/scripts/wave.mjs +27 -0
  94. package/skills/repo-coding-rules/SKILL.md +1 -0
  95. package/skills/role-cont-eval/SKILL.md +1 -0
  96. package/skills/role-cont-qa/SKILL.md +13 -6
  97. package/skills/role-deploy/SKILL.md +1 -0
  98. package/skills/role-documentation/SKILL.md +4 -0
  99. package/skills/role-implementation/SKILL.md +4 -0
  100. package/skills/role-infra/SKILL.md +2 -1
  101. package/skills/role-integration/SKILL.md +15 -8
  102. package/skills/role-planner/SKILL.md +39 -0
  103. package/skills/role-planner/skill.json +21 -0
  104. package/skills/role-research/SKILL.md +1 -0
  105. package/skills/role-security/SKILL.md +2 -2
  106. package/skills/runtime-claude/SKILL.md +2 -1
  107. package/skills/runtime-codex/SKILL.md +1 -0
  108. package/skills/runtime-local/SKILL.md +2 -0
  109. package/skills/runtime-opencode/SKILL.md +1 -0
  110. package/skills/wave-core/SKILL.md +25 -6
  111. package/skills/wave-core/references/marker-syntax.md +16 -8
  112. package/wave.config.json +45 -0
@@ -0,0 +1,1173 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - planning-and-orchestration
8
+ - harnesses-and-practice
9
+ kind: 'paper'
10
+ title: 'Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution'
11
+ ---
12
+ # Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution
13
+
14
+ <Note>
15
+ Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
+ </Note>
17
+
18
+ ## Metadata
19
+
20
+ | Field | Value |
21
+ | --- | --- |
22
+ | Content type | Paper / report |
23
+ | Authors | Xing Zhang, Yanwei Cui, Guanghui Wang, Wei Qiu, Ziyuan Li, Fangwei Han, Yajing Huang, Hengzhi Qiu, Bing Zhu, Peiyang He |
24
+ | Year | 2026 |
25
+ | Venue | arXiv 2603.11445 |
26
+ | Research bucket | P0 direct hits |
27
+ | Maps to | DAG decomposition, parallel execution, verification, and replanning for complex queries. |
28
+ | Harness fit | Direct blueprint for a planner-verifier harness loop instead of one-shot multi-agent delegation. |
29
+ | Source page | [Open source](https://arxiv.org/abs/2603.11445) |
30
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2603.11445.pdf) |
31
+
32
+ ## Extracted text
33
+ ### Page 1
34
+
35
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
36
+
37
+ V ERIFIED M ULTI -AGENT O RCHESTRATION: A P LAN -
38
+
39
+ E XECUTE -V ERIFY-R EPLAN F RAMEWORK FOR C OM -
40
+
41
+ PLEX QUERY R ESOLUTION
42
+
43
+ Xing Zhang1 Yanwei Cui1 Guanghui Wang1 Wei Qiu2 Ziyuan Li2
44
+
45
+ Fangwei Han2 Yajing Huang2 Hengzhi Qiu2 Bing Zhu2 Peiyang He1∗
46
+
47
+ 1AWS Generative AI Innovation Center 2HSBC
48
+
49
+ ABSTRACT
50
+
51
+ We present Verified Multi-Agent Orchestration (VMAO), a framework that
52
+
53
+ coordinates specialized LLM-based agents through a verification-driven iterative
54
+
55
+ loop. Given a complex query, our system decomposes it into a directed acyclic
56
+
57
+ graph (DAG) of sub-questions, executes them through domain-specific agents in
58
+
59
+ parallel, verifies result completeness via LLM-based evaluation, and adaptively
60
+
61
+ replans to address gaps. The key contributions are: (1) dependency-aware parallel
62
+
63
+ execution over a DAG of sub-questions with automatic context propagation, (2)
64
+
65
+ verification-driven adaptive replanning that uses an LLM-based verifier as an
66
+
67
+ orchestration-level coordination signal, and (3) configurable stop conditions that
68
+
69
+ balance answer quality against resource usage. On 25 expert-curated market
70
+
71
+ research queries, VMAO improves answer completeness from 3.1 to 4.2 and
72
+
73
+ source quality from 2.6 to 4.1 (1–5 scale) compared to a single-agent baseline,
74
+
75
+ demonstrating that orchestration-level verification is an effective mechanism for
76
+
77
+ multi-agent quality assurance.
78
+
79
+ 1 INTRODUCTION
80
+
81
+ Large language models (LLMs) have enabled a new generation of multi-agent systems where
82
+
83
+ specialized agents collaborate to solve complex tasks. A central challenge in such systems is
84
+
85
+ coordination: given a complex query that requires information from heterogeneous sources and
86
+
87
+ diverse analytical expertise, how should agents be organized and assigned to sub-tasks? How
88
+
89
+ can we ensure result quality without constant human oversight? When should the system stop
90
+
91
+ iterating and synthesize a final answer? These questions are especially acute in domains like market
92
+
93
+ research, where analysts gather data from internal databases, public filings, news sources, and
94
+
95
+ competitor reports, then synthesize findings into actionable insights. Information is scattered across
96
+
97
+ heterogeneous sources, analysis requires diverse expertise (financial, operational, competitive), and
98
+
99
+ synthesis demands cross-referencing while resolving contradictions.
100
+
101
+ Existing multi-agent frameworks fall short of these requirements. Debate-style approaches where
102
+
103
+ agents critique each other’s outputs (Du et al., 2023) improve reasoning quality but lack structured
104
+
105
+ task decomposition. Role-playing frameworks where agents assume personas (Li et al., 2023) enable
106
+
107
+ collaboration but provide no mechanism for verifying completeness. More recent systems like
108
+
109
+ AutoGen (Wu et al., 2024) and MetaGPT (Hong et al., 2024) offer flexible interaction patterns, yet
110
+
111
+ still lack principled quality verification and adaptive refinement—critical requirements for production
112
+
113
+ deployment where outputs must be reliable without constant human oversight.
114
+
115
+ We introduce Verified Multi-Agent Orchestration (VMAO), a framework that addresses these gaps
116
+
117
+ through three key contributions:
118
+
119
+ 1. DAG-Based Query Decomposition and Execution: Complex queries are decomposed into
120
+
121
+ sub-questions organized as a directed acyclic graph (DAG), enabling dependency-aware
122
+
123
+ parallel execution with automatic context propagation from upstream results.
124
+
125
+
126
+
127
+ Corresponding author: peiyan@amazon.com
128
+
129
+ 1
130
+
131
+ arXiv:2603.11445v2 [cs.AI] 15 Mar 2026
132
+
133
+ ### Page 2
134
+
135
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
136
+
137
+ 2. Verification-Driven Replanning: An LLM-based verifier evaluates result completeness at
138
+
139
+ the orchestration level, triggering adaptive replanning when gaps are identified—providing
140
+
141
+ a principled coordination signal that is decoupled from individual agent implementations.
142
+
143
+ 3. Configurable Stop Conditions: Termination decisions are based on completeness thresh-
144
+
145
+ olds, confidence scores, and resource constraints, enabling explicit quality-cost tradeoffs.
146
+
147
+ On 25 expert-curated market research queries, VMAO improves answer completeness from 3.1 to
148
+
149
+ 4.2 and source quality from 2.6 to 4.1 (1–5 scale) compared to single-agent and static multi-agent
150
+
151
+ baselines.
152
+
153
+ 2 RELATED WORK
154
+
155
+ Multi-Agent Coordination and Tool Use. Recent surveys (Wang et al., 2024; Xi et al., 2023)
156
+
157
+ document the rapid growth of LLM-based multi-agent systems, which vary in coordination strategy:
158
+
159
+ AutoGen (Wu et al., 2024) uses conversational patterns, CAMEL (Li et al., 2023) employs role-
160
+
161
+ playing, MetaGPT (Hong et al., 2024) enforces software engineering workflows, and HuggingGPT
162
+
163
+ (Shen et al., 2023) orchestrates specialized models via a central controller. Orthogonally, work on
164
+
165
+ tool use has focused on single-agent settings: ReAct (Yao et al., 2023b) established the thought-
166
+
167
+ action-observation paradigm, Toolformer (Schick et al., 2023) enables self-supervised tool learning,
168
+
169
+ and ToolLLM (Qin et al., 2023) scales to 16,000+ APIs. These lines of work address coordination
170
+
171
+ and tool use separately, but production systems require both: multiple specialized agents, each with
172
+
173
+ domain-specific tools, working in concert.
174
+
175
+ Planning, Decomposition, and Verification. Chain-of-Thought (Wei et al., 2022), Tree-of-Thoughts
176
+
177
+ (Yao et al., 2023a), and Least-to-Most prompting (Zhou et al., 2023) decompose complex reasoning
178
+
179
+ into structured steps, but operate within a single LLM rather than distributing sub-tasks across
180
+
181
+ specialized agents. For output quality, Self-Consistency (Wang et al., 2022) aggregates multiple
182
+
183
+ reasoning paths, Self-Refine (Madaan et al., 2023) iterates on single outputs, and Reflexion (Shinn
184
+
185
+ et al., 2023) uses verbal reinforcement—all operating at the individual response level. Missing
186
+
187
+ from prior work is verification at the orchestration level: evaluating whether collective results from
188
+
189
+ multiple agents adequately address the original query, and triggering targeted replanning when gaps
190
+
191
+ are detected.
192
+
193
+ Agentic Search and Deep Research. Recent commercial systems have demonstrated the potential of
194
+
195
+ multi-step agentic research: search-augmented assistants like Perplexity iteratively refine queries to
196
+
197
+ synthesize information from web sources, while deep research features in frontier models (OpenAI,
198
+
199
+ 2025) perform extended multi-step investigation. These systems demonstrate the value of iterative
200
+
201
+ research loops but are closed-source, making their coordination mechanisms difficult to study or
202
+
203
+ reproduce. Our work provides an open, modular framework where the coordination strategy—
204
+
205
+ particularly the verification-driven replanning loop—is explicit and configurable.
206
+
207
+ Our Approach. VMAO synthesizes these threads into a unified framework for complex query
208
+
209
+ resolution. We decompose queries into a DAG of sub-questions assigned to domain-specific agents,
210
+
211
+ execute them in parallel with dependency-aware scheduling, verify collective completeness via LLM-
212
+
213
+ based evaluation, and adaptively replan to address gaps. We evaluate VMAO on market research
214
+
215
+ tasks, maintaining verifiable output quality through explicit coordination mechanisms.
216
+
217
+ 3 FRAMEWORK ARCHITECTURE
218
+
219
+ 3.1 OVERVIEW
220
+
221
+ VMAO operates through five phases: Plan, Execute, Verify, Replan, and Synthesize (Figure 1a).
222
+
223
+ Given a complex query, the system first decomposes it into sub-questions with assigned agent types
224
+
225
+ and dependencies. It then executes these through specialized agents in parallel where dependencies
226
+
227
+ permit. The verify phase evaluates completeness and identifies gaps. If deficiencies exist, the system
228
+
229
+ replans by generating new sub-questions or marking incomplete ones for retry. This loop continues
230
+
231
+ until stop conditions are met, triggering synthesis of a final answer with proper source attribution.
232
+
233
+ 2
234
+
235
+ ### Page 3
236
+
237
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
238
+
239
+ PLAN
240
+
241
+ Query Decomp.
242
+
243
+ EXECUTE
244
+
245
+ DAG Parallel
246
+
247
+ VERIFY
248
+
249
+ LLM Complete.
250
+
251
+ REPLAN
252
+
253
+ Gap Filling
254
+
255
+ SYNTHESIZE
256
+
257
+ Merge & Cite
258
+
259
+ incomplete
260
+
261
+ Iterative Refinement Loop
262
+
263
+ stop conditions met
264
+
265
+ (a) Plan-Execute-Verify-Replan Architecture
266
+
267
+ Tier 1: DATA
268
+
269
+ RAG, Web
270
+
271
+ Financial, Competitor
272
+
273
+ Tier 2: ANALYSIS
274
+
275
+ Analysis, Reasoning
276
+
277
+ Raw Data
278
+
279
+ Tier 3: OUTPUT
280
+
281
+ Document
282
+
283
+ Visualization
284
+
285
+ Information Flow
286
+
287
+ (b) Agent Taxonomy by Functional Tier
288
+
289
+ Figure 1: (a) VMAO framework architecture showing the iterative Plan-Execute-Verify-Replan loop.
290
+
291
+ (b) Agent taxonomy organized by functional tier with information flow from data gathering through
292
+
293
+ analysis to output generation.
294
+
295
+ Table 1: Sub-question structure generated by the QueryPlanner
296
+
297
+ Field Description
298
+
299
+ id Unique identifier (e.g., sq 001)
300
+
301
+ question Specific, answerable question text
302
+
303
+ agent type Agent from taxonomy to handle this question
304
+
305
+ dependencies IDs of sub-questions that must complete first
306
+
307
+ priority Execution priority (1–10); higher = more important
308
+
309
+ context from deps Whether to include dependency results in prompt
310
+
311
+ verification criteria Criteria for determining answer completeness
312
+
313
+ Agents are organized into three functional tiers (Figure 1b): Tier 1 (Data Gathering) agents retrieve
314
+
315
+ information from diverse sources, Tier 2 (Analysis) agents reason over this data, and Tier 3 (Output)
316
+
317
+ agents produce final deliverables. This hierarchy reflects the natural information flow in research
318
+
319
+ tasks and enables principled task assignment by the planner.
320
+
321
+ 3.2 PLANNING AND EXECUTION
322
+
323
+ The QueryPlanner decomposes a complex query into sub-questions organized as a DAG (Table 1).
324
+
325
+ An LLM identifies distinct information requirements, assigns each to an appropriate agent type,
326
+
327
+ establishes dependencies where one sub-question requires another’s output, and sets execution
328
+
329
+ priorities.
330
+
331
+ The DAGExecutor then orchestrates execution while respecting dependencies and maximizing
332
+
333
+ parallelism (Algorithm 1). It iteratively identifies ready questions—those whose dependencies
334
+
335
+ have completed—and executes batches in parallel (default k = 3). For sub-questions with
336
+
337
+ context from deps enabled, results from dependencies are prepended to the query. Figure 2a
338
+
339
+ illustrates how independent sub-questions execute concurrently in Wave 1, while dependent questions
340
+
341
+ execute in subsequent waves. Each execution is wrapped with a configurable timeout (default: 600s)
342
+
343
+ and a tool call limiter to prevent infinite loops.
344
+
345
+ 3
346
+
347
+ ### Page 4
348
+
349
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
350
+
351
+ Algorithm 1 DAG-Based Parallel Execution
352
+
353
+ Require: Execution plan P = (Q, G), max concurrent k
354
+
355
+ Ensure: Results R = {r1,..., rn}
356
+
357
+ 1: completed ← ∅
358
+
359
+ 2: while |completed| < |Q| do
360
+
361
+ 3: ready ← {q ∈ Q: deps(q) ⊆ completed ∧ q /∈ completed}
362
+
363
+ 4: batch ← top-k(ready, by = priority)
364
+
365
+ 5: results ← parallel execute(batch)
366
+
367
+ 6: for (q, r) in results do
368
+
369
+ 7: if q.context f rom deps then
370
+
371
+ 8: r ← enrich with context(r, {R[d]: d ∈ deps(q)})
372
+
373
+ 9: end if
374
+
375
+ 10: R[q.id] ← r; completed ← completed ∪ {q}
376
+
377
+ 11: end for
378
+
379
+ 12: end while
380
+
381
+ 13: return R
382
+
383
+ Query: "Why did service quality decline and what is the profit impact?"
384
+
385
+ Wave 1 (Parallel)
386
+
387
+ Wave 2 (Parallel)
388
+
389
+ Wave 3
390
+
391
+ sq_001
392
+
393
+ RAG
394
+
395
+ service metrics
396
+
397
+ sq_002
398
+
399
+ RAG
400
+
401
+ customer feedback
402
+
403
+ sq_003
404
+
405
+ Financial
406
+
407
+ profit data
408
+
409
+ sq_006
410
+
411
+ Web
412
+
413
+ external factors
414
+
415
+ sq_004
416
+
417
+ Competitor
418
+
419
+ benchmarking
420
+
421
+ sq_007
422
+
423
+ Analysis
424
+
425
+ correlation
426
+
427
+ sq_005
428
+
429
+ Analysis
430
+
431
+ root cause
432
+
433
+ Time
434
+
435
+ (a) DAG Execution Example
436
+
437
+ Iteration 1 Iteration 2
438
+
439
+ Execution Results
440
+
441
+ sq_001 (0.90)
442
+
443
+ sq_002 (0.45)
444
+
445
+ sq_003 (0.85)
446
+
447
+ sq_004 (0.30)
448
+
449
+ sq_005 (0.25)
450
+
451
+ Execution Results
452
+
453
+ sq_001 (0.90)
454
+
455
+ sq_002 (0.45)
456
+
457
+ sq_003 (0.85)
458
+
459
+ sq_004 (0.75)
460
+
461
+ sq_005 (0.80)
462
+
463
+ + sq_006 (0.80)
464
+
465
+ + sq_007 (0.85)
466
+
467
+ Overall: 40.0% | Complete: 2/5 Overall: 85.7% | Complete: 6/7
468
+
469
+ REPLAN
470
+
471
+ retry: [002, 004, 005]
472
+
473
+ new: [006: external factors,
474
+
475
+ 007: correlation analysis]
476
+
477
+ SYNTHESIZE
478
+
479
+ >80% complete
480
+
481
+ Ready for final answer
482
+
483
+ (b) Verification and Replanning
484
+
485
+ Complete Incomplete Inherited New
486
+
487
+ Figure 2: (a) DAG execution: independent sub-questions execute in Wave 1; dependent questions in
488
+
489
+ subsequent waves. (b) Verification-driven iteration: Iteration 1 identifies incomplete results, triggering
490
+
491
+ replanning; Iteration 2 achieves sufficient completeness for synthesis.
492
+
493
+ 3.3 VERIFICATION, REPLANNING, AND SYNTHESIS
494
+
495
+ The ResultVerifier evaluates whether execution results adequately answer their sub-questions (Fig-
496
+
497
+ ure 2b). For each result, it produces: status (complete/partial/incomplete), completeness score
498
+
499
+ (0–1), missing aspects, contradictions, and a recommendation (accept/retry/escalate). Results already
500
+
501
+ marked complete are reused to avoid redundant LLM calls.
502
+
503
+ When verification identifies gaps, the AdaptiveReplanner determines corrective actions: retry sub-
504
+
505
+ questions with low scores while preserving previous results, introduce new queries to address specific
506
+
507
+ missing aspects, or merge results from multiple attempts. A key feature is result preservation—
508
+
509
+ previous results are stored and merged with retry attempts, enabling progressive refinement without
510
+
511
+ losing earlier findings.
512
+
513
+ Determining when to stop iterating is critical for balancing quality and cost. We introduce five
514
+
515
+ configurable stop conditions (Table 2), evaluated after each verification phase: completeness threshold
516
+
517
+ (80% of sub-questions answered), high confidence with partial coverage, diminishing returns (<5%
518
+
519
+ improvement), token budget (1M tokens), and maximum iterations (3). When any condition is met,
520
+
521
+ the system proceeds to synthesis.
522
+
523
+ For large result sets (>15K characters or 10+ results), direct synthesis would exceed context limits.
524
+
525
+ We address this through hierarchical synthesis: group results by agent type, synthesize within each
526
+
527
+ group to produce condensed summaries, then integrate group summaries into a coherent final answer
528
+
529
+ with proper source attribution.
530
+
531
+ 4
532
+
533
+ ### Page 5
534
+
535
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
536
+
537
+ Table 2: Stop conditions for orchestration termination
538
+
539
+ Condition Threshold Rationale
540
+
541
+ Ready for Synthesis 80% complete Sufficient sub-questions answered
542
+
543
+ High Confidence 75% conf, 50% complete High reliability despite partial coverage
544
+
545
+ Diminishing Returns <5% improvement Further iteration yields minimal gain
546
+
547
+ Token Budget 1M tokens Hard cost limit
548
+
549
+ Max Iterations 3 iterations Hard iteration limit
550
+
551
+ Table 3: Agent taxonomy with tool allocation across MCP servers (42 unique tools total)
552
+
553
+ Tier Agent Tools Key Capabilities
554
+
555
+ 1: Data
556
+
557
+ RAG 13 Semantic, keyword, and hybrid retrieval; metadata
558
+
559
+ filtering
560
+
561
+ Web Search 4 General and AI-powered search, news retrieval
562
+
563
+ Financial 7 Stock quotes, technical indicators, fundamentals
564
+
565
+ Competitor 11 Market positioning, benchmarks, competitor news
566
+
567
+ 2: Analysis
568
+
569
+ Analysis 20 Survey analytics, financial and competitor analysis
570
+
571
+ Reasoning 24 Cross-domain reasoning with RAG, web, and financial
572
+
573
+ tools
574
+
575
+ Raw Data 1 Python execution (pandas, matplotlib)
576
+
577
+ 3: Output Document 4 Report generation, tables, source citations
578
+
579
+ Visualization 6 Chart generation, statistical summaries
580
+
581
+ 4 IMPLEMENTATION
582
+
583
+ We implement VMAO using LangGraph for workflow orchestration and the Strands Agent framework
584
+
585
+ for agent execution, integrated with AWS Bedrock. Agent execution uses Claude Sonnet 4.5 as
586
+
587
+ the primary model with Claude Haiku 4.5 as a fallback for graceful degradation; verification and
588
+
589
+ evaluation use Claude Opus 4.5 to provide an independent quality signal. Agents access tools through
590
+
591
+ the Model Context Protocol (MCP), which exposes domain-specific capabilities via independent
592
+
593
+ HTTP microservices. This modular architecture allows adding new tools without modifying agent
594
+
595
+ code.
596
+
597
+ Table 3 shows the agent taxonomy with tool allocation across eight MCP servers (42 unique tools
598
+
599
+ total). Each server runs independently, enabling horizontal scaling and fault isolation. Agents
600
+
601
+ automatically select appropriate tools based on sub-question requirements.
602
+
603
+ For production deployment, we implement several safety mechanisms: tool call limiters prevent
604
+
605
+ infinite loops (max 10 consecutive same-tool calls, 50 total per agent), per-execution timeouts enforce
606
+
607
+ bounded latency (default 600s), and phase-level token tracking enables budget enforcement. When
608
+
609
+ the primary model (Sonnet 4.5) is unavailable, the system falls back to Haiku 4.5 with graceful
610
+
611
+ degradation. Real-time observability is provided through Server-Sent Events that stream execution
612
+
613
+ progress to the frontend.
614
+
615
+ 5 EXPERIMENTS
616
+
617
+ 5.1 DATASET: MARKET RESEARCH QUERIES
618
+
619
+ We evaluate VMAO on market research tasks—a domain where traditional research typically
620
+
621
+ requires 2–4 weeks of human effort. These tasks are challenging because relevant data is scattered
622
+
623
+ across heterogeneous sources, answering questions requires diverse expertise (financial, operational,
624
+
625
+ competitive), and synthesis demands cross-referencing while resolving contradictions. We curated 25
626
+
627
+ queries from domain experts spanning four categories:
628
+
629
+ 5
630
+
631
+ ### Page 6
632
+
633
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
634
+
635
+ • Performance Analysis (8 queries): Operational metrics, trends, and causal factors. Example:
636
+
637
+ “What factors explain the year-over-year change in customer satisfaction?”
638
+
639
+ • Competitive Intelligence (7 queries): Comparison with industry peers and market position-
640
+
641
+ ing. Example: “How does our market share compare to regional competitors?”
642
+
643
+ • Financial Investigation (5 queries): Financial metrics combined with operational context.
644
+
645
+ Example: “What is driving the change in revenue per customer?”
646
+
647
+ • Strategic Assessment (5 queries): Open-ended synthesis across multiple dimensions. Ex-
648
+
649
+ ample: “What are the key risks and opportunities for geographic expansion?”
650
+
651
+ Query complexity varies from simpler queries (3–5 sub-questions, 2–3 agent types) to complex
652
+
653
+ ones (8–12 sub-questions, 5+ agent types with multi-level dependencies). Each query consumes
654
+
655
+ 500K–1.1M tokens and requires 10–20 minutes of execution plus domain expert review, making 25
656
+
657
+ queries a practical yet meaningful evaluation set.
658
+
659
+ 5.2 BASELINES AND CONFIGURATION
660
+
661
+ We compare three configurations:
662
+
663
+ • Single-Agent: One reasoning agent with access to all tools, relying on internal reasoning to
664
+
665
+ determine tool invocation order.
666
+
667
+ • Static Pipeline: Predefined agent sequence (RAG → Web → Financial → Analysis →
668
+
669
+ Synthesis) without verification or replanning.
670
+
671
+ • VMAO: Full framework with dynamic decomposition, parallel execution, verification-
672
+
673
+ driven replanning, and stop conditions.
674
+
675
+ All configurations use Claude Sonnet 4.5 for agent execution and the same tool set. We evaluate
676
+
677
+ Completeness (how thoroughly all query aspects are addressed, 1–5 scale) and Source Quality (proper
678
+
679
+ citation and traceability, 1–5 scale). Evaluation follows a two-stage process: an LLM judge (Claude
680
+
681
+ Opus 4.5) first scores each response using structured rubrics, then human domain experts review and
682
+
683
+ adjust scores where the LLM assessment appears inconsistent or misses domain-specific nuances.
684
+
685
+ We deliberately use a different, more capable model for evaluation than for execution to reduce
686
+
687
+ self-evaluation bias, though both models belong to the same family. In practice, human reviewers
688
+
689
+ adjusted fewer than 15% of LLM scores, typically by ±0.5 points, indicating reasonable LLM-human
690
+
691
+ alignment on these metrics.
692
+
693
+ We evaluate Completeness rather than accuracy because deep research queries have no single ground
694
+
695
+ truth—a question like “What factors explain declining satisfaction?” admits multiple valid answers.
696
+
697
+ Completeness measures whether all relevant aspects are addressed with supporting evidence, better
698
+
699
+ capturing the exploratory nature of research. Source Quality ensures answers are grounded in
700
+
701
+ verifiable sources.
702
+
703
+ 5.3 RESULTS
704
+
705
+ Table 4 presents the main results across all 25 queries. VMAO achieves substantially higher
706
+
707
+ completeness (+35%) and source quality (+58%) compared to Single-Agent. The Static Pipeline
708
+
709
+ improves over Single-Agent but cannot adapt when initial agents return insufficient results. VMAO’s
710
+
711
+ verification-driven approach identifies gaps and adaptively replans, leading to more complete answers
712
+
713
+ with better source attribution. The increased resource usage reflects verification overhead, justified by
714
+
715
+ quality improvements.
716
+
717
+ Figure 3(a) shows a typical token distribution across orchestration phases: execution dominates (61%)
718
+
719
+ as agents invoke tools and process results, while verification and synthesis remain efficient. VMAO
720
+
721
+ demonstrates consistent improvements across all query categories (Figure 3(b)), with the largest
722
+
723
+ gains on Strategic Assessment queries (+53% completeness), which require synthesizing information
724
+
725
+ across multiple dimensions. Performance Analysis queries show more modest gains, as these often
726
+
727
+ have well-defined data sources that even single agents can locate.
728
+
729
+ In our experiments, most queries (>75%) terminate via resource-based conditions (diminishing re-
730
+
731
+ turns, max iterations, or token budget), reflecting conservative thresholds that prioritize thoroughness
732
+
733
+ 6
734
+
735
+ ### Page 7
736
+
737
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
738
+
739
+ Planning
740
+
741
+ 8%
742
+
743
+ Execution
744
+
745
+ 61%
746
+
747
+ Verification16%
748
+
749
+ Replanning
750
+
751
+ 5%
752
+
753
+ Synthesis
754
+
755
+ 10%
756
+
757
+ (a) Token Usage by Phase
758
+
759
+ Performance
760
+
761
+ Analysis
762
+
763
+ Competitive
764
+
765
+ Intelligence
766
+
767
+ Financial
768
+
769
+ Investigation
770
+
771
+ Strategic
772
+
773
+ Assessment
774
+
775
+ 1
776
+
777
+ 2
778
+
779
+ 3
780
+
781
+ 4
782
+
783
+ 5
784
+
785
+ Completeness Score (1-5)
786
+
787
+ 3.4
788
+
789
+ 2.9
790
+
791
+ 3.2
792
+
793
+ 2.8
794
+
795
+ 3.7
796
+
797
+ 3.4
798
+
799
+ 3.6
800
+
801
+ 3.2
802
+
803
+ 4.1
804
+
805
+ 4.2
806
+
807
+ 4.3 4.3
808
+
809
+ (b) Completeness by Query Category
810
+
811
+ Single-Agent
812
+
813
+ Static Pipeline
814
+
815
+ VMAO (Ours)
816
+
817
+ Figure 3: (a) Token usage breakdown by orchestration phase for a typical query. Execution dominates
818
+
819
+ at 61%, while verification and synthesis remain efficient. (b) Completeness scores by query category
820
+
821
+ across methods. VMAO shows consistent improvements, with largest gains on Strategic Assessment
822
+
823
+ (+53%).
824
+
825
+ Table 4: Comparison of orchestration methods on market research tasks. Completeness and Source
826
+
827
+ Quality are co-scored by LLM and human evaluators (1–5 scale, higher is better).
828
+
829
+ Method Completeness Source Quality Avg Tokens Avg Time (s)
830
+
831
+ Single-Agent 3.1 2.6 100K 165
832
+
833
+ Static Pipeline 3.5 3.2 350K 420
834
+
835
+ VMAO (Ours) 4.2 4.1 850K 900
836
+
837
+ over speed. These parameters are configurable for deployments requiring faster responses or lower
838
+
839
+ costs.
840
+
841
+ Evaluation Limitations. We acknowledge three caveats: (1) 25 queries is a modest evaluation set
842
+
843
+ without reported confidence intervals, (2) the LLM judge (Opus 4.5) belongs to the same model family
844
+
845
+ as the execution model (Sonnet 4.5), potentially introducing shared biases despite human review, and
846
+
847
+ (3) the Static Pipeline baseline tests verification and replanning jointly without a component-level
848
+
849
+ ablation. We view the current evaluation as a meaningful signal of the framework’s potential, while
850
+
851
+ acknowledging that larger-scale evaluation with independent judges would strengthen the conclusions.
852
+
853
+ 6 DISCUSSION
854
+
855
+ Unlike skill-based systems (e.g., AutoGPT plugins) that invoke capabilities sequentially within a
856
+
857
+ single agent, VMAO offers explicit DAG decomposition for interpretable plans, parallel execution re-
858
+
859
+ ducing latency, verification-driven iteration for progressive refinement, and cross-agent synthesis with
860
+
861
+ source attribution. The LLM-based verification serves as a principled coordination signal—assessing
862
+
863
+ whether collective results satisfy the query—decoupling coordination from agent implementation.
864
+
865
+ When Does Verification Help Most? The largest gains from verification-driven replanning appear
866
+
867
+ on open-ended, multi-dimensional queries (Strategic Assessment: +53% completeness) where initial
868
+
869
+ decomposition inevitably misses relevant aspects. For narrower queries with well-defined data sources
870
+
871
+ (Performance Analysis), single agents already locate most relevant information, and the marginal
872
+
873
+ benefit of replanning is smaller. This suggests verification is most valuable when the query space
874
+
875
+ is difficult to fully characterize upfront—precisely the setting where static pipelines fail. We also
876
+
877
+ observe that the majority of replanning actions are retries of incomplete sub-questions rather than
878
+
879
+ introduction of entirely new ones, indicating that agent execution variance (tool failures, insufficient
880
+
881
+ search results) is a larger contributor to gaps than poor initial decomposition.
882
+
883
+ 7
884
+
885
+ ### Page 8
886
+
887
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
888
+
889
+ Limitations. Our framework has several limitations beyond the evaluation caveats noted in Sec-
890
+
891
+ tion 5.3. LLM-based verification may miss subtle factual errors or hallucinations, as it evaluates
892
+
893
+ completeness rather than accuracy—the verifier can confirm that a claim is present and sourced, but
894
+
895
+ cannot independently establish its truth. Poor query decomposition can propagate errors downstream:
896
+
897
+ if the planner misframes a sub-question, the verifier may accept a well-sourced but irrelevant answer.
898
+
899
+ The system’s 8.5× token cost relative to a single agent (850K vs. 100K tokens) may be prohibitive
900
+
901
+ for latency-sensitive or cost-constrained settings. Finally, all experiments use a single model family
902
+
903
+ (Claude); the framework’s effectiveness with other LLM families remains untested.
904
+
905
+ Transferability and Future Work. The core components—DAG decomposition, verification, and
906
+
907
+ replanning—are domain-agnostic and should transfer to domains like legal discovery or scientific
908
+
909
+ literature review with appropriate agent and tool configuration. Future directions include learning-
910
+
911
+ based stop conditions trained on execution traces, component-level ablation studies to isolate the
912
+
913
+ contribution of each framework element, evaluation with diverse model families, and human-in-the-
914
+
915
+ loop verification for high-stakes queries.
916
+
917
+ 7 CONCLUSION
918
+
919
+ We presented VMAO, a framework that coordinates specialized LLM agents through a Plan-Execute-
920
+
921
+ Verify-Replan loop. On 25 market research queries, VMAO improves answer completeness from 3.1
922
+
923
+ to 4.2 and source quality from 2.6 to 4.1 (1–5 scale) compared to single-agent baselines, with the
924
+
925
+ largest gains on open-ended queries that require multi-dimensional synthesis. Our results suggest
926
+
927
+ that orchestration-level verification—where an independent model evaluates whether collective agent
928
+
929
+ results satisfy the original query—is an effective coordination mechanism for multi-agent systems.
930
+
931
+ Key open questions remain around component-level contributions, generalization across model
932
+
933
+ families and domains, and scalable evaluation methodology. We will release the implementation
934
+
935
+ upon publication.
936
+
937
+ REFERENCES
938
+
939
+ Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving
940
+
941
+ factuality and reasoning in language models through multiagent debate. In Forty-first International
942
+
943
+ Conference on Machine Learning, 2023.
944
+
945
+ Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao
946
+
947
+ Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for
948
+
949
+ a multi-agent collaborative framework. In The Twelfth International Conference on Learning
950
+
951
+ Representations, 2024.
952
+
953
+ Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Com-
954
+
955
+ municative agents for” mind” exploration of large language model society. Advances in Neural
956
+
957
+ Information Processing Systems, 36:51991–52008, 2023.
958
+
959
+ Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri
960
+
961
+ Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement
962
+
963
+ with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023.
964
+
965
+ OpenAI. Introducing deep research. OpenAI Blog, 2025. URL https://openai.com/index/
966
+
967
+ introducing-deep-research/.
968
+
969
+ Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru
970
+
971
+ Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world
972
+
973
+ apis. arXiv preprint arXiv:2307.16789, 2023.
974
+
975
+ Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke
976
+
977
+ Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach
978
+
979
+ themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551,
980
+
981
+ 2023.
982
+
983
+ 8
984
+
985
+ ### Page 9
986
+
987
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
988
+
989
+ Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt:
990
+
991
+ Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information
992
+
993
+ Processing Systems, 36:38154–38180, 2023.
994
+
995
+ Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion:
996
+
997
+ Language agents with verbal reinforcement learning. Advances in Neural Information Processing
998
+
999
+ Systems, 36:8634–8652, 2023.
1000
+
1001
+ Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai
1002
+
1003
+ Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.
1004
+
1005
+ Frontiers of Computer Science, 18(6):186345, 2024.
1006
+
1007
+ Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
1008
+
1009
+ ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
1010
+
1011
+ arXiv preprint arXiv:2203.11171, 2022.
1012
+
1013
+ Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
1014
+
1015
+ Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
1016
+
1017
+ neural information processing systems, 35:24824–24837, 2022.
1018
+
1019
+ Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
1020
+
1021
+ Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent
1022
+
1023
+ conversations. In First Conference on Language Modeling, 2024.
1024
+
1025
+ Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe
1026
+
1027
+ Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents:
1028
+
1029
+ A survey. arXiv preprint arXiv:2309.07864, 2023.
1030
+
1031
+ Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
1032
+
1033
+ Tree of thoughts: Deliberate problem solving with large language models. Advances in neural
1034
+
1035
+ information processing systems, 36:11809–11822, 2023a.
1036
+
1037
+ Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
1038
+
1039
+ React: Synergizing reasoning and acting in language models. In The eleventh international
1040
+
1041
+ conference on learning representations, 2023b.
1042
+
1043
+ Denny Zhou, Nathanael Sch¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
1044
+
1045
+ Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning
1046
+
1047
+ in large language models. In The Eleventh International Conference on Learning Representations,
1048
+
1049
+ 2023.
1050
+
1051
+ A PROMPT TEMPLATES
1052
+
1053
+ We provide simplified versions of the core prompts. Each follows a structured format with input
1054
+
1055
+ specifications, decision rules, and JSON output schemas.
1056
+
1057
+ Planning Prompt
1058
+
1059
+ You are a query planner. Decompose complex queries into sub-questions for specialized agents.
1060
+
1061
+ Input: Original query, conversation context, available agents
1062
+
1063
+ Planning Rules:
1064
+
1065
+ – RAG First: Always search internal knowledge base first or in parallel
1066
+
1067
+ – Maximize Parallelism: Execute independent questions simultaneously
1068
+
1069
+ – Minimize Dependencies: Only when results feed into other questions
1070
+
1071
+ – Be Specific: Clear, answerable scope for each question
1072
+
1073
+ Sub-question Fields: id, question, agent type, dependencies, priority, context from deps, verifica-
1074
+
1075
+ tion criteria
1076
+
1077
+ Output: JSON with sub questions array and explanation
1078
+
1079
+ 9
1080
+
1081
+ ### Page 10
1082
+
1083
+ Published as a conference paper at ICLR 2026 Workshop on MALGAI
1084
+
1085
+ Verification Prompt
1086
+
1087
+ Verify if the sub-question has been adequately answered with proper metadata.
1088
+
1089
+ Input: Sub-question, verification criteria, result, dependency results
1090
+
1091
+ Evaluation Criteria:
1092
+
1093
+ – Completeness: All aspects of question addressed?
1094
+
1095
+ – Evidence Quality: Multiple sources? Cross-referenced?
1096
+
1097
+ – Metadata: Source attribution (filename/URL/date) present?
1098
+
1099
+ – Specificity: Concrete facts/numbers vs vague claims?
1100
+
1101
+ – Contradictions: Conflicts between sources?
1102
+
1103
+ Output: JSON with verification status (complete/partial/incomplete), completeness score (0–1),
1104
+
1105
+ missing aspects, confidence, recommendation (accept/retry/escalate)
1106
+
1107
+ Replanning Prompt
1108
+
1109
+ Determine next actions based on verification results.
1110
+
1111
+ Input: Original query, execution plan, completed/incomplete results, iteration count
1112
+
1113
+ Critical Rule: MUST include ALL incomplete sub-question IDs in retry list.
1114
+
1115
+ Decision Logic:
1116
+
1117
+ – completeness > 0.8: Proceed to synthesis (done)
1118
+
1119
+ – Incomplete results exist: Add ALL to retry sub questions
1120
+
1121
+ – completeness 0.5–0.8: Add new sub questions to fill gaps
1122
+
1123
+ – Contradictions found: Add queries targeting different sources
1124
+
1125
+ – iterations ≥ max: Return empty lists (done)
1126
+
1127
+ Output: JSON with retry sub questions, new sub questions, explanation
1128
+
1129
+ Synthesis Prompt
1130
+
1131
+ Synthesize results from multiple agents into a concise, well-cited answer.
1132
+
1133
+ Input: Original query, sub-question results, verification summary
1134
+
1135
+ Required Structure:
1136
+
1137
+ 1. Executive Summary (2–3 sentences with key metrics)
1138
+
1139
+ 2. Key Findings (5–8 bullets with source citations)
1140
+
1141
+ 3. Analysis (2–3 paragraphs connecting insights)
1142
+
1143
+ 4. Conclusions (confidence level and limitations)
1144
+
1145
+ Citation Format: [source - section/URL, metadata]
1146
+
1147
+ Output: JSON with answer, key findings, confidence, sources, gaps
1148
+
1149
+ B CONFIGURATION PARAMETERS
1150
+
1151
+ Table 5 lists the default configuration parameters used in our experiments. These can be tuned for
1152
+
1153
+ different quality-latency tradeoffs.
1154
+
1155
+ Table 5: Configuration parameters for VMAO orchestration
1156
+
1157
+ Parameter Default Description
1158
+
1159
+ max iterations 3 Maximum replanning iterations
1160
+
1161
+ token budget 1M Maximum tokens before stopping
1162
+
1163
+ ready threshold 0.8 Completeness ratio for synthesis
1164
+
1165
+ high confidence 0.75 Confidence threshold for early stop
1166
+
1167
+ diminishing returns 0.05 Minimum improvement to continue
1168
+
1169
+ max concurrent 3 Parallel agent executions
1170
+
1171
+ agent timeout 600s Per-agent timeout
1172
+
1173
+ 10