@chllming/wave-orchestration 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (112) hide show
  1. package/CHANGELOG.md +57 -1
  2. package/README.md +39 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +48 -11
  39. package/docs/plans/context7-wave-orchestrator.md +20 -0
  40. package/docs/plans/current-state.md +8 -1
  41. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  42. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  43. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  44. package/docs/plans/wave-orchestrator.md +62 -11
  45. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  46. package/docs/reference/coordination-and-closure.md +436 -0
  47. package/docs/reference/live-proof-waves.md +25 -3
  48. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  49. package/docs/reference/proof-metrics.md +90 -0
  50. package/docs/reference/runtime-config/README.md +61 -0
  51. package/docs/reference/sample-waves.md +29 -18
  52. package/docs/reference/wave-control.md +164 -0
  53. package/docs/reference/wave-planning-lessons.md +131 -0
  54. package/package.json +5 -4
  55. package/releases/manifest.json +18 -0
  56. package/scripts/research/agent-context-archive.mjs +18 -0
  57. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  58. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  59. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  60. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  61. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  62. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  63. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  64. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  65. package/scripts/wave-orchestrator/config.mjs +175 -0
  66. package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
  67. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  68. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  69. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  70. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  71. package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
  72. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  73. package/scripts/wave-orchestrator/evals.mjs +23 -0
  74. package/scripts/wave-orchestrator/executors.mjs +3 -2
  75. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  76. package/scripts/wave-orchestrator/install.mjs +55 -1
  77. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  78. package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
  79. package/scripts/wave-orchestrator/launcher.mjs +796 -35
  80. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  81. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  82. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  83. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  84. package/scripts/wave-orchestrator/replay.mjs +10 -4
  85. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  86. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  87. package/scripts/wave-orchestrator/shared.mjs +26 -0
  88. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  89. package/scripts/wave-orchestrator/traces.mjs +157 -2
  90. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  91. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  92. package/scripts/wave-orchestrator/wave-files.mjs +17 -5
  93. package/scripts/wave.mjs +27 -0
  94. package/skills/repo-coding-rules/SKILL.md +1 -0
  95. package/skills/role-cont-eval/SKILL.md +1 -0
  96. package/skills/role-cont-qa/SKILL.md +13 -6
  97. package/skills/role-deploy/SKILL.md +1 -0
  98. package/skills/role-documentation/SKILL.md +4 -0
  99. package/skills/role-implementation/SKILL.md +4 -0
  100. package/skills/role-infra/SKILL.md +2 -1
  101. package/skills/role-integration/SKILL.md +15 -8
  102. package/skills/role-planner/SKILL.md +39 -0
  103. package/skills/role-planner/skill.json +21 -0
  104. package/skills/role-research/SKILL.md +1 -0
  105. package/skills/role-security/SKILL.md +2 -2
  106. package/skills/runtime-claude/SKILL.md +2 -1
  107. package/skills/runtime-codex/SKILL.md +1 -0
  108. package/skills/runtime-local/SKILL.md +2 -0
  109. package/skills/runtime-opencode/SKILL.md +1 -0
  110. package/skills/wave-core/SKILL.md +25 -6
  111. package/skills/wave-core/references/marker-syntax.md +16 -8
  112. package/wave.config.json +45 -0
@@ -44,13 +44,20 @@ This runbook is the operational view of the architecture:
44
44
  - `pnpm exec wave doctor`
45
45
  - `pnpm exec wave launch --lane main --dry-run --no-dashboard`
46
46
  - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access`
47
+ - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access --resident-orchestrator`
47
48
  - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor claude`
48
49
  - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor opencode`
49
50
  - `pnpm exec wave autonomous --lane main --executor codex --codex-sandbox danger-full-access`
50
51
  - `pnpm exec wave feedback list --lane main --pending`
51
- - `pnpm exec wave coord show --lane main --wave 0 --dry-run`
52
+ - `pnpm exec wave control status --lane main --wave 0 --json`
53
+ - `pnpm exec wave control status --lane main --wave 0 --agent A1 --json`
52
54
  - `pnpm exec wave coord inbox --lane main --wave 0 --agent A1 --dry-run`
53
- - `pnpm exec wave coord post --lane main --wave 0 --agent A1 --kind blocker --summary "Need repository decision"`
55
+ - `pnpm exec wave control task create --lane main --wave 0 --agent A1 --kind blocker --summary "Need repository decision"`
56
+ - `pnpm exec wave control task act reassign --lane main --wave 0 --id <task-id> --to A2`
57
+ - `pnpm exec wave control rerun get --lane main --wave 0 --json`
58
+ - `pnpm exec wave control rerun request --lane main --wave 0 --agent A2 --agent A7 --clear-reuse A2 --reason "resume sibling-owned shared component closure"`
59
+ - `pnpm exec wave control proof register --lane main --wave 9 --agent A7 --artifact .tmp/wave-9-proof/live-status.json --authoritative --completion live --durability durable --proof-level live`
60
+ - `pnpm exec wave local --prompt .tmp/main-wave-launcher/prompts/wave-0-A1.md --log .tmp/main-wave-launcher/logs/wave-0-A1.log --status .tmp/main-wave-launcher/status/wave-0-A1.json`
54
61
  - `pnpm exec wave dep show --lane main --wave 0 --json`
55
62
  - `pnpm exec wave dep post --owner-lane main --requester-lane release --owner-wave 0 --requester-wave 2 --agent launcher --summary "Need shared-plan reconciliation" --target capability:docs-shared-plan --required`
56
63
  - `pnpm exec wave upgrade`
@@ -58,14 +65,15 @@ This runbook is the operational view of the architecture:
58
65
 
59
66
  ## Configuration
60
67
 
61
- - `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences, and Context7 bundle-index location. The starter config also wires the optional security reviewer prompt at `docs/agents/wave-security-role.md` and the `security-review` executor profile.
68
+ - `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences, Context7 bundle-index location, and the optional `waveControl` telemetry section. The starter config also wires the optional security reviewer prompt at `docs/agents/wave-security-role.md` and the `security-review` executor profile.
62
69
  - `docs/context7/bundles.json` controls allowed external library bundles and lane defaults.
63
70
  - `docs/evals/README.md` explains how to author delegated versus pinned `## Eval targets`, including the coordination-oriented benchmark families.
64
71
  - `docs/reference/live-proof-waves.md` explains how to author proof-first `pilot-live` and higher-maturity waves with `### Proof artifacts`, sticky executors, and operator command capture.
65
72
  - `docs/reference/sample-waves.md` points to showcase-first sample waves that combine the modern authored wave surface in concrete examples.
73
+ - `docs/reference/wave-control.md` documents the Wave Control telemetry and analysis plane, including entity types, artifact upload policies, and the local-first reporting contract.
66
74
  - `docs/plans/component-cutover-matrix.json` is the canonical machine-readable source for component maturity and per-wave promotion targets.
67
75
  - `.wave/install-state.json` records how the workspace was initialized and which package version is installed.
68
- - `.wave/project-profile.json` records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
76
+ - `.wave/project-profile.json` (created by `wave project setup`) records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
69
77
  - `.wave/adhoc/runs/<run-id>/` stores transient ad-hoc request, spec, rendered markdown, and result artifacts.
70
78
  - ad-hoc documentation closure always writes `.wave/adhoc/runs/<run-id>/reports/`, but shared-plan deltas still queue the canonical lane shared-plan docs.
71
79
  - ad-hoc task ownership inference only accepts repo-local paths; URLs and other external references are ignored.
@@ -76,12 +84,12 @@ This runbook is the operational view of the architecture:
76
84
  - Wave skill bundles live under `skills/<skill-id>/`.
77
85
  - Each bundle requires `skill.json` and `SKILL.md`.
78
86
  - Bundles can also include runtime adapters at `adapters/<runtime>.md` for `codex`, `claude`, `opencode`, or `local`.
79
- - The starter config resolves skills in this order: global base, lane base, global role map, lane role map, global runtime map, lane runtime map, global deploy-kind map, lane deploy-kind map, then explicit per-agent `### Skills`.
87
+ - The starter config merges global and lane skill configs, then resolves in order: base, role, runtime, deploy-kind, and finally explicit per-agent `### Skills`.
80
88
  - The effective skill set is recomputed after final executor resolution, including retry-time runtime fallback, so a fallback from one runtime to another also swaps runtime-specific skill overlays.
81
89
  - Starter bundles in this repo cover:
82
90
  - core Wave coordination and repo coding rules
83
91
  - runtime packs for Codex, Claude, OpenCode, and local execution
84
- - role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, and research work
92
+ - role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, research, and planner work
85
93
  - deploy and environment packs for Railway, Docker Compose, Kubernetes, SSH/manual rollout, and generic custom deploys
86
94
  - explicit provider packs for GitHub release flow and AWS norms when a wave or lane wants to attach them
87
95
 
@@ -124,13 +132,22 @@ pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex -
124
132
 
125
133
  ## Coordination Surfaces
126
134
 
127
- - `wave coord show` is a read-only view of the materialized coordination state for a wave.
135
+ - `wave control status` is the read-only projection for "why blocked / why retrying" at wave or agent scope. It returns blocking edges, logical agent state, tasks, dependencies, rerun intent, active proof bundles, and next timers from one materialized control-plane view.
136
+ - `wave control task create|get|list|act` is the operator task surface for blocking requests, blockers, clarification chains, human-input tickets, escalations, and informative handoffs, evidence, claims, and decisions. `wave control status` only treats requests, blockers, clarifications, human-input, escalations, helper assignments, and required dependencies as blocking edges.
137
+ - `wave control rerun request|get|clear` manages targeted rerun intent under `.tmp/<lane>-wave-launcher/control-plane/` and projects compatible retry overrides under `.tmp/<lane>-wave-launcher/control/`, including selected agents, reuse selectors, invalidated components, and clear or preserve reuse lists.
138
+ - `wave control proof register|get|supersede|revoke` manages authoritative proof bundles in the same control-plane log and projects compatible proof registries under `.tmp/<lane>-wave-launcher/proof/`.
139
+ - `wave control telemetry status|flush` inspects and delivers the local Wave Control event queue. Pass `--no-telemetry` on `wave launch` to disable event publication for a single run.
128
140
  - `wave coord render` regenerates the markdown board projection from the canonical coordination log.
129
141
  - `wave coord inbox` writes the compiled shared summary plus the selected agent inbox.
130
- - `wave coord post` appends a structured record to the coordination log. This is the machine-readable path for blockers, handoffs, evidence, targeted requests, and clarification requests.
142
+
143
+ Compatibility note:
144
+
145
+ - `wave coord`, `wave retry`, and `wave proof` remain available as compatibility surfaces, but new operator docs and runbooks should prefer `wave control`.
131
146
 
132
147
  The canonical state is the JSONL log under `.tmp/<lane>-wave-launcher/coordination/`. The markdown board is a generated projection for humans, not the scheduler's source of truth.
133
148
 
149
+ Control-plane facts that drive reruns, proof, attempt state, and operator tasks are appended separately under `.tmp/<lane>-wave-launcher/control-plane/`. Legacy proof and retry files remain derived projections for compatibility, not the source of truth.
150
+
134
151
  Capability-targeted requests now become deterministic helper assignments. The launcher resolves the assignee from explicit targets, `capabilityRouting.preferredAgents`, then least-busy matching capability owners, writes that assignment into `.tmp/<lane>-wave-launcher/assignments/`, mirrors the decision into coordination state, and keeps the wave blocked until the linked follow-up resolves.
135
152
 
136
153
  Clarification flow is orchestrator-first:
@@ -141,6 +158,19 @@ Clarification flow is orchestrator-first:
141
158
  4. Routed clarification follow-up requests remain blocking until they resolve.
142
159
  5. Human escalations are written back into coordination state, the ledger, and trace artifacts.
143
160
 
161
+ During live runs, the launcher now keeps an active orchestration loop while agents are still running. It refreshes the derived coordination surfaces on cadence, surfaces overdue acknowledgements and stale clarification chains in dashboards and traces, and can reroute clarification follow-up requests inside the same attempt when the routed owner never acknowledges them.
162
+
163
+ If you opt into `--resident-orchestrator`, the launcher also starts a long-running non-owning orchestrator session for the wave. That session can inspect the same coordination artifacts and intervene through coordination records, but the launcher remains the scheduler truth and closure authority.
164
+
165
+ Retry intent, operator tasks, attempt lifecycle, and proof injection are now first-class control-plane artifacts rather than manual file surgery:
166
+
167
+ - canonical control events live under `.tmp/<lane>-wave-launcher/control-plane/`
168
+ - projected retry overrides still live under `.tmp/<lane>-wave-launcher/control/`
169
+ - projected proof registries still live under `.tmp/<lane>-wave-launcher/proof/`
170
+ - live traces now copy the control-plane log alongside the proof registry so replay keeps the same operator-visible facts
171
+
172
+ For a full end-to-end explainer of helper assignments, deliverables, integration, and why an agent can be locally done while the wave stays blocked, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).
173
+
144
174
  ## Cross-Lane Dependencies
145
175
 
146
176
  - `wave dep post` appends a typed dependency ticket under `.tmp/wave-orchestrator/dependencies/`.
@@ -181,8 +211,10 @@ pnpm exec wave changelog --since-installed
181
211
 
182
212
  - prompts: `.tmp/<lane>-wave-launcher/prompts/`
183
213
  - logs: `.tmp/<lane>-wave-launcher/logs/`
214
+ - run-state: `.tmp/<lane>-wave-launcher/run-state.json`
215
+ Keeps compatibility `completedWaves`, but now also stores per-wave current state plus append-only transition history and completion or blocker evidence.
184
216
  - status summaries: `.tmp/<lane>-wave-launcher/status/`
185
- `run-state.json` keeps compatibility `completedWaves`, but now also stores per-wave current state plus append-only transition history and completion or blocker evidence. Relaunch plans in this directory are schema-versioned.
217
+ Relaunch plans in this directory are schema-versioned.
186
218
  - coordination logs: `.tmp/<lane>-wave-launcher/coordination/`
187
219
  - helper-assignment snapshots: `.tmp/<lane>-wave-launcher/assignments/`
188
220
  - message boards: `.tmp/<lane>-wave-launcher/messageboards/`
@@ -195,6 +227,12 @@ pnpm exec wave changelog --since-installed
195
227
  - dependency snapshots: `.tmp/<lane>-wave-launcher/dependencies/`
196
228
  - docs queue: `.tmp/<lane>-wave-launcher/docs-queue/`
197
229
  - trace bundles: `.tmp/<lane>-wave-launcher/traces/`
230
+ - control-plane events: `.tmp/<lane>-wave-launcher/control-plane/`
231
+ Canonical append-only JSONL log of operator tasks, rerun requests, proof bundles, attempt lifecycle, and human-input events. This is the source of truth for `wave control`. Telemetry queue lives under `control-plane/telemetry/`.
232
+ - proof registries: `.tmp/<lane>-wave-launcher/proof/`
233
+ Projected from control-plane state for compatibility. Operator-registered authoritative proof bundles that feed integration, cont-QA, and replay.
234
+ - retry overrides: `.tmp/<lane>-wave-launcher/control/`
235
+ Projected from control-plane state for compatibility. Operator-applied targeted retry overrides, applied once per attempt and then cleared by the launcher.
198
236
  - clarification triage: `.tmp/<lane>-wave-launcher/feedback/triage/`
199
237
  - dashboards: `.tmp/<lane>-wave-launcher/dashboards/`
200
238
  Dashboard JSON is a versioned contract. `global.json` and `wave-<n>.json` now carry explicit `schemaVersion` and `kind` fields.
@@ -220,10 +258,13 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
220
258
  - `ledger.json`
221
259
  - `docs-queue.json`
222
260
  - `security.json`
261
+ - `capability-assignments.json`
262
+ - `dependency-snapshot.json`
223
263
  - `integration.json`
224
264
  - `outcome.json`
225
265
  - `shared-summary.md`
226
266
  - copied prompt, log, status, inbox, and summary artifacts per launched agent
267
+ - `control-plane.raw.jsonl`
227
268
  - `structured-signals.json`
228
269
  - `quality.json`
229
270
  - `run-metadata.json`
@@ -232,7 +273,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
232
273
  - For `traceVersion: 2`, launched agents must have copied prompt/log/status/inbox/summary artifacts, and promoted-component waves must include the copied component matrix JSON.
233
274
  - `security.json` stores the derived per-wave security state that feeds integration summaries, gate snapshots, and replay.
234
275
  - `quality.json` is cumulative through the current attempt. It is intended for regression comparison, not only for one-shot pass/fail reporting.
235
- - `quality.json` also reports capability-assignment and dependency-resolution metrics in addition to the Phase 2/3 communication, fallback, and closure metrics.
276
+ - `quality.json` also reports capability-assignment and dependency-resolution metrics, plus coordination response metrics (overdue acknowledgements, clarification timing, human escalation counts), in addition to the Phase 2/3 communication, fallback, and closure metrics.
236
277
  - Replay support is internal. The source tree contains helpers to load, validate, and replay trace bundles against the same gate logic the launcher uses, but there is no public replay CLI yet.
237
278
  - Replay is read-only and hash-validating for `traceVersion: 2` bundles. It ignores inline summary duplicates in `run-metadata.json` and returns a stored-vs-recomputed comparison report for gate and quality state. Legacy `traceVersion: 1` bundles remain best-effort and emit warnings instead of claiming full hermetic replay.
238
279
 
@@ -257,7 +298,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
257
298
  - Optional standing roles available in this repo include `docs/agents/wave-infra-role.md` for infra proof and `docs/agents/wave-deploy-verifier-role.md` for rollout verification.
258
299
  - Keep file ownership explicit inside each `### Prompt`.
259
300
  - From the configured thresholds onward, declare `## Context7 defaults`, per-agent `### Context7`, and per-agent `### Exit contract`.
260
- - For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md).
301
+ - For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md). External benchmark failure reviews classify outcomes into categories (`verifier-image`, `setup-harness`, `timeout`, `blocked-proof`, `missing-context`, `partial-fix`, `wrong-fix`, `unknown`) which feed the failure-review tooling available through `wave benchmark external-show`.
261
302
  - For proof-first live-wave patterns, sticky retry guidance, and `### Proof artifacts` examples, see [docs/reference/live-proof-waves.md](../reference/live-proof-waves.md).
262
303
  - Agents should use `wave coord post` for durable blockers, handoffs, evidence, and requests instead of relying on ad hoc board edits.
263
304
  - Keep shared plan docs and the component cutover matrix owned by the configured documentation steward once that rule becomes active.
@@ -267,6 +308,14 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
267
308
  [claude.md](../reference/runtime-config/claude.md),
268
309
  [opencode.md](../reference/runtime-config/opencode.md).
269
310
 
311
+ ## Benchmark CLI
312
+
313
+ - `wave benchmark list` lists local benchmark cases from the catalog.
314
+ - `wave benchmark show --case <id>` shows a single case definition.
315
+ - `wave benchmark run --case <id>` executes a local deterministic case.
316
+ - `wave benchmark adapters` lists available external benchmark adapters.
317
+ - `wave benchmark external-list|external-show|external-run|external-pilots` manage external benchmark targets (e.g., SWE-bench Pro).
318
+
270
319
  ## Executor Modes
271
320
 
272
321
  - `--executor codex` uses `codex exec` with the generated task prompt piped through stdin.
@@ -335,3 +384,5 @@ Live closure is fail-closed:
335
384
  - Security review requires a report artifact plus a structured `[wave-security]` marker. `state=blocked` stops the wave before integration, while `state=concerns` is preserved in summaries and traces without automatically failing closure.
336
385
  - `cont-QA` PASS requires both the final verdict and the final `[wave-gate]` marker.
337
386
  - Legacy evaluator-era or underspecified closure artifacts are still readable in replay and trace analysis, but they no longer satisfy live completion.
387
+
388
+ For a detailed worked example of cross-agent follow-up and staged closure, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).
@@ -0,0 +1,118 @@
1
+ # Wave 1 Benchmark Operator Review
2
+
3
+ ## Scope
4
+
5
+ This document reviews the `SWE-bench Pro` 10-task `full-wave` review-only batch run.
6
+
7
+ - manifest: `docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json`
8
+ - command config: `docs/evals/external-command-config.swe-bench-pro.json`
9
+ - source evidence: recorded aggregate results plus per-task verifier stdout/stderr logs and integration summaries from the benchmark worktree pass
10
+
11
+ Command used:
12
+
13
+ ```bash
14
+ node "scripts/wave.mjs" benchmark external-run \
15
+ --adapter swe-bench-pro \
16
+ --manifest docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json \
17
+ --arm full-wave \
18
+ --command-config docs/evals/external-command-config.swe-bench-pro.json \
19
+ --model-id gpt-5-codex \
20
+ --executor-id codex \
21
+ --executor-command "codex exec" \
22
+ --tool-permissions "Read,Write,Edit,Bash" \
23
+ --temperature 0 \
24
+ --reasoning-effort high \
25
+ --max-wall-clock-minutes 15 \
26
+ --max-turns 250 \
27
+ --retry-limit 0 \
28
+ --verification-harness official-swe-bench-pro \
29
+ --dataset-version public-v1 \
30
+ --output-dir .tmp/wave-benchmarks/external/swe-bench-pro-full-wave-review-10 \
31
+ --json
32
+ ```
33
+
34
+ This was a `review-only` run, not a matched `single-agent` versus `full-wave` comparison.
35
+
36
+ ## Verdict
37
+
38
+ - Official resolved score: `0/10`
39
+ - Interpretable capability score: `not valid for external comparison`
40
+ - Recommendation: `blocked`
41
+
42
+ Why this is blocked:
43
+
44
+ - `7/10` tasks reached the official SWE-bench Pro evaluator, but the evaluator could not pull the expected Docker image tag from `jefzda/sweap-images`, so those zeros are not trustworthy model-performance failures.
45
+ - `3/10` tasks failed earlier in harness or repository setup before a trustworthy benchmark judgment existed.
46
+ - The raw aggregate `reviewBuckets` from the runner said `harness-env=10`; that was directionally closer to the truth than `incorrect-patch`, but still too coarse. The corrected manual buckets below are the review-ready interpretation.
47
+
48
+ ## Aggregate Metrics
49
+
50
+ Recorded totals from the 10-task batch:
51
+
52
+ - tasks: `10`
53
+ - solved: `0`
54
+ - success rate: `0%`
55
+ - total wall clock: `2810439 ms`
56
+ - token totals:
57
+ - `input_tokens = 59155820`
58
+ - `cached_input_tokens = 54180608`
59
+ - `output_tokens = 278308`
60
+
61
+ Corrected manual failure buckets:
62
+
63
+ - `7` verifier-image failures
64
+ - `3` setup or harness failures before trustworthy scoring
65
+ - `0` trustworthy patch-quality failures established by the official verifier
66
+
67
+ ## Task Scorecard
68
+
69
+ Scoring convention used here:
70
+
71
+ - `official score`: the raw `0/1` result recorded by the run artifacts
72
+ - `review score`: whether that official score is trustworthy enough to interpret as model capability evidence
73
+
74
+ | Task | Repo | Official score | Review score | Wall clock | Notes |
75
+ | --- | --- | --- | --- | ---: | --- |
76
+ | `instance_NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan` | `NodeBB/NodeBB` | `0` | `invalidated` | `807464 ms` | Full-wave solve ran and produced a patch, but the official evaluator failed to pull `jefzda/sweap-images:nodebb.nodebb-NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5` and returned `None`. |
77
+ | `instance_qutebrowser__qutebrowser-f91ace96223cac8161c16dd061907e138fe85111-v059c6fdc75567943479b23ebca7c07b5e9a7f34c` | `qutebrowser/qutebrowser` | `0` | `invalidated` | `369151 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `qutebrowser` image tag and returned `None`. |
78
+ | `instance_ansible__ansible-f327e65d11bb905ed9f15996024f857a95592629-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5` | `ansible/ansible` | `0` | `setup-failure` | `499457 ms` | Patch extraction failed during `git diff`; the task workspace had local `.venv` churn, so this never reached a trustworthy verifier judgment. |
79
+ | `instance_internetarchive__openlibrary-4a5d2a7d24c9e4c11d3069220c0685b736d5ecde-v13642507b4fc1f8d234172bf8129942da2c2ca26` | `internetarchive/openlibrary` | `0` | `invalidated` | `95 ms` | The official evaluator failed to pull the expected `openlibrary` image tag and returned `None`. |
80
+ | `instance_gravitational__teleport-3fa6904377c006497169945428e8197158667910-v626ec2a48416b10a88641359a169d99e935ff037` | `gravitational/teleport` | `0` | `setup-failure` | `64527 ms` | `wave init` failed because the repo already contained Wave bootstrap files and the harness still used the non-adopt path. |
81
+ | `instance_navidrome__navidrome-7073d18b54da7e53274d11c9e2baef1242e8769e` | `navidrome/navidrome` | `0` | `invalidated` | `417099 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `navidrome` image tag and returned `None`. |
82
+ | `instance_element-hq__element-web-33e8edb3d508d6eefb354819ca693b7accc695e7` | `element-hq/element-web` | `0` | `invalidated` | `510260 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `element-web` image tag and returned `None`. |
83
+ | `instance_future-architect__vuls-407407d306e9431d6aa0ab566baa6e44e5ba2904` | `future-architect/vuls` | `0` | `invalidated` | `115 ms` | The official evaluator failed to pull the expected `vuls` image tag and returned `None`. |
84
+ | `instance_flipt-io__flipt-e42da21a07a5ae35835ec54f74004ebd58713874` | `flipt-io/flipt` | `0` | `invalidated` | `104 ms` | The official evaluator failed to pull the expected `flipt` image tag and returned `None`. |
85
+ | `instance_protonmail__webclients-2c3559cad02d1090985dba7e8eb5a129144d9811` | `protonmail/webclients` | `0` | `setup-failure` | `142167 ms` | Repository preparation failed before solving because the target commit tree could not be read locally (`fatal: Could not parse object ...`). |
86
+
87
+ ## What The Batch Actually Tells Us
88
+
89
+ This run does establish a few useful things:
90
+
91
+ - The 10-task `full-wave` review path is now executable end to end through `wave benchmark external-run --arm full-wave`.
92
+ - The harness now persists enough task-level evidence to audit failures: patch paths, verifier stdout and stderr, output dirs, and integration summaries.
93
+ - At least several tasks did enter real multi-agent execution and produced patches before the verifier step.
94
+
95
+ This run does **not** establish:
96
+
97
+ - a trustworthy `SWE-bench Pro` success rate for `full-wave`
98
+ - a comparison against public leaderboard systems
99
+ - a comparison against our own `single-agent` baseline
100
+
101
+ ## Comparison Context
102
+
103
+ Context only, not head-to-head:
104
+
105
+ - The public `SWE-bench Pro` leaderboard reports top public-set systems in roughly the `41%` to `46%` range across the full public benchmark, not `0%`.
106
+ - Because this review run was invalidated by verifier-image and setup failures, the current `0/10` should not be treated as a clean external capability comparison against those systems.
107
+
108
+ Official sources:
109
+
110
+ - `https://scale.com/leaderboard/swe_bench_pro_public`
111
+ - `https://scaleapi.github.io/SWE-bench_Pro-os/`
112
+
113
+ ## Follow-up Required Before Publication
114
+
115
+ - Fix verifier image resolution so the official evaluator can actually score all selected tasks.
116
+ - Fix the `teleport` harness path so repos with existing Wave bootstrap files use the adopt-existing flow when needed.
117
+ - Fix the `ansible` patch-extraction path so local environment bootstrapping cannot pollute the generated patch.
118
+ - Re-run the same frozen 10-task manifest after those harness fixes before making any external-performance claim.