@chllming/wave-orchestration 0.6.3 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +82 -1
- package/README.md +40 -7
- package/docs/agents/wave-orchestrator-role.md +50 -0
- package/docs/agents/wave-planner-role.md +39 -0
- package/docs/context7/bundles.json +9 -0
- package/docs/context7/planner-agent/README.md +25 -0
- package/docs/context7/planner-agent/manifest.json +83 -0
- package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
- package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
- package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
- package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
- package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
- package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
- package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
- package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
- package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
- package/docs/evals/README.md +96 -1
- package/docs/evals/arm-templates/README.md +13 -0
- package/docs/evals/arm-templates/full-wave.json +15 -0
- package/docs/evals/arm-templates/single-agent.json +15 -0
- package/docs/evals/benchmark-catalog.json +7 -0
- package/docs/evals/cases/README.md +47 -0
- package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
- package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
- package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
- package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
- package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
- package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
- package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
- package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
- package/docs/evals/external-benchmarks.json +85 -0
- package/docs/evals/external-command-config.sample.json +9 -0
- package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
- package/docs/evals/pilots/README.md +47 -0
- package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
- package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
- package/docs/evals/wave-benchmark-program.md +302 -0
- package/docs/guides/planner.md +67 -11
- package/docs/guides/terminal-surfaces.md +12 -0
- package/docs/plans/context7-wave-orchestrator.md +20 -0
- package/docs/plans/current-state.md +8 -1
- package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
- package/docs/plans/examples/wave-example-live-proof.md +1 -1
- package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
- package/docs/plans/migration.md +26 -0
- package/docs/plans/wave-orchestrator.md +60 -12
- package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
- package/docs/reference/cli-reference.md +547 -0
- package/docs/reference/coordination-and-closure.md +436 -0
- package/docs/reference/live-proof-waves.md +25 -3
- package/docs/reference/npmjs-trusted-publishing.md +3 -3
- package/docs/reference/proof-metrics.md +90 -0
- package/docs/reference/runtime-config/README.md +63 -2
- package/docs/reference/runtime-config/codex.md +2 -1
- package/docs/reference/sample-waves.md +29 -18
- package/docs/reference/wave-control.md +164 -0
- package/docs/reference/wave-planning-lessons.md +131 -0
- package/package.json +5 -4
- package/releases/manifest.json +40 -0
- package/scripts/research/agent-context-archive.mjs +18 -0
- package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
- package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
- package/scripts/wave-orchestrator/agent-state.mjs +11 -2
- package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
- package/scripts/wave-orchestrator/autonomous.mjs +7 -0
- package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
- package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
- package/scripts/wave-orchestrator/benchmark.mjs +972 -0
- package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
- package/scripts/wave-orchestrator/config.mjs +175 -0
- package/scripts/wave-orchestrator/control-cli.mjs +1216 -0
- package/scripts/wave-orchestrator/control-plane.mjs +697 -0
- package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
- package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
- package/scripts/wave-orchestrator/coordination.mjs +84 -0
- package/scripts/wave-orchestrator/dashboard-renderer.mjs +120 -5
- package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
- package/scripts/wave-orchestrator/evals.mjs +23 -0
- package/scripts/wave-orchestrator/executors.mjs +3 -2
- package/scripts/wave-orchestrator/feedback.mjs +55 -0
- package/scripts/wave-orchestrator/install.mjs +151 -2
- package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
- package/scripts/wave-orchestrator/launcher-runtime.mjs +33 -30
- package/scripts/wave-orchestrator/launcher.mjs +884 -36
- package/scripts/wave-orchestrator/planner-context.mjs +75 -0
- package/scripts/wave-orchestrator/planner.mjs +2270 -136
- package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
- package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
- package/scripts/wave-orchestrator/replay.mjs +10 -4
- package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
- package/scripts/wave-orchestrator/retry-control.mjs +225 -0
- package/scripts/wave-orchestrator/shared.mjs +26 -0
- package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
- package/scripts/wave-orchestrator/terminals.mjs +1 -1
- package/scripts/wave-orchestrator/traces.mjs +157 -2
- package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
- package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
- package/scripts/wave-orchestrator/wave-files.mjs +144 -23
- package/scripts/wave.mjs +27 -0
- package/skills/repo-coding-rules/SKILL.md +1 -0
- package/skills/role-cont-eval/SKILL.md +1 -0
- package/skills/role-cont-qa/SKILL.md +13 -6
- package/skills/role-deploy/SKILL.md +1 -0
- package/skills/role-documentation/SKILL.md +4 -0
- package/skills/role-implementation/SKILL.md +4 -0
- package/skills/role-infra/SKILL.md +2 -1
- package/skills/role-integration/SKILL.md +15 -8
- package/skills/role-planner/SKILL.md +39 -0
- package/skills/role-planner/skill.json +21 -0
- package/skills/role-research/SKILL.md +1 -0
- package/skills/role-security/SKILL.md +2 -2
- package/skills/runtime-claude/SKILL.md +2 -1
- package/skills/runtime-codex/SKILL.md +1 -0
- package/skills/runtime-local/SKILL.md +2 -0
- package/skills/runtime-opencode/SKILL.md +1 -0
- package/skills/wave-core/SKILL.md +25 -6
- package/skills/wave-core/references/marker-syntax.md +16 -8
- package/wave.config.json +45 -0
|
@@ -44,13 +44,20 @@ This runbook is the operational view of the architecture:
|
|
|
44
44
|
- `pnpm exec wave doctor`
|
|
45
45
|
- `pnpm exec wave launch --lane main --dry-run --no-dashboard`
|
|
46
46
|
- `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access`
|
|
47
|
+
- `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access --resident-orchestrator`
|
|
47
48
|
- `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor claude`
|
|
48
49
|
- `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor opencode`
|
|
49
50
|
- `pnpm exec wave autonomous --lane main --executor codex --codex-sandbox danger-full-access`
|
|
50
51
|
- `pnpm exec wave feedback list --lane main --pending`
|
|
51
|
-
- `pnpm exec wave
|
|
52
|
+
- `pnpm exec wave control status --lane main --wave 0 --json`
|
|
53
|
+
- `pnpm exec wave control status --lane main --wave 0 --agent A1 --json`
|
|
52
54
|
- `pnpm exec wave coord inbox --lane main --wave 0 --agent A1 --dry-run`
|
|
53
|
-
- `pnpm exec wave
|
|
55
|
+
- `pnpm exec wave control task create --lane main --wave 0 --agent A1 --kind blocker --summary "Need repository decision"`
|
|
56
|
+
- `pnpm exec wave control task act reassign --lane main --wave 0 --id <task-id> --to A2`
|
|
57
|
+
- `pnpm exec wave control rerun get --lane main --wave 0 --json`
|
|
58
|
+
- `pnpm exec wave control rerun request --lane main --wave 0 --agent A2 --agent A7 --clear-reuse A2 --reason "resume sibling-owned shared component closure"`
|
|
59
|
+
- `pnpm exec wave control proof register --lane main --wave 9 --agent A7 --artifact .tmp/wave-9-proof/live-status.json --authoritative --completion live --durability durable --proof-level live`
|
|
60
|
+
- `pnpm exec wave local --prompt .tmp/main-wave-launcher/prompts/wave-0-A1.md --log .tmp/main-wave-launcher/logs/wave-0-A1.log --status .tmp/main-wave-launcher/status/wave-0-A1.json`
|
|
54
61
|
- `pnpm exec wave dep show --lane main --wave 0 --json`
|
|
55
62
|
- `pnpm exec wave dep post --owner-lane main --requester-lane release --owner-wave 0 --requester-wave 2 --agent launcher --summary "Need shared-plan reconciliation" --target capability:docs-shared-plan --required`
|
|
56
63
|
- `pnpm exec wave upgrade`
|
|
@@ -58,14 +65,15 @@ This runbook is the operational view of the architecture:
|
|
|
58
65
|
|
|
59
66
|
## Configuration
|
|
60
67
|
|
|
61
|
-
- `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences,
|
|
68
|
+
- `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences, Context7 bundle-index location, and the optional `waveControl` telemetry section. The starter config also wires the optional security reviewer prompt at `docs/agents/wave-security-role.md` and the `security-review` executor profile.
|
|
62
69
|
- `docs/context7/bundles.json` controls allowed external library bundles and lane defaults.
|
|
63
70
|
- `docs/evals/README.md` explains how to author delegated versus pinned `## Eval targets`, including the coordination-oriented benchmark families.
|
|
64
71
|
- `docs/reference/live-proof-waves.md` explains how to author proof-first `pilot-live` and higher-maturity waves with `### Proof artifacts`, sticky executors, and operator command capture.
|
|
65
72
|
- `docs/reference/sample-waves.md` points to showcase-first sample waves that combine the modern authored wave surface in concrete examples.
|
|
73
|
+
- `docs/reference/wave-control.md` documents the Wave Control telemetry and analysis plane, including entity types, artifact upload policies, and the local-first reporting contract.
|
|
66
74
|
- `docs/plans/component-cutover-matrix.json` is the canonical machine-readable source for component maturity and per-wave promotion targets.
|
|
67
75
|
- `.wave/install-state.json` records how the workspace was initialized and which package version is installed.
|
|
68
|
-
- `.wave/project-profile.json` records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
|
|
76
|
+
- `.wave/project-profile.json` (created by `wave project setup`) records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
|
|
69
77
|
- `.wave/adhoc/runs/<run-id>/` stores transient ad-hoc request, spec, rendered markdown, and result artifacts.
|
|
70
78
|
- ad-hoc documentation closure always writes `.wave/adhoc/runs/<run-id>/reports/`, but shared-plan deltas still queue the canonical lane shared-plan docs.
|
|
71
79
|
- ad-hoc task ownership inference only accepts repo-local paths; URLs and other external references are ignored.
|
|
@@ -76,12 +84,12 @@ This runbook is the operational view of the architecture:
|
|
|
76
84
|
- Wave skill bundles live under `skills/<skill-id>/`.
|
|
77
85
|
- Each bundle requires `skill.json` and `SKILL.md`.
|
|
78
86
|
- Bundles can also include runtime adapters at `adapters/<runtime>.md` for `codex`, `claude`, `opencode`, or `local`.
|
|
79
|
-
- The starter config
|
|
87
|
+
- The starter config merges global and lane skill configs, then resolves in order: base, role, runtime, deploy-kind, and finally explicit per-agent `### Skills`.
|
|
80
88
|
- The effective skill set is recomputed after final executor resolution, including retry-time runtime fallback, so a fallback from one runtime to another also swaps runtime-specific skill overlays.
|
|
81
89
|
- Starter bundles in this repo cover:
|
|
82
90
|
- core Wave coordination and repo coding rules
|
|
83
91
|
- runtime packs for Codex, Claude, OpenCode, and local execution
|
|
84
|
-
- role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, and
|
|
92
|
+
- role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, research, and planner work
|
|
85
93
|
- deploy and environment packs for Railway, Docker Compose, Kubernetes, SSH/manual rollout, and generic custom deploys
|
|
86
94
|
- explicit provider packs for GitHub release flow and AWS norms when a wave or lane wants to attach them
|
|
87
95
|
|
|
@@ -124,13 +132,23 @@ pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex -
|
|
|
124
132
|
|
|
125
133
|
## Coordination Surfaces
|
|
126
134
|
|
|
127
|
-
- `wave
|
|
135
|
+
- `wave control status` is the read-only projection for "why blocked / why retrying" at wave or agent scope. It returns blocking edges, logical agent state, tasks, dependencies, rerun intent, active proof bundles, and next timers from one materialized control-plane view.
|
|
136
|
+
- `wave control task create|get|list|act` is the operator task surface for blocking requests, blockers, clarification chains, human-input tickets, escalations, and informative handoffs, evidence, claims, and decisions. `wave control status` only treats requests, blockers, clarifications, human-input, escalations, helper assignments, and required dependencies as blocking edges.
|
|
137
|
+
- A fresh live `wave launch --start-wave <n> --end-wave <n>` now clears the previous auto-generated relaunch plan for that wave before selecting the initial implementation fan-out. Pass `--resume-control-state` only when you intentionally want to keep that persisted relaunch selection.
|
|
138
|
+
- `wave control rerun request|get|clear` manages targeted rerun intent under `.tmp/<lane>-wave-launcher/control-plane/` and projects compatible retry overrides under `.tmp/<lane>-wave-launcher/control/`, including selected agents, reuse selectors, invalidated components, and clear or preserve reuse lists.
|
|
139
|
+
- `wave control proof register|get|supersede|revoke` manages authoritative proof bundles in the same control-plane log and projects compatible proof registries under `.tmp/<lane>-wave-launcher/proof/`.
|
|
140
|
+
- `wave control telemetry status|flush` inspects and delivers the local Wave Control event queue. Pass `--no-telemetry` on `wave launch` to disable event publication for a single run.
|
|
128
141
|
- `wave coord render` regenerates the markdown board projection from the canonical coordination log.
|
|
129
142
|
- `wave coord inbox` writes the compiled shared summary plus the selected agent inbox.
|
|
130
|
-
|
|
143
|
+
|
|
144
|
+
Compatibility note:
|
|
145
|
+
|
|
146
|
+
- `wave coord`, `wave retry`, and `wave proof` remain available as compatibility surfaces, but new operator docs and runbooks should prefer `wave control`.
|
|
131
147
|
|
|
132
148
|
The canonical state is the JSONL log under `.tmp/<lane>-wave-launcher/coordination/`. The markdown board is a generated projection for humans, not the scheduler's source of truth.
|
|
133
149
|
|
|
150
|
+
Control-plane facts that drive reruns, proof, attempt state, and operator tasks are appended separately under `.tmp/<lane>-wave-launcher/control-plane/`. Legacy proof and retry files remain derived projections for compatibility, not the source of truth.
|
|
151
|
+
|
|
134
152
|
Capability-targeted requests now become deterministic helper assignments. The launcher resolves the assignee from explicit targets, `capabilityRouting.preferredAgents`, then least-busy matching capability owners, writes that assignment into `.tmp/<lane>-wave-launcher/assignments/`, mirrors the decision into coordination state, and keeps the wave blocked until the linked follow-up resolves.
|
|
135
153
|
|
|
136
154
|
Clarification flow is orchestrator-first:
|
|
@@ -141,6 +159,19 @@ Clarification flow is orchestrator-first:
|
|
|
141
159
|
4. Routed clarification follow-up requests remain blocking until they resolve.
|
|
142
160
|
5. Human escalations are written back into coordination state, the ledger, and trace artifacts.
|
|
143
161
|
|
|
162
|
+
During live runs, the launcher now keeps an active orchestration loop while agents are still running. It refreshes the derived coordination surfaces on cadence, surfaces overdue acknowledgements and stale clarification chains in dashboards and traces, and can reroute clarification follow-up requests inside the same attempt when the routed owner never acknowledges them.
|
|
163
|
+
|
|
164
|
+
If you opt into `--resident-orchestrator`, the launcher also starts a long-running non-owning orchestrator session for the wave. That session can inspect the same coordination artifacts and intervene through coordination records, but the launcher remains the scheduler truth and closure authority.
|
|
165
|
+
|
|
166
|
+
Retry intent, operator tasks, attempt lifecycle, and proof injection are now first-class control-plane artifacts rather than manual file surgery:
|
|
167
|
+
|
|
168
|
+
- canonical control events live under `.tmp/<lane>-wave-launcher/control-plane/`
|
|
169
|
+
- projected retry overrides still live under `.tmp/<lane>-wave-launcher/control/`
|
|
170
|
+
- projected proof registries still live under `.tmp/<lane>-wave-launcher/proof/`
|
|
171
|
+
- live traces now copy the control-plane log alongside the proof registry so replay keeps the same operator-visible facts
|
|
172
|
+
|
|
173
|
+
For a full end-to-end explainer of helper assignments, deliverables, integration, and why an agent can be locally done while the wave stays blocked, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).
|
|
174
|
+
|
|
144
175
|
## Cross-Lane Dependencies
|
|
145
176
|
|
|
146
177
|
- `wave dep post` appends a typed dependency ticket under `.tmp/wave-orchestrator/dependencies/`.
|
|
@@ -181,8 +212,10 @@ pnpm exec wave changelog --since-installed
|
|
|
181
212
|
|
|
182
213
|
- prompts: `.tmp/<lane>-wave-launcher/prompts/`
|
|
183
214
|
- logs: `.tmp/<lane>-wave-launcher/logs/`
|
|
215
|
+
- run-state: `.tmp/<lane>-wave-launcher/run-state.json`
|
|
216
|
+
Keeps compatibility `completedWaves`, but now also stores per-wave current state plus append-only transition history and completion or blocker evidence.
|
|
184
217
|
- status summaries: `.tmp/<lane>-wave-launcher/status/`
|
|
185
|
-
|
|
218
|
+
Relaunch plans in this directory are schema-versioned.
|
|
186
219
|
- coordination logs: `.tmp/<lane>-wave-launcher/coordination/`
|
|
187
220
|
- helper-assignment snapshots: `.tmp/<lane>-wave-launcher/assignments/`
|
|
188
221
|
- message boards: `.tmp/<lane>-wave-launcher/messageboards/`
|
|
@@ -195,6 +228,12 @@ pnpm exec wave changelog --since-installed
|
|
|
195
228
|
- dependency snapshots: `.tmp/<lane>-wave-launcher/dependencies/`
|
|
196
229
|
- docs queue: `.tmp/<lane>-wave-launcher/docs-queue/`
|
|
197
230
|
- trace bundles: `.tmp/<lane>-wave-launcher/traces/`
|
|
231
|
+
- control-plane events: `.tmp/<lane>-wave-launcher/control-plane/`
|
|
232
|
+
Canonical append-only JSONL log of operator tasks, rerun requests, proof bundles, attempt lifecycle, and human-input events. This is the source of truth for `wave control`. Telemetry queue lives under `control-plane/telemetry/`.
|
|
233
|
+
- proof registries: `.tmp/<lane>-wave-launcher/proof/`
|
|
234
|
+
Projected from control-plane state for compatibility. Operator-registered authoritative proof bundles that feed integration, cont-QA, and replay.
|
|
235
|
+
- retry overrides: `.tmp/<lane>-wave-launcher/control/`
|
|
236
|
+
Projected from control-plane state for compatibility. Operator-applied targeted retry overrides, applied once per attempt and then cleared by the launcher.
|
|
198
237
|
- clarification triage: `.tmp/<lane>-wave-launcher/feedback/triage/`
|
|
199
238
|
- dashboards: `.tmp/<lane>-wave-launcher/dashboards/`
|
|
200
239
|
Dashboard JSON is a versioned contract. `global.json` and `wave-<n>.json` now carry explicit `schemaVersion` and `kind` fields.
|
|
@@ -220,10 +259,13 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
|
|
|
220
259
|
- `ledger.json`
|
|
221
260
|
- `docs-queue.json`
|
|
222
261
|
- `security.json`
|
|
262
|
+
- `capability-assignments.json`
|
|
263
|
+
- `dependency-snapshot.json`
|
|
223
264
|
- `integration.json`
|
|
224
265
|
- `outcome.json`
|
|
225
266
|
- `shared-summary.md`
|
|
226
267
|
- copied prompt, log, status, inbox, and summary artifacts per launched agent
|
|
268
|
+
- `control-plane.raw.jsonl`
|
|
227
269
|
- `structured-signals.json`
|
|
228
270
|
- `quality.json`
|
|
229
271
|
- `run-metadata.json`
|
|
@@ -232,7 +274,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
|
|
|
232
274
|
- For `traceVersion: 2`, launched agents must have copied prompt/log/status/inbox/summary artifacts, and promoted-component waves must include the copied component matrix JSON.
|
|
233
275
|
- `security.json` stores the derived per-wave security state that feeds integration summaries, gate snapshots, and replay.
|
|
234
276
|
- `quality.json` is cumulative through the current attempt. It is intended for regression comparison, not only for one-shot pass/fail reporting.
|
|
235
|
-
- `quality.json` also reports capability-assignment and dependency-resolution metrics in addition to the Phase 2/3 communication, fallback, and closure metrics.
|
|
277
|
+
- `quality.json` also reports capability-assignment and dependency-resolution metrics, plus coordination response metrics (overdue acknowledgements, clarification timing, human escalation counts), in addition to the Phase 2/3 communication, fallback, and closure metrics.
|
|
236
278
|
- Replay support is internal. The source tree contains helpers to load, validate, and replay trace bundles against the same gate logic the launcher uses, but there is no public replay CLI yet.
|
|
237
279
|
- Replay is read-only and hash-validating for `traceVersion: 2` bundles. It ignores inline summary duplicates in `run-metadata.json` and returns a stored-vs-recomputed comparison report for gate and quality state. Legacy `traceVersion: 1` bundles remain best-effort and emit warnings instead of claiming full hermetic replay.
|
|
238
280
|
|
|
@@ -257,7 +299,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
|
|
|
257
299
|
- Optional standing roles available in this repo include `docs/agents/wave-infra-role.md` for infra proof and `docs/agents/wave-deploy-verifier-role.md` for rollout verification.
|
|
258
300
|
- Keep file ownership explicit inside each `### Prompt`.
|
|
259
301
|
- From the configured thresholds onward, declare `## Context7 defaults`, per-agent `### Context7`, and per-agent `### Exit contract`.
|
|
260
|
-
- For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md).
|
|
302
|
+
- For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md). External benchmark failure reviews classify outcomes into categories (`verifier-image`, `setup-harness`, `timeout`, `blocked-proof`, `missing-context`, `partial-fix`, `wrong-fix`, `unknown`) which feed the failure-review tooling available through `wave benchmark external-show`.
|
|
261
303
|
- For proof-first live-wave patterns, sticky retry guidance, and `### Proof artifacts` examples, see [docs/reference/live-proof-waves.md](../reference/live-proof-waves.md).
|
|
262
304
|
- Agents should use `wave coord post` for durable blockers, handoffs, evidence, and requests instead of relying on ad hoc board edits.
|
|
263
305
|
- Keep shared plan docs and the component cutover matrix owned by the configured documentation steward once that rule becomes active.
|
|
@@ -267,6 +309,10 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
|
|
|
267
309
|
[claude.md](../reference/runtime-config/claude.md),
|
|
268
310
|
[opencode.md](../reference/runtime-config/opencode.md).
|
|
269
311
|
|
|
312
|
+
## CLI Reference
|
|
313
|
+
|
|
314
|
+
For the complete syntax of every command, flag, and subcommand, see [docs/reference/cli-reference.md](../reference/cli-reference.md).
|
|
315
|
+
|
|
270
316
|
## Executor Modes
|
|
271
317
|
|
|
272
318
|
- `--executor codex` uses `codex exec` with the generated task prompt piped through stdin.
|
|
@@ -279,7 +325,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
|
|
|
279
325
|
- Skills resolve only after that executor choice is known. Runtime-specific skill overlays are regenerated whenever retry-time fallback changes the selected executor.
|
|
280
326
|
- Runtime mix targets are enforced before launch and again before any retry-time fallback reassignment.
|
|
281
327
|
- Fallbacks are declared in profiles or lane policy, can be applied automatically on retry when the next executor is available and still satisfies mix targets, and are recorded in the ledger, integration summary, and traces when used.
|
|
282
|
-
- Generic `budget.minutes` caps per-agent attempt timeouts. Generic `budget.turns` seeds `claude.maxTurns` and `opencode.steps` when executor-specific values are not set; Codex turn ceilings remain external to Wave and show up in preview metadata as opaque when Wave cannot inspect them.
|
|
328
|
+
- Generic `budget.minutes` caps per-agent attempt timeouts. Generic `budget.turns` seeds `claude.maxTurns` and `opencode.steps` when executor-specific values are not set; Codex turn ceilings remain external to Wave and show up in preview metadata as opaque when Wave cannot inspect them, though live previews now record an observed ceiling if the Codex runtime later logs one explicitly.
|
|
283
329
|
- The launcher writes runtime overlay files under `.tmp/<lane>-wave-launcher/executors/`; these should stay ignored and local.
|
|
284
330
|
|
|
285
331
|
Runtime authoring examples:
|
|
@@ -335,3 +381,5 @@ Live closure is fail-closed:
|
|
|
335
381
|
- Security review requires a report artifact plus a structured `[wave-security]` marker. `state=blocked` stops the wave before integration, while `state=concerns` is preserved in summaries and traces without automatically failing closure.
|
|
336
382
|
- `cont-QA` PASS requires both the final verdict and the final `[wave-gate]` marker.
|
|
337
383
|
- Legacy evaluator-era or underspecified closure artifacts are still readable in replay and trace analysis, but they no longer satisfy live completion.
|
|
384
|
+
|
|
385
|
+
For a detailed worked example of cross-agent follow-up and staged closure, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# Wave 1 Benchmark Operator Review
|
|
2
|
+
|
|
3
|
+
## Scope
|
|
4
|
+
|
|
5
|
+
This document reviews the `SWE-bench Pro` 10-task `full-wave` review-only batch run.
|
|
6
|
+
|
|
7
|
+
- manifest: `docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json`
|
|
8
|
+
- command config: `docs/evals/external-command-config.swe-bench-pro.json`
|
|
9
|
+
- source evidence: recorded aggregate results plus per-task verifier stdout/stderr logs and integration summaries from the benchmark worktree pass
|
|
10
|
+
|
|
11
|
+
Command used:
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
node "scripts/wave.mjs" benchmark external-run \
|
|
15
|
+
--adapter swe-bench-pro \
|
|
16
|
+
--manifest docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json \
|
|
17
|
+
--arm full-wave \
|
|
18
|
+
--command-config docs/evals/external-command-config.swe-bench-pro.json \
|
|
19
|
+
--model-id gpt-5-codex \
|
|
20
|
+
--executor-id codex \
|
|
21
|
+
--executor-command "codex exec" \
|
|
22
|
+
--tool-permissions "Read,Write,Edit,Bash" \
|
|
23
|
+
--temperature 0 \
|
|
24
|
+
--reasoning-effort high \
|
|
25
|
+
--max-wall-clock-minutes 15 \
|
|
26
|
+
--max-turns 250 \
|
|
27
|
+
--retry-limit 0 \
|
|
28
|
+
--verification-harness official-swe-bench-pro \
|
|
29
|
+
--dataset-version public-v1 \
|
|
30
|
+
--output-dir .tmp/wave-benchmarks/external/swe-bench-pro-full-wave-review-10 \
|
|
31
|
+
--json
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
This was a `review-only` run, not a matched `single-agent` versus `full-wave` comparison.
|
|
35
|
+
|
|
36
|
+
## Verdict
|
|
37
|
+
|
|
38
|
+
- Official resolved score: `0/10`
|
|
39
|
+
- Interpretable capability score: `not valid for external comparison`
|
|
40
|
+
- Recommendation: `blocked`
|
|
41
|
+
|
|
42
|
+
Why this is blocked:
|
|
43
|
+
|
|
44
|
+
- `7/10` tasks reached the official SWE-bench Pro evaluator, but the evaluator could not pull the expected Docker image tag from `jefzda/sweap-images`, so those zeros are not trustworthy model-performance failures.
|
|
45
|
+
- `3/10` tasks failed earlier in harness or repository setup before a trustworthy benchmark judgment existed.
|
|
46
|
+
- The raw aggregate `reviewBuckets` from the runner said `harness-env=10`; that was directionally closer to the truth than `incorrect-patch`, but still too coarse. The corrected manual buckets below are the review-ready interpretation.
|
|
47
|
+
|
|
48
|
+
## Aggregate Metrics
|
|
49
|
+
|
|
50
|
+
Recorded totals from the 10-task batch:
|
|
51
|
+
|
|
52
|
+
- tasks: `10`
|
|
53
|
+
- solved: `0`
|
|
54
|
+
- success rate: `0%`
|
|
55
|
+
- total wall clock: `2810439 ms`
|
|
56
|
+
- token totals:
|
|
57
|
+
- `input_tokens = 59155820`
|
|
58
|
+
- `cached_input_tokens = 54180608`
|
|
59
|
+
- `output_tokens = 278308`
|
|
60
|
+
|
|
61
|
+
Corrected manual failure buckets:
|
|
62
|
+
|
|
63
|
+
- `7` verifier-image failures
|
|
64
|
+
- `3` setup or harness failures before trustworthy scoring
|
|
65
|
+
- `0` trustworthy patch-quality failures established by the official verifier
|
|
66
|
+
|
|
67
|
+
## Task Scorecard
|
|
68
|
+
|
|
69
|
+
Scoring convention used here:
|
|
70
|
+
|
|
71
|
+
- `official score`: the raw `0/1` result recorded by the run artifacts
|
|
72
|
+
- `review score`: whether that official score is trustworthy enough to interpret as model capability evidence
|
|
73
|
+
|
|
74
|
+
| Task | Repo | Official score | Review score | Wall clock | Notes |
|
|
75
|
+
| --- | --- | --- | --- | ---: | --- |
|
|
76
|
+
| `instance_NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan` | `NodeBB/NodeBB` | `0` | `invalidated` | `807464 ms` | Full-wave solve ran and produced a patch, but the official evaluator failed to pull `jefzda/sweap-images:nodebb.nodebb-NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5` and returned `None`. |
|
|
77
|
+
| `instance_qutebrowser__qutebrowser-f91ace96223cac8161c16dd061907e138fe85111-v059c6fdc75567943479b23ebca7c07b5e9a7f34c` | `qutebrowser/qutebrowser` | `0` | `invalidated` | `369151 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `qutebrowser` image tag and returned `None`. |
|
|
78
|
+
| `instance_ansible__ansible-f327e65d11bb905ed9f15996024f857a95592629-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5` | `ansible/ansible` | `0` | `setup-failure` | `499457 ms` | Patch extraction failed during `git diff`; the task workspace had local `.venv` churn, so this never reached a trustworthy verifier judgment. |
|
|
79
|
+
| `instance_internetarchive__openlibrary-4a5d2a7d24c9e4c11d3069220c0685b736d5ecde-v13642507b4fc1f8d234172bf8129942da2c2ca26` | `internetarchive/openlibrary` | `0` | `invalidated` | `95 ms` | The official evaluator failed to pull the expected `openlibrary` image tag and returned `None`. |
|
|
80
|
+
| `instance_gravitational__teleport-3fa6904377c006497169945428e8197158667910-v626ec2a48416b10a88641359a169d99e935ff037` | `gravitational/teleport` | `0` | `setup-failure` | `64527 ms` | `wave init` failed because the repo already contained Wave bootstrap files and the harness still used the non-adopt path. |
|
|
81
|
+
| `instance_navidrome__navidrome-7073d18b54da7e53274d11c9e2baef1242e8769e` | `navidrome/navidrome` | `0` | `invalidated` | `417099 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `navidrome` image tag and returned `None`. |
|
|
82
|
+
| `instance_element-hq__element-web-33e8edb3d508d6eefb354819ca693b7accc695e7` | `element-hq/element-web` | `0` | `invalidated` | `510260 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `element-web` image tag and returned `None`. |
|
|
83
|
+
| `instance_future-architect__vuls-407407d306e9431d6aa0ab566baa6e44e5ba2904` | `future-architect/vuls` | `0` | `invalidated` | `115 ms` | The official evaluator failed to pull the expected `vuls` image tag and returned `None`. |
|
|
84
|
+
| `instance_flipt-io__flipt-e42da21a07a5ae35835ec54f74004ebd58713874` | `flipt-io/flipt` | `0` | `invalidated` | `104 ms` | The official evaluator failed to pull the expected `flipt` image tag and returned `None`. |
|
|
85
|
+
| `instance_protonmail__webclients-2c3559cad02d1090985dba7e8eb5a129144d9811` | `protonmail/webclients` | `0` | `setup-failure` | `142167 ms` | Repository preparation failed before solving because the target commit tree could not be read locally (`fatal: Could not parse object ...`). |
|
|
86
|
+
|
|
87
|
+
## What The Batch Actually Tells Us
|
|
88
|
+
|
|
89
|
+
This run does establish a few useful things:
|
|
90
|
+
|
|
91
|
+
- The 10-task `full-wave` review path is now executable end to end through `wave benchmark external-run --arm full-wave`.
|
|
92
|
+
- The harness now persists enough task-level evidence to audit failures: patch paths, verifier stdout and stderr, output dirs, and integration summaries.
|
|
93
|
+
- At least several tasks did enter real multi-agent execution and produced patches before the verifier step.
|
|
94
|
+
|
|
95
|
+
This run does **not** establish:
|
|
96
|
+
|
|
97
|
+
- a trustworthy `SWE-bench Pro` success rate for `full-wave`
|
|
98
|
+
- a comparison against public leaderboard systems
|
|
99
|
+
- a comparison against our own `single-agent` baseline
|
|
100
|
+
|
|
101
|
+
## Comparison Context
|
|
102
|
+
|
|
103
|
+
Context only, not head-to-head:
|
|
104
|
+
|
|
105
|
+
- The public `SWE-bench Pro` leaderboard reports top public-set systems in roughly the `41%` to `46%` range across the full public benchmark, not `0%`.
|
|
106
|
+
- Because this review run was invalidated by verifier-image and setup failures, the current `0/10` should not be treated as a clean external capability comparison against those systems.
|
|
107
|
+
|
|
108
|
+
Official sources:
|
|
109
|
+
|
|
110
|
+
- `https://scale.com/leaderboard/swe_bench_pro_public`
|
|
111
|
+
- `https://scaleapi.github.io/SWE-bench_Pro-os/`
|
|
112
|
+
|
|
113
|
+
## Follow-up Required Before Publication
|
|
114
|
+
|
|
115
|
+
- Fix verifier image resolution so the official evaluator can actually score all selected tasks.
|
|
116
|
+
- Fix the `teleport` harness path so repos with existing Wave bootstrap files use the adopt-existing flow when needed.
|
|
117
|
+
- Fix the `ansible` patch-extraction path so local environment bootstrapping cannot pollute the generated patch.
|
|
118
|
+
- Re-run the same frozen 10-task manifest after those harness fixes before making any external-performance claim.
|