npm - @chllming/wave-orchestration - Versions diffs - 0.6.3 → 0.7.0 - Mend

@chllming/wave-orchestration 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (112) hide show

package/CHANGELOG.md +57 -1
package/README.md +39 -7
package/docs/agents/wave-orchestrator-role.md +50 -0
package/docs/agents/wave-planner-role.md +39 -0
package/docs/context7/bundles.json +9 -0
package/docs/context7/planner-agent/README.md +25 -0
package/docs/context7/planner-agent/manifest.json +83 -0
package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
package/docs/evals/README.md +96 -1
package/docs/evals/arm-templates/README.md +13 -0
package/docs/evals/arm-templates/full-wave.json +15 -0
package/docs/evals/arm-templates/single-agent.json +15 -0
package/docs/evals/benchmark-catalog.json +7 -0
package/docs/evals/cases/README.md +47 -0
package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
package/docs/evals/external-benchmarks.json +85 -0
package/docs/evals/external-command-config.sample.json +9 -0
package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
package/docs/evals/pilots/README.md +47 -0
package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
package/docs/evals/wave-benchmark-program.md +302 -0
package/docs/guides/planner.md +48 -11
package/docs/plans/context7-wave-orchestrator.md +20 -0
package/docs/plans/current-state.md +8 -1
package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
package/docs/plans/examples/wave-example-live-proof.md +1 -1
package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
package/docs/plans/wave-orchestrator.md +62 -11
package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
package/docs/reference/coordination-and-closure.md +436 -0
package/docs/reference/live-proof-waves.md +25 -3
package/docs/reference/npmjs-trusted-publishing.md +3 -3
package/docs/reference/proof-metrics.md +90 -0
package/docs/reference/runtime-config/README.md +61 -0
package/docs/reference/sample-waves.md +29 -18
package/docs/reference/wave-control.md +164 -0
package/docs/reference/wave-planning-lessons.md +131 -0
package/package.json +5 -4
package/releases/manifest.json +18 -0
package/scripts/research/agent-context-archive.mjs +18 -0
package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
package/scripts/wave-orchestrator/autonomous.mjs +7 -0
package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
package/scripts/wave-orchestrator/benchmark.mjs +972 -0
package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
package/scripts/wave-orchestrator/config.mjs +175 -0
package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
package/scripts/wave-orchestrator/control-plane.mjs +697 -0
package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
package/scripts/wave-orchestrator/coordination.mjs +84 -0
package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
package/scripts/wave-orchestrator/evals.mjs +23 -0
package/scripts/wave-orchestrator/executors.mjs +3 -2
package/scripts/wave-orchestrator/feedback.mjs +55 -0
package/scripts/wave-orchestrator/install.mjs +55 -1
package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
package/scripts/wave-orchestrator/launcher.mjs +796 -35
package/scripts/wave-orchestrator/planner-context.mjs +75 -0
package/scripts/wave-orchestrator/planner.mjs +2270 -136
package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
package/scripts/wave-orchestrator/replay.mjs +10 -4
package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
package/scripts/wave-orchestrator/retry-control.mjs +225 -0
package/scripts/wave-orchestrator/shared.mjs +26 -0
package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
package/scripts/wave-orchestrator/traces.mjs +157 -2
package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
package/scripts/wave-orchestrator/wave-files.mjs +17 -5
package/scripts/wave.mjs +27 -0
package/skills/repo-coding-rules/SKILL.md +1 -0
package/skills/role-cont-eval/SKILL.md +1 -0
package/skills/role-cont-qa/SKILL.md +13 -6
package/skills/role-deploy/SKILL.md +1 -0
package/skills/role-documentation/SKILL.md +4 -0
package/skills/role-implementation/SKILL.md +4 -0
package/skills/role-infra/SKILL.md +2 -1
package/skills/role-integration/SKILL.md +15 -8
package/skills/role-planner/SKILL.md +39 -0
package/skills/role-planner/skill.json +21 -0
package/skills/role-research/SKILL.md +1 -0
package/skills/role-security/SKILL.md +2 -2
package/skills/runtime-claude/SKILL.md +2 -1
package/skills/runtime-codex/SKILL.md +1 -0
package/skills/runtime-local/SKILL.md +2 -0
package/skills/runtime-opencode/SKILL.md +1 -0
package/skills/wave-core/SKILL.md +25 -6
package/skills/wave-core/references/marker-syntax.md +16 -8
package/wave.config.json +45 -0

package/docs/plans/wave-orchestrator.md CHANGED Viewed

@@ -44,13 +44,20 @@ This runbook is the operational view of the architecture:
 - `pnpm exec wave doctor`
 - `pnpm exec wave launch --lane main --dry-run --no-dashboard`
 - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access`
+- `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex --codex-sandbox danger-full-access --resident-orchestrator`
 - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor claude`
 - `pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor opencode`
 - `pnpm exec wave autonomous --lane main --executor codex --codex-sandbox danger-full-access`
 - `pnpm exec wave feedback list --lane main --pending`
-- `pnpm exec wave coord show --lane main --wave 0 --dry-run`
+- `pnpm exec wave control status --lane main --wave 0 --json`
+- `pnpm exec wave control status --lane main --wave 0 --agent A1 --json`
 - `pnpm exec wave coord inbox --lane main --wave 0 --agent A1 --dry-run`
-- `pnpm exec wave coord post --lane main --wave 0 --agent A1 --kind blocker --summary "Need repository decision"`
+- `pnpm exec wave control task create --lane main --wave 0 --agent A1 --kind blocker --summary "Need repository decision"`
+- `pnpm exec wave control task act reassign --lane main --wave 0 --id <task-id> --to A2`
+- `pnpm exec wave control rerun get --lane main --wave 0 --json`
+- `pnpm exec wave control rerun request --lane main --wave 0 --agent A2 --agent A7 --clear-reuse A2 --reason "resume sibling-owned shared component closure"`
+- `pnpm exec wave control proof register --lane main --wave 9 --agent A7 --artifact .tmp/wave-9-proof/live-status.json --authoritative --completion live --durability durable --proof-level live`
+- `pnpm exec wave local --prompt .tmp/main-wave-launcher/prompts/wave-0-A1.md --log .tmp/main-wave-launcher/logs/wave-0-A1.log --status .tmp/main-wave-launcher/status/wave-0-A1.json`
 - `pnpm exec wave dep show --lane main --wave 0 --json`
 - `pnpm exec wave dep post --owner-lane main --requester-lane release --owner-wave 0 --requester-wave 2 --agent launcher --summary "Need shared-plan reconciliation" --target capability:docs-shared-plan --required`
 - `pnpm exec wave upgrade`
@@ -58,14 +65,15 @@ This runbook is the operational view of the architecture:
 ## Configuration
-- `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences, and Context7 bundle-index location. The starter config also wires the optional security reviewer prompt at `docs/agents/wave-security-role.md` and the `security-review` executor profile.
+- `wave.config.json` controls docs roots, shared plan docs, role prompts, validation thresholds, executor defaults, executor profiles, per-lane runtime policy, skill attachment policy, component-cutover matrix paths, capability-routing preferences, Context7 bundle-index location, and the optional `waveControl` telemetry section. The starter config also wires the optional security reviewer prompt at `docs/agents/wave-security-role.md` and the `security-review` executor profile.
 - `docs/context7/bundles.json` controls allowed external library bundles and lane defaults.
 - `docs/evals/README.md` explains how to author delegated versus pinned `## Eval targets`, including the coordination-oriented benchmark families.
 - `docs/reference/live-proof-waves.md` explains how to author proof-first `pilot-live` and higher-maturity waves with `### Proof artifacts`, sticky executors, and operator command capture.
 - `docs/reference/sample-waves.md` points to showcase-first sample waves that combine the modern authored wave surface in concrete examples.
+- `docs/reference/wave-control.md` documents the Wave Control telemetry and analysis plane, including entity types, artifact upload policies, and the local-first reporting contract.
 - `docs/plans/component-cutover-matrix.json` is the canonical machine-readable source for component maturity and per-wave promotion targets.
 - `.wave/install-state.json` records how the workspace was initialized and which package version is installed.
-- `.wave/project-profile.json` records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
+- `.wave/project-profile.json` (created by `wave project setup`) records planner defaults such as oversight mode, terminal surface, and deploy-environment memory.
 - `.wave/adhoc/runs/<run-id>/` stores transient ad-hoc request, spec, rendered markdown, and result artifacts.
 - ad-hoc documentation closure always writes `.wave/adhoc/runs/<run-id>/reports/`, but shared-plan deltas still queue the canonical lane shared-plan docs.
 - ad-hoc task ownership inference only accepts repo-local paths; URLs and other external references are ignored.
@@ -76,12 +84,12 @@ This runbook is the operational view of the architecture:
 - Wave skill bundles live under `skills/<skill-id>/`.
 - Each bundle requires `skill.json` and `SKILL.md`.
 - Bundles can also include runtime adapters at `adapters/<runtime>.md` for `codex`, `claude`, `opencode`, or `local`.
-- The starter config resolves skills in this order: global base, lane base, global role map, lane role map, global runtime map, lane runtime map, global deploy-kind map, lane deploy-kind map, then explicit per-agent `### Skills`.
+- The starter config merges global and lane skill configs, then resolves in order: base, role, runtime, deploy-kind, and finally explicit per-agent `### Skills`.
 - The effective skill set is recomputed after final executor resolution, including retry-time runtime fallback, so a fallback from one runtime to another also swaps runtime-specific skill overlays.
 - Starter bundles in this repo cover:
   - core Wave coordination and repo coding rules
   - runtime packs for Codex, Claude, OpenCode, and local execution
-  - role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, and research work
+  - role packs for implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, deploy, research, and planner work
   - deploy and environment packs for Railway, Docker Compose, Kubernetes, SSH/manual rollout, and generic custom deploys
   - explicit provider packs for GitHub release flow and AWS norms when a wave or lane wants to attach them
@@ -124,13 +132,22 @@ pnpm exec wave launch --lane main --start-wave 0 --end-wave 0 --executor codex -
 ## Coordination Surfaces
-- `wave coord show` is a read-only view of the materialized coordination state for a wave.
+- `wave control status` is the read-only projection for "why blocked / why retrying" at wave or agent scope. It returns blocking edges, logical agent state, tasks, dependencies, rerun intent, active proof bundles, and next timers from one materialized control-plane view.
+- `wave control task create|get|list|act` is the operator task surface for blocking requests, blockers, clarification chains, human-input tickets, escalations, and informative handoffs, evidence, claims, and decisions. `wave control status` only treats requests, blockers, clarifications, human-input, escalations, helper assignments, and required dependencies as blocking edges.
+- `wave control rerun request|get|clear` manages targeted rerun intent under `.tmp/<lane>-wave-launcher/control-plane/` and projects compatible retry overrides under `.tmp/<lane>-wave-launcher/control/`, including selected agents, reuse selectors, invalidated components, and clear or preserve reuse lists.
+- `wave control proof register|get|supersede|revoke` manages authoritative proof bundles in the same control-plane log and projects compatible proof registries under `.tmp/<lane>-wave-launcher/proof/`.
+- `wave control telemetry status|flush` inspects and delivers the local Wave Control event queue. Pass `--no-telemetry` on `wave launch` to disable event publication for a single run.
 - `wave coord render` regenerates the markdown board projection from the canonical coordination log.
 - `wave coord inbox` writes the compiled shared summary plus the selected agent inbox.
-- `wave coord post` appends a structured record to the coordination log. This is the machine-readable path for blockers, handoffs, evidence, targeted requests, and clarification requests.
+Compatibility note:
+- `wave coord`, `wave retry`, and `wave proof` remain available as compatibility surfaces, but new operator docs and runbooks should prefer `wave control`.
 The canonical state is the JSONL log under `.tmp/<lane>-wave-launcher/coordination/`. The markdown board is a generated projection for humans, not the scheduler's source of truth.
+Control-plane facts that drive reruns, proof, attempt state, and operator tasks are appended separately under `.tmp/<lane>-wave-launcher/control-plane/`. Legacy proof and retry files remain derived projections for compatibility, not the source of truth.
 Capability-targeted requests now become deterministic helper assignments. The launcher resolves the assignee from explicit targets, `capabilityRouting.preferredAgents`, then least-busy matching capability owners, writes that assignment into `.tmp/<lane>-wave-launcher/assignments/`, mirrors the decision into coordination state, and keeps the wave blocked until the linked follow-up resolves.
 Clarification flow is orchestrator-first:
@@ -141,6 +158,19 @@ Clarification flow is orchestrator-first:
 4. Routed clarification follow-up requests remain blocking until they resolve.
 5. Human escalations are written back into coordination state, the ledger, and trace artifacts.
+During live runs, the launcher now keeps an active orchestration loop while agents are still running. It refreshes the derived coordination surfaces on cadence, surfaces overdue acknowledgements and stale clarification chains in dashboards and traces, and can reroute clarification follow-up requests inside the same attempt when the routed owner never acknowledges them.
+If you opt into `--resident-orchestrator`, the launcher also starts a long-running non-owning orchestrator session for the wave. That session can inspect the same coordination artifacts and intervene through coordination records, but the launcher remains the scheduler truth and closure authority.
+Retry intent, operator tasks, attempt lifecycle, and proof injection are now first-class control-plane artifacts rather than manual file surgery:
+- canonical control events live under `.tmp/<lane>-wave-launcher/control-plane/`
+- projected retry overrides still live under `.tmp/<lane>-wave-launcher/control/`
+- projected proof registries still live under `.tmp/<lane>-wave-launcher/proof/`
+- live traces now copy the control-plane log alongside the proof registry so replay keeps the same operator-visible facts
+For a full end-to-end explainer of helper assignments, deliverables, integration, and why an agent can be locally done while the wave stays blocked, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).
 ## Cross-Lane Dependencies
 - `wave dep post` appends a typed dependency ticket under `.tmp/wave-orchestrator/dependencies/`.
@@ -181,8 +211,10 @@ pnpm exec wave changelog --since-installed
 - prompts: `.tmp/<lane>-wave-launcher/prompts/`
 - logs: `.tmp/<lane>-wave-launcher/logs/`
+- run-state: `.tmp/<lane>-wave-launcher/run-state.json`
+  Keeps compatibility `completedWaves`, but now also stores per-wave current state plus append-only transition history and completion or blocker evidence.
 - status summaries: `.tmp/<lane>-wave-launcher/status/`
-  `run-state.json` keeps compatibility `completedWaves`, but now also stores per-wave current state plus append-only transition history and completion or blocker evidence. Relaunch plans in this directory are schema-versioned.
+  Relaunch plans in this directory are schema-versioned.
 - coordination logs: `.tmp/<lane>-wave-launcher/coordination/`
 - helper-assignment snapshots: `.tmp/<lane>-wave-launcher/assignments/`
 - message boards: `.tmp/<lane>-wave-launcher/messageboards/`
@@ -195,6 +227,12 @@ pnpm exec wave changelog --since-installed
 - dependency snapshots: `.tmp/<lane>-wave-launcher/dependencies/`
 - docs queue: `.tmp/<lane>-wave-launcher/docs-queue/`
 - trace bundles: `.tmp/<lane>-wave-launcher/traces/`
+- control-plane events: `.tmp/<lane>-wave-launcher/control-plane/`
+  Canonical append-only JSONL log of operator tasks, rerun requests, proof bundles, attempt lifecycle, and human-input events. This is the source of truth for `wave control`. Telemetry queue lives under `control-plane/telemetry/`.
+- proof registries: `.tmp/<lane>-wave-launcher/proof/`
+  Projected from control-plane state for compatibility. Operator-registered authoritative proof bundles that feed integration, cont-QA, and replay.
+- retry overrides: `.tmp/<lane>-wave-launcher/control/`
+  Projected from control-plane state for compatibility. Operator-applied targeted retry overrides, applied once per attempt and then cleared by the launcher.
 - clarification triage: `.tmp/<lane>-wave-launcher/feedback/triage/`
 - dashboards: `.tmp/<lane>-wave-launcher/dashboards/`
   Dashboard JSON is a versioned contract. `global.json` and `wave-<n>.json` now carry explicit `schemaVersion` and `kind` fields.
@@ -220,10 +258,13 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
   - `ledger.json`
   - `docs-queue.json`
   - `security.json`
+  - `capability-assignments.json`
+  - `dependency-snapshot.json`
   - `integration.json`
   - `outcome.json`
   - `shared-summary.md`
   - copied prompt, log, status, inbox, and summary artifacts per launched agent
+  - `control-plane.raw.jsonl`
   - `structured-signals.json`
   - `quality.json`
   - `run-metadata.json`
@@ -232,7 +273,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
 - For `traceVersion: 2`, launched agents must have copied prompt/log/status/inbox/summary artifacts, and promoted-component waves must include the copied component matrix JSON.
 - `security.json` stores the derived per-wave security state that feeds integration summaries, gate snapshots, and replay.
 - `quality.json` is cumulative through the current attempt. It is intended for regression comparison, not only for one-shot pass/fail reporting.
-- `quality.json` also reports capability-assignment and dependency-resolution metrics in addition to the Phase 2/3 communication, fallback, and closure metrics.
+- `quality.json` also reports capability-assignment and dependency-resolution metrics, plus coordination response metrics (overdue acknowledgements, clarification timing, human escalation counts), in addition to the Phase 2/3 communication, fallback, and closure metrics.
 - Replay support is internal. The source tree contains helpers to load, validate, and replay trace bundles against the same gate logic the launcher uses, but there is no public replay CLI yet.
 - Replay is read-only and hash-validating for `traceVersion: 2` bundles. It ignores inline summary duplicates in `run-metadata.json` and returns a stored-vs-recomputed comparison report for gate and quality state. Legacy `traceVersion: 1` bundles remain best-effort and emit warnings instead of claiming full hermetic replay.
@@ -257,7 +298,7 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
 - Optional standing roles available in this repo include `docs/agents/wave-infra-role.md` for infra proof and `docs/agents/wave-deploy-verifier-role.md` for rollout verification.
 - Keep file ownership explicit inside each `### Prompt`.
 - From the configured thresholds onward, declare `## Context7 defaults`, per-agent `### Context7`, and per-agent `### Exit contract`.
-- For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md).
+- For benchmark-family guidance and delegated-versus-pinned eval examples, see [docs/evals/README.md](../evals/README.md). External benchmark failure reviews classify outcomes into categories (`verifier-image`, `setup-harness`, `timeout`, `blocked-proof`, `missing-context`, `partial-fix`, `wrong-fix`, `unknown`) which feed the failure-review tooling available through `wave benchmark external-show`.
 - For proof-first live-wave patterns, sticky retry guidance, and `### Proof artifacts` examples, see [docs/reference/live-proof-waves.md](../reference/live-proof-waves.md).
 - Agents should use `wave coord post` for durable blockers, handoffs, evidence, and requests instead of relying on ad hoc board edits.
 - Keep shared plan docs and the component cutover matrix owned by the configured documentation steward once that rule becomes active.
@@ -267,6 +308,14 @@ The launcher entrypoint in `scripts/wave-orchestrator/launcher.mjs` now delegate
   [claude.md](../reference/runtime-config/claude.md),
   [opencode.md](../reference/runtime-config/opencode.md).
+## Benchmark CLI
+- `wave benchmark list` lists local benchmark cases from the catalog.
+- `wave benchmark show --case <id>` shows a single case definition.
+- `wave benchmark run --case <id>` executes a local deterministic case.
+- `wave benchmark adapters` lists available external benchmark adapters.
+- `wave benchmark external-list|external-show|external-run|external-pilots` manage external benchmark targets (e.g., SWE-bench Pro).
 ## Executor Modes
 - `--executor codex` uses `codex exec` with the generated task prompt piped through stdin.
@@ -335,3 +384,5 @@ Live closure is fail-closed:
 - Security review requires a report artifact plus a structured `[wave-security]` marker. `state=blocked` stops the wave before integration, while `state=concerns` is preserved in summaries and traces without automatically failing closure.
 - `cont-QA` PASS requires both the final verdict and the final `[wave-gate]` marker.
 - Legacy evaluator-era or underspecified closure artifacts are still readable in replay and trace analysis, but they no longer satisfy live completion.
+For a detailed worked example of cross-agent follow-up and staged closure, see [docs/reference/coordination-and-closure.md](../reference/coordination-and-closure.md).

package/docs/plans/waves/reviews/wave-1-benchmark-operator.md ADDED Viewed

@@ -0,0 +1,118 @@
+# Wave 1 Benchmark Operator Review
+## Scope
+This document reviews the `SWE-bench Pro` 10-task `full-wave` review-only batch run.
+- manifest: `docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json`
+- command config: `docs/evals/external-command-config.swe-bench-pro.json`
+- source evidence: recorded aggregate results plus per-task verifier stdout/stderr logs and integration summaries from the benchmark worktree pass
+Command used:
+```bash
+node "scripts/wave.mjs" benchmark external-run \
+  --adapter swe-bench-pro \
+  --manifest docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json \
+  --arm full-wave \
+  --command-config docs/evals/external-command-config.swe-bench-pro.json \
+  --model-id gpt-5-codex \
+  --executor-id codex \
+  --executor-command "codex exec" \
+  --tool-permissions "Read,Write,Edit,Bash" \
+  --temperature 0 \
+  --reasoning-effort high \
+  --max-wall-clock-minutes 15 \
+  --max-turns 250 \
+  --retry-limit 0 \
+  --verification-harness official-swe-bench-pro \
+  --dataset-version public-v1 \
+  --output-dir .tmp/wave-benchmarks/external/swe-bench-pro-full-wave-review-10 \
+  --json
+```
+This was a `review-only` run, not a matched `single-agent` versus `full-wave` comparison.
+## Verdict
+- Official resolved score: `0/10`
+- Interpretable capability score: `not valid for external comparison`
+- Recommendation: `blocked`
+Why this is blocked:
+- `7/10` tasks reached the official SWE-bench Pro evaluator, but the evaluator could not pull the expected Docker image tag from `jefzda/sweap-images`, so those zeros are not trustworthy model-performance failures.
+- `3/10` tasks failed earlier in harness or repository setup before a trustworthy benchmark judgment existed.
+- The raw aggregate `reviewBuckets` from the runner said `harness-env=10`; that was directionally closer to the truth than `incorrect-patch`, but still too coarse. The corrected manual buckets below are the review-ready interpretation.
+## Aggregate Metrics
+Recorded totals from the 10-task batch:
+- tasks: `10`
+- solved: `0`
+- success rate: `0%`
+- total wall clock: `2810439 ms`
+- token totals:
+  - `input_tokens = 59155820`
+  - `cached_input_tokens = 54180608`
+  - `output_tokens = 278308`
+Corrected manual failure buckets:
+- `7` verifier-image failures
+- `3` setup or harness failures before trustworthy scoring
+- `0` trustworthy patch-quality failures established by the official verifier
+## Task Scorecard
+Scoring convention used here:
+- `official score`: the raw `0/1` result recorded by the run artifacts
+- `review score`: whether that official score is trustworthy enough to interpret as model capability evidence
+| Task | Repo | Official score | Review score | Wall clock | Notes |
+| --- | --- | --- | --- | ---: | --- |
+| `instance_NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan` | `NodeBB/NodeBB` | `0` | `invalidated` | `807464 ms` | Full-wave solve ran and produced a patch, but the official evaluator failed to pull `jefzda/sweap-images:nodebb.nodebb-NodeBB__NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5` and returned `None`. |
+| `instance_qutebrowser__qutebrowser-f91ace96223cac8161c16dd061907e138fe85111-v059c6fdc75567943479b23ebca7c07b5e9a7f34c` | `qutebrowser/qutebrowser` | `0` | `invalidated` | `369151 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `qutebrowser` image tag and returned `None`. |
+| `instance_ansible__ansible-f327e65d11bb905ed9f15996024f857a95592629-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5` | `ansible/ansible` | `0` | `setup-failure` | `499457 ms` | Patch extraction failed during `git diff`; the task workspace had local `.venv` churn, so this never reached a trustworthy verifier judgment. |
+| `instance_internetarchive__openlibrary-4a5d2a7d24c9e4c11d3069220c0685b736d5ecde-v13642507b4fc1f8d234172bf8129942da2c2ca26` | `internetarchive/openlibrary` | `0` | `invalidated` | `95 ms` | The official evaluator failed to pull the expected `openlibrary` image tag and returned `None`. |
+| `instance_gravitational__teleport-3fa6904377c006497169945428e8197158667910-v626ec2a48416b10a88641359a169d99e935ff037` | `gravitational/teleport` | `0` | `setup-failure` | `64527 ms` | `wave init` failed because the repo already contained Wave bootstrap files and the harness still used the non-adopt path. |
+| `instance_navidrome__navidrome-7073d18b54da7e53274d11c9e2baef1242e8769e` | `navidrome/navidrome` | `0` | `invalidated` | `417099 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `navidrome` image tag and returned `None`. |
+| `instance_element-hq__element-web-33e8edb3d508d6eefb354819ca693b7accc695e7` | `element-hq/element-web` | `0` | `invalidated` | `510260 ms` | Solve ran and produced a patch, but the official evaluator failed to pull the expected `element-web` image tag and returned `None`. |
+| `instance_future-architect__vuls-407407d306e9431d6aa0ab566baa6e44e5ba2904` | `future-architect/vuls` | `0` | `invalidated` | `115 ms` | The official evaluator failed to pull the expected `vuls` image tag and returned `None`. |
+| `instance_flipt-io__flipt-e42da21a07a5ae35835ec54f74004ebd58713874` | `flipt-io/flipt` | `0` | `invalidated` | `104 ms` | The official evaluator failed to pull the expected `flipt` image tag and returned `None`. |
+| `instance_protonmail__webclients-2c3559cad02d1090985dba7e8eb5a129144d9811` | `protonmail/webclients` | `0` | `setup-failure` | `142167 ms` | Repository preparation failed before solving because the target commit tree could not be read locally (`fatal: Could not parse object ...`). |
+## What The Batch Actually Tells Us
+This run does establish a few useful things:
+- The 10-task `full-wave` review path is now executable end to end through `wave benchmark external-run --arm full-wave`.
+- The harness now persists enough task-level evidence to audit failures: patch paths, verifier stdout and stderr, output dirs, and integration summaries.
+- At least several tasks did enter real multi-agent execution and produced patches before the verifier step.
+This run does **not** establish:
+- a trustworthy `SWE-bench Pro` success rate for `full-wave`
+- a comparison against public leaderboard systems
+- a comparison against our own `single-agent` baseline
+## Comparison Context
+Context only, not head-to-head:
+- The public `SWE-bench Pro` leaderboard reports top public-set systems in roughly the `41%` to `46%` range across the full public benchmark, not `0%`.
+- Because this review run was invalidated by verifier-image and setup failures, the current `0/10` should not be treated as a clean external capability comparison against those systems.
+Official sources:
+- `https://scale.com/leaderboard/swe_bench_pro_public`
+- `https://scaleapi.github.io/SWE-bench_Pro-os/`
+## Follow-up Required Before Publication
+- Fix verifier image resolution so the official evaluator can actually score all selected tasks.
+- Fix the `teleport` harness path so repos with existing Wave bootstrap files use the adopt-existing flow when needed.
+- Fix the `ansible` patch-extraction path so local environment bootstrapping cannot pollute the generated patch.
+- Re-run the same frozen 10-task manifest after those harness fixes before making any external-performance claim.