@chllming/wave-orchestration 0.5.3 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +53 -3
- package/README.md +81 -506
- package/docs/README.md +53 -0
- package/docs/agents/wave-cont-eval-role.md +36 -0
- package/docs/agents/{wave-evaluator-role.md → wave-cont-qa-role.md} +14 -11
- package/docs/agents/wave-documentation-role.md +1 -1
- package/docs/agents/wave-infra-role.md +1 -1
- package/docs/agents/wave-integration-role.md +3 -3
- package/docs/agents/wave-launcher-role.md +4 -3
- package/docs/agents/wave-security-role.md +40 -0
- package/docs/concepts/context7-vs-skills.md +94 -0
- package/docs/concepts/operating-modes.md +91 -0
- package/docs/concepts/runtime-agnostic-orchestration.md +95 -0
- package/docs/concepts/what-is-a-wave.md +183 -0
- package/docs/evals/README.md +166 -0
- package/docs/evals/benchmark-catalog.json +663 -0
- package/docs/guides/author-and-run-waves.md +135 -0
- package/docs/guides/planner.md +118 -0
- package/docs/guides/terminal-surfaces.md +82 -0
- package/docs/image.png +0 -0
- package/docs/plans/component-cutover-matrix.json +1 -1
- package/docs/plans/component-cutover-matrix.md +1 -1
- package/docs/plans/context7-wave-orchestrator.md +2 -0
- package/docs/plans/current-state.md +29 -1
- package/docs/plans/examples/wave-example-live-proof.md +435 -0
- package/docs/plans/master-plan.md +3 -3
- package/docs/plans/migration.md +46 -3
- package/docs/plans/wave-orchestrator.md +71 -8
- package/docs/plans/waves/wave-0.md +4 -4
- package/docs/reference/live-proof-waves.md +177 -0
- package/docs/reference/migration-0.2-to-0.5.md +26 -19
- package/docs/reference/npmjs-trusted-publishing.md +6 -5
- package/docs/reference/runtime-config/README.md +29 -0
- package/docs/reference/sample-waves.md +87 -0
- package/docs/reference/skills.md +224 -0
- package/docs/research/agent-context-sources.md +130 -11
- package/docs/research/coordination-failure-review.md +266 -0
- package/docs/roadmap.md +164 -564
- package/package.json +3 -2
- package/releases/manifest.json +37 -2
- package/scripts/research/agent-context-archive.mjs +83 -1
- package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +811 -0
- package/scripts/wave-orchestrator/adhoc.mjs +1331 -0
- package/scripts/wave-orchestrator/agent-state.mjs +358 -6
- package/scripts/wave-orchestrator/artifact-schemas.mjs +173 -0
- package/scripts/wave-orchestrator/clarification-triage.mjs +10 -3
- package/scripts/wave-orchestrator/config.mjs +65 -12
- package/scripts/wave-orchestrator/context7.mjs +11 -0
- package/scripts/wave-orchestrator/coord-cli.mjs +51 -19
- package/scripts/wave-orchestrator/coordination-store.mjs +26 -4
- package/scripts/wave-orchestrator/coordination.mjs +99 -9
- package/scripts/wave-orchestrator/dashboard-state.mjs +20 -8
- package/scripts/wave-orchestrator/dep-cli.mjs +5 -2
- package/scripts/wave-orchestrator/docs-queue.mjs +8 -2
- package/scripts/wave-orchestrator/evals.mjs +451 -0
- package/scripts/wave-orchestrator/executors.mjs +24 -11
- package/scripts/wave-orchestrator/feedback.mjs +15 -1
- package/scripts/wave-orchestrator/install.mjs +69 -7
- package/scripts/wave-orchestrator/launcher-closure.mjs +281 -0
- package/scripts/wave-orchestrator/launcher-runtime.mjs +334 -0
- package/scripts/wave-orchestrator/launcher.mjs +778 -577
- package/scripts/wave-orchestrator/ledger.mjs +123 -20
- package/scripts/wave-orchestrator/local-executor.mjs +99 -12
- package/scripts/wave-orchestrator/planner.mjs +1463 -0
- package/scripts/wave-orchestrator/project-profile.mjs +190 -0
- package/scripts/wave-orchestrator/replay.mjs +6 -3
- package/scripts/wave-orchestrator/role-helpers.mjs +84 -0
- package/scripts/wave-orchestrator/shared.mjs +77 -11
- package/scripts/wave-orchestrator/skills.mjs +979 -0
- package/scripts/wave-orchestrator/terminals.mjs +16 -0
- package/scripts/wave-orchestrator/traces.mjs +73 -27
- package/scripts/wave-orchestrator/wave-files.mjs +1224 -163
- package/scripts/wave.mjs +20 -0
- package/skills/README.md +202 -0
- package/skills/provider-aws/SKILL.md +117 -0
- package/skills/provider-aws/adapters/claude.md +1 -0
- package/skills/provider-aws/adapters/codex.md +1 -0
- package/skills/provider-aws/references/service-verification.md +39 -0
- package/skills/provider-aws/skill.json +54 -0
- package/skills/provider-custom-deploy/SKILL.md +64 -0
- package/skills/provider-custom-deploy/skill.json +50 -0
- package/skills/provider-docker-compose/SKILL.md +96 -0
- package/skills/provider-docker-compose/adapters/local.md +1 -0
- package/skills/provider-docker-compose/skill.json +53 -0
- package/skills/provider-github-release/SKILL.md +121 -0
- package/skills/provider-github-release/adapters/claude.md +1 -0
- package/skills/provider-github-release/adapters/codex.md +1 -0
- package/skills/provider-github-release/skill.json +55 -0
- package/skills/provider-kubernetes/SKILL.md +143 -0
- package/skills/provider-kubernetes/adapters/claude.md +1 -0
- package/skills/provider-kubernetes/adapters/codex.md +1 -0
- package/skills/provider-kubernetes/references/kubectl-patterns.md +58 -0
- package/skills/provider-kubernetes/skill.json +52 -0
- package/skills/provider-railway/SKILL.md +123 -0
- package/skills/provider-railway/adapters/claude.md +1 -0
- package/skills/provider-railway/adapters/codex.md +1 -0
- package/skills/provider-railway/adapters/local.md +1 -0
- package/skills/provider-railway/adapters/opencode.md +1 -0
- package/skills/provider-railway/references/verification-commands.md +39 -0
- package/skills/provider-railway/skill.json +71 -0
- package/skills/provider-ssh-manual/SKILL.md +97 -0
- package/skills/provider-ssh-manual/skill.json +54 -0
- package/skills/repo-coding-rules/SKILL.md +91 -0
- package/skills/repo-coding-rules/skill.json +34 -0
- package/skills/role-cont-eval/SKILL.md +90 -0
- package/skills/role-cont-eval/adapters/codex.md +1 -0
- package/skills/role-cont-eval/skill.json +36 -0
- package/skills/role-cont-qa/SKILL.md +93 -0
- package/skills/role-cont-qa/adapters/claude.md +1 -0
- package/skills/role-cont-qa/skill.json +36 -0
- package/skills/role-deploy/SKILL.md +96 -0
- package/skills/role-deploy/skill.json +36 -0
- package/skills/role-documentation/SKILL.md +72 -0
- package/skills/role-documentation/skill.json +36 -0
- package/skills/role-implementation/SKILL.md +68 -0
- package/skills/role-implementation/skill.json +36 -0
- package/skills/role-infra/SKILL.md +80 -0
- package/skills/role-infra/skill.json +36 -0
- package/skills/role-integration/SKILL.md +84 -0
- package/skills/role-integration/skill.json +36 -0
- package/skills/role-research/SKILL.md +64 -0
- package/skills/role-research/skill.json +36 -0
- package/skills/role-security/SKILL.md +60 -0
- package/skills/role-security/skill.json +36 -0
- package/skills/runtime-claude/SKILL.md +65 -0
- package/skills/runtime-claude/skill.json +36 -0
- package/skills/runtime-codex/SKILL.md +57 -0
- package/skills/runtime-codex/skill.json +36 -0
- package/skills/runtime-local/SKILL.md +44 -0
- package/skills/runtime-local/skill.json +36 -0
- package/skills/runtime-opencode/SKILL.md +57 -0
- package/skills/runtime-opencode/skill.json +36 -0
- package/skills/wave-core/SKILL.md +114 -0
- package/skills/wave-core/references/marker-syntax.md +62 -0
- package/skills/wave-core/skill.json +35 -0
- package/wave.config.json +61 -5
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# What Is A Wave?
|
|
2
|
+
|
|
3
|
+
A wave is the main planning and execution unit in Wave Orchestration.
|
|
4
|
+
|
|
5
|
+
It is not just a prompt file. A wave is a bounded slice of repository work with:
|
|
6
|
+
|
|
7
|
+
- explicit scope
|
|
8
|
+
- named owners
|
|
9
|
+
- runtime and context requirements
|
|
10
|
+
- proof and closure rules
|
|
11
|
+
- durable coordination state
|
|
12
|
+
- replayable execution artifacts
|
|
13
|
+
|
|
14
|
+
## Core Terms
|
|
15
|
+
|
|
16
|
+
- Lane
|
|
17
|
+
An ordered sequence of waves. The default lane in this repo is `main`.
|
|
18
|
+
- Wave
|
|
19
|
+
One numbered work package inside a lane, usually stored as `docs/plans/waves/wave-<n>.md`.
|
|
20
|
+
- Agent
|
|
21
|
+
One role inside the wave, such as implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, or deploy.
|
|
22
|
+
- Attempt
|
|
23
|
+
One execution pass of a wave. A wave can have multiple attempts due to retries or fallback.
|
|
24
|
+
- Closure
|
|
25
|
+
The final proof pass that decides whether the wave is actually done, not just partially implemented.
|
|
26
|
+
|
|
27
|
+
## Why Waves Exist
|
|
28
|
+
|
|
29
|
+
Waves force a higher planning bar than ad hoc prompts. A good wave answers:
|
|
30
|
+
|
|
31
|
+
- What is changing now, and why now?
|
|
32
|
+
- Which components or docs are in scope?
|
|
33
|
+
- Which agent owns each slice?
|
|
34
|
+
- What evidence closes the wave?
|
|
35
|
+
- Which dependencies, helper requests, or escalations can still block completion?
|
|
36
|
+
|
|
37
|
+
## Wave Anatomy
|
|
38
|
+
|
|
39
|
+
Wave markdown is the authored execution surface today. A typical wave can include:
|
|
40
|
+
|
|
41
|
+
- title and commit message
|
|
42
|
+
- project profile details such as oversight mode and lane
|
|
43
|
+
- sequencing note
|
|
44
|
+
- reference rule
|
|
45
|
+
- deploy environments
|
|
46
|
+
- component promotions
|
|
47
|
+
- eval targets
|
|
48
|
+
- Context7 defaults
|
|
49
|
+
- one `## Agent ...` block per role
|
|
50
|
+
|
|
51
|
+
Inside each agent block, the important sections are:
|
|
52
|
+
|
|
53
|
+
- `### Role prompts`
|
|
54
|
+
Standing role identity imported from `docs/agents/*.md`.
|
|
55
|
+
- `### Executor`
|
|
56
|
+
Runtime selection, profile, model, fallbacks, and budgets.
|
|
57
|
+
- `## Eval targets`
|
|
58
|
+
Optional wave-level contract for `cont-EVAL`, including benchmark family or pinned benchmarks, objective, and stop condition.
|
|
59
|
+
See [docs/evals/README.md](../evals/README.md) for guidance on delegated versus pinned targets and the coordination benchmark families.
|
|
60
|
+
- `### Proof artifacts`
|
|
61
|
+
Optional machine-visible local evidence required for proof-centric waves, especially `pilot-live` and above.
|
|
62
|
+
- `### Context7`
|
|
63
|
+
External library truth to prefetch and inject.
|
|
64
|
+
- `### Skills`
|
|
65
|
+
Reusable repo-owned environment or workflow guidance resolved after runtime selection.
|
|
66
|
+
- `### Components`
|
|
67
|
+
The components that agent is responsible for proving or promoting.
|
|
68
|
+
- `### Capabilities`
|
|
69
|
+
Optional routing hints for follow-up work.
|
|
70
|
+
- `### Deliverables`
|
|
71
|
+
Exact repo-relative outputs that must exist before closure can pass.
|
|
72
|
+
- `### Prompt`
|
|
73
|
+
The specific task, file ownership, requirements, and validation instructions.
|
|
74
|
+
- `### Exit contract`
|
|
75
|
+
The completion, durability, proof, and documentation expectations that gate closure.
|
|
76
|
+
|
|
77
|
+
## Standard Roles
|
|
78
|
+
|
|
79
|
+
The starter runtime expects three standard closure roles plus up to two optional review specialists:
|
|
80
|
+
|
|
81
|
+
- `A8`
|
|
82
|
+
Integration steward
|
|
83
|
+
- `A9`
|
|
84
|
+
Documentation steward
|
|
85
|
+
- `A0`
|
|
86
|
+
cont-QA
|
|
87
|
+
- `E0`
|
|
88
|
+
Optional `cont-EVAL` for iterative benchmark or output tuning; report-only by default, implementation-owning only when explicitly assigned non-report files
|
|
89
|
+
- `A7`
|
|
90
|
+
Optional security reviewer; report-only by default and used to publish a threat-model-first security review before integration closure
|
|
91
|
+
|
|
92
|
+
Implementation or specialist agents own the actual work slices. Closure roles do not replace implementation ownership; they decide whether the combined result is closure-ready. `cont-EVAL` is the one hybrid role: most waves keep it report-only, but human-authored waves may assign explicit tuning files to `E0`, in which case it must satisfy both implementation proof and eval proof.
|
|
93
|
+
|
|
94
|
+
## Lifecycle Of A Wave
|
|
95
|
+
|
|
96
|
+
1. Author or draft the wave.
|
|
97
|
+
2. Run `wave launch --dry-run --no-dashboard`.
|
|
98
|
+
3. The launcher validates the wave, resolves executors and skills, builds prompts, and materializes operator surfaces.
|
|
99
|
+
4. A live run launches implementation agents first when implementation work remains.
|
|
100
|
+
5. Agents write structured coordination events instead of relying on ad hoc terminal output.
|
|
101
|
+
6. The launcher checks implementation contracts, promoted-component proof, helper assignments, dependencies, and clarification state.
|
|
102
|
+
7. If implementation is ready, closure runs in order: optional `cont-EVAL`, optional security review, integration, documentation, then cont-QA.
|
|
103
|
+
8. The attempt is captured in per-wave traces, ledgers, inboxes, summaries, and copied artifacts.
|
|
104
|
+
|
|
105
|
+
## Runtime And Operating Posture
|
|
106
|
+
|
|
107
|
+
Wave is runtime agnostic at the orchestration layer.
|
|
108
|
+
|
|
109
|
+
Planning, ownership, closure, durable state, and traces do not depend on whether an agent runs on Codex, Claude Code, OpenCode, or the local smoke executor. Runtime-specific behavior is isolated to executor adapters and overlays.
|
|
110
|
+
|
|
111
|
+
That means a wave should usually be authored in runtime-neutral terms:
|
|
112
|
+
|
|
113
|
+
- ownership and deliverables
|
|
114
|
+
- proof and validation
|
|
115
|
+
- closure order
|
|
116
|
+
- dependencies and helper flow
|
|
117
|
+
- promoted component expectations
|
|
118
|
+
|
|
119
|
+
The runtime choice resolves later, from the agent executor block, profile defaults, lane defaults, CLI overrides, and fallback policy.
|
|
120
|
+
|
|
121
|
+
Wave also has an execution posture:
|
|
122
|
+
|
|
123
|
+
- `oversight`
|
|
124
|
+
Human review or intervention is expected for risky or ambiguous work.
|
|
125
|
+
- `dark-factory`
|
|
126
|
+
The wave is authored for routine execution without normal human intervention.
|
|
127
|
+
|
|
128
|
+
Today these postures are planning vocabulary and saved project defaults, not two separate execution engines. Human feedback is still an escalation mechanism inside the orchestration loop, not the definition of the operating mode itself.
|
|
129
|
+
|
|
130
|
+
If you need the narrower supporting pages, see [runtime-agnostic-orchestration.md](./runtime-agnostic-orchestration.md) and [operating-modes.md](./operating-modes.md).
|
|
131
|
+
|
|
132
|
+
Current live waves are strict about closure artifacts:
|
|
133
|
+
|
|
134
|
+
- `cont-EVAL` must emit a structured `[wave-eval]` marker whose `target_ids` matches the declared eval targets and whose `benchmark_ids` enumerates the executed benchmark set.
|
|
135
|
+
- Security reviewers must leave a security review report and emit a final `[wave-security]` marker with `state=<clear|concerns|blocked>`, finding count, and approval count.
|
|
136
|
+
- `cont-QA` must emit both a final `Verdict:` line and a final `[wave-gate]` marker.
|
|
137
|
+
- Replay keeps read-only compatibility with older traces and older evaluator-era artifacts, but live waves do not pass on verdict-only or underspecified closure markers.
|
|
138
|
+
|
|
139
|
+
## What Makes A Wave "Done"
|
|
140
|
+
|
|
141
|
+
A wave is not done because an agent said so. It is done only when the runtime surfaces agree:
|
|
142
|
+
|
|
143
|
+
- implementation exit contracts pass
|
|
144
|
+
- required deliverables exist and stay within ownership boundaries
|
|
145
|
+
- required proof artifacts exist when the wave declares proof-first live evidence
|
|
146
|
+
- required component proof and promotions pass
|
|
147
|
+
- helper assignments are resolved
|
|
148
|
+
- required dependency tickets are resolved
|
|
149
|
+
- clarification follow-ups or escalations are resolved
|
|
150
|
+
- if present, `cont-EVAL` satisfies its declared eval targets
|
|
151
|
+
- if present, the security reviewer publishes a report plus a final `[wave-security]` marker; `blocked` stops closure while `concerns` stays advisory
|
|
152
|
+
- integration recommends closure
|
|
153
|
+
- documentation and cont-QA closure pass
|
|
154
|
+
|
|
155
|
+
For proof-first live-wave examples, see [docs/reference/live-proof-waves.md](../reference/live-proof-waves.md).
|
|
156
|
+
|
|
157
|
+
## Where The State Lives
|
|
158
|
+
|
|
159
|
+
The wave file is only part of the story. The runtime writes durable state under `.tmp/<lane>-wave-launcher/`, including:
|
|
160
|
+
|
|
161
|
+
- prompts and logs
|
|
162
|
+
- status summaries
|
|
163
|
+
- coordination logs
|
|
164
|
+
- rendered message boards
|
|
165
|
+
- compiled inboxes
|
|
166
|
+
- ledger and docs queue
|
|
167
|
+
- security summaries
|
|
168
|
+
- integration summaries
|
|
169
|
+
- dependency snapshots
|
|
170
|
+
- executor overlays
|
|
171
|
+
- trace bundles
|
|
172
|
+
|
|
173
|
+
That is why a wave is better understood as a bounded execution record, not just a markdown file.
|
|
174
|
+
|
|
175
|
+
## Planner Specs vs Markdown
|
|
176
|
+
|
|
177
|
+
The planner foundation adds a JSON draft spec at `docs/plans/waves/specs/wave-<n>.json`.
|
|
178
|
+
|
|
179
|
+
- The JSON spec is the canonical planner artifact.
|
|
180
|
+
- The rendered markdown stays compatible with the launcher and parser.
|
|
181
|
+
- The launcher still executes the markdown wave file today.
|
|
182
|
+
|
|
183
|
+
This split keeps authoring structured while preserving the established execution surface.
|
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Benchmark Catalog Guide"
|
|
3
|
+
summary: "How to use delegated benchmark families, pinned benchmarks, and coordination-oriented eval targets in Wave."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Catalog Guide
|
|
7
|
+
|
|
8
|
+
Wave's benchmark catalog lives in `docs/evals/benchmark-catalog.json`.
|
|
9
|
+
|
|
10
|
+
It has two jobs:
|
|
11
|
+
|
|
12
|
+
- give `cont-EVAL` a repo-governed menu of allowed benchmark families and benchmark ids
|
|
13
|
+
- document what each benchmark is trying to catch, including coordination failure modes and static paper baselines
|
|
14
|
+
|
|
15
|
+
The catalog is reference metadata, not a run-history database. It tells the wave author and `cont-EVAL` what kinds of checks are allowed and what external benchmark or paper baseline those checks map to.
|
|
16
|
+
|
|
17
|
+
For a full authored wave example that uses these patterns, see [docs/reference/sample-waves.md](../reference/sample-waves.md).
|
|
18
|
+
|
|
19
|
+
## Migrating From Legacy Evaluator Waves
|
|
20
|
+
|
|
21
|
+
If your `0.5.4`-era repo still talks about a single `evaluator` role, split that surface before adopting `0.6.0`:
|
|
22
|
+
|
|
23
|
+
- keep `A0` as `cont-QA` for the final closure verdict and `[wave-gate]`
|
|
24
|
+
- add `E0` only when the wave needs benchmark-driven tuning or service-output evaluation
|
|
25
|
+
- treat `cont-EVAL` as report-only unless the wave explicitly gives `E0` owned implementation files
|
|
26
|
+
- keep `## Eval targets` at the wave level so `cont-EVAL` has an exact contract to satisfy
|
|
27
|
+
|
|
28
|
+
`cont-EVAL` is not a rename of `cont-QA`. In `0.6.0`, `E0` proves the eval contract before integration, while `A0` still owns the final release verdict after documentation closure.
|
|
29
|
+
|
|
30
|
+
## When To Use Delegated Vs Pinned Targets
|
|
31
|
+
|
|
32
|
+
Use `selection: delegated` when the wave should authorize a benchmark family and let `cont-EVAL` choose the exact benchmark set inside that family.
|
|
33
|
+
|
|
34
|
+
Use `selection: pinned` when the wave must require specific benchmark ids and does not want `cont-EVAL` to choose alternates.
|
|
35
|
+
|
|
36
|
+
In practice:
|
|
37
|
+
|
|
38
|
+
- `delegated` is better when you want flexibility inside a stable area such as `hidden-profile-pooling` or `latency`
|
|
39
|
+
- `pinned` is better when you need an exact smoke gate such as `cold-start-smoke` or `private-evidence-integration`
|
|
40
|
+
|
|
41
|
+
## Example Eval Targets
|
|
42
|
+
|
|
43
|
+
Delegated family target:
|
|
44
|
+
|
|
45
|
+
```md
|
|
46
|
+
## Eval targets
|
|
47
|
+
|
|
48
|
+
- id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Pinned benchmark target:
|
|
52
|
+
|
|
53
|
+
```md
|
|
54
|
+
## Eval targets
|
|
55
|
+
|
|
56
|
+
- id: contradiction-recovery-guard | selection: pinned | benchmarks: claim-conflict-detection,evidence-based-repair | objective: Detect and repair conflicting claims before closure | threshold: Material contradictions become explicit follow-up work and resolve before final pass
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Mixed target set:
|
|
60
|
+
|
|
61
|
+
```md
|
|
62
|
+
## Eval targets
|
|
63
|
+
|
|
64
|
+
- id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
|
|
65
|
+
- id: summary-integrity | selection: pinned | benchmarks: shared-summary-fact-retention | objective: Preserve decision-changing facts through summary compression | threshold: Shared summaries retain the facts needed for the final recommendation
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Coordination Families
|
|
69
|
+
|
|
70
|
+
The coordination-oriented families currently included in the catalog are:
|
|
71
|
+
|
|
72
|
+
- `hidden-profile-pooling`
|
|
73
|
+
Use when the main risk is that agents fail to surface or integrate distributed private evidence. This maps most directly to HiddenBench.
|
|
74
|
+
- `silo-escape`
|
|
75
|
+
Use when the risk is that agents communicate but still fail to reconstruct the correct global state. This maps most directly to Silo-Bench.
|
|
76
|
+
- `simultaneous-coordination`
|
|
77
|
+
Use when the risk is contention, lockstep failure, or convergent reasoning under concurrent decisions. This maps most directly to DPBench.
|
|
78
|
+
- `expertise-leverage`
|
|
79
|
+
Use when the risk is expert underuse, bad routing, or low-quality compromise across mixed-skill agents. This maps most directly to `Multi-Agent Teams Hold Experts Back`.
|
|
80
|
+
- `blackboard-fidelity`
|
|
81
|
+
Use when the risk is information loss or distortion between the raw coordination log and derived artifacts like shared summaries, inboxes, ledger state, or integration summaries.
|
|
82
|
+
- `contradiction-recovery`
|
|
83
|
+
Use when the risk is false consensus, unresolved conflicting claims, or clarification chains that appear resolved without real repair.
|
|
84
|
+
|
|
85
|
+
## How To Choose The Right Family
|
|
86
|
+
|
|
87
|
+
Choose the family based on the failure you are most worried about, not just on the surface area being changed.
|
|
88
|
+
|
|
89
|
+
Use:
|
|
90
|
+
|
|
91
|
+
- `hidden-profile-pooling` when the hard part is discovering missing facts
|
|
92
|
+
- `silo-escape` when the hard part is integrating already-shared facts into one correct state
|
|
93
|
+
- `simultaneous-coordination` when multiple owners or resources must move together
|
|
94
|
+
- `expertise-leverage` when the right answer depends on preserving the best expert signal
|
|
95
|
+
- `blackboard-fidelity` when summaries, inboxes, or integration artifacts may be dropping important evidence
|
|
96
|
+
- `contradiction-recovery` when you expect conflicting claims and need the framework to turn them into bounded repair work
|
|
97
|
+
|
|
98
|
+
## How `cont-EVAL` Should Use The Catalog
|
|
99
|
+
|
|
100
|
+
When a wave delegates benchmark selection:
|
|
101
|
+
|
|
102
|
+
1. Read the wave's `## Eval targets`.
|
|
103
|
+
2. Resolve the allowed benchmark family from the catalog.
|
|
104
|
+
3. Choose the smallest benchmark set that genuinely tests the target's failure mode.
|
|
105
|
+
4. Record the exact selected benchmark ids in the `cont-EVAL` report.
|
|
106
|
+
5. Emit the final `[wave-eval]` marker with the exact executed `benchmark_ids`.
|
|
107
|
+
|
|
108
|
+
When a wave pins benchmarks:
|
|
109
|
+
|
|
110
|
+
1. Run the named benchmark ids directly.
|
|
111
|
+
2. Do not silently swap to nearby checks.
|
|
112
|
+
3. Treat missing or unrun pinned benchmarks as an unsatisfied target.
|
|
113
|
+
|
|
114
|
+
## How To Read The Static Baselines
|
|
115
|
+
|
|
116
|
+
Some coordination families now include static paper baselines such as HiddenBench, Silo-Bench, DPBench, and `Multi-Agent Teams Hold Experts Back`.
|
|
117
|
+
|
|
118
|
+
These baselines are:
|
|
119
|
+
|
|
120
|
+
- reference points from papers
|
|
121
|
+
- useful for framing whether Wave is still far from the broader state of the art
|
|
122
|
+
- not the same thing as local run history
|
|
123
|
+
|
|
124
|
+
They should answer:
|
|
125
|
+
|
|
126
|
+
- what failure mode this benchmark family is grounded in
|
|
127
|
+
- what the paper reported
|
|
128
|
+
- what metric the paper used
|
|
129
|
+
|
|
130
|
+
They should not be treated as:
|
|
131
|
+
|
|
132
|
+
- a promise that Wave already matches the paper's best system
|
|
133
|
+
- a local regression history
|
|
134
|
+
- a substitute for actually running evals
|
|
135
|
+
|
|
136
|
+
## Authoring Guidance
|
|
137
|
+
|
|
138
|
+
Prefer one eval target per distinct risk.
|
|
139
|
+
|
|
140
|
+
Good:
|
|
141
|
+
|
|
142
|
+
- one target for distributed-information pooling
|
|
143
|
+
- one target for contradiction recovery
|
|
144
|
+
- one target for latency guardrails
|
|
145
|
+
|
|
146
|
+
Avoid:
|
|
147
|
+
|
|
148
|
+
- one overloaded target that mixes every coordination risk into a single vague threshold
|
|
149
|
+
|
|
150
|
+
Prefer delegated targets early when the family is stable but the exact check should remain flexible.
|
|
151
|
+
|
|
152
|
+
Prefer pinned targets when:
|
|
153
|
+
|
|
154
|
+
- the wave is release-sensitive
|
|
155
|
+
- the benchmark is small and repeatable
|
|
156
|
+
- you need a precise regression gate
|
|
157
|
+
|
|
158
|
+
## Current Limits
|
|
159
|
+
|
|
160
|
+
The benchmark catalog does not yet store:
|
|
161
|
+
|
|
162
|
+
- local benchmark run history
|
|
163
|
+
- local-vs-paper delta computation
|
|
164
|
+
- automated benchmark execution plans
|
|
165
|
+
|
|
166
|
+
For now it is the schema and policy layer that keeps eval authoring, `cont-EVAL`, and coordination benchmarking aligned.
|