@chllming/wave-orchestration 0.5.4 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +46 -3
- package/README.md +33 -5
- package/docs/README.md +18 -4
- package/docs/agents/wave-cont-eval-role.md +36 -0
- package/docs/agents/{wave-evaluator-role.md → wave-cont-qa-role.md} +14 -11
- package/docs/agents/wave-documentation-role.md +1 -1
- package/docs/agents/wave-infra-role.md +1 -1
- package/docs/agents/wave-integration-role.md +3 -3
- package/docs/agents/wave-launcher-role.md +4 -3
- package/docs/agents/wave-security-role.md +40 -0
- package/docs/concepts/context7-vs-skills.md +1 -1
- package/docs/concepts/what-is-a-wave.md +56 -6
- package/docs/evals/README.md +166 -0
- package/docs/evals/benchmark-catalog.json +663 -0
- package/docs/guides/author-and-run-waves.md +135 -0
- package/docs/guides/planner.md +5 -0
- package/docs/guides/terminal-surfaces.md +2 -0
- package/docs/plans/component-cutover-matrix.json +1 -1
- package/docs/plans/component-cutover-matrix.md +1 -1
- package/docs/plans/current-state.md +19 -1
- package/docs/plans/examples/wave-example-live-proof.md +435 -0
- package/docs/plans/migration.md +42 -0
- package/docs/plans/wave-orchestrator.md +46 -7
- package/docs/plans/waves/wave-0.md +4 -4
- package/docs/reference/live-proof-waves.md +177 -0
- package/docs/reference/migration-0.2-to-0.5.md +26 -19
- package/docs/reference/npmjs-trusted-publishing.md +6 -5
- package/docs/reference/runtime-config/README.md +13 -3
- package/docs/reference/sample-waves.md +87 -0
- package/docs/reference/skills.md +110 -42
- package/docs/research/agent-context-sources.md +130 -11
- package/docs/research/coordination-failure-review.md +266 -0
- package/docs/roadmap.md +6 -2
- package/package.json +2 -2
- package/releases/manifest.json +20 -2
- package/scripts/research/agent-context-archive.mjs +83 -1
- package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +811 -0
- package/scripts/wave-orchestrator/adhoc.mjs +1331 -0
- package/scripts/wave-orchestrator/agent-state.mjs +358 -6
- package/scripts/wave-orchestrator/artifact-schemas.mjs +173 -0
- package/scripts/wave-orchestrator/clarification-triage.mjs +10 -3
- package/scripts/wave-orchestrator/config.mjs +48 -12
- package/scripts/wave-orchestrator/context7.mjs +2 -0
- package/scripts/wave-orchestrator/coord-cli.mjs +51 -19
- package/scripts/wave-orchestrator/coordination-store.mjs +26 -4
- package/scripts/wave-orchestrator/coordination.mjs +83 -9
- package/scripts/wave-orchestrator/dashboard-state.mjs +20 -8
- package/scripts/wave-orchestrator/dep-cli.mjs +5 -2
- package/scripts/wave-orchestrator/docs-queue.mjs +8 -2
- package/scripts/wave-orchestrator/evals.mjs +451 -0
- package/scripts/wave-orchestrator/feedback.mjs +15 -1
- package/scripts/wave-orchestrator/install.mjs +32 -9
- package/scripts/wave-orchestrator/launcher-closure.mjs +281 -0
- package/scripts/wave-orchestrator/launcher-runtime.mjs +334 -0
- package/scripts/wave-orchestrator/launcher.mjs +709 -601
- package/scripts/wave-orchestrator/ledger.mjs +123 -20
- package/scripts/wave-orchestrator/local-executor.mjs +99 -12
- package/scripts/wave-orchestrator/planner.mjs +177 -42
- package/scripts/wave-orchestrator/replay.mjs +6 -3
- package/scripts/wave-orchestrator/role-helpers.mjs +84 -0
- package/scripts/wave-orchestrator/shared.mjs +75 -11
- package/scripts/wave-orchestrator/skills.mjs +637 -106
- package/scripts/wave-orchestrator/traces.mjs +71 -48
- package/scripts/wave-orchestrator/wave-files.mjs +947 -101
- package/scripts/wave.mjs +9 -0
- package/skills/README.md +202 -0
- package/skills/provider-aws/SKILL.md +111 -0
- package/skills/provider-aws/adapters/claude.md +1 -0
- package/skills/provider-aws/adapters/codex.md +1 -0
- package/skills/provider-aws/references/service-verification.md +39 -0
- package/skills/provider-aws/skill.json +50 -1
- package/skills/provider-custom-deploy/SKILL.md +59 -0
- package/skills/provider-custom-deploy/skill.json +46 -1
- package/skills/provider-docker-compose/SKILL.md +90 -0
- package/skills/provider-docker-compose/adapters/local.md +1 -0
- package/skills/provider-docker-compose/skill.json +49 -1
- package/skills/provider-github-release/SKILL.md +116 -1
- package/skills/provider-github-release/adapters/claude.md +1 -0
- package/skills/provider-github-release/adapters/codex.md +1 -0
- package/skills/provider-github-release/skill.json +51 -1
- package/skills/provider-kubernetes/SKILL.md +137 -0
- package/skills/provider-kubernetes/adapters/claude.md +1 -0
- package/skills/provider-kubernetes/adapters/codex.md +1 -0
- package/skills/provider-kubernetes/references/kubectl-patterns.md +58 -0
- package/skills/provider-kubernetes/skill.json +48 -1
- package/skills/provider-railway/SKILL.md +118 -1
- package/skills/provider-railway/references/verification-commands.md +39 -0
- package/skills/provider-railway/skill.json +67 -1
- package/skills/provider-ssh-manual/SKILL.md +91 -0
- package/skills/provider-ssh-manual/skill.json +50 -1
- package/skills/repo-coding-rules/SKILL.md +84 -0
- package/skills/repo-coding-rules/skill.json +30 -1
- package/skills/role-cont-eval/SKILL.md +90 -0
- package/skills/role-cont-eval/adapters/codex.md +1 -0
- package/skills/role-cont-eval/skill.json +36 -0
- package/skills/role-cont-qa/SKILL.md +93 -0
- package/skills/role-cont-qa/adapters/claude.md +1 -0
- package/skills/role-cont-qa/skill.json +36 -0
- package/skills/role-deploy/SKILL.md +90 -0
- package/skills/role-deploy/skill.json +32 -1
- package/skills/role-documentation/SKILL.md +66 -0
- package/skills/role-documentation/skill.json +32 -1
- package/skills/role-implementation/SKILL.md +62 -0
- package/skills/role-implementation/skill.json +32 -1
- package/skills/role-infra/SKILL.md +74 -0
- package/skills/role-infra/skill.json +32 -1
- package/skills/role-integration/SKILL.md +79 -1
- package/skills/role-integration/skill.json +32 -1
- package/skills/role-research/SKILL.md +58 -0
- package/skills/role-research/skill.json +32 -1
- package/skills/role-security/SKILL.md +60 -0
- package/skills/role-security/skill.json +36 -0
- package/skills/runtime-claude/SKILL.md +60 -1
- package/skills/runtime-claude/skill.json +32 -1
- package/skills/runtime-codex/SKILL.md +52 -1
- package/skills/runtime-codex/skill.json +32 -1
- package/skills/runtime-local/SKILL.md +39 -0
- package/skills/runtime-local/skill.json +32 -1
- package/skills/runtime-opencode/SKILL.md +51 -0
- package/skills/runtime-opencode/skill.json +32 -1
- package/skills/wave-core/SKILL.md +107 -0
- package/skills/wave-core/references/marker-syntax.md +62 -0
- package/skills/wave-core/skill.json +31 -1
- package/wave.config.json +35 -6
- package/skills/role-evaluator/SKILL.md +0 -6
- package/skills/role-evaluator/skill.json +0 -5
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Benchmark Catalog Guide"
|
|
3
|
+
summary: "How to use delegated benchmark families, pinned benchmarks, and coordination-oriented eval targets in Wave."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Catalog Guide
|
|
7
|
+
|
|
8
|
+
Wave's benchmark catalog lives in `docs/evals/benchmark-catalog.json`.
|
|
9
|
+
|
|
10
|
+
It has two jobs:
|
|
11
|
+
|
|
12
|
+
- give `cont-EVAL` a repo-governed menu of allowed benchmark families and benchmark ids
|
|
13
|
+
- document what each benchmark is trying to catch, including coordination failure modes and static paper baselines
|
|
14
|
+
|
|
15
|
+
The catalog is reference metadata, not a run-history database. It tells the wave author and `cont-EVAL` what kinds of checks are allowed and what external benchmark or paper baseline those checks map to.
|
|
16
|
+
|
|
17
|
+
For a full authored wave example that uses these patterns, see [docs/reference/sample-waves.md](../reference/sample-waves.md).
|
|
18
|
+
|
|
19
|
+
## Migrating From Legacy Evaluator Waves
|
|
20
|
+
|
|
21
|
+
If your `0.5.4`-era repo still talks about a single `evaluator` role, split that surface before adopting `0.6.0`:
|
|
22
|
+
|
|
23
|
+
- keep `A0` as `cont-QA` for the final closure verdict and `[wave-gate]`
|
|
24
|
+
- add `E0` only when the wave needs benchmark-driven tuning or service-output evaluation
|
|
25
|
+
- treat `cont-EVAL` as report-only unless the wave explicitly gives `E0` owned implementation files
|
|
26
|
+
- keep `## Eval targets` at the wave level so `cont-EVAL` has an exact contract to satisfy
|
|
27
|
+
|
|
28
|
+
`cont-EVAL` is not a rename of `cont-QA`. In `0.6.0`, `E0` proves the eval contract before integration, while `A0` still owns the final release verdict after documentation closure.
|
|
29
|
+
|
|
30
|
+
## When To Use Delegated Vs Pinned Targets
|
|
31
|
+
|
|
32
|
+
Use `selection: delegated` when the wave should authorize a benchmark family and let `cont-EVAL` choose the exact benchmark set inside that family.
|
|
33
|
+
|
|
34
|
+
Use `selection: pinned` when the wave must require specific benchmark ids and does not want `cont-EVAL` to choose alternates.
|
|
35
|
+
|
|
36
|
+
In practice:
|
|
37
|
+
|
|
38
|
+
- `delegated` is better when you want flexibility inside a stable area such as `hidden-profile-pooling` or `latency`
|
|
39
|
+
- `pinned` is better when you need an exact smoke gate such as `cold-start-smoke` or `private-evidence-integration`
|
|
40
|
+
|
|
41
|
+
## Example Eval Targets
|
|
42
|
+
|
|
43
|
+
Delegated family target:
|
|
44
|
+
|
|
45
|
+
```md
|
|
46
|
+
## Eval targets
|
|
47
|
+
|
|
48
|
+
- id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Pinned benchmark target:
|
|
52
|
+
|
|
53
|
+
```md
|
|
54
|
+
## Eval targets
|
|
55
|
+
|
|
56
|
+
- id: contradiction-recovery-guard | selection: pinned | benchmarks: claim-conflict-detection,evidence-based-repair | objective: Detect and repair conflicting claims before closure | threshold: Material contradictions become explicit follow-up work and resolve before final pass
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Mixed target set:
|
|
60
|
+
|
|
61
|
+
```md
|
|
62
|
+
## Eval targets
|
|
63
|
+
|
|
64
|
+
- id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
|
|
65
|
+
- id: summary-integrity | selection: pinned | benchmarks: shared-summary-fact-retention | objective: Preserve decision-changing facts through summary compression | threshold: Shared summaries retain the facts needed for the final recommendation
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Coordination Families
|
|
69
|
+
|
|
70
|
+
The coordination-oriented families currently included in the catalog are:
|
|
71
|
+
|
|
72
|
+
- `hidden-profile-pooling`
|
|
73
|
+
Use when the main risk is that agents fail to surface or integrate distributed private evidence. This maps most directly to HiddenBench.
|
|
74
|
+
- `silo-escape`
|
|
75
|
+
Use when the risk is that agents communicate but still fail to reconstruct the correct global state. This maps most directly to Silo-Bench.
|
|
76
|
+
- `simultaneous-coordination`
|
|
77
|
+
Use when the risk is contention, lockstep failure, or convergent reasoning under concurrent decisions. This maps most directly to DPBench.
|
|
78
|
+
- `expertise-leverage`
|
|
79
|
+
Use when the risk is expert underuse, bad routing, or low-quality compromise across mixed-skill agents. This maps most directly to `Multi-Agent Teams Hold Experts Back`.
|
|
80
|
+
- `blackboard-fidelity`
|
|
81
|
+
Use when the risk is information loss or distortion between the raw coordination log and derived artifacts like shared summaries, inboxes, ledger state, or integration summaries.
|
|
82
|
+
- `contradiction-recovery`
|
|
83
|
+
Use when the risk is false consensus, unresolved conflicting claims, or clarification chains that appear resolved without real repair.
|
|
84
|
+
|
|
85
|
+
## How To Choose The Right Family
|
|
86
|
+
|
|
87
|
+
Choose the family based on the failure you are most worried about, not just on the surface area being changed.
|
|
88
|
+
|
|
89
|
+
Use:
|
|
90
|
+
|
|
91
|
+
- `hidden-profile-pooling` when the hard part is discovering missing facts
|
|
92
|
+
- `silo-escape` when the hard part is integrating already-shared facts into one correct state
|
|
93
|
+
- `simultaneous-coordination` when multiple owners or resources must move together
|
|
94
|
+
- `expertise-leverage` when the right answer depends on preserving the best expert signal
|
|
95
|
+
- `blackboard-fidelity` when summaries, inboxes, or integration artifacts may be dropping important evidence
|
|
96
|
+
- `contradiction-recovery` when you expect conflicting claims and need the framework to turn them into bounded repair work
|
|
97
|
+
|
|
98
|
+
## How `cont-EVAL` Should Use The Catalog
|
|
99
|
+
|
|
100
|
+
When a wave delegates benchmark selection:
|
|
101
|
+
|
|
102
|
+
1. Read the wave's `## Eval targets`.
|
|
103
|
+
2. Resolve the allowed benchmark family from the catalog.
|
|
104
|
+
3. Choose the smallest benchmark set that genuinely tests the target's failure mode.
|
|
105
|
+
4. Record the exact selected benchmark ids in the `cont-EVAL` report.
|
|
106
|
+
5. Emit the final `[wave-eval]` marker with the exact executed `benchmark_ids`.
|
|
107
|
+
|
|
108
|
+
When a wave pins benchmarks:
|
|
109
|
+
|
|
110
|
+
1. Run the named benchmark ids directly.
|
|
111
|
+
2. Do not silently swap to nearby checks.
|
|
112
|
+
3. Treat missing or unrun pinned benchmarks as an unsatisfied target.
|
|
113
|
+
|
|
114
|
+
## How To Read The Static Baselines
|
|
115
|
+
|
|
116
|
+
Some coordination families now include static paper baselines such as HiddenBench, Silo-Bench, DPBench, and `Multi-Agent Teams Hold Experts Back`.
|
|
117
|
+
|
|
118
|
+
These baselines are:
|
|
119
|
+
|
|
120
|
+
- reference points from papers
|
|
121
|
+
- useful for framing whether Wave is still far from the broader state of the art
|
|
122
|
+
- not the same thing as local run history
|
|
123
|
+
|
|
124
|
+
They should answer:
|
|
125
|
+
|
|
126
|
+
- what failure mode this benchmark family is grounded in
|
|
127
|
+
- what the paper reported
|
|
128
|
+
- what metric the paper used
|
|
129
|
+
|
|
130
|
+
They should not be treated as:
|
|
131
|
+
|
|
132
|
+
- a promise that Wave already matches the paper's best system
|
|
133
|
+
- a local regression history
|
|
134
|
+
- a substitute for actually running evals
|
|
135
|
+
|
|
136
|
+
## Authoring Guidance
|
|
137
|
+
|
|
138
|
+
Prefer one eval target per distinct risk.
|
|
139
|
+
|
|
140
|
+
Good:
|
|
141
|
+
|
|
142
|
+
- one target for distributed-information pooling
|
|
143
|
+
- one target for contradiction recovery
|
|
144
|
+
- one target for latency guardrails
|
|
145
|
+
|
|
146
|
+
Avoid:
|
|
147
|
+
|
|
148
|
+
- one overloaded target that mixes every coordination risk into a single vague threshold
|
|
149
|
+
|
|
150
|
+
Prefer delegated targets early when the family is stable but the exact check should remain flexible.
|
|
151
|
+
|
|
152
|
+
Prefer pinned targets when:
|
|
153
|
+
|
|
154
|
+
- the wave is release-sensitive
|
|
155
|
+
- the benchmark is small and repeatable
|
|
156
|
+
- you need a precise regression gate
|
|
157
|
+
|
|
158
|
+
## Current Limits
|
|
159
|
+
|
|
160
|
+
The benchmark catalog does not yet store:
|
|
161
|
+
|
|
162
|
+
- local benchmark run history
|
|
163
|
+
- local-vs-paper delta computation
|
|
164
|
+
- automated benchmark execution plans
|
|
165
|
+
|
|
166
|
+
For now it is the schema and policy layer that keeps eval authoring, `cont-EVAL`, and coordination benchmarking aligned.
|