@chllming/wave-orchestration 0.5.4 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (126) hide show
  1. package/CHANGELOG.md +46 -3
  2. package/README.md +33 -5
  3. package/docs/README.md +18 -4
  4. package/docs/agents/wave-cont-eval-role.md +36 -0
  5. package/docs/agents/{wave-evaluator-role.md → wave-cont-qa-role.md} +14 -11
  6. package/docs/agents/wave-documentation-role.md +1 -1
  7. package/docs/agents/wave-infra-role.md +1 -1
  8. package/docs/agents/wave-integration-role.md +3 -3
  9. package/docs/agents/wave-launcher-role.md +4 -3
  10. package/docs/agents/wave-security-role.md +40 -0
  11. package/docs/concepts/context7-vs-skills.md +1 -1
  12. package/docs/concepts/what-is-a-wave.md +56 -6
  13. package/docs/evals/README.md +166 -0
  14. package/docs/evals/benchmark-catalog.json +663 -0
  15. package/docs/guides/author-and-run-waves.md +135 -0
  16. package/docs/guides/planner.md +5 -0
  17. package/docs/guides/terminal-surfaces.md +2 -0
  18. package/docs/plans/component-cutover-matrix.json +1 -1
  19. package/docs/plans/component-cutover-matrix.md +1 -1
  20. package/docs/plans/current-state.md +19 -1
  21. package/docs/plans/examples/wave-example-live-proof.md +435 -0
  22. package/docs/plans/migration.md +42 -0
  23. package/docs/plans/wave-orchestrator.md +46 -7
  24. package/docs/plans/waves/wave-0.md +4 -4
  25. package/docs/reference/live-proof-waves.md +177 -0
  26. package/docs/reference/migration-0.2-to-0.5.md +26 -19
  27. package/docs/reference/npmjs-trusted-publishing.md +6 -5
  28. package/docs/reference/runtime-config/README.md +13 -3
  29. package/docs/reference/sample-waves.md +87 -0
  30. package/docs/reference/skills.md +110 -42
  31. package/docs/research/agent-context-sources.md +130 -11
  32. package/docs/research/coordination-failure-review.md +266 -0
  33. package/docs/roadmap.md +6 -2
  34. package/package.json +2 -2
  35. package/releases/manifest.json +20 -2
  36. package/scripts/research/agent-context-archive.mjs +83 -1
  37. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +811 -0
  38. package/scripts/wave-orchestrator/adhoc.mjs +1331 -0
  39. package/scripts/wave-orchestrator/agent-state.mjs +358 -6
  40. package/scripts/wave-orchestrator/artifact-schemas.mjs +173 -0
  41. package/scripts/wave-orchestrator/clarification-triage.mjs +10 -3
  42. package/scripts/wave-orchestrator/config.mjs +48 -12
  43. package/scripts/wave-orchestrator/context7.mjs +2 -0
  44. package/scripts/wave-orchestrator/coord-cli.mjs +51 -19
  45. package/scripts/wave-orchestrator/coordination-store.mjs +26 -4
  46. package/scripts/wave-orchestrator/coordination.mjs +83 -9
  47. package/scripts/wave-orchestrator/dashboard-state.mjs +20 -8
  48. package/scripts/wave-orchestrator/dep-cli.mjs +5 -2
  49. package/scripts/wave-orchestrator/docs-queue.mjs +8 -2
  50. package/scripts/wave-orchestrator/evals.mjs +451 -0
  51. package/scripts/wave-orchestrator/feedback.mjs +15 -1
  52. package/scripts/wave-orchestrator/install.mjs +32 -9
  53. package/scripts/wave-orchestrator/launcher-closure.mjs +281 -0
  54. package/scripts/wave-orchestrator/launcher-runtime.mjs +334 -0
  55. package/scripts/wave-orchestrator/launcher.mjs +709 -601
  56. package/scripts/wave-orchestrator/ledger.mjs +123 -20
  57. package/scripts/wave-orchestrator/local-executor.mjs +99 -12
  58. package/scripts/wave-orchestrator/planner.mjs +177 -42
  59. package/scripts/wave-orchestrator/replay.mjs +6 -3
  60. package/scripts/wave-orchestrator/role-helpers.mjs +84 -0
  61. package/scripts/wave-orchestrator/shared.mjs +75 -11
  62. package/scripts/wave-orchestrator/skills.mjs +637 -106
  63. package/scripts/wave-orchestrator/traces.mjs +71 -48
  64. package/scripts/wave-orchestrator/wave-files.mjs +947 -101
  65. package/scripts/wave.mjs +9 -0
  66. package/skills/README.md +202 -0
  67. package/skills/provider-aws/SKILL.md +111 -0
  68. package/skills/provider-aws/adapters/claude.md +1 -0
  69. package/skills/provider-aws/adapters/codex.md +1 -0
  70. package/skills/provider-aws/references/service-verification.md +39 -0
  71. package/skills/provider-aws/skill.json +50 -1
  72. package/skills/provider-custom-deploy/SKILL.md +59 -0
  73. package/skills/provider-custom-deploy/skill.json +46 -1
  74. package/skills/provider-docker-compose/SKILL.md +90 -0
  75. package/skills/provider-docker-compose/adapters/local.md +1 -0
  76. package/skills/provider-docker-compose/skill.json +49 -1
  77. package/skills/provider-github-release/SKILL.md +116 -1
  78. package/skills/provider-github-release/adapters/claude.md +1 -0
  79. package/skills/provider-github-release/adapters/codex.md +1 -0
  80. package/skills/provider-github-release/skill.json +51 -1
  81. package/skills/provider-kubernetes/SKILL.md +137 -0
  82. package/skills/provider-kubernetes/adapters/claude.md +1 -0
  83. package/skills/provider-kubernetes/adapters/codex.md +1 -0
  84. package/skills/provider-kubernetes/references/kubectl-patterns.md +58 -0
  85. package/skills/provider-kubernetes/skill.json +48 -1
  86. package/skills/provider-railway/SKILL.md +118 -1
  87. package/skills/provider-railway/references/verification-commands.md +39 -0
  88. package/skills/provider-railway/skill.json +67 -1
  89. package/skills/provider-ssh-manual/SKILL.md +91 -0
  90. package/skills/provider-ssh-manual/skill.json +50 -1
  91. package/skills/repo-coding-rules/SKILL.md +84 -0
  92. package/skills/repo-coding-rules/skill.json +30 -1
  93. package/skills/role-cont-eval/SKILL.md +90 -0
  94. package/skills/role-cont-eval/adapters/codex.md +1 -0
  95. package/skills/role-cont-eval/skill.json +36 -0
  96. package/skills/role-cont-qa/SKILL.md +93 -0
  97. package/skills/role-cont-qa/adapters/claude.md +1 -0
  98. package/skills/role-cont-qa/skill.json +36 -0
  99. package/skills/role-deploy/SKILL.md +90 -0
  100. package/skills/role-deploy/skill.json +32 -1
  101. package/skills/role-documentation/SKILL.md +66 -0
  102. package/skills/role-documentation/skill.json +32 -1
  103. package/skills/role-implementation/SKILL.md +62 -0
  104. package/skills/role-implementation/skill.json +32 -1
  105. package/skills/role-infra/SKILL.md +74 -0
  106. package/skills/role-infra/skill.json +32 -1
  107. package/skills/role-integration/SKILL.md +79 -1
  108. package/skills/role-integration/skill.json +32 -1
  109. package/skills/role-research/SKILL.md +58 -0
  110. package/skills/role-research/skill.json +32 -1
  111. package/skills/role-security/SKILL.md +60 -0
  112. package/skills/role-security/skill.json +36 -0
  113. package/skills/runtime-claude/SKILL.md +60 -1
  114. package/skills/runtime-claude/skill.json +32 -1
  115. package/skills/runtime-codex/SKILL.md +52 -1
  116. package/skills/runtime-codex/skill.json +32 -1
  117. package/skills/runtime-local/SKILL.md +39 -0
  118. package/skills/runtime-local/skill.json +32 -1
  119. package/skills/runtime-opencode/SKILL.md +51 -0
  120. package/skills/runtime-opencode/skill.json +32 -1
  121. package/skills/wave-core/SKILL.md +107 -0
  122. package/skills/wave-core/references/marker-syntax.md +62 -0
  123. package/skills/wave-core/skill.json +31 -1
  124. package/wave.config.json +35 -6
  125. package/skills/role-evaluator/SKILL.md +0 -6
  126. package/skills/role-evaluator/skill.json +0 -5
@@ -0,0 +1,166 @@
1
+ ---
2
+ title: "Benchmark Catalog Guide"
3
+ summary: "How to use delegated benchmark families, pinned benchmarks, and coordination-oriented eval targets in Wave."
4
+ ---
5
+
6
+ # Benchmark Catalog Guide
7
+
8
+ Wave's benchmark catalog lives in `docs/evals/benchmark-catalog.json`.
9
+
10
+ It has two jobs:
11
+
12
+ - give `cont-EVAL` a repo-governed menu of allowed benchmark families and benchmark ids
13
+ - document what each benchmark is trying to catch, including coordination failure modes and static paper baselines
14
+
15
+ The catalog is reference metadata, not a run-history database. It tells the wave author and `cont-EVAL` what kinds of checks are allowed and what external benchmark or paper baseline those checks map to.
16
+
17
+ For a full authored wave example that uses these patterns, see [docs/reference/sample-waves.md](../reference/sample-waves.md).
18
+
19
+ ## Migrating From Legacy Evaluator Waves
20
+
21
+ If your `0.5.4`-era repo still talks about a single `evaluator` role, split that surface before adopting `0.6.0`:
22
+
23
+ - keep `A0` as `cont-QA` for the final closure verdict and `[wave-gate]`
24
+ - add `E0` only when the wave needs benchmark-driven tuning or service-output evaluation
25
+ - treat `cont-EVAL` as report-only unless the wave explicitly gives `E0` owned implementation files
26
+ - keep `## Eval targets` at the wave level so `cont-EVAL` has an exact contract to satisfy
27
+
28
+ `cont-EVAL` is not a rename of `cont-QA`. In `0.6.0`, `E0` proves the eval contract before integration, while `A0` still owns the final release verdict after documentation closure.
29
+
30
+ ## When To Use Delegated Vs Pinned Targets
31
+
32
+ Use `selection: delegated` when the wave should authorize a benchmark family and let `cont-EVAL` choose the exact benchmark set inside that family.
33
+
34
+ Use `selection: pinned` when the wave must require specific benchmark ids and does not want `cont-EVAL` to choose alternates.
35
+
36
+ In practice:
37
+
38
+ - `delegated` is better when you want flexibility inside a stable area such as `hidden-profile-pooling` or `latency`
39
+ - `pinned` is better when you need an exact smoke gate such as `cold-start-smoke` or `private-evidence-integration`
40
+
41
+ ## Example Eval Targets
42
+
43
+ Delegated family target:
44
+
45
+ ```md
46
+ ## Eval targets
47
+
48
+ - id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
49
+ ```
50
+
51
+ Pinned benchmark target:
52
+
53
+ ```md
54
+ ## Eval targets
55
+
56
+ - id: contradiction-recovery-guard | selection: pinned | benchmarks: claim-conflict-detection,evidence-based-repair | objective: Detect and repair conflicting claims before closure | threshold: Material contradictions become explicit follow-up work and resolve before final pass
57
+ ```
58
+
59
+ Mixed target set:
60
+
61
+ ```md
62
+ ## Eval targets
63
+
64
+ - id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
65
+ - id: summary-integrity | selection: pinned | benchmarks: shared-summary-fact-retention | objective: Preserve decision-changing facts through summary compression | threshold: Shared summaries retain the facts needed for the final recommendation
66
+ ```
67
+
68
+ ## Coordination Families
69
+
70
+ The coordination-oriented families currently included in the catalog are:
71
+
72
+ - `hidden-profile-pooling`
73
+ Use when the main risk is that agents fail to surface or integrate distributed private evidence. This maps most directly to HiddenBench.
74
+ - `silo-escape`
75
+ Use when the risk is that agents communicate but still fail to reconstruct the correct global state. This maps most directly to Silo-Bench.
76
+ - `simultaneous-coordination`
77
+ Use when the risk is contention, lockstep failure, or convergent reasoning under concurrent decisions. This maps most directly to DPBench.
78
+ - `expertise-leverage`
79
+ Use when the risk is expert underuse, bad routing, or low-quality compromise across mixed-skill agents. This maps most directly to `Multi-Agent Teams Hold Experts Back`.
80
+ - `blackboard-fidelity`
81
+ Use when the risk is information loss or distortion between the raw coordination log and derived artifacts like shared summaries, inboxes, ledger state, or integration summaries.
82
+ - `contradiction-recovery`
83
+ Use when the risk is false consensus, unresolved conflicting claims, or clarification chains that appear resolved without real repair.
84
+
85
+ ## How To Choose The Right Family
86
+
87
+ Choose the family based on the failure you are most worried about, not just on the surface area being changed.
88
+
89
+ Use:
90
+
91
+ - `hidden-profile-pooling` when the hard part is discovering missing facts
92
+ - `silo-escape` when the hard part is integrating already-shared facts into one correct state
93
+ - `simultaneous-coordination` when multiple owners or resources must move together
94
+ - `expertise-leverage` when the right answer depends on preserving the best expert signal
95
+ - `blackboard-fidelity` when summaries, inboxes, or integration artifacts may be dropping important evidence
96
+ - `contradiction-recovery` when you expect conflicting claims and need the framework to turn them into bounded repair work
97
+
98
+ ## How `cont-EVAL` Should Use The Catalog
99
+
100
+ When a wave delegates benchmark selection:
101
+
102
+ 1. Read the wave's `## Eval targets`.
103
+ 2. Resolve the allowed benchmark family from the catalog.
104
+ 3. Choose the smallest benchmark set that genuinely tests the target's failure mode.
105
+ 4. Record the exact selected benchmark ids in the `cont-EVAL` report.
106
+ 5. Emit the final `[wave-eval]` marker with the exact executed `benchmark_ids`.
107
+
108
+ When a wave pins benchmarks:
109
+
110
+ 1. Run the named benchmark ids directly.
111
+ 2. Do not silently swap to nearby checks.
112
+ 3. Treat missing or unrun pinned benchmarks as an unsatisfied target.
113
+
114
+ ## How To Read The Static Baselines
115
+
116
+ Some coordination families now include static paper baselines such as HiddenBench, Silo-Bench, DPBench, and `Multi-Agent Teams Hold Experts Back`.
117
+
118
+ These baselines are:
119
+
120
+ - reference points from papers
121
+ - useful for framing whether Wave is still far from the broader state of the art
122
+ - not the same thing as local run history
123
+
124
+ They should answer:
125
+
126
+ - what failure mode this benchmark family is grounded in
127
+ - what the paper reported
128
+ - what metric the paper used
129
+
130
+ They should not be treated as:
131
+
132
+ - a promise that Wave already matches the paper's best system
133
+ - a local regression history
134
+ - a substitute for actually running evals
135
+
136
+ ## Authoring Guidance
137
+
138
+ Prefer one eval target per distinct risk.
139
+
140
+ Good:
141
+
142
+ - one target for distributed-information pooling
143
+ - one target for contradiction recovery
144
+ - one target for latency guardrails
145
+
146
+ Avoid:
147
+
148
+ - one overloaded target that mixes every coordination risk into a single vague threshold
149
+
150
+ Prefer delegated targets early when the family is stable but the exact check should remain flexible.
151
+
152
+ Prefer pinned targets when:
153
+
154
+ - the wave is release-sensitive
155
+ - the benchmark is small and repeatable
156
+ - you need a precise regression gate
157
+
158
+ ## Current Limits
159
+
160
+ The benchmark catalog does not yet store:
161
+
162
+ - local benchmark run history
163
+ - local-vs-paper delta computation
164
+ - automated benchmark execution plans
165
+
166
+ For now it is the schema and policy layer that keeps eval authoring, `cont-EVAL`, and coordination benchmarking aligned.