@chllming/wave-orchestration 0.5.3 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (136) hide show
  1. package/CHANGELOG.md +53 -3
  2. package/README.md +81 -506
  3. package/docs/README.md +53 -0
  4. package/docs/agents/wave-cont-eval-role.md +36 -0
  5. package/docs/agents/{wave-evaluator-role.md → wave-cont-qa-role.md} +14 -11
  6. package/docs/agents/wave-documentation-role.md +1 -1
  7. package/docs/agents/wave-infra-role.md +1 -1
  8. package/docs/agents/wave-integration-role.md +3 -3
  9. package/docs/agents/wave-launcher-role.md +4 -3
  10. package/docs/agents/wave-security-role.md +40 -0
  11. package/docs/concepts/context7-vs-skills.md +94 -0
  12. package/docs/concepts/operating-modes.md +91 -0
  13. package/docs/concepts/runtime-agnostic-orchestration.md +95 -0
  14. package/docs/concepts/what-is-a-wave.md +183 -0
  15. package/docs/evals/README.md +166 -0
  16. package/docs/evals/benchmark-catalog.json +663 -0
  17. package/docs/guides/author-and-run-waves.md +135 -0
  18. package/docs/guides/planner.md +118 -0
  19. package/docs/guides/terminal-surfaces.md +82 -0
  20. package/docs/image.png +0 -0
  21. package/docs/plans/component-cutover-matrix.json +1 -1
  22. package/docs/plans/component-cutover-matrix.md +1 -1
  23. package/docs/plans/context7-wave-orchestrator.md +2 -0
  24. package/docs/plans/current-state.md +29 -1
  25. package/docs/plans/examples/wave-example-live-proof.md +435 -0
  26. package/docs/plans/master-plan.md +3 -3
  27. package/docs/plans/migration.md +46 -3
  28. package/docs/plans/wave-orchestrator.md +71 -8
  29. package/docs/plans/waves/wave-0.md +4 -4
  30. package/docs/reference/live-proof-waves.md +177 -0
  31. package/docs/reference/migration-0.2-to-0.5.md +26 -19
  32. package/docs/reference/npmjs-trusted-publishing.md +6 -5
  33. package/docs/reference/runtime-config/README.md +29 -0
  34. package/docs/reference/sample-waves.md +87 -0
  35. package/docs/reference/skills.md +224 -0
  36. package/docs/research/agent-context-sources.md +130 -11
  37. package/docs/research/coordination-failure-review.md +266 -0
  38. package/docs/roadmap.md +164 -564
  39. package/package.json +3 -2
  40. package/releases/manifest.json +37 -2
  41. package/scripts/research/agent-context-archive.mjs +83 -1
  42. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +811 -0
  43. package/scripts/wave-orchestrator/adhoc.mjs +1331 -0
  44. package/scripts/wave-orchestrator/agent-state.mjs +358 -6
  45. package/scripts/wave-orchestrator/artifact-schemas.mjs +173 -0
  46. package/scripts/wave-orchestrator/clarification-triage.mjs +10 -3
  47. package/scripts/wave-orchestrator/config.mjs +65 -12
  48. package/scripts/wave-orchestrator/context7.mjs +11 -0
  49. package/scripts/wave-orchestrator/coord-cli.mjs +51 -19
  50. package/scripts/wave-orchestrator/coordination-store.mjs +26 -4
  51. package/scripts/wave-orchestrator/coordination.mjs +99 -9
  52. package/scripts/wave-orchestrator/dashboard-state.mjs +20 -8
  53. package/scripts/wave-orchestrator/dep-cli.mjs +5 -2
  54. package/scripts/wave-orchestrator/docs-queue.mjs +8 -2
  55. package/scripts/wave-orchestrator/evals.mjs +451 -0
  56. package/scripts/wave-orchestrator/executors.mjs +24 -11
  57. package/scripts/wave-orchestrator/feedback.mjs +15 -1
  58. package/scripts/wave-orchestrator/install.mjs +69 -7
  59. package/scripts/wave-orchestrator/launcher-closure.mjs +281 -0
  60. package/scripts/wave-orchestrator/launcher-runtime.mjs +334 -0
  61. package/scripts/wave-orchestrator/launcher.mjs +778 -577
  62. package/scripts/wave-orchestrator/ledger.mjs +123 -20
  63. package/scripts/wave-orchestrator/local-executor.mjs +99 -12
  64. package/scripts/wave-orchestrator/planner.mjs +1463 -0
  65. package/scripts/wave-orchestrator/project-profile.mjs +190 -0
  66. package/scripts/wave-orchestrator/replay.mjs +6 -3
  67. package/scripts/wave-orchestrator/role-helpers.mjs +84 -0
  68. package/scripts/wave-orchestrator/shared.mjs +77 -11
  69. package/scripts/wave-orchestrator/skills.mjs +979 -0
  70. package/scripts/wave-orchestrator/terminals.mjs +16 -0
  71. package/scripts/wave-orchestrator/traces.mjs +73 -27
  72. package/scripts/wave-orchestrator/wave-files.mjs +1224 -163
  73. package/scripts/wave.mjs +20 -0
  74. package/skills/README.md +202 -0
  75. package/skills/provider-aws/SKILL.md +117 -0
  76. package/skills/provider-aws/adapters/claude.md +1 -0
  77. package/skills/provider-aws/adapters/codex.md +1 -0
  78. package/skills/provider-aws/references/service-verification.md +39 -0
  79. package/skills/provider-aws/skill.json +54 -0
  80. package/skills/provider-custom-deploy/SKILL.md +64 -0
  81. package/skills/provider-custom-deploy/skill.json +50 -0
  82. package/skills/provider-docker-compose/SKILL.md +96 -0
  83. package/skills/provider-docker-compose/adapters/local.md +1 -0
  84. package/skills/provider-docker-compose/skill.json +53 -0
  85. package/skills/provider-github-release/SKILL.md +121 -0
  86. package/skills/provider-github-release/adapters/claude.md +1 -0
  87. package/skills/provider-github-release/adapters/codex.md +1 -0
  88. package/skills/provider-github-release/skill.json +55 -0
  89. package/skills/provider-kubernetes/SKILL.md +143 -0
  90. package/skills/provider-kubernetes/adapters/claude.md +1 -0
  91. package/skills/provider-kubernetes/adapters/codex.md +1 -0
  92. package/skills/provider-kubernetes/references/kubectl-patterns.md +58 -0
  93. package/skills/provider-kubernetes/skill.json +52 -0
  94. package/skills/provider-railway/SKILL.md +123 -0
  95. package/skills/provider-railway/adapters/claude.md +1 -0
  96. package/skills/provider-railway/adapters/codex.md +1 -0
  97. package/skills/provider-railway/adapters/local.md +1 -0
  98. package/skills/provider-railway/adapters/opencode.md +1 -0
  99. package/skills/provider-railway/references/verification-commands.md +39 -0
  100. package/skills/provider-railway/skill.json +71 -0
  101. package/skills/provider-ssh-manual/SKILL.md +97 -0
  102. package/skills/provider-ssh-manual/skill.json +54 -0
  103. package/skills/repo-coding-rules/SKILL.md +91 -0
  104. package/skills/repo-coding-rules/skill.json +34 -0
  105. package/skills/role-cont-eval/SKILL.md +90 -0
  106. package/skills/role-cont-eval/adapters/codex.md +1 -0
  107. package/skills/role-cont-eval/skill.json +36 -0
  108. package/skills/role-cont-qa/SKILL.md +93 -0
  109. package/skills/role-cont-qa/adapters/claude.md +1 -0
  110. package/skills/role-cont-qa/skill.json +36 -0
  111. package/skills/role-deploy/SKILL.md +96 -0
  112. package/skills/role-deploy/skill.json +36 -0
  113. package/skills/role-documentation/SKILL.md +72 -0
  114. package/skills/role-documentation/skill.json +36 -0
  115. package/skills/role-implementation/SKILL.md +68 -0
  116. package/skills/role-implementation/skill.json +36 -0
  117. package/skills/role-infra/SKILL.md +80 -0
  118. package/skills/role-infra/skill.json +36 -0
  119. package/skills/role-integration/SKILL.md +84 -0
  120. package/skills/role-integration/skill.json +36 -0
  121. package/skills/role-research/SKILL.md +64 -0
  122. package/skills/role-research/skill.json +36 -0
  123. package/skills/role-security/SKILL.md +60 -0
  124. package/skills/role-security/skill.json +36 -0
  125. package/skills/runtime-claude/SKILL.md +65 -0
  126. package/skills/runtime-claude/skill.json +36 -0
  127. package/skills/runtime-codex/SKILL.md +57 -0
  128. package/skills/runtime-codex/skill.json +36 -0
  129. package/skills/runtime-local/SKILL.md +44 -0
  130. package/skills/runtime-local/skill.json +36 -0
  131. package/skills/runtime-opencode/SKILL.md +57 -0
  132. package/skills/runtime-opencode/skill.json +36 -0
  133. package/skills/wave-core/SKILL.md +114 -0
  134. package/skills/wave-core/references/marker-syntax.md +62 -0
  135. package/skills/wave-core/skill.json +35 -0
  136. package/wave.config.json +61 -5
@@ -0,0 +1,183 @@
1
+ # What Is A Wave?
2
+
3
+ A wave is the main planning and execution unit in Wave Orchestration.
4
+
5
+ It is not just a prompt file. A wave is a bounded slice of repository work with:
6
+
7
+ - explicit scope
8
+ - named owners
9
+ - runtime and context requirements
10
+ - proof and closure rules
11
+ - durable coordination state
12
+ - replayable execution artifacts
13
+
14
+ ## Core Terms
15
+
16
+ - Lane
17
+ An ordered sequence of waves. The default lane in this repo is `main`.
18
+ - Wave
19
+ One numbered work package inside a lane, usually stored as `docs/plans/waves/wave-<n>.md`.
20
+ - Agent
21
+ One role inside the wave, such as implementation, `cont-EVAL`, security review, integration, documentation, cont-QA, infra, or deploy.
22
+ - Attempt
23
+ One execution pass of a wave. A wave can have multiple attempts due to retries or fallback.
24
+ - Closure
25
+ The final proof pass that decides whether the wave is actually done, not just partially implemented.
26
+
27
+ ## Why Waves Exist
28
+
29
+ Waves force a higher planning bar than ad hoc prompts. A good wave answers:
30
+
31
+ - What is changing now, and why now?
32
+ - Which components or docs are in scope?
33
+ - Which agent owns each slice?
34
+ - What evidence closes the wave?
35
+ - Which dependencies, helper requests, or escalations can still block completion?
36
+
37
+ ## Wave Anatomy
38
+
39
+ Wave markdown is the authored execution surface today. A typical wave can include:
40
+
41
+ - title and commit message
42
+ - project profile details such as oversight mode and lane
43
+ - sequencing note
44
+ - reference rule
45
+ - deploy environments
46
+ - component promotions
47
+ - eval targets
48
+ - Context7 defaults
49
+ - one `## Agent ...` block per role
50
+
51
+ Inside each agent block, the important sections are:
52
+
53
+ - `### Role prompts`
54
+ Standing role identity imported from `docs/agents/*.md`.
55
+ - `### Executor`
56
+ Runtime selection, profile, model, fallbacks, and budgets.
57
+ - `## Eval targets`
58
+ Optional wave-level contract for `cont-EVAL`, including benchmark family or pinned benchmarks, objective, and stop condition.
59
+ See [docs/evals/README.md](../evals/README.md) for guidance on delegated versus pinned targets and the coordination benchmark families.
60
+ - `### Proof artifacts`
61
+ Optional machine-visible local evidence required for proof-centric waves, especially `pilot-live` and above.
62
+ - `### Context7`
63
+ External library truth to prefetch and inject.
64
+ - `### Skills`
65
+ Reusable repo-owned environment or workflow guidance resolved after runtime selection.
66
+ - `### Components`
67
+ The components that agent is responsible for proving or promoting.
68
+ - `### Capabilities`
69
+ Optional routing hints for follow-up work.
70
+ - `### Deliverables`
71
+ Exact repo-relative outputs that must exist before closure can pass.
72
+ - `### Prompt`
73
+ The specific task, file ownership, requirements, and validation instructions.
74
+ - `### Exit contract`
75
+ The completion, durability, proof, and documentation expectations that gate closure.
76
+
77
+ ## Standard Roles
78
+
79
+ The starter runtime expects three standard closure roles plus up to two optional review specialists:
80
+
81
+ - `A8`
82
+ Integration steward
83
+ - `A9`
84
+ Documentation steward
85
+ - `A0`
86
+ cont-QA
87
+ - `E0`
88
+ Optional `cont-EVAL` for iterative benchmark or output tuning; report-only by default, implementation-owning only when explicitly assigned non-report files
89
+ - `A7`
90
+ Optional security reviewer; report-only by default and used to publish a threat-model-first security review before integration closure
91
+
92
+ Implementation or specialist agents own the actual work slices. Closure roles do not replace implementation ownership; they decide whether the combined result is closure-ready. `cont-EVAL` is the one hybrid role: most waves keep it report-only, but human-authored waves may assign explicit tuning files to `E0`, in which case it must satisfy both implementation proof and eval proof.
93
+
94
+ ## Lifecycle Of A Wave
95
+
96
+ 1. Author or draft the wave.
97
+ 2. Run `wave launch --dry-run --no-dashboard`.
98
+ 3. The launcher validates the wave, resolves executors and skills, builds prompts, and materializes operator surfaces.
99
+ 4. A live run launches implementation agents first when implementation work remains.
100
+ 5. Agents write structured coordination events instead of relying on ad hoc terminal output.
101
+ 6. The launcher checks implementation contracts, promoted-component proof, helper assignments, dependencies, and clarification state.
102
+ 7. If implementation is ready, closure runs in order: optional `cont-EVAL`, optional security review, integration, documentation, then cont-QA.
103
+ 8. The attempt is captured in per-wave traces, ledgers, inboxes, summaries, and copied artifacts.
104
+
105
+ ## Runtime And Operating Posture
106
+
107
+ Wave is runtime agnostic at the orchestration layer.
108
+
109
+ Planning, ownership, closure, durable state, and traces do not depend on whether an agent runs on Codex, Claude Code, OpenCode, or the local smoke executor. Runtime-specific behavior is isolated to executor adapters and overlays.
110
+
111
+ That means a wave should usually be authored in runtime-neutral terms:
112
+
113
+ - ownership and deliverables
114
+ - proof and validation
115
+ - closure order
116
+ - dependencies and helper flow
117
+ - promoted component expectations
118
+
119
+ The runtime choice resolves later, from the agent executor block, profile defaults, lane defaults, CLI overrides, and fallback policy.
120
+
121
+ Wave also has an execution posture:
122
+
123
+ - `oversight`
124
+ Human review or intervention is expected for risky or ambiguous work.
125
+ - `dark-factory`
126
+ The wave is authored for routine execution without normal human intervention.
127
+
128
+ Today these postures are planning vocabulary and saved project defaults, not two separate execution engines. Human feedback is still an escalation mechanism inside the orchestration loop, not the definition of the operating mode itself.
129
+
130
+ If you need the narrower supporting pages, see [runtime-agnostic-orchestration.md](./runtime-agnostic-orchestration.md) and [operating-modes.md](./operating-modes.md).
131
+
132
+ Current live waves are strict about closure artifacts:
133
+
134
+ - `cont-EVAL` must emit a structured `[wave-eval]` marker whose `target_ids` matches the declared eval targets and whose `benchmark_ids` enumerates the executed benchmark set.
135
+ - Security reviewers must leave a security review report and emit a final `[wave-security]` marker with `state=<clear|concerns|blocked>`, finding count, and approval count.
136
+ - `cont-QA` must emit both a final `Verdict:` line and a final `[wave-gate]` marker.
137
+ - Replay keeps read-only compatibility with older traces and older evaluator-era artifacts, but live waves do not pass on verdict-only or underspecified closure markers.
138
+
139
+ ## What Makes A Wave "Done"
140
+
141
+ A wave is not done because an agent said so. It is done only when the runtime surfaces agree:
142
+
143
+ - implementation exit contracts pass
144
+ - required deliverables exist and stay within ownership boundaries
145
+ - required proof artifacts exist when the wave declares proof-first live evidence
146
+ - required component proof and promotions pass
147
+ - helper assignments are resolved
148
+ - required dependency tickets are resolved
149
+ - clarification follow-ups or escalations are resolved
150
+ - if present, `cont-EVAL` satisfies its declared eval targets
151
+ - if present, the security reviewer publishes a report plus a final `[wave-security]` marker; `blocked` stops closure while `concerns` stays advisory
152
+ - integration recommends closure
153
+ - documentation and cont-QA closure pass
154
+
155
+ For proof-first live-wave examples, see [docs/reference/live-proof-waves.md](../reference/live-proof-waves.md).
156
+
157
+ ## Where The State Lives
158
+
159
+ The wave file is only part of the story. The runtime writes durable state under `.tmp/<lane>-wave-launcher/`, including:
160
+
161
+ - prompts and logs
162
+ - status summaries
163
+ - coordination logs
164
+ - rendered message boards
165
+ - compiled inboxes
166
+ - ledger and docs queue
167
+ - security summaries
168
+ - integration summaries
169
+ - dependency snapshots
170
+ - executor overlays
171
+ - trace bundles
172
+
173
+ That is why a wave is better understood as a bounded execution record, not just a markdown file.
174
+
175
+ ## Planner Specs vs Markdown
176
+
177
+ The planner foundation adds a JSON draft spec at `docs/plans/waves/specs/wave-<n>.json`.
178
+
179
+ - The JSON spec is the canonical planner artifact.
180
+ - The rendered markdown stays compatible with the launcher and parser.
181
+ - The launcher still executes the markdown wave file today.
182
+
183
+ This split keeps authoring structured while preserving the established execution surface.
@@ -0,0 +1,166 @@
1
+ ---
2
+ title: "Benchmark Catalog Guide"
3
+ summary: "How to use delegated benchmark families, pinned benchmarks, and coordination-oriented eval targets in Wave."
4
+ ---
5
+
6
+ # Benchmark Catalog Guide
7
+
8
+ Wave's benchmark catalog lives in `docs/evals/benchmark-catalog.json`.
9
+
10
+ It has two jobs:
11
+
12
+ - give `cont-EVAL` a repo-governed menu of allowed benchmark families and benchmark ids
13
+ - document what each benchmark is trying to catch, including coordination failure modes and static paper baselines
14
+
15
+ The catalog is reference metadata, not a run-history database. It tells the wave author and `cont-EVAL` what kinds of checks are allowed and what external benchmark or paper baseline those checks map to.
16
+
17
+ For a full authored wave example that uses these patterns, see [docs/reference/sample-waves.md](../reference/sample-waves.md).
18
+
19
+ ## Migrating From Legacy Evaluator Waves
20
+
21
+ If your `0.5.4`-era repo still talks about a single `evaluator` role, split that surface before adopting `0.6.0`:
22
+
23
+ - keep `A0` as `cont-QA` for the final closure verdict and `[wave-gate]`
24
+ - add `E0` only when the wave needs benchmark-driven tuning or service-output evaluation
25
+ - treat `cont-EVAL` as report-only unless the wave explicitly gives `E0` owned implementation files
26
+ - keep `## Eval targets` at the wave level so `cont-EVAL` has an exact contract to satisfy
27
+
28
+ `cont-EVAL` is not a rename of `cont-QA`. In `0.6.0`, `E0` proves the eval contract before integration, while `A0` still owns the final release verdict after documentation closure.
29
+
30
+ ## When To Use Delegated Vs Pinned Targets
31
+
32
+ Use `selection: delegated` when the wave should authorize a benchmark family and let `cont-EVAL` choose the exact benchmark set inside that family.
33
+
34
+ Use `selection: pinned` when the wave must require specific benchmark ids and does not want `cont-EVAL` to choose alternates.
35
+
36
+ In practice:
37
+
38
+ - `delegated` is better when you want flexibility inside a stable area such as `hidden-profile-pooling` or `latency`
39
+ - `pinned` is better when you need an exact smoke gate such as `cold-start-smoke` or `private-evidence-integration`
40
+
41
+ ## Example Eval Targets
42
+
43
+ Delegated family target:
44
+
45
+ ```md
46
+ ## Eval targets
47
+
48
+ - id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
49
+ ```
50
+
51
+ Pinned benchmark target:
52
+
53
+ ```md
54
+ ## Eval targets
55
+
56
+ - id: contradiction-recovery-guard | selection: pinned | benchmarks: claim-conflict-detection,evidence-based-repair | objective: Detect and repair conflicting claims before closure | threshold: Material contradictions become explicit follow-up work and resolve before final pass
57
+ ```
58
+
59
+ Mixed target set:
60
+
61
+ ```md
62
+ ## Eval targets
63
+
64
+ - id: coordination-pooling | selection: delegated | benchmark-family: hidden-profile-pooling | objective: Pool distributed private evidence before closure | threshold: Critical decision-changing facts appear in the final integrated answer before PASS
65
+ - id: summary-integrity | selection: pinned | benchmarks: shared-summary-fact-retention | objective: Preserve decision-changing facts through summary compression | threshold: Shared summaries retain the facts needed for the final recommendation
66
+ ```
67
+
68
+ ## Coordination Families
69
+
70
+ The coordination-oriented families currently included in the catalog are:
71
+
72
+ - `hidden-profile-pooling`
73
+ Use when the main risk is that agents fail to surface or integrate distributed private evidence. This maps most directly to HiddenBench.
74
+ - `silo-escape`
75
+ Use when the risk is that agents communicate but still fail to reconstruct the correct global state. This maps most directly to Silo-Bench.
76
+ - `simultaneous-coordination`
77
+ Use when the risk is contention, lockstep failure, or convergent reasoning under concurrent decisions. This maps most directly to DPBench.
78
+ - `expertise-leverage`
79
+ Use when the risk is expert underuse, bad routing, or low-quality compromise across mixed-skill agents. This maps most directly to `Multi-Agent Teams Hold Experts Back`.
80
+ - `blackboard-fidelity`
81
+ Use when the risk is information loss or distortion between the raw coordination log and derived artifacts like shared summaries, inboxes, ledger state, or integration summaries.
82
+ - `contradiction-recovery`
83
+ Use when the risk is false consensus, unresolved conflicting claims, or clarification chains that appear resolved without real repair.
84
+
85
+ ## How To Choose The Right Family
86
+
87
+ Choose the family based on the failure you are most worried about, not just on the surface area being changed.
88
+
89
+ Use:
90
+
91
+ - `hidden-profile-pooling` when the hard part is discovering missing facts
92
+ - `silo-escape` when the hard part is integrating already-shared facts into one correct state
93
+ - `simultaneous-coordination` when multiple owners or resources must move together
94
+ - `expertise-leverage` when the right answer depends on preserving the best expert signal
95
+ - `blackboard-fidelity` when summaries, inboxes, or integration artifacts may be dropping important evidence
96
+ - `contradiction-recovery` when you expect conflicting claims and need the framework to turn them into bounded repair work
97
+
98
+ ## How `cont-EVAL` Should Use The Catalog
99
+
100
+ When a wave delegates benchmark selection:
101
+
102
+ 1. Read the wave's `## Eval targets`.
103
+ 2. Resolve the allowed benchmark family from the catalog.
104
+ 3. Choose the smallest benchmark set that genuinely tests the target's failure mode.
105
+ 4. Record the exact selected benchmark ids in the `cont-EVAL` report.
106
+ 5. Emit the final `[wave-eval]` marker with the exact executed `benchmark_ids`.
107
+
108
+ When a wave pins benchmarks:
109
+
110
+ 1. Run the named benchmark ids directly.
111
+ 2. Do not silently swap to nearby checks.
112
+ 3. Treat missing or unrun pinned benchmarks as an unsatisfied target.
113
+
114
+ ## How To Read The Static Baselines
115
+
116
+ Some coordination families now include static paper baselines such as HiddenBench, Silo-Bench, DPBench, and `Multi-Agent Teams Hold Experts Back`.
117
+
118
+ These baselines are:
119
+
120
+ - reference points from papers
121
+ - useful for framing whether Wave is still far from the broader state of the art
122
+ - not the same thing as local run history
123
+
124
+ They should answer:
125
+
126
+ - what failure mode this benchmark family is grounded in
127
+ - what the paper reported
128
+ - what metric the paper used
129
+
130
+ They should not be treated as:
131
+
132
+ - a promise that Wave already matches the paper's best system
133
+ - a local regression history
134
+ - a substitute for actually running evals
135
+
136
+ ## Authoring Guidance
137
+
138
+ Prefer one eval target per distinct risk.
139
+
140
+ Good:
141
+
142
+ - one target for distributed-information pooling
143
+ - one target for contradiction recovery
144
+ - one target for latency guardrails
145
+
146
+ Avoid:
147
+
148
+ - one overloaded target that mixes every coordination risk into a single vague threshold
149
+
150
+ Prefer delegated targets early when the family is stable but the exact check should remain flexible.
151
+
152
+ Prefer pinned targets when:
153
+
154
+ - the wave is release-sensitive
155
+ - the benchmark is small and repeatable
156
+ - you need a precise regression gate
157
+
158
+ ## Current Limits
159
+
160
+ The benchmark catalog does not yet store:
161
+
162
+ - local benchmark run history
163
+ - local-vs-paper delta computation
164
+ - automated benchmark execution plans
165
+
166
+ For now it is the schema and policy layer that keeps eval authoring, `cont-EVAL`, and coordination benchmarking aligned.