cool-workflow 0.1.78

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (193) hide show
  1. package/.claude-plugin/plugin.json +20 -0
  2. package/.codex-plugin/mcp.json +10 -0
  3. package/.codex-plugin/plugin.json +38 -0
  4. package/.mcp.json +10 -0
  5. package/LICENSE +24 -0
  6. package/README.md +638 -0
  7. package/apps/architecture-review/app.json +51 -0
  8. package/apps/architecture-review/workflow.js +116 -0
  9. package/apps/end-to-end-golden-path/app.json +30 -0
  10. package/apps/end-to-end-golden-path/workflow.js +33 -0
  11. package/apps/pr-review-fix-ci/app.json +59 -0
  12. package/apps/pr-review-fix-ci/workflow.js +90 -0
  13. package/apps/release-cut/app.json +54 -0
  14. package/apps/release-cut/workflow.js +82 -0
  15. package/apps/research-synthesis/app.json +50 -0
  16. package/apps/research-synthesis/workflow.js +76 -0
  17. package/apps/workflow-app-framework-demo/app.json +29 -0
  18. package/apps/workflow-app-framework-demo/workflow.js +44 -0
  19. package/dist/agent-config.js +223 -0
  20. package/dist/candidate-scoring.js +715 -0
  21. package/dist/capability-core.js +630 -0
  22. package/dist/capability-dispatcher.js +86 -0
  23. package/dist/capability-registry.js +523 -0
  24. package/dist/cli.js +1276 -0
  25. package/dist/collaboration.js +727 -0
  26. package/dist/commit.js +570 -0
  27. package/dist/contract-migration.js +234 -0
  28. package/dist/coordinator.js +1163 -0
  29. package/dist/daemon.js +44 -0
  30. package/dist/dispatch.js +201 -0
  31. package/dist/drive.js +503 -0
  32. package/dist/error-feedback.js +415 -0
  33. package/dist/evidence-grounding.js +179 -0
  34. package/dist/evidence-reasoning.js +733 -0
  35. package/dist/execution-backend.js +1279 -0
  36. package/dist/harness.js +61 -0
  37. package/dist/mcp-server.js +1615 -0
  38. package/dist/multi-agent-eval.js +857 -0
  39. package/dist/multi-agent-host.js +764 -0
  40. package/dist/multi-agent-operator-ux.js +537 -0
  41. package/dist/multi-agent-trust.js +366 -0
  42. package/dist/multi-agent.js +1173 -0
  43. package/dist/node-snapshot.js +270 -0
  44. package/dist/observability.js +922 -0
  45. package/dist/operator-ux.js +971 -0
  46. package/dist/orchestrator/audit-operations.js +182 -0
  47. package/dist/orchestrator/candidate-operations.js +117 -0
  48. package/dist/orchestrator/cli-options.js +288 -0
  49. package/dist/orchestrator/collaboration-operations.js +86 -0
  50. package/dist/orchestrator/feedback-operations.js +81 -0
  51. package/dist/orchestrator/host-operations.js +78 -0
  52. package/dist/orchestrator/lifecycle-operations.js +462 -0
  53. package/dist/orchestrator/migration-operations.js +44 -0
  54. package/dist/orchestrator/multi-agent-operations.js +362 -0
  55. package/dist/orchestrator/report.js +369 -0
  56. package/dist/orchestrator/topology-operations.js +84 -0
  57. package/dist/orchestrator.js +874 -0
  58. package/dist/pipeline-contract.js +92 -0
  59. package/dist/pipeline-runner.js +285 -0
  60. package/dist/reclamation.js +882 -0
  61. package/dist/result-normalize.js +194 -0
  62. package/dist/run-export.js +64 -0
  63. package/dist/run-registry.js +1347 -0
  64. package/dist/run-state-schema.js +67 -0
  65. package/dist/sandbox-profile.js +471 -0
  66. package/dist/scheduler.js +266 -0
  67. package/dist/scheduling.js +184 -0
  68. package/dist/schema-validate.js +98 -0
  69. package/dist/state-explosion.js +1213 -0
  70. package/dist/state-migrations.js +463 -0
  71. package/dist/state-node.js +301 -0
  72. package/dist/state.js +308 -0
  73. package/dist/telemetry-attestation.js +156 -0
  74. package/dist/telemetry-ledger.js +145 -0
  75. package/dist/topology.js +527 -0
  76. package/dist/triggers.js +159 -0
  77. package/dist/trust-audit.js +475 -0
  78. package/dist/types/blackboard.js +2 -0
  79. package/dist/types/boundary.js +29 -0
  80. package/dist/types/candidate.js +2 -0
  81. package/dist/types/collaboration.js +2 -0
  82. package/dist/types/core.js +2 -0
  83. package/dist/types/drive.js +10 -0
  84. package/dist/types/error-feedback.js +2 -0
  85. package/dist/types/evidence-reasoning.js +2 -0
  86. package/dist/types/execution-backend.js +2 -0
  87. package/dist/types/multi-agent.js +2 -0
  88. package/dist/types/observability.js +2 -0
  89. package/dist/types/pipeline.js +2 -0
  90. package/dist/types/reclamation.js +8 -0
  91. package/dist/types/result.js +2 -0
  92. package/dist/types/run-registry.js +2 -0
  93. package/dist/types/run.js +2 -0
  94. package/dist/types/sandbox.js +2 -0
  95. package/dist/types/schedule.js +2 -0
  96. package/dist/types/state-node.js +2 -0
  97. package/dist/types/topology.js +2 -0
  98. package/dist/types/trust.js +2 -0
  99. package/dist/types/workbench.js +2 -0
  100. package/dist/types/worker.js +2 -0
  101. package/dist/types/workflow-app.js +2 -0
  102. package/dist/types.js +43 -0
  103. package/dist/verifier-registry.js +46 -0
  104. package/dist/verifier.js +78 -0
  105. package/dist/version.js +8 -0
  106. package/dist/workbench-host.js +172 -0
  107. package/dist/workbench.js +190 -0
  108. package/dist/worker-isolation.js +1028 -0
  109. package/dist/workflow-api.js +98 -0
  110. package/dist/workflow-app-framework.js +626 -0
  111. package/docs/agent-delegation-drive.7.md +190 -0
  112. package/docs/agent-framework.md +176 -0
  113. package/docs/candidate-scoring.7.md +106 -0
  114. package/docs/canonical-workflow-apps.7.md +137 -0
  115. package/docs/capability-topology-registry.7.md +168 -0
  116. package/docs/cli-mcp-parity.7.md +373 -0
  117. package/docs/contract-migration-tooling.7.md +123 -0
  118. package/docs/control-plane-scheduling.7.md +110 -0
  119. package/docs/coordinator-blackboard.7.md +183 -0
  120. package/docs/dogfood/architecture-review-cool-workflow.md +16 -0
  121. package/docs/dogfood-one-real-repo.7.md +168 -0
  122. package/docs/durable-state-and-locking.7.md +107 -0
  123. package/docs/end-to-end-golden-path.7.md +117 -0
  124. package/docs/error-feedback.7.md +153 -0
  125. package/docs/evidence-adoption-reasoning-chain.7.md +270 -0
  126. package/docs/execution-backends.7.md +300 -0
  127. package/docs/getting-started.md +99 -0
  128. package/docs/index.md +41 -0
  129. package/docs/mcp-app-surface.7.md +235 -0
  130. package/docs/multi-agent-cli-mcp-surface.7.md +265 -0
  131. package/docs/multi-agent-eval-replay-harness.7.md +302 -0
  132. package/docs/multi-agent-operator-ux.7.md +314 -0
  133. package/docs/multi-agent-runtime-core.7.md +231 -0
  134. package/docs/multi-agent-topologies.7.md +103 -0
  135. package/docs/multi-agent-trust-policy-audit.7.md +154 -0
  136. package/docs/node-snapshot-diff-replay.7.md +135 -0
  137. package/docs/observability-cost-accounting.7.md +194 -0
  138. package/docs/operator-ux.7.md +180 -0
  139. package/docs/pipeline-runner.7.md +136 -0
  140. package/docs/project-index.md +261 -0
  141. package/docs/real-execution-backends.7.md +142 -0
  142. package/docs/release-and-migration.7.md +280 -0
  143. package/docs/release-tooling.7.md +159 -0
  144. package/docs/routines.md +48 -0
  145. package/docs/run-registry-control-plane.7.md +312 -0
  146. package/docs/run-retention-reclamation.7.md +191 -0
  147. package/docs/sandbox-profiles.7.md +137 -0
  148. package/docs/scheduled-tasks.md +80 -0
  149. package/docs/security-trust-hardening.7.md +117 -0
  150. package/docs/state-explosion-management.7.md +264 -0
  151. package/docs/state-node.7.md +96 -0
  152. package/docs/team-collaboration.7.md +207 -0
  153. package/docs/unix-principles.md +192 -0
  154. package/docs/verifier-gated-commit.7.md +140 -0
  155. package/docs/web-desktop-workbench.7.md +215 -0
  156. package/docs/worker-isolation.7.md +167 -0
  157. package/docs/workflow-app-framework.7.md +274 -0
  158. package/manifest/README.md +43 -0
  159. package/manifest/plugin.manifest.json +316 -0
  160. package/manifest/pricing.policy.json +14 -0
  161. package/package.json +79 -0
  162. package/scripts/agents/claude-p-agent.js +104 -0
  163. package/scripts/agents/claude-p-agent.sh +9 -0
  164. package/scripts/agents/cw-attest-keygen.js +55 -0
  165. package/scripts/agents/cw-attest-wrap.js +143 -0
  166. package/scripts/block-unapproved-tag.sh +39 -0
  167. package/scripts/bump-version.js +249 -0
  168. package/scripts/canonical-apps.js +171 -0
  169. package/scripts/cw.js +4 -0
  170. package/scripts/dist-drift-check.js +79 -0
  171. package/scripts/dogfood-architecture-review.js +237 -0
  172. package/scripts/dogfood-release.js +624 -0
  173. package/scripts/forward-ref-docs.js +73 -0
  174. package/scripts/gen-manifests.js +232 -0
  175. package/scripts/golden-path.js +300 -0
  176. package/scripts/mcp-server.js +4 -0
  177. package/scripts/new-feature.js +121 -0
  178. package/scripts/parity-check.js +213 -0
  179. package/scripts/release-check.js +118 -0
  180. package/scripts/release-flow.js +272 -0
  181. package/scripts/release-gate.sh +85 -0
  182. package/scripts/sync-project-index.js +387 -0
  183. package/scripts/validate-run-state-schema.js +126 -0
  184. package/scripts/verify-container-selfref.js +64 -0
  185. package/scripts/version-sync-check.js +237 -0
  186. package/skills/cool-workflow/SKILL.md +162 -0
  187. package/skills/cool-workflow/references/commands.md +282 -0
  188. package/tsconfig.json +16 -0
  189. package/ui/workbench/app.css +76 -0
  190. package/ui/workbench/app.js +159 -0
  191. package/ui/workbench/index.html +32 -0
  192. package/workflows/architecture-review.workflow.js +84 -0
  193. package/workflows/research-synthesis.workflow.js +47 -0
@@ -0,0 +1,265 @@
1
+ # Multi-Agent CLI + MCP Surface
2
+
3
+ CW v0.1.20 adds the preferred host-facing control loop for multi-agent work:
4
+
5
+ ```text
6
+ multi-agent run -> status -> step -> blackboard -> score -> select
7
+ ```
8
+
9
+ CW v0.1.25 extends this surface with State Explosion Management commands:
10
+ `summary refresh`, `summary show`, `blackboard summarize`,
11
+ `multi-agent summarize`, and `multi-agent graph --view <view>` (with optional
12
+ `--focus <id>` and `--depth <n>`). Matching MCP tools are `cw_summary_refresh`,
13
+ `cw_summary_show`, `cw_blackboard_summarize`, `cw_multi_agent_summarize`, and
14
+ `cw_multi_agent_graph_compact`. All responses keep source refs and expansion
15
+ hints. See [state-explosion-management.7.md](state-explosion-management.7.md).
16
+
17
+ CW v0.1.26 adds `multi-agent reasoning <run-id> [--evidence <id>] [--refresh]`
18
+ (MCP: `cw_evidence_reasoning`, `cw_evidence_reasoning_refresh`), which explains
19
+ *why* each evidence item was adopted, and an additive `rationaleStatus` field on
20
+ `multi-agent evidence` rows. See
21
+ [evidence-adoption-reasoning-chain.7.md](evidence-adoption-reasoning-chain.7.md).
22
+
23
+ This is userland over the existing kernel records. The low-level topology,
24
+ multi-agent, blackboard, candidate, audit, and commit primitives remain
25
+ available, but agent hosts should use this high-level surface when driving a
26
+ run.
27
+
28
+ ## CLI Loop
29
+
30
+ Create or attach a topology-backed run without spawning workers:
31
+
32
+ ```bash
33
+ node scripts/cw.js multi-agent run <run-id> --topology judge-panel --task <task-id>
34
+ node scripts/cw.js multi-agent run --app architecture-review --repo /path/to/repo --question "Review this" --topology map-reduce
35
+ ```
36
+
37
+ Read the combined host status:
38
+
39
+ ```bash
40
+ node scripts/cw.js multi-agent status <run-id>
41
+ node scripts/cw.js multi-agent status <run-id> --json
42
+ ```
43
+
44
+ Perform one deterministic step at a time:
45
+
46
+ ```bash
47
+ node scripts/cw.js multi-agent step <run-id> --sandbox readonly
48
+ ```
49
+
50
+ `step` may create a dispatch manifest, collect fanin, snapshot the blackboard,
51
+ register a candidate, score a candidate with existing verifier evidence, select
52
+ a scored candidate, or recommend the verifier-gated commit command. It never
53
+ spawns agents directly.
54
+
55
+ Work with the active blackboard when it is unambiguous:
56
+
57
+ ```bash
58
+ node scripts/cw.js multi-agent blackboard <run-id> summary
59
+ node scripts/cw.js multi-agent blackboard <run-id> topics
60
+ node scripts/cw.js multi-agent blackboard <run-id> post --topic <topic-id> --body "finding" --evidence <ref>
61
+ node scripts/cw.js multi-agent blackboard <run-id> add-artifact --topic <topic-id> --kind worker-result --path result.md
62
+ node scripts/cw.js multi-agent blackboard <run-id> snapshot
63
+ ```
64
+
65
+ Score and select explicitly:
66
+
67
+ ```bash
68
+ node scripts/cw.js multi-agent score <run-id> <candidate-id> --criterion correctness=1 --criterion evidence=1 --evidence <ref>
69
+ node scripts/cw.js multi-agent select <run-id> <candidate-id> --score <score-id> --reason "verifier-backed candidate"
70
+ node scripts/cw.js commit <run-id> --selection <selection-id> --reason "verified winner"
71
+ ```
72
+
73
+ ## Operator Inspection
74
+
75
+ v0.1.21 extends the host loop with focused operator commands:
76
+
77
+ ```bash
78
+ node scripts/cw.js multi-agent graph <run-id>
79
+ node scripts/cw.js multi-agent dependencies <run-id>
80
+ node scripts/cw.js multi-agent failures <run-id>
81
+ node scripts/cw.js multi-agent evidence <run-id>
82
+ ```
83
+
84
+ The human output is compact and operational: agent graph, dependencies, failed
85
+ or blocked agents, adopted evidence, missing evidence, and the next action.
86
+ Use `--json` or `--format json` for deterministic script output.
87
+
88
+ ## MCP Tools
89
+
90
+ MCP hosts should prefer:
91
+
92
+ - `cw_multi_agent_run`
93
+ - `cw_multi_agent_status`
94
+ - `cw_multi_agent_step`
95
+ - `cw_multi_agent_blackboard`
96
+ - `cw_multi_agent_score`
97
+ - `cw_multi_agent_select`
98
+ - `cw_multi_agent_graph`
99
+ - `cw_multi_agent_dependencies`
100
+ - `cw_multi_agent_failures`
101
+ - `cw_multi_agent_evidence`
102
+
103
+ The older `cw_multi_agent_*`, `cw_topology_*`, `cw_blackboard_*`, and
104
+ `cw_candidate_*` tools remain advanced primitives.
105
+
106
+ ## Stable Responses
107
+
108
+ Every high-level response is JSON and includes:
109
+
110
+ - `runId`
111
+ - active topology and multi-agent ids
112
+ - blackboard and topic ids
113
+ - candidate, selection, commit, and audit ids
114
+ - `state`, `performed`, `nextAction`, and `nextActions`
115
+ - `blockedReasons`, `requiredHostAction`, and `evidenceRequirements`
116
+ - state, report, blackboard, audit, ranking, worker manifest, and result paths
117
+ - combined topology, multi-agent, multi-agent operator, blackboard, worker,
118
+ candidate, feedback, commit, and audit summaries
119
+
120
+ ## Fail-Closed Rules
121
+
122
+ The host surface fails closed when:
123
+
124
+ - active topology or blackboard state is ambiguous
125
+ - a fanout has incomplete role coverage
126
+ - worker output has not been recorded
127
+ - fanin lacks required evidence or blackboard links
128
+ - score evidence is missing
129
+ - selection lacks score or verifier readiness
130
+ - a verifier-gated commit is not ready
131
+
132
+ ## Smoke Coverage
133
+
134
+ `test/multi-agent-cli-mcp-surface-smoke.js` covers the full host loop over the
135
+ official `judge-panel` topology, CLI and MCP parity, ambiguous topology
136
+ failure, missing evidence failure, successful score/select, blackboard
137
+ artifact/message linkage, audit provenance, and Operator UX next actions. It is
138
+ included in `npm test` and `npm run release:check`.
139
+
140
+ `test/multi-agent-operator-ux-smoke.js` covers the v0.1.21 graph,
141
+ dependencies, failures, evidence adoption, report output, and MCP parity.
142
+
143
+ `test/multi-agent-trust-policy-audit-smoke.js` covers the v0.1.22
144
+ role-policy, blackboard-write, message-provenance, judge-rationale,
145
+ policy-violation, report, audit provenance, and MCP parity surface.
146
+
147
+ `test/multi-agent-eval-replay-harness-smoke.js` covers the v0.1.24 eval/replay
148
+ commands and MCP tools: snapshot, replay, compare, score, gate, report, and
149
+ controlled regression detection.
150
+ ## CLI ↔ MCP Parity (v0.1.27)
151
+
152
+ Every command and tool referenced above is declared in the v0.1.27 capability
153
+ registry (`src/capability-registry.ts`) and validated by `npm run parity:check`,
154
+ so `cw <cmd> --json` and the matching `cw_<tool>` result render one data source.
155
+ See [cli-mcp-parity.7.md](cli-mcp-parity.7.md).
156
+
157
+ ## Run Registry / Control Plane (v0.1.28)
158
+
159
+ The runs described here are indexed, searchable, resumable, archivable, and
160
+ rerunnable across repos by the v0.1.28 Run Registry / Control Plane, which derives
161
+ a fingerprinted, fail-closed index over the same per-run `.cw/runs/<id>/state.json`
162
+ source of truth. See [run-registry-control-plane.7.md](run-registry-control-plane.7.md).
163
+
164
+ ## Execution Backends (v0.1.29)
165
+
166
+ v0.1.29 lifts execution into a pluggable driver layer: one narrow `ExecutionBackend`
167
+ contract with interchangeable `node`/`bun`/`shell`/`container`/`remote`/`ci`
168
+ drivers, selected by `--backend` (parallel to `--sandbox`) and inspected via
169
+ `backend list|show|probe`. The result/evidence envelope is schema-identical across
170
+ backends; the backend id + sandbox attestation are recorded as provenance, so this
171
+ surface is unchanged regardless of which backend executed a run. See
172
+ [execution-backends.7.md](execution-backends.7.md).
173
+ ## Web / Desktop Workbench (v0.1.30)
174
+
175
+ v0.1.30 adds the Web / Desktop Workbench: a read-only, localhost-only human
176
+ console that renders this surface (and the other four operator panels — run
177
+ graph, blackboard, worker logs, candidate compare, audit timeline) for any run,
178
+ reading the SAME capability `--json` payloads. It is a THIRD FRONT DOOR alongside
179
+ the CLI and MCP that holds no authoritative state and forks no schema: each panel
180
+ equals its `cw <cmd> --json` payload byte-for-byte (parity-gated), and refresh
181
+ re-derives everything from disk. See
182
+ [web-desktop-workbench.7.md](web-desktop-workbench.7.md).
183
+
184
+ ## Observability + Cost Accounting (v0.1.31)
185
+
186
+ v0.1.31 adds Observability + Cost Accounting: `metrics show`/`metrics summary`
187
+ derive durations, failure/verifier/acceptance rates (with sample counts and
188
+ fail-closed `n/a`), and host-attested token/cost from existing durable run state
189
+ — no metrics database, no collector daemon, no hidden counter. Usage is additive
190
+ and optional (absent ⇒ `unreported`, never 0); cost is `attested` (attested usage
191
+ × a recorded pricing policy) or clearly `estimated`, with pricing as policy. Both
192
+ verbs are parity-gated and render read-only in the v0.1.30 Workbench. See
193
+ [observability-cost-accounting.7.md](observability-cost-accounting.7.md).
194
+
195
+
196
+ ## Team Collaboration (v0.1.32)
197
+
198
+ v0.1.32 adds Team Collaboration: a host-attested actor and append-only
199
+ approvals/rejections/comments/handoffs provenance-linked to a durable target,
200
+ plus a review gate that STACKS ON the verifier gate — required approvals from
201
+ authorized roles, enforced inside `resolveCommitGate` AFTER the verifier checks
202
+ and never instead of them, failing closed on quorum/authority/self-approval and
203
+ recording who approved the very artifact that shipped. Policy (required approvals,
204
+ authorized roles, self-approval) is data, default off (pre-v0.1.32 behavior
205
+ unchanged). The verbs are parity-gated and render read-only in the v0.1.30
206
+ Workbench. See [Team Collaboration](team-collaboration.7.md).
207
+
208
+ ## Release Tooling (v0.1.33)
209
+
210
+ the per-tag mechanical surfaces (version bump across 17 surfaces, feature scaffold, and the forward-reference docs) become deterministic scripts, with a de-duplicated release gate. See release-tooling(7).
211
+
212
+ ## Real Execution Backend Integrations (v0.1.34)
213
+
214
+ container/remote/ci backends really execute (docker/podman run, remote/CI POST-and-poll) under the sandbox contract, with byte-stable evidence vs node and fail-closed refusal when a runtime/endpoint is unavailable. See real-execution-backends(7).
215
+
216
+ ## Node Snapshot / Diff / Replay (v0.1.35)
217
+
218
+ per-node snapshot, structural diff, and isolated deterministic replay over StateNode, reusing the v0.1.23 eval harness; fail-closed on source drift (valid|stale|absent). See node-snapshot-diff-replay(7).
219
+
220
+ ## Contract Migration Tooling (v0.1.36)
221
+
222
+ first-class declared migration registry (run-state + workflow-app) with per-edge compatibility proofs, fail-closed reachability, and a round-trip/non-destruction prover. See contract-migration-tooling(7).
223
+
224
+ ## Control-Plane Scheduling (v0.1.37)
225
+
226
+ priority + concurrency limits + lease lifecycle + retry/backoff + fail-closed park over the v0.1.28 Run Registry queue; policy-as-data, deterministic. See control-plane-scheduling(7).
227
+
228
+ ## Agent Delegation Drive (v0.1.38)
229
+
230
+ spawn an external agent process per worker, capture result.md + attestation, auto-drive plan->dispatch->fulfill->accept->commit
231
+
232
+ ## Run Retention & Provable Reclamation (v0.1.39)
233
+
234
+ tiered, append-only, cryptographically-verifiable run reclamation: seal the audit skeleton, free the reconstructable bulk, prove it
235
+
236
+ ## Durable State & Locking (v0.1.40)
237
+
238
+ atomic temp->rename writes + fsync-durability for authoritative stores; portable stale-stealing file lock serializing the cross-process read-modify-write stores
239
+
240
+ ## Self-Audit Hardening & Pure-Router Decomposition (v0.1.41)
241
+
242
+ evidence grounding + durable audit append + symlink-hardened containment + deterministic worker ids + recursive redaction; BackendRegistry self-describing drivers (no per-id switches); orchestrator god-object decomposed into per-domain operation modules (pure loadRun->delegate router)
243
+
244
+ ## Robust Result Ingest (v0.1.42)
245
+
246
+ capture findings/evidence from any reasonable agent shape (alt keys + prose), CW derives grounded evidence itself, warn on empty capture — closes the v0.1.41 live-drive 'accepted with 0 captured' failure
247
+
248
+ ## No-False-Green Gate & Launch Prep (v0.1.43)
249
+
250
+ Hard gate blocking empty-capture verifier-gated commits, plus quickstart and launch-prep docs.
251
+
252
+ ## Release-Gate Determinism & Agents Vendor (v0.1.44)
253
+
254
+ Release-readiness checks now validate the committed blob (`git show HEAD:<path>`) instead of the mutable working tree — eliminating false-red/false-green from concurrent working-tree writes (iCloud/Spotlight/editor). Adds the `agents` vendor manifest target: a generated `.agents/plugins/cool-workflow/` adapter giving any non-Claude AI agent one common interface to CW.
255
+
256
+ ## P1-P2 Fixes & CI Content Surfaces (v0.1.49)
257
+
258
+ Migration DAG with reversible edges (v0.1.45), capability auto-discovery (v0.1.46), vendor-adapter registry (v0.1.47), state auto-compaction and P2 fixes (v0.1.48), plus CI content-surface determinism hardening (v0.1.49).
259
+ 0.1.51
260
+
261
+ 0.1.76
262
+
263
+ 0.1.77
264
+
265
+ 0.1.78
@@ -0,0 +1,302 @@
1
+ # Multi-Agent Eval & Replay Harness
2
+
3
+ CW v0.1.23 added a deterministic replay harness for topology-backed
4
+ multi-agent runs. It turns a completed run into plain JSON evidence that can be
5
+ replayed without live agents, compared with normalized rules, scored, and used
6
+ as a release gate.
7
+
8
+ CW v0.1.25 extends the harness with State Explosion Management metrics so the
9
+ derived summary layer is regression-gated alongside the raw run:
10
+ `summary_freshness`, `compact_graph_parity`, `blackboard_digest_parity`,
11
+ `critical_path_parity`, `evidence_digest_parity`, and `expansion_ref_integrity`.
12
+ Pre-0.1.25 snapshots load with empty summary sections, so old fixtures stay
13
+ backward compatible. See
14
+ [state-explosion-management.7.md](state-explosion-management.7.md).
15
+
16
+ CW v0.1.26 adds Evidence Adoption Reasoning Chain metrics: `reasoning_freshness`,
17
+ `reasoning_chain_parity`, and `reasoning_unexplained_parity`. Pre-0.1.26
18
+ snapshots load with empty reasoning sections. See
19
+ [evidence-adoption-reasoning-chain.7.md](evidence-adoption-reasoning-chain.7.md).
20
+
21
+ The harness is intentionally file-first:
22
+
23
+ - snapshots, replay runs, comparisons, scores, findings, gates, and reports are
24
+ stored under `.cw/evals/<suite-id>/`
25
+ - the baseline run is not mutated during replay
26
+ - replay output is written to an isolated `replay/` directory
27
+ - every CLI command supports deterministic JSON with `--json` or
28
+ `--format json`
29
+ - MCP tools return JSON only and include generated artifact paths
30
+
31
+ ## Commands
32
+
33
+ Create a snapshot from a multi-agent run:
34
+
35
+ ```bash
36
+ node scripts/cw.js eval snapshot <run-id> --id <suite-id>
37
+ node scripts/cw.js eval snapshot <run-id> --id <suite-id> --json
38
+ ```
39
+
40
+ Replay without live agents:
41
+
42
+ ```bash
43
+ node scripts/cw.js eval replay .cw/evals/<suite-id>/snapshot.json
44
+ ```
45
+
46
+ Compare, score, gate, and report:
47
+
48
+ ```bash
49
+ node scripts/cw.js eval compare \
50
+ .cw/evals/<suite-id>/snapshot.json \
51
+ .cw/evals/<suite-id>/replay-run.json
52
+
53
+ node scripts/cw.js eval score .cw/evals/<suite-id>/replay-run.json
54
+ node scripts/cw.js eval gate .cw/evals/<suite-id>
55
+ node scripts/cw.js eval report .cw/evals/<suite-id>/replay-run.json
56
+ ```
57
+
58
+ `npm run eval:replay` runs the deterministic smoke suite and is included in
59
+ `npm test` and `npm run release:check`.
60
+
61
+ Human output uses stable panels:
62
+
63
+ ```text
64
+ Eval Suite
65
+ Replay Status
66
+ Graph Comparison
67
+ Evidence Comparison
68
+ Trust / Policy / Audit Comparison
69
+ Candidate Score Comparison
70
+ Selection / Commit Gate
71
+ Regression Findings
72
+ Final Verdict
73
+ Next Action
74
+ ```
75
+
76
+ ## Artifacts
77
+
78
+ Each suite writes predictable files:
79
+
80
+ - `suite.json`
81
+ - `snapshot.json`
82
+ - `replay-run.json`
83
+ - `comparison.json`
84
+ - `score.json`
85
+ - `findings.json`
86
+ - `gate.json`
87
+ - `report.md`
88
+
89
+ The snapshot captures workflow app identity, inputs, topology shape, roles,
90
+ groups, memberships, fanout/fanin state, blackboard records, worker outputs,
91
+ candidate scores, selection rationale, verifier-gated commit inputs,
92
+ trust/policy/audit records, expected operator summaries, evidence adoption, and
93
+ report sections.
94
+
95
+ ## Comparison Rules
96
+
97
+ The comparison checks:
98
+
99
+ - topology id and topology run shape
100
+ - roles, groups, memberships, fanout, and fanin records
101
+ - dependency edges and failure rows
102
+ - blackboard records and message provenance
103
+ - role policies, permission decisions, write audit, judge rationale, panel
104
+ decisions, and policy violations
105
+ - evidence adoption status
106
+ - candidate scores, selected candidate, and selection rationale
107
+ - verifier-gated commit readiness
108
+ - report sections
109
+
110
+ Normalization removes unstable paths, timestamps, generated temp roots, and
111
+ machine-local directories. It does not hide changed evidence, policy,
112
+ selection, scoring, or commit-gate behavior.
113
+
114
+ ## Scoring
115
+
116
+ Scores are deterministic metrics:
117
+
118
+ - `replay_completed`
119
+ - `graph_parity`
120
+ - `role_parity`
121
+ - `group_parity`
122
+ - `membership_parity`
123
+ - `fanout_parity`
124
+ - `fanin_parity`
125
+ - `dependency_parity`
126
+ - `failure_parity`
127
+ - `blackboard_record_parity`
128
+ - `evidence_adoption_parity`
129
+ - `trust_audit_parity`
130
+ - `role_policy_parity`
131
+ - `permission_decision_parity`
132
+ - `policy_violation_parity`
133
+ - `blackboard_provenance_parity`
134
+ - `judge_rationale_parity`
135
+ - `panel_decision_parity`
136
+ - `candidate_score_parity`
137
+ - `selection_parity`
138
+ - `verifier_commit_gate_parity`
139
+ - `report_parity`
140
+
141
+ Each metric returns `id`, `status`, `score`, `maxScore`, `reason`, evidence
142
+ refs, baseline refs, and replay refs.
143
+
144
+ ## Gate
145
+
146
+ `eval gate` fails closed when replay artifacts are missing or when comparison
147
+ findings show a regression. This includes missing judge rationale, changed
148
+ selected candidate, changed evidence adoption, changed policy violations,
149
+ missing provenance, lost verifier-gated commit readiness, or graph/dependency
150
+ loss.
151
+
152
+ Improvements can be represented as changed findings in future suites, but they
153
+ must be visible in `score.json`, `findings.json`, and `report.md` before a
154
+ release gate can accept them.
155
+
156
+ ## MCP Parity
157
+
158
+ The MCP surface mirrors the CLI:
159
+
160
+ - `cw_eval_snapshot`
161
+ - `cw_eval_replay`
162
+ - `cw_eval_compare`
163
+ - `cw_eval_score`
164
+ - `cw_eval_gate`
165
+ - `cw_eval_report`
166
+
167
+ MCP responses are deterministic JSON and include artifact paths.
168
+
169
+ ## Release Use
170
+
171
+ Use this harness after a topology-backed run reaches score, selection, and a
172
+ verifier-gated commit:
173
+
174
+ ```bash
175
+ node scripts/cw.js eval snapshot <run-id> --id release-replay
176
+ node scripts/cw.js eval replay .cw/evals/release-replay/snapshot.json
177
+ node scripts/cw.js eval compare .cw/evals/release-replay/snapshot.json .cw/evals/release-replay/replay-run.json
178
+ node scripts/cw.js eval score .cw/evals/release-replay/replay-run.json
179
+ node scripts/cw.js eval gate .cw/evals/release-replay
180
+ node scripts/cw.js eval report .cw/evals/release-replay/replay-run.json
181
+ ```
182
+
183
+ The gate proves the replay completed, graph/dependencies stayed stable,
184
+ evidence adoption stayed traceable, trust/policy/audit records remained
185
+ explainable, judge rationale is present, scoring/selection did not regress, and
186
+ verifier-gated commit readiness still holds.
187
+ ## CLI ↔ MCP Parity (v0.1.27)
188
+
189
+ Every command and tool referenced above is declared in the v0.1.27 capability
190
+ registry (`src/capability-registry.ts`) and validated by `npm run parity:check`,
191
+ so `cw <cmd> --json` and the matching `cw_<tool>` result render one data source.
192
+ See [cli-mcp-parity.7.md](cli-mcp-parity.7.md).
193
+
194
+ ## Run Registry / Control Plane (v0.1.28)
195
+
196
+ The runs described here are indexed, searchable, resumable, archivable, and
197
+ rerunnable across repos by the v0.1.28 Run Registry / Control Plane, which derives
198
+ a fingerprinted, fail-closed index over the same per-run `.cw/runs/<id>/state.json`
199
+ source of truth. See [run-registry-control-plane.7.md](run-registry-control-plane.7.md).
200
+
201
+ ## Execution Backends (v0.1.29)
202
+
203
+ v0.1.29 lifts execution into a pluggable driver layer: one narrow `ExecutionBackend`
204
+ contract with interchangeable `node`/`bun`/`shell`/`container`/`remote`/`ci`
205
+ drivers, selected by `--backend` (parallel to `--sandbox`) and inspected via
206
+ `backend list|show|probe`. The result/evidence envelope is schema-identical across
207
+ backends; the backend id + sandbox attestation are recorded as provenance, so this
208
+ surface is unchanged regardless of which backend executed a run. See
209
+ [execution-backends.7.md](execution-backends.7.md).
210
+ ## Web / Desktop Workbench (v0.1.30)
211
+
212
+ v0.1.30 adds the Web / Desktop Workbench: a read-only, localhost-only human
213
+ console that renders this surface (and the other four operator panels — run
214
+ graph, blackboard, worker logs, candidate compare, audit timeline) for any run,
215
+ reading the SAME capability `--json` payloads. It is a THIRD FRONT DOOR alongside
216
+ the CLI and MCP that holds no authoritative state and forks no schema: each panel
217
+ equals its `cw <cmd> --json` payload byte-for-byte (parity-gated), and refresh
218
+ re-derives everything from disk. See
219
+ [web-desktop-workbench.7.md](web-desktop-workbench.7.md).
220
+
221
+ ## Observability + Cost Accounting (v0.1.31)
222
+
223
+ v0.1.31 adds Observability + Cost Accounting: `metrics show`/`metrics summary`
224
+ derive durations, failure/verifier/acceptance rates (with sample counts and
225
+ fail-closed `n/a`), and host-attested token/cost from existing durable run state
226
+ — no metrics database, no collector daemon, no hidden counter. Usage is additive
227
+ and optional (absent ⇒ `unreported`, never 0); cost is `attested` (attested usage
228
+ × a recorded pricing policy) or clearly `estimated`, with pricing as policy. Both
229
+ verbs are parity-gated and render read-only in the v0.1.30 Workbench. See
230
+ [observability-cost-accounting.7.md](observability-cost-accounting.7.md).
231
+
232
+
233
+ ## Team Collaboration (v0.1.32)
234
+
235
+ v0.1.32 adds Team Collaboration: a host-attested actor and append-only
236
+ approvals/rejections/comments/handoffs provenance-linked to a durable target,
237
+ plus a review gate that STACKS ON the verifier gate — required approvals from
238
+ authorized roles, enforced inside `resolveCommitGate` AFTER the verifier checks
239
+ and never instead of them, failing closed on quorum/authority/self-approval and
240
+ recording who approved the very artifact that shipped. Policy (required approvals,
241
+ authorized roles, self-approval) is data, default off (pre-v0.1.32 behavior
242
+ unchanged). The verbs are parity-gated and render read-only in the v0.1.30
243
+ Workbench. See [Team Collaboration](team-collaboration.7.md).
244
+
245
+ ## Release Tooling (v0.1.33)
246
+
247
+ the per-tag mechanical surfaces (version bump across 17 surfaces, feature scaffold, and the forward-reference docs) become deterministic scripts, with a de-duplicated release gate. See release-tooling(7).
248
+
249
+ ## Real Execution Backend Integrations (v0.1.34)
250
+
251
+ container/remote/ci backends really execute (docker/podman run, remote/CI POST-and-poll) under the sandbox contract, with byte-stable evidence vs node and fail-closed refusal when a runtime/endpoint is unavailable. See real-execution-backends(7).
252
+
253
+ ## Node Snapshot / Diff / Replay (v0.1.35)
254
+
255
+ per-node snapshot, structural diff, and isolated deterministic replay over StateNode, reusing the v0.1.23 eval harness; fail-closed on source drift (valid|stale|absent). See node-snapshot-diff-replay(7).
256
+
257
+ ## Contract Migration Tooling (v0.1.36)
258
+
259
+ first-class declared migration registry (run-state + workflow-app) with per-edge compatibility proofs, fail-closed reachability, and a round-trip/non-destruction prover. See contract-migration-tooling(7).
260
+
261
+ ## Control-Plane Scheduling (v0.1.37)
262
+
263
+ priority + concurrency limits + lease lifecycle + retry/backoff + fail-closed park over the v0.1.28 Run Registry queue; policy-as-data, deterministic. See control-plane-scheduling(7).
264
+
265
+ ## Agent Delegation Drive (v0.1.38)
266
+
267
+ spawn an external agent process per worker, capture result.md + attestation, auto-drive plan->dispatch->fulfill->accept->commit
268
+
269
+ ## Run Retention & Provable Reclamation (v0.1.39)
270
+
271
+ tiered, append-only, cryptographically-verifiable run reclamation: seal the audit skeleton, free the reconstructable bulk, prove it
272
+
273
+ ## Durable State & Locking (v0.1.40)
274
+
275
+ atomic temp->rename writes + fsync-durability for authoritative stores; portable stale-stealing file lock serializing the cross-process read-modify-write stores
276
+
277
+ ## Self-Audit Hardening & Pure-Router Decomposition (v0.1.41)
278
+
279
+ evidence grounding + durable audit append + symlink-hardened containment + deterministic worker ids + recursive redaction; BackendRegistry self-describing drivers (no per-id switches); orchestrator god-object decomposed into per-domain operation modules (pure loadRun->delegate router)
280
+
281
+ ## Robust Result Ingest (v0.1.42)
282
+
283
+ capture findings/evidence from any reasonable agent shape (alt keys + prose), CW derives grounded evidence itself, warn on empty capture — closes the v0.1.41 live-drive 'accepted with 0 captured' failure
284
+
285
+ ## No-False-Green Gate & Launch Prep (v0.1.43)
286
+
287
+ Hard gate blocking empty-capture verifier-gated commits, plus quickstart and launch-prep docs.
288
+
289
+ ## Release-Gate Determinism & Agents Vendor (v0.1.44)
290
+
291
+ Release-readiness checks now validate the committed blob (`git show HEAD:<path>`) instead of the mutable working tree — eliminating false-red/false-green from concurrent working-tree writes (iCloud/Spotlight/editor). Adds the `agents` vendor manifest target: a generated `.agents/plugins/cool-workflow/` adapter giving any non-Claude AI agent one common interface to CW.
292
+
293
+ ## P1-P2 Fixes & CI Content Surfaces (v0.1.49)
294
+
295
+ Migration DAG with reversible edges (v0.1.45), capability auto-discovery (v0.1.46), vendor-adapter registry (v0.1.47), state auto-compaction and P2 fixes (v0.1.48), plus CI content-surface determinism hardening (v0.1.49).
296
+ 0.1.51
297
+
298
+ 0.1.76
299
+
300
+ 0.1.77
301
+
302
+ 0.1.78