cool-workflow 0.1.78
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +20 -0
- package/.codex-plugin/mcp.json +10 -0
- package/.codex-plugin/plugin.json +38 -0
- package/.mcp.json +10 -0
- package/LICENSE +24 -0
- package/README.md +638 -0
- package/apps/architecture-review/app.json +51 -0
- package/apps/architecture-review/workflow.js +116 -0
- package/apps/end-to-end-golden-path/app.json +30 -0
- package/apps/end-to-end-golden-path/workflow.js +33 -0
- package/apps/pr-review-fix-ci/app.json +59 -0
- package/apps/pr-review-fix-ci/workflow.js +90 -0
- package/apps/release-cut/app.json +54 -0
- package/apps/release-cut/workflow.js +82 -0
- package/apps/research-synthesis/app.json +50 -0
- package/apps/research-synthesis/workflow.js +76 -0
- package/apps/workflow-app-framework-demo/app.json +29 -0
- package/apps/workflow-app-framework-demo/workflow.js +44 -0
- package/dist/agent-config.js +223 -0
- package/dist/candidate-scoring.js +715 -0
- package/dist/capability-core.js +630 -0
- package/dist/capability-dispatcher.js +86 -0
- package/dist/capability-registry.js +523 -0
- package/dist/cli.js +1276 -0
- package/dist/collaboration.js +727 -0
- package/dist/commit.js +570 -0
- package/dist/contract-migration.js +234 -0
- package/dist/coordinator.js +1163 -0
- package/dist/daemon.js +44 -0
- package/dist/dispatch.js +201 -0
- package/dist/drive.js +503 -0
- package/dist/error-feedback.js +415 -0
- package/dist/evidence-grounding.js +179 -0
- package/dist/evidence-reasoning.js +733 -0
- package/dist/execution-backend.js +1279 -0
- package/dist/harness.js +61 -0
- package/dist/mcp-server.js +1615 -0
- package/dist/multi-agent-eval.js +857 -0
- package/dist/multi-agent-host.js +764 -0
- package/dist/multi-agent-operator-ux.js +537 -0
- package/dist/multi-agent-trust.js +366 -0
- package/dist/multi-agent.js +1173 -0
- package/dist/node-snapshot.js +270 -0
- package/dist/observability.js +922 -0
- package/dist/operator-ux.js +971 -0
- package/dist/orchestrator/audit-operations.js +182 -0
- package/dist/orchestrator/candidate-operations.js +117 -0
- package/dist/orchestrator/cli-options.js +288 -0
- package/dist/orchestrator/collaboration-operations.js +86 -0
- package/dist/orchestrator/feedback-operations.js +81 -0
- package/dist/orchestrator/host-operations.js +78 -0
- package/dist/orchestrator/lifecycle-operations.js +462 -0
- package/dist/orchestrator/migration-operations.js +44 -0
- package/dist/orchestrator/multi-agent-operations.js +362 -0
- package/dist/orchestrator/report.js +369 -0
- package/dist/orchestrator/topology-operations.js +84 -0
- package/dist/orchestrator.js +874 -0
- package/dist/pipeline-contract.js +92 -0
- package/dist/pipeline-runner.js +285 -0
- package/dist/reclamation.js +882 -0
- package/dist/result-normalize.js +194 -0
- package/dist/run-export.js +64 -0
- package/dist/run-registry.js +1347 -0
- package/dist/run-state-schema.js +67 -0
- package/dist/sandbox-profile.js +471 -0
- package/dist/scheduler.js +266 -0
- package/dist/scheduling.js +184 -0
- package/dist/schema-validate.js +98 -0
- package/dist/state-explosion.js +1213 -0
- package/dist/state-migrations.js +463 -0
- package/dist/state-node.js +301 -0
- package/dist/state.js +308 -0
- package/dist/telemetry-attestation.js +156 -0
- package/dist/telemetry-ledger.js +145 -0
- package/dist/topology.js +527 -0
- package/dist/triggers.js +159 -0
- package/dist/trust-audit.js +475 -0
- package/dist/types/blackboard.js +2 -0
- package/dist/types/boundary.js +29 -0
- package/dist/types/candidate.js +2 -0
- package/dist/types/collaboration.js +2 -0
- package/dist/types/core.js +2 -0
- package/dist/types/drive.js +10 -0
- package/dist/types/error-feedback.js +2 -0
- package/dist/types/evidence-reasoning.js +2 -0
- package/dist/types/execution-backend.js +2 -0
- package/dist/types/multi-agent.js +2 -0
- package/dist/types/observability.js +2 -0
- package/dist/types/pipeline.js +2 -0
- package/dist/types/reclamation.js +8 -0
- package/dist/types/result.js +2 -0
- package/dist/types/run-registry.js +2 -0
- package/dist/types/run.js +2 -0
- package/dist/types/sandbox.js +2 -0
- package/dist/types/schedule.js +2 -0
- package/dist/types/state-node.js +2 -0
- package/dist/types/topology.js +2 -0
- package/dist/types/trust.js +2 -0
- package/dist/types/workbench.js +2 -0
- package/dist/types/worker.js +2 -0
- package/dist/types/workflow-app.js +2 -0
- package/dist/types.js +43 -0
- package/dist/verifier-registry.js +46 -0
- package/dist/verifier.js +78 -0
- package/dist/version.js +8 -0
- package/dist/workbench-host.js +172 -0
- package/dist/workbench.js +190 -0
- package/dist/worker-isolation.js +1028 -0
- package/dist/workflow-api.js +98 -0
- package/dist/workflow-app-framework.js +626 -0
- package/docs/agent-delegation-drive.7.md +190 -0
- package/docs/agent-framework.md +176 -0
- package/docs/candidate-scoring.7.md +106 -0
- package/docs/canonical-workflow-apps.7.md +137 -0
- package/docs/capability-topology-registry.7.md +168 -0
- package/docs/cli-mcp-parity.7.md +373 -0
- package/docs/contract-migration-tooling.7.md +123 -0
- package/docs/control-plane-scheduling.7.md +110 -0
- package/docs/coordinator-blackboard.7.md +183 -0
- package/docs/dogfood/architecture-review-cool-workflow.md +16 -0
- package/docs/dogfood-one-real-repo.7.md +168 -0
- package/docs/durable-state-and-locking.7.md +107 -0
- package/docs/end-to-end-golden-path.7.md +117 -0
- package/docs/error-feedback.7.md +153 -0
- package/docs/evidence-adoption-reasoning-chain.7.md +270 -0
- package/docs/execution-backends.7.md +300 -0
- package/docs/getting-started.md +99 -0
- package/docs/index.md +41 -0
- package/docs/mcp-app-surface.7.md +235 -0
- package/docs/multi-agent-cli-mcp-surface.7.md +265 -0
- package/docs/multi-agent-eval-replay-harness.7.md +302 -0
- package/docs/multi-agent-operator-ux.7.md +314 -0
- package/docs/multi-agent-runtime-core.7.md +231 -0
- package/docs/multi-agent-topologies.7.md +103 -0
- package/docs/multi-agent-trust-policy-audit.7.md +154 -0
- package/docs/node-snapshot-diff-replay.7.md +135 -0
- package/docs/observability-cost-accounting.7.md +194 -0
- package/docs/operator-ux.7.md +180 -0
- package/docs/pipeline-runner.7.md +136 -0
- package/docs/project-index.md +261 -0
- package/docs/real-execution-backends.7.md +142 -0
- package/docs/release-and-migration.7.md +280 -0
- package/docs/release-tooling.7.md +159 -0
- package/docs/routines.md +48 -0
- package/docs/run-registry-control-plane.7.md +312 -0
- package/docs/run-retention-reclamation.7.md +191 -0
- package/docs/sandbox-profiles.7.md +137 -0
- package/docs/scheduled-tasks.md +80 -0
- package/docs/security-trust-hardening.7.md +117 -0
- package/docs/state-explosion-management.7.md +264 -0
- package/docs/state-node.7.md +96 -0
- package/docs/team-collaboration.7.md +207 -0
- package/docs/unix-principles.md +192 -0
- package/docs/verifier-gated-commit.7.md +140 -0
- package/docs/web-desktop-workbench.7.md +215 -0
- package/docs/worker-isolation.7.md +167 -0
- package/docs/workflow-app-framework.7.md +274 -0
- package/manifest/README.md +43 -0
- package/manifest/plugin.manifest.json +316 -0
- package/manifest/pricing.policy.json +14 -0
- package/package.json +79 -0
- package/scripts/agents/claude-p-agent.js +104 -0
- package/scripts/agents/claude-p-agent.sh +9 -0
- package/scripts/agents/cw-attest-keygen.js +55 -0
- package/scripts/agents/cw-attest-wrap.js +143 -0
- package/scripts/block-unapproved-tag.sh +39 -0
- package/scripts/bump-version.js +249 -0
- package/scripts/canonical-apps.js +171 -0
- package/scripts/cw.js +4 -0
- package/scripts/dist-drift-check.js +79 -0
- package/scripts/dogfood-architecture-review.js +237 -0
- package/scripts/dogfood-release.js +624 -0
- package/scripts/forward-ref-docs.js +73 -0
- package/scripts/gen-manifests.js +232 -0
- package/scripts/golden-path.js +300 -0
- package/scripts/mcp-server.js +4 -0
- package/scripts/new-feature.js +121 -0
- package/scripts/parity-check.js +213 -0
- package/scripts/release-check.js +118 -0
- package/scripts/release-flow.js +272 -0
- package/scripts/release-gate.sh +85 -0
- package/scripts/sync-project-index.js +387 -0
- package/scripts/validate-run-state-schema.js +126 -0
- package/scripts/verify-container-selfref.js +64 -0
- package/scripts/version-sync-check.js +237 -0
- package/skills/cool-workflow/SKILL.md +162 -0
- package/skills/cool-workflow/references/commands.md +282 -0
- package/tsconfig.json +16 -0
- package/ui/workbench/app.css +76 -0
- package/ui/workbench/app.js +159 -0
- package/ui/workbench/index.html +32 -0
- package/workflows/architecture-review.workflow.js +84 -0
- package/workflows/research-synthesis.workflow.js +47 -0
|
@@ -0,0 +1,265 @@
|
|
|
1
|
+
# Multi-Agent CLI + MCP Surface
|
|
2
|
+
|
|
3
|
+
CW v0.1.20 adds the preferred host-facing control loop for multi-agent work:
|
|
4
|
+
|
|
5
|
+
```text
|
|
6
|
+
multi-agent run -> status -> step -> blackboard -> score -> select
|
|
7
|
+
```
|
|
8
|
+
|
|
9
|
+
CW v0.1.25 extends this surface with State Explosion Management commands:
|
|
10
|
+
`summary refresh`, `summary show`, `blackboard summarize`,
|
|
11
|
+
`multi-agent summarize`, and `multi-agent graph --view <view>` (with optional
|
|
12
|
+
`--focus <id>` and `--depth <n>`). Matching MCP tools are `cw_summary_refresh`,
|
|
13
|
+
`cw_summary_show`, `cw_blackboard_summarize`, `cw_multi_agent_summarize`, and
|
|
14
|
+
`cw_multi_agent_graph_compact`. All responses keep source refs and expansion
|
|
15
|
+
hints. See [state-explosion-management.7.md](state-explosion-management.7.md).
|
|
16
|
+
|
|
17
|
+
CW v0.1.26 adds `multi-agent reasoning <run-id> [--evidence <id>] [--refresh]`
|
|
18
|
+
(MCP: `cw_evidence_reasoning`, `cw_evidence_reasoning_refresh`), which explains
|
|
19
|
+
*why* each evidence item was adopted, and an additive `rationaleStatus` field on
|
|
20
|
+
`multi-agent evidence` rows. See
|
|
21
|
+
[evidence-adoption-reasoning-chain.7.md](evidence-adoption-reasoning-chain.7.md).
|
|
22
|
+
|
|
23
|
+
This is userland over the existing kernel records. The low-level topology,
|
|
24
|
+
multi-agent, blackboard, candidate, audit, and commit primitives remain
|
|
25
|
+
available, but agent hosts should use this high-level surface when driving a
|
|
26
|
+
run.
|
|
27
|
+
|
|
28
|
+
## CLI Loop
|
|
29
|
+
|
|
30
|
+
Create or attach a topology-backed run without spawning workers:
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
node scripts/cw.js multi-agent run <run-id> --topology judge-panel --task <task-id>
|
|
34
|
+
node scripts/cw.js multi-agent run --app architecture-review --repo /path/to/repo --question "Review this" --topology map-reduce
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Read the combined host status:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
node scripts/cw.js multi-agent status <run-id>
|
|
41
|
+
node scripts/cw.js multi-agent status <run-id> --json
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Perform one deterministic step at a time:
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
node scripts/cw.js multi-agent step <run-id> --sandbox readonly
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
`step` may create a dispatch manifest, collect fanin, snapshot the blackboard,
|
|
51
|
+
register a candidate, score a candidate with existing verifier evidence, select
|
|
52
|
+
a scored candidate, or recommend the verifier-gated commit command. It never
|
|
53
|
+
spawns agents directly.
|
|
54
|
+
|
|
55
|
+
Work with the active blackboard when it is unambiguous:
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
node scripts/cw.js multi-agent blackboard <run-id> summary
|
|
59
|
+
node scripts/cw.js multi-agent blackboard <run-id> topics
|
|
60
|
+
node scripts/cw.js multi-agent blackboard <run-id> post --topic <topic-id> --body "finding" --evidence <ref>
|
|
61
|
+
node scripts/cw.js multi-agent blackboard <run-id> add-artifact --topic <topic-id> --kind worker-result --path result.md
|
|
62
|
+
node scripts/cw.js multi-agent blackboard <run-id> snapshot
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
Score and select explicitly:
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
node scripts/cw.js multi-agent score <run-id> <candidate-id> --criterion correctness=1 --criterion evidence=1 --evidence <ref>
|
|
69
|
+
node scripts/cw.js multi-agent select <run-id> <candidate-id> --score <score-id> --reason "verifier-backed candidate"
|
|
70
|
+
node scripts/cw.js commit <run-id> --selection <selection-id> --reason "verified winner"
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Operator Inspection
|
|
74
|
+
|
|
75
|
+
v0.1.21 extends the host loop with focused operator commands:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
node scripts/cw.js multi-agent graph <run-id>
|
|
79
|
+
node scripts/cw.js multi-agent dependencies <run-id>
|
|
80
|
+
node scripts/cw.js multi-agent failures <run-id>
|
|
81
|
+
node scripts/cw.js multi-agent evidence <run-id>
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
The human output is compact and operational: agent graph, dependencies, failed
|
|
85
|
+
or blocked agents, adopted evidence, missing evidence, and the next action.
|
|
86
|
+
Use `--json` or `--format json` for deterministic script output.
|
|
87
|
+
|
|
88
|
+
## MCP Tools
|
|
89
|
+
|
|
90
|
+
MCP hosts should prefer:
|
|
91
|
+
|
|
92
|
+
- `cw_multi_agent_run`
|
|
93
|
+
- `cw_multi_agent_status`
|
|
94
|
+
- `cw_multi_agent_step`
|
|
95
|
+
- `cw_multi_agent_blackboard`
|
|
96
|
+
- `cw_multi_agent_score`
|
|
97
|
+
- `cw_multi_agent_select`
|
|
98
|
+
- `cw_multi_agent_graph`
|
|
99
|
+
- `cw_multi_agent_dependencies`
|
|
100
|
+
- `cw_multi_agent_failures`
|
|
101
|
+
- `cw_multi_agent_evidence`
|
|
102
|
+
|
|
103
|
+
The older `cw_multi_agent_*`, `cw_topology_*`, `cw_blackboard_*`, and
|
|
104
|
+
`cw_candidate_*` tools remain advanced primitives.
|
|
105
|
+
|
|
106
|
+
## Stable Responses
|
|
107
|
+
|
|
108
|
+
Every high-level response is JSON and includes:
|
|
109
|
+
|
|
110
|
+
- `runId`
|
|
111
|
+
- active topology and multi-agent ids
|
|
112
|
+
- blackboard and topic ids
|
|
113
|
+
- candidate, selection, commit, and audit ids
|
|
114
|
+
- `state`, `performed`, `nextAction`, and `nextActions`
|
|
115
|
+
- `blockedReasons`, `requiredHostAction`, and `evidenceRequirements`
|
|
116
|
+
- state, report, blackboard, audit, ranking, worker manifest, and result paths
|
|
117
|
+
- combined topology, multi-agent, multi-agent operator, blackboard, worker,
|
|
118
|
+
candidate, feedback, commit, and audit summaries
|
|
119
|
+
|
|
120
|
+
## Fail-Closed Rules
|
|
121
|
+
|
|
122
|
+
The host surface fails closed when:
|
|
123
|
+
|
|
124
|
+
- active topology or blackboard state is ambiguous
|
|
125
|
+
- a fanout has incomplete role coverage
|
|
126
|
+
- worker output has not been recorded
|
|
127
|
+
- fanin lacks required evidence or blackboard links
|
|
128
|
+
- score evidence is missing
|
|
129
|
+
- selection lacks score or verifier readiness
|
|
130
|
+
- a verifier-gated commit is not ready
|
|
131
|
+
|
|
132
|
+
## Smoke Coverage
|
|
133
|
+
|
|
134
|
+
`test/multi-agent-cli-mcp-surface-smoke.js` covers the full host loop over the
|
|
135
|
+
official `judge-panel` topology, CLI and MCP parity, ambiguous topology
|
|
136
|
+
failure, missing evidence failure, successful score/select, blackboard
|
|
137
|
+
artifact/message linkage, audit provenance, and Operator UX next actions. It is
|
|
138
|
+
included in `npm test` and `npm run release:check`.
|
|
139
|
+
|
|
140
|
+
`test/multi-agent-operator-ux-smoke.js` covers the v0.1.21 graph,
|
|
141
|
+
dependencies, failures, evidence adoption, report output, and MCP parity.
|
|
142
|
+
|
|
143
|
+
`test/multi-agent-trust-policy-audit-smoke.js` covers the v0.1.22
|
|
144
|
+
role-policy, blackboard-write, message-provenance, judge-rationale,
|
|
145
|
+
policy-violation, report, audit provenance, and MCP parity surface.
|
|
146
|
+
|
|
147
|
+
`test/multi-agent-eval-replay-harness-smoke.js` covers the v0.1.24 eval/replay
|
|
148
|
+
commands and MCP tools: snapshot, replay, compare, score, gate, report, and
|
|
149
|
+
controlled regression detection.
|
|
150
|
+
## CLI ↔ MCP Parity (v0.1.27)
|
|
151
|
+
|
|
152
|
+
Every command and tool referenced above is declared in the v0.1.27 capability
|
|
153
|
+
registry (`src/capability-registry.ts`) and validated by `npm run parity:check`,
|
|
154
|
+
so `cw <cmd> --json` and the matching `cw_<tool>` result render one data source.
|
|
155
|
+
See [cli-mcp-parity.7.md](cli-mcp-parity.7.md).
|
|
156
|
+
|
|
157
|
+
## Run Registry / Control Plane (v0.1.28)
|
|
158
|
+
|
|
159
|
+
The runs described here are indexed, searchable, resumable, archivable, and
|
|
160
|
+
rerunnable across repos by the v0.1.28 Run Registry / Control Plane, which derives
|
|
161
|
+
a fingerprinted, fail-closed index over the same per-run `.cw/runs/<id>/state.json`
|
|
162
|
+
source of truth. See [run-registry-control-plane.7.md](run-registry-control-plane.7.md).
|
|
163
|
+
|
|
164
|
+
## Execution Backends (v0.1.29)
|
|
165
|
+
|
|
166
|
+
v0.1.29 lifts execution into a pluggable driver layer: one narrow `ExecutionBackend`
|
|
167
|
+
contract with interchangeable `node`/`bun`/`shell`/`container`/`remote`/`ci`
|
|
168
|
+
drivers, selected by `--backend` (parallel to `--sandbox`) and inspected via
|
|
169
|
+
`backend list|show|probe`. The result/evidence envelope is schema-identical across
|
|
170
|
+
backends; the backend id + sandbox attestation are recorded as provenance, so this
|
|
171
|
+
surface is unchanged regardless of which backend executed a run. See
|
|
172
|
+
[execution-backends.7.md](execution-backends.7.md).
|
|
173
|
+
## Web / Desktop Workbench (v0.1.30)
|
|
174
|
+
|
|
175
|
+
v0.1.30 adds the Web / Desktop Workbench: a read-only, localhost-only human
|
|
176
|
+
console that renders this surface (and the other four operator panels — run
|
|
177
|
+
graph, blackboard, worker logs, candidate compare, audit timeline) for any run,
|
|
178
|
+
reading the SAME capability `--json` payloads. It is a THIRD FRONT DOOR alongside
|
|
179
|
+
the CLI and MCP that holds no authoritative state and forks no schema: each panel
|
|
180
|
+
equals its `cw <cmd> --json` payload byte-for-byte (parity-gated), and refresh
|
|
181
|
+
re-derives everything from disk. See
|
|
182
|
+
[web-desktop-workbench.7.md](web-desktop-workbench.7.md).
|
|
183
|
+
|
|
184
|
+
## Observability + Cost Accounting (v0.1.31)
|
|
185
|
+
|
|
186
|
+
v0.1.31 adds Observability + Cost Accounting: `metrics show`/`metrics summary`
|
|
187
|
+
derive durations, failure/verifier/acceptance rates (with sample counts and
|
|
188
|
+
fail-closed `n/a`), and host-attested token/cost from existing durable run state
|
|
189
|
+
— no metrics database, no collector daemon, no hidden counter. Usage is additive
|
|
190
|
+
and optional (absent ⇒ `unreported`, never 0); cost is `attested` (attested usage
|
|
191
|
+
× a recorded pricing policy) or clearly `estimated`, with pricing as policy. Both
|
|
192
|
+
verbs are parity-gated and render read-only in the v0.1.30 Workbench. See
|
|
193
|
+
[observability-cost-accounting.7.md](observability-cost-accounting.7.md).
|
|
194
|
+
|
|
195
|
+
|
|
196
|
+
## Team Collaboration (v0.1.32)
|
|
197
|
+
|
|
198
|
+
v0.1.32 adds Team Collaboration: a host-attested actor and append-only
|
|
199
|
+
approvals/rejections/comments/handoffs provenance-linked to a durable target,
|
|
200
|
+
plus a review gate that STACKS ON the verifier gate — required approvals from
|
|
201
|
+
authorized roles, enforced inside `resolveCommitGate` AFTER the verifier checks
|
|
202
|
+
and never instead of them, failing closed on quorum/authority/self-approval and
|
|
203
|
+
recording who approved the very artifact that shipped. Policy (required approvals,
|
|
204
|
+
authorized roles, self-approval) is data, default off (pre-v0.1.32 behavior
|
|
205
|
+
unchanged). The verbs are parity-gated and render read-only in the v0.1.30
|
|
206
|
+
Workbench. See [Team Collaboration](team-collaboration.7.md).
|
|
207
|
+
|
|
208
|
+
## Release Tooling (v0.1.33)
|
|
209
|
+
|
|
210
|
+
the per-tag mechanical surfaces (version bump across 17 surfaces, feature scaffold, and the forward-reference docs) become deterministic scripts, with a de-duplicated release gate. See release-tooling(7).
|
|
211
|
+
|
|
212
|
+
## Real Execution Backend Integrations (v0.1.34)
|
|
213
|
+
|
|
214
|
+
container/remote/ci backends really execute (docker/podman run, remote/CI POST-and-poll) under the sandbox contract, with byte-stable evidence vs node and fail-closed refusal when a runtime/endpoint is unavailable. See real-execution-backends(7).
|
|
215
|
+
|
|
216
|
+
## Node Snapshot / Diff / Replay (v0.1.35)
|
|
217
|
+
|
|
218
|
+
per-node snapshot, structural diff, and isolated deterministic replay over StateNode, reusing the v0.1.23 eval harness; fail-closed on source drift (valid|stale|absent). See node-snapshot-diff-replay(7).
|
|
219
|
+
|
|
220
|
+
## Contract Migration Tooling (v0.1.36)
|
|
221
|
+
|
|
222
|
+
first-class declared migration registry (run-state + workflow-app) with per-edge compatibility proofs, fail-closed reachability, and a round-trip/non-destruction prover. See contract-migration-tooling(7).
|
|
223
|
+
|
|
224
|
+
## Control-Plane Scheduling (v0.1.37)
|
|
225
|
+
|
|
226
|
+
priority + concurrency limits + lease lifecycle + retry/backoff + fail-closed park over the v0.1.28 Run Registry queue; policy-as-data, deterministic. See control-plane-scheduling(7).
|
|
227
|
+
|
|
228
|
+
## Agent Delegation Drive (v0.1.38)
|
|
229
|
+
|
|
230
|
+
spawn an external agent process per worker, capture result.md + attestation, auto-drive plan->dispatch->fulfill->accept->commit
|
|
231
|
+
|
|
232
|
+
## Run Retention & Provable Reclamation (v0.1.39)
|
|
233
|
+
|
|
234
|
+
tiered, append-only, cryptographically-verifiable run reclamation: seal the audit skeleton, free the reconstructable bulk, prove it
|
|
235
|
+
|
|
236
|
+
## Durable State & Locking (v0.1.40)
|
|
237
|
+
|
|
238
|
+
atomic temp->rename writes + fsync-durability for authoritative stores; portable stale-stealing file lock serializing the cross-process read-modify-write stores
|
|
239
|
+
|
|
240
|
+
## Self-Audit Hardening & Pure-Router Decomposition (v0.1.41)
|
|
241
|
+
|
|
242
|
+
evidence grounding + durable audit append + symlink-hardened containment + deterministic worker ids + recursive redaction; BackendRegistry self-describing drivers (no per-id switches); orchestrator god-object decomposed into per-domain operation modules (pure loadRun->delegate router)
|
|
243
|
+
|
|
244
|
+
## Robust Result Ingest (v0.1.42)
|
|
245
|
+
|
|
246
|
+
capture findings/evidence from any reasonable agent shape (alt keys + prose), CW derives grounded evidence itself, warn on empty capture — closes the v0.1.41 live-drive 'accepted with 0 captured' failure
|
|
247
|
+
|
|
248
|
+
## No-False-Green Gate & Launch Prep (v0.1.43)
|
|
249
|
+
|
|
250
|
+
Hard gate blocking empty-capture verifier-gated commits, plus quickstart and launch-prep docs.
|
|
251
|
+
|
|
252
|
+
## Release-Gate Determinism & Agents Vendor (v0.1.44)
|
|
253
|
+
|
|
254
|
+
Release-readiness checks now validate the committed blob (`git show HEAD:<path>`) instead of the mutable working tree — eliminating false-red/false-green from concurrent working-tree writes (iCloud/Spotlight/editor). Adds the `agents` vendor manifest target: a generated `.agents/plugins/cool-workflow/` adapter giving any non-Claude AI agent one common interface to CW.
|
|
255
|
+
|
|
256
|
+
## P1-P2 Fixes & CI Content Surfaces (v0.1.49)
|
|
257
|
+
|
|
258
|
+
Migration DAG with reversible edges (v0.1.45), capability auto-discovery (v0.1.46), vendor-adapter registry (v0.1.47), state auto-compaction and P2 fixes (v0.1.48), plus CI content-surface determinism hardening (v0.1.49).
|
|
259
|
+
0.1.51
|
|
260
|
+
|
|
261
|
+
0.1.76
|
|
262
|
+
|
|
263
|
+
0.1.77
|
|
264
|
+
|
|
265
|
+
0.1.78
|
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
# Multi-Agent Eval & Replay Harness
|
|
2
|
+
|
|
3
|
+
CW v0.1.23 added a deterministic replay harness for topology-backed
|
|
4
|
+
multi-agent runs. It turns a completed run into plain JSON evidence that can be
|
|
5
|
+
replayed without live agents, compared with normalized rules, scored, and used
|
|
6
|
+
as a release gate.
|
|
7
|
+
|
|
8
|
+
CW v0.1.25 extends the harness with State Explosion Management metrics so the
|
|
9
|
+
derived summary layer is regression-gated alongside the raw run:
|
|
10
|
+
`summary_freshness`, `compact_graph_parity`, `blackboard_digest_parity`,
|
|
11
|
+
`critical_path_parity`, `evidence_digest_parity`, and `expansion_ref_integrity`.
|
|
12
|
+
Pre-0.1.25 snapshots load with empty summary sections, so old fixtures stay
|
|
13
|
+
backward compatible. See
|
|
14
|
+
[state-explosion-management.7.md](state-explosion-management.7.md).
|
|
15
|
+
|
|
16
|
+
CW v0.1.26 adds Evidence Adoption Reasoning Chain metrics: `reasoning_freshness`,
|
|
17
|
+
`reasoning_chain_parity`, and `reasoning_unexplained_parity`. Pre-0.1.26
|
|
18
|
+
snapshots load with empty reasoning sections. See
|
|
19
|
+
[evidence-adoption-reasoning-chain.7.md](evidence-adoption-reasoning-chain.7.md).
|
|
20
|
+
|
|
21
|
+
The harness is intentionally file-first:
|
|
22
|
+
|
|
23
|
+
- snapshots, replay runs, comparisons, scores, findings, gates, and reports are
|
|
24
|
+
stored under `.cw/evals/<suite-id>/`
|
|
25
|
+
- the baseline run is not mutated during replay
|
|
26
|
+
- replay output is written to an isolated `replay/` directory
|
|
27
|
+
- every CLI command supports deterministic JSON with `--json` or
|
|
28
|
+
`--format json`
|
|
29
|
+
- MCP tools return JSON only and include generated artifact paths
|
|
30
|
+
|
|
31
|
+
## Commands
|
|
32
|
+
|
|
33
|
+
Create a snapshot from a multi-agent run:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
node scripts/cw.js eval snapshot <run-id> --id <suite-id>
|
|
37
|
+
node scripts/cw.js eval snapshot <run-id> --id <suite-id> --json
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Replay without live agents:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
node scripts/cw.js eval replay .cw/evals/<suite-id>/snapshot.json
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Compare, score, gate, and report:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
node scripts/cw.js eval compare \
|
|
50
|
+
.cw/evals/<suite-id>/snapshot.json \
|
|
51
|
+
.cw/evals/<suite-id>/replay-run.json
|
|
52
|
+
|
|
53
|
+
node scripts/cw.js eval score .cw/evals/<suite-id>/replay-run.json
|
|
54
|
+
node scripts/cw.js eval gate .cw/evals/<suite-id>
|
|
55
|
+
node scripts/cw.js eval report .cw/evals/<suite-id>/replay-run.json
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
`npm run eval:replay` runs the deterministic smoke suite and is included in
|
|
59
|
+
`npm test` and `npm run release:check`.
|
|
60
|
+
|
|
61
|
+
Human output uses stable panels:
|
|
62
|
+
|
|
63
|
+
```text
|
|
64
|
+
Eval Suite
|
|
65
|
+
Replay Status
|
|
66
|
+
Graph Comparison
|
|
67
|
+
Evidence Comparison
|
|
68
|
+
Trust / Policy / Audit Comparison
|
|
69
|
+
Candidate Score Comparison
|
|
70
|
+
Selection / Commit Gate
|
|
71
|
+
Regression Findings
|
|
72
|
+
Final Verdict
|
|
73
|
+
Next Action
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## Artifacts
|
|
77
|
+
|
|
78
|
+
Each suite writes predictable files:
|
|
79
|
+
|
|
80
|
+
- `suite.json`
|
|
81
|
+
- `snapshot.json`
|
|
82
|
+
- `replay-run.json`
|
|
83
|
+
- `comparison.json`
|
|
84
|
+
- `score.json`
|
|
85
|
+
- `findings.json`
|
|
86
|
+
- `gate.json`
|
|
87
|
+
- `report.md`
|
|
88
|
+
|
|
89
|
+
The snapshot captures workflow app identity, inputs, topology shape, roles,
|
|
90
|
+
groups, memberships, fanout/fanin state, blackboard records, worker outputs,
|
|
91
|
+
candidate scores, selection rationale, verifier-gated commit inputs,
|
|
92
|
+
trust/policy/audit records, expected operator summaries, evidence adoption, and
|
|
93
|
+
report sections.
|
|
94
|
+
|
|
95
|
+
## Comparison Rules
|
|
96
|
+
|
|
97
|
+
The comparison checks:
|
|
98
|
+
|
|
99
|
+
- topology id and topology run shape
|
|
100
|
+
- roles, groups, memberships, fanout, and fanin records
|
|
101
|
+
- dependency edges and failure rows
|
|
102
|
+
- blackboard records and message provenance
|
|
103
|
+
- role policies, permission decisions, write audit, judge rationale, panel
|
|
104
|
+
decisions, and policy violations
|
|
105
|
+
- evidence adoption status
|
|
106
|
+
- candidate scores, selected candidate, and selection rationale
|
|
107
|
+
- verifier-gated commit readiness
|
|
108
|
+
- report sections
|
|
109
|
+
|
|
110
|
+
Normalization removes unstable paths, timestamps, generated temp roots, and
|
|
111
|
+
machine-local directories. It does not hide changed evidence, policy,
|
|
112
|
+
selection, scoring, or commit-gate behavior.
|
|
113
|
+
|
|
114
|
+
## Scoring
|
|
115
|
+
|
|
116
|
+
Scores are deterministic metrics:
|
|
117
|
+
|
|
118
|
+
- `replay_completed`
|
|
119
|
+
- `graph_parity`
|
|
120
|
+
- `role_parity`
|
|
121
|
+
- `group_parity`
|
|
122
|
+
- `membership_parity`
|
|
123
|
+
- `fanout_parity`
|
|
124
|
+
- `fanin_parity`
|
|
125
|
+
- `dependency_parity`
|
|
126
|
+
- `failure_parity`
|
|
127
|
+
- `blackboard_record_parity`
|
|
128
|
+
- `evidence_adoption_parity`
|
|
129
|
+
- `trust_audit_parity`
|
|
130
|
+
- `role_policy_parity`
|
|
131
|
+
- `permission_decision_parity`
|
|
132
|
+
- `policy_violation_parity`
|
|
133
|
+
- `blackboard_provenance_parity`
|
|
134
|
+
- `judge_rationale_parity`
|
|
135
|
+
- `panel_decision_parity`
|
|
136
|
+
- `candidate_score_parity`
|
|
137
|
+
- `selection_parity`
|
|
138
|
+
- `verifier_commit_gate_parity`
|
|
139
|
+
- `report_parity`
|
|
140
|
+
|
|
141
|
+
Each metric returns `id`, `status`, `score`, `maxScore`, `reason`, evidence
|
|
142
|
+
refs, baseline refs, and replay refs.
|
|
143
|
+
|
|
144
|
+
## Gate
|
|
145
|
+
|
|
146
|
+
`eval gate` fails closed when replay artifacts are missing or when comparison
|
|
147
|
+
findings show a regression. This includes missing judge rationale, changed
|
|
148
|
+
selected candidate, changed evidence adoption, changed policy violations,
|
|
149
|
+
missing provenance, lost verifier-gated commit readiness, or graph/dependency
|
|
150
|
+
loss.
|
|
151
|
+
|
|
152
|
+
Improvements can be represented as changed findings in future suites, but they
|
|
153
|
+
must be visible in `score.json`, `findings.json`, and `report.md` before a
|
|
154
|
+
release gate can accept them.
|
|
155
|
+
|
|
156
|
+
## MCP Parity
|
|
157
|
+
|
|
158
|
+
The MCP surface mirrors the CLI:
|
|
159
|
+
|
|
160
|
+
- `cw_eval_snapshot`
|
|
161
|
+
- `cw_eval_replay`
|
|
162
|
+
- `cw_eval_compare`
|
|
163
|
+
- `cw_eval_score`
|
|
164
|
+
- `cw_eval_gate`
|
|
165
|
+
- `cw_eval_report`
|
|
166
|
+
|
|
167
|
+
MCP responses are deterministic JSON and include artifact paths.
|
|
168
|
+
|
|
169
|
+
## Release Use
|
|
170
|
+
|
|
171
|
+
Use this harness after a topology-backed run reaches score, selection, and a
|
|
172
|
+
verifier-gated commit:
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
node scripts/cw.js eval snapshot <run-id> --id release-replay
|
|
176
|
+
node scripts/cw.js eval replay .cw/evals/release-replay/snapshot.json
|
|
177
|
+
node scripts/cw.js eval compare .cw/evals/release-replay/snapshot.json .cw/evals/release-replay/replay-run.json
|
|
178
|
+
node scripts/cw.js eval score .cw/evals/release-replay/replay-run.json
|
|
179
|
+
node scripts/cw.js eval gate .cw/evals/release-replay
|
|
180
|
+
node scripts/cw.js eval report .cw/evals/release-replay/replay-run.json
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
The gate proves the replay completed, graph/dependencies stayed stable,
|
|
184
|
+
evidence adoption stayed traceable, trust/policy/audit records remained
|
|
185
|
+
explainable, judge rationale is present, scoring/selection did not regress, and
|
|
186
|
+
verifier-gated commit readiness still holds.
|
|
187
|
+
## CLI ↔ MCP Parity (v0.1.27)
|
|
188
|
+
|
|
189
|
+
Every command and tool referenced above is declared in the v0.1.27 capability
|
|
190
|
+
registry (`src/capability-registry.ts`) and validated by `npm run parity:check`,
|
|
191
|
+
so `cw <cmd> --json` and the matching `cw_<tool>` result render one data source.
|
|
192
|
+
See [cli-mcp-parity.7.md](cli-mcp-parity.7.md).
|
|
193
|
+
|
|
194
|
+
## Run Registry / Control Plane (v0.1.28)
|
|
195
|
+
|
|
196
|
+
The runs described here are indexed, searchable, resumable, archivable, and
|
|
197
|
+
rerunnable across repos by the v0.1.28 Run Registry / Control Plane, which derives
|
|
198
|
+
a fingerprinted, fail-closed index over the same per-run `.cw/runs/<id>/state.json`
|
|
199
|
+
source of truth. See [run-registry-control-plane.7.md](run-registry-control-plane.7.md).
|
|
200
|
+
|
|
201
|
+
## Execution Backends (v0.1.29)
|
|
202
|
+
|
|
203
|
+
v0.1.29 lifts execution into a pluggable driver layer: one narrow `ExecutionBackend`
|
|
204
|
+
contract with interchangeable `node`/`bun`/`shell`/`container`/`remote`/`ci`
|
|
205
|
+
drivers, selected by `--backend` (parallel to `--sandbox`) and inspected via
|
|
206
|
+
`backend list|show|probe`. The result/evidence envelope is schema-identical across
|
|
207
|
+
backends; the backend id + sandbox attestation are recorded as provenance, so this
|
|
208
|
+
surface is unchanged regardless of which backend executed a run. See
|
|
209
|
+
[execution-backends.7.md](execution-backends.7.md).
|
|
210
|
+
## Web / Desktop Workbench (v0.1.30)
|
|
211
|
+
|
|
212
|
+
v0.1.30 adds the Web / Desktop Workbench: a read-only, localhost-only human
|
|
213
|
+
console that renders this surface (and the other four operator panels — run
|
|
214
|
+
graph, blackboard, worker logs, candidate compare, audit timeline) for any run,
|
|
215
|
+
reading the SAME capability `--json` payloads. It is a THIRD FRONT DOOR alongside
|
|
216
|
+
the CLI and MCP that holds no authoritative state and forks no schema: each panel
|
|
217
|
+
equals its `cw <cmd> --json` payload byte-for-byte (parity-gated), and refresh
|
|
218
|
+
re-derives everything from disk. See
|
|
219
|
+
[web-desktop-workbench.7.md](web-desktop-workbench.7.md).
|
|
220
|
+
|
|
221
|
+
## Observability + Cost Accounting (v0.1.31)
|
|
222
|
+
|
|
223
|
+
v0.1.31 adds Observability + Cost Accounting: `metrics show`/`metrics summary`
|
|
224
|
+
derive durations, failure/verifier/acceptance rates (with sample counts and
|
|
225
|
+
fail-closed `n/a`), and host-attested token/cost from existing durable run state
|
|
226
|
+
— no metrics database, no collector daemon, no hidden counter. Usage is additive
|
|
227
|
+
and optional (absent ⇒ `unreported`, never 0); cost is `attested` (attested usage
|
|
228
|
+
× a recorded pricing policy) or clearly `estimated`, with pricing as policy. Both
|
|
229
|
+
verbs are parity-gated and render read-only in the v0.1.30 Workbench. See
|
|
230
|
+
[observability-cost-accounting.7.md](observability-cost-accounting.7.md).
|
|
231
|
+
|
|
232
|
+
|
|
233
|
+
## Team Collaboration (v0.1.32)
|
|
234
|
+
|
|
235
|
+
v0.1.32 adds Team Collaboration: a host-attested actor and append-only
|
|
236
|
+
approvals/rejections/comments/handoffs provenance-linked to a durable target,
|
|
237
|
+
plus a review gate that STACKS ON the verifier gate — required approvals from
|
|
238
|
+
authorized roles, enforced inside `resolveCommitGate` AFTER the verifier checks
|
|
239
|
+
and never instead of them, failing closed on quorum/authority/self-approval and
|
|
240
|
+
recording who approved the very artifact that shipped. Policy (required approvals,
|
|
241
|
+
authorized roles, self-approval) is data, default off (pre-v0.1.32 behavior
|
|
242
|
+
unchanged). The verbs are parity-gated and render read-only in the v0.1.30
|
|
243
|
+
Workbench. See [Team Collaboration](team-collaboration.7.md).
|
|
244
|
+
|
|
245
|
+
## Release Tooling (v0.1.33)
|
|
246
|
+
|
|
247
|
+
the per-tag mechanical surfaces (version bump across 17 surfaces, feature scaffold, and the forward-reference docs) become deterministic scripts, with a de-duplicated release gate. See release-tooling(7).
|
|
248
|
+
|
|
249
|
+
## Real Execution Backend Integrations (v0.1.34)
|
|
250
|
+
|
|
251
|
+
container/remote/ci backends really execute (docker/podman run, remote/CI POST-and-poll) under the sandbox contract, with byte-stable evidence vs node and fail-closed refusal when a runtime/endpoint is unavailable. See real-execution-backends(7).
|
|
252
|
+
|
|
253
|
+
## Node Snapshot / Diff / Replay (v0.1.35)
|
|
254
|
+
|
|
255
|
+
per-node snapshot, structural diff, and isolated deterministic replay over StateNode, reusing the v0.1.23 eval harness; fail-closed on source drift (valid|stale|absent). See node-snapshot-diff-replay(7).
|
|
256
|
+
|
|
257
|
+
## Contract Migration Tooling (v0.1.36)
|
|
258
|
+
|
|
259
|
+
first-class declared migration registry (run-state + workflow-app) with per-edge compatibility proofs, fail-closed reachability, and a round-trip/non-destruction prover. See contract-migration-tooling(7).
|
|
260
|
+
|
|
261
|
+
## Control-Plane Scheduling (v0.1.37)
|
|
262
|
+
|
|
263
|
+
priority + concurrency limits + lease lifecycle + retry/backoff + fail-closed park over the v0.1.28 Run Registry queue; policy-as-data, deterministic. See control-plane-scheduling(7).
|
|
264
|
+
|
|
265
|
+
## Agent Delegation Drive (v0.1.38)
|
|
266
|
+
|
|
267
|
+
spawn an external agent process per worker, capture result.md + attestation, auto-drive plan->dispatch->fulfill->accept->commit
|
|
268
|
+
|
|
269
|
+
## Run Retention & Provable Reclamation (v0.1.39)
|
|
270
|
+
|
|
271
|
+
tiered, append-only, cryptographically-verifiable run reclamation: seal the audit skeleton, free the reconstructable bulk, prove it
|
|
272
|
+
|
|
273
|
+
## Durable State & Locking (v0.1.40)
|
|
274
|
+
|
|
275
|
+
atomic temp->rename writes + fsync-durability for authoritative stores; portable stale-stealing file lock serializing the cross-process read-modify-write stores
|
|
276
|
+
|
|
277
|
+
## Self-Audit Hardening & Pure-Router Decomposition (v0.1.41)
|
|
278
|
+
|
|
279
|
+
evidence grounding + durable audit append + symlink-hardened containment + deterministic worker ids + recursive redaction; BackendRegistry self-describing drivers (no per-id switches); orchestrator god-object decomposed into per-domain operation modules (pure loadRun->delegate router)
|
|
280
|
+
|
|
281
|
+
## Robust Result Ingest (v0.1.42)
|
|
282
|
+
|
|
283
|
+
capture findings/evidence from any reasonable agent shape (alt keys + prose), CW derives grounded evidence itself, warn on empty capture — closes the v0.1.41 live-drive 'accepted with 0 captured' failure
|
|
284
|
+
|
|
285
|
+
## No-False-Green Gate & Launch Prep (v0.1.43)
|
|
286
|
+
|
|
287
|
+
Hard gate blocking empty-capture verifier-gated commits, plus quickstart and launch-prep docs.
|
|
288
|
+
|
|
289
|
+
## Release-Gate Determinism & Agents Vendor (v0.1.44)
|
|
290
|
+
|
|
291
|
+
Release-readiness checks now validate the committed blob (`git show HEAD:<path>`) instead of the mutable working tree — eliminating false-red/false-green from concurrent working-tree writes (iCloud/Spotlight/editor). Adds the `agents` vendor manifest target: a generated `.agents/plugins/cool-workflow/` adapter giving any non-Claude AI agent one common interface to CW.
|
|
292
|
+
|
|
293
|
+
## P1-P2 Fixes & CI Content Surfaces (v0.1.49)
|
|
294
|
+
|
|
295
|
+
Migration DAG with reversible edges (v0.1.45), capability auto-discovery (v0.1.46), vendor-adapter registry (v0.1.47), state auto-compaction and P2 fixes (v0.1.48), plus CI content-surface determinism hardening (v0.1.49).
|
|
296
|
+
0.1.51
|
|
297
|
+
|
|
298
|
+
0.1.76
|
|
299
|
+
|
|
300
|
+
0.1.77
|
|
301
|
+
|
|
302
|
+
0.1.78
|