@cleocode/skills 2026.5.15 → 2026.5.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. package/package.json +1 -1
  2. package/skills/ct-council/SKILL.md +377 -0
  3. package/skills/ct-council/optimization/HARDENING-PLAYBOOK.md +107 -0
  4. package/skills/ct-council/optimization/README.md +74 -0
  5. package/skills/ct-council/optimization/scenarios.yaml +121 -0
  6. package/skills/ct-council/optimization/scripts/campaign.py +543 -0
  7. package/skills/ct-council/optimization/scripts/test_campaign.py +143 -0
  8. package/skills/ct-council/references/chairman.md +119 -0
  9. package/skills/ct-council/references/contrarian.md +70 -0
  10. package/skills/ct-council/references/evidence-pack.md +145 -0
  11. package/skills/ct-council/references/examples.md +235 -0
  12. package/skills/ct-council/references/executor.md +83 -0
  13. package/skills/ct-council/references/expansionist.md +68 -0
  14. package/skills/ct-council/references/first-principles.md +73 -0
  15. package/skills/ct-council/references/outsider.md +73 -0
  16. package/skills/ct-council/references/peer-review.md +125 -0
  17. package/skills/ct-council/scripts/analyze_runs.py +293 -0
  18. package/skills/ct-council/scripts/fixtures/executor_multi.md +198 -0
  19. package/skills/ct-council/scripts/fixtures/missing_advisor.md +117 -0
  20. package/skills/ct-council/scripts/fixtures/missing_convergence.md +190 -0
  21. package/skills/ct-council/scripts/fixtures/thin_evidence.md +193 -0
  22. package/skills/ct-council/scripts/fixtures/valid.md +226 -0
  23. package/skills/ct-council/scripts/fixtures/valid_with_llmtxt.md +226 -0
  24. package/skills/ct-council/scripts/llmtxt_ref.py +223 -0
  25. package/skills/ct-council/scripts/run_council.py +578 -0
  26. package/skills/ct-council/scripts/telemetry.py +624 -0
  27. package/skills/ct-council/scripts/test_telemetry.py +509 -0
  28. package/skills/ct-council/scripts/test_validate.py +452 -0
  29. package/skills/ct-council/scripts/validate.py +396 -0
  30. package/skills.json +19 -0
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@cleocode/skills",
3
- "version": "2026.5.15",
3
+ "version": "2026.5.17",
4
4
  "description": "CLEO skill definitions - bundled with CLEO monorepo",
5
5
  "main": "index.js",
6
6
  "types": "index.d.ts",
@@ -0,0 +1,377 @@
1
+ ---
2
+ name: ct-council
3
+ description: Convene "The Council" — a 5-advisor, shuffled gate-based peer-review, chairman-synthesis workflow for reviewing a plan, decision, architecture, or piece of work inside the current project. Use when the user says "convene the council" (or "counsel"), "get the council on this", "council review", "run the five advisors", "stress-test this", "get multiple perspectives", or asks for a rigorous multi-angle challenge of a proposal (Contrarian, First Principles, Expansionist, Outsider, Executor → shuffled peer review with pass/fail gates → convergence detector → Chairman verdict). Operates on the current codebase — each advisor grounds their analysis in actual files/commits before opining. Output is validated by scripts/validate.py.
4
+ ---
5
+
6
+ # The Council
7
+
8
+ ## Overview
9
+
10
+ The Council reviews a proposal, plan, architecture decision, or existing implementation from five locked perspectives, cross-checks each perspective through a shuffled gate-based peer review, runs a convergence detector to catch frame drift, then has a Chairman synthesize one final verdict for the owner.
11
+
12
+ **Five advisors** → **Shuffled peer review (4 pass/fail gates per reviewee)** → **Convergence detector** → **Chairman synthesis** → **Single verdict to owner**.
13
+
14
+ Every Council run's output is validated by `scripts/validate.py` — structural failures are caught automatically.
15
+
16
+ ## When to use
17
+
18
+ - Owner presents a plan, design, or existing implementation and wants it stress-tested from multiple angles.
19
+ - A decision is high-stakes enough that a single-perspective analysis feels thin.
20
+ - The user explicitly invokes the council ("convene the council", "council on X", "run the five", etc.).
21
+
22
+ Do NOT use for simple factual questions, routine implementation tasks, or anything the user wants a quick answer on. The Council is heavyweight by design.
23
+
24
+ ## The five advisors
25
+
26
+ Each advisor has their own progressive-disclosure file with full persona, mandate, hard rules, and self-contained output template. When running an advisor pass, **load only that advisor's file** — this enforces frame integrity and works equally well in single-Claude mode (re-read per pass) and subagent mode (each subagent briefed with only their persona).
27
+
28
+ | Advisor | Frame | Produces | Persona file |
29
+ |---|---|---|---|
30
+ | Contrarian | Devil's advocate / risk analyst | Failure modes with trigger conditions | [references/contrarian.md](references/contrarian.md) |
31
+ | First Principles | Zero-based rebuilder | Atomic truths + reconstructed solution | [references/first-principles.md](references/first-principles.md) |
32
+ | Expansionist | Frame-expander / opportunity-spotter | Asymmetric upside + latent assets | [references/expansionist.md](references/expansionist.md) |
33
+ | Outsider | Cold-read stranger | Claim/reality gaps from the artifact alone | [references/outsider.md](references/outsider.md) |
34
+ | Executor | Action-only | Exactly one 60-minute action with expected outcome | [references/executor.md](references/executor.md) |
35
+
36
+ Each persona's "Your lane vs. other advisors' lanes" section enforces boundaries — frame bleed fails the G3 gate in peer review.
37
+
38
+ ## Execution mode — subagents are the default
39
+
40
+ **Default: subagent mode.** Spawn five parallel `Agent` calls, one per advisor. Each subagent receives *only* the shared evidence pack and the path to their own persona file — nothing else. This is the only execution mode with true frame isolation, and it is the default for any non-trivial question.
41
+
42
+ **Exception: single-Claude mode** is permitted only when (a) the question is extremely narrow (single file, single function), or (b) subagent infrastructure is unavailable. In single-Claude mode, Claude re-reads each persona file before each pass and explicitly acknowledges the "Your lane vs." section. The convergence detector (Phase 2.5) is especially load-bearing in this mode.
43
+
44
+ ## Canonical file layout (mandatory)
45
+
46
+ Every run lives under a run directory (`<run-dir>/`, created by `scripts/run_council.py init`). **Each subagent writes its own output file** — the orchestrator does NOT transcribe agent text into its own context. This is structurally important: agent outputs land directly on disk, the orchestrator reads them back when needed (or at assembly time), and the run directory is the audit trail.
47
+
48
+ | File | Owner | When written |
49
+ |---|---|---|
50
+ | `<run-dir>/phase0.md` | **Orchestrator** | Phase 0 (evidence pack) |
51
+ | `<run-dir>/phase1-contrarian.md` | **Contrarian agent** | Phase 1 |
52
+ | `<run-dir>/phase1-first-principles.md` | **First Principles agent** | Phase 1 |
53
+ | `<run-dir>/phase1-expansionist.md` | **Expansionist agent** | Phase 1 |
54
+ | `<run-dir>/phase1-outsider.md` | **Outsider agent** | Phase 1 |
55
+ | `<run-dir>/phase1-executor.md` | **Executor agent** | Phase 1 |
56
+ | `<run-dir>/peer-contrarian-on-first-principles.md` | **Contrarian-as-reviewer agent** | Phase 2 |
57
+ | `<run-dir>/peer-first-principles-on-expansionist.md` | **First Principles-as-reviewer agent** | Phase 2 |
58
+ | `<run-dir>/peer-expansionist-on-outsider.md` | **Expansionist-as-reviewer agent** | Phase 2 |
59
+ | `<run-dir>/peer-outsider-on-executor.md` | **Outsider-as-reviewer agent** | Phase 2 |
60
+ | `<run-dir>/peer-executor-on-contrarian.md` | **Executor-as-reviewer agent** | Phase 2 |
61
+ | `<run-dir>/phase2_5.md` | **Orchestrator** | Phase 2.5 (convergence) |
62
+ | `<run-dir>/phase3.md` | **Orchestrator** (or Chairman agent if delegated) | Phase 3 |
63
+ | `<run-dir>/output.md` | **Orchestrator** (assembled from above) | After Phase 3 |
64
+ | `<run-dir>/verdict.md`, `tldr.md` | **Auto-generated** by `run_council.py ingest` | After validate |
65
+
66
+ The agent file-write contract: each Phase-1 / Phase-2 subagent must use the `Write` tool to save its full output markdown to the path above and return ONLY a one-line confirmation. The orchestrator does NOT include the agent's full output in its return-context — that bloats the orchestrator window unnecessarily.
67
+
68
+ ### Subagent briefing template — Phase 1 (advisor)
69
+
70
+ Pass this verbatim to each Agent call, substituting the bracketed values:
71
+
72
+ ```
73
+ You are the <Advisor Name>. Read your persona, mandate, hard rules, and output
74
+ template at this path before producing any output:
75
+
76
+ packages/skills/skills/ct-council/references/<advisor-slug>.md
77
+
78
+ The restated question is: <restated question from Phase 0>
79
+
80
+ The evidence pack is:
81
+ 1. <file:line | commit | symbol> — <rationale>
82
+ 2. ...
83
+
84
+ Produce exactly the output specified in your persona file's "Your output"
85
+ template. Do not break frame. Cite at least two items from the evidence pack.
86
+ Stay strictly in your lane — the "Your lane vs. other advisors' lanes" section
87
+ is enforced in peer review.
88
+
89
+ WRITE your full output to this exact path using the Write tool:
90
+
91
+ <run-dir>/phase1-<advisor-slug>.md
92
+
93
+ After the file is written, return ONLY a one-line confirmation
94
+ (e.g. "Wrote phase1-contrarian.md, sharpest point: <one-clause summary>").
95
+ DO NOT include the full advisor analysis in your reply — the orchestrator reads
96
+ it back from the file at assembly time.
97
+ ```
98
+
99
+ ### Subagent briefing template — Phase 2 (peer review)
100
+
101
+ The peer-review briefing follows the fixed rotation (`Contrarian → First Principles`, etc.). Each reviewer reads three files: their own persona, the reviewee's persona (for the G3 lane check), and the reviewee's actual output.
102
+
103
+ ```
104
+ You are <Reviewer> running a peer review of <Reviewee>. Read these files
105
+ in order before producing any output:
106
+
107
+ 1. packages/skills/skills/ct-council/references/<reviewer-slug>.md
108
+ (your persona — stay in this frame)
109
+ 2. packages/skills/skills/ct-council/references/<reviewee-slug>.md
110
+ (reviewee's persona — the "Your lane vs. other advisors' lanes" section
111
+ is what G3 Frame integrity is enforced against)
112
+ 3. <run-dir>/phase1-<reviewee-slug>.md
113
+ (the output you are evaluating)
114
+ 4. packages/skills/skills/ct-council/references/peer-review.md
115
+ (gate format, output template, hard rules — the gate-line format is
116
+ load-bearing; "G1 Rigor:" not "G1 Rigor gate:")
117
+
118
+ The shared evidence pack lives at: <run-dir>/phase0.md
119
+
120
+ Evaluate the reviewee against the four gates (G1 Rigor, G2 Evidence grounding,
121
+ G3 Frame integrity, G4 Actionability). Each gate is strictly PASS or FAIL —
122
+ no PARTIAL / MIXED. The verdict-line format MUST match exactly:
123
+
124
+ - G1 Rigor: PASS|FAIL — <evidence>
125
+ - G2 Evidence grounding: PASS|FAIL — <evidence>
126
+ - G3 Frame integrity: PASS|FAIL — <evidence>
127
+ - G4 Actionability: PASS|FAIL — <evidence>
128
+
129
+ Do NOT append "gate" to the verdict-line names (the validator regex rejects it).
130
+
131
+ WRITE your full peer-review output to this exact path using the Write tool:
132
+
133
+ <run-dir>/peer-<reviewer-slug>-on-<reviewee-slug>.md
134
+
135
+ After the file is written, return ONLY a one-line confirmation including the
136
+ gate-pass count and disposition (e.g. "Wrote peer-contrarian-on-first-principles.md
137
+ — 4/4 PASS, Disposition: Accept"). DO NOT include the full peer review in your
138
+ reply — the orchestrator reads it back from the file.
139
+ ```
140
+
141
+ ## Phase ownership — who executes what
142
+
143
+ The skill uses a mix of orchestrator-owned and agent-owned phases. Know which is which before running:
144
+
145
+ | Phase | Owner | Writes file | Why |
146
+ |---|---|---|---|
147
+ | Phase 0 — evidence pack | **Orchestrator** | `phase0.md` | Needs codebase access + project memory; produces the shared pack distributed to all 5 advisors. |
148
+ | Phase 1 — 5 advisor passes | **Independent agents** | `phase1-<slug>.md` (each agent writes its own) | True frame isolation requires separate Claude instances; each agent writes directly to disk so the orchestrator never sees the full advisor text mid-flight. |
149
+ | Phase 2 — 5 peer reviews | **Independent agents** | `peer-<reviewer>-on-<reviewee>.md` (each agent writes its own) | Reviewer frame-integrity requires seeing only their own persona + reviewee's output + persona; agent writes the review file directly. |
150
+ | Phase 2.5 — convergence check | **Orchestrator** | `phase2_5.md` | Needs all 5 sharpest points at once; mechanical pairwise check (validated by `telemetry.py --phase-2-5`); not a frame-locked judgment. |
151
+ | Phase 3 — Chairman verdict | **Orchestrator (default)** or 6th agent (optional) | `phase3.md` | Default: orchestrator reads all advisor + peer-review files directly. Optional: spawn Chairman as a 6th `Agent` call that writes `phase3.md` itself. |
152
+ | Final assembly | **Orchestrator** | `output.md` | Concatenates `phase0.md` + 5 advisor files + 5 peer-review files + `phase2_5.md` + `phase3.md`. |
153
+ | Lean deliverables | **Auto-generated** | `verdict.md`, `tldr.md` | Created by `run_council.py ingest` after structural validation. |
154
+
155
+ **Chairman-as-agent is recommended when:**
156
+ - The decision is high-stakes and the orchestrator's context may be polluted by other work.
157
+ - The advisors produced genuinely contested verdicts (not just different angles on the same conclusion) that benefit from a fresh reader.
158
+ - You want the Chairman's verdict to be independently auditable (the 6th agent's briefing + output becomes its own reviewable artifact).
159
+
160
+ **Chairman-as-orchestrator is fine when:** the context is clean, the advisors converged through different routes, and you want lower token spend.
161
+
162
+ ## Workflow
163
+
164
+ Four phases (Phase 0, Phase 1, Phase 2 + 2.5, Phase 3). Each phase finishes before the next begins. Phase 0 is validator-gated; Phase 2.5 may trigger a rerun.
165
+
166
+ ### Phase 0 — Intake and ground-truthing (validator-gated)
167
+
168
+ Produce:
169
+ 1. **A restated question** — one sentence, testable decision shape.
170
+ 2. **Evidence pack** — 3–7 items, each with citation + one-line rationale.
171
+
172
+ The validator refuses to accept the output if either is missing or malformed. Phase 1 does not start until Phase 0 passes.
173
+
174
+ Full guidance, item types, and format → [references/evidence-pack.md](references/evidence-pack.md).
175
+
176
+ **For external docs / APIs / specs**, use the `llmtxt:<slug>[@<version>]` evidence-pack item type and fetch compressed overviews via `scripts/llmtxt_ref.py` (api.llmtxt.my). Anonymous reads work for public docs (60/min per IP; anonymous session cookie persisted locally); set `LLMTXT_API_KEY` for private/org docs. Cached under `~/.cache/council/llmtxt/` — indefinitely for pinned versions, 60s for `latest`.
177
+
178
+ ### Phase 1 — Advisor analysis (5 parallel or sequential passes)
179
+
180
+ For each of the 5 advisors, produce one output section following the persona's output template exactly. Cite at least 2 items from the evidence pack.
181
+
182
+ Subagent mode: spawn 5 parallel `Agent` calls with the briefing template above.
183
+ Single-Claude mode: run 5 sequential passes, re-reading each persona file before each pass.
184
+
185
+ ### Phase 2 — Shuffled gate-based peer review
186
+
187
+ Every advisor's output is reviewed by exactly one other advisor via the fixed rotation. No self-review, every advisor reviews once and is reviewed once:
188
+
189
+ ```
190
+ Contrarian → reviews → First Principles
191
+ First Principles → reviews → Expansionist
192
+ Expansionist → reviews → Outsider
193
+ Outsider → reviews → Executor
194
+ Executor → reviews → Contrarian
195
+ ```
196
+
197
+ The reviewer evaluates the reviewee against **4 pass/fail gates** (not numeric scores), each requiring quoted evidence:
198
+
199
+ - **G1 Rigor** — are findings specific and non-hedged?
200
+ - **G2 Evidence grounding** — does every finding cite from the evidence pack?
201
+ - **G3 Frame integrity** — does every finding stay in the reviewee's lane?
202
+ - **G4 Actionability** — does the verdict cash out to a decision?
203
+
204
+ Full gates, evidence requirements, and review template → [references/peer-review.md](references/peer-review.md).
205
+
206
+ ### Phase 2.5 — Convergence detector (MANDATORY)
207
+
208
+ After all five peer reviews complete, before the Chairman synthesizes:
209
+
210
+ 1. Extract the "Single sharpest point" from each advisor (5 sentences).
211
+ 2. Pairwise-compare. Are ≥3 semantically the same finding (same subject + predicate)?
212
+ 3. If YES → **convergence flag**. Rerun the advisor(s) with the lowest gate-pass count, with explicit frame-reinforcement (re-read "Your lane vs." section). Re-review the new output. Repeat until the flag clears.
213
+ 4. If NO → proceed to Phase 3.
214
+
215
+ The convergence detector is the structural antibody to single-Claude-mode frame smearing. Even in subagent mode, run it — it catches cases where frames were under-specified for the question.
216
+
217
+ ### Phase 3 — Chairman synthesis
218
+
219
+ The Chairman is a separate voice (not one of the five advisors). Reads all five advisor analyses + all five peer reviews (with gate results) + verifies Phase 0 and Phase 2.5 completed, then produces the final verdict.
220
+
221
+ Per-advisor weight is computed from gate-pass count (0–4). The verdict MUST:
222
+ - State a single clear recommendation (no fence-sitting).
223
+ - Include the full gate summary table.
224
+ - Reconcile contradictions explicitly.
225
+ - Carry forward the sharpest finding from each of the five advisors.
226
+ - End with the Executor's 60-minute action and a confidence rating.
227
+
228
+ Full synthesis protocol and verdict template → [references/chairman.md](references/chairman.md).
229
+
230
+ ## Output contract — three-tier deliverables
231
+
232
+ Every validated Council run produces **three artifacts** under `<run-dir>/`. They serve different consumers; do not conflate them:
233
+
234
+ | File | ~Lines | Purpose | Audience |
235
+ |---|---|---|---|
236
+ | `tldr.md` | 10-15 | Recommendation + action + confidence (level only) | PR comments, chat, status updates |
237
+ | `verdict.md` | 60-80 | Full Chairman section with question header — **the deliverable** | Owner / decision-maker |
238
+ | `output.md` | 300-400 | Phase 0 + 5 advisors + 5 peer reviews + 2.5 + 3 — full transcript | Audit trail, post-hoc analysis |
239
+
240
+ The full transcript was the historical primary output; the verdict was buried at the bottom. After shakedown #5+ telemetry showed every consumer was scrolling past 290 lines of upstream artifacts to reach the Chairman section, the three-tier split was made canonical. **`verdict.md` is what you hand the owner; `output.md` is what proves it's defensible.**
241
+
242
+ ### Full transcript structure (`output.md`)
243
+
244
+ ```
245
+ # The Council — <one-line question>
246
+
247
+ ## Evidence pack
248
+
249
+ ## Phase 1 — Advisor analyses
250
+ ### Advisor: Contrarian
251
+ ### Advisor: First Principles
252
+ ### Advisor: Expansionist
253
+ ### Advisor: Outsider
254
+ ### Advisor: Executor
255
+
256
+ ## Phase 2 — Shuffled peer reviews
257
+ ### Contrarian reviewing First Principles
258
+ ### First Principles reviewing Expansionist
259
+ ### Expansionist reviewing Outsider
260
+ ### Outsider reviewing Executor
261
+ ### Executor reviewing Contrarian
262
+
263
+ ## Phase 2.5 — Convergence check
264
+
265
+ ## Phase 3 — Chairman's verdict
266
+ ```
267
+
268
+ ### Run index — find past runs across the project
269
+
270
+ Every run is auto-indexed at `.cleo/council-runs/INDEX.jsonl` — a project-scoped human-readable roster (one line per run with title, description, status, hash, run_dir). Distinct from the deeper `.cleo/council-runs.jsonl` telemetry log. The INDEX is the "find me that run from last Tuesday" surface; the telemetry log is what `analyze_runs.py` reads.
271
+
272
+ ```bash
273
+ # At init, an entry is written with status=initialized
274
+ python3 scripts/run_council.py init "<question>" --title "<short>" --description "<longer>"
275
+
276
+ # At ingest, the entry is updated with status=ingested + verdict snippet + validation summary
277
+ python3 scripts/run_council.py ingest <run-dir>
278
+
279
+ # Browse / search
280
+ python3 scripts/run_council.py list # newest-first table
281
+ python3 scripts/run_council.py list --status initialized # only in-progress runs
282
+ python3 scripts/run_council.py list --limit 10 # last 10
283
+ python3 scripts/run_council.py list --json # JSON
284
+ python3 scripts/run_council.py find "convergence" # substring search
285
+ python3 scripts/run_council.py show <run-id-prefix> # full INDEX entry
286
+ python3 scripts/run_council.py reindex # rebuild from run.json files
287
+ ```
288
+
289
+ The `--title` flag is optional; if omitted the title is auto-derived from the question (interrogative prefix stripped, truncated to 60 chars). The `--description` flag is also optional and defaults to the full question.
290
+
291
+ ### Validation + auto-generation
292
+
293
+ `scripts/run_council.py ingest <run-dir>` validates `output.md`, then automatically writes `verdict.md` and `tldr.md` to the same run dir, then appends a telemetry record to `.cleo/council-runs.jsonl`. Direct usage:
294
+
295
+ ```bash
296
+ # Full output (default)
297
+ python3 scripts/validate.py <output.md> # exit 0 = structurally valid
298
+
299
+ # Partial files (when assembling phase-by-phase)
300
+ python3 scripts/validate.py --phase 0 <phase0.md> # only H1 + evidence pack
301
+ python3 scripts/validate.py --phase 1 <file.md> # +5 advisor sections
302
+ python3 scripts/validate.py --phase 2 <file.md> # +5 peer reviews
303
+
304
+ python3 scripts/run_council.py ingest <run-dir> # validate + verdict.md + tldr.md + telemetry
305
+ ```
306
+
307
+ The validator **auto-detects partial files** when no `--phase` is passed: if no Phase 3 header is present, it validates up to the highest phase the file contains and prints a stderr suggestion. This prevents the "12 missing-section errors" wall of red when you're checking a phase0.md or phase1-*.md mid-flight.
308
+
309
+ Exit code 0 = structurally valid. Non-zero = fix the violations before surfacing the verdict.
310
+
311
+ ## Worked example
312
+
313
+ A compact end-to-end golden run is in [references/examples.md](references/examples.md). Read it before your first council run so you have a concrete reference for what "good" looks like at each phase.
314
+
315
+ ## Anti-patterns (reject any council run that does these)
316
+
317
+ - Skipping Phase 0 validator gate and opining from memory.
318
+ - Skipping Phase 2.5 convergence detector and letting a synthesized verdict cover convergent advisor outputs.
319
+ - Running an advisor pass without loading the persona file first — frame won't hold.
320
+ - Treating the 4 gates as numeric scores — they are pass/fail with quoted evidence required.
321
+ - Peer reviewer producing a second copy of their own analysis instead of evaluating the reviewee.
322
+ - Chairman writing "on one hand / on the other hand" — the Chairman decides.
323
+ - Five advisors reaching identical conclusions (Phase 2.5 should have caught this).
324
+ - Using the council for trivial questions where one clear answer already exists.
325
+
326
+ ## Validation
327
+
328
+ The `scripts/validate.py` checker enforces:
329
+
330
+ - Phase 0 gate (restated question + 3–7 evidence items with rationales).
331
+ - All 5 advisor sections with required subsections.
332
+ - Executor produced exactly one action.
333
+ - Peer review rotation matches the fixed 5-cycle.
334
+ - Each peer review has 4 gates with PASS/FAIL and cited evidence.
335
+ - Phase 2.5 convergence check was run.
336
+ - Chairman verdict has all required subsections + gate summary.
337
+
338
+ Tests live in `scripts/test_validate.py` and `scripts/test_telemetry.py` — run via `python3 -m unittest test_validate test_telemetry` from the `scripts/` directory.
339
+
340
+ ## Telemetry & systematic hardening
341
+
342
+ The Council learns from itself. Every validated run should be ingested into a JSONL log so failure-mode patterns surface across runs instead of being lost between sessions.
343
+
344
+ ```bash
345
+ # 1. Initialize a run directory + skeleton phase0.md
346
+ python3 scripts/run_council.py init "<one-sentence question>" --scenario <name>
347
+
348
+ # 2. Orchestrator does Phase 0..3, assembles the artifact at <run-dir>/output.md
349
+ # (subagent mode is the default — see "Execution mode" above).
350
+
351
+ # 3. Ingest: validate + emit telemetry to .cleo/council-runs.jsonl
352
+ python3 scripts/run_council.py ingest <run-dir> --tokens <N> --wall-clock <secs>
353
+
354
+ # 4. After several runs, surface hotspots
355
+ python3 scripts/analyze_runs.py
356
+ ```
357
+
358
+ **Phase 2.5 structured extractor** (`telemetry.py --phase-2-5 <run-dir>`) replaces the prose-only convergence channel with a versioned JSON artifact. Reads each `phase1-<advisor>.md` file, parses the `**Single sharpest point:**` line (anchored on start-of-line — inline mentions in action bodies do not match), computes pairwise same-finding via exact-normalized + Jaccard token-overlap ≥ 0.6, and raises `flag_mechanical=true` iff a 3-clique exists in the pairwise-same graph (matches the protocol's "≥3 semantically the same finding" rule). Output schema includes `sharpest_points`, `pairwise_same`, `pair_methods`, `missing_advisors`, `jaccard_threshold`. Use this as the structured-output channel that the orchestrator's manual semantic Phase 2.5 should agree with — divergence is a signal to refine either the threshold or the manual read.
359
+
360
+ `analyze_runs.py` reports:
361
+
362
+ - **Gate-failure hotspots** — which (advisor, gate) pair fails most. A gate failing in ≥2 runs is systemic; harden the persona file, not the run.
363
+ - **Peer-review disposition distribution** — all-Accept across many runs signals reviewers being too lenient (G3 frame-integrity is the usual culprit).
364
+ - **Convergence flag rate** — should fire rarely; high rate means frame definitions are too narrow for the questions being asked.
365
+ - **Chairman confidence distribution** — recurring `low` or `medium-low` confidence on similar question shapes is a candidate for documenting "not a good council fit."
366
+ - **Token + wall-clock spread** — exit criterion target is ≤20% per scope tier.
367
+
368
+ Scenario tags for `--scenario` (from the hardening plan): `baseline`, `external-doc-heavy`, `three-way`, `sparse-ops`, `contradictory`, `non-cleo`, `mini`, `contention`. These let `analyze_runs.py` dimension hotspots by question shape.
369
+
370
+ The JSONL schema is documented in `scripts/telemetry.py` (`TelemetryRecord`), versioned via `schema_version`.
371
+
372
+ ## Invocation examples
373
+
374
+ - "Convene the council on whether we should migrate from SQLite to Postgres."
375
+ - "Council review of the T1140 worktree-by-default design."
376
+ - "Run the five advisors on this PR before I merge."
377
+ - "Stress-test this plan with the council."
@@ -0,0 +1,107 @@
1
+ # Council Hardening Playbook
2
+
3
+ The blueprint for systematically hardening the Council across question shapes via a structured shakedown campaign. This file is the **template**; per-campaign instances live under `optimization/campaigns/<name>/` and are gitignored.
4
+
5
+ A campaign is a sequence of shakedowns (1-N) where each shakedown:
6
+ 1. Runs the full Council pipeline (Phase 0 → 5 advisors → 5 peer reviews → Phase 2.5 → Chairman)
7
+ 2. Surfaces ≥0 failure modes (gate fails, validator catches, persona drift, fabricated framing)
8
+ 3. Triggers ≤1 hardening fix shipped to `references/*.md` or `scripts/`
9
+ 4. Adds a regression test (when applicable) so the fix is measurable in future runs
10
+
11
+ Telemetry persists to `.cleo/council-runs.jsonl` (skill-root) and is read by `scripts/analyze_runs.py`. The campaign manager (`optimization/scripts/campaign.py`) tracks which scenarios are done, which fixes shipped, and what to run next.
12
+
13
+ ## Campaign workflow
14
+
15
+ ```bash
16
+ # Start a new campaign from this playbook
17
+ python3 optimization/scripts/campaign.py new <campaign-name>
18
+
19
+ # Show status (which scenarios done, hotspots, fixes shipped)
20
+ python3 optimization/scripts/campaign.py status [--name <campaign>]
21
+
22
+ # Get the next scenario to run (with full briefing)
23
+ python3 optimization/scripts/campaign.py next [--name <campaign>]
24
+
25
+ # Mark a scenario complete after ingest
26
+ python3 optimization/scripts/campaign.py done <scenario-id> <run-dir-id> [--name <campaign>]
27
+
28
+ # Log a hardening fix that shipped between runs
29
+ python3 optimization/scripts/campaign.py log "<failure>" "<fix>" "<regression-test>"
30
+ ```
31
+
32
+ ## The eight scenarios
33
+
34
+ Each scenario tests ≥1 dimension uncovered by prior runs. Run in order; each subsequent scenario builds on the prior's hardening.
35
+
36
+ **Scenarios are loaded from `optimization/scenarios.yaml`** (or `scenarios.json` as a fallback). To add or modify a scenario, edit the YAML — no code changes required. The schema requires `id`, `number`, `title`, `dimension`, `shape`, `learn`, and `briefing` per entry. `campaign.py` picks up changes on the next invocation.
37
+
38
+ | # | Scenario | Dimension stressed | Question shape | What we learn |
39
+ |---|---|---|---|---|
40
+ | 1 | **baseline** | Control run | Narrow binary, dense evidence (5-7 path:line citations, no `llmtxt:`) | Baseline cost / wall-clock / gate-pass distribution all subsequent runs compare against |
41
+ | 2 | **external-doc-heavy** | Live `llmtxt:` integration | Binary, ≥3 of 7 evidence items as `llmtxt:<slug>` | Does the wrapper survive real subagent distribution under auth/rate-limit conditions? |
42
+ | 3 | **three-way** | Chairman ranking, not binary approve | "Which of A / B / C should we pick?" | Does the verdict template hold for N-way? Is `### Recommendation` flexible enough? |
43
+ | 4 | **sparse-ops** | Advisors with no code to grep | Configs + external docs only; no executable-code citations | Do advisors honestly say "insufficient" or hallucinate to fill gaps? |
44
+ | 5 | **contradictory** | Contradiction handling | Pack contains 2 items that disagree on purpose | Does Outsider catch it? Does FP re-derive cleanly under conflicting overlay? |
45
+ | 6 | **non-cleo** | Portability beyond cleocode conventions | Clone a small external repo + bug report; run council against it | Does the skill work on any project, or has it accumulated cleocode-isms? |
46
+ | 7 | **mini** | Overhead-vs-signal ratio | 3 evidence items only (the validator floor) | Is a "mini-council" variant worth shipping? Can the gates fire on thin packs? |
47
+ | 8 | **contention** | Chairman reconciliation under genuine disagreement | Designed to provoke a 3-vs-2 advisor split | Does the Chairman template handle real contention rather than directional convergence? |
48
+
49
+ ## Between-run rules
50
+
51
+ - **After every run**, run `python3 scripts/analyze_runs.py --log .cleo/council-runs.jsonl` and check:
52
+ - Gate failure appearing in ≥2 runs = **systemic**; harden the persona/validator that produced it.
53
+ - Recurring "What I would add" cross-frame additions in peer reviews = **candidate for a new structural slot** in the persona output template.
54
+ - Chairman confidence < `medium` on ≥2 runs with similar question shape = **the skill handles that shape poorly**; document as "not a good council fit" in `SKILL.md` or `references/evidence-pack.md`.
55
+ - All-PASS gate verdicts across many runs = **suspicious leniency**; design the next shakedown to deliberately violate one frame's lane.
56
+
57
+ - **Between-run hardening fixes** must be logged via `campaign.py log` so the cumulative findings.md captures the compounding pattern.
58
+
59
+ ## Exit criteria (campaign success)
60
+
61
+ The campaign succeeds when ALL of these hold across the runs:
62
+
63
+ - [ ] All 8 scenarios validate structurally (`scripts/validate.py` exit 0).
64
+ - [ ] Every advisor achieves **≥3.0 average gate-pass** across the 8 runs.
65
+ - [ ] **Convergence flag fires at most once** across the 8 runs (it should be rare; firing more = persona files have a contamination problem).
66
+ - [ ] Chairman confidence is `high` or `medium-high` on **≥6 of 8** runs.
67
+ - [ ] Token cost stable **within 20%** per scope tier (mean ± 20%).
68
+ - [ ] At least one substantive Outsider catch per 4 runs (cold-read producing artifact-internal-contradiction or premise-falsification finding that no other lane could produce).
69
+
70
+ `campaign.py status` prints the scorecard automatically.
71
+
72
+ ## Cost honestly
73
+
74
+ Per shakedown:
75
+ - ~600k tokens (5 advisors × ~55k + 5 peer reviews × ~57k + Phase 2.5 + Chairman)
76
+ - ~8-10 minutes wall-clock
77
+ - ~$3-5 in API costs at current Sonnet/Opus rates
78
+
79
+ Full 8-scenario campaign: ~5M tokens, ~75 minutes wall-clock, ~$30-40. **Realistic cadence: 1-2 shakedowns per session, analyze, iterate. Full plan = weeks of evenings, not one marathon.**
80
+
81
+ ## What falls out when done
82
+
83
+ - **Portable** — the skill works on any project, not just cleocode (validated by S6 non-cleo run)
84
+ - **Calibrated across scales** — mini-council (3 items) and full-council (5-7 items) both validated
85
+ - **Telemetry history** — future hardening is evidence-based via `.cleo/council-runs.jsonl`, not vibes-based
86
+
87
+ ## Tradeoff
88
+
89
+ This playbook optimizes the skill for **quality across question shapes**, not for faster/cheaper runs. If the goal is **adoption** (make it lighter so it gets used more often), the scenario list should be reordered to lead with #7 (mini) and cut the structural stress tests (#5, #8).
90
+
91
+ ## Failure-mode diff template
92
+
93
+ Each scenario's findings get appended to `optimization/campaigns/<name>/findings.md` in this format:
94
+
95
+ ```
96
+ | Run | Scenario | Failure surfaced | Fix shipped | Regression test |
97
+ |---|---|---|---|---|
98
+ | 1 | baseline | <one-line failure> | <one-line fix> | yes/no/n-a |
99
+ ```
100
+
101
+ The compounding pattern: a fix shipped after run N should be measurably validated by run N+1 or later. The `Regression test` column tracks whether that validation has been observed.
102
+
103
+ ## Campaign archive (this skill's history)
104
+
105
+ The campaign directories are intentionally gitignored, so historical campaigns don't pollute the repo. The **distilled findings** that survived multiple campaigns SHOULD eventually be promoted into the persona files / SKILL.md as canonical hardening — at which point the campaign that produced them can be archived locally or deleted.
106
+
107
+ If a campaign produced a fix worth committing, the diff to `references/*.md` or `scripts/*.py` is the canonical commit; the campaign dir itself stays local.
@@ -0,0 +1,74 @@
1
+ # `optimization/` — Council Hardening Campaigns
2
+
3
+ This directory holds the **systematic hardening machinery** for The Council skill. It separates the durable playbook (committed) from the per-campaign state (gitignored), so you can run multiple multi-session optimization passes without polluting the repo.
4
+
5
+ ## What's here
6
+
7
+ | Path | Tracked? | Purpose |
8
+ |---|---|---|
9
+ | `HARDENING-PLAYBOOK.md` | ✅ committed | The master plan — 8 scenarios, exit criteria, between-run rules, cost honesty |
10
+ | `scripts/campaign.py` | ✅ committed | Programmatic tracker — `new / next / done / log / status / list / active` |
11
+ | `README.md` | ✅ committed | This file |
12
+ | `.gitignore` | ✅ committed | Keeps `campaigns/` and `.active-campaign` out of git |
13
+ | `campaigns/<name>/` | 🚫 gitignored | Per-campaign state: manifest, plan, findings, run symlinks |
14
+ | `.active-campaign` | 🚫 gitignored | Pointer to the currently-active campaign |
15
+
16
+ ## Workflow
17
+
18
+ ```bash
19
+ # From the skill root: packages/skills/skills/ct-council/
20
+
21
+ # 1. Start a new campaign (any time you want a fresh hardening pass)
22
+ python3 optimization/scripts/campaign.py new 2026-04-25-portability
23
+
24
+ # 2. See what to run next (prints scenario briefing)
25
+ python3 optimization/scripts/campaign.py next
26
+
27
+ # 3. Run a shakedown using the existing pipeline
28
+ python3 scripts/run_council.py init "<question>" --scenario <id> --subagent-mode
29
+ # orchestrator runs Phase 0..3 → assembles output.md
30
+ python3 scripts/run_council.py ingest <run-dir>
31
+
32
+ # 4. Mark the scenario complete (links the run dir into the campaign)
33
+ python3 optimization/scripts/campaign.py done <scenario-id> <run-dir-id>
34
+
35
+ # 5. If a hardening fix landed between runs, log it
36
+ python3 optimization/scripts/campaign.py log "Executor mis-cite line range" \
37
+ "Pre-action verification rule in executor.md" "yes"
38
+
39
+ # 6. After every run, check status (exit-criteria scorecard auto-renders)
40
+ python3 optimization/scripts/campaign.py status
41
+
42
+ # Resume across sessions
43
+ python3 optimization/scripts/campaign.py list # see all campaigns
44
+ python3 optimization/scripts/campaign.py active --set <name> # switch
45
+ ```
46
+
47
+ ## Why split playbook from state?
48
+
49
+ - **Playbook is durable.** The 8 scenarios + exit criteria don't change run-to-run; they're the institutional memory of how to harden a multi-frame review skill. Promoting it to a checked-in artifact means future operators can run the same campaign without re-deriving it.
50
+ - **State is local.** Run dirs are large (~300-400 line transcripts × N runs), telemetry contains project-specific paths, and findings get *promoted* into the persona files when they prove durable. Keeping campaign state out of git avoids both noise and provenance leakage.
51
+
52
+ ## Promotion path: campaign findings → committed code
53
+
54
+ When a hardening fix proves durable across ≥2 runs in the same campaign, it should be promoted from the campaign's `findings.md` into committed code:
55
+
56
+ | Fix shape | Goes into |
57
+ |---|---|
58
+ | Persona output template change | `references/<advisor>.md` |
59
+ | New gate / format rule | `references/peer-review.md` + `scripts/validate.py` |
60
+ | Phase 0 / orchestrator discipline | `references/evidence-pack.md` |
61
+ | Tooling / pipeline change | `scripts/<file>.py` + tests |
62
+ | Output-shape change (verdict / TL;DR) | `scripts/telemetry.py` + tests |
63
+
64
+ The campaign directory itself is **disposable** once durable findings are promoted. The git diff to the persona files is the canonical commit; the campaign that produced it stays local.
65
+
66
+ ## Cost expectations (re-stated from the playbook)
67
+
68
+ - Per shakedown: ~600k tokens, ~9 minutes wall-clock, ~$3-5
69
+ - Full 8-scenario campaign: ~5M tokens, ~75 minutes, ~$30-40
70
+ - Realistic cadence: 1-2 shakedowns per session × multiple sessions
71
+
72
+ ## Historical note
73
+
74
+ The first campaign (run during the skill's initial creation) shipped 4-5 substantive hardening fixes — structured Phase 2.5 extractor, Executor pre-action verification, gate-line format spec, Phase-0 fact-check rule, three-tier output (verdict.md / tldr.md / output.md). Those are now in the committed code. The campaign directory that produced them is gitignored and may be archived locally.