@cleocode/skills 2026.5.16 → 2026.5.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/ct-council/SKILL.md +377 -0
- package/skills/ct-council/optimization/HARDENING-PLAYBOOK.md +107 -0
- package/skills/ct-council/optimization/README.md +74 -0
- package/skills/ct-council/optimization/scenarios.yaml +121 -0
- package/skills/ct-council/optimization/scripts/campaign.py +543 -0
- package/skills/ct-council/optimization/scripts/test_campaign.py +143 -0
- package/skills/ct-council/references/chairman.md +119 -0
- package/skills/ct-council/references/contrarian.md +70 -0
- package/skills/ct-council/references/evidence-pack.md +145 -0
- package/skills/ct-council/references/examples.md +235 -0
- package/skills/ct-council/references/executor.md +83 -0
- package/skills/ct-council/references/expansionist.md +68 -0
- package/skills/ct-council/references/first-principles.md +73 -0
- package/skills/ct-council/references/outsider.md +73 -0
- package/skills/ct-council/references/peer-review.md +125 -0
- package/skills/ct-council/scripts/analyze_runs.py +293 -0
- package/skills/ct-council/scripts/fixtures/executor_multi.md +198 -0
- package/skills/ct-council/scripts/fixtures/missing_advisor.md +117 -0
- package/skills/ct-council/scripts/fixtures/missing_convergence.md +190 -0
- package/skills/ct-council/scripts/fixtures/thin_evidence.md +193 -0
- package/skills/ct-council/scripts/fixtures/valid.md +226 -0
- package/skills/ct-council/scripts/fixtures/valid_with_llmtxt.md +226 -0
- package/skills/ct-council/scripts/llmtxt_ref.py +223 -0
- package/skills/ct-council/scripts/run_council.py +578 -0
- package/skills/ct-council/scripts/telemetry.py +624 -0
- package/skills/ct-council/scripts/test_telemetry.py +509 -0
- package/skills/ct-council/scripts/test_validate.py +452 -0
- package/skills/ct-council/scripts/validate.py +396 -0
- package/skills.json +19 -0
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
# Shuffled Peer Review — Gate-Based Protocol
|
|
2
|
+
|
|
3
|
+
The peer review is where frames collide productively. An advisor reviewing another advisor's output does NOT play neutral judge — they evaluate from their own locked frame, which is exactly what makes the shuffle informative. The Contrarian reviewing First Principles means: "Zero-based analysis sounds clean, but here's the failure mode you introduced by stripping context."
|
|
4
|
+
|
|
5
|
+
This protocol replaces numeric scoring with **gate-based evaluation**. Each gate is pass/fail with required evidence. The reviewer must produce a quote or concrete citation to justify each gate decision. Theater ("4/5 — good") is structurally impossible.
|
|
6
|
+
|
|
7
|
+
## The rotation (fixed, do not deviate)
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
Contrarian → reviews → First Principles
|
|
11
|
+
First Principles → reviews → Expansionist
|
|
12
|
+
Expansionist → reviews → Outsider
|
|
13
|
+
Outsider → reviews → Executor
|
|
14
|
+
Executor → reviews → Contrarian
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
**Properties:**
|
|
18
|
+
- No self-review.
|
|
19
|
+
- Every advisor reviews exactly once and is reviewed exactly once.
|
|
20
|
+
- Single 5-cycle, not pairs — information flows around the full ring.
|
|
21
|
+
|
|
22
|
+
**Why this specific rotation:**
|
|
23
|
+
- *Contrarian → First Principles*: stress-tests whether atomic truths survive adversarial conditions.
|
|
24
|
+
- *First Principles → Expansionist*: grounds ambitious upside against what's actually true.
|
|
25
|
+
- *Expansionist → Outsider*: checks whether the cold-read missed an opportunity hiding in plain sight.
|
|
26
|
+
- *Outsider → Executor*: the stranger asks "why *that* action?" — if it only makes sense with backstory, the Executor picked wrong.
|
|
27
|
+
- *Executor → Contrarian*: forces risk analysis to cash out. Pure doom with no actionable mitigation is cheap.
|
|
28
|
+
|
|
29
|
+
## The four gates
|
|
30
|
+
|
|
31
|
+
Each gate is **strictly PASS or FAIL** — no middle states. The reviewer MUST provide the evidence the gate requires. A gate with no cited evidence is itself a validation failure.
|
|
32
|
+
|
|
33
|
+
**No PARTIAL / MIXED / CONDITIONAL / "PARTIAL PASS" / hedged values are allowed.** The validator rejects any gate line not matching `- G<N> <dimension>: PASS — <evidence>` or `- G<N> <dimension>: FAIL — <evidence>`.
|
|
34
|
+
|
|
35
|
+
If your judgment feels genuinely mixed — "the shape is right but the target is wrong", "it passes in spirit but not in letter", "mostly good except for one thing" — **pick FAIL** and express the nuance in the `Gap from <reviewer>'s frame` and `What I would add` fields. Those are exactly the fields the Chairman reads for texture. A FAIL with a rich gap note is more informative than a PARTIAL with thin justification, and it forces the reviewer to actually decide.
|
|
36
|
+
|
|
37
|
+
The test: would you act on this advisor's verdict as-written, unconditionally? If yes → PASS. If no, no matter why → FAIL, then explain the condition in the gap note.
|
|
38
|
+
|
|
39
|
+
### G1 — Rigor gate
|
|
40
|
+
|
|
41
|
+
**PASS** if every finding in the reviewee's "Findings" list has a named subject, predicate, and (where the frame requires it) trigger condition. The reviewer MUST quote the strongest-rigor finding and, if any finding fails, the weakest.
|
|
42
|
+
|
|
43
|
+
**FAIL** if any finding is hedged ("might", "could", "may" without concrete anchor), vague ("there are scalability concerns"), or missing the frame's required specifics (Contrarian without trigger condition; Executor without expected outcome; First Principles without atoms; Expansionist without asymmetry; Outsider without artifact citation).
|
|
44
|
+
|
|
45
|
+
### G2 — Evidence-grounding gate
|
|
46
|
+
|
|
47
|
+
**PASS** if every finding cites at least one item from the shared evidence pack, and every cited item actually exists in the pack. The reviewer MUST list all cited items.
|
|
48
|
+
|
|
49
|
+
**FAIL** if any finding is free-floating (no citation), cites an item not in the pack, or cites something that does not support the finding. The reviewer MUST list ungrounded or misgrounded findings.
|
|
50
|
+
|
|
51
|
+
### G3 — Frame-integrity gate
|
|
52
|
+
|
|
53
|
+
**PASS** if no finding belongs to another advisor's lane. The reviewer MUST read the reviewee's persona file's "Your lane vs. other advisors' lanes" section and confirm.
|
|
54
|
+
|
|
55
|
+
**FAIL** if any finding is something a different advisor would produce. The reviewer MUST name which frame the violating finding belongs to and quote the violating line.
|
|
56
|
+
|
|
57
|
+
### G4 — Actionability gate
|
|
58
|
+
|
|
59
|
+
**PASS** if the reviewee's verdict cashes out to a decision, a test, a change, or a concrete line of inquiry. The reviewer MUST quote the actionable part.
|
|
60
|
+
|
|
61
|
+
**FAIL** if the verdict is "interesting" but leaves the owner nowhere to go. "Further analysis is warranted" fails. "Reject the plan unless X is added" passes.
|
|
62
|
+
|
|
63
|
+
## Peer review output template
|
|
64
|
+
|
|
65
|
+
**Destination:** when invoked as a Phase 2 subagent, save your full peer-review output below to `<run-dir>/peer-<reviewer-slug>-on-<reviewee-slug>.md` via the `Write` tool, then return only a one-line confirmation including the gate-pass count and disposition (e.g. `Wrote peer-contrarian-on-first-principles.md — 4/4 PASS, Disposition: Accept`). Do not include the full peer-review text in your reply — the orchestrator reads it back from the file.
|
|
66
|
+
|
|
67
|
+
The gate-line format is **load-bearing** — `scripts/validate.py` parses these lines with a regex anchored to the canonical names below. Use them VERBATIM.
|
|
68
|
+
|
|
69
|
+
| Gate | Canonical line prefix | Common mistakes (rejected) |
|
|
70
|
+
|---|---|---|
|
|
71
|
+
| G1 | `- G1 Rigor: PASS \| FAIL — ...` | `G1 Rigor gate:`, `G1: Rigor:`, `G1 - Rigor:` |
|
|
72
|
+
| G2 | `- G2 Evidence grounding: PASS \| FAIL — ...` | `G2 Evidence-grounding gate:`, `G2: Evidence:` |
|
|
73
|
+
| G3 | `- G3 Frame integrity: PASS \| FAIL — ...` | `G3 Frame-integrity gate:`, `G3: Frame:` |
|
|
74
|
+
| G4 | `- G4 Actionability: PASS \| FAIL — ...` | `G4 Actionability gate:`, `G4: Actionability:` |
|
|
75
|
+
|
|
76
|
+
The section headers below ("G1 — Rigor gate") use "gate" as a label *for the section*; the *gate verdict line* never does. The validator rejects gate verdict lines with the "gate" suffix because they break the canonical regex.
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
### <reviewer> reviewing <reviewee>
|
|
80
|
+
|
|
81
|
+
**Gate results:**
|
|
82
|
+
- G1 Rigor: PASS | FAIL — <quote of strongest finding; if FAIL, quote weakest and explain>
|
|
83
|
+
- G2 Evidence grounding: PASS | FAIL — <list cited items; if FAIL, list ungrounded/misgrounded findings>
|
|
84
|
+
- G3 Frame integrity: PASS | FAIL — <confirm lane; if FAIL, name the violating frame and quote the violating line>
|
|
85
|
+
- G4 Actionability: PASS | FAIL — <quote the actionable part; if FAIL, explain what's missing>
|
|
86
|
+
|
|
87
|
+
**Strongest finding (from reviewee):**
|
|
88
|
+
<quote or close paraphrase of the one finding the reviewer thinks lands hardest, even from an opposing frame>
|
|
89
|
+
|
|
90
|
+
**Gap from <reviewer>'s frame:**
|
|
91
|
+
<the specific thing the reviewee missed that the reviewer's frame would have caught. Concrete — no "could have gone deeper".>
|
|
92
|
+
|
|
93
|
+
**What I would add:**
|
|
94
|
+
<one sentence from the reviewer's frame that sharpens or corrects the reviewee's analysis. Single value-add.>
|
|
95
|
+
|
|
96
|
+
**Disposition:** Accept | Modify | Reject — <one sentence why>
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## Hard rules
|
|
100
|
+
|
|
101
|
+
- Reviewer MUST stay in their own frame. A Contrarian reviewing First Principles still looks for what breaks.
|
|
102
|
+
- Reviewer MUST NOT produce a second copy of their own analysis. They evaluate the reviewee *through* their lens; they do not redo the work.
|
|
103
|
+
- Agreement with the reviewee is allowed if it adds a cross-frame dimension ("the Contrarian confirms the atomic truth holds under adversarial pressure"). Pure agreement with no added dimension is a Frame-integrity violation — the reviewer did not do their job.
|
|
104
|
+
- Every gate must have its required evidence. A naked "PASS" with no quote is itself a protocol violation caught by the validator.
|
|
105
|
+
- **Disposition** forces a call: Accept, Modify, or Reject. No fence-sitting.
|
|
106
|
+
|
|
107
|
+
## Convergence check (Phase 2.5, before the Chairman)
|
|
108
|
+
|
|
109
|
+
After all five peer reviews complete, run the **convergence detector** before Phase 3:
|
|
110
|
+
|
|
111
|
+
1. Extract the "Single sharpest point" from each of the 5 advisors.
|
|
112
|
+
2. Pairwise-compare them. Are 3 or more semantically the same finding (same subject, same predicate)?
|
|
113
|
+
3. If yes → **convergence flag**. The advisor(s) with the lowest gate-pass count are suspected of frame drift. Rerun those advisors with explicit frame-reinforcement (re-read persona file, emphasize the "Your lane vs. other advisors' lanes" section) before proceeding to Chairman.
|
|
114
|
+
4. If no → proceed to Chairman.
|
|
115
|
+
|
|
116
|
+
**Why this exists:** in single-Claude mode, the same model produces all 5 advisor outputs in one response and they tend to rhyme. The convergence detector is the structural antibody.
|
|
117
|
+
|
|
118
|
+
**What "semantically the same" means:** if you can describe two findings with the same sentence and lose no essential content, they are convergent. "Retry storms are dangerous" and "the retry wrapper will cascade under load" are convergent. "Retry storms are dangerous" and "the plan omits idempotency classification" are not.
|
|
119
|
+
|
|
120
|
+
## What the Chairman extracts from peer reviews
|
|
121
|
+
|
|
122
|
+
- **Gate-pass count per advisor** (0–4). Advisors with 4/4 pass carry full weight. Advisors with gate failures are weighted proportionally down.
|
|
123
|
+
- **Disposition distribution**: how many Accept / Modify / Reject across the 5 reviews. A review ring that's all Accept signals either genuinely strong work or insufficient friction (check G3 Frame-integrity results).
|
|
124
|
+
- **Cross-frame additions**: the "What I would add" sentences — these often contain the material that makes the final verdict sharper than any single advisor.
|
|
125
|
+
- **Convergence flag** (if raised): triggers a rerun; do not synthesize until resolved.
|
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
analyze_runs.py — read council-runs.jsonl, surface where to harden next.
|
|
4
|
+
|
|
5
|
+
Reports:
|
|
6
|
+
* gate-failure hotspots (which advisor fails which gate most),
|
|
7
|
+
* peer-review reject frequency (per reviewer + per reviewee),
|
|
8
|
+
* convergence-flag rate,
|
|
9
|
+
* Chairman confidence distribution + low-confidence question shapes,
|
|
10
|
+
* token / wall-clock distribution per scope tier (if metrics present),
|
|
11
|
+
* exit-criteria scorecard from the plan.
|
|
12
|
+
|
|
13
|
+
Usage:
|
|
14
|
+
python3 analyze_runs.py # default log
|
|
15
|
+
python3 analyze_runs.py --log path/to/runs.jsonl
|
|
16
|
+
python3 analyze_runs.py --json
|
|
17
|
+
python3 analyze_runs.py --since 2026-04-24 # filter by timestamp prefix
|
|
18
|
+
python3 analyze_runs.py --tail 8 # last N runs only
|
|
19
|
+
"""
|
|
20
|
+
|
|
21
|
+
from __future__ import annotations
|
|
22
|
+
|
|
23
|
+
import argparse
|
|
24
|
+
import json
|
|
25
|
+
import statistics
|
|
26
|
+
import sys
|
|
27
|
+
from collections import Counter, defaultdict
|
|
28
|
+
from pathlib import Path
|
|
29
|
+
|
|
30
|
+
DEFAULT_LOG_PATH = Path(".cleo/council-runs.jsonl")
|
|
31
|
+
ADVISORS = ["Contrarian", "First Principles", "Expansionist", "Outsider", "Executor"]
|
|
32
|
+
GATES = ["G1", "G2", "G3", "G4"]
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def load_runs(path: Path, since: str | None = None, tail: int | None = None) -> list[dict]:
|
|
36
|
+
if not path.exists():
|
|
37
|
+
return []
|
|
38
|
+
runs: list[dict] = []
|
|
39
|
+
with path.open("r", encoding="utf-8") as f:
|
|
40
|
+
for line in f:
|
|
41
|
+
line = line.strip()
|
|
42
|
+
if not line:
|
|
43
|
+
continue
|
|
44
|
+
try:
|
|
45
|
+
rec = json.loads(line)
|
|
46
|
+
except json.JSONDecodeError:
|
|
47
|
+
continue
|
|
48
|
+
if since and rec.get("timestamp", "") < since:
|
|
49
|
+
continue
|
|
50
|
+
runs.append(rec)
|
|
51
|
+
if tail:
|
|
52
|
+
runs = runs[-tail:]
|
|
53
|
+
return runs
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
def gate_hotspots(runs: list[dict]) -> dict:
|
|
57
|
+
"""Per (advisor, gate) FAIL count + rate."""
|
|
58
|
+
fail = Counter()
|
|
59
|
+
seen = Counter()
|
|
60
|
+
for r in runs:
|
|
61
|
+
for advisor, body in (r.get("advisors") or {}).items():
|
|
62
|
+
for gate in GATES:
|
|
63
|
+
verdict = (body.get("gates") or {}).get(gate)
|
|
64
|
+
if verdict in ("PASS", "FAIL"):
|
|
65
|
+
seen[(advisor, gate)] += 1
|
|
66
|
+
if verdict == "FAIL":
|
|
67
|
+
fail[(advisor, gate)] += 1
|
|
68
|
+
rows = []
|
|
69
|
+
for key, total in seen.items():
|
|
70
|
+
f = fail[key]
|
|
71
|
+
rows.append({
|
|
72
|
+
"advisor": key[0],
|
|
73
|
+
"gate": key[1],
|
|
74
|
+
"fail": f,
|
|
75
|
+
"n": total,
|
|
76
|
+
"fail_rate": round(f / total, 3) if total else 0.0,
|
|
77
|
+
})
|
|
78
|
+
rows.sort(key=lambda x: (-x["fail_rate"], -x["fail"], x["advisor"], x["gate"]))
|
|
79
|
+
return rows
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
def disposition_distribution(runs: list[dict]) -> dict:
|
|
83
|
+
by_reviewer = defaultdict(Counter)
|
|
84
|
+
by_reviewee = defaultdict(Counter)
|
|
85
|
+
overall = Counter()
|
|
86
|
+
for r in runs:
|
|
87
|
+
for pr in r.get("peer_reviews", []):
|
|
88
|
+
disp = pr.get("disposition") or "Unknown"
|
|
89
|
+
overall[disp] += 1
|
|
90
|
+
by_reviewer[pr["reviewer"]][disp] += 1
|
|
91
|
+
by_reviewee[pr["reviewee"]][disp] += 1
|
|
92
|
+
return {
|
|
93
|
+
"overall": dict(overall),
|
|
94
|
+
"by_reviewer": {k: dict(v) for k, v in by_reviewer.items()},
|
|
95
|
+
"by_reviewee": {k: dict(v) for k, v in by_reviewee.items()},
|
|
96
|
+
}
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
def convergence_rate(runs: list[dict]) -> dict:
|
|
100
|
+
raised = sum(1 for r in runs if (r.get("convergence") or {}).get("flag") is True)
|
|
101
|
+
cleared = sum(1 for r in runs if (r.get("convergence") or {}).get("flag") is False)
|
|
102
|
+
unknown = sum(1 for r in runs if (r.get("convergence") or {}).get("flag") is None)
|
|
103
|
+
return {
|
|
104
|
+
"raised": raised,
|
|
105
|
+
"cleared": cleared,
|
|
106
|
+
"unknown": unknown,
|
|
107
|
+
"rate": round(raised / len(runs), 3) if runs else 0.0,
|
|
108
|
+
}
|
|
109
|
+
|
|
110
|
+
|
|
111
|
+
def confidence_distribution(runs: list[dict]) -> dict:
|
|
112
|
+
counts = Counter()
|
|
113
|
+
low_conf_questions: list[str] = []
|
|
114
|
+
for r in runs:
|
|
115
|
+
conf = (r.get("chairman") or {}).get("confidence")
|
|
116
|
+
counts[conf or "missing"] += 1
|
|
117
|
+
if conf in ("low", "medium-low"):
|
|
118
|
+
low_conf_questions.append(r.get("question", ""))
|
|
119
|
+
return {
|
|
120
|
+
"counts": dict(counts),
|
|
121
|
+
"low_confidence_questions": low_conf_questions,
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
|
|
125
|
+
def cost_distribution(runs: list[dict]) -> dict:
|
|
126
|
+
tokens = [r.get("metrics", {}).get("tokens") for r in runs if (r.get("metrics") or {}).get("tokens")]
|
|
127
|
+
walls = [r.get("metrics", {}).get("wall_clock_seconds") for r in runs if (r.get("metrics") or {}).get("wall_clock_seconds")]
|
|
128
|
+
|
|
129
|
+
def _summary(xs):
|
|
130
|
+
if not xs:
|
|
131
|
+
return None
|
|
132
|
+
return {
|
|
133
|
+
"n": len(xs),
|
|
134
|
+
"min": min(xs),
|
|
135
|
+
"max": max(xs),
|
|
136
|
+
"mean": round(statistics.mean(xs), 1),
|
|
137
|
+
"stdev": round(statistics.stdev(xs), 1) if len(xs) > 1 else 0.0,
|
|
138
|
+
"spread_pct": round(((max(xs) - min(xs)) / statistics.mean(xs)) * 100, 1) if statistics.mean(xs) else 0.0,
|
|
139
|
+
}
|
|
140
|
+
|
|
141
|
+
return {"tokens": _summary(tokens), "wall_clock_seconds": _summary(walls)}
|
|
142
|
+
|
|
143
|
+
|
|
144
|
+
def exit_criteria(runs: list[dict]) -> dict:
|
|
145
|
+
"""Scorecard against the plan's exit criteria."""
|
|
146
|
+
n = len(runs)
|
|
147
|
+
|
|
148
|
+
# 1. All shakedowns validate (here we don't know which run = which scenario,
|
|
149
|
+
# but we report the structural-validity rate as a proxy).
|
|
150
|
+
valid_runs = sum(1 for r in runs if (r.get("validation") or {}).get("valid"))
|
|
151
|
+
|
|
152
|
+
# 2. Every advisor ≥3/4 average gate pass.
|
|
153
|
+
sums = defaultdict(list)
|
|
154
|
+
for r in runs:
|
|
155
|
+
for advisor, body in (r.get("advisors") or {}).items():
|
|
156
|
+
sums[advisor].append(body.get("gate_pass_count", 0))
|
|
157
|
+
advisor_avg = {a: round(statistics.mean(v), 2) for a, v in sums.items() if v}
|
|
158
|
+
|
|
159
|
+
# 3. Convergence flag fires at most once across the campaign.
|
|
160
|
+
convergence_raised = convergence_rate(runs)["raised"]
|
|
161
|
+
|
|
162
|
+
# 4. Chairman confidence ≥ medium-high on ≥6/8 runs.
|
|
163
|
+
high_or_above = sum(
|
|
164
|
+
1 for r in runs
|
|
165
|
+
if (r.get("chairman") or {}).get("confidence") in ("high", "medium-high")
|
|
166
|
+
)
|
|
167
|
+
|
|
168
|
+
# 5. Token cost stable within 20% per scope tier — proxy on overall spread.
|
|
169
|
+
tokens = [r.get("metrics", {}).get("tokens") for r in runs if (r.get("metrics") or {}).get("tokens")]
|
|
170
|
+
token_spread_ok = None
|
|
171
|
+
if tokens and len(tokens) > 1 and statistics.mean(tokens):
|
|
172
|
+
spread_pct = ((max(tokens) - min(tokens)) / statistics.mean(tokens)) * 100
|
|
173
|
+
token_spread_ok = spread_pct <= 20.0
|
|
174
|
+
|
|
175
|
+
return {
|
|
176
|
+
"n_runs": n,
|
|
177
|
+
"validate_pass_rate": round(valid_runs / n, 3) if n else 0.0,
|
|
178
|
+
"advisor_gate_avg": advisor_avg,
|
|
179
|
+
"advisor_gate_avg_min": min(advisor_avg.values()) if advisor_avg else None,
|
|
180
|
+
"convergence_raised": convergence_raised,
|
|
181
|
+
"high_or_above_confidence_runs": high_or_above,
|
|
182
|
+
"token_spread_within_20pct": token_spread_ok,
|
|
183
|
+
"checklist": {
|
|
184
|
+
"all_validate": valid_runs == n if n else False,
|
|
185
|
+
"every_advisor_avg_ge_3": all(v >= 3.0 for v in advisor_avg.values()) if advisor_avg else False,
|
|
186
|
+
"convergence_at_most_once": convergence_raised <= 1,
|
|
187
|
+
"high_or_above_ge_6_of_8": high_or_above >= 6 if n >= 8 else None,
|
|
188
|
+
"token_spread_ok": token_spread_ok,
|
|
189
|
+
},
|
|
190
|
+
}
|
|
191
|
+
|
|
192
|
+
|
|
193
|
+
def render_report(report: dict) -> str:
|
|
194
|
+
lines = []
|
|
195
|
+
lines.append(f"# Council telemetry — {report['n_runs']} run(s)")
|
|
196
|
+
lines.append("")
|
|
197
|
+
|
|
198
|
+
lines.append("## Exit-criteria scorecard")
|
|
199
|
+
cl = report["exit_criteria"]
|
|
200
|
+
lines.append(f"- Validate pass rate: {cl['validate_pass_rate']*100:.0f}%")
|
|
201
|
+
lines.append(f"- Advisor avg gate-pass (≥3.0 target): {cl['advisor_gate_avg']}")
|
|
202
|
+
lines.append(f"- Convergence flags raised: {cl['convergence_raised']} (target ≤1)")
|
|
203
|
+
lines.append(f"- High/medium-high confidence: {cl['high_or_above_confidence_runs']}/{report['n_runs']} (target ≥6/8)")
|
|
204
|
+
spread = cl["token_spread_within_20pct"]
|
|
205
|
+
lines.append(f"- Token spread within 20%: {'yes' if spread else 'no' if spread is False else 'n/a (insufficient runs with token metrics)'}")
|
|
206
|
+
lines.append("")
|
|
207
|
+
|
|
208
|
+
lines.append("## Gate-failure hotspots (top 5)")
|
|
209
|
+
if not report["gate_hotspots"]:
|
|
210
|
+
lines.append("- No gate-fail data yet.")
|
|
211
|
+
else:
|
|
212
|
+
for row in report["gate_hotspots"][:5]:
|
|
213
|
+
if row["fail"] == 0:
|
|
214
|
+
continue
|
|
215
|
+
lines.append(
|
|
216
|
+
f"- {row['advisor']:<16} {row['gate']} "
|
|
217
|
+
f"fail {row['fail']}/{row['n']} ({row['fail_rate']*100:.0f}%)"
|
|
218
|
+
)
|
|
219
|
+
if all(r["fail"] == 0 for r in report["gate_hotspots"]):
|
|
220
|
+
lines.append("- 0 gate failures across all runs (suspicious — check whether reviewers are too lenient).")
|
|
221
|
+
lines.append("")
|
|
222
|
+
|
|
223
|
+
lines.append("## Peer-review disposition distribution")
|
|
224
|
+
disp = report["dispositions"]
|
|
225
|
+
lines.append(f"- Overall: {disp['overall']}")
|
|
226
|
+
lines.append("")
|
|
227
|
+
|
|
228
|
+
lines.append("## Convergence")
|
|
229
|
+
cv = report["convergence"]
|
|
230
|
+
lines.append(f"- Raised: {cv['raised']} | Cleared: {cv['cleared']} | Unknown: {cv['unknown']} (rate {cv['rate']*100:.0f}%)")
|
|
231
|
+
lines.append("")
|
|
232
|
+
|
|
233
|
+
lines.append("## Chairman confidence")
|
|
234
|
+
conf = report["confidence"]
|
|
235
|
+
lines.append(f"- Distribution: {conf['counts']}")
|
|
236
|
+
if conf["low_confidence_questions"]:
|
|
237
|
+
lines.append("- Low-confidence questions (candidates for documenting as 'not a good council fit'):")
|
|
238
|
+
for q in conf["low_confidence_questions"]:
|
|
239
|
+
lines.append(f" - {q}")
|
|
240
|
+
lines.append("")
|
|
241
|
+
|
|
242
|
+
lines.append("## Cost (token + wall-clock summary)")
|
|
243
|
+
cost = report["cost"]
|
|
244
|
+
if cost["tokens"]:
|
|
245
|
+
t = cost["tokens"]
|
|
246
|
+
lines.append(f"- Tokens: n={t['n']} mean={t['mean']:.0f} stdev={t['stdev']:.0f} spread={t['spread_pct']}%")
|
|
247
|
+
else:
|
|
248
|
+
lines.append("- Tokens: no metrics recorded (pass --tokens to telemetry.py).")
|
|
249
|
+
if cost["wall_clock_seconds"]:
|
|
250
|
+
w = cost["wall_clock_seconds"]
|
|
251
|
+
lines.append(f"- Wall-clock: n={w['n']} mean={w['mean']}s stdev={w['stdev']}s")
|
|
252
|
+
else:
|
|
253
|
+
lines.append("- Wall-clock: no metrics recorded.")
|
|
254
|
+
lines.append("")
|
|
255
|
+
|
|
256
|
+
return "\n".join(lines)
|
|
257
|
+
|
|
258
|
+
|
|
259
|
+
def build_report(runs: list[dict]) -> dict:
|
|
260
|
+
return {
|
|
261
|
+
"n_runs": len(runs),
|
|
262
|
+
"gate_hotspots": gate_hotspots(runs),
|
|
263
|
+
"dispositions": disposition_distribution(runs),
|
|
264
|
+
"convergence": convergence_rate(runs),
|
|
265
|
+
"confidence": confidence_distribution(runs),
|
|
266
|
+
"cost": cost_distribution(runs),
|
|
267
|
+
"exit_criteria": exit_criteria(runs),
|
|
268
|
+
}
|
|
269
|
+
|
|
270
|
+
|
|
271
|
+
def main():
|
|
272
|
+
parser = argparse.ArgumentParser(description="Analyze council-runs.jsonl telemetry.")
|
|
273
|
+
parser.add_argument("--log", default=str(DEFAULT_LOG_PATH), help=f"JSONL log path (default: {DEFAULT_LOG_PATH}).")
|
|
274
|
+
parser.add_argument("--json", action="store_true", help="Emit JSON report.")
|
|
275
|
+
parser.add_argument("--since", default=None, help="Only include runs with ISO timestamps ≥ this prefix.")
|
|
276
|
+
parser.add_argument("--tail", type=int, default=None, help="Only the last N runs.")
|
|
277
|
+
args = parser.parse_args()
|
|
278
|
+
|
|
279
|
+
runs = load_runs(Path(args.log), since=args.since, tail=args.tail)
|
|
280
|
+
report = build_report(runs)
|
|
281
|
+
|
|
282
|
+
if not runs:
|
|
283
|
+
print(f"⚠️ No runs found at {args.log}.", file=sys.stderr)
|
|
284
|
+
sys.exit(0)
|
|
285
|
+
|
|
286
|
+
if args.json:
|
|
287
|
+
print(json.dumps(report, indent=2, default=str))
|
|
288
|
+
else:
|
|
289
|
+
print(render_report(report))
|
|
290
|
+
|
|
291
|
+
|
|
292
|
+
if __name__ == "__main__":
|
|
293
|
+
main()
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
# The Council — Should we add a retry-on-timeout wrapper to outbound HTTP calls?
|
|
2
|
+
|
|
3
|
+
## Evidence pack
|
|
4
|
+
|
|
5
|
+
1. `packages/core/src/http.ts:L12-L58` — current httpGet/httpPost.
|
|
6
|
+
2. `packages/core/src/circuit-breaker.ts` — exists with zero callers.
|
|
7
|
+
3. commit `a1b2c3d "drop retries from http client"` — retries removed 18 months ago.
|
|
8
|
+
|
|
9
|
+
## Phase 1 — Advisor analyses
|
|
10
|
+
|
|
11
|
+
### Advisor: Contrarian
|
|
12
|
+
|
|
13
|
+
**Frame:** Assume the plan is wrong.
|
|
14
|
+
|
|
15
|
+
**Evidence anchored:**
|
|
16
|
+
- commit `a1b2c3d` — retries were pulled for a documented reason.
|
|
17
|
+
- `packages/core/src/http.ts` — zero per-caller rate limits.
|
|
18
|
+
|
|
19
|
+
**Verdict from this lens:** Plan re-introduces known incident class.
|
|
20
|
+
|
|
21
|
+
**Single sharpest point:** Retry wrapper without breaker reproduces old bug.
|
|
22
|
+
|
|
23
|
+
### Advisor: First Principles
|
|
24
|
+
|
|
25
|
+
**Frame:** Ignore everything.
|
|
26
|
+
|
|
27
|
+
**Evidence anchored:**
|
|
28
|
+
- RFC 7231.
|
|
29
|
+
- `packages/core/src/http.ts:L12-L58`.
|
|
30
|
+
|
|
31
|
+
**Verdict from this lens:** Plan incomplete.
|
|
32
|
+
|
|
33
|
+
**Single sharpest point:** Non-idempotent requests cannot be blindly retried.
|
|
34
|
+
|
|
35
|
+
### Advisor: Expansionist
|
|
36
|
+
|
|
37
|
+
**Frame:** Forget the constraints.
|
|
38
|
+
|
|
39
|
+
**Evidence anchored:**
|
|
40
|
+
- `packages/core/src/circuit-breaker.ts`.
|
|
41
|
+
- `MEMORY.md`.
|
|
42
|
+
|
|
43
|
+
**Verdict from this lens:** Owner thinking too small.
|
|
44
|
+
|
|
45
|
+
**Single sharpest point:** Wire the circuit breaker.
|
|
46
|
+
|
|
47
|
+
### Advisor: Outsider
|
|
48
|
+
|
|
49
|
+
**Frame:** You have no context.
|
|
50
|
+
|
|
51
|
+
**Evidence anchored:**
|
|
52
|
+
- `packages/core/src/circuit-breaker.ts` — zero callers.
|
|
53
|
+
- `docs/adr/ADR-021-http-client.md`.
|
|
54
|
+
|
|
55
|
+
**What the artifact claims vs. shows:** Claims await breaker; shows breaker has landed.
|
|
56
|
+
|
|
57
|
+
**Verdict from this lens:** Project prepared but didn't close the loop.
|
|
58
|
+
|
|
59
|
+
**Single sharpest point:** ADR says do this when breaker lands; it has landed.
|
|
60
|
+
|
|
61
|
+
### Advisor: Executor
|
|
62
|
+
|
|
63
|
+
**Frame:** Don't analyze.
|
|
64
|
+
|
|
65
|
+
**Evidence anchored:**
|
|
66
|
+
- `packages/core/test/http.test.ts`.
|
|
67
|
+
- `packages/core/src/circuit-breaker.ts`.
|
|
68
|
+
|
|
69
|
+
**The action (one):**
|
|
70
|
+
1. Write a failing test.
|
|
71
|
+
2. Implement the retry wrapper.
|
|
72
|
+
3. Add circuit breaker wiring.
|
|
73
|
+
|
|
74
|
+
**Expected outcome (60 minutes from now):**
|
|
75
|
+
Many things happen.
|
|
76
|
+
|
|
77
|
+
**What this unblocks:**
|
|
78
|
+
All subsequent work.
|
|
79
|
+
|
|
80
|
+
**Verdict from this lens:** Lots to do.
|
|
81
|
+
|
|
82
|
+
**Single sharpest point:** Do three things simultaneously.
|
|
83
|
+
|
|
84
|
+
## Phase 2 — Shuffled peer reviews
|
|
85
|
+
|
|
86
|
+
### Contrarian reviewing First Principles
|
|
87
|
+
|
|
88
|
+
**Gate results:**
|
|
89
|
+
- G1 Rigor: PASS — specific.
|
|
90
|
+
- G2 Evidence grounding: PASS — cited.
|
|
91
|
+
- G3 Frame integrity: PASS — in lane.
|
|
92
|
+
- G4 Actionability: PASS — decidable.
|
|
93
|
+
|
|
94
|
+
**Strongest finding (from reviewee):** Idempotency.
|
|
95
|
+
|
|
96
|
+
**Gap from Contrarian's frame:** None.
|
|
97
|
+
|
|
98
|
+
**What I would add:** Nothing.
|
|
99
|
+
|
|
100
|
+
**Disposition:** Accept — holds.
|
|
101
|
+
|
|
102
|
+
### First Principles reviewing Expansionist
|
|
103
|
+
|
|
104
|
+
**Gate results:**
|
|
105
|
+
- G1 Rigor: PASS — specific.
|
|
106
|
+
- G2 Evidence grounding: PASS — cited.
|
|
107
|
+
- G3 Frame integrity: PASS — in lane.
|
|
108
|
+
- G4 Actionability: PASS — decidable.
|
|
109
|
+
|
|
110
|
+
**Strongest finding (from reviewee):** Asset wiring.
|
|
111
|
+
|
|
112
|
+
**Gap from First Principles' frame:** None.
|
|
113
|
+
|
|
114
|
+
**What I would add:** Nothing.
|
|
115
|
+
|
|
116
|
+
**Disposition:** Accept — holds.
|
|
117
|
+
|
|
118
|
+
### Expansionist reviewing Outsider
|
|
119
|
+
|
|
120
|
+
**Gate results:**
|
|
121
|
+
- G1 Rigor: PASS — specific.
|
|
122
|
+
- G2 Evidence grounding: PASS — cited.
|
|
123
|
+
- G3 Frame integrity: PASS — in lane.
|
|
124
|
+
- G4 Actionability: PASS — decidable.
|
|
125
|
+
|
|
126
|
+
**Strongest finding (from reviewee):** ADR gap.
|
|
127
|
+
|
|
128
|
+
**Gap from Expansionist's frame:** None.
|
|
129
|
+
|
|
130
|
+
**What I would add:** Nothing.
|
|
131
|
+
|
|
132
|
+
**Disposition:** Accept — holds.
|
|
133
|
+
|
|
134
|
+
### Outsider reviewing Executor
|
|
135
|
+
|
|
136
|
+
**Gate results:**
|
|
137
|
+
- G1 Rigor: FAIL — three actions listed, not one.
|
|
138
|
+
- G2 Evidence grounding: PASS — cited.
|
|
139
|
+
- G3 Frame integrity: FAIL — multiple actions violates Executor frame.
|
|
140
|
+
- G4 Actionability: FAIL — action is ambiguous.
|
|
141
|
+
|
|
142
|
+
**Strongest finding (from reviewee):** Writing the test is still valid.
|
|
143
|
+
|
|
144
|
+
**Gap from Outsider's frame:** Executor frame requires exactly one action.
|
|
145
|
+
|
|
146
|
+
**What I would add:** Nothing.
|
|
147
|
+
|
|
148
|
+
**Disposition:** Reject — frame violation.
|
|
149
|
+
|
|
150
|
+
### Executor reviewing Contrarian
|
|
151
|
+
|
|
152
|
+
**Gate results:**
|
|
153
|
+
- G1 Rigor: PASS — specific.
|
|
154
|
+
- G2 Evidence grounding: PASS — cited.
|
|
155
|
+
- G3 Frame integrity: PASS — in lane.
|
|
156
|
+
- G4 Actionability: PASS — decidable.
|
|
157
|
+
|
|
158
|
+
**Strongest finding (from reviewee):** Retry storm.
|
|
159
|
+
|
|
160
|
+
**Gap from Executor's frame:** No mitigation named.
|
|
161
|
+
|
|
162
|
+
**What I would add:** Wire breaker first.
|
|
163
|
+
|
|
164
|
+
**Disposition:** Accept — risk real.
|
|
165
|
+
|
|
166
|
+
## Phase 2.5 — Convergence check
|
|
167
|
+
|
|
168
|
+
No convergence.
|
|
169
|
+
|
|
170
|
+
## Phase 3 — Chairman's verdict
|
|
171
|
+
|
|
172
|
+
### Gate summary
|
|
173
|
+
|
|
174
|
+
| Advisor | G1 | G2 | G3 | G4 | Weight |
|
|
175
|
+
|---|---|---|---|---|---|
|
|
176
|
+
| Contrarian | PASS | PASS | PASS | PASS | full |
|
|
177
|
+
| First Principles | PASS | PASS | PASS | PASS | full |
|
|
178
|
+
| Expansionist | PASS | PASS | PASS | PASS | full |
|
|
179
|
+
| Outsider | PASS | PASS | PASS | PASS | full |
|
|
180
|
+
| Executor | FAIL | PASS | FAIL | FAIL | low |
|
|
181
|
+
|
|
182
|
+
### Recommendation
|
|
183
|
+
Rerun Executor.
|
|
184
|
+
|
|
185
|
+
### Why this, not the alternatives
|
|
186
|
+
Executor violated frame.
|
|
187
|
+
|
|
188
|
+
### What each advisor got right
|
|
189
|
+
See above.
|
|
190
|
+
|
|
191
|
+
### Conditions on the recommendation
|
|
192
|
+
Rerun required.
|
|
193
|
+
|
|
194
|
+
### Next 60-minute action
|
|
195
|
+
Rerun the Executor pass with explicit one-action constraint.
|
|
196
|
+
|
|
197
|
+
### Confidence
|
|
198
|
+
Medium — four frames solid, one rerun needed.
|