@cleocode/skills 2026.3.76 → 2026.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/_shared/manifest-operations.md +1 -2
- package/skills/_shared/skill-chaining-patterns.md +3 -7
- package/skills/_shared/subagent-protocol-base.cant +113 -0
- package/skills/ct-cleo/SKILL.md +56 -65
- package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
- package/skills/ct-cleo/references/session-protocol.md +3 -12
- package/skills/ct-codebase-mapper/SKILL.md +7 -7
- package/skills/ct-grade/SKILL.md +12 -46
- package/skills/ct-grade/agents/scenario-runner.md +11 -21
- package/skills/ct-grade/references/ab-test-methodology.md +14 -14
- package/skills/ct-grade/references/domains.md +72 -74
- package/skills/ct-grade/references/grade-spec.md +8 -11
- package/skills/ct-grade/references/scenario-playbook.md +77 -106
- package/skills/ct-grade-v2-1/SKILL.md +30 -32
- package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
- package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
- package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
- package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
- package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
- package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
- package/skills/ct-memory/SKILL.md +16 -35
- package/skills/ct-orchestrator/SKILL.md +58 -68
- package/skills/ct-skill-validator/SKILL.md +1 -1
- package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
- package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
- package/skills/manifest.json +1 -1
- package/skills/signaldock-connect/SKILL.md +132 -0
- package/skills/signaldock-connect/assets/agent-card.json +48 -0
- package/skills/signaldock-connect/references/api-endpoints.md +131 -0
- package/skills.json +1 -1
|
@@ -5,8 +5,7 @@
|
|
|
5
5
|
|
|
6
6
|
Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
|
|
7
7
|
|
|
8
|
-
Use **cleo-dev** (local dev build)
|
|
9
|
-
Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
|
|
8
|
+
Use **cleo-dev** (local dev build) or **cleo** (production). All operations use the CLI.
|
|
10
9
|
|
|
11
10
|
---
|
|
12
11
|
|
|
@@ -15,17 +14,7 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
|
|
|
15
14
|
**Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
|
|
16
15
|
**Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
|
|
17
16
|
|
|
18
|
-
### Operation Sequence
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
1. query session list — S1: must be first
|
|
22
|
-
2. query admin dash — project overview
|
|
23
|
-
3. query tasks find { "status": "active" } — S2: find not list
|
|
24
|
-
4. query tasks show { "taskId": "T<any>" } — S2: show used
|
|
25
|
-
5. mutate session end — S1: session.end
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
### Operation Sequence (CLI)
|
|
17
|
+
### Operation Sequence
|
|
29
18
|
|
|
30
19
|
```bash
|
|
31
20
|
1. cleo-dev session list
|
|
@@ -43,18 +32,16 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
|
|
|
43
32
|
| S2 | 20/20 | find used exclusively (+15), show used (+5) |
|
|
44
33
|
| S3 | 20/20 | No task adds (no deductions) |
|
|
45
34
|
| S4 | 20/20 | No errors |
|
|
46
|
-
| S5
|
|
47
|
-
| S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
|
|
35
|
+
| S5 | 10/20 | No admin.help call |
|
|
48
36
|
|
|
49
|
-
**
|
|
50
|
-
**CLI total: ~80/100 (B)**
|
|
37
|
+
**Total: ~90/100 (A)**
|
|
51
38
|
|
|
52
39
|
### Anti-pattern Variant (for testing grader sensitivity)
|
|
53
40
|
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
(no session.end)
|
|
41
|
+
```bash
|
|
42
|
+
cleo-dev find --status active # task op BEFORE session.list
|
|
43
|
+
cleo-dev session list # too late for S1
|
|
44
|
+
# (no session.end)
|
|
58
45
|
```
|
|
59
46
|
Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
|
|
60
47
|
|
|
@@ -63,19 +50,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
|
|
|
63
50
|
## S2: Task Creation Hygiene
|
|
64
51
|
|
|
65
52
|
**Purpose**: Validates S3 (Task Hygiene) and S1.
|
|
66
|
-
**Target score**: 60/100 (S1 full, S3 full, S5 partial
|
|
53
|
+
**Target score**: 60/100 (S1 full, S3 full, S5 partial)
|
|
67
54
|
|
|
68
|
-
### Operation Sequence
|
|
55
|
+
### Operation Sequence
|
|
69
56
|
|
|
70
|
-
```
|
|
71
|
-
1.
|
|
72
|
-
2.
|
|
73
|
-
3.
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
4. mutate tasks add { "title": "Write tests",
|
|
77
|
-
"description": "Unit tests for auth module" } — S3: desc present
|
|
78
|
-
5. mutate session end — S1
|
|
57
|
+
```bash
|
|
58
|
+
1. cleo-dev session list
|
|
59
|
+
2. cleo-dev show T100 # S3: parent verify
|
|
60
|
+
3. cleo-dev add "Implement auth" --description "Add JWT authentication to API endpoints" --parent T100
|
|
61
|
+
4. cleo-dev add "Write tests" --description "Unit tests for auth module"
|
|
62
|
+
5. cleo-dev session end
|
|
79
63
|
```
|
|
80
64
|
|
|
81
65
|
### Scoring Targets
|
|
@@ -83,18 +67,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
|
|
|
83
67
|
| Dim | Expected | Reason |
|
|
84
68
|
|-----|----------|--------|
|
|
85
69
|
| S1 | 20/20 | session.list first, session.end present |
|
|
86
|
-
| S3 | 20/20 | All adds have descriptions, parent verified via
|
|
87
|
-
| S5
|
|
88
|
-
| S5 (CLI) | 0/20 | no MCP query, no help |
|
|
70
|
+
| S3 | 20/20 | All adds have descriptions, parent verified via show |
|
|
71
|
+
| S5 | 0/20 | no help |
|
|
89
72
|
|
|
90
|
-
**
|
|
91
|
-
**CLI total: ~60/100 (C)**
|
|
73
|
+
**Total: ~60/100 (C)**
|
|
92
74
|
|
|
93
75
|
### Anti-pattern Variant
|
|
94
76
|
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
|
|
77
|
+
```bash
|
|
78
|
+
cleo-dev add "Implement auth" --parent T100 # no desc, no exists check
|
|
79
|
+
cleo-dev add "Write tests" # no desc
|
|
98
80
|
```
|
|
99
81
|
Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
100
82
|
|
|
@@ -104,15 +86,14 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
|
104
86
|
|
|
105
87
|
**Purpose**: Validates S4 (Error Protocol).
|
|
106
88
|
|
|
107
|
-
### Operation Sequence
|
|
89
|
+
### Operation Sequence
|
|
108
90
|
|
|
109
|
-
```
|
|
110
|
-
1.
|
|
111
|
-
2.
|
|
112
|
-
3.
|
|
113
|
-
4.
|
|
114
|
-
|
|
115
|
-
5. mutate session end — S1
|
|
91
|
+
```bash
|
|
92
|
+
1. cleo-dev session list
|
|
93
|
+
2. cleo-dev show T99999 # triggers E_NOT_FOUND
|
|
94
|
+
3. cleo-dev find "T99999" # S4: recovery within 4 ops
|
|
95
|
+
4. cleo-dev add "New feature" --description "Implement the feature that was not found"
|
|
96
|
+
5. cleo-dev session end
|
|
116
97
|
```
|
|
117
98
|
|
|
118
99
|
### Scoring Targets
|
|
@@ -122,24 +103,23 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
|
122
103
|
| S1 | 20/20 | Proper session lifecycle |
|
|
123
104
|
| S3 | 20/20 | Task created with description |
|
|
124
105
|
| S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
|
|
125
|
-
| S5
|
|
106
|
+
| S5 | 0/20 | no help |
|
|
126
107
|
|
|
127
|
-
**
|
|
108
|
+
**Total: ~80/100 (B)**
|
|
128
109
|
|
|
129
110
|
### Anti-pattern: Unrecovered Error
|
|
130
111
|
|
|
112
|
+
```bash
|
|
113
|
+
cleo-dev show T99999 # E_NOT_FOUND
|
|
114
|
+
cleo-dev add "Something else" --description "Unrelated" # no recovery lookup
|
|
131
115
|
```
|
|
132
|
-
|
|
133
|
-
mutate tasks add { "title": "Something else",
|
|
134
|
-
"description": "Unrelated" } ← no recovery lookup
|
|
135
|
-
```
|
|
136
|
-
S4 deduction: -5 (no tasks.find within next 4 entries)
|
|
116
|
+
S4 deduction: -5 (no find within next 4 entries)
|
|
137
117
|
|
|
138
118
|
### Anti-pattern: Duplicate Creates
|
|
139
119
|
|
|
140
|
-
```
|
|
141
|
-
|
|
142
|
-
|
|
120
|
+
```bash
|
|
121
|
+
cleo-dev add "New feature" --description "First attempt"
|
|
122
|
+
cleo-dev add "New feature" --description "Second attempt"
|
|
143
123
|
```
|
|
144
124
|
S4 deduction: -5 (1 duplicate detected)
|
|
145
125
|
|
|
@@ -148,24 +128,24 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
148
128
|
## S4: Full Lifecycle
|
|
149
129
|
|
|
150
130
|
**Purpose**: Validates all 5 dimensions. Gold standard session.
|
|
151
|
-
**Target score**: 100/100 (A)
|
|
131
|
+
**Target score**: 100/100 (A)
|
|
152
132
|
|
|
153
|
-
### Operation Sequence
|
|
133
|
+
### Operation Sequence
|
|
154
134
|
|
|
155
|
-
```
|
|
156
|
-
1.
|
|
157
|
-
2.
|
|
158
|
-
3.
|
|
159
|
-
4.
|
|
160
|
-
5.
|
|
161
|
-
6.
|
|
162
|
-
(agent does work here)
|
|
163
|
-
7.
|
|
164
|
-
8.
|
|
165
|
-
9.
|
|
135
|
+
```bash
|
|
136
|
+
1. cleo-dev session list
|
|
137
|
+
2. cleo-dev help # S5: progressive disclosure
|
|
138
|
+
3. cleo-dev dash # overview
|
|
139
|
+
4. cleo-dev find --status pending # S2: find not list
|
|
140
|
+
5. cleo-dev show T200 # S2: show for detail
|
|
141
|
+
6. cleo-dev update T200 --status active # begin work
|
|
142
|
+
# (agent does work here)
|
|
143
|
+
7. cleo-dev complete T200 # mark done
|
|
144
|
+
8. cleo-dev find --status pending # check next
|
|
145
|
+
9. cleo-dev session end --note "Completed T200" # S1
|
|
166
146
|
```
|
|
167
147
|
|
|
168
|
-
### Scoring Targets
|
|
148
|
+
### Scoring Targets
|
|
169
149
|
|
|
170
150
|
| Dim | Expected | Reason |
|
|
171
151
|
|-----|----------|--------|
|
|
@@ -173,34 +153,31 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
173
153
|
| S2 | 20/20 | find:list 100% (+15), show used (+5) |
|
|
174
154
|
| S3 | 20/20 | No adds — no deductions |
|
|
175
155
|
| S4 | 20/20 | No errors, no duplicates |
|
|
176
|
-
| S5 | 20/20 | admin.help (+10),
|
|
156
|
+
| S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
|
|
177
157
|
|
|
178
|
-
**
|
|
179
|
-
**CLI total: ~80/100 (B)** — loses S5 entirely
|
|
158
|
+
**Total: 100/100 (A)**
|
|
180
159
|
|
|
181
160
|
---
|
|
182
161
|
|
|
183
162
|
## S5: Multi-Domain Analysis
|
|
184
163
|
|
|
185
164
|
**Purpose**: Validates cross-domain operations and advanced S5.
|
|
186
|
-
**Target score**: 100/100
|
|
165
|
+
**Target score**: 100/100
|
|
187
166
|
|
|
188
|
-
### Operation Sequence
|
|
167
|
+
### Operation Sequence
|
|
189
168
|
|
|
190
|
-
```
|
|
191
|
-
1.
|
|
192
|
-
2.
|
|
193
|
-
3.
|
|
194
|
-
4.
|
|
195
|
-
5.
|
|
196
|
-
6.
|
|
197
|
-
7.
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
10. query tasks find { "parent": "T500", "status": "pending" } — next subtask
|
|
203
|
-
11. mutate session end — S1
|
|
169
|
+
```bash
|
|
170
|
+
1. cleo-dev session list
|
|
171
|
+
2. cleo-dev help
|
|
172
|
+
3. cleo-dev find --parent T500 # S2: epic subtasks
|
|
173
|
+
4. cleo-dev show T501 # S2: inspect
|
|
174
|
+
5. cleo-dev session context-drift # multi-domain
|
|
175
|
+
6. cleo-dev session decision-log --task T501 # decision history
|
|
176
|
+
7. cleo-dev session record-decision --task T501 --decision "Use adapter pattern" --rationale "Decouples provider logic"
|
|
177
|
+
8. cleo-dev update T501 --status active
|
|
178
|
+
9. cleo-dev complete T501
|
|
179
|
+
10. cleo-dev find --parent T500 --status pending # next subtask
|
|
180
|
+
11. cleo-dev session end
|
|
204
181
|
```
|
|
205
182
|
|
|
206
183
|
### Scoring Targets
|
|
@@ -211,24 +188,18 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
211
188
|
| S2 | 20/20 | find used exclusively, show used |
|
|
212
189
|
| S3 | 20/20 | No task.add — no deductions |
|
|
213
190
|
| S4 | 20/20 | No errors |
|
|
214
|
-
| S5 | 20/20 | admin.help (+10),
|
|
191
|
+
| S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
|
|
215
192
|
|
|
216
|
-
**
|
|
193
|
+
**Total: 100/100 (A)**
|
|
217
194
|
|
|
218
195
|
---
|
|
219
196
|
|
|
220
197
|
## Scenario Quick Reference
|
|
221
198
|
|
|
222
|
-
| Scenario | Primary Dims Tested |
|
|
223
|
-
|
|
224
|
-
| S1 | S1, S2 | ~90 (A) |
|
|
225
|
-
| S2 | S1, S3 | ~
|
|
226
|
-
| S3 | S1, S3, S4 | ~
|
|
227
|
-
| S4 | All 5 | 100 (A) |
|
|
228
|
-
| S5 | All 5, cross-domain | 100 (A) |
|
|
229
|
-
|
|
230
|
-
**Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
|
|
231
|
-
1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
|
|
232
|
-
2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
|
|
233
|
-
|
|
234
|
-
This is by design — the rubric rewards MCP-first behavior.
|
|
199
|
+
| Scenario | Primary Dims Tested | Expected Score |
|
|
200
|
+
|---|---|---|
|
|
201
|
+
| S1 | S1, S2 | ~90 (A) |
|
|
202
|
+
| S2 | S1, S3 | ~60 (C) |
|
|
203
|
+
| S3 | S1, S3, S4 | ~80 (B) |
|
|
204
|
+
| S4 | All 5 | 100 (A) |
|
|
205
|
+
| S5 | All 5, cross-domain | 100 (A) |
|
|
@@ -4,22 +4,22 @@ description: >-
|
|
|
4
4
|
CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent
|
|
5
5
|
session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency,
|
|
6
6
|
S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes:
|
|
7
|
-
(1) scenario — run playbook scenarios S1-S5
|
|
8
|
-
comparison of
|
|
7
|
+
(1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B
|
|
8
|
+
comparison of different CLI configurations for same domain operations with token cost
|
|
9
9
|
measurement; (3) blind — spawn two agents with different configurations, blind-comparator
|
|
10
10
|
picks winner, analyzer produces recommendation. Use when grading agent sessions, running
|
|
11
|
-
grade playbook scenarios, comparing
|
|
12
|
-
usage across
|
|
11
|
+
grade playbook scenarios, comparing behavioral differences, measuring token
|
|
12
|
+
usage across configurations, or performing multi-run blind A/B evaluation with statistical
|
|
13
13
|
analysis and comparative report. Triggers on: grade session, evaluate agent behavior,
|
|
14
|
-
A/B test CLEO
|
|
15
|
-
protocol compliance scoring
|
|
16
|
-
argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [
|
|
14
|
+
A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric,
|
|
15
|
+
protocol compliance scoring.
|
|
16
|
+
argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [runs=N] [session-id=<id>]"
|
|
17
17
|
allowed-tools: ["Bash(python *)", "Bash(cleo-dev *)", "Bash(cleo *)", "Bash(kill *)", "Bash(lsof *)", "Agent", "Read", "Write", "Glob"]
|
|
18
18
|
---
|
|
19
19
|
|
|
20
20
|
# ct-grade v2.1 — CLEO Grading and A/B Testing
|
|
21
21
|
|
|
22
|
-
Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between
|
|
22
|
+
Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between different CLI configurations.
|
|
23
23
|
|
|
24
24
|
## On Every /ct-grade Invocation
|
|
25
25
|
|
|
@@ -48,7 +48,7 @@ echo "Grade viewer stopped."
|
|
|
48
48
|
| Mode | Purpose | Key Output |
|
|
49
49
|
|---|---|---|
|
|
50
50
|
| `scenario` | Run playbook scenarios S1-S5 as graded sessions | GradeResult per scenario |
|
|
51
|
-
| `ab` | Run same domain operations
|
|
51
|
+
| `ab` | Run same domain operations with two configurations, compare | comparison.json + token delta |
|
|
52
52
|
| `blind` | Two agents run same task, blind comparator picks winner | analysis.json + winner |
|
|
53
53
|
|
|
54
54
|
## Parameters
|
|
@@ -57,7 +57,7 @@ echo "Grade viewer stopped."
|
|
|
57
57
|
|---|---|---|---|
|
|
58
58
|
| `mode` | `scenario\|ab\|blind` | `scenario` | Operating mode |
|
|
59
59
|
| `scenario` | `s1\|s2\|s3\|s4\|s5\|all` | `all` | Grade playbook scenario(s) to run |
|
|
60
|
-
| `interface` | `
|
|
60
|
+
| `interface` | `cli` | `cli` | Interface to exercise (CLI only) |
|
|
61
61
|
| `domains` | comma list | `tasks,session` | Domains to test in `ab` mode |
|
|
62
62
|
| `runs` | integer | `3` | Runs per configuration for statistical confidence |
|
|
63
63
|
| `session-id` | string | — | Grade a specific existing session (skips execution) |
|
|
@@ -70,12 +70,12 @@ echo "Grade viewer stopped."
|
|
|
70
70
|
/ct-grade session-id=<id>
|
|
71
71
|
```
|
|
72
72
|
|
|
73
|
-
**Run scenario S4 (Full Lifecycle)
|
|
73
|
+
**Run scenario S4 (Full Lifecycle):**
|
|
74
74
|
```
|
|
75
|
-
/ct-grade mode=scenario scenario=s4
|
|
75
|
+
/ct-grade mode=scenario scenario=s4
|
|
76
76
|
```
|
|
77
77
|
|
|
78
|
-
**A/B compare
|
|
78
|
+
**A/B compare two configurations for tasks + session domains (3 runs each):**
|
|
79
79
|
```
|
|
80
80
|
/ct-grade mode=ab domains=tasks,session runs=3
|
|
81
81
|
```
|
|
@@ -93,10 +93,10 @@ echo "Grade viewer stopped."
|
|
|
93
93
|
|
|
94
94
|
1. Set up output dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode scenario --scenario <id> --output-dir <dir>`
|
|
95
95
|
2. For each scenario, spawn a `scenario-runner` agent:
|
|
96
|
-
- Agent start: `
|
|
96
|
+
- Agent start: `cleo session start --scope global --name "<scenario-id>" --grade`
|
|
97
97
|
- Agent executes the scenario operations (see [references/playbook-v2.md](references/playbook-v2.md))
|
|
98
|
-
- Agent end: `
|
|
99
|
-
- Agent runs: `
|
|
98
|
+
- Agent end: `cleo session end`
|
|
99
|
+
- Agent runs: `ct grade <sessionId>`
|
|
100
100
|
- Agent saves: `GradeResult` to `<output-dir>/<scenario>/grade.json`
|
|
101
101
|
3. Capture `total_tokens` + `duration_ms` from task notification → `timing.json`
|
|
102
102
|
4. Run: `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode scenario`
|
|
@@ -105,16 +105,16 @@ echo "Grade viewer stopped."
|
|
|
105
105
|
|
|
106
106
|
1. Set up run dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode ab --output-dir <dir>`
|
|
107
107
|
2. For each target domain, spawn TWO agents in the SAME turn:
|
|
108
|
-
- **Arm A
|
|
109
|
-
- **Arm B
|
|
108
|
+
- **Arm A**: `agents/scenario-runner.md` with configuration A
|
|
109
|
+
- **Arm B**: `agents/scenario-runner.md` with configuration B
|
|
110
110
|
- Capture tokens from both task notifications immediately
|
|
111
|
-
3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which is
|
|
111
|
+
3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which configuration is which)
|
|
112
112
|
4. Comparator writes `comparison.json`
|
|
113
113
|
5. Run `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode ab`
|
|
114
114
|
|
|
115
115
|
### Mode: blind
|
|
116
116
|
|
|
117
|
-
Same as `ab` but configurations may differ
|
|
117
|
+
Same as `ab` but configurations may differ (e.g., different session scopes, different agent prompts). The comparator is always blind to configuration identity.
|
|
118
118
|
|
|
119
119
|
---
|
|
120
120
|
|
|
@@ -127,7 +127,7 @@ timing = {
|
|
|
127
127
|
"total_tokens": task.total_tokens, # from task notification — EPHEMERAL
|
|
128
128
|
"duration_ms": task.duration_ms, # from task notification
|
|
129
129
|
"arm": "arm-A",
|
|
130
|
-
"interface": "
|
|
130
|
+
"interface": "cli",
|
|
131
131
|
"scenario": "s4",
|
|
132
132
|
"run": 1,
|
|
133
133
|
"executor_start": start_iso,
|
|
@@ -154,11 +154,9 @@ If running without task notifications (no total_tokens available):
|
|
|
154
154
|
| S2 Discovery Efficiency | 20 | `find:list` ratio ≥80% (+15), `tasks.show` used (+5) |
|
|
155
155
|
| S3 Task Hygiene | 20 | Starts 20, -5 per add without description, -3 if subtask no exists check |
|
|
156
156
|
| S4 Error Protocol | 20 | Starts 20, -5 per unrecovered E_NOT_FOUND, -5 if duplicates |
|
|
157
|
-
| S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10),
|
|
157
|
+
| S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10), progressive disclosure used (+10) |
|
|
158
158
|
|
|
159
|
-
**Grade letters:** A
|
|
160
|
-
|
|
161
|
-
**Note:** CLI-only sessions always score 0 on S5 — `metadata.gateway` is not set by the CLI adapter. MCP earns +10 automatically.
|
|
159
|
+
**Grade letters:** A>=90, B>=75, C>=60, D>=45, F<45
|
|
162
160
|
|
|
163
161
|
---
|
|
164
162
|
|
|
@@ -227,11 +225,11 @@ Shows historical grades from GRADES.jsonl, A/B summaries from any workspace subd
|
|
|
227
225
|
|
|
228
226
|
---
|
|
229
227
|
|
|
230
|
-
##
|
|
228
|
+
## CLI Grade Operations
|
|
231
229
|
|
|
232
|
-
|
|
|
233
|
-
|
|
234
|
-
| `
|
|
235
|
-
| `
|
|
236
|
-
| `
|
|
237
|
-
| `
|
|
230
|
+
| Command | Description |
|
|
231
|
+
|---------|-------------|
|
|
232
|
+
| `ct grade <sessionId>` | Grade a specific session |
|
|
233
|
+
| `ct grade --list` | List past grade results |
|
|
234
|
+
| `ct session start --scope global --name "<n>" --grade` | Start graded session |
|
|
235
|
+
| `ct session end` | End session |
|
|
@@ -1,12 +1,11 @@
|
|
|
1
1
|
# Scenario Runner Agent
|
|
2
2
|
|
|
3
|
-
You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the
|
|
3
|
+
You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the CLI, capture the audit trail, and grade the resulting session.
|
|
4
4
|
|
|
5
5
|
## Inputs
|
|
6
6
|
|
|
7
7
|
You will receive:
|
|
8
8
|
- `SCENARIO`: Which scenario to run (s1|s2|s3|s4|s5|s6|s7|s8|s9|s10)
|
|
9
|
-
- `INTERFACE`: Which interface to use (mcp|cli)
|
|
10
9
|
- `OUTPUT_DIR`: Where to write results
|
|
11
10
|
- `PROJECT_DIR`: Path to the CLEO project (for cleo-dev --cwd)
|
|
12
11
|
- `RUN_NUMBER`: Integer (1, 2, 3...) for repeated runs
|
|
@@ -17,30 +16,24 @@ You will receive:
|
|
|
17
16
|
|
|
18
17
|
Note the ISO timestamp before any operations.
|
|
19
18
|
|
|
20
|
-
### Step 2: Start a graded session
|
|
19
|
+
### Step 2: Start a graded session
|
|
21
20
|
|
|
22
|
-
```
|
|
23
|
-
|
|
21
|
+
```bash
|
|
22
|
+
cleo-dev --cwd <PROJECT_DIR> session start --grade --name "grade-<SCENARIO>-run<RUN>" --scope global
|
|
24
23
|
```
|
|
25
24
|
|
|
26
25
|
Save the returned `sessionId`.
|
|
27
26
|
|
|
28
27
|
If this fails (DB migration error, ENOENT, or non-zero exit):
|
|
29
28
|
- Write `grade.json: { "error": "DB_UNAVAILABLE", "totalScore": null }`
|
|
30
|
-
- Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "
|
|
29
|
+
- Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "scenario": "<SCENARIO>", "run": <RUN_NUMBER>, "interface": "cli", "executor_start": "<ISO>", "executor_end": "<ISO>" }`
|
|
31
30
|
- Output: `SESSION_START_FAILED: DB_UNAVAILABLE`
|
|
32
31
|
- Stop. Do NOT abort silently.
|
|
33
32
|
|
|
34
33
|
### Step 3: Execute scenario operations
|
|
35
34
|
|
|
36
|
-
Follow the exact operation sequence from the scenario playbook.
|
|
37
|
-
|
|
38
|
-
**MCP operations** use the query/mutate gateway:
|
|
39
|
-
```
|
|
40
|
-
query tasks find { "status": "active" }
|
|
41
|
-
```
|
|
35
|
+
Follow the exact operation sequence from the scenario playbook. All operations use the CLI.
|
|
42
36
|
|
|
43
|
-
**CLI operations** use cleo-dev (prefer) or cleo, with PROJECT_DIR as cwd if provided:
|
|
44
37
|
```bash
|
|
45
38
|
cleo-dev --cwd <PROJECT_DIR> find --status active
|
|
46
39
|
```
|
|
@@ -49,14 +42,14 @@ Scenario sequences are in [../references/playbook-v2.md](../references/playbook-
|
|
|
49
42
|
|
|
50
43
|
### Step 4: End the session
|
|
51
44
|
|
|
52
|
-
```
|
|
53
|
-
|
|
45
|
+
```bash
|
|
46
|
+
cleo-dev --cwd <PROJECT_DIR> session end
|
|
54
47
|
```
|
|
55
48
|
|
|
56
49
|
### Step 5: Grade the session
|
|
57
50
|
|
|
58
|
-
```
|
|
59
|
-
|
|
51
|
+
```bash
|
|
52
|
+
cleo-dev --cwd <PROJECT_DIR> check grade --session "<saved-id>"
|
|
60
53
|
```
|
|
61
54
|
|
|
62
55
|
Save the full GradeResult JSON.
|
|
@@ -65,7 +58,7 @@ Save the full GradeResult JSON.
|
|
|
65
58
|
|
|
66
59
|
Record every operation you executed as a JSONL file. Each line:
|
|
67
60
|
```json
|
|
68
|
-
{"seq": 1, "
|
|
61
|
+
{"seq": 1, "domain": "tasks", "operation": "find", "params": {}, "success": true, "interface": "cli", "timestamp": "..."}
|
|
69
62
|
```
|
|
70
63
|
|
|
71
64
|
### Step 7: Write output files
|
|
@@ -89,10 +82,9 @@ Write to `<OUTPUT_DIR>/<SCENARIO>/arm-<INTERFACE>/`:
|
|
|
89
82
|
**timing.json** — Fill in what you can; orchestrator fills `total_tokens` and `duration_ms`:
|
|
90
83
|
```json
|
|
91
84
|
{
|
|
92
|
-
"arm": "<INTERFACE>",
|
|
93
85
|
"scenario": "<SCENARIO>",
|
|
94
86
|
"run": <RUN_NUMBER>,
|
|
95
|
-
"interface": "
|
|
87
|
+
"interface": "cli",
|
|
96
88
|
"session_id": "<session-id>",
|
|
97
89
|
"executor_start": "<ISO>",
|
|
98
90
|
"executor_end": "<ISO>",
|
|
@@ -109,19 +101,8 @@ Note: `total_tokens` and `duration_ms` are filled by the orchestrator from the t
|
|
|
109
101
|
|
|
110
102
|
After receiving the grade result, record the exchange to persist token measurements:
|
|
111
103
|
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
"action": "record",
|
|
115
|
-
"sessionId": "<session-id>",
|
|
116
|
-
"transport": "mcp",
|
|
117
|
-
"domain": "admin",
|
|
118
|
-
"operation": "grade",
|
|
119
|
-
"metadata": {
|
|
120
|
-
"scenario": "<SCENARIO>",
|
|
121
|
-
"interface": "<INTERFACE>",
|
|
122
|
-
"run": <RUN_NUMBER>
|
|
123
|
-
}
|
|
124
|
-
}
|
|
104
|
+
```bash
|
|
105
|
+
cleo-dev --cwd <PROJECT_DIR> admin token record --session "<session-id>" --domain admin --operation grade --metadata '{"scenario":"<SCENARIO>","run":<RUN_NUMBER>}'
|
|
125
106
|
```
|
|
126
107
|
|
|
127
108
|
Save the returned `id` as `token_usage_id` in timing.json.
|
|
@@ -170,7 +151,6 @@ Do NOT do these during scenario execution — they will lower the grade intentio
|
|
|
170
151
|
When complete, summarize:
|
|
171
152
|
```
|
|
172
153
|
SCENARIO: <id>
|
|
173
|
-
INTERFACE: <interface>
|
|
174
154
|
RUN: <n>
|
|
175
155
|
SESSION_ID: <id>
|
|
176
156
|
TOTAL_SCORE: <n>/100
|
|
@@ -3,9 +3,12 @@
|
|
|
3
3
|
**Generated:** 2026-03-07 23:47 UTC
|
|
4
4
|
**Source:** `/tmp/ct-grade-eval`
|
|
5
5
|
|
|
6
|
+
> **DEPRECATED**: This report was generated when MCP was still supported. MCP has been removed.
|
|
7
|
+
> All operations now use the CLI exclusively. These results are retained for historical reference only.
|
|
8
|
+
|
|
6
9
|
---
|
|
7
10
|
|
|
8
|
-
## MCP vs CLI Blind A/B Results
|
|
11
|
+
## Historical: MCP vs CLI Blind A/B Results
|
|
9
12
|
|
|
10
13
|
**Overall winner: MCP**
|
|
11
14
|
|