@cleocode/skills 2026.3.76 → 2026.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/package.json +1 -1
  2. package/skills/_shared/manifest-operations.md +1 -2
  3. package/skills/_shared/skill-chaining-patterns.md +3 -7
  4. package/skills/_shared/subagent-protocol-base.cant +113 -0
  5. package/skills/ct-cleo/SKILL.md +56 -65
  6. package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
  7. package/skills/ct-cleo/references/session-protocol.md +3 -12
  8. package/skills/ct-codebase-mapper/SKILL.md +7 -7
  9. package/skills/ct-grade/SKILL.md +12 -46
  10. package/skills/ct-grade/agents/scenario-runner.md +11 -21
  11. package/skills/ct-grade/references/ab-test-methodology.md +14 -14
  12. package/skills/ct-grade/references/domains.md +72 -74
  13. package/skills/ct-grade/references/grade-spec.md +8 -11
  14. package/skills/ct-grade/references/scenario-playbook.md +77 -106
  15. package/skills/ct-grade-v2-1/SKILL.md +30 -32
  16. package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
  17. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
  18. package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
  19. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
  20. package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
  21. package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
  22. package/skills/ct-memory/SKILL.md +16 -35
  23. package/skills/ct-orchestrator/SKILL.md +58 -68
  24. package/skills/ct-skill-validator/SKILL.md +1 -1
  25. package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
  26. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
  27. package/skills/manifest.json +1 -1
  28. package/skills/signaldock-connect/SKILL.md +132 -0
  29. package/skills/signaldock-connect/assets/agent-card.json +48 -0
  30. package/skills/signaldock-connect/references/api-endpoints.md +131 -0
  31. package/skills.json +1 -1
@@ -5,8 +5,7 @@
5
5
 
6
6
  Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
7
7
 
8
- Use **cleo-dev** (local dev build) for MCP operations or **cleo** (production).
9
- Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
8
+ Use **cleo-dev** (local dev build) or **cleo** (production). All operations use the CLI.
10
9
 
11
10
  ---
12
11
 
@@ -15,17 +14,7 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
15
14
  **Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
16
15
  **Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
17
16
 
18
- ### Operation Sequence (MCP)
19
-
20
- ```
21
- 1. query session list — S1: must be first
22
- 2. query admin dash — project overview
23
- 3. query tasks find { "status": "active" } — S2: find not list
24
- 4. query tasks show { "taskId": "T<any>" } — S2: show used
25
- 5. mutate session end — S1: session.end
26
- ```
27
-
28
- ### Operation Sequence (CLI)
17
+ ### Operation Sequence
29
18
 
30
19
  ```bash
31
20
  1. cleo-dev session list
@@ -43,18 +32,16 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
43
32
  | S2 | 20/20 | find used exclusively (+15), show used (+5) |
44
33
  | S3 | 20/20 | No task adds (no deductions) |
45
34
  | S4 | 20/20 | No errors |
46
- | S5 (MCP) | 10/20 | query gateway used (+10), no admin.help call |
47
- | S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
35
+ | S5 | 10/20 | No admin.help call |
48
36
 
49
- **MCP total: ~90/100 (A)**
50
- **CLI total: ~80/100 (B)**
37
+ **Total: ~90/100 (A)**
51
38
 
52
39
  ### Anti-pattern Variant (for testing grader sensitivity)
53
40
 
54
- ```
55
- query tasks find { "status": "active" } ← task op BEFORE session.list
56
- query session list too late for S1
57
- (no session.end)
41
+ ```bash
42
+ cleo-dev find --status active # task op BEFORE session.list
43
+ cleo-dev session list # too late for S1
44
+ # (no session.end)
58
45
  ```
59
46
  Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
60
47
 
@@ -63,19 +50,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
63
50
  ## S2: Task Creation Hygiene
64
51
 
65
52
  **Purpose**: Validates S3 (Task Hygiene) and S1.
66
- **Target score**: 60/100 (S1 full, S3 full, S5 partial MCP or 0 CLI)
53
+ **Target score**: 60/100 (S1 full, S3 full, S5 partial)
67
54
 
68
- ### Operation Sequence (MCP)
55
+ ### Operation Sequence
69
56
 
70
- ```
71
- 1. query session list — S1
72
- 2. query tasks exists { "taskId": "T100" } — S3: parent verify
73
- 3. mutate tasks add { "title": "Implement auth",
74
- "description": "Add JWT authentication to API endpoints",
75
- "parent": "T100" } — S3: desc + parent
76
- 4. mutate tasks add { "title": "Write tests",
77
- "description": "Unit tests for auth module" } — S3: desc present
78
- 5. mutate session end — S1
57
+ ```bash
58
+ 1. cleo-dev session list
59
+ 2. cleo-dev show T100 # S3: parent verify
60
+ 3. cleo-dev add "Implement auth" --description "Add JWT authentication to API endpoints" --parent T100
61
+ 4. cleo-dev add "Write tests" --description "Unit tests for auth module"
62
+ 5. cleo-dev session end
79
63
  ```
80
64
 
81
65
  ### Scoring Targets
@@ -83,18 +67,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
83
67
  | Dim | Expected | Reason |
84
68
  |-----|----------|--------|
85
69
  | S1 | 20/20 | session.list first, session.end present |
86
- | S3 | 20/20 | All adds have descriptions, parent verified via exists |
87
- | S5 (MCP) | 10/20 | query gateway used |
88
- | S5 (CLI) | 0/20 | no MCP query, no help |
70
+ | S3 | 20/20 | All adds have descriptions, parent verified via show |
71
+ | S5 | 0/20 | no help |
89
72
 
90
- **MCP total: ~70/100 (C)**
91
- **CLI total: ~60/100 (C)**
73
+ **Total: ~60/100 (C)**
92
74
 
93
75
  ### Anti-pattern Variant
94
76
 
95
- ```
96
- mutate tasks add { "title": "Implement auth", "parent": "T100" } ← no desc, no exists check
97
- mutate tasks add { "title": "Write tests" } no desc
77
+ ```bash
78
+ cleo-dev add "Implement auth" --parent T100 # no desc, no exists check
79
+ cleo-dev add "Write tests" # no desc
98
80
  ```
99
81
  Expected S3: 7 (20 - 5 - 5 - 3 = 7)
100
82
 
@@ -104,15 +86,14 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
104
86
 
105
87
  **Purpose**: Validates S4 (Error Protocol).
106
88
 
107
- ### Operation Sequence (MCP)
89
+ ### Operation Sequence
108
90
 
109
- ```
110
- 1. query session list — S1
111
- 2. query tasks show { "taskId": "T99999" } — triggers E_NOT_FOUND
112
- 3. query tasks find { "query": "T99999" } — S4: recovery within 4 ops
113
- 4. mutate tasks add { "title": "New feature",
114
- "description": "Implement the feature that was not found" } — S3: desc present
115
- 5. mutate session end — S1
91
+ ```bash
92
+ 1. cleo-dev session list
93
+ 2. cleo-dev show T99999 # triggers E_NOT_FOUND
94
+ 3. cleo-dev find "T99999" # S4: recovery within 4 ops
95
+ 4. cleo-dev add "New feature" --description "Implement the feature that was not found"
96
+ 5. cleo-dev session end
116
97
  ```
117
98
 
118
99
  ### Scoring Targets
@@ -122,24 +103,23 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
122
103
  | S1 | 20/20 | Proper session lifecycle |
123
104
  | S3 | 20/20 | Task created with description |
124
105
  | S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
125
- | S5 (MCP) | 10/20 | query gateway used |
106
+ | S5 | 0/20 | no help |
126
107
 
127
- **MCP total: ~90/100 (A)**
108
+ **Total: ~80/100 (B)**
128
109
 
129
110
  ### Anti-pattern: Unrecovered Error
130
111
 
112
+ ```bash
113
+ cleo-dev show T99999 # E_NOT_FOUND
114
+ cleo-dev add "Something else" --description "Unrelated" # no recovery lookup
131
115
  ```
132
- query tasks show { "taskId": "T99999" } ← E_NOT_FOUND
133
- mutate tasks add { "title": "Something else",
134
- "description": "Unrelated" } ← no recovery lookup
135
- ```
136
- S4 deduction: -5 (no tasks.find within next 4 entries)
116
+ S4 deduction: -5 (no find within next 4 entries)
137
117
 
138
118
  ### Anti-pattern: Duplicate Creates
139
119
 
140
- ```
141
- mutate tasks add { "title": "New feature", "description": "First attempt" }
142
- mutate tasks add { "title": "New feature", "description": "Second attempt" }
120
+ ```bash
121
+ cleo-dev add "New feature" --description "First attempt"
122
+ cleo-dev add "New feature" --description "Second attempt"
143
123
  ```
144
124
  S4 deduction: -5 (1 duplicate detected)
145
125
 
@@ -148,24 +128,24 @@ S4 deduction: -5 (1 duplicate detected)
148
128
  ## S4: Full Lifecycle
149
129
 
150
130
  **Purpose**: Validates all 5 dimensions. Gold standard session.
151
- **Target score**: 100/100 (A) for MCP, ~80/100 (B) for CLI
131
+ **Target score**: 100/100 (A)
152
132
 
153
- ### Operation Sequence (MCP)
133
+ ### Operation Sequence
154
134
 
155
- ```
156
- 1. query session list — S1
157
- 2. query admin help S5: progressive disclosure
158
- 3. query admin dash overview
159
- 4. query tasks find { "status": "pending" } — S2: find not list
160
- 5. query tasks show { "taskId": "T200" } — S2: show for detail
161
- 6. mutate tasks update { "taskId": "T200", "status": "active" } — begin work
162
- (agent does work here)
163
- 7. mutate tasks complete { "taskId": "T200" } — mark done
164
- 8. query tasks find { "status": "pending" } — check next
165
- 9. mutate session end { "note": "Completed T200" } — S1
135
+ ```bash
136
+ 1. cleo-dev session list
137
+ 2. cleo-dev help # S5: progressive disclosure
138
+ 3. cleo-dev dash # overview
139
+ 4. cleo-dev find --status pending # S2: find not list
140
+ 5. cleo-dev show T200 # S2: show for detail
141
+ 6. cleo-dev update T200 --status active # begin work
142
+ # (agent does work here)
143
+ 7. cleo-dev complete T200 # mark done
144
+ 8. cleo-dev find --status pending # check next
145
+ 9. cleo-dev session end --note "Completed T200" # S1
166
146
  ```
167
147
 
168
- ### Scoring Targets (MCP)
148
+ ### Scoring Targets
169
149
 
170
150
  | Dim | Expected | Reason |
171
151
  |-----|----------|--------|
@@ -173,34 +153,31 @@ S4 deduction: -5 (1 duplicate detected)
173
153
  | S2 | 20/20 | find:list 100% (+15), show used (+5) |
174
154
  | S3 | 20/20 | No adds — no deductions |
175
155
  | S4 | 20/20 | No errors, no duplicates |
176
- | S5 | 20/20 | admin.help (+10), query gateway (+10) |
156
+ | S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
177
157
 
178
- **MCP total: 100/100 (A)**
179
- **CLI total: ~80/100 (B)** — loses S5 entirely
158
+ **Total: 100/100 (A)**
180
159
 
181
160
  ---
182
161
 
183
162
  ## S5: Multi-Domain Analysis
184
163
 
185
164
  **Purpose**: Validates cross-domain operations and advanced S5.
186
- **Target score**: 100/100 (MCP), ~80/100 (CLI)
165
+ **Target score**: 100/100
187
166
 
188
- ### Operation Sequence (MCP)
167
+ ### Operation Sequence
189
168
 
190
- ```
191
- 1. query session list — S1
192
- 2. query admin help — S5
193
- 3. query tasks find { "parent": "T500" } — S2: epic subtasks
194
- 4. query tasks show { "taskId": "T501" } — S2: inspect
195
- 5. query session context.drift multi-domain
196
- 6. query session decision.log { "taskId": "T501" } — decision history
197
- 7. mutate session record.decision { "taskId": "T501",
198
- "decision": "Use adapter pattern",
199
- "rationale": "Decouples provider logic" } — record decision
200
- 8. mutate tasks update { "taskId": "T501", "status": "active" }
201
- 9. mutate tasks complete { "taskId": "T501" }
202
- 10. query tasks find { "parent": "T500", "status": "pending" } — next subtask
203
- 11. mutate session end — S1
169
+ ```bash
170
+ 1. cleo-dev session list
171
+ 2. cleo-dev help
172
+ 3. cleo-dev find --parent T500 # S2: epic subtasks
173
+ 4. cleo-dev show T501 # S2: inspect
174
+ 5. cleo-dev session context-drift # multi-domain
175
+ 6. cleo-dev session decision-log --task T501 # decision history
176
+ 7. cleo-dev session record-decision --task T501 --decision "Use adapter pattern" --rationale "Decouples provider logic"
177
+ 8. cleo-dev update T501 --status active
178
+ 9. cleo-dev complete T501
179
+ 10. cleo-dev find --parent T500 --status pending # next subtask
180
+ 11. cleo-dev session end
204
181
  ```
205
182
 
206
183
  ### Scoring Targets
@@ -211,24 +188,18 @@ S4 deduction: -5 (1 duplicate detected)
211
188
  | S2 | 20/20 | find used exclusively, show used |
212
189
  | S3 | 20/20 | No task.add — no deductions |
213
190
  | S4 | 20/20 | No errors |
214
- | S5 | 20/20 | admin.help (+10), query gateway (+10) |
191
+ | S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
215
192
 
216
- **MCP total: 100/100 (A)**
193
+ **Total: 100/100 (A)**
217
194
 
218
195
  ---
219
196
 
220
197
  ## Scenario Quick Reference
221
198
 
222
- | Scenario | Primary Dims Tested | MCP Expected | CLI Expected |
223
- |---|---|---|---|
224
- | S1 | S1, S2 | ~90 (A) | ~80 (B) |
225
- | S2 | S1, S3 | ~70 (C) | ~60 (C) |
226
- | S3 | S1, S3, S4 | ~90 (A) | ~80 (B) |
227
- | S4 | All 5 | 100 (A) | ~80 (B) |
228
- | S5 | All 5, cross-domain | 100 (A) | ~80 (B) |
229
-
230
- **Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
231
- 1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
232
- 2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
233
-
234
- This is by design — the rubric rewards MCP-first behavior.
199
+ | Scenario | Primary Dims Tested | Expected Score |
200
+ |---|---|---|
201
+ | S1 | S1, S2 | ~90 (A) |
202
+ | S2 | S1, S3 | ~60 (C) |
203
+ | S3 | S1, S3, S4 | ~80 (B) |
204
+ | S4 | All 5 | 100 (A) |
205
+ | S5 | All 5, cross-domain | 100 (A) |
@@ -4,22 +4,22 @@ description: >-
4
4
  CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent
5
5
  session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency,
6
6
  S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes:
7
- (1) scenario — run playbook scenarios S1-S5 against MCP or CLI; (2) ab — blind A/B
8
- comparison of CLEO MCP gateway vs CLI for same domain operations with token cost
7
+ (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B
8
+ comparison of different CLI configurations for same domain operations with token cost
9
9
  measurement; (3) blind — spawn two agents with different configurations, blind-comparator
10
10
  picks winner, analyzer produces recommendation. Use when grading agent sessions, running
11
- grade playbook scenarios, comparing MCP vs CLI behavioral differences, measuring token
12
- usage across interface types, or performing multi-run blind A/B evaluation with statistical
11
+ grade playbook scenarios, comparing behavioral differences, measuring token
12
+ usage across configurations, or performing multi-run blind A/B evaluation with statistical
13
13
  analysis and comparative report. Triggers on: grade session, evaluate agent behavior,
14
- A/B test CLEO interfaces, run grade scenario, token usage analysis, behavioral rubric,
15
- protocol compliance scoring, MCP vs CLI comparison.
16
- argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [interface=mcp|cli|both] [runs=N] [session-id=<id>]"
14
+ A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric,
15
+ protocol compliance scoring.
16
+ argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [runs=N] [session-id=<id>]"
17
17
  allowed-tools: ["Bash(python *)", "Bash(cleo-dev *)", "Bash(cleo *)", "Bash(kill *)", "Bash(lsof *)", "Agent", "Read", "Write", "Glob"]
18
18
  ---
19
19
 
20
20
  # ct-grade v2.1 — CLEO Grading and A/B Testing
21
21
 
22
- Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between MCP and CLI interfaces.
22
+ Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between different CLI configurations.
23
23
 
24
24
  ## On Every /ct-grade Invocation
25
25
 
@@ -48,7 +48,7 @@ echo "Grade viewer stopped."
48
48
  | Mode | Purpose | Key Output |
49
49
  |---|---|---|
50
50
  | `scenario` | Run playbook scenarios S1-S5 as graded sessions | GradeResult per scenario |
51
- | `ab` | Run same domain operations via MCP AND CLI, compare | comparison.json + token delta |
51
+ | `ab` | Run same domain operations with two configurations, compare | comparison.json + token delta |
52
52
  | `blind` | Two agents run same task, blind comparator picks winner | analysis.json + winner |
53
53
 
54
54
  ## Parameters
@@ -57,7 +57,7 @@ echo "Grade viewer stopped."
57
57
  |---|---|---|---|
58
58
  | `mode` | `scenario\|ab\|blind` | `scenario` | Operating mode |
59
59
  | `scenario` | `s1\|s2\|s3\|s4\|s5\|all` | `all` | Grade playbook scenario(s) to run |
60
- | `interface` | `mcp\|cli\|both` | `both` | Which interface to exercise |
60
+ | `interface` | `cli` | `cli` | Interface to exercise (CLI only) |
61
61
  | `domains` | comma list | `tasks,session` | Domains to test in `ab` mode |
62
62
  | `runs` | integer | `3` | Runs per configuration for statistical confidence |
63
63
  | `session-id` | string | — | Grade a specific existing session (skips execution) |
@@ -70,12 +70,12 @@ echo "Grade viewer stopped."
70
70
  /ct-grade session-id=<id>
71
71
  ```
72
72
 
73
- **Run scenario S4 (Full Lifecycle) on MCP:**
73
+ **Run scenario S4 (Full Lifecycle):**
74
74
  ```
75
- /ct-grade mode=scenario scenario=s4 interface=mcp
75
+ /ct-grade mode=scenario scenario=s4
76
76
  ```
77
77
 
78
- **A/B compare MCP vs CLI for tasks + session domains (3 runs each):**
78
+ **A/B compare two configurations for tasks + session domains (3 runs each):**
79
79
  ```
80
80
  /ct-grade mode=ab domains=tasks,session runs=3
81
81
  ```
@@ -93,10 +93,10 @@ echo "Grade viewer stopped."
93
93
 
94
94
  1. Set up output dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode scenario --scenario <id> --output-dir <dir>`
95
95
  2. For each scenario, spawn a `scenario-runner` agent:
96
- - Agent start: `mutate session start { "grade": true, "name": "<scenario-id>-<interface>" }`
96
+ - Agent start: `cleo session start --scope global --name "<scenario-id>" --grade`
97
97
  - Agent executes the scenario operations (see [references/playbook-v2.md](references/playbook-v2.md))
98
- - Agent end: `mutate session end`
99
- - Agent runs: `query admin grade { "sessionId": "<id>" }`
98
+ - Agent end: `cleo session end`
99
+ - Agent runs: `ct grade <sessionId>`
100
100
  - Agent saves: `GradeResult` to `<output-dir>/<scenario>/grade.json`
101
101
  3. Capture `total_tokens` + `duration_ms` from task notification → `timing.json`
102
102
  4. Run: `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode scenario`
@@ -105,16 +105,16 @@ echo "Grade viewer stopped."
105
105
 
106
106
  1. Set up run dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode ab --output-dir <dir>`
107
107
  2. For each target domain, spawn TWO agents in the SAME turn:
108
- - **Arm A** (MCP): `agents/scenario-runner.md` with `INTERFACE=mcp`
109
- - **Arm B** (CLI): `agents/scenario-runner.md` with `INTERFACE=cli`
108
+ - **Arm A**: `agents/scenario-runner.md` with configuration A
109
+ - **Arm B**: `agents/scenario-runner.md` with configuration B
110
110
  - Capture tokens from both task notifications immediately
111
- 3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which is MCP vs CLI)
111
+ 3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which configuration is which)
112
112
  4. Comparator writes `comparison.json`
113
113
  5. Run `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode ab`
114
114
 
115
115
  ### Mode: blind
116
116
 
117
- Same as `ab` but configurations may differ beyond MCP/CLI (e.g., different session scopes, different agent prompts). The comparator is always blind to configuration identity.
117
+ Same as `ab` but configurations may differ (e.g., different session scopes, different agent prompts). The comparator is always blind to configuration identity.
118
118
 
119
119
  ---
120
120
 
@@ -127,7 +127,7 @@ timing = {
127
127
  "total_tokens": task.total_tokens, # from task notification — EPHEMERAL
128
128
  "duration_ms": task.duration_ms, # from task notification
129
129
  "arm": "arm-A",
130
- "interface": "mcp",
130
+ "interface": "cli",
131
131
  "scenario": "s4",
132
132
  "run": 1,
133
133
  "executor_start": start_iso,
@@ -154,11 +154,9 @@ If running without task notifications (no total_tokens available):
154
154
  | S2 Discovery Efficiency | 20 | `find:list` ratio ≥80% (+15), `tasks.show` used (+5) |
155
155
  | S3 Task Hygiene | 20 | Starts 20, -5 per add without description, -3 if subtask no exists check |
156
156
  | S4 Error Protocol | 20 | Starts 20, -5 per unrecovered E_NOT_FOUND, -5 if duplicates |
157
- | S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10), MCP `query` gateway used (+10) |
157
+ | S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10), progressive disclosure used (+10) |
158
158
 
159
- **Grade letters:** A90, B75, C60, D45, F<45
160
-
161
- **Note:** CLI-only sessions always score 0 on S5 — `metadata.gateway` is not set by the CLI adapter. MCP earns +10 automatically.
159
+ **Grade letters:** A>=90, B>=75, C>=60, D>=45, F<45
162
160
 
163
161
  ---
164
162
 
@@ -227,11 +225,11 @@ Shows historical grades from GRADES.jsonl, A/B summaries from any workspace subd
227
225
 
228
226
  ---
229
227
 
230
- ## MCP Grade Operations
228
+ ## CLI Grade Operations
231
229
 
232
- | Gateway | Domain | Operation | Params |
233
- |---|---|---|---|
234
- | `query` | `admin` | `grade` | `{ "sessionId": "<id>" }` |
235
- | `query` | `admin` | `grade.list` | |
236
- | `mutate` | `session` | `start` | `{ "grade": true, "name": "<n>", "scope": "global" }` |
237
- | `mutate` | `session` | `end` | |
230
+ | Command | Description |
231
+ |---------|-------------|
232
+ | `ct grade <sessionId>` | Grade a specific session |
233
+ | `ct grade --list` | List past grade results |
234
+ | `ct session start --scope global --name "<n>" --grade` | Start graded session |
235
+ | `ct session end` | End session |
@@ -1,12 +1,11 @@
1
1
  # Scenario Runner Agent
2
2
 
3
- You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the specified interface (MCP or CLI), capture the audit trail, and grade the resulting session.
3
+ You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the CLI, capture the audit trail, and grade the resulting session.
4
4
 
5
5
  ## Inputs
6
6
 
7
7
  You will receive:
8
8
  - `SCENARIO`: Which scenario to run (s1|s2|s3|s4|s5|s6|s7|s8|s9|s10)
9
- - `INTERFACE`: Which interface to use (mcp|cli)
10
9
  - `OUTPUT_DIR`: Where to write results
11
10
  - `PROJECT_DIR`: Path to the CLEO project (for cleo-dev --cwd)
12
11
  - `RUN_NUMBER`: Integer (1, 2, 3...) for repeated runs
@@ -17,30 +16,24 @@ You will receive:
17
16
 
18
17
  Note the ISO timestamp before any operations.
19
18
 
20
- ### Step 2: Start a graded session via MCP (always use MCP for session lifecycle)
19
+ ### Step 2: Start a graded session
21
20
 
22
- ```
23
- mutate session start { "grade": true, "name": "grade-<SCENARIO>-<INTERFACE>-run<RUN>", "scope": "global" }
21
+ ```bash
22
+ cleo-dev --cwd <PROJECT_DIR> session start --grade --name "grade-<SCENARIO>-run<RUN>" --scope global
24
23
  ```
25
24
 
26
25
  Save the returned `sessionId`.
27
26
 
28
27
  If this fails (DB migration error, ENOENT, or non-zero exit):
29
28
  - Write `grade.json: { "error": "DB_UNAVAILABLE", "totalScore": null }`
30
- - Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "arm": "<INTERFACE>", "scenario": "<SCENARIO>", "run": <RUN_NUMBER>, "interface": "<INTERFACE>", "executor_start": "<ISO>", "executor_end": "<ISO>" }`
29
+ - Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "scenario": "<SCENARIO>", "run": <RUN_NUMBER>, "interface": "cli", "executor_start": "<ISO>", "executor_end": "<ISO>" }`
31
30
  - Output: `SESSION_START_FAILED: DB_UNAVAILABLE`
32
31
  - Stop. Do NOT abort silently.
33
32
 
34
33
  ### Step 3: Execute scenario operations
35
34
 
36
- Follow the exact operation sequence from the scenario playbook. Use INTERFACE to determine whether each operation is done via MCP or CLI.
37
-
38
- **MCP operations** use the query/mutate gateway:
39
- ```
40
- query tasks find { "status": "active" }
41
- ```
35
+ Follow the exact operation sequence from the scenario playbook. All operations use the CLI.
42
36
 
43
- **CLI operations** use cleo-dev (prefer) or cleo, with PROJECT_DIR as cwd if provided:
44
37
  ```bash
45
38
  cleo-dev --cwd <PROJECT_DIR> find --status active
46
39
  ```
@@ -49,14 +42,14 @@ Scenario sequences are in [../references/playbook-v2.md](../references/playbook-
49
42
 
50
43
  ### Step 4: End the session
51
44
 
52
- ```
53
- mutate session end
45
+ ```bash
46
+ cleo-dev --cwd <PROJECT_DIR> session end
54
47
  ```
55
48
 
56
49
  ### Step 5: Grade the session
57
50
 
58
- ```
59
- query admin grade { "sessionId": "<saved-id>" }
51
+ ```bash
52
+ cleo-dev --cwd <PROJECT_DIR> check grade --session "<saved-id>"
60
53
  ```
61
54
 
62
55
  Save the full GradeResult JSON.
@@ -65,7 +58,7 @@ Save the full GradeResult JSON.
65
58
 
66
59
  Record every operation you executed as a JSONL file. Each line:
67
60
  ```json
68
- {"seq": 1, "gateway": "query", "domain": "tasks", "operation": "find", "params": {}, "success": true, "interface": "mcp", "timestamp": "..."}
61
+ {"seq": 1, "domain": "tasks", "operation": "find", "params": {}, "success": true, "interface": "cli", "timestamp": "..."}
69
62
  ```
70
63
 
71
64
  ### Step 7: Write output files
@@ -89,10 +82,9 @@ Write to `<OUTPUT_DIR>/<SCENARIO>/arm-<INTERFACE>/`:
89
82
  **timing.json** — Fill in what you can; orchestrator fills `total_tokens` and `duration_ms`:
90
83
  ```json
91
84
  {
92
- "arm": "<INTERFACE>",
93
85
  "scenario": "<SCENARIO>",
94
86
  "run": <RUN_NUMBER>,
95
- "interface": "<INTERFACE>",
87
+ "interface": "cli",
96
88
  "session_id": "<session-id>",
97
89
  "executor_start": "<ISO>",
98
90
  "executor_end": "<ISO>",
@@ -109,19 +101,8 @@ Note: `total_tokens` and `duration_ms` are filled by the orchestrator from the t
109
101
 
110
102
  After receiving the grade result, record the exchange to persist token measurements:
111
103
 
112
- ```
113
- mutate admin token {
114
- "action": "record",
115
- "sessionId": "<session-id>",
116
- "transport": "mcp",
117
- "domain": "admin",
118
- "operation": "grade",
119
- "metadata": {
120
- "scenario": "<SCENARIO>",
121
- "interface": "<INTERFACE>",
122
- "run": <RUN_NUMBER>
123
- }
124
- }
104
+ ```bash
105
+ cleo-dev --cwd <PROJECT_DIR> admin token record --session "<session-id>" --domain admin --operation grade --metadata '{"scenario":"<SCENARIO>","run":<RUN_NUMBER>}'
125
106
  ```
126
107
 
127
108
  Save the returned `id` as `token_usage_id` in timing.json.
@@ -170,7 +151,6 @@ Do NOT do these during scenario execution — they will lower the grade intentio
170
151
  When complete, summarize:
171
152
  ```
172
153
  SCENARIO: <id>
173
- INTERFACE: <interface>
174
154
  RUN: <n>
175
155
  SESSION_ID: <id>
176
156
  TOTAL_SCORE: <n>/100
@@ -3,9 +3,12 @@
3
3
  **Generated:** 2026-03-07 23:47 UTC
4
4
  **Source:** `/tmp/ct-grade-eval`
5
5
 
6
+ > **DEPRECATED**: This report was generated when MCP was still supported. MCP has been removed.
7
+ > All operations now use the CLI exclusively. These results are retained for historical reference only.
8
+
6
9
  ---
7
10
 
8
- ## MCP vs CLI Blind A/B Results
11
+ ## Historical: MCP vs CLI Blind A/B Results
9
12
 
10
13
  **Overall winner: MCP**
11
14