@cleocode/skills 2026.4.0 → 2026.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/package.json +1 -1
  2. package/skills/_shared/manifest-operations.md +1 -2
  3. package/skills/_shared/skill-chaining-patterns.md +3 -7
  4. package/skills/_shared/subagent-protocol-base.cant +1 -1
  5. package/skills/ct-cleo/SKILL.md +56 -65
  6. package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
  7. package/skills/ct-cleo/references/session-protocol.md +3 -12
  8. package/skills/ct-codebase-mapper/SKILL.md +7 -7
  9. package/skills/ct-grade/SKILL.md +12 -46
  10. package/skills/ct-grade/agents/scenario-runner.md +11 -21
  11. package/skills/ct-grade/references/ab-test-methodology.md +14 -14
  12. package/skills/ct-grade/references/domains.md +72 -74
  13. package/skills/ct-grade/references/grade-spec.md +8 -11
  14. package/skills/ct-grade/references/scenario-playbook.md +77 -106
  15. package/skills/ct-grade-v2-1/SKILL.md +30 -32
  16. package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
  17. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
  18. package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
  19. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
  20. package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
  21. package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
  22. package/skills/ct-memory/SKILL.md +16 -35
  23. package/skills/ct-orchestrator/SKILL.md +58 -68
  24. package/skills/ct-skill-validator/SKILL.md +1 -1
  25. package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
  26. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
  27. package/skills/manifest.json +1 -1
  28. package/skills/signaldock-connect/SKILL.md +132 -0
  29. package/skills/signaldock-connect/assets/agent-card.json +48 -0
  30. package/skills/signaldock-connect/references/api-endpoints.md +131 -0
  31. package/skills.json +1 -1
@@ -14,10 +14,11 @@ An "arm" is a specific test configuration. In CLEO A/B tests, the two most commo
14
14
 
15
15
  | Arm | Typical Config | Example |
16
16
  |-----|---------------|---------|
17
- | A | MCP gateway | Uses `query`/`mutate` for all operations |
18
- | B | CLI fallback | Uses `cleo-dev` CLI for equivalent operations |
17
+ | A | Configuration A | Different CLI binary, flags, or prompt setup |
18
+ | B | Configuration B | Alternate setup for comparison |
19
19
 
20
- Arms can also differ by:
20
+ Arms can differ by:
21
+ - CLI binary version (`cleo-dev` vs `cleo`)
21
22
  - Session scope (`global` vs `epic:T500`)
22
23
  - Tier escalation (with/without `admin.help`)
23
24
  - Agent persona (orchestrator vs task-executor)
@@ -71,10 +72,9 @@ save_json(arm_dir + "/timing.json", timing)
71
72
 
72
73
  ### Why This Matters
73
74
 
74
- Token cost is the primary economic metric for comparing interfaces:
75
- - MCP operations may use more tokens (richer responses, metadata)
76
- - CLI operations may use fewer tokens but score lower on S5
77
- - Score-per-token tells you which interface is more efficient for protocol work
75
+ Token cost is the primary economic metric for comparing configurations:
76
+ - Different configurations may produce different token costs
77
+ - Score-per-token tells you which configuration is more efficient for protocol work
78
78
 
79
79
  ### Missing Token Data
80
80
 
@@ -98,16 +98,16 @@ If you forgot to capture tokens, you cannot recover them. Mark `total_tokens: nu
98
98
  | 0-5 pts | Noise level — likely equivalent |
99
99
  | 5-15 pts | Meaningful difference — investigate flags |
100
100
  | 15-25 pts | Significant — one interface clearly better |
101
- | 25+ pts | Extreme — likely S5 differential (MCP vs CLI) |
101
+ | 25+ pts | Extreme — likely S5 differential or protocol gap |
102
102
 
103
- ### Expected MCP vs CLI Delta
103
+ ### Expected Delta
104
104
 
105
105
  Based on the rubric implementation:
106
- - S5 Progressive Disclosure: always +20 for MCP (if admin.help called), +10 MCP no help, 0 CLI
106
+ - S5 Progressive Disclosure: +20 if agent uses `admin.help` and follows read-before-write discipline
107
107
  - S1-S4: approximately equal if agent follows same protocol steps
108
- - Total expected delta: **+10 to +20 points** in favor of MCP for equivalent protocols
108
+ - Configuration differences should primarily show up in S5 and token efficiency
109
109
 
110
- If delta exceeds 20 points, investigate whether the CLI agent is also skipping other protocol steps (session.list, descriptions, etc.) due to lack of guidance.
110
+ If delta exceeds 20 points, investigate whether one arm is skipping protocol steps (session.list, descriptions, etc.).
111
111
 
112
112
  ---
113
113
 
@@ -119,8 +119,8 @@ The "git tree" metaphor: each A/B run produces a branch in the results tree. Mul
119
119
  ab_results/
120
120
  run-001/ ← first full A/B run
121
121
  s4/
122
- run-01/arm-A/ ← first run, MCP arm
123
- run-01/arm-B/ ← first run, CLI arm
122
+ run-01/arm-A/ ← first run, arm A
123
+ run-01/arm-B/ ← first run, arm B
124
124
  run-01/comparison.json
125
125
  run-02/arm-A/
126
126
  ...
@@ -1,130 +1,128 @@
1
1
  # CLEO Domain Operation Reference for A/B Testing
2
2
 
3
3
  **Source**: `docs/specs/CLEO-OPERATION-CONSTITUTION.md`
4
- **Purpose**: Lists the key operations to test in MCP vs CLI A/B comparisons.
4
+ **Purpose**: Lists the key operations to test in A/B comparisons.
5
+
6
+ All operations use the CLI (`cleo` / `cleo-dev`). There is no MCP interface.
5
7
 
6
8
  ---
7
9
 
8
- ## MCP vs CLI Equivalents
10
+ ## CLI Operations by Domain
9
11
 
10
12
  For each domain, these are the canonical operations to test in A/B mode.
11
- MCP gateway = audit metadata.gateway is `'query'` or `'mutate'` (set by MCP adapter).
12
- CLI = operations routed through CLI do NOT set metadata.gateway.
13
13
 
14
14
  ### tasks (32 operations)
15
15
 
16
- | Test Op | MCP | CLI |
17
- |---------|-----|-----|
18
- | Discovery | `query tasks find { "status": "active" }` | `cleo-dev find --status active` |
19
- | Show detail | `query tasks show { "taskId": "T123" }` | `cleo-dev show T123` |
20
- | List children | `query tasks list { "parent": "T100" }` | `cleo-dev list --parent T100` |
21
- | Create | `mutate tasks add { "title": "...", "description": "..." }` | `cleo-dev add --title "..." --description "..."` |
22
- | Update | `mutate tasks update { "taskId": "T123", "status": "active" }` | `cleo-dev update T123 --status active` |
23
- | Complete | `mutate tasks complete { "taskId": "T123" }` | `cleo-dev complete T123` |
24
- | Exists check | `query tasks exists { "taskId": "T123" }` | `cleo-dev exists T123` |
16
+ | Test Op | CLI |
17
+ |---------|-----|
18
+ | Discovery | `cleo-dev find --status active` |
19
+ | Show detail | `cleo-dev show T123` |
20
+ | List children | `cleo-dev list --parent T100` |
21
+ | Create | `cleo-dev add "title" --description "..."` |
22
+ | Update | `cleo-dev update T123 --status active` |
23
+ | Complete | `cleo-dev complete T123` |
24
+ | Exists check | `cleo-dev exists T123` |
25
25
 
26
- **Key S2 insight**: `tasks.find` (MCP) vs `cleo-dev find` (CLI). Both count toward find:list ratio in the audit log. MCP find at gateway='query', CLI find also logged but without gateway metadata.
26
+ **Key S2 insight**: `cleo-dev find` counts toward find:list ratio in the audit log. Always prefer find over list for discovery.
27
27
 
28
28
  ### session (19 operations)
29
29
 
30
- | Test Op | MCP | CLI |
31
- |---------|-----|-----|
32
- | Check existing | `query session list` | `cleo-dev session list` |
33
- | Start | `mutate session start { "grade": true, "scope": "global" }` | `cleo-dev session start --grade --scope global` |
34
- | End | `mutate session end` | `cleo-dev session end` |
35
- | Status | `query session status` | `cleo-dev session status` |
36
- | Record decision | `mutate session record.decision { "decision": "...", "rationale": "..." }` | `cleo-dev session record-decision ...` |
30
+ | Test Op | CLI |
31
+ |---------|-----|
32
+ | Check existing | `cleo-dev session list` |
33
+ | Start | `cleo-dev session start --grade --scope global` |
34
+ | End | `cleo-dev session end` |
35
+ | Status | `cleo-dev session status` |
36
+ | Record decision | `cleo-dev session record-decision --decision "..." --rationale "..."` |
37
37
 
38
- **Critical**: `session.list` (MCP) is what the rubric checks for S1. If CLI does `cleo-dev session list`, it still appears as `domain='session', operation='list'` in the audit log. S1 counts it.
38
+ **Critical**: `session.list` is what the rubric checks for S1. It must appear as `domain='session', operation='list'` in the audit log.
39
39
 
40
- ### memory (18 operations) Tier 1
40
+ ### memory (18 operations) -- Tier 1
41
41
 
42
- | Test Op | MCP | CLI |
43
- |---------|-----|-----|
44
- | Search | `query memory find { "query": "authentication" }` | `cleo-dev memory find "authentication"` |
45
- | Store observation | `mutate memory observe { "text": "..." }` | `cleo-dev memory observe "..."` |
46
- | Timeline | `query memory timeline { "anchor": "<id>" }` | N/A (MCP-preferred) |
42
+ | Test Op | CLI |
43
+ |---------|-----|
44
+ | Search | `cleo-dev memory find "authentication"` |
45
+ | Store observation | `cleo-dev observe "..."` |
46
+ | Timeline | `cleo-dev memory timeline <id>` |
47
47
 
48
48
  ### admin (44 operations)
49
49
 
50
- | Test Op | MCP | CLI |
51
- |---------|-----|-----|
52
- | Dashboard | `query admin dash` | `cleo-dev dash` |
53
- | Help (S5 key) | `query admin help` | `cleo-dev help` |
54
- | Grade session | `query admin grade { "sessionId": "<id>" }` | `cleo-dev grade <id>` |
55
- | Health check | `query admin health` | `cleo-dev health` |
50
+ | Test Op | CLI |
51
+ |---------|-----|
52
+ | Dashboard | `cleo-dev dash` |
53
+ | Help (S5 key) | `cleo-dev help` |
54
+ | Grade session | `cleo-dev check grade --session "<id>"` |
55
+ | Health check | `cleo-dev health` |
56
56
 
57
- **Critical for S5**: Only `query admin help` (MCP) satisfies the `helpCalls` filter in S5. CLI `cleo-dev help` does NOT set `metadata.gateway='query'` or match `domain='admin', operation='help'` — it depends on how the CLI routes internally.
57
+ **Critical for S5**: `cleo-dev help` satisfies the `helpCalls` filter in S5 Progressive Disclosure scoring.
58
58
 
59
- ### pipeline (42 operations) LOOM system
59
+ ### pipeline (42 operations) -- LOOM system
60
60
 
61
- | Test Op | MCP | CLI |
62
- |---------|-----|-----|
63
- | Stage status | `query pipeline stage.status` | `cleo-dev pipeline status` |
64
- | Stage validate | `query pipeline stage.validate` | `cleo-dev pipeline validate` |
65
- | Manifest list | `query pipeline manifest.list` | `cleo-dev pipeline manifest list` |
61
+ | Test Op | CLI |
62
+ |---------|-----|
63
+ | Stage status | `cleo-dev pipeline stage.status --epic <id>` |
64
+ | Stage validate | `cleo-dev pipeline stage.validate --epic <id> --stage <stage>` |
65
+ | Manifest list | `cleo-dev manifest list` |
66
66
 
67
67
  ### check (19 operations)
68
68
 
69
- | Test Op | MCP | CLI |
70
- |---------|-----|-----|
71
- | Test status | `query check test.status` | `cleo-dev check test-status` |
72
- | Protocol check | `query check protocol` | `cleo-dev check protocol` |
73
- | Compliance | `query check compliance.summary` | `cleo-dev check compliance` |
69
+ | Test Op | CLI |
70
+ |---------|-----|
71
+ | Test status | `cleo-dev check test-status` |
72
+ | Protocol check | `cleo-dev check protocol` |
73
+ | Compliance | `cleo-dev check compliance` |
74
74
 
75
75
  ### orchestrate (19 operations)
76
76
 
77
- | Test Op | MCP | CLI |
78
- |---------|-----|-----|
79
- | Status | `query orchestrate status` | `cleo-dev orchestrate status` |
80
- | Waves | `query orchestrate waves` | `cleo-dev orchestrate waves` |
77
+ | Test Op | CLI |
78
+ |---------|-----|
79
+ | Status | `cleo-dev orchestrator status` |
80
+ | Waves | `cleo-dev orchestrator waves` |
81
81
 
82
82
  ### tools (32 operations)
83
83
 
84
- | Test Op | MCP | CLI |
85
- |---------|-----|-----|
86
- | Skill list (S5 key) | `query tools skill.list` | `cleo-dev tools skill list` |
87
- | Skill show (S5 key) | `query tools skill.show { "skillId": "ct-cleo" }` | `cleo-dev tools skill show ct-cleo` |
84
+ | Test Op | CLI |
85
+ |---------|-----|
86
+ | Skill list (S5 key) | `cleo-dev skill list` |
87
+ | Skill show (S5 key) | `cleo-dev skill show ct-cleo` |
88
88
 
89
- **S5 note**: `tools.skill.list` and `tools.skill.show` via MCP count toward S5 helpCalls filter.
89
+ **S5 note**: `tools.skill.list` and `tools.skill.show` count toward S5 helpCalls filter.
90
90
 
91
91
  ---
92
92
 
93
- ## A/B Domain Test Configurations
93
+ ## A/B Configuration Test Examples
94
94
 
95
95
  ### Quick A/B: Tasks Domain
96
96
 
97
- **Goal**: Compare MCP vs CLI for core task operations.
98
- **Operations to execute (both interfaces)**:
99
- 1. `session list` S1
100
- 2. `tasks find { "status": "active" }` S2
101
- 3. `tasks show { "taskId": "<valid-id>" }` S2
102
- 4. `session end` S1
103
-
104
- **Expected score difference**: MCP ~30/100 vs CLI ~20/100 (S5 is 0 for CLI)
97
+ **Goal**: Compare two configurations for core task operations.
98
+ **Operations to execute (both arms)**:
99
+ 1. `cleo-dev session list` -- S1
100
+ 2. `cleo-dev find --status active` -- S2
101
+ 3. `cleo-dev show <valid-id>` -- S2
102
+ 4. `cleo-dev session end` -- S1
105
103
 
106
104
  ### Standard A/B: Full Protocol (S4)
107
105
 
108
- **Goal**: Full lifecycle scenario through both interfaces.
106
+ **Goal**: Full lifecycle scenario through both configurations.
109
107
  **Operations**: Follow S4 scenario (10 ops including admin.help).
110
- **Expected**: MCP 100/100, CLI ~80/100
108
+ **Expected**: 100/100 for protocol-complete arm
111
109
 
112
110
  ### Targeted A/B: S5 Isolation
113
111
 
114
112
  **Goal**: Specifically measure the S5 (progressive disclosure) gap.
115
- **Operations** same except arm A calls `admin.help`, arm B does not:
113
+ **Operations** -- same except arm A calls `admin.help`, arm B does not:
116
114
 
117
- Arm A (MCP + help):
118
- ```
119
- query session list query admin help query tasks find mutate session end
115
+ Arm A (with help):
116
+ ```bash
117
+ cleo-dev session list && cleo-dev help && cleo-dev find --status active && cleo-dev session end
120
118
  ```
121
119
 
122
- Arm B (CLI — no help call):
123
- ```
124
- cleo-dev session list cleo-dev find cleo-dev session end
120
+ Arm B (no help call):
121
+ ```bash
122
+ cleo-dev session list && cleo-dev find --status active && cleo-dev session end
125
123
  ```
126
124
 
127
- **Expected**: Arm A S5 = 20/20, Arm B S5 = 0/20
125
+ **Expected**: Arm A S5 = 20/20, Arm B S5 = 10/20
128
126
 
129
127
  ---
130
128
 
@@ -152,19 +152,17 @@ helpCalls = entries where:
152
152
  OR (domain='tools' AND operation IN ['skill.show','skill.list'])
153
153
  OR (domain='skills' AND operation IN ['list','show'])
154
154
 
155
- mcpQueryCalls = entries where metadata.gateway = 'query'
155
+ readOps = entries where operation type is a read (show, find, list, status, etc.)
156
156
  ```
157
157
 
158
158
  | Points | Condition |
159
159
  |--------|-----------|
160
160
  | +10 | `helpCalls.length > 0` |
161
- | +10 | `mcpQueryCalls.length > 0` |
161
+ | +10 | `readOps.length > 0` (agent performed read operations before writes) |
162
162
 
163
163
  **Flags on violation:**
164
164
  - `No admin.help or skill lookup calls (load ct-cleo for guidance)`
165
- - `No MCP query calls (prefer query over CLI for programmatic access)`
166
-
167
- **Important**: The `metadata.gateway` field equals `'query'` for MCP query operations. CLI operations do not set this field. This is how MCP vs CLI usage is distinguished in the grade.
165
+ - `No read operations before writes (prefer discovery before mutation)`
168
166
 
169
167
  ---
170
168
 
@@ -218,14 +216,13 @@ interface GradeResult {
218
216
 
219
217
  ---
220
218
 
221
- ## MCP vs CLI Detection in S5
219
+ ## S5 Detection
222
220
 
223
- The grading system detects MCP usage via `metadata.gateway === 'query'`. This means:
224
- - **MCP interface**: All query operations set `metadata.gateway = 'query'` S5 gets +10
225
- - **CLI interface**: CLI operations do NOT set metadata.gateway S5 loses +10
226
- - **Mixed**: Any single MCP query call is enough for the +10
221
+ The grading system awards S5 points based on:
222
+ 1. Presence of `admin.help` or skill lookup calls (+10)
223
+ 2. Evidence of read-before-write discipline agent performed discovery operations before mutations (+10)
227
224
 
228
- This is why A/B tests between MCP and CLI interfaces will reliably show S5 differences.
225
+ All operations use the CLI (`cleo` / `cleo-dev`). There is no MCP interface.
229
226
 
230
227
 
231
228
  ## API Surface Update
@@ -5,8 +5,7 @@
5
5
 
6
6
  Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
7
7
 
8
- Use **cleo-dev** (local dev build) for MCP operations or **cleo** (production).
9
- Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
8
+ Use **cleo-dev** (local dev build) or **cleo** (production). All operations use the CLI.
10
9
 
11
10
  ---
12
11
 
@@ -15,17 +14,7 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
15
14
  **Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
16
15
  **Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
17
16
 
18
- ### Operation Sequence (MCP)
19
-
20
- ```
21
- 1. query session list — S1: must be first
22
- 2. query admin dash — project overview
23
- 3. query tasks find { "status": "active" } — S2: find not list
24
- 4. query tasks show { "taskId": "T<any>" } — S2: show used
25
- 5. mutate session end — S1: session.end
26
- ```
27
-
28
- ### Operation Sequence (CLI)
17
+ ### Operation Sequence
29
18
 
30
19
  ```bash
31
20
  1. cleo-dev session list
@@ -43,18 +32,16 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
43
32
  | S2 | 20/20 | find used exclusively (+15), show used (+5) |
44
33
  | S3 | 20/20 | No task adds (no deductions) |
45
34
  | S4 | 20/20 | No errors |
46
- | S5 (MCP) | 10/20 | query gateway used (+10), no admin.help call |
47
- | S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
35
+ | S5 | 10/20 | No admin.help call |
48
36
 
49
- **MCP total: ~90/100 (A)**
50
- **CLI total: ~80/100 (B)**
37
+ **Total: ~90/100 (A)**
51
38
 
52
39
  ### Anti-pattern Variant (for testing grader sensitivity)
53
40
 
54
- ```
55
- query tasks find { "status": "active" } ← task op BEFORE session.list
56
- query session list too late for S1
57
- (no session.end)
41
+ ```bash
42
+ cleo-dev find --status active # task op BEFORE session.list
43
+ cleo-dev session list # too late for S1
44
+ # (no session.end)
58
45
  ```
59
46
  Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
60
47
 
@@ -63,19 +50,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
63
50
  ## S2: Task Creation Hygiene
64
51
 
65
52
  **Purpose**: Validates S3 (Task Hygiene) and S1.
66
- **Target score**: 60/100 (S1 full, S3 full, S5 partial MCP or 0 CLI)
53
+ **Target score**: 60/100 (S1 full, S3 full, S5 partial)
67
54
 
68
- ### Operation Sequence (MCP)
55
+ ### Operation Sequence
69
56
 
70
- ```
71
- 1. query session list — S1
72
- 2. query tasks exists { "taskId": "T100" } — S3: parent verify
73
- 3. mutate tasks add { "title": "Implement auth",
74
- "description": "Add JWT authentication to API endpoints",
75
- "parent": "T100" } — S3: desc + parent
76
- 4. mutate tasks add { "title": "Write tests",
77
- "description": "Unit tests for auth module" } — S3: desc present
78
- 5. mutate session end — S1
57
+ ```bash
58
+ 1. cleo-dev session list
59
+ 2. cleo-dev show T100 # S3: parent verify
60
+ 3. cleo-dev add "Implement auth" --description "Add JWT authentication to API endpoints" --parent T100
61
+ 4. cleo-dev add "Write tests" --description "Unit tests for auth module"
62
+ 5. cleo-dev session end
79
63
  ```
80
64
 
81
65
  ### Scoring Targets
@@ -83,18 +67,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
83
67
  | Dim | Expected | Reason |
84
68
  |-----|----------|--------|
85
69
  | S1 | 20/20 | session.list first, session.end present |
86
- | S3 | 20/20 | All adds have descriptions, parent verified via exists |
87
- | S5 (MCP) | 10/20 | query gateway used |
88
- | S5 (CLI) | 0/20 | no MCP query, no help |
70
+ | S3 | 20/20 | All adds have descriptions, parent verified via show |
71
+ | S5 | 0/20 | no help |
89
72
 
90
- **MCP total: ~70/100 (C)**
91
- **CLI total: ~60/100 (C)**
73
+ **Total: ~60/100 (C)**
92
74
 
93
75
  ### Anti-pattern Variant
94
76
 
95
- ```
96
- mutate tasks add { "title": "Implement auth", "parent": "T100" } ← no desc, no exists check
97
- mutate tasks add { "title": "Write tests" } no desc
77
+ ```bash
78
+ cleo-dev add "Implement auth" --parent T100 # no desc, no exists check
79
+ cleo-dev add "Write tests" # no desc
98
80
  ```
99
81
  Expected S3: 7 (20 - 5 - 5 - 3 = 7)
100
82
 
@@ -104,15 +86,14 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
104
86
 
105
87
  **Purpose**: Validates S4 (Error Protocol).
106
88
 
107
- ### Operation Sequence (MCP)
89
+ ### Operation Sequence
108
90
 
109
- ```
110
- 1. query session list — S1
111
- 2. query tasks show { "taskId": "T99999" } — triggers E_NOT_FOUND
112
- 3. query tasks find { "query": "T99999" } — S4: recovery within 4 ops
113
- 4. mutate tasks add { "title": "New feature",
114
- "description": "Implement the feature that was not found" } — S3: desc present
115
- 5. mutate session end — S1
91
+ ```bash
92
+ 1. cleo-dev session list
93
+ 2. cleo-dev show T99999 # triggers E_NOT_FOUND
94
+ 3. cleo-dev find "T99999" # S4: recovery within 4 ops
95
+ 4. cleo-dev add "New feature" --description "Implement the feature that was not found"
96
+ 5. cleo-dev session end
116
97
  ```
117
98
 
118
99
  ### Scoring Targets
@@ -122,24 +103,23 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
122
103
  | S1 | 20/20 | Proper session lifecycle |
123
104
  | S3 | 20/20 | Task created with description |
124
105
  | S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
125
- | S5 (MCP) | 10/20 | query gateway used |
106
+ | S5 | 0/20 | no help |
126
107
 
127
- **MCP total: ~90/100 (A)**
108
+ **Total: ~80/100 (B)**
128
109
 
129
110
  ### Anti-pattern: Unrecovered Error
130
111
 
112
+ ```bash
113
+ cleo-dev show T99999 # E_NOT_FOUND
114
+ cleo-dev add "Something else" --description "Unrelated" # no recovery lookup
131
115
  ```
132
- query tasks show { "taskId": "T99999" } ← E_NOT_FOUND
133
- mutate tasks add { "title": "Something else",
134
- "description": "Unrelated" } ← no recovery lookup
135
- ```
136
- S4 deduction: -5 (no tasks.find within next 4 entries)
116
+ S4 deduction: -5 (no find within next 4 entries)
137
117
 
138
118
  ### Anti-pattern: Duplicate Creates
139
119
 
140
- ```
141
- mutate tasks add { "title": "New feature", "description": "First attempt" }
142
- mutate tasks add { "title": "New feature", "description": "Second attempt" }
120
+ ```bash
121
+ cleo-dev add "New feature" --description "First attempt"
122
+ cleo-dev add "New feature" --description "Second attempt"
143
123
  ```
144
124
  S4 deduction: -5 (1 duplicate detected)
145
125
 
@@ -148,24 +128,24 @@ S4 deduction: -5 (1 duplicate detected)
148
128
  ## S4: Full Lifecycle
149
129
 
150
130
  **Purpose**: Validates all 5 dimensions. Gold standard session.
151
- **Target score**: 100/100 (A) for MCP, ~80/100 (B) for CLI
131
+ **Target score**: 100/100 (A)
152
132
 
153
- ### Operation Sequence (MCP)
133
+ ### Operation Sequence
154
134
 
155
- ```
156
- 1. query session list — S1
157
- 2. query admin help S5: progressive disclosure
158
- 3. query admin dash overview
159
- 4. query tasks find { "status": "pending" } — S2: find not list
160
- 5. query tasks show { "taskId": "T200" } — S2: show for detail
161
- 6. mutate tasks update { "taskId": "T200", "status": "active" } — begin work
162
- (agent does work here)
163
- 7. mutate tasks complete { "taskId": "T200" } — mark done
164
- 8. query tasks find { "status": "pending" } — check next
165
- 9. mutate session end { "note": "Completed T200" } — S1
135
+ ```bash
136
+ 1. cleo-dev session list
137
+ 2. cleo-dev help # S5: progressive disclosure
138
+ 3. cleo-dev dash # overview
139
+ 4. cleo-dev find --status pending # S2: find not list
140
+ 5. cleo-dev show T200 # S2: show for detail
141
+ 6. cleo-dev update T200 --status active # begin work
142
+ # (agent does work here)
143
+ 7. cleo-dev complete T200 # mark done
144
+ 8. cleo-dev find --status pending # check next
145
+ 9. cleo-dev session end --note "Completed T200" # S1
166
146
  ```
167
147
 
168
- ### Scoring Targets (MCP)
148
+ ### Scoring Targets
169
149
 
170
150
  | Dim | Expected | Reason |
171
151
  |-----|----------|--------|
@@ -173,34 +153,31 @@ S4 deduction: -5 (1 duplicate detected)
173
153
  | S2 | 20/20 | find:list 100% (+15), show used (+5) |
174
154
  | S3 | 20/20 | No adds — no deductions |
175
155
  | S4 | 20/20 | No errors, no duplicates |
176
- | S5 | 20/20 | admin.help (+10), query gateway (+10) |
156
+ | S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
177
157
 
178
- **MCP total: 100/100 (A)**
179
- **CLI total: ~80/100 (B)** — loses S5 entirely
158
+ **Total: 100/100 (A)**
180
159
 
181
160
  ---
182
161
 
183
162
  ## S5: Multi-Domain Analysis
184
163
 
185
164
  **Purpose**: Validates cross-domain operations and advanced S5.
186
- **Target score**: 100/100 (MCP), ~80/100 (CLI)
165
+ **Target score**: 100/100
187
166
 
188
- ### Operation Sequence (MCP)
167
+ ### Operation Sequence
189
168
 
190
- ```
191
- 1. query session list — S1
192
- 2. query admin help — S5
193
- 3. query tasks find { "parent": "T500" } — S2: epic subtasks
194
- 4. query tasks show { "taskId": "T501" } — S2: inspect
195
- 5. query session context.drift multi-domain
196
- 6. query session decision.log { "taskId": "T501" } — decision history
197
- 7. mutate session record.decision { "taskId": "T501",
198
- "decision": "Use adapter pattern",
199
- "rationale": "Decouples provider logic" } — record decision
200
- 8. mutate tasks update { "taskId": "T501", "status": "active" }
201
- 9. mutate tasks complete { "taskId": "T501" }
202
- 10. query tasks find { "parent": "T500", "status": "pending" } — next subtask
203
- 11. mutate session end — S1
169
+ ```bash
170
+ 1. cleo-dev session list
171
+ 2. cleo-dev help
172
+ 3. cleo-dev find --parent T500 # S2: epic subtasks
173
+ 4. cleo-dev show T501 # S2: inspect
174
+ 5. cleo-dev session context-drift # multi-domain
175
+ 6. cleo-dev session decision-log --task T501 # decision history
176
+ 7. cleo-dev session record-decision --task T501 --decision "Use adapter pattern" --rationale "Decouples provider logic"
177
+ 8. cleo-dev update T501 --status active
178
+ 9. cleo-dev complete T501
179
+ 10. cleo-dev find --parent T500 --status pending # next subtask
180
+ 11. cleo-dev session end
204
181
  ```
205
182
 
206
183
  ### Scoring Targets
@@ -211,24 +188,18 @@ S4 deduction: -5 (1 duplicate detected)
211
188
  | S2 | 20/20 | find used exclusively, show used |
212
189
  | S3 | 20/20 | No task.add — no deductions |
213
190
  | S4 | 20/20 | No errors |
214
- | S5 | 20/20 | admin.help (+10), query gateway (+10) |
191
+ | S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
215
192
 
216
- **MCP total: 100/100 (A)**
193
+ **Total: 100/100 (A)**
217
194
 
218
195
  ---
219
196
 
220
197
  ## Scenario Quick Reference
221
198
 
222
- | Scenario | Primary Dims Tested | MCP Expected | CLI Expected |
223
- |---|---|---|---|
224
- | S1 | S1, S2 | ~90 (A) | ~80 (B) |
225
- | S2 | S1, S3 | ~70 (C) | ~60 (C) |
226
- | S3 | S1, S3, S4 | ~90 (A) | ~80 (B) |
227
- | S4 | All 5 | 100 (A) | ~80 (B) |
228
- | S5 | All 5, cross-domain | 100 (A) | ~80 (B) |
229
-
230
- **Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
231
- 1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
232
- 2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
233
-
234
- This is by design — the rubric rewards MCP-first behavior.
199
+ | Scenario | Primary Dims Tested | Expected Score |
200
+ |---|---|---|
201
+ | S1 | S1, S2 | ~90 (A) |
202
+ | S2 | S1, S3 | ~60 (C) |
203
+ | S3 | S1, S3, S4 | ~80 (B) |
204
+ | S4 | All 5 | 100 (A) |
205
+ | S5 | All 5, cross-domain | 100 (A) |