@cleocode/skills 2026.4.0 → 2026.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/_shared/manifest-operations.md +1 -2
- package/skills/_shared/skill-chaining-patterns.md +3 -7
- package/skills/_shared/subagent-protocol-base.cant +1 -1
- package/skills/ct-cleo/SKILL.md +56 -65
- package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
- package/skills/ct-cleo/references/session-protocol.md +3 -12
- package/skills/ct-codebase-mapper/SKILL.md +7 -7
- package/skills/ct-grade/SKILL.md +12 -46
- package/skills/ct-grade/agents/scenario-runner.md +11 -21
- package/skills/ct-grade/references/ab-test-methodology.md +14 -14
- package/skills/ct-grade/references/domains.md +72 -74
- package/skills/ct-grade/references/grade-spec.md +8 -11
- package/skills/ct-grade/references/scenario-playbook.md +77 -106
- package/skills/ct-grade-v2-1/SKILL.md +30 -32
- package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
- package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
- package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
- package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
- package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
- package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
- package/skills/ct-memory/SKILL.md +16 -35
- package/skills/ct-orchestrator/SKILL.md +58 -68
- package/skills/ct-skill-validator/SKILL.md +1 -1
- package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
- package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
- package/skills/manifest.json +1 -1
- package/skills/signaldock-connect/SKILL.md +132 -0
- package/skills/signaldock-connect/assets/agent-card.json +48 -0
- package/skills/signaldock-connect/references/api-endpoints.md +131 -0
- package/skills.json +1 -1
|
@@ -14,10 +14,11 @@ An "arm" is a specific test configuration. In CLEO A/B tests, the two most commo
|
|
|
14
14
|
|
|
15
15
|
| Arm | Typical Config | Example |
|
|
16
16
|
|-----|---------------|---------|
|
|
17
|
-
| A |
|
|
18
|
-
| B |
|
|
17
|
+
| A | Configuration A | Different CLI binary, flags, or prompt setup |
|
|
18
|
+
| B | Configuration B | Alternate setup for comparison |
|
|
19
19
|
|
|
20
|
-
Arms can
|
|
20
|
+
Arms can differ by:
|
|
21
|
+
- CLI binary version (`cleo-dev` vs `cleo`)
|
|
21
22
|
- Session scope (`global` vs `epic:T500`)
|
|
22
23
|
- Tier escalation (with/without `admin.help`)
|
|
23
24
|
- Agent persona (orchestrator vs task-executor)
|
|
@@ -71,10 +72,9 @@ save_json(arm_dir + "/timing.json", timing)
|
|
|
71
72
|
|
|
72
73
|
### Why This Matters
|
|
73
74
|
|
|
74
|
-
Token cost is the primary economic metric for comparing
|
|
75
|
-
-
|
|
76
|
-
-
|
|
77
|
-
- Score-per-token tells you which interface is more efficient for protocol work
|
|
75
|
+
Token cost is the primary economic metric for comparing configurations:
|
|
76
|
+
- Different configurations may produce different token costs
|
|
77
|
+
- Score-per-token tells you which configuration is more efficient for protocol work
|
|
78
78
|
|
|
79
79
|
### Missing Token Data
|
|
80
80
|
|
|
@@ -98,16 +98,16 @@ If you forgot to capture tokens, you cannot recover them. Mark `total_tokens: nu
|
|
|
98
98
|
| 0-5 pts | Noise level — likely equivalent |
|
|
99
99
|
| 5-15 pts | Meaningful difference — investigate flags |
|
|
100
100
|
| 15-25 pts | Significant — one interface clearly better |
|
|
101
|
-
| 25+ pts | Extreme — likely S5 differential
|
|
101
|
+
| 25+ pts | Extreme — likely S5 differential or protocol gap |
|
|
102
102
|
|
|
103
|
-
### Expected
|
|
103
|
+
### Expected Delta
|
|
104
104
|
|
|
105
105
|
Based on the rubric implementation:
|
|
106
|
-
- S5 Progressive Disclosure:
|
|
106
|
+
- S5 Progressive Disclosure: +20 if agent uses `admin.help` and follows read-before-write discipline
|
|
107
107
|
- S1-S4: approximately equal if agent follows same protocol steps
|
|
108
|
-
-
|
|
108
|
+
- Configuration differences should primarily show up in S5 and token efficiency
|
|
109
109
|
|
|
110
|
-
If delta exceeds 20 points, investigate whether
|
|
110
|
+
If delta exceeds 20 points, investigate whether one arm is skipping protocol steps (session.list, descriptions, etc.).
|
|
111
111
|
|
|
112
112
|
---
|
|
113
113
|
|
|
@@ -119,8 +119,8 @@ The "git tree" metaphor: each A/B run produces a branch in the results tree. Mul
|
|
|
119
119
|
ab_results/
|
|
120
120
|
run-001/ ← first full A/B run
|
|
121
121
|
s4/
|
|
122
|
-
run-01/arm-A/ ← first run,
|
|
123
|
-
run-01/arm-B/ ← first run,
|
|
122
|
+
run-01/arm-A/ ← first run, arm A
|
|
123
|
+
run-01/arm-B/ ← first run, arm B
|
|
124
124
|
run-01/comparison.json
|
|
125
125
|
run-02/arm-A/
|
|
126
126
|
...
|
|
@@ -1,130 +1,128 @@
|
|
|
1
1
|
# CLEO Domain Operation Reference for A/B Testing
|
|
2
2
|
|
|
3
3
|
**Source**: `docs/specs/CLEO-OPERATION-CONSTITUTION.md`
|
|
4
|
-
**Purpose**: Lists the key operations to test in
|
|
4
|
+
**Purpose**: Lists the key operations to test in A/B comparisons.
|
|
5
|
+
|
|
6
|
+
All operations use the CLI (`cleo` / `cleo-dev`). There is no MCP interface.
|
|
5
7
|
|
|
6
8
|
---
|
|
7
9
|
|
|
8
|
-
##
|
|
10
|
+
## CLI Operations by Domain
|
|
9
11
|
|
|
10
12
|
For each domain, these are the canonical operations to test in A/B mode.
|
|
11
|
-
MCP gateway = audit metadata.gateway is `'query'` or `'mutate'` (set by MCP adapter).
|
|
12
|
-
CLI = operations routed through CLI do NOT set metadata.gateway.
|
|
13
13
|
|
|
14
14
|
### tasks (32 operations)
|
|
15
15
|
|
|
16
|
-
| Test Op |
|
|
17
|
-
|
|
18
|
-
| Discovery | `
|
|
19
|
-
| Show detail | `
|
|
20
|
-
| List children | `
|
|
21
|
-
| Create | `
|
|
22
|
-
| Update | `
|
|
23
|
-
| Complete | `
|
|
24
|
-
| Exists check | `
|
|
16
|
+
| Test Op | CLI |
|
|
17
|
+
|---------|-----|
|
|
18
|
+
| Discovery | `cleo-dev find --status active` |
|
|
19
|
+
| Show detail | `cleo-dev show T123` |
|
|
20
|
+
| List children | `cleo-dev list --parent T100` |
|
|
21
|
+
| Create | `cleo-dev add "title" --description "..."` |
|
|
22
|
+
| Update | `cleo-dev update T123 --status active` |
|
|
23
|
+
| Complete | `cleo-dev complete T123` |
|
|
24
|
+
| Exists check | `cleo-dev exists T123` |
|
|
25
25
|
|
|
26
|
-
**Key S2 insight**: `
|
|
26
|
+
**Key S2 insight**: `cleo-dev find` counts toward find:list ratio in the audit log. Always prefer find over list for discovery.
|
|
27
27
|
|
|
28
28
|
### session (19 operations)
|
|
29
29
|
|
|
30
|
-
| Test Op |
|
|
31
|
-
|
|
32
|
-
| Check existing | `
|
|
33
|
-
| Start | `
|
|
34
|
-
| End | `
|
|
35
|
-
| Status | `
|
|
36
|
-
| Record decision | `
|
|
30
|
+
| Test Op | CLI |
|
|
31
|
+
|---------|-----|
|
|
32
|
+
| Check existing | `cleo-dev session list` |
|
|
33
|
+
| Start | `cleo-dev session start --grade --scope global` |
|
|
34
|
+
| End | `cleo-dev session end` |
|
|
35
|
+
| Status | `cleo-dev session status` |
|
|
36
|
+
| Record decision | `cleo-dev session record-decision --decision "..." --rationale "..."` |
|
|
37
37
|
|
|
38
|
-
**Critical**: `session.list`
|
|
38
|
+
**Critical**: `session.list` is what the rubric checks for S1. It must appear as `domain='session', operation='list'` in the audit log.
|
|
39
39
|
|
|
40
|
-
### memory (18 operations)
|
|
40
|
+
### memory (18 operations) -- Tier 1
|
|
41
41
|
|
|
42
|
-
| Test Op |
|
|
43
|
-
|
|
44
|
-
| Search | `
|
|
45
|
-
| Store observation | `
|
|
46
|
-
| Timeline | `
|
|
42
|
+
| Test Op | CLI |
|
|
43
|
+
|---------|-----|
|
|
44
|
+
| Search | `cleo-dev memory find "authentication"` |
|
|
45
|
+
| Store observation | `cleo-dev observe "..."` |
|
|
46
|
+
| Timeline | `cleo-dev memory timeline <id>` |
|
|
47
47
|
|
|
48
48
|
### admin (44 operations)
|
|
49
49
|
|
|
50
|
-
| Test Op |
|
|
51
|
-
|
|
52
|
-
| Dashboard | `
|
|
53
|
-
| Help (S5 key) | `
|
|
54
|
-
| Grade session | `
|
|
55
|
-
| Health check | `
|
|
50
|
+
| Test Op | CLI |
|
|
51
|
+
|---------|-----|
|
|
52
|
+
| Dashboard | `cleo-dev dash` |
|
|
53
|
+
| Help (S5 key) | `cleo-dev help` |
|
|
54
|
+
| Grade session | `cleo-dev check grade --session "<id>"` |
|
|
55
|
+
| Health check | `cleo-dev health` |
|
|
56
56
|
|
|
57
|
-
**Critical for S5**:
|
|
57
|
+
**Critical for S5**: `cleo-dev help` satisfies the `helpCalls` filter in S5 Progressive Disclosure scoring.
|
|
58
58
|
|
|
59
|
-
### pipeline (42 operations)
|
|
59
|
+
### pipeline (42 operations) -- LOOM system
|
|
60
60
|
|
|
61
|
-
| Test Op |
|
|
62
|
-
|
|
63
|
-
| Stage status | `
|
|
64
|
-
| Stage validate | `
|
|
65
|
-
| Manifest list | `
|
|
61
|
+
| Test Op | CLI |
|
|
62
|
+
|---------|-----|
|
|
63
|
+
| Stage status | `cleo-dev pipeline stage.status --epic <id>` |
|
|
64
|
+
| Stage validate | `cleo-dev pipeline stage.validate --epic <id> --stage <stage>` |
|
|
65
|
+
| Manifest list | `cleo-dev manifest list` |
|
|
66
66
|
|
|
67
67
|
### check (19 operations)
|
|
68
68
|
|
|
69
|
-
| Test Op |
|
|
70
|
-
|
|
71
|
-
| Test status | `
|
|
72
|
-
| Protocol check | `
|
|
73
|
-
| Compliance | `
|
|
69
|
+
| Test Op | CLI |
|
|
70
|
+
|---------|-----|
|
|
71
|
+
| Test status | `cleo-dev check test-status` |
|
|
72
|
+
| Protocol check | `cleo-dev check protocol` |
|
|
73
|
+
| Compliance | `cleo-dev check compliance` |
|
|
74
74
|
|
|
75
75
|
### orchestrate (19 operations)
|
|
76
76
|
|
|
77
|
-
| Test Op |
|
|
78
|
-
|
|
79
|
-
| Status | `
|
|
80
|
-
| Waves | `
|
|
77
|
+
| Test Op | CLI |
|
|
78
|
+
|---------|-----|
|
|
79
|
+
| Status | `cleo-dev orchestrator status` |
|
|
80
|
+
| Waves | `cleo-dev orchestrator waves` |
|
|
81
81
|
|
|
82
82
|
### tools (32 operations)
|
|
83
83
|
|
|
84
|
-
| Test Op |
|
|
85
|
-
|
|
86
|
-
| Skill list (S5 key) | `
|
|
87
|
-
| Skill show (S5 key) | `
|
|
84
|
+
| Test Op | CLI |
|
|
85
|
+
|---------|-----|
|
|
86
|
+
| Skill list (S5 key) | `cleo-dev skill list` |
|
|
87
|
+
| Skill show (S5 key) | `cleo-dev skill show ct-cleo` |
|
|
88
88
|
|
|
89
|
-
**S5 note**: `tools.skill.list` and `tools.skill.show`
|
|
89
|
+
**S5 note**: `tools.skill.list` and `tools.skill.show` count toward S5 helpCalls filter.
|
|
90
90
|
|
|
91
91
|
---
|
|
92
92
|
|
|
93
|
-
## A/B
|
|
93
|
+
## A/B Configuration Test Examples
|
|
94
94
|
|
|
95
95
|
### Quick A/B: Tasks Domain
|
|
96
96
|
|
|
97
|
-
**Goal**: Compare
|
|
98
|
-
**Operations to execute (both
|
|
99
|
-
1. `session list`
|
|
100
|
-
2. `
|
|
101
|
-
3. `
|
|
102
|
-
4. `session end`
|
|
103
|
-
|
|
104
|
-
**Expected score difference**: MCP ~30/100 vs CLI ~20/100 (S5 is 0 for CLI)
|
|
97
|
+
**Goal**: Compare two configurations for core task operations.
|
|
98
|
+
**Operations to execute (both arms)**:
|
|
99
|
+
1. `cleo-dev session list` -- S1
|
|
100
|
+
2. `cleo-dev find --status active` -- S2
|
|
101
|
+
3. `cleo-dev show <valid-id>` -- S2
|
|
102
|
+
4. `cleo-dev session end` -- S1
|
|
105
103
|
|
|
106
104
|
### Standard A/B: Full Protocol (S4)
|
|
107
105
|
|
|
108
|
-
**Goal**: Full lifecycle scenario through both
|
|
106
|
+
**Goal**: Full lifecycle scenario through both configurations.
|
|
109
107
|
**Operations**: Follow S4 scenario (10 ops including admin.help).
|
|
110
|
-
**Expected**:
|
|
108
|
+
**Expected**: 100/100 for protocol-complete arm
|
|
111
109
|
|
|
112
110
|
### Targeted A/B: S5 Isolation
|
|
113
111
|
|
|
114
112
|
**Goal**: Specifically measure the S5 (progressive disclosure) gap.
|
|
115
|
-
**Operations**
|
|
113
|
+
**Operations** -- same except arm A calls `admin.help`, arm B does not:
|
|
116
114
|
|
|
117
|
-
Arm A (
|
|
118
|
-
```
|
|
119
|
-
|
|
115
|
+
Arm A (with help):
|
|
116
|
+
```bash
|
|
117
|
+
cleo-dev session list && cleo-dev help && cleo-dev find --status active && cleo-dev session end
|
|
120
118
|
```
|
|
121
119
|
|
|
122
|
-
Arm B (
|
|
123
|
-
```
|
|
124
|
-
cleo-dev session list
|
|
120
|
+
Arm B (no help call):
|
|
121
|
+
```bash
|
|
122
|
+
cleo-dev session list && cleo-dev find --status active && cleo-dev session end
|
|
125
123
|
```
|
|
126
124
|
|
|
127
|
-
**Expected**: Arm A S5 = 20/20, Arm B S5 =
|
|
125
|
+
**Expected**: Arm A S5 = 20/20, Arm B S5 = 10/20
|
|
128
126
|
|
|
129
127
|
---
|
|
130
128
|
|
|
@@ -152,19 +152,17 @@ helpCalls = entries where:
|
|
|
152
152
|
OR (domain='tools' AND operation IN ['skill.show','skill.list'])
|
|
153
153
|
OR (domain='skills' AND operation IN ['list','show'])
|
|
154
154
|
|
|
155
|
-
|
|
155
|
+
readOps = entries where operation type is a read (show, find, list, status, etc.)
|
|
156
156
|
```
|
|
157
157
|
|
|
158
158
|
| Points | Condition |
|
|
159
159
|
|--------|-----------|
|
|
160
160
|
| +10 | `helpCalls.length > 0` |
|
|
161
|
-
| +10 | `
|
|
161
|
+
| +10 | `readOps.length > 0` (agent performed read operations before writes) |
|
|
162
162
|
|
|
163
163
|
**Flags on violation:**
|
|
164
164
|
- `No admin.help or skill lookup calls (load ct-cleo for guidance)`
|
|
165
|
-
- `No
|
|
166
|
-
|
|
167
|
-
**Important**: The `metadata.gateway` field equals `'query'` for MCP query operations. CLI operations do not set this field. This is how MCP vs CLI usage is distinguished in the grade.
|
|
165
|
+
- `No read operations before writes (prefer discovery before mutation)`
|
|
168
166
|
|
|
169
167
|
---
|
|
170
168
|
|
|
@@ -218,14 +216,13 @@ interface GradeResult {
|
|
|
218
216
|
|
|
219
217
|
---
|
|
220
218
|
|
|
221
|
-
##
|
|
219
|
+
## S5 Detection
|
|
222
220
|
|
|
223
|
-
The grading system
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
- **Mixed**: Any single MCP query call is enough for the +10
|
|
221
|
+
The grading system awards S5 points based on:
|
|
222
|
+
1. Presence of `admin.help` or skill lookup calls (+10)
|
|
223
|
+
2. Evidence of read-before-write discipline — agent performed discovery operations before mutations (+10)
|
|
227
224
|
|
|
228
|
-
|
|
225
|
+
All operations use the CLI (`cleo` / `cleo-dev`). There is no MCP interface.
|
|
229
226
|
|
|
230
227
|
|
|
231
228
|
## API Surface Update
|
|
@@ -5,8 +5,7 @@
|
|
|
5
5
|
|
|
6
6
|
Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
|
|
7
7
|
|
|
8
|
-
Use **cleo-dev** (local dev build)
|
|
9
|
-
Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
|
|
8
|
+
Use **cleo-dev** (local dev build) or **cleo** (production). All operations use the CLI.
|
|
10
9
|
|
|
11
10
|
---
|
|
12
11
|
|
|
@@ -15,17 +14,7 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
|
|
|
15
14
|
**Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
|
|
16
15
|
**Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
|
|
17
16
|
|
|
18
|
-
### Operation Sequence
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
1. query session list — S1: must be first
|
|
22
|
-
2. query admin dash — project overview
|
|
23
|
-
3. query tasks find { "status": "active" } — S2: find not list
|
|
24
|
-
4. query tasks show { "taskId": "T<any>" } — S2: show used
|
|
25
|
-
5. mutate session end — S1: session.end
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
### Operation Sequence (CLI)
|
|
17
|
+
### Operation Sequence
|
|
29
18
|
|
|
30
19
|
```bash
|
|
31
20
|
1. cleo-dev session list
|
|
@@ -43,18 +32,16 @@ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for
|
|
|
43
32
|
| S2 | 20/20 | find used exclusively (+15), show used (+5) |
|
|
44
33
|
| S3 | 20/20 | No task adds (no deductions) |
|
|
45
34
|
| S4 | 20/20 | No errors |
|
|
46
|
-
| S5
|
|
47
|
-
| S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
|
|
35
|
+
| S5 | 10/20 | No admin.help call |
|
|
48
36
|
|
|
49
|
-
**
|
|
50
|
-
**CLI total: ~80/100 (B)**
|
|
37
|
+
**Total: ~90/100 (A)**
|
|
51
38
|
|
|
52
39
|
### Anti-pattern Variant (for testing grader sensitivity)
|
|
53
40
|
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
(no session.end)
|
|
41
|
+
```bash
|
|
42
|
+
cleo-dev find --status active # task op BEFORE session.list
|
|
43
|
+
cleo-dev session list # too late for S1
|
|
44
|
+
# (no session.end)
|
|
58
45
|
```
|
|
59
46
|
Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
|
|
60
47
|
|
|
@@ -63,19 +50,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
|
|
|
63
50
|
## S2: Task Creation Hygiene
|
|
64
51
|
|
|
65
52
|
**Purpose**: Validates S3 (Task Hygiene) and S1.
|
|
66
|
-
**Target score**: 60/100 (S1 full, S3 full, S5 partial
|
|
53
|
+
**Target score**: 60/100 (S1 full, S3 full, S5 partial)
|
|
67
54
|
|
|
68
|
-
### Operation Sequence
|
|
55
|
+
### Operation Sequence
|
|
69
56
|
|
|
70
|
-
```
|
|
71
|
-
1.
|
|
72
|
-
2.
|
|
73
|
-
3.
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
4. mutate tasks add { "title": "Write tests",
|
|
77
|
-
"description": "Unit tests for auth module" } — S3: desc present
|
|
78
|
-
5. mutate session end — S1
|
|
57
|
+
```bash
|
|
58
|
+
1. cleo-dev session list
|
|
59
|
+
2. cleo-dev show T100 # S3: parent verify
|
|
60
|
+
3. cleo-dev add "Implement auth" --description "Add JWT authentication to API endpoints" --parent T100
|
|
61
|
+
4. cleo-dev add "Write tests" --description "Unit tests for auth module"
|
|
62
|
+
5. cleo-dev session end
|
|
79
63
|
```
|
|
80
64
|
|
|
81
65
|
### Scoring Targets
|
|
@@ -83,18 +67,16 @@ Expected S1: 0 — flags: `session.list called after task ops`, `session.end nev
|
|
|
83
67
|
| Dim | Expected | Reason |
|
|
84
68
|
|-----|----------|--------|
|
|
85
69
|
| S1 | 20/20 | session.list first, session.end present |
|
|
86
|
-
| S3 | 20/20 | All adds have descriptions, parent verified via
|
|
87
|
-
| S5
|
|
88
|
-
| S5 (CLI) | 0/20 | no MCP query, no help |
|
|
70
|
+
| S3 | 20/20 | All adds have descriptions, parent verified via show |
|
|
71
|
+
| S5 | 0/20 | no help |
|
|
89
72
|
|
|
90
|
-
**
|
|
91
|
-
**CLI total: ~60/100 (C)**
|
|
73
|
+
**Total: ~60/100 (C)**
|
|
92
74
|
|
|
93
75
|
### Anti-pattern Variant
|
|
94
76
|
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
|
|
77
|
+
```bash
|
|
78
|
+
cleo-dev add "Implement auth" --parent T100 # no desc, no exists check
|
|
79
|
+
cleo-dev add "Write tests" # no desc
|
|
98
80
|
```
|
|
99
81
|
Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
100
82
|
|
|
@@ -104,15 +86,14 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
|
104
86
|
|
|
105
87
|
**Purpose**: Validates S4 (Error Protocol).
|
|
106
88
|
|
|
107
|
-
### Operation Sequence
|
|
89
|
+
### Operation Sequence
|
|
108
90
|
|
|
109
|
-
```
|
|
110
|
-
1.
|
|
111
|
-
2.
|
|
112
|
-
3.
|
|
113
|
-
4.
|
|
114
|
-
|
|
115
|
-
5. mutate session end — S1
|
|
91
|
+
```bash
|
|
92
|
+
1. cleo-dev session list
|
|
93
|
+
2. cleo-dev show T99999 # triggers E_NOT_FOUND
|
|
94
|
+
3. cleo-dev find "T99999" # S4: recovery within 4 ops
|
|
95
|
+
4. cleo-dev add "New feature" --description "Implement the feature that was not found"
|
|
96
|
+
5. cleo-dev session end
|
|
116
97
|
```
|
|
117
98
|
|
|
118
99
|
### Scoring Targets
|
|
@@ -122,24 +103,23 @@ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
|
|
|
122
103
|
| S1 | 20/20 | Proper session lifecycle |
|
|
123
104
|
| S3 | 20/20 | Task created with description |
|
|
124
105
|
| S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
|
|
125
|
-
| S5
|
|
106
|
+
| S5 | 0/20 | no help |
|
|
126
107
|
|
|
127
|
-
**
|
|
108
|
+
**Total: ~80/100 (B)**
|
|
128
109
|
|
|
129
110
|
### Anti-pattern: Unrecovered Error
|
|
130
111
|
|
|
112
|
+
```bash
|
|
113
|
+
cleo-dev show T99999 # E_NOT_FOUND
|
|
114
|
+
cleo-dev add "Something else" --description "Unrelated" # no recovery lookup
|
|
131
115
|
```
|
|
132
|
-
|
|
133
|
-
mutate tasks add { "title": "Something else",
|
|
134
|
-
"description": "Unrelated" } ← no recovery lookup
|
|
135
|
-
```
|
|
136
|
-
S4 deduction: -5 (no tasks.find within next 4 entries)
|
|
116
|
+
S4 deduction: -5 (no find within next 4 entries)
|
|
137
117
|
|
|
138
118
|
### Anti-pattern: Duplicate Creates
|
|
139
119
|
|
|
140
|
-
```
|
|
141
|
-
|
|
142
|
-
|
|
120
|
+
```bash
|
|
121
|
+
cleo-dev add "New feature" --description "First attempt"
|
|
122
|
+
cleo-dev add "New feature" --description "Second attempt"
|
|
143
123
|
```
|
|
144
124
|
S4 deduction: -5 (1 duplicate detected)
|
|
145
125
|
|
|
@@ -148,24 +128,24 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
148
128
|
## S4: Full Lifecycle
|
|
149
129
|
|
|
150
130
|
**Purpose**: Validates all 5 dimensions. Gold standard session.
|
|
151
|
-
**Target score**: 100/100 (A)
|
|
131
|
+
**Target score**: 100/100 (A)
|
|
152
132
|
|
|
153
|
-
### Operation Sequence
|
|
133
|
+
### Operation Sequence
|
|
154
134
|
|
|
155
|
-
```
|
|
156
|
-
1.
|
|
157
|
-
2.
|
|
158
|
-
3.
|
|
159
|
-
4.
|
|
160
|
-
5.
|
|
161
|
-
6.
|
|
162
|
-
(agent does work here)
|
|
163
|
-
7.
|
|
164
|
-
8.
|
|
165
|
-
9.
|
|
135
|
+
```bash
|
|
136
|
+
1. cleo-dev session list
|
|
137
|
+
2. cleo-dev help # S5: progressive disclosure
|
|
138
|
+
3. cleo-dev dash # overview
|
|
139
|
+
4. cleo-dev find --status pending # S2: find not list
|
|
140
|
+
5. cleo-dev show T200 # S2: show for detail
|
|
141
|
+
6. cleo-dev update T200 --status active # begin work
|
|
142
|
+
# (agent does work here)
|
|
143
|
+
7. cleo-dev complete T200 # mark done
|
|
144
|
+
8. cleo-dev find --status pending # check next
|
|
145
|
+
9. cleo-dev session end --note "Completed T200" # S1
|
|
166
146
|
```
|
|
167
147
|
|
|
168
|
-
### Scoring Targets
|
|
148
|
+
### Scoring Targets
|
|
169
149
|
|
|
170
150
|
| Dim | Expected | Reason |
|
|
171
151
|
|-----|----------|--------|
|
|
@@ -173,34 +153,31 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
173
153
|
| S2 | 20/20 | find:list 100% (+15), show used (+5) |
|
|
174
154
|
| S3 | 20/20 | No adds — no deductions |
|
|
175
155
|
| S4 | 20/20 | No errors, no duplicates |
|
|
176
|
-
| S5 | 20/20 | admin.help (+10),
|
|
156
|
+
| S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
|
|
177
157
|
|
|
178
|
-
**
|
|
179
|
-
**CLI total: ~80/100 (B)** — loses S5 entirely
|
|
158
|
+
**Total: 100/100 (A)**
|
|
180
159
|
|
|
181
160
|
---
|
|
182
161
|
|
|
183
162
|
## S5: Multi-Domain Analysis
|
|
184
163
|
|
|
185
164
|
**Purpose**: Validates cross-domain operations and advanced S5.
|
|
186
|
-
**Target score**: 100/100
|
|
165
|
+
**Target score**: 100/100
|
|
187
166
|
|
|
188
|
-
### Operation Sequence
|
|
167
|
+
### Operation Sequence
|
|
189
168
|
|
|
190
|
-
```
|
|
191
|
-
1.
|
|
192
|
-
2.
|
|
193
|
-
3.
|
|
194
|
-
4.
|
|
195
|
-
5.
|
|
196
|
-
6.
|
|
197
|
-
7.
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
10. query tasks find { "parent": "T500", "status": "pending" } — next subtask
|
|
203
|
-
11. mutate session end — S1
|
|
169
|
+
```bash
|
|
170
|
+
1. cleo-dev session list
|
|
171
|
+
2. cleo-dev help
|
|
172
|
+
3. cleo-dev find --parent T500 # S2: epic subtasks
|
|
173
|
+
4. cleo-dev show T501 # S2: inspect
|
|
174
|
+
5. cleo-dev session context-drift # multi-domain
|
|
175
|
+
6. cleo-dev session decision-log --task T501 # decision history
|
|
176
|
+
7. cleo-dev session record-decision --task T501 --decision "Use adapter pattern" --rationale "Decouples provider logic"
|
|
177
|
+
8. cleo-dev update T501 --status active
|
|
178
|
+
9. cleo-dev complete T501
|
|
179
|
+
10. cleo-dev find --parent T500 --status pending # next subtask
|
|
180
|
+
11. cleo-dev session end
|
|
204
181
|
```
|
|
205
182
|
|
|
206
183
|
### Scoring Targets
|
|
@@ -211,24 +188,18 @@ S4 deduction: -5 (1 duplicate detected)
|
|
|
211
188
|
| S2 | 20/20 | find used exclusively, show used |
|
|
212
189
|
| S3 | 20/20 | No task.add — no deductions |
|
|
213
190
|
| S4 | 20/20 | No errors |
|
|
214
|
-
| S5 | 20/20 | admin.help (+10),
|
|
191
|
+
| S5 | 20/20 | admin.help used (+10), progressive disclosure (+10) |
|
|
215
192
|
|
|
216
|
-
**
|
|
193
|
+
**Total: 100/100 (A)**
|
|
217
194
|
|
|
218
195
|
---
|
|
219
196
|
|
|
220
197
|
## Scenario Quick Reference
|
|
221
198
|
|
|
222
|
-
| Scenario | Primary Dims Tested |
|
|
223
|
-
|
|
224
|
-
| S1 | S1, S2 | ~90 (A) |
|
|
225
|
-
| S2 | S1, S3 | ~
|
|
226
|
-
| S3 | S1, S3, S4 | ~
|
|
227
|
-
| S4 | All 5 | 100 (A) |
|
|
228
|
-
| S5 | All 5, cross-domain | 100 (A) |
|
|
229
|
-
|
|
230
|
-
**Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
|
|
231
|
-
1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
|
|
232
|
-
2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
|
|
233
|
-
|
|
234
|
-
This is by design — the rubric rewards MCP-first behavior.
|
|
199
|
+
| Scenario | Primary Dims Tested | Expected Score |
|
|
200
|
+
|---|---|---|
|
|
201
|
+
| S1 | S1, S2 | ~90 (A) |
|
|
202
|
+
| S2 | S1, S3 | ~60 (C) |
|
|
203
|
+
| S3 | S1, S3, S4 | ~80 (B) |
|
|
204
|
+
| S4 | All 5 | 100 (A) |
|
|
205
|
+
| S5 | All 5, cross-domain | 100 (A) |
|