@cleocode/skills 2026.3.76 → 2026.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/package.json +1 -1
  2. package/skills/_shared/manifest-operations.md +1 -2
  3. package/skills/_shared/skill-chaining-patterns.md +3 -7
  4. package/skills/_shared/subagent-protocol-base.cant +113 -0
  5. package/skills/ct-cleo/SKILL.md +56 -65
  6. package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
  7. package/skills/ct-cleo/references/session-protocol.md +3 -12
  8. package/skills/ct-codebase-mapper/SKILL.md +7 -7
  9. package/skills/ct-grade/SKILL.md +12 -46
  10. package/skills/ct-grade/agents/scenario-runner.md +11 -21
  11. package/skills/ct-grade/references/ab-test-methodology.md +14 -14
  12. package/skills/ct-grade/references/domains.md +72 -74
  13. package/skills/ct-grade/references/grade-spec.md +8 -11
  14. package/skills/ct-grade/references/scenario-playbook.md +77 -106
  15. package/skills/ct-grade-v2-1/SKILL.md +30 -32
  16. package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
  17. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
  18. package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
  19. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
  20. package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
  21. package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
  22. package/skills/ct-memory/SKILL.md +16 -35
  23. package/skills/ct-orchestrator/SKILL.md +58 -68
  24. package/skills/ct-skill-validator/SKILL.md +1 -1
  25. package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
  26. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
  27. package/skills/manifest.json +1 -1
  28. package/skills/signaldock-connect/SKILL.md +132 -0
  29. package/skills/signaldock-connect/assets/agent-card.json +48 -0
  30. package/skills/signaldock-connect/references/api-endpoints.md +131 -0
  31. package/skills.json +1 -1
@@ -1,18 +1,21 @@
1
1
  # Blind A/B Testing Protocol
2
2
 
3
- Methodology for blind comparison of MCP vs CLI interface usage in CLEO.
3
+ Methodology for blind comparison of grade scenario results in CLEO.
4
+
5
+ > **Note**: MCP support was removed. All operations now use the CLI exclusively.
6
+ > This protocol compares different CLI configurations, binary versions, or parameter sets.
4
7
 
5
8
  ---
6
9
 
7
10
  ## Agent-Based Execution (Canonical)
8
11
 
9
- The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end via the live MCP/CLI interfaces. This avoids subprocess initialization issues and captures real token data from task notifications.
12
+ The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end via the CLI. This captures real token data from task notifications.
10
13
 
11
14
  ### Execution Flow
12
15
 
13
16
  1. Run `python scripts/setup_run.py` to create run structure and print the execution plan
14
- 2. Follow the plan: spawn scenario-runner agents in parallel (arm-A MCP, arm-B CLI)
15
- 3. Immediately capture `total_tokens` from each task notification `timing.json`
17
+ 2. Follow the plan: spawn scenario-runner agents in parallel (arm-A, arm-B with different configurations)
18
+ 3. Immediately capture `total_tokens` from each task notification to `timing.json`
16
19
  4. Spawn blind-comparator agent after both arms complete
17
20
  5. Run `python scripts/token_tracker.py --run-dir <dir>` to aggregate tokens
18
21
  6. Run `python scripts/generate_report.py --run-dir <dir>` for final report
@@ -22,10 +25,10 @@ The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end v
22
25
  ```python
23
26
  # After EACH agent task completes, fill timing.json immediately:
24
27
  timing = {
25
- "total_tokens": task.total_tokens, # EPHEMERAL capture now or lose it
28
+ "total_tokens": task.total_tokens, # EPHEMERAL -- capture now or lose it
26
29
  "duration_ms": task.duration_ms,
27
30
  "arm": "arm-A",
28
- "interface": "mcp",
31
+ "interface": "cli",
29
32
  "scenario": "s4",
30
33
  "run": 1,
31
34
  }
@@ -35,30 +38,7 @@ Token data priority:
35
38
  1. `total_tokens` from Claude Code Agent task notification (canonical)
36
39
  2. OTel `claude_code.token.usage` (when `CLAUDE_CODE_ENABLE_TELEMETRY=1`)
37
40
  3. `output_chars / 3.5` (JSON response estimate)
38
- 4. `entryCount × 150` (coarse proxy from GRADES.jsonl)
39
-
40
- ---
41
-
42
- ## Subprocess-Based Execution (Fallback)
43
-
44
- For automated testing without agent delegation, use `run_ab_test.py`. This invokes CLEO via subprocess and requires a migrated `tasks.db`.
45
-
46
- ---
47
-
48
- ## What We're Testing
49
-
50
- | Side | Interface | Mechanism |
51
- |------|-----------|-----------|
52
- | **A** (MCP) | JSON-RPC via stdio to CLEO MCP server | `node dist/mcp/index.js` with JSON-RPC messages |
53
- | **B** (CLI) | Shell commands via subprocess | `cleo-dev <domain> <operation> [params]` |
54
-
55
- Both sides call the same underlying `src/dispatch/` layer. The A/B test isolates:
56
- - **Output format differences** — MCP returns structured JSON envelopes; CLI may add ANSI/formatting
57
- - **Response size** — character counts as token proxy
58
- - **Latency** — wall-clock time per operation
59
- - **Data equivalence** — do they return the same logical data?
60
-
61
- Blind assignment means the comparator does not know which result came from MCP vs CLI when producing the quality verdict.
41
+ 4. `entryCount x 150` (coarse proxy from GRADES.jsonl)
62
42
 
63
43
  ---
64
44
 
@@ -89,11 +69,11 @@ ab-results/
89
69
  ## Blind Assignment
90
70
 
91
71
  The `run_ab_test.py` script randomly shuffles which side gets labeled "A" vs "B" for each run. The comparator agent sees only:
92
- - Output labeled "A" (could be MCP or CLI)
93
- - Output labeled "B" (could be MCP or CLI)
72
+ - Output labeled "A"
73
+ - Output labeled "B"
94
74
  - The original request prompt
95
75
 
96
- The `meta.json` records the true identity (`a_is_mcp: true|false`) per run. `generate_report.py` de-blinds after all comparisons are done.
76
+ The `meta.json` records the true identity per run. `generate_report.py` de-blinds after all comparisons are done.
97
77
 
98
78
  ---
99
79
 
@@ -104,7 +84,7 @@ The `meta.json` records the true identity (`a_is_mcp: true|false`) per run. `gen
104
84
  | `output_chars` | `len(response_json_str)` |
105
85
  | `estimated_tokens` | `output_chars / 4` (approximation) |
106
86
  | `duration_ms` | wall clock from subprocess start to end |
107
- | `success` | `response.success === true` (MCP) or exit code 0 (CLI) |
87
+ | `success` | exit code 0 |
108
88
  | `data_equivalent` | compare key fields between A and B response |
109
89
 
110
90
  ---
@@ -136,17 +116,17 @@ After N runs, `generate_report.py` computes:
136
116
 
137
117
  ```json
138
118
  {
139
- "wins": { "mcp": 0, "cli": 0, "tie": 0 },
140
- "win_rate": { "mcp": 0.0, "cli": 0.0 },
119
+ "wins": { "arm_a": 0, "arm_b": 0, "tie": 0 },
120
+ "win_rate": { "arm_a": 0.0, "arm_b": 0.0 },
141
121
  "token_delta": {
142
- "mean_mcp_chars": 0,
143
- "mean_cli_chars": 0,
122
+ "mean_a_chars": 0,
123
+ "mean_b_chars": 0,
144
124
  "delta_chars": 0,
145
125
  "delta_pct": "+0%"
146
126
  },
147
127
  "latency_delta": {
148
- "mean_mcp_ms": 0,
149
- "mean_cli_ms": 0,
128
+ "mean_a_ms": 0,
129
+ "mean_b_ms": 0,
150
130
  "delta_ms": 0
151
131
  },
152
132
  "data_equivalence_rate": 1.0,
@@ -167,48 +147,18 @@ The blind comparator evaluates each side on:
167
147
  | **Completeness** | Does the response contain all expected fields? |
168
148
  | **Structure** | Is the response well-formed JSON? Clean envelope? |
169
149
  | **Usability** | Can an agent consume this without post-processing? |
170
- | **Verbosity** | Lower is better same data, fewer chars = more efficient |
150
+ | **Verbosity** | Lower is better -- same data, fewer chars = more efficient |
171
151
 
172
- Rubric scores are 15 per criterion. Winner is the side with higher weighted total.
152
+ Rubric scores are 1-5 per criterion. Winner is the side with higher weighted total.
173
153
 
174
154
  ---
175
155
 
176
- ## MCP Server Invocation Details
156
+ ## CLI Invocation
177
157
 
178
- The `run_ab_test.py` script calls the CLEO MCP server via stdio JSON-RPC:
158
+ All operations use the CLI:
179
159
 
180
- ```python
181
- # Protocol sequence
182
- # 1. Send initialize
183
- # 2. Send tools/call (query or mutate)
184
- # 3. Read response lines until tool result found
185
- # 4. Terminate process
186
-
187
- MCP_INIT = {
188
- "jsonrpc": "2.0", "id": 0, "method": "initialize",
189
- "params": {
190
- "protocolVersion": "2024-11-05",
191
- "capabilities": {},
192
- "clientInfo": {"name": "ct-grade-ab-test", "version": "2.1.0"}
193
- }
194
- }
195
-
196
- MCP_CALL = {
197
- "jsonrpc": "2.0", "id": 1, "method": "tools/call",
198
- "params": {
199
- "name": "query", # or "mutate"
200
- "arguments": {
201
- "domain": "<domain>",
202
- "operation": "<operation>",
203
- "params": {}
204
- }
205
- }
206
- }
207
- ```
208
-
209
- **CLI equivalent:**
210
160
  ```bash
211
- cleo-dev <domain> <operation> [args] --json
161
+ cleo-dev <command> [args] --json
212
162
  ```
213
163
 
214
164
  ---
@@ -217,17 +167,7 @@ cleo-dev <domain> <operation> [args] --json
217
167
 
218
168
  | Outcome | Meaning | Action |
219
169
  |---------|---------|--------|
220
- | MCP wins consistently | MCP output is cleaner/more complete | Recommend MCP-first in agent protocols |
221
- | CLI wins consistently | CLI output is more complete or parseable | Investigate MCP envelope overhead |
170
+ | Arm A wins consistently | Configuration A output is cleaner/more complete | Investigate differences |
171
+ | Arm B wins consistently | Configuration B output is more complete or parseable | Investigate differences |
222
172
  | Tie | Both equivalent | Focus on latency and token cost |
223
- | MCP tokens > CLI tokens | MCP envelope adds overhead | Quantify and document in CLEO-GRADE-SPEC |
224
- | Data divergence detected | MCP and CLI returning different data | File bug — should be dispatch-level consistent |
225
-
226
- ---
227
-
228
- ## Parity Scenarios
229
-
230
- The P1-P3 parity scenarios (see playbook-v2.md) run a curated set of operations specifically chosen to stress:
231
- - **P1**: tasks domain — high-frequency agent operations
232
- - **P2**: session domain — lifecycle operations agents use at start/end
233
- - **P3**: admin domain — help, dash, health (first calls in any session)
173
+ | Data divergence detected | Arms returning different data | File bug -- should be consistent |
@@ -82,16 +82,16 @@ Measures whether the agent recovers from `E_NOT_FOUND` (exit code 4) and avoids
82
82
 
83
83
  ### S5: Progressive Disclosure Use (20 pts)
84
84
 
85
- Measures whether the agent uses CLEO's progressive disclosure system and the MCP query gateway.
85
+ Measures whether the agent uses CLEO's progressive disclosure system.
86
86
 
87
87
  | Points | Condition | Evidence string |
88
88
  |--------|-----------|-----------------|
89
89
  | +10 | At least one help/skill call: `admin.help`, `tools.skill.show`, `tools.skill.list`, `tools.skill.find` | `Progressive disclosure used (Nx)` |
90
- | +10 | At least one MCP query gateway call (`metadata.gateway === "query"`) | `query (MCP) used Nx` |
90
+ | +10 | Progressive disclosure used for efficient access | `Progressive disclosure active` |
91
91
 
92
92
  **Flags:**
93
93
  - `No admin.help or skill lookup calls (load ct-cleo for guidance)`
94
- - `No MCP query calls (prefer query over CLI for programmatic access)`
94
+ - `No progressive disclosure calls (use admin.help or skill lookups)`
95
95
 
96
96
  **Scoring:** Starts at 0. Range: 0–20.
97
97
 
@@ -123,7 +123,7 @@ Grade results in v2.1 carry optional token metadata alongside the standard Grade
123
123
  "session": 600,
124
124
  "admin": 400
125
125
  },
126
- "mcpQueryTokens": 2100,
126
+ "queryTokens": 2100,
127
127
  "cliTokens": 1100,
128
128
  "auditEntries": 47
129
129
  }
@@ -164,4 +164,4 @@ The rubric recognizes all 10 canonical domains in audit entries. Key domain-to-d
164
164
  | `nexus` | S5 (gateway tracking only) |
165
165
  | `sticky` | S5 (gateway tracking only) |
166
166
 
167
- All 10 domains contribute to `mcpQueryCalls` count in S5 — any MCP query gateway call regardless of domain earns the +10.
167
+ All 10 domains contribute to the progressive disclosure count in S5 — any help or skill lookup call regardless of domain earns the +10.