@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (171) hide show
  1. package/dispatch-config.json +404 -0
  2. package/index.d.ts +178 -0
  3. package/index.js +405 -0
  4. package/package.json +14 -0
  5. package/profiles/core.json +7 -0
  6. package/profiles/full.json +10 -0
  7. package/profiles/minimal.json +7 -0
  8. package/profiles/recommended.json +7 -0
  9. package/provider-skills-map.json +97 -0
  10. package/skills/_shared/cleo-style-guide.md +84 -0
  11. package/skills/_shared/manifest-operations.md +810 -0
  12. package/skills/_shared/placeholders.json +433 -0
  13. package/skills/_shared/skill-chaining-patterns.md +237 -0
  14. package/skills/_shared/subagent-protocol-base.md +223 -0
  15. package/skills/_shared/task-system-integration.md +232 -0
  16. package/skills/_shared/testing-framework-config.md +110 -0
  17. package/skills/ct-cleo/SKILL.md +490 -0
  18. package/skills/ct-cleo/references/anti-patterns.md +19 -0
  19. package/skills/ct-cleo/references/loom-lifecycle.md +136 -0
  20. package/skills/ct-cleo/references/orchestrator-constraints.md +55 -0
  21. package/skills/ct-cleo/references/session-protocol.md +162 -0
  22. package/skills/ct-codebase-mapper/SKILL.md +82 -0
  23. package/skills/ct-contribution/SKILL.md +521 -0
  24. package/skills/ct-contribution/templates/contribution-init.json +21 -0
  25. package/skills/ct-dev-workflow/SKILL.md +423 -0
  26. package/skills/ct-docs-lookup/SKILL.md +66 -0
  27. package/skills/ct-docs-review/SKILL.md +175 -0
  28. package/skills/ct-docs-write/SKILL.md +108 -0
  29. package/skills/ct-documentor/SKILL.md +231 -0
  30. package/skills/ct-epic-architect/SKILL.md +305 -0
  31. package/skills/ct-epic-architect/references/bug-epic-example.md +172 -0
  32. package/skills/ct-epic-architect/references/commands.md +201 -0
  33. package/skills/ct-epic-architect/references/feature-epic-example.md +210 -0
  34. package/skills/ct-epic-architect/references/migration-epic-example.md +244 -0
  35. package/skills/ct-epic-architect/references/output-format.md +92 -0
  36. package/skills/ct-epic-architect/references/patterns.md +284 -0
  37. package/skills/ct-epic-architect/references/refactor-epic-example.md +412 -0
  38. package/skills/ct-epic-architect/references/research-epic-example.md +226 -0
  39. package/skills/ct-epic-architect/references/shell-escaping.md +86 -0
  40. package/skills/ct-epic-architect/references/skill-aware-execution.md +195 -0
  41. package/skills/ct-grade/SKILL.md +230 -0
  42. package/skills/ct-grade/agents/analysis-reporter.md +203 -0
  43. package/skills/ct-grade/agents/blind-comparator.md +157 -0
  44. package/skills/ct-grade/agents/scenario-runner.md +134 -0
  45. package/skills/ct-grade/eval-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  46. package/skills/ct-grade/eval-viewer/generate_grade_review.py +1138 -0
  47. package/skills/ct-grade/eval-viewer/generate_grade_viewer.py +544 -0
  48. package/skills/ct-grade/eval-viewer/generate_review.py +283 -0
  49. package/skills/ct-grade/eval-viewer/grade-review.html +1574 -0
  50. package/skills/ct-grade/eval-viewer/viewer.html +219 -0
  51. package/skills/ct-grade/evals/evals.json +94 -0
  52. package/skills/ct-grade/references/ab-test-methodology.md +150 -0
  53. package/skills/ct-grade/references/domains.md +137 -0
  54. package/skills/ct-grade/references/grade-spec.md +236 -0
  55. package/skills/ct-grade/references/scenario-playbook.md +234 -0
  56. package/skills/ct-grade/references/token-tracking.md +120 -0
  57. package/skills/ct-grade/scripts/__pycache__/audit_analyzer.cpython-314.pyc +0 -0
  58. package/skills/ct-grade/scripts/__pycache__/run_ab_test.cpython-314.pyc +0 -0
  59. package/skills/ct-grade/scripts/__pycache__/run_all.cpython-314.pyc +0 -0
  60. package/skills/ct-grade/scripts/__pycache__/token_tracker.cpython-314.pyc +0 -0
  61. package/skills/ct-grade/scripts/audit_analyzer.py +279 -0
  62. package/skills/ct-grade/scripts/generate_report.py +283 -0
  63. package/skills/ct-grade/scripts/run_ab_test.py +504 -0
  64. package/skills/ct-grade/scripts/run_all.py +287 -0
  65. package/skills/ct-grade/scripts/setup_run.py +183 -0
  66. package/skills/ct-grade/scripts/token_tracker.py +630 -0
  67. package/skills/ct-grade-v2-1/SKILL.md +237 -0
  68. package/skills/ct-grade-v2-1/agents/analysis-reporter.md +203 -0
  69. package/skills/ct-grade-v2-1/agents/blind-comparator.md +157 -0
  70. package/skills/ct-grade-v2-1/agents/scenario-runner.md +179 -0
  71. package/skills/ct-grade-v2-1/evals/evals.json +74 -0
  72. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/build_op_stats.cpython-314.pyc +0 -0
  73. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  74. package/skills/ct-grade-v2-1/grade-viewer/build_op_stats.py +174 -0
  75. package/skills/ct-grade-v2-1/grade-viewer/eval-analysis.json +41 -0
  76. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +34 -0
  77. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_review.py +1023 -0
  78. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_viewer.py +548 -0
  79. package/skills/ct-grade-v2-1/grade-viewer/grade-review-eval.html +613 -0
  80. package/skills/ct-grade-v2-1/grade-viewer/grade-review.html +1532 -0
  81. package/skills/ct-grade-v2-1/grade-viewer/viewer.html +620 -0
  82. package/skills/ct-grade-v2-1/manifest-entry.json +31 -0
  83. package/skills/ct-grade-v2-1/references/ab-testing.md +233 -0
  84. package/skills/ct-grade-v2-1/references/domains-ssot.md +156 -0
  85. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +167 -0
  86. package/skills/ct-grade-v2-1/references/playbook-v2.md +393 -0
  87. package/skills/ct-grade-v2-1/references/token-tracking.md +202 -0
  88. package/skills/ct-grade-v2-1/scripts/generate_report.py +419 -0
  89. package/skills/ct-grade-v2-1/scripts/run_ab_test.py +493 -0
  90. package/skills/ct-grade-v2-1/scripts/run_scenario.py +396 -0
  91. package/skills/ct-grade-v2-1/scripts/setup_run.py +207 -0
  92. package/skills/ct-grade-v2-1/scripts/token_tracker.py +175 -0
  93. package/skills/ct-memory/SKILL.md +84 -0
  94. package/skills/ct-orchestrator/INSTALL.md +61 -0
  95. package/skills/ct-orchestrator/README.md +69 -0
  96. package/skills/ct-orchestrator/SKILL.md +380 -0
  97. package/skills/ct-orchestrator/manifest-entry.json +19 -0
  98. package/skills/ct-orchestrator/orchestrator-prompt.txt +17 -0
  99. package/skills/ct-orchestrator/references/SUBAGENT-PROTOCOL-BLOCK.md +66 -0
  100. package/skills/ct-orchestrator/references/autonomous-operation.md +167 -0
  101. package/skills/ct-orchestrator/references/lifecycle-gates.md +98 -0
  102. package/skills/ct-orchestrator/references/orchestrator-compliance.md +271 -0
  103. package/skills/ct-orchestrator/references/orchestrator-handoffs.md +85 -0
  104. package/skills/ct-orchestrator/references/orchestrator-patterns.md +164 -0
  105. package/skills/ct-orchestrator/references/orchestrator-recovery.md +113 -0
  106. package/skills/ct-orchestrator/references/orchestrator-spawning.md +271 -0
  107. package/skills/ct-orchestrator/references/orchestrator-tokens.md +180 -0
  108. package/skills/ct-research-agent/SKILL.md +226 -0
  109. package/skills/ct-skill-creator/.cleo/.context-state.json +13 -0
  110. package/skills/ct-skill-creator/.cleo/logs/cleo.2026-03-07.1.log +24 -0
  111. package/skills/ct-skill-creator/.cleo/tasks.db +0 -0
  112. package/skills/ct-skill-creator/SKILL.md +356 -0
  113. package/skills/ct-skill-creator/agents/analyzer.md +276 -0
  114. package/skills/ct-skill-creator/agents/comparator.md +204 -0
  115. package/skills/ct-skill-creator/agents/grader.md +225 -0
  116. package/skills/ct-skill-creator/assets/eval_review.html +146 -0
  117. package/skills/ct-skill-creator/eval-viewer/__pycache__/generate_review.cpython-314.pyc +0 -0
  118. package/skills/ct-skill-creator/eval-viewer/generate_review.py +471 -0
  119. package/skills/ct-skill-creator/eval-viewer/viewer.html +1325 -0
  120. package/skills/ct-skill-creator/manifest-entry.json +17 -0
  121. package/skills/ct-skill-creator/references/dynamic-context.md +228 -0
  122. package/skills/ct-skill-creator/references/frontmatter.md +83 -0
  123. package/skills/ct-skill-creator/references/invocation-control.md +165 -0
  124. package/skills/ct-skill-creator/references/output-patterns.md +86 -0
  125. package/skills/ct-skill-creator/references/provider-deployment.md +175 -0
  126. package/skills/ct-skill-creator/references/schemas.md +430 -0
  127. package/skills/ct-skill-creator/references/workflows.md +28 -0
  128. package/skills/ct-skill-creator/scripts/__init__.py +1 -0
  129. package/skills/ct-skill-creator/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
  130. package/skills/ct-skill-creator/scripts/__pycache__/aggregate_benchmark.cpython-314.pyc +0 -0
  131. package/skills/ct-skill-creator/scripts/__pycache__/generate_report.cpython-314.pyc +0 -0
  132. package/skills/ct-skill-creator/scripts/__pycache__/improve_description.cpython-314.pyc +0 -0
  133. package/skills/ct-skill-creator/scripts/__pycache__/init_skill.cpython-314.pyc +0 -0
  134. package/skills/ct-skill-creator/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
  135. package/skills/ct-skill-creator/scripts/__pycache__/run_eval.cpython-314.pyc +0 -0
  136. package/skills/ct-skill-creator/scripts/__pycache__/run_loop.cpython-314.pyc +0 -0
  137. package/skills/ct-skill-creator/scripts/__pycache__/utils.cpython-314.pyc +0 -0
  138. package/skills/ct-skill-creator/scripts/aggregate_benchmark.py +401 -0
  139. package/skills/ct-skill-creator/scripts/generate_report.py +326 -0
  140. package/skills/ct-skill-creator/scripts/improve_description.py +247 -0
  141. package/skills/ct-skill-creator/scripts/init_skill.py +306 -0
  142. package/skills/ct-skill-creator/scripts/package_skill.py +110 -0
  143. package/skills/ct-skill-creator/scripts/quick_validate.py +97 -0
  144. package/skills/ct-skill-creator/scripts/run_eval.py +310 -0
  145. package/skills/ct-skill-creator/scripts/run_loop.py +328 -0
  146. package/skills/ct-skill-creator/scripts/utils.py +47 -0
  147. package/skills/ct-skill-validator/SKILL.md +178 -0
  148. package/skills/ct-skill-validator/agents/ecosystem-checker.md +151 -0
  149. package/skills/ct-skill-validator/assets/valid-skill-example.md +13 -0
  150. package/skills/ct-skill-validator/evals/eval_set.json +14 -0
  151. package/skills/ct-skill-validator/evals/evals.json +52 -0
  152. package/skills/ct-skill-validator/manifest-entry.json +20 -0
  153. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +163 -0
  154. package/skills/ct-skill-validator/references/validation-rules.md +168 -0
  155. package/skills/ct-skill-validator/scripts/__init__.py +0 -0
  156. package/skills/ct-skill-validator/scripts/__pycache__/audit_body.cpython-314.pyc +0 -0
  157. package/skills/ct-skill-validator/scripts/__pycache__/check_ecosystem.cpython-314.pyc +0 -0
  158. package/skills/ct-skill-validator/scripts/__pycache__/generate_validation_report.cpython-314.pyc +0 -0
  159. package/skills/ct-skill-validator/scripts/__pycache__/validate.cpython-314.pyc +0 -0
  160. package/skills/ct-skill-validator/scripts/audit_body.py +242 -0
  161. package/skills/ct-skill-validator/scripts/check_ecosystem.py +169 -0
  162. package/skills/ct-skill-validator/scripts/check_manifest.py +172 -0
  163. package/skills/ct-skill-validator/scripts/generate_validation_report.py +442 -0
  164. package/skills/ct-skill-validator/scripts/validate.py +422 -0
  165. package/skills/ct-spec-writer/SKILL.md +189 -0
  166. package/skills/ct-stickynote/README.md +14 -0
  167. package/skills/ct-stickynote/SKILL.md +46 -0
  168. package/skills/ct-task-executor/SKILL.md +296 -0
  169. package/skills/ct-validator/SKILL.md +216 -0
  170. package/skills/manifest.json +469 -0
  171. package/skills.json +281 -0
@@ -0,0 +1,236 @@
1
+ # CLEO Grade Specification (Current)
2
+
3
+ **Source**: `src/core/sessions/session-grade.ts`
4
+ **Spec**: `docs/specs/CLEO-GRADE-SPEC.md`
5
+ **Status**: Active
6
+
7
+ This document reflects the **current rubric implementation** as of v2026.3.x. It is derived from the live source code, not the original spec (which may be outdated).
8
+
9
+ ---
10
+
11
+ ## Data Flow
12
+
13
+ ```
14
+ session.start(grade: true)
15
+ -> CLEO_SESSION_GRADE=true (env)
16
+ -> audit middleware logs ALL ops (query + mutate)
17
+ session.end
18
+ -> query check grade { sessionId } # canonical registry surface
19
+ -> query admin grade { sessionId } # runtime compatibility alias
20
+ -> Reads audit_log from tasks.db
21
+ -> Applies 5-dimension rubric
22
+ -> Appends to .cleo/metrics/GRADES.jsonl
23
+ ```
24
+
25
+ Audit log entries in `tasks.db` contain:
26
+ - `domain`, `operation`, `timestamp`
27
+ - `params` (the operation parameters)
28
+ - `result.success` (boolean)
29
+ - `result.exitCode` (number; 4 = E_NOT_FOUND)
30
+ - `metadata.gateway` (`"query"` | `"mutate"`)
31
+ - `metadata.taskId` (if set)
32
+
33
+ ---
34
+
35
+ ## Dimension 1: Session Discipline (20 pts)
36
+
37
+ **Source**: Lines 73-105 in session-grade.ts
38
+
39
+ ```
40
+ sessionListCalls = entries where domain='session' AND operation='list'
41
+ sessionEndCalls = entries where domain='session' AND operation='end'
42
+ taskOps = entries where domain='tasks'
43
+ ```
44
+
45
+ | Points | Condition |
46
+ |--------|-----------|
47
+ | +10 | `session.list` called AND first list timestamp ≤ first tasks op timestamp |
48
+ | +10 | `session.end` called at least once |
49
+
50
+ **Flags on violation:**
51
+ - `session.list never called (check existing sessions before starting)` — no session.list found
52
+ - `session.list called after task ops (should check sessions first)` — exists but wrong order
53
+ - `session.end never called (always end sessions when done)` — no session.end
54
+
55
+ **Edge cases:**
56
+ - No task ops: `firstTaskTime = Infinity`, so session.list always satisfies the ordering check
57
+ - Range: 0-20
58
+
59
+ ---
60
+
61
+ ## Dimension 2: Discovery Efficiency (20 pts)
62
+
63
+ **Source**: Lines 107-144
64
+
65
+ ```
66
+ findCalls = entries where domain='tasks' AND operation='find'
67
+ listCalls = entries where domain='tasks' AND operation='list'
68
+ showCalls = entries where domain='tasks' AND operation='show'
69
+ totalDiscoveryCalls = findCalls.length + listCalls.length
70
+ ```
71
+
72
+ | Points | Condition |
73
+ |--------|-----------|
74
+ | +10 baseline | `totalDiscoveryCalls === 0` (no discovery needed) |
75
+ | proportional 0-15 | `round(15 * findRatio)` where `findRatio = findCalls / totalDiscoveryCalls` |
76
+ | full 15 | when `findRatio >= 0.80` |
77
+ | +5 | `tasks.show` used at least once |
78
+
79
+ **Cap**: `Math.min(20, score)` — maximum 20 points
80
+
81
+ **Flags on violation:**
82
+ - `tasks.list used Nx (prefer tasks.find for discovery)` — when findRatio < 0.80
83
+
84
+ **Note**: `tasks.list` with filters (for known parent children) is acceptable but still counted toward the ratio.
85
+
86
+ ---
87
+
88
+ ## Dimension 3: Task Hygiene (20 pts)
89
+
90
+ **Source**: Lines 146-180
91
+
92
+ ```
93
+ addCalls = entries where domain='tasks' AND operation='add' AND result.success=true
94
+ existsCalls = entries where domain='tasks' AND operation='exists'
95
+ subtaskAdds = addCalls where params.parent is truthy
96
+ ```
97
+
98
+ **Starts at 20, deducts penalties:**
99
+
100
+ | Deduction | Condition |
101
+ |-----------|-----------|
102
+ | -5 per add | `tasks.add` succeeded but `params.description` is empty/missing |
103
+ | -3 | subtaskAdds > 0 AND existsCalls = 0 (no parent verification) |
104
+
105
+ **Evidence (no deduction):**
106
+ - `All N tasks.add calls had descriptions` — when no description violations
107
+ - `Parent existence verified before subtask creation` — subtask adds with exists check
108
+
109
+ **Floor**: `Math.max(0, score)` — cannot go below 0
110
+
111
+ ---
112
+
113
+ ## Dimension 4: Error Protocol (20 pts)
114
+
115
+ **Source**: Lines 182-215
116
+
117
+ ```
118
+ notFoundErrors = entries where result.success=false AND result.exitCode=4
119
+ ```
120
+
121
+ For each E_NOT_FOUND error at index `errIdx`:
122
+ ```
123
+ nextEntries = sessionEntries[errIdx+1 .. errIdx+5]
124
+ hasRecovery = nextEntries contains (domain='tasks' AND operation IN ['find','exists'])
125
+ ```
126
+
127
+ **Duplicate detection:**
128
+ ```
129
+ creates = entries where domain='tasks' AND operation='add' AND result.success=true
130
+ titles = creates.map(e => e.params.title.toLowerCase().trim())
131
+ duplicates = titles.length - new Set(titles).size
132
+ ```
133
+
134
+ **Starts at 20, deducts penalties:**
135
+
136
+ | Deduction | Condition |
137
+ |-----------|-----------|
138
+ | -5 per error | E_NOT_FOUND not followed by `tasks.find` or `tasks.exists` within 4 entries |
139
+ | -5 | Any duplicate task creates detected (title collision within session) |
140
+
141
+ **Floor**: `Math.max(0, score)`
142
+
143
+ ---
144
+
145
+ ## Dimension 5: Progressive Disclosure Use (20 pts)
146
+
147
+ **Source**: Lines 217-249
148
+
149
+ ```
150
+ helpCalls = entries where:
151
+ (domain='admin' AND operation='help')
152
+ OR (domain='tools' AND operation IN ['skill.show','skill.list'])
153
+ OR (domain='skills' AND operation IN ['list','show'])
154
+
155
+ mcpQueryCalls = entries where metadata.gateway = 'query'
156
+ ```
157
+
158
+ | Points | Condition |
159
+ |--------|-----------|
160
+ | +10 | `helpCalls.length > 0` |
161
+ | +10 | `mcpQueryCalls.length > 0` |
162
+
163
+ **Flags on violation:**
164
+ - `No admin.help or skill lookup calls (load ct-cleo for guidance)`
165
+ - `No MCP query calls (prefer query over CLI for programmatic access)`
166
+
167
+ **Important**: The `metadata.gateway` field equals `'query'` for MCP query operations. CLI operations do not set this field. This is how MCP vs CLI usage is distinguished in the grade.
168
+
169
+ ---
170
+
171
+ ## Grade Letter Mapping
172
+
173
+ | Grade | Threshold | Score Range |
174
+ |-------|-----------|-------------|
175
+ | A | ≥90% | 90-100 |
176
+ | B | ≥75% | 75-89 |
177
+ | C | ≥60% | 60-74 |
178
+ | D | ≥45% | 45-59 |
179
+ | F | <45% | 0-44 |
180
+
181
+ ---
182
+
183
+ ## GradeResult Schema
184
+
185
+ ```typescript
186
+ interface GradeResult {
187
+ sessionId: string;
188
+ taskId?: string;
189
+ totalScore: number; // 0-100
190
+ maxScore: number; // 100
191
+ dimensions: {
192
+ sessionDiscipline: DimensionScore; // score, max:20, evidence[]
193
+ discoveryEfficiency: DimensionScore;
194
+ taskHygiene: DimensionScore;
195
+ errorProtocol: DimensionScore;
196
+ disclosureUse: DimensionScore;
197
+ };
198
+ flags: string[];
199
+ timestamp: string; // ISO 8601
200
+ entryCount: number;
201
+ evaluator?: 'auto' | 'manual';
202
+ }
203
+ ```
204
+
205
+ ---
206
+
207
+ ## Edge Cases
208
+
209
+ | Case | Behavior |
210
+ |------|----------|
211
+ | No audit entries | All scores 0, flag: `No audit entries found for session` |
212
+ | No task ops | S1 list check always passes (firstTaskTime=Infinity) |
213
+ | No discovery calls | S2 awards 10 baseline points |
214
+ | No task.add calls | S3 starts at 20 with no deductions |
215
+ | No errors | S4 starts at 20 with no deductions |
216
+ | GRADES.jsonl missing | readGrades() returns [] |
217
+ | Write failure | Silently ignored (best-effort persistence) |
218
+
219
+ ---
220
+
221
+ ## MCP vs CLI Detection in S5
222
+
223
+ The grading system detects MCP usage via `metadata.gateway === 'query'`. This means:
224
+ - **MCP interface**: All query operations set `metadata.gateway = 'query'` → S5 gets +10
225
+ - **CLI interface**: CLI operations do NOT set metadata.gateway → S5 loses +10
226
+ - **Mixed**: Any single MCP query call is enough for the +10
227
+
228
+ This is why A/B tests between MCP and CLI interfaces will reliably show S5 differences.
229
+
230
+
231
+ ## API Surface Update
232
+
233
+ - Canonical reads now live under `check.grade` and `check.grade.list`.
234
+ - `admin.grade` and `admin.grade.list` remain compatibility handlers during the registry transition.
235
+ - Token telemetry should be read through `admin.token` with `action=summary|list|show` rather than inferring split legacy operations.
236
+ - Web clients should use `POST /api/query` and `POST /api/mutate`; default HTTP responses carry LAFS metadata in `X-Cleo-*` headers.
@@ -0,0 +1,234 @@
1
+ # Grade Scenario Playbook (Updated)
2
+
3
+ **Based on**: `docs/specs/GRADE-SCENARIO-PLAYBOOK.md` + current session-grade.ts implementation
4
+ **Status**: Active — reflects current rubric
5
+
6
+ Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
7
+
8
+ Use **cleo-dev** (local dev build) for MCP operations or **cleo** (production).
9
+ Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
10
+
11
+ ---
12
+
13
+ ## S1: Fresh Discovery
14
+
15
+ **Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
16
+ **Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
17
+
18
+ ### Operation Sequence (MCP)
19
+
20
+ ```
21
+ 1. query session list — S1: must be first
22
+ 2. query admin dash — project overview
23
+ 3. query tasks find { "status": "active" } — S2: find not list
24
+ 4. query tasks show { "taskId": "T<any>" } — S2: show used
25
+ 5. mutate session end — S1: session.end
26
+ ```
27
+
28
+ ### Operation Sequence (CLI)
29
+
30
+ ```bash
31
+ 1. cleo-dev session list
32
+ 2. cleo-dev dash
33
+ 3. cleo-dev find --status active
34
+ 4. cleo-dev show T<any>
35
+ 5. cleo-dev session end
36
+ ```
37
+
38
+ ### Scoring Targets
39
+
40
+ | Dim | Expected | Reason |
41
+ |-----|----------|--------|
42
+ | S1 | 20/20 | session.list first (+10), session.end present (+10) |
43
+ | S2 | 20/20 | find used exclusively (+15), show used (+5) |
44
+ | S3 | 20/20 | No task adds (no deductions) |
45
+ | S4 | 20/20 | No errors |
46
+ | S5 (MCP) | 10/20 | query gateway used (+10), no admin.help call |
47
+ | S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
48
+
49
+ **MCP total: ~90/100 (A)**
50
+ **CLI total: ~80/100 (B)**
51
+
52
+ ### Anti-pattern Variant (for testing grader sensitivity)
53
+
54
+ ```
55
+ query tasks find { "status": "active" } ← task op BEFORE session.list
56
+ query session list ← too late for S1
57
+ (no session.end)
58
+ ```
59
+ Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
60
+
61
+ ---
62
+
63
+ ## S2: Task Creation Hygiene
64
+
65
+ **Purpose**: Validates S3 (Task Hygiene) and S1.
66
+ **Target score**: 60/100 (S1 full, S3 full, S5 partial MCP or 0 CLI)
67
+
68
+ ### Operation Sequence (MCP)
69
+
70
+ ```
71
+ 1. query session list — S1
72
+ 2. query tasks exists { "taskId": "T100" } — S3: parent verify
73
+ 3. mutate tasks add { "title": "Implement auth",
74
+ "description": "Add JWT authentication to API endpoints",
75
+ "parent": "T100" } — S3: desc + parent
76
+ 4. mutate tasks add { "title": "Write tests",
77
+ "description": "Unit tests for auth module" } — S3: desc present
78
+ 5. mutate session end — S1
79
+ ```
80
+
81
+ ### Scoring Targets
82
+
83
+ | Dim | Expected | Reason |
84
+ |-----|----------|--------|
85
+ | S1 | 20/20 | session.list first, session.end present |
86
+ | S3 | 20/20 | All adds have descriptions, parent verified via exists |
87
+ | S5 (MCP) | 10/20 | query gateway used |
88
+ | S5 (CLI) | 0/20 | no MCP query, no help |
89
+
90
+ **MCP total: ~70/100 (C)**
91
+ **CLI total: ~60/100 (C)**
92
+
93
+ ### Anti-pattern Variant
94
+
95
+ ```
96
+ mutate tasks add { "title": "Implement auth", "parent": "T100" } ← no desc, no exists check
97
+ mutate tasks add { "title": "Write tests" } ← no desc
98
+ ```
99
+ Expected S3: 7 (20 - 5 - 5 - 3 = 7)
100
+
101
+ ---
102
+
103
+ ## S3: Error Recovery
104
+
105
+ **Purpose**: Validates S4 (Error Protocol).
106
+
107
+ ### Operation Sequence (MCP)
108
+
109
+ ```
110
+ 1. query session list — S1
111
+ 2. query tasks show { "taskId": "T99999" } — triggers E_NOT_FOUND
112
+ 3. query tasks find { "query": "T99999" } — S4: recovery within 4 ops
113
+ 4. mutate tasks add { "title": "New feature",
114
+ "description": "Implement the feature that was not found" } — S3: desc present
115
+ 5. mutate session end — S1
116
+ ```
117
+
118
+ ### Scoring Targets
119
+
120
+ | Dim | Expected | Reason |
121
+ |-----|----------|--------|
122
+ | S1 | 20/20 | Proper session lifecycle |
123
+ | S3 | 20/20 | Task created with description |
124
+ | S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
125
+ | S5 (MCP) | 10/20 | query gateway used |
126
+
127
+ **MCP total: ~90/100 (A)**
128
+
129
+ ### Anti-pattern: Unrecovered Error
130
+
131
+ ```
132
+ query tasks show { "taskId": "T99999" } ← E_NOT_FOUND
133
+ mutate tasks add { "title": "Something else",
134
+ "description": "Unrelated" } ← no recovery lookup
135
+ ```
136
+ S4 deduction: -5 (no tasks.find within next 4 entries)
137
+
138
+ ### Anti-pattern: Duplicate Creates
139
+
140
+ ```
141
+ mutate tasks add { "title": "New feature", "description": "First attempt" }
142
+ mutate tasks add { "title": "New feature", "description": "Second attempt" }
143
+ ```
144
+ S4 deduction: -5 (1 duplicate detected)
145
+
146
+ ---
147
+
148
+ ## S4: Full Lifecycle
149
+
150
+ **Purpose**: Validates all 5 dimensions. Gold standard session.
151
+ **Target score**: 100/100 (A) for MCP, ~80/100 (B) for CLI
152
+
153
+ ### Operation Sequence (MCP)
154
+
155
+ ```
156
+ 1. query session list — S1
157
+ 2. query admin help — S5: progressive disclosure
158
+ 3. query admin dash — overview
159
+ 4. query tasks find { "status": "pending" } — S2: find not list
160
+ 5. query tasks show { "taskId": "T200" } — S2: show for detail
161
+ 6. mutate tasks update { "taskId": "T200", "status": "active" } — begin work
162
+ (agent does work here)
163
+ 7. mutate tasks complete { "taskId": "T200" } — mark done
164
+ 8. query tasks find { "status": "pending" } — check next
165
+ 9. mutate session end { "note": "Completed T200" } — S1
166
+ ```
167
+
168
+ ### Scoring Targets (MCP)
169
+
170
+ | Dim | Expected | Reason |
171
+ |-----|----------|--------|
172
+ | S1 | 20/20 | session.list first (+10), session.end present (+10) |
173
+ | S2 | 20/20 | find:list 100% (+15), show used (+5) |
174
+ | S3 | 20/20 | No adds — no deductions |
175
+ | S4 | 20/20 | No errors, no duplicates |
176
+ | S5 | 20/20 | admin.help (+10), query gateway (+10) |
177
+
178
+ **MCP total: 100/100 (A)**
179
+ **CLI total: ~80/100 (B)** — loses S5 entirely
180
+
181
+ ---
182
+
183
+ ## S5: Multi-Domain Analysis
184
+
185
+ **Purpose**: Validates cross-domain operations and advanced S5.
186
+ **Target score**: 100/100 (MCP), ~80/100 (CLI)
187
+
188
+ ### Operation Sequence (MCP)
189
+
190
+ ```
191
+ 1. query session list — S1
192
+ 2. query admin help — S5
193
+ 3. query tasks find { "parent": "T500" } — S2: epic subtasks
194
+ 4. query tasks show { "taskId": "T501" } — S2: inspect
195
+ 5. query session context.drift — multi-domain
196
+ 6. query session decision.log { "taskId": "T501" } — decision history
197
+ 7. mutate session record.decision { "taskId": "T501",
198
+ "decision": "Use adapter pattern",
199
+ "rationale": "Decouples provider logic" } — record decision
200
+ 8. mutate tasks update { "taskId": "T501", "status": "active" }
201
+ 9. mutate tasks complete { "taskId": "T501" }
202
+ 10. query tasks find { "parent": "T500", "status": "pending" } — next subtask
203
+ 11. mutate session end — S1
204
+ ```
205
+
206
+ ### Scoring Targets
207
+
208
+ | Dim | Expected | Reason |
209
+ |-----|----------|--------|
210
+ | S1 | 20/20 | session.list first, session.end present |
211
+ | S2 | 20/20 | find used exclusively, show used |
212
+ | S3 | 20/20 | No task.add — no deductions |
213
+ | S4 | 20/20 | No errors |
214
+ | S5 | 20/20 | admin.help (+10), query gateway (+10) |
215
+
216
+ **MCP total: 100/100 (A)**
217
+
218
+ ---
219
+
220
+ ## Scenario Quick Reference
221
+
222
+ | Scenario | Primary Dims Tested | MCP Expected | CLI Expected |
223
+ |---|---|---|---|
224
+ | S1 | S1, S2 | ~90 (A) | ~80 (B) |
225
+ | S2 | S1, S3 | ~70 (C) | ~60 (C) |
226
+ | S3 | S1, S3, S4 | ~90 (A) | ~80 (B) |
227
+ | S4 | All 5 | 100 (A) | ~80 (B) |
228
+ | S5 | All 5, cross-domain | 100 (A) | ~80 (B) |
229
+
230
+ **Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
231
+ 1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
232
+ 2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
233
+
234
+ This is by design — the rubric rewards MCP-first behavior.
@@ -0,0 +1,120 @@
1
+ # Token Tracking Reference
2
+
3
+ **Skill**: ct-grade
4
+ **Version**: 1.0.0
5
+ **Status**: APPROVED
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ Token tracking matters for three reasons in ct-grade evaluations:
12
+
13
+ 1. **Cost tracking**: Each A/B run consumes real tokens. Knowing the cost per run helps budget multi-scenario evaluations.
14
+ 2. **MCP vs CLI comparison**: The primary value of ct-grade is comparing MCP efficiency against CLI. Token consumption is a direct measure of interface efficiency — lower tokens for the same score means better efficiency.
15
+ 3. **Score-per-token efficiency**: A session scoring 85/100 with 2,000 tokens outperforms one scoring 90/100 with 8,000 tokens on an efficiency basis. The eval-viewer surfaces this ratio as `score_per_1k_tokens`.
16
+
17
+ **Important constraint**: Claude Code does not expose per-call token counts during agent execution. There is no API to query "how many tokens did this operation consume" in real time. Token counts arrive only via OpenTelemetry telemetry (if configured) or must be approximated from response payload size. This is why ct-grade uses a three-layer estimation system.
18
+
19
+ ---
20
+
21
+ ## Three Estimation Methods
22
+
23
+ Token estimates are produced by one of three methods, ordered by confidence:
24
+
25
+ | Method | Confidence | When Available | How |
26
+ |--------|-----------|----------------|-----|
27
+ | OTel telemetry | REAL | When `CLAUDE_CODE_ENABLE_TELEMETRY=1` + OTel configured | Reads `~/.cleo/metrics/otel/*.jsonl`, field `claude_code.token.usage` |
28
+ | Response chars ÷ 4 | ESTIMATED | After A/B test runs | Counts response payload characters, divides by 4 (industry standard approximation) |
29
+ | Coarse op averages | COARSE | Always | Multiplies op count by `OP_TOKEN_AVERAGES` lookup table |
30
+
31
+ The eval-viewer labels every token figure with its confidence level so you know how to interpret the number.
32
+
33
+ ---
34
+
35
+ ## OTel Setup
36
+
37
+ OTel telemetry provides the most accurate token counts (REAL confidence). It requires a one-time shell setup.
38
+
39
+ ```bash
40
+ # One-time setup — add to ~/.bashrc or ~/.zshrc
41
+ source /path/to/.cleo/setup-otel.sh
42
+
43
+ # What the script sets:
44
+ export CLAUDE_CODE_ENABLE_TELEMETRY=1
45
+ export OTEL_METRICS_EXPORTER=otlp
46
+ export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
47
+ export OTEL_EXPORTER_OTLP_ENDPOINT="file://${HOME}/.cleo/metrics/otel/"
48
+ ```
49
+
50
+ After sourcing, restart your shell or run `source ~/.bashrc` (or `~/.zshrc`).
51
+
52
+ Once configured, Claude Code writes session token metrics to `~/.cleo/metrics/otel/` as JSONL files. The ct-grade analysis scripts read these files and match them to run sessions by timestamp overlap. The relevant field is `claude_code.token.usage` which contains `input`, `output`, and `cache_read` sub-fields.
53
+
54
+ **Verification**: After a graded session, check that files exist under `~/.cleo/metrics/otel/`. If the directory is empty, telemetry is not active for your current shell session.
55
+
56
+ ---
57
+
58
+ ## Per-Operation Token Budget Table
59
+
60
+ The coarse estimation layer uses the following lookup table (`OP_TOKEN_AVERAGES`). These averages were measured across real CLEO sessions and are used when neither OTel nor char-counting is available.
61
+
62
+ | Operation | Estimated Tokens | Notes |
63
+ |-----------|-----------------|-------|
64
+ | tasks.find | ~750 | Depends on result count |
65
+ | tasks.list | ~3,000 | Heavy — prefer tasks.find |
66
+ | tasks.show | ~600 | Single task with full details |
67
+ | tasks.exists | ~300 | Boolean + minimal data |
68
+ | tasks.tree | ~800 | Hierarchy view |
69
+ | tasks.plan | ~900 | Next task recommendations |
70
+ | session.status | ~350 | Quick status check |
71
+ | session.list | ~400 | Session list |
72
+ | session.briefing.show | ~500 | Handoff briefing |
73
+ | admin.dash | ~500 | Project overview |
74
+ | admin.help | ~800 | Full operation reference |
75
+ | admin.health | ~300 | Health check |
76
+ | admin.stats | ~600 | Statistics summary |
77
+ | memory.find | ~600 | Search results |
78
+ | memory.timeline | ~500 | Timeline entries |
79
+ | tools.skill.list | ~400 | Skill manifest |
80
+ | tools.skill.show | ~350 | Single skill details |
81
+
82
+ These figures are averages. Actual token counts vary based on the number of results returned, note field length, and payload verbosity. The coarse method is accurate within ±50% and is only used as a last resort when better data is unavailable.
83
+
84
+ ---
85
+
86
+ ## Confidence Labels
87
+
88
+ Every token figure in the eval-viewer is annotated with one of three confidence labels:
89
+
90
+ | Label | Source | Accuracy |
91
+ |-------|--------|----------|
92
+ | `REAL` | OTel telemetry (`claude_code.token.usage`) | Exact — from Claude Code instrumentation |
93
+ | `ESTIMATED` | Response chars ÷ 4 | ±20% — good for JSON payloads |
94
+ | `COARSE` | Operation count × `OP_TOKEN_AVERAGES` | ±50% — fallback only |
95
+
96
+ When reading eval-viewer reports, treat REAL figures as authoritative, ESTIMATED figures as directionally accurate, and COARSE figures as rough order-of-magnitude only.
97
+
98
+ **Recommendation**: Enable OTel telemetry before running multi-scenario or multi-run evaluations. The additional setup is minimal and the REAL confidence data significantly improves the reliability of MCP vs CLI efficiency comparisons.
99
+
100
+ ---
101
+
102
+ ## Chars ÷ 4 Rationale
103
+
104
+ The chars/4 approximation is applied to response payload character counts when OTel data is unavailable but operation responses were captured in `operations.jsonl`.
105
+
106
+ This approximation matches CLEO's own `src/core/metrics/token-estimation.ts` and is the same approximation used by OpenAI and Anthropic in their documentation. It is accurate within ±20% for JSON payloads.
107
+
108
+ The approximation works because English text and JSON structure average roughly 4 characters per token across typical LLM tokenizers (cl100k_base, o200k_base). JSON keys, punctuation, and whitespace are slightly more token-dense than prose, but the ±20% margin accounts for this variance.
109
+
110
+ For ct-grade specifically, both arms of an A/B test experience the same approximation error, so relative comparisons between MCP and CLI remain valid even when absolute counts are slightly off.
111
+
112
+ ---
113
+
114
+ ## References
115
+
116
+ - `src/core/metrics/token-estimation.ts` — CLEO's token estimation implementation
117
+ - `docs/specs/CLEO-METRICS-VALIDATION-SYSTEM-SPEC.md` — Metrics system specification
118
+ - `.cleo/setup-otel.sh` — OTel environment setup script
119
+ - `packages/ct-skills/skills/ct-grade/scripts/token_tracker.py` — Token aggregation script
120
+ - `packages/ct-skills/skills/ct-grade/scripts/generate_report.py` — Report generator (uses confidence labels)