@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (171) hide show
  1. package/dispatch-config.json +404 -0
  2. package/index.d.ts +178 -0
  3. package/index.js +405 -0
  4. package/package.json +14 -0
  5. package/profiles/core.json +7 -0
  6. package/profiles/full.json +10 -0
  7. package/profiles/minimal.json +7 -0
  8. package/profiles/recommended.json +7 -0
  9. package/provider-skills-map.json +97 -0
  10. package/skills/_shared/cleo-style-guide.md +84 -0
  11. package/skills/_shared/manifest-operations.md +810 -0
  12. package/skills/_shared/placeholders.json +433 -0
  13. package/skills/_shared/skill-chaining-patterns.md +237 -0
  14. package/skills/_shared/subagent-protocol-base.md +223 -0
  15. package/skills/_shared/task-system-integration.md +232 -0
  16. package/skills/_shared/testing-framework-config.md +110 -0
  17. package/skills/ct-cleo/SKILL.md +490 -0
  18. package/skills/ct-cleo/references/anti-patterns.md +19 -0
  19. package/skills/ct-cleo/references/loom-lifecycle.md +136 -0
  20. package/skills/ct-cleo/references/orchestrator-constraints.md +55 -0
  21. package/skills/ct-cleo/references/session-protocol.md +162 -0
  22. package/skills/ct-codebase-mapper/SKILL.md +82 -0
  23. package/skills/ct-contribution/SKILL.md +521 -0
  24. package/skills/ct-contribution/templates/contribution-init.json +21 -0
  25. package/skills/ct-dev-workflow/SKILL.md +423 -0
  26. package/skills/ct-docs-lookup/SKILL.md +66 -0
  27. package/skills/ct-docs-review/SKILL.md +175 -0
  28. package/skills/ct-docs-write/SKILL.md +108 -0
  29. package/skills/ct-documentor/SKILL.md +231 -0
  30. package/skills/ct-epic-architect/SKILL.md +305 -0
  31. package/skills/ct-epic-architect/references/bug-epic-example.md +172 -0
  32. package/skills/ct-epic-architect/references/commands.md +201 -0
  33. package/skills/ct-epic-architect/references/feature-epic-example.md +210 -0
  34. package/skills/ct-epic-architect/references/migration-epic-example.md +244 -0
  35. package/skills/ct-epic-architect/references/output-format.md +92 -0
  36. package/skills/ct-epic-architect/references/patterns.md +284 -0
  37. package/skills/ct-epic-architect/references/refactor-epic-example.md +412 -0
  38. package/skills/ct-epic-architect/references/research-epic-example.md +226 -0
  39. package/skills/ct-epic-architect/references/shell-escaping.md +86 -0
  40. package/skills/ct-epic-architect/references/skill-aware-execution.md +195 -0
  41. package/skills/ct-grade/SKILL.md +230 -0
  42. package/skills/ct-grade/agents/analysis-reporter.md +203 -0
  43. package/skills/ct-grade/agents/blind-comparator.md +157 -0
  44. package/skills/ct-grade/agents/scenario-runner.md +134 -0
  45. package/skills/ct-grade/eval-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  46. package/skills/ct-grade/eval-viewer/generate_grade_review.py +1138 -0
  47. package/skills/ct-grade/eval-viewer/generate_grade_viewer.py +544 -0
  48. package/skills/ct-grade/eval-viewer/generate_review.py +283 -0
  49. package/skills/ct-grade/eval-viewer/grade-review.html +1574 -0
  50. package/skills/ct-grade/eval-viewer/viewer.html +219 -0
  51. package/skills/ct-grade/evals/evals.json +94 -0
  52. package/skills/ct-grade/references/ab-test-methodology.md +150 -0
  53. package/skills/ct-grade/references/domains.md +137 -0
  54. package/skills/ct-grade/references/grade-spec.md +236 -0
  55. package/skills/ct-grade/references/scenario-playbook.md +234 -0
  56. package/skills/ct-grade/references/token-tracking.md +120 -0
  57. package/skills/ct-grade/scripts/__pycache__/audit_analyzer.cpython-314.pyc +0 -0
  58. package/skills/ct-grade/scripts/__pycache__/run_ab_test.cpython-314.pyc +0 -0
  59. package/skills/ct-grade/scripts/__pycache__/run_all.cpython-314.pyc +0 -0
  60. package/skills/ct-grade/scripts/__pycache__/token_tracker.cpython-314.pyc +0 -0
  61. package/skills/ct-grade/scripts/audit_analyzer.py +279 -0
  62. package/skills/ct-grade/scripts/generate_report.py +283 -0
  63. package/skills/ct-grade/scripts/run_ab_test.py +504 -0
  64. package/skills/ct-grade/scripts/run_all.py +287 -0
  65. package/skills/ct-grade/scripts/setup_run.py +183 -0
  66. package/skills/ct-grade/scripts/token_tracker.py +630 -0
  67. package/skills/ct-grade-v2-1/SKILL.md +237 -0
  68. package/skills/ct-grade-v2-1/agents/analysis-reporter.md +203 -0
  69. package/skills/ct-grade-v2-1/agents/blind-comparator.md +157 -0
  70. package/skills/ct-grade-v2-1/agents/scenario-runner.md +179 -0
  71. package/skills/ct-grade-v2-1/evals/evals.json +74 -0
  72. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/build_op_stats.cpython-314.pyc +0 -0
  73. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  74. package/skills/ct-grade-v2-1/grade-viewer/build_op_stats.py +174 -0
  75. package/skills/ct-grade-v2-1/grade-viewer/eval-analysis.json +41 -0
  76. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +34 -0
  77. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_review.py +1023 -0
  78. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_viewer.py +548 -0
  79. package/skills/ct-grade-v2-1/grade-viewer/grade-review-eval.html +613 -0
  80. package/skills/ct-grade-v2-1/grade-viewer/grade-review.html +1532 -0
  81. package/skills/ct-grade-v2-1/grade-viewer/viewer.html +620 -0
  82. package/skills/ct-grade-v2-1/manifest-entry.json +31 -0
  83. package/skills/ct-grade-v2-1/references/ab-testing.md +233 -0
  84. package/skills/ct-grade-v2-1/references/domains-ssot.md +156 -0
  85. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +167 -0
  86. package/skills/ct-grade-v2-1/references/playbook-v2.md +393 -0
  87. package/skills/ct-grade-v2-1/references/token-tracking.md +202 -0
  88. package/skills/ct-grade-v2-1/scripts/generate_report.py +419 -0
  89. package/skills/ct-grade-v2-1/scripts/run_ab_test.py +493 -0
  90. package/skills/ct-grade-v2-1/scripts/run_scenario.py +396 -0
  91. package/skills/ct-grade-v2-1/scripts/setup_run.py +207 -0
  92. package/skills/ct-grade-v2-1/scripts/token_tracker.py +175 -0
  93. package/skills/ct-memory/SKILL.md +84 -0
  94. package/skills/ct-orchestrator/INSTALL.md +61 -0
  95. package/skills/ct-orchestrator/README.md +69 -0
  96. package/skills/ct-orchestrator/SKILL.md +380 -0
  97. package/skills/ct-orchestrator/manifest-entry.json +19 -0
  98. package/skills/ct-orchestrator/orchestrator-prompt.txt +17 -0
  99. package/skills/ct-orchestrator/references/SUBAGENT-PROTOCOL-BLOCK.md +66 -0
  100. package/skills/ct-orchestrator/references/autonomous-operation.md +167 -0
  101. package/skills/ct-orchestrator/references/lifecycle-gates.md +98 -0
  102. package/skills/ct-orchestrator/references/orchestrator-compliance.md +271 -0
  103. package/skills/ct-orchestrator/references/orchestrator-handoffs.md +85 -0
  104. package/skills/ct-orchestrator/references/orchestrator-patterns.md +164 -0
  105. package/skills/ct-orchestrator/references/orchestrator-recovery.md +113 -0
  106. package/skills/ct-orchestrator/references/orchestrator-spawning.md +271 -0
  107. package/skills/ct-orchestrator/references/orchestrator-tokens.md +180 -0
  108. package/skills/ct-research-agent/SKILL.md +226 -0
  109. package/skills/ct-skill-creator/.cleo/.context-state.json +13 -0
  110. package/skills/ct-skill-creator/.cleo/logs/cleo.2026-03-07.1.log +24 -0
  111. package/skills/ct-skill-creator/.cleo/tasks.db +0 -0
  112. package/skills/ct-skill-creator/SKILL.md +356 -0
  113. package/skills/ct-skill-creator/agents/analyzer.md +276 -0
  114. package/skills/ct-skill-creator/agents/comparator.md +204 -0
  115. package/skills/ct-skill-creator/agents/grader.md +225 -0
  116. package/skills/ct-skill-creator/assets/eval_review.html +146 -0
  117. package/skills/ct-skill-creator/eval-viewer/__pycache__/generate_review.cpython-314.pyc +0 -0
  118. package/skills/ct-skill-creator/eval-viewer/generate_review.py +471 -0
  119. package/skills/ct-skill-creator/eval-viewer/viewer.html +1325 -0
  120. package/skills/ct-skill-creator/manifest-entry.json +17 -0
  121. package/skills/ct-skill-creator/references/dynamic-context.md +228 -0
  122. package/skills/ct-skill-creator/references/frontmatter.md +83 -0
  123. package/skills/ct-skill-creator/references/invocation-control.md +165 -0
  124. package/skills/ct-skill-creator/references/output-patterns.md +86 -0
  125. package/skills/ct-skill-creator/references/provider-deployment.md +175 -0
  126. package/skills/ct-skill-creator/references/schemas.md +430 -0
  127. package/skills/ct-skill-creator/references/workflows.md +28 -0
  128. package/skills/ct-skill-creator/scripts/__init__.py +1 -0
  129. package/skills/ct-skill-creator/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
  130. package/skills/ct-skill-creator/scripts/__pycache__/aggregate_benchmark.cpython-314.pyc +0 -0
  131. package/skills/ct-skill-creator/scripts/__pycache__/generate_report.cpython-314.pyc +0 -0
  132. package/skills/ct-skill-creator/scripts/__pycache__/improve_description.cpython-314.pyc +0 -0
  133. package/skills/ct-skill-creator/scripts/__pycache__/init_skill.cpython-314.pyc +0 -0
  134. package/skills/ct-skill-creator/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
  135. package/skills/ct-skill-creator/scripts/__pycache__/run_eval.cpython-314.pyc +0 -0
  136. package/skills/ct-skill-creator/scripts/__pycache__/run_loop.cpython-314.pyc +0 -0
  137. package/skills/ct-skill-creator/scripts/__pycache__/utils.cpython-314.pyc +0 -0
  138. package/skills/ct-skill-creator/scripts/aggregate_benchmark.py +401 -0
  139. package/skills/ct-skill-creator/scripts/generate_report.py +326 -0
  140. package/skills/ct-skill-creator/scripts/improve_description.py +247 -0
  141. package/skills/ct-skill-creator/scripts/init_skill.py +306 -0
  142. package/skills/ct-skill-creator/scripts/package_skill.py +110 -0
  143. package/skills/ct-skill-creator/scripts/quick_validate.py +97 -0
  144. package/skills/ct-skill-creator/scripts/run_eval.py +310 -0
  145. package/skills/ct-skill-creator/scripts/run_loop.py +328 -0
  146. package/skills/ct-skill-creator/scripts/utils.py +47 -0
  147. package/skills/ct-skill-validator/SKILL.md +178 -0
  148. package/skills/ct-skill-validator/agents/ecosystem-checker.md +151 -0
  149. package/skills/ct-skill-validator/assets/valid-skill-example.md +13 -0
  150. package/skills/ct-skill-validator/evals/eval_set.json +14 -0
  151. package/skills/ct-skill-validator/evals/evals.json +52 -0
  152. package/skills/ct-skill-validator/manifest-entry.json +20 -0
  153. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +163 -0
  154. package/skills/ct-skill-validator/references/validation-rules.md +168 -0
  155. package/skills/ct-skill-validator/scripts/__init__.py +0 -0
  156. package/skills/ct-skill-validator/scripts/__pycache__/audit_body.cpython-314.pyc +0 -0
  157. package/skills/ct-skill-validator/scripts/__pycache__/check_ecosystem.cpython-314.pyc +0 -0
  158. package/skills/ct-skill-validator/scripts/__pycache__/generate_validation_report.cpython-314.pyc +0 -0
  159. package/skills/ct-skill-validator/scripts/__pycache__/validate.cpython-314.pyc +0 -0
  160. package/skills/ct-skill-validator/scripts/audit_body.py +242 -0
  161. package/skills/ct-skill-validator/scripts/check_ecosystem.py +169 -0
  162. package/skills/ct-skill-validator/scripts/check_manifest.py +172 -0
  163. package/skills/ct-skill-validator/scripts/generate_validation_report.py +442 -0
  164. package/skills/ct-skill-validator/scripts/validate.py +422 -0
  165. package/skills/ct-spec-writer/SKILL.md +189 -0
  166. package/skills/ct-stickynote/README.md +14 -0
  167. package/skills/ct-stickynote/SKILL.md +46 -0
  168. package/skills/ct-task-executor/SKILL.md +296 -0
  169. package/skills/ct-validator/SKILL.md +216 -0
  170. package/skills/manifest.json +469 -0
  171. package/skills.json +281 -0
@@ -0,0 +1,393 @@
1
+ # Grade Scenario Playbook v2
2
+
3
+ Parameterized test scenarios for CLEO grade system validation.
4
+ Updated for CLEO v2026.3+ operation names and 10-domain registry.
5
+
6
+ ---
7
+
8
+ ## Parameterization
9
+
10
+ All scenarios accept these parameters via run_scenario.py:
11
+
12
+ | Param | Default | Description |
13
+ |-------|---------|-------------|
14
+ | `--cleo` | `cleo-dev` | CLEO binary to use |
15
+ | `--scope` | `global` | Session scope |
16
+ | `--parent-task` | none | Parent task ID for subtask tests |
17
+ | `--output-dir` | `./grade-results` | Where to save results |
18
+ | `--runs` | 1 | Number of runs (for statistical averaging) |
19
+ | `--seed-task` | none | Pre-existing task ID to work against |
20
+
21
+ ---
22
+
23
+ ## S1: Session Discipline
24
+
25
+ **Rubric target:** S1 Session Discipline 20/20
26
+
27
+ **Operations (in order):**
28
+
29
+ ```
30
+ 1. query session list
31
+ 2. query admin dash
32
+ 3. query tasks find { "status": "active" }
33
+ 4. query tasks show { "taskId": "<seed-task>" }
34
+ 5. mutate session end
35
+ ```
36
+
37
+ **CLI equivalents:**
38
+ ```bash
39
+ cleo-dev session list
40
+ cleo-dev admin dash
41
+ cleo-dev tasks find --status active
42
+ cleo-dev tasks show <task-id>
43
+ cleo-dev session end
44
+ ```
45
+
46
+ **Pass criteria:**
47
+ - S1 = 20 (session.list before task ops AND session.end present)
48
+ - S2 >= 15 (only find used, no list)
49
+ - Flags: zero
50
+
51
+ **Anti-pattern (failing S1 = 0):**
52
+ ```
53
+ # tasks.find BEFORE session.list
54
+ query tasks find { "status": "active" }
55
+ query session list -- too late
56
+ # (no session.end)
57
+ ```
58
+
59
+ ---
60
+
61
+ ## S2: Task Hygiene
62
+
63
+ **Rubric target:** S3 Task Hygiene 20/20
64
+
65
+ **Prerequisites:** `--parent-task` set to an existing task ID.
66
+
67
+ **Operations:**
68
+ ```
69
+ 1. query session list
70
+ 2. query tasks exists { "taskId": "<parent-task>" }
71
+ 3. mutate tasks add { "title": "Impl auth", "description": "Add JWT auth to API endpoints", "parent": "<parent-task>" }
72
+ 4. mutate tasks add { "title": "Write tests", "description": "Unit tests for auth module" }
73
+ 5. mutate session end
74
+ ```
75
+
76
+ **Pass criteria:**
77
+ - S3 = 20 (all adds have descriptions, parent verified via exists)
78
+ - S1 = 20
79
+ - Flags: zero
80
+
81
+ **Anti-pattern (S3 = 7):**
82
+ ```
83
+ # No description, no exists check
84
+ mutate tasks add { "title": "Impl auth", "parent": "<id>" }
85
+ mutate tasks add { "title": "Write tests" }
86
+ ```
87
+ Expected deduction: -5 (no desc task 1) + -5 (no desc task 2) + -3 (no exists check) = 7/20.
88
+
89
+ ---
90
+
91
+ ## S3: Error Recovery
92
+
93
+ **Rubric target:** S4 Error Protocol 20/20
94
+
95
+ **Prerequisites:** `T99999` does NOT exist.
96
+
97
+ **Operations:**
98
+ ```
99
+ 1. query session list
100
+ 2. query tasks show { "taskId": "T99999" } -- triggers E_NOT_FOUND (exit code 4)
101
+ 3. query tasks find { "query": "T99999" } -- recovery lookup (must be within 4 ops)
102
+ 4. mutate tasks add { "title": "New feature", "description": "Feature not found, creating fresh" }
103
+ 5. mutate session end
104
+ ```
105
+
106
+ **Pass criteria:**
107
+ - S4 = 20 (E_NOT_FOUND followed by recovery; no duplicates)
108
+ - Evidence: `E_NOT_FOUND followed by recovery lookup`
109
+ - Flags: zero
110
+
111
+ **Anti-pattern (unrecovered, S4 = 15):**
112
+ ```
113
+ query tasks show { "taskId": "T99999" } -- E_NOT_FOUND
114
+ mutate tasks add { ... } -- NO recovery lookup
115
+ ```
116
+
117
+ **Anti-pattern (duplicates, S4 = 15):**
118
+ ```
119
+ mutate tasks add { "title": "Feature X", "description": "First attempt" }
120
+ mutate tasks add { "title": "Feature X", "description": "Second attempt" } -- duplicate!
121
+ ```
122
+
123
+ ---
124
+
125
+ ## S4: Full Lifecycle
126
+
127
+ **Rubric target:** All 5 dimensions 20/20 (total = 100)
128
+
129
+ **Prerequisites:** Known task `--seed-task` in pending status.
130
+
131
+ **Operations (in order):**
132
+ ```
133
+ 1. query session list
134
+ 2. query admin help
135
+ 3. query admin dash
136
+ 4. query tasks find { "status": "pending" }
137
+ 5. query tasks show { "taskId": "<seed-task>" }
138
+ 6. mutate tasks update { "taskId": "<seed-task>", "status": "active" }
139
+ 7. [agent performs work]
140
+ 8. mutate tasks complete { "taskId": "<seed-task>" }
141
+ 9. query tasks find { "status": "pending" }
142
+ 10. mutate session end { "note": "Completed <seed-task>" }
143
+ ```
144
+
145
+ **Pass criteria:**
146
+ - Total = 100, Grade = A
147
+ - Zero flags
148
+ - Entry count >= 10
149
+ - All 5 dimensions at 20/20
150
+
151
+ ---
152
+
153
+ ## S5: Multi-Domain Analysis
154
+
155
+ **Rubric target:** All 5 dimensions 20/20
156
+
157
+ **Prerequisites:** `--scope "epic:<parent-task>"` and epic has subtasks.
158
+
159
+ **Operations:**
160
+ ```
161
+ 1. query session list
162
+ 2. query admin help
163
+ 3. query tasks find { "parent": "<parent-task>" }
164
+ 4. query tasks show { "taskId": "<subtask-id>" }
165
+ 5. query session context.drift
166
+ 6. query session decision.log { "taskId": "<subtask-id>" }
167
+ 7. mutate session record.decision { "taskId": "<subtask-id>", "decision": "Use adapter pattern", "rationale": "Decouples provider logic" }
168
+ 8. mutate tasks update { "taskId": "<subtask-id>", "status": "active" }
169
+ 9. mutate tasks complete { "taskId": "<subtask-id>" }
170
+ 10. query tasks find { "parent": "<parent-task>", "status": "pending" }
171
+ 11. mutate session end
172
+ ```
173
+
174
+ **Pass criteria:**
175
+ - Total = 100, Grade = A
176
+ - Evidence of multi-domain ops (session, tasks, admin)
177
+ - Decision recorded
178
+
179
+ **Partial variation (S5 = 10 instead of 20):**
180
+ Skip step 2 (`admin.help`). Earns MCP gateway +10 but not help/skill +10.
181
+
182
+ ---
183
+
184
+ ## P1: MCP vs CLI Parity — tasks domain
185
+
186
+ **Purpose:** Verify MCP and CLI return equivalent data for key tasks operations.
187
+
188
+ **Test matrix:**
189
+
190
+ | Operation | MCP call | CLI equivalent |
191
+ |-----------|----------|----------------|
192
+ | `tasks.find` | `query { domain: "tasks", operation: "find", params: { status: "active" } }` | `cleo-dev tasks find --status active --json` |
193
+ | `tasks.show` | `query { domain: "tasks", operation: "show", params: { taskId: "<id>" } }` | `cleo-dev tasks show <id> --json` |
194
+ | `tasks.list` | `query { domain: "tasks", operation: "list", params: {} }` | `cleo-dev tasks list --json` |
195
+ | `tasks.tree` | `query { domain: "tasks", operation: "tree", params: {} }` | `cleo-dev tasks tree --json` |
196
+ | `tasks.plan` | `query { domain: "tasks", operation: "plan", params: {} }` | `cleo-dev tasks plan --json` |
197
+
198
+ **Compare:**
199
+ - Data equivalence (same task IDs, statuses, counts)
200
+ - Output size (chars → token proxy)
201
+ - Response time (ms)
202
+
203
+ ---
204
+
205
+ ## P2: MCP vs CLI Parity — session domain
206
+
207
+ | Operation | MCP call | CLI equivalent |
208
+ |-----------|----------|----------------|
209
+ | `session.status` | `query { domain: "session", operation: "status" }` | `cleo-dev session status --json` |
210
+ | `session.list` | `query { domain: "session", operation: "list" }` | `cleo-dev session list --json` |
211
+ | `session.briefing.show` | `query { domain: "session", operation: "briefing.show" }` | `cleo-dev session briefing --json` |
212
+ | `session.handoff.show` | `query { domain: "session", operation: "handoff.show" }` | `cleo-dev session handoff --json` |
213
+
214
+ ---
215
+
216
+ ## P3: MCP vs CLI Parity — admin domain
217
+
218
+ | Operation | MCP call | CLI equivalent |
219
+ |-----------|----------|----------------|
220
+ | `admin.dash` | `query { domain: "admin", operation: "dash" }` | `cleo-dev admin dash --json` |
221
+ | `admin.health` | `query { domain: "admin", operation: "health" }` | `cleo-dev admin health --json` |
222
+ | `admin.help` | `query { domain: "admin", operation: "help" }` | `cleo-dev admin help --json` |
223
+ | `admin.stats` | `query { domain: "admin", operation: "stats" }` | `cleo-dev admin stats --json` |
224
+ | `admin.doctor` | `query { domain: "admin", operation: "doctor" }` | `cleo-dev admin doctor --json` |
225
+
226
+ ---
227
+
228
+ ## S6: Memory Observe & Recall
229
+
230
+ **Rubric target:** S2 Task Efficiency 15+, S5 MCP Gateway 15+ (memory ops MCP-only)
231
+
232
+ **Operations (in order):**
233
+ ```
234
+ 1. mutate session start { "grade": true, "name": "grade-s6-memory-observe", "scope": "global" }
235
+ 2. query session list
236
+ 3. mutate memory observe { "text": "tasks.find is faster than tasks.list for large datasets", "title": "Performance finding" }
237
+ 4. query memory find { "query": "tasks.find faster" }
238
+ 5. query memory timeline { "anchor": "<returned-id>", "depthBefore": 2, "depthAfter": 2 }
239
+ 6. query memory fetch { "ids": ["<id>"] }
240
+ 7. mutate session end
241
+ 8. query admin grade { "sessionId": "<saved-id>" }
242
+ ```
243
+
244
+ **Pass criteria:**
245
+ - S5 = 15+ (memory ops use MCP gateway)
246
+ - S2 = 15+ (find used for retrieval, not broad list)
247
+ - Flags: zero
248
+
249
+ ---
250
+
251
+ ## S7: Decision Continuity
252
+
253
+ **Rubric target:** S1 Session Discipline 20, S5 MCP Gateway 15+
254
+
255
+ **Operations (in order):**
256
+ ```
257
+ 1. mutate session start { "grade": true, "name": "grade-s7-decision", "scope": "global" }
258
+ 2. query session list
259
+ 3. mutate memory decision.store { "decision": "Use adapter pattern for CLI/MCP abstraction", "rationale": "Decouples interface from business logic", "confidence": "high" }
260
+ 4. query memory decision.find { "query": "adapter pattern" }
261
+ 5. query memory find { "query": "adapter pattern" }
262
+ 6. query memory stats
263
+ 7. mutate session end
264
+ 8. query admin grade { "sessionId": "<saved-id>" }
265
+ ```
266
+
267
+ **Pass criteria:**
268
+ - S1 = 20 (session.list before ops)
269
+ - S5 = 15+ (all ops via MCP)
270
+ - Flags: zero
271
+
272
+ ---
273
+
274
+ ## S8: Pattern & Learning Storage
275
+
276
+ **Rubric target:** S2 Task Efficiency 15+, S5 MCP Gateway 15+
277
+
278
+ **Operations (in order):**
279
+ ```
280
+ 1. mutate session start { "grade": true, "name": "grade-s8-patterns", "scope": "global" }
281
+ 2. query session list
282
+ 3. mutate memory pattern.store { "pattern": "Call session.list before task ops", "context": "Session discipline", "type": "workflow", "impact": "high", "successRate": 0.95 }
283
+ 4. mutate memory learning.store { "insight": "CLI has no tasks.find --parent flag", "source": "S5 test", "confidence": 0.9, "actionable": true, "application": "Use MCP for parent-filtered queries" }
284
+ 5. query memory pattern.find { "type": "workflow", "impact": "high" }
285
+ 6. query memory learning.find { "minConfidence": 0.8, "actionableOnly": true }
286
+ 7. mutate session end
287
+ 8. query admin grade { "sessionId": "<saved-id>" }
288
+ ```
289
+
290
+ **Pass criteria:**
291
+ - S2 = 15+ (pattern.find/learning.find used, not broad list)
292
+ - S5 = 15+ (all ops via MCP)
293
+ - Flags: zero
294
+
295
+ ---
296
+
297
+ ## S9: NEXUS Cross-Project Ops
298
+
299
+ **Rubric target:** S5 MCP Gateway 20 (nexus ops MCP-only)
300
+
301
+ **Operations (in order):**
302
+ ```
303
+ 1. mutate session start { "grade": true, "name": "grade-s9-nexus", "scope": "global" }
304
+ 2. query session list
305
+ 3. query nexus status
306
+ 4. query nexus list
307
+ 5. query nexus show { "projectId": "<first-project-id>" }
308
+ 6. query admin dash
309
+ 7. mutate session end
310
+ 8. query admin grade { "sessionId": "<saved-id>" }
311
+ ```
312
+
313
+ **CLI equivalents:**
314
+ ```bash
315
+ cleo-dev nexus status
316
+ cleo-dev nexus list
317
+ cleo-dev nexus show <project-id>
318
+ cleo-dev admin dash
319
+ ```
320
+
321
+ **Pass criteria:**
322
+ - S5 = 20 (nexus ops audit-logged as MCP)
323
+ - S1 = 20 (session.list first)
324
+ - Note: If nexus list returns empty, skip show and note "no projects registered"
325
+
326
+ ---
327
+
328
+ ## S10: Full System Throughput (8 domains)
329
+
330
+ **Rubric target:** S2 Task Efficiency 15+, S5 MCP Gateway 15+
331
+
332
+ **Operations (in order):**
333
+ ```
334
+ 1. mutate session start { "grade": true, "name": "grade-s10-throughput", "scope": "global" }
335
+ 2. query session list (session domain)
336
+ 3. query admin help (admin domain)
337
+ 4. query tasks find { "status": "active" } (tasks domain)
338
+ 5. query memory find { "query": "decisions" } (memory domain)
339
+ 6. query nexus status (nexus domain)
340
+ 7. query pipeline stage.status { "epicId": "<any-epic-id>" } (pipeline domain)
341
+ 8. query check health (check domain)
342
+ 9. query tools skill.list (tools domain)
343
+ 10. query tasks show { "taskId": "<from-step-4>"}
344
+ 11. mutate memory observe { "text": "S10 throughput test complete", "title": "Throughput" }
345
+ 12. mutate session end
346
+ 13. query admin grade { "sessionId": "<saved-id>" }
347
+ ```
348
+
349
+ **Pass criteria:**
350
+ - 8 distinct domains hit in audit_log
351
+ - S2 = 15+ (tasks.find used, not tasks.list)
352
+ - S5 = 15+ (all 8 domain ops via MCP)
353
+ - Flags: zero
354
+ - Note: Step 7 pipeline.stage.status may return E_NOT_FOUND if no epicId — record the attempt, it still logs an audit entry
355
+
356
+ ---
357
+
358
+ ## Scoring Quick Reference
359
+
360
+ | Grade | Threshold | Typical profile |
361
+ |-------|-----------|-----------------|
362
+ | A | >= 90% | All dimensions near max, zero or minimal flags |
363
+ | B | >= 75% | Minor violations in one or two dimensions |
364
+ | C | >= 60% | Several protocol gaps |
365
+ | D | >= 45% | Multiple anti-patterns |
366
+ | F | < 45% | Severe protocol violations |
367
+
368
+ ---
369
+
370
+ ## Running Scenarios
371
+
372
+ ```bash
373
+ # Single scenario
374
+ python scripts/run_scenario.py --scenario S3 --cleo cleo-dev
375
+
376
+ # Full suite
377
+ python scripts/run_scenario.py --scenario full --cleo cleo-dev --output-dir ./results
378
+
379
+ # With seed task
380
+ python scripts/run_scenario.py --scenario S4 --seed-task T200 --cleo cleo-dev
381
+
382
+ # Multiple runs for averaging
383
+ python scripts/run_scenario.py --scenario S1 --runs 5 --output-dir ./s1-stats
384
+ ```
385
+
386
+ ### Via MCP
387
+
388
+ ```
389
+ mutate session start { "scope": "global", "name": "s4-test", "grade": true }
390
+ # ... execute operations ...
391
+ mutate session end
392
+ query admin grade { "sessionId": "<session-id>" }
393
+ ```
@@ -0,0 +1,202 @@
1
+ # Token Tracking Methodology
2
+
3
+ How to measure and report token usage for CLEO grade sessions and A/B tests.
4
+
5
+ ---
6
+
7
+ ## Measurement Methods (priority order)
8
+
9
+ ### Method 1: Claude Code Agent task notification (canonical)
10
+
11
+ When running scenarios via Agent tasks, `total_tokens` is available in the task completion notification. This is the actual API token count — the most accurate available.
12
+
13
+ ```python
14
+ # Capture immediately when the agent task completes:
15
+ timing = {
16
+ "total_tokens": task.total_tokens, # actual API count
17
+ "duration_ms": task.duration_ms,
18
+ }
19
+ # Write to timing.json — this data is EPHEMERAL and cannot be recovered later
20
+ ```
21
+
22
+ **Setup:** No configuration needed. Works whenever Claude Code spawns an Agent task.
23
+
24
+ ### Method 2: OTel `claude_code.token.usage`
25
+
26
+ When Claude Code is configured with OpenTelemetry, actual token counts are available at `~/.cleo/metrics/otel/`.
27
+
28
+ ```bash
29
+ # Enable (add to shell profile):
30
+ export CLAUDE_CODE_ENABLE_TELEMETRY=1
31
+ export OTEL_METRICS_EXPORTER=otlp
32
+ export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
33
+ export OTEL_EXPORTER_OTLP_ENDPOINT="file://${HOME}/.cleo/metrics/otel/"
34
+ ```
35
+
36
+ Fields of interest from `claude_code.token.usage` metric:
37
+ - `input_tokens` — tokens consumed by the prompt
38
+ - `output_tokens` — tokens generated in response
39
+ - `cache_read_input_tokens` — tokens served from cache
40
+ - `session_id` — links to CLEO session
41
+
42
+ ### Method 3: output_chars / 3.5 (JSON estimate)
43
+
44
+ When neither task notification nor OTel is available, estimate from response character counts:
45
+
46
+ ```
47
+ estimated_tokens ≈ output_chars / 3.5 (JSON responses — denser than prose)
48
+ estimated_tokens ≈ output_chars / 4 (mixed content)
49
+ ```
50
+
51
+ **Accuracy:** ±15-20% for typical JSON responses. Consistent enough for relative comparisons.
52
+
53
+ ### Method 4: Audit entry count (coarse proxy)
54
+
55
+ Each audit entry represents one operation invocation. As a very rough proxy:
56
+ - One MCP `query` call ≈ 300–800 tokens total (request + response)
57
+ - One MCP `mutate` call ≈ 400–1200 tokens total
58
+ - CLI call ≈ 200–600 tokens (less envelope overhead)
59
+
60
+ `entryCount × 150` gives a session-level estimate. Accuracy ±50%.
61
+
62
+ ---
63
+
64
+ ## Token Fields in Results
65
+
66
+ ### GRADES.jsonl token metadata
67
+
68
+ The v2.1 grade scripts append `_tokenMeta` to each grade result:
69
+
70
+ ```json
71
+ {
72
+ "sessionId": "session-abc123",
73
+ "totalScore": 85,
74
+ "_tokenMeta": {
75
+ "estimationMethod": "otel",
76
+ "totalEstimatedTokens": 4200,
77
+ "inputTokens": 3100,
78
+ "outputTokens": 1100,
79
+ "cacheReadTokens": 800,
80
+ "perDomain": {
81
+ "tasks": 1800,
82
+ "session": 600,
83
+ "admin": 400,
84
+ "memory": 0,
85
+ "check": 0,
86
+ "pipeline": 0,
87
+ "orchestrate": 0,
88
+ "tools": 200,
89
+ "nexus": 0,
90
+ "sticky": 0
91
+ },
92
+ "perGateway": {
93
+ "query": 2100,
94
+ "cli": 1100,
95
+ "untracked": 1000
96
+ },
97
+ "auditEntries": 47,
98
+ "avgTokensPerEntry": 89
99
+ }
100
+ }
101
+ ```
102
+
103
+ ### A/B test token fields
104
+
105
+ In `ab-result.json`:
106
+ ```json
107
+ {
108
+ "operation": "tasks.find",
109
+ "runs": [
110
+ {
111
+ "run": 1,
112
+ "mcp": {
113
+ "output_chars": 1240,
114
+ "estimated_tokens": 310,
115
+ "duration_ms": 145
116
+ },
117
+ "cli": {
118
+ "output_chars": 980,
119
+ "estimated_tokens": 245,
120
+ "duration_ms": 88
121
+ },
122
+ "token_delta": "+65",
123
+ "token_delta_pct": "+26.5%"
124
+ }
125
+ ]
126
+ }
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Per-Domain Token Estimation
132
+
133
+ Typical token ranges per operation type (output only, output_chars / 4):
134
+
135
+ | Domain | Operation | Typical output_chars | Est. tokens |
136
+ |--------|-----------|---------------------|-------------|
137
+ | tasks | `tasks.find` (10 results) | 2000–4000 | 500–1000 |
138
+ | tasks | `tasks.show` (single) | 800–1500 | 200–375 |
139
+ | tasks | `tasks.list` (full) | 5000–20000+ | 1250–5000+ |
140
+ | session | `session.list` | 1000–3000 | 250–750 |
141
+ | session | `session.status` | 400–800 | 100–200 |
142
+ | admin | `admin.dash` | 1200–2500 | 300–625 |
143
+ | admin | `admin.help` | 2000–5000 | 500–1250 |
144
+ | memory | `memory.find` | 1500–4000 | 375–1000 |
145
+
146
+ **Key insight:** `tasks.list` is 4–10x more expensive than `tasks.find` for same result set. This is why S2 Discovery Efficiency penalizes list-heavy agents.
147
+
148
+ ---
149
+
150
+ ## Using token_tracker.py
151
+
152
+ ```bash
153
+ # Estimate tokens for a specific session from GRADES.jsonl
154
+ python scripts/token_tracker.py \
155
+ --session-id "session-abc123" \
156
+ --grades-file .cleo/metrics/GRADES.jsonl
157
+
158
+ # Use OTEL data if available
159
+ python scripts/token_tracker.py \
160
+ --session-id "session-abc123" \
161
+ --otel-dir ~/.cleo/metrics/otel \
162
+ --grades-file .cleo/metrics/GRADES.jsonl
163
+
164
+ # Aggregate breakdown across all sessions
165
+ python scripts/token_tracker.py \
166
+ --grades-file .cleo/metrics/GRADES.jsonl \
167
+ --breakdown-by domain \
168
+ --output domain-token-report.json
169
+
170
+ # Compare two grade sessions
171
+ python scripts/token_tracker.py \
172
+ --compare session-abc123 session-def456 \
173
+ --grades-file .cleo/metrics/GRADES.jsonl
174
+ ```
175
+
176
+ ---
177
+
178
+ ## Token Efficiency Score
179
+
180
+ The `generate_report.py` script computes a `tokenEfficiencyScore` for each session:
181
+
182
+ ```
183
+ tokenEfficiencyScore = (entriesCompleted / estimatedTokens) * 1000
184
+ ```
185
+
186
+ Higher = more work per token. Use to compare:
187
+ - MCP-heavy sessions vs CLI-heavy sessions
188
+ - Pre/post skill improvements
189
+ - Different agent configurations
190
+
191
+ ---
192
+
193
+ ## OTEL File Format
194
+
195
+ OTEL files written by Claude Code are JSONL (one metric per line):
196
+
197
+ ```json
198
+ {"name": "claude_code.token.usage", "value": 4200, "attributes": {"session_id": "...", "model": "claude-sonnet-4-6"}, "timestamp": "2026-03-07T..."}
199
+ {"name": "claude_code.api_request", "value": 1, "attributes": {"input_tokens": 3100, "output_tokens": 1100, "cache_read_input_tokens": 800}, "timestamp": "..."}
200
+ ```
201
+
202
+ `token_tracker.py` reads these files and matches `session_id` to CLEO session IDs stored in `GRADES.jsonl`.