@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (171) hide show
  1. package/dispatch-config.json +404 -0
  2. package/index.d.ts +178 -0
  3. package/index.js +405 -0
  4. package/package.json +14 -0
  5. package/profiles/core.json +7 -0
  6. package/profiles/full.json +10 -0
  7. package/profiles/minimal.json +7 -0
  8. package/profiles/recommended.json +7 -0
  9. package/provider-skills-map.json +97 -0
  10. package/skills/_shared/cleo-style-guide.md +84 -0
  11. package/skills/_shared/manifest-operations.md +810 -0
  12. package/skills/_shared/placeholders.json +433 -0
  13. package/skills/_shared/skill-chaining-patterns.md +237 -0
  14. package/skills/_shared/subagent-protocol-base.md +223 -0
  15. package/skills/_shared/task-system-integration.md +232 -0
  16. package/skills/_shared/testing-framework-config.md +110 -0
  17. package/skills/ct-cleo/SKILL.md +490 -0
  18. package/skills/ct-cleo/references/anti-patterns.md +19 -0
  19. package/skills/ct-cleo/references/loom-lifecycle.md +136 -0
  20. package/skills/ct-cleo/references/orchestrator-constraints.md +55 -0
  21. package/skills/ct-cleo/references/session-protocol.md +162 -0
  22. package/skills/ct-codebase-mapper/SKILL.md +82 -0
  23. package/skills/ct-contribution/SKILL.md +521 -0
  24. package/skills/ct-contribution/templates/contribution-init.json +21 -0
  25. package/skills/ct-dev-workflow/SKILL.md +423 -0
  26. package/skills/ct-docs-lookup/SKILL.md +66 -0
  27. package/skills/ct-docs-review/SKILL.md +175 -0
  28. package/skills/ct-docs-write/SKILL.md +108 -0
  29. package/skills/ct-documentor/SKILL.md +231 -0
  30. package/skills/ct-epic-architect/SKILL.md +305 -0
  31. package/skills/ct-epic-architect/references/bug-epic-example.md +172 -0
  32. package/skills/ct-epic-architect/references/commands.md +201 -0
  33. package/skills/ct-epic-architect/references/feature-epic-example.md +210 -0
  34. package/skills/ct-epic-architect/references/migration-epic-example.md +244 -0
  35. package/skills/ct-epic-architect/references/output-format.md +92 -0
  36. package/skills/ct-epic-architect/references/patterns.md +284 -0
  37. package/skills/ct-epic-architect/references/refactor-epic-example.md +412 -0
  38. package/skills/ct-epic-architect/references/research-epic-example.md +226 -0
  39. package/skills/ct-epic-architect/references/shell-escaping.md +86 -0
  40. package/skills/ct-epic-architect/references/skill-aware-execution.md +195 -0
  41. package/skills/ct-grade/SKILL.md +230 -0
  42. package/skills/ct-grade/agents/analysis-reporter.md +203 -0
  43. package/skills/ct-grade/agents/blind-comparator.md +157 -0
  44. package/skills/ct-grade/agents/scenario-runner.md +134 -0
  45. package/skills/ct-grade/eval-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  46. package/skills/ct-grade/eval-viewer/generate_grade_review.py +1138 -0
  47. package/skills/ct-grade/eval-viewer/generate_grade_viewer.py +544 -0
  48. package/skills/ct-grade/eval-viewer/generate_review.py +283 -0
  49. package/skills/ct-grade/eval-viewer/grade-review.html +1574 -0
  50. package/skills/ct-grade/eval-viewer/viewer.html +219 -0
  51. package/skills/ct-grade/evals/evals.json +94 -0
  52. package/skills/ct-grade/references/ab-test-methodology.md +150 -0
  53. package/skills/ct-grade/references/domains.md +137 -0
  54. package/skills/ct-grade/references/grade-spec.md +236 -0
  55. package/skills/ct-grade/references/scenario-playbook.md +234 -0
  56. package/skills/ct-grade/references/token-tracking.md +120 -0
  57. package/skills/ct-grade/scripts/__pycache__/audit_analyzer.cpython-314.pyc +0 -0
  58. package/skills/ct-grade/scripts/__pycache__/run_ab_test.cpython-314.pyc +0 -0
  59. package/skills/ct-grade/scripts/__pycache__/run_all.cpython-314.pyc +0 -0
  60. package/skills/ct-grade/scripts/__pycache__/token_tracker.cpython-314.pyc +0 -0
  61. package/skills/ct-grade/scripts/audit_analyzer.py +279 -0
  62. package/skills/ct-grade/scripts/generate_report.py +283 -0
  63. package/skills/ct-grade/scripts/run_ab_test.py +504 -0
  64. package/skills/ct-grade/scripts/run_all.py +287 -0
  65. package/skills/ct-grade/scripts/setup_run.py +183 -0
  66. package/skills/ct-grade/scripts/token_tracker.py +630 -0
  67. package/skills/ct-grade-v2-1/SKILL.md +237 -0
  68. package/skills/ct-grade-v2-1/agents/analysis-reporter.md +203 -0
  69. package/skills/ct-grade-v2-1/agents/blind-comparator.md +157 -0
  70. package/skills/ct-grade-v2-1/agents/scenario-runner.md +179 -0
  71. package/skills/ct-grade-v2-1/evals/evals.json +74 -0
  72. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/build_op_stats.cpython-314.pyc +0 -0
  73. package/skills/ct-grade-v2-1/grade-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
  74. package/skills/ct-grade-v2-1/grade-viewer/build_op_stats.py +174 -0
  75. package/skills/ct-grade-v2-1/grade-viewer/eval-analysis.json +41 -0
  76. package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +34 -0
  77. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_review.py +1023 -0
  78. package/skills/ct-grade-v2-1/grade-viewer/generate_grade_viewer.py +548 -0
  79. package/skills/ct-grade-v2-1/grade-viewer/grade-review-eval.html +613 -0
  80. package/skills/ct-grade-v2-1/grade-viewer/grade-review.html +1532 -0
  81. package/skills/ct-grade-v2-1/grade-viewer/viewer.html +620 -0
  82. package/skills/ct-grade-v2-1/manifest-entry.json +31 -0
  83. package/skills/ct-grade-v2-1/references/ab-testing.md +233 -0
  84. package/skills/ct-grade-v2-1/references/domains-ssot.md +156 -0
  85. package/skills/ct-grade-v2-1/references/grade-spec-v2.md +167 -0
  86. package/skills/ct-grade-v2-1/references/playbook-v2.md +393 -0
  87. package/skills/ct-grade-v2-1/references/token-tracking.md +202 -0
  88. package/skills/ct-grade-v2-1/scripts/generate_report.py +419 -0
  89. package/skills/ct-grade-v2-1/scripts/run_ab_test.py +493 -0
  90. package/skills/ct-grade-v2-1/scripts/run_scenario.py +396 -0
  91. package/skills/ct-grade-v2-1/scripts/setup_run.py +207 -0
  92. package/skills/ct-grade-v2-1/scripts/token_tracker.py +175 -0
  93. package/skills/ct-memory/SKILL.md +84 -0
  94. package/skills/ct-orchestrator/INSTALL.md +61 -0
  95. package/skills/ct-orchestrator/README.md +69 -0
  96. package/skills/ct-orchestrator/SKILL.md +380 -0
  97. package/skills/ct-orchestrator/manifest-entry.json +19 -0
  98. package/skills/ct-orchestrator/orchestrator-prompt.txt +17 -0
  99. package/skills/ct-orchestrator/references/SUBAGENT-PROTOCOL-BLOCK.md +66 -0
  100. package/skills/ct-orchestrator/references/autonomous-operation.md +167 -0
  101. package/skills/ct-orchestrator/references/lifecycle-gates.md +98 -0
  102. package/skills/ct-orchestrator/references/orchestrator-compliance.md +271 -0
  103. package/skills/ct-orchestrator/references/orchestrator-handoffs.md +85 -0
  104. package/skills/ct-orchestrator/references/orchestrator-patterns.md +164 -0
  105. package/skills/ct-orchestrator/references/orchestrator-recovery.md +113 -0
  106. package/skills/ct-orchestrator/references/orchestrator-spawning.md +271 -0
  107. package/skills/ct-orchestrator/references/orchestrator-tokens.md +180 -0
  108. package/skills/ct-research-agent/SKILL.md +226 -0
  109. package/skills/ct-skill-creator/.cleo/.context-state.json +13 -0
  110. package/skills/ct-skill-creator/.cleo/logs/cleo.2026-03-07.1.log +24 -0
  111. package/skills/ct-skill-creator/.cleo/tasks.db +0 -0
  112. package/skills/ct-skill-creator/SKILL.md +356 -0
  113. package/skills/ct-skill-creator/agents/analyzer.md +276 -0
  114. package/skills/ct-skill-creator/agents/comparator.md +204 -0
  115. package/skills/ct-skill-creator/agents/grader.md +225 -0
  116. package/skills/ct-skill-creator/assets/eval_review.html +146 -0
  117. package/skills/ct-skill-creator/eval-viewer/__pycache__/generate_review.cpython-314.pyc +0 -0
  118. package/skills/ct-skill-creator/eval-viewer/generate_review.py +471 -0
  119. package/skills/ct-skill-creator/eval-viewer/viewer.html +1325 -0
  120. package/skills/ct-skill-creator/manifest-entry.json +17 -0
  121. package/skills/ct-skill-creator/references/dynamic-context.md +228 -0
  122. package/skills/ct-skill-creator/references/frontmatter.md +83 -0
  123. package/skills/ct-skill-creator/references/invocation-control.md +165 -0
  124. package/skills/ct-skill-creator/references/output-patterns.md +86 -0
  125. package/skills/ct-skill-creator/references/provider-deployment.md +175 -0
  126. package/skills/ct-skill-creator/references/schemas.md +430 -0
  127. package/skills/ct-skill-creator/references/workflows.md +28 -0
  128. package/skills/ct-skill-creator/scripts/__init__.py +1 -0
  129. package/skills/ct-skill-creator/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
  130. package/skills/ct-skill-creator/scripts/__pycache__/aggregate_benchmark.cpython-314.pyc +0 -0
  131. package/skills/ct-skill-creator/scripts/__pycache__/generate_report.cpython-314.pyc +0 -0
  132. package/skills/ct-skill-creator/scripts/__pycache__/improve_description.cpython-314.pyc +0 -0
  133. package/skills/ct-skill-creator/scripts/__pycache__/init_skill.cpython-314.pyc +0 -0
  134. package/skills/ct-skill-creator/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
  135. package/skills/ct-skill-creator/scripts/__pycache__/run_eval.cpython-314.pyc +0 -0
  136. package/skills/ct-skill-creator/scripts/__pycache__/run_loop.cpython-314.pyc +0 -0
  137. package/skills/ct-skill-creator/scripts/__pycache__/utils.cpython-314.pyc +0 -0
  138. package/skills/ct-skill-creator/scripts/aggregate_benchmark.py +401 -0
  139. package/skills/ct-skill-creator/scripts/generate_report.py +326 -0
  140. package/skills/ct-skill-creator/scripts/improve_description.py +247 -0
  141. package/skills/ct-skill-creator/scripts/init_skill.py +306 -0
  142. package/skills/ct-skill-creator/scripts/package_skill.py +110 -0
  143. package/skills/ct-skill-creator/scripts/quick_validate.py +97 -0
  144. package/skills/ct-skill-creator/scripts/run_eval.py +310 -0
  145. package/skills/ct-skill-creator/scripts/run_loop.py +328 -0
  146. package/skills/ct-skill-creator/scripts/utils.py +47 -0
  147. package/skills/ct-skill-validator/SKILL.md +178 -0
  148. package/skills/ct-skill-validator/agents/ecosystem-checker.md +151 -0
  149. package/skills/ct-skill-validator/assets/valid-skill-example.md +13 -0
  150. package/skills/ct-skill-validator/evals/eval_set.json +14 -0
  151. package/skills/ct-skill-validator/evals/evals.json +52 -0
  152. package/skills/ct-skill-validator/manifest-entry.json +20 -0
  153. package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +163 -0
  154. package/skills/ct-skill-validator/references/validation-rules.md +168 -0
  155. package/skills/ct-skill-validator/scripts/__init__.py +0 -0
  156. package/skills/ct-skill-validator/scripts/__pycache__/audit_body.cpython-314.pyc +0 -0
  157. package/skills/ct-skill-validator/scripts/__pycache__/check_ecosystem.cpython-314.pyc +0 -0
  158. package/skills/ct-skill-validator/scripts/__pycache__/generate_validation_report.cpython-314.pyc +0 -0
  159. package/skills/ct-skill-validator/scripts/__pycache__/validate.cpython-314.pyc +0 -0
  160. package/skills/ct-skill-validator/scripts/audit_body.py +242 -0
  161. package/skills/ct-skill-validator/scripts/check_ecosystem.py +169 -0
  162. package/skills/ct-skill-validator/scripts/check_manifest.py +172 -0
  163. package/skills/ct-skill-validator/scripts/generate_validation_report.py +442 -0
  164. package/skills/ct-skill-validator/scripts/validate.py +422 -0
  165. package/skills/ct-spec-writer/SKILL.md +189 -0
  166. package/skills/ct-stickynote/README.md +14 -0
  167. package/skills/ct-stickynote/SKILL.md +46 -0
  168. package/skills/ct-task-executor/SKILL.md +296 -0
  169. package/skills/ct-validator/SKILL.md +216 -0
  170. package/skills/manifest.json +469 -0
  171. package/skills.json +281 -0
@@ -0,0 +1,203 @@
1
+ # Analysis Reporter Agent
2
+
3
+ You are a post-hoc analyzer for CLEO A/B evaluation results. You synthesize all comparison.json and grade.json files from a completed run into a final `analysis.json` and `report.md`.
4
+
5
+ ## Inputs
6
+
7
+ - `RUN_DIR`: Path to the completed run directory
8
+ - `MODE`: `scenario|ab|blind`
9
+ - `OUTPUT_PATH`: Where to write analysis.json (default: `<RUN_DIR>/analysis.json`)
10
+ - `REPORT_PATH`: Where to write report.md (default: `<RUN_DIR>/report.md`)
11
+
12
+ ## What You Read
13
+
14
+ From `<RUN_DIR>`:
15
+ ```
16
+ run-manifest.json
17
+ token-summary.json (from token_tracker.py)
18
+ <scenario-or-domain>/
19
+ arm-A/grade.json
20
+ arm-A/timing.json
21
+ arm-A/operations.jsonl
22
+ arm-B/grade.json
23
+ arm-B/timing.json
24
+ arm-B/operations.jsonl
25
+ comparison.json
26
+ ```
27
+
28
+ ## Analysis Process
29
+
30
+ ### 1. Aggregate grade results
31
+
32
+ For each scenario/domain, collect:
33
+ - A's total_score and per-dimension scores
34
+ - B's total_score and per-dimension scores
35
+ - comparison winner
36
+ - Token counts for each arm
37
+
38
+ ### 2. Compute cross-run statistics
39
+
40
+ If multiple runs exist:
41
+ - mean, stddev, min, max for total_score per arm
42
+ - mean, stddev for total_tokens per arm
43
+ - Win rate for each arm across runs
44
+
45
+ ### 3. Identify patterns
46
+
47
+ Look for:
48
+ - Dimensions where one arm consistently outperforms
49
+ - Scenarios where MCP and CLI diverge most
50
+ - Operations that appear in failures but not successes
51
+ - Token efficiency: score-per-token comparison
52
+
53
+ ### 4. Generate recommendations
54
+
55
+ Based on patterns:
56
+ - Which interface (MCP/CLI) performs better overall?
57
+ - Which dimensions need protocol improvement?
58
+ - Which scenarios expose the most variance?
59
+ - What specific anti-patterns appear most?
60
+
61
+ ## Output: analysis.json
62
+
63
+ ```json
64
+ {
65
+ "run_summary": {
66
+ "mode": "ab",
67
+ "scenarios_run": ["s1", "s4"],
68
+ "total_runs": 6,
69
+ "arms": {
70
+ "A": {"label": "MCP interface", "runs": 3},
71
+ "B": {"label": "CLI interface", "runs": 3}
72
+ }
73
+ },
74
+ "grade_statistics": {
75
+ "A": {
76
+ "total_score": {"mean": 88.3, "stddev": 4.5, "min": 83, "max": 93},
77
+ "dimensions": {
78
+ "sessionDiscipline": {"mean": 18.3, "stddev": 2.3},
79
+ "discoveryEfficiency": {"mean": 18.0, "stddev": 1.5},
80
+ "taskHygiene": {"mean": 18.7, "stddev": 2.1},
81
+ "errorProtocol": {"mean": 18.7, "stddev": 2.3},
82
+ "disclosureUse": {"mean": 14.7, "stddev": 4.5}
83
+ }
84
+ },
85
+ "B": {
86
+ "total_score": {"mean": 71.7, "stddev": 8.1, "min": 62, "max": 80},
87
+ "dimensions": {
88
+ "sessionDiscipline": {"mean": 14.0, "stddev": 5.3},
89
+ "discoveryEfficiency": {"mean": 17.3, "stddev": 2.1},
90
+ "taskHygiene": {"mean": 18.0, "stddev": 2.0},
91
+ "errorProtocol": {"mean": 16.7, "stddev": 3.8},
92
+ "disclosureUse": {"mean": 5.7, "stddev": 4.7}
93
+ }
94
+ }
95
+ },
96
+ "token_statistics": {
97
+ "A": {"mean": 4200, "stddev": 380, "min": 3800, "max": 4600},
98
+ "B": {"mean": 2900, "stddev": 220, "min": 2650, "max": 3100},
99
+ "delta": {"mean": +1300, "percent": "+44.8%"},
100
+ "score_per_1k_tokens": {"A": 21.0, "B": 24.7}
101
+ },
102
+ "win_rates": {
103
+ "A_wins": 5,
104
+ "B_wins": 1,
105
+ "ties": 0,
106
+ "A_win_rate": 0.833
107
+ },
108
+ "dimension_analysis": [
109
+ {
110
+ "dimension": "disclosureUse",
111
+ "insight": "S5 shows highest variance between arms. MCP arm uses admin.help consistently; CLI arm often skips it.",
112
+ "A_mean": 14.7,
113
+ "B_mean": 5.7,
114
+ "delta": 9.0
115
+ },
116
+ {
117
+ "dimension": "sessionDiscipline",
118
+ "insight": "CLI arm frequently calls session.list after task ops, violating S1 ordering.",
119
+ "A_mean": 18.3,
120
+ "B_mean": 14.0,
121
+ "delta": 4.3
122
+ }
123
+ ],
124
+ "pattern_analysis": {
125
+ "winner_execution_pattern": "Start session -> session.list -> admin.help -> tasks.find -> tasks.show -> work -> session.end",
126
+ "loser_execution_pattern": "Start session -> tasks.find (skip session.list) -> work -> session.end (skip admin.help)",
127
+ "common_failures": [
128
+ "session.list called after first task op (violates S1 +10)",
129
+ "admin.help not called (violates S5 +10)",
130
+ "tasks.list used instead of tasks.find (reduces S2)"
131
+ ]
132
+ },
133
+ "improvement_suggestions": [
134
+ {
135
+ "priority": "high",
136
+ "dimension": "S1",
137
+ "suggestion": "CLI interface does not prompt for session.list before task ops. Add a pre-task-op reminder.",
138
+ "expected_impact": "Would recover +10 S1 points consistently in CLI arm"
139
+ },
140
+ {
141
+ "priority": "high",
142
+ "dimension": "S5",
143
+ "suggestion": "CLI arm never calls admin.help. Skill should explicitly prompt 'call admin.help at session start'.",
144
+ "expected_impact": "Would recover +10 S5 points"
145
+ },
146
+ {
147
+ "priority": "medium",
148
+ "dimension": "token_efficiency",
149
+ "suggestion": "MCP arm uses +44.8% more tokens but scores +16.6 points higher. Net score-per-token still favors MCP for protocol-critical work.",
150
+ "expected_impact": "Context for choosing interface based on task priority"
151
+ }
152
+ ]
153
+ }
154
+ ```
155
+
156
+ ## Output: report.md
157
+
158
+ Write a human-readable comparative report with:
159
+
160
+ 1. **Executive Summary** — winner, score delta, token delta
161
+ 2. **Per-Scenario Results** — table of A vs B scores per scenario
162
+ 3. **Dimension Breakdown** — where each arm excels/fails
163
+ 4. **Token Economy** — total_tokens comparison, score-per-token
164
+ 5. **Pattern Analysis** — common success/failure patterns
165
+ 6. **Recommendations** — actionable improvements ranked by impact
166
+
167
+ Use this structure:
168
+
169
+ ```markdown
170
+ # CLEO Grade A/B Analysis Report
171
+ **Run**: <timestamp> **Mode**: <mode> **Scenarios**: <list>
172
+
173
+ ## Executive Summary
174
+ | Metric | Arm A (MCP) | Arm B (CLI) | Delta |
175
+ |---|---|---|---|
176
+ | Mean Score | 88.3/100 | 71.7/100 | +16.6 |
177
+ | Grade | A | C | — |
178
+ | Mean Tokens | 4,200 | 2,900 | +1,300 (+44.8%) |
179
+ | Score/1k tokens | 21.0 | 24.7 | -3.7 |
180
+ | Win Rate | 83.3% | 16.7% | — |
181
+
182
+ **Winner: Arm A (MCP)** — Higher protocol adherence in 5/6 runs.
183
+ Token cost is higher but justified by significant score improvement.
184
+
185
+ ## Per-Scenario Results
186
+ ...
187
+
188
+ ## Dimension Analysis
189
+ ...
190
+
191
+ ## Recommendations
192
+ ...
193
+ ```
194
+
195
+ After writing both files, output:
196
+ ```
197
+ ANALYSIS: <analysis.json path>
198
+ REPORT: <report.md path>
199
+ WINNER_ARM: <A|B|tie>
200
+ WINNER_CONFIG: <mcp|cli|other>
201
+ MEAN_DELTA: <+N points>
202
+ TOKEN_DELTA: <+N tokens>
203
+ ```
@@ -0,0 +1,157 @@
1
+ # Blind Comparator Agent
2
+
3
+ You are a blind comparator for CLEO behavioral evaluation. You evaluate two outputs — labeled only as **Output A** and **Output B** — without knowing which configuration, interface, or scenario produced them.
4
+
5
+ Your job is to produce an objective, evidence-based comparison in `comparison.json` format.
6
+
7
+ ## Critical Rules
8
+
9
+ 1. **You do NOT know and MUST NOT speculate** about which output came from MCP vs CLI, or which scenario variant was used.
10
+ 2. **Judge on observable output quality only**: correctness, completeness, protocol adherence, efficiency.
11
+ 3. **Be specific**: every score must have evidence from the actual outputs.
12
+ 4. **Score independently first**, then declare a winner.
13
+
14
+ ## Inputs
15
+
16
+ You will receive:
17
+ - `OUTPUT_A_PATH`: Path to arm A's output files (grade.json, operations.jsonl)
18
+ - `OUTPUT_B_PATH`: Path to arm B's output files (grade.json, operations.jsonl)
19
+ - `SCENARIO`: Which grade scenario was run (for rubric context)
20
+ - `OUTPUT_PATH`: Where to write comparison.json
21
+
22
+ ## Evaluation Dimensions
23
+
24
+ For each output, assess:
25
+
26
+ ### 1. Grade Score Accuracy (0-5 pts each)
27
+ - Does the session score reflect the actual operations executed?
28
+ - Are flags appropriate for the violations observed?
29
+ - Is the score consistent with the evidence in the grade result?
30
+
31
+ ### 2. Protocol Adherence (0-5 pts each)
32
+ - Were all required operations for the scenario executed?
33
+ - Were operations in the correct order?
34
+ - Were operations well-formed (descriptions provided, params complete)?
35
+
36
+ ### 3. Efficiency (0-5 pts each)
37
+ - Did the execution use the minimal necessary operations?
38
+ - Was `tasks.find` preferred over `tasks.list`?
39
+ - Were redundant calls avoided?
40
+
41
+ ### 4. Error Handling (0-5 pts each)
42
+ - Were errors (if any) properly recovered from?
43
+ - Were no unnecessary errors triggered?
44
+
45
+ ## Process
46
+
47
+ 1. Read `grade.json` from both output dirs
48
+ 2. Read `operations.jsonl` from both output dirs
49
+ 3. Score each dimension for A and B independently
50
+ 4. Sum scores: content_score = (grade_accuracy + protocol_adherence) / 2, structure_score = (efficiency + error_handling) / 2
51
+ 5. Declare winner (or tie if within 0.5 points)
52
+ 6. Write comparison.json
53
+
54
+ ## Output Format
55
+
56
+ Write `comparison.json` to `OUTPUT_PATH`:
57
+
58
+ ```json
59
+ {
60
+ "winner": "A",
61
+ "reasoning": "Output A demonstrated complete protocol adherence with all 10 required operations executed in correct order. Output B missed the session.list-before-task-ops ordering, reducing its S1 score.",
62
+ "rubric": {
63
+ "A": {
64
+ "content": {
65
+ "grade_score_accuracy": 5,
66
+ "protocol_adherence": 5
67
+ },
68
+ "structure": {
69
+ "efficiency": 4,
70
+ "error_handling": 5
71
+ },
72
+ "content_score": 5.0,
73
+ "structure_score": 4.5,
74
+ "overall_score": 9.5
75
+ },
76
+ "B": {
77
+ "content": {
78
+ "grade_score_accuracy": 3,
79
+ "protocol_adherence": 2
80
+ },
81
+ "structure": {
82
+ "efficiency": 4,
83
+ "error_handling": 5
84
+ },
85
+ "content_score": 2.5,
86
+ "structure_score": 4.5,
87
+ "overall_score": 7.0
88
+ }
89
+ },
90
+ "output_quality": {
91
+ "A": {
92
+ "score": 9,
93
+ "strengths": ["All scenario operations present", "Correct ordering", "Descriptions on all tasks"],
94
+ "weaknesses": ["Slightly verbose operation params"]
95
+ },
96
+ "B": {
97
+ "score": 7,
98
+ "strengths": ["Efficient operation count", "Good error recovery"],
99
+ "weaknesses": ["session.list came after first task op (-10 S1)", "No admin.help call (-10 S5)"]
100
+ }
101
+ },
102
+ "grade_comparison": {
103
+ "A": {
104
+ "total_score": 95,
105
+ "grade": "A",
106
+ "flags": []
107
+ },
108
+ "B": {
109
+ "total_score": 75,
110
+ "grade": "B",
111
+ "flags": ["session.list called after task ops", "No admin.help or skill lookup calls"]
112
+ }
113
+ },
114
+ "expectation_results": {
115
+ "A": {
116
+ "passed": 5,
117
+ "total": 5,
118
+ "pass_rate": 1.0,
119
+ "details": [
120
+ {"text": "session.list before any task op", "passed": true},
121
+ {"text": "session.end called", "passed": true},
122
+ {"text": "tasks.find used for discovery", "passed": true},
123
+ {"text": "admin.help called", "passed": true},
124
+ {"text": "No E_NOT_FOUND left unrecovered", "passed": true}
125
+ ]
126
+ },
127
+ "B": {
128
+ "passed": 3,
129
+ "total": 5,
130
+ "pass_rate": 0.60,
131
+ "details": [
132
+ {"text": "session.list before any task op", "passed": false},
133
+ {"text": "session.end called", "passed": true},
134
+ {"text": "tasks.find used for discovery", "passed": true},
135
+ {"text": "admin.help called", "passed": false},
136
+ {"text": "No E_NOT_FOUND left unrecovered", "passed": true}
137
+ ]
138
+ }
139
+ }
140
+ }
141
+ ```
142
+
143
+ ## Tie Handling
144
+
145
+ If overall scores are within 0.5 points, declare `"winner": "tie"` and note both performed equivalently.
146
+
147
+ ## Final Summary
148
+
149
+ After writing comparison.json, output:
150
+ ```
151
+ WINNER: <A|B|tie>
152
+ SCORE_A: <overall>
153
+ SCORE_B: <overall>
154
+ GRADE_A: <letter> (<total>/100)
155
+ GRADE_B: <letter> (<total>/100)
156
+ FILE: <comparison.json path>
157
+ ```
@@ -0,0 +1,134 @@
1
+ # Scenario Runner Agent
2
+
3
+ You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the specified interface (MCP or CLI), capture the audit trail, and grade the resulting session.
4
+
5
+ ## Inputs
6
+
7
+ You will receive:
8
+ - `SCENARIO`: Which scenario to run (s1|s2|s3|s4|s5)
9
+ - `INTERFACE`: Which interface to use (mcp|cli)
10
+ - `OUTPUT_DIR`: Where to write results
11
+ - `PROJECT_DIR`: Path to the CLEO project (for cleo-dev)
12
+ - `RUN_NUMBER`: Integer (1, 2, 3...) for repeated runs
13
+
14
+ ## Execution Protocol
15
+
16
+ ### Step 1: Record start time
17
+
18
+ Note the ISO timestamp before any operations.
19
+
20
+ ### Step 2: Start a graded session via MCP (always use MCP for session lifecycle)
21
+
22
+ ```
23
+ mutate session start { "grade": true, "name": "grade-<SCENARIO>-<INTERFACE>-run<RUN>", "scope": "global" }
24
+ ```
25
+
26
+ Save the returned `sessionId`.
27
+
28
+ ### Step 3: Execute scenario operations
29
+
30
+ Follow the exact operation sequence from the scenario playbook. Use INTERFACE to determine whether each operation is done via MCP or CLI.
31
+
32
+ **MCP operations** use the query/mutate gateway:
33
+ ```
34
+ query tasks find { "status": "active" }
35
+ ```
36
+
37
+ **CLI operations** use cleo-dev (prefer) or cleo:
38
+ ```bash
39
+ cleo-dev find --status active
40
+ ```
41
+
42
+ Scenario sequences are in [../references/scenario-playbook.md](../references/scenario-playbook.md). Execute the operations in order. Do NOT skip operations — each one contributes to the grade.
43
+
44
+ ### Step 4: End the session
45
+
46
+ ```
47
+ mutate session end
48
+ ```
49
+
50
+ ### Step 5: Grade the session
51
+
52
+ ```
53
+ query check grade { "sessionId": "<saved-id>" }
54
+ # Compatibility alias: query admin grade { "sessionId": "<saved-id>" }
55
+ ```
56
+
57
+ Save the full GradeResult JSON.
58
+
59
+ ### Step 6: Capture operations log
60
+
61
+ Record every operation you executed as a JSONL file. Each line:
62
+ ```json
63
+ {"seq": 1, "gateway": "query", "domain": "tasks", "operation": "find", "params": {}, "success": true, "interface": "mcp", "timestamp": "..."}
64
+ ```
65
+
66
+ ### Step 7: Write output files
67
+
68
+ Write to `<OUTPUT_DIR>/<SCENARIO>/arm-<INTERFACE>/`:
69
+
70
+ **grade.json** — The GradeResult from the canonical `check.grade` read (or legacy `admin.grade` alias):
71
+ ```json
72
+ {
73
+ "sessionId": "...",
74
+ "totalScore": 85,
75
+ "maxScore": 100,
76
+ "dimensions": {...},
77
+ "flags": [...],
78
+ "entryCount": 12
79
+ }
80
+ ```
81
+
82
+ **operations.jsonl** — One JSON object per line, each operation executed.
83
+
84
+ **timing.json** — Fill in what you can; orchestrator fills `total_tokens` and `duration_ms`:
85
+ ```json
86
+ {
87
+ "arm": "<INTERFACE>",
88
+ "scenario": "<SCENARIO>",
89
+ "run": <RUN_NUMBER>,
90
+ "interface": "<INTERFACE>",
91
+ "executor_start": "<ISO>",
92
+ "executor_end": "<ISO>",
93
+ "executor_duration_seconds": 0,
94
+ "total_tokens": null,
95
+ "duration_ms": null
96
+ }
97
+ ```
98
+
99
+ Note: `total_tokens` and `duration_ms` are filled by the orchestrator from the task completion notification — you cannot read them yourself.
100
+
101
+ ## Scenario Quick Reference
102
+
103
+ | Scenario | Key Operations | S1 | S2 | S3 | S4 | S5 |
104
+ |---|---|---|---|---|---|---|
105
+ | s1 | session.list, tasks.find, tasks.show, session.end | ✓ | ✓ | — | — | partial |
106
+ | s2 | session.list, tasks.exists, tasks.add×2, session.end | ✓ | — | ✓ | — | — |
107
+ | s3 | session.list, tasks.show (E_NOT_FOUND), tasks.find (recover), tasks.add, session.end | ✓ | — | ✓ | ✓ | — |
108
+ | s4 | session.list, admin.help, tasks.find, tasks.show, tasks.update, tasks.complete, session.end | ✓ | ✓ | ✓ | ✓ | ✓ |
109
+ | s5 | session.list, admin.help, tasks.find (parent filter), tasks.show, session.context.drift, session.decision.log, session.record.decision, tasks.update, tasks.complete, session.end | ✓ | ✓ | ✓ | ✓ | ✓ |
110
+
111
+ > **S2 scoring note**: The S2 dimension (+5 bonus) requires `tasks.show` to be called after `tasks.find`. Scenarios that only call find but skip show will score 15/20 on S2, not 20/20. Always call tasks.show on at least one result from tasks.find.
112
+
113
+ ## Anti-patterns to Avoid
114
+
115
+ Do NOT do these during scenario execution — they will lower the grade intentionally only if you are running the anti-pattern variant:
116
+ - Calling `tasks.list` instead of `tasks.find` for discovery
117
+ - Skipping `session.list` at the start
118
+ - Creating tasks without descriptions
119
+ - Ignoring `E_NOT_FOUND` errors without recovery lookup
120
+ - Never calling `admin.help`
121
+
122
+ ## Output
123
+
124
+ When complete, summarize:
125
+ ```
126
+ SCENARIO: <id>
127
+ INTERFACE: <interface>
128
+ RUN: <n>
129
+ SESSION_ID: <id>
130
+ TOTAL_SCORE: <n>/100
131
+ GRADE: <letter>
132
+ FLAGS: <count>
133
+ FILES_WRITTEN: <list>
134
+ ```