@cleocode/cleo 2026.3.20 → 2026.3.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (150) hide show
  1. package/dist/cli/index.js +39394 -38817
  2. package/dist/cli/index.js.map +4 -4
  3. package/dist/mcp/index.js +35841 -36702
  4. package/dist/mcp/index.js.map +4 -4
  5. package/drizzle-brain.config.ts +7 -0
  6. package/drizzle-nexus.config.ts +7 -0
  7. package/drizzle-tasks.config.ts +7 -0
  8. package/migrations/drizzle-brain/20260301230215_workable_spitfire/migration.sql +68 -0
  9. package/migrations/drizzle-brain/20260301230215_workable_spitfire/snapshot.json +651 -0
  10. package/migrations/drizzle-brain/20260302050325_unknown_justin_hammer/migration.sql +23 -0
  11. package/migrations/drizzle-brain/20260302050325_unknown_justin_hammer/snapshot.json +884 -0
  12. package/migrations/drizzle-brain/20260302061755_unusual_jamie_braddock/migration.sql +2 -0
  13. package/migrations/drizzle-brain/20260302061755_unusual_jamie_braddock/snapshot.json +908 -0
  14. package/migrations/drizzle-brain/20260302193548_luxuriant_glorian/migration.sql +20 -0
  15. package/migrations/drizzle-brain/20260302193548_luxuriant_glorian/snapshot.json +1078 -0
  16. package/migrations/drizzle-brain/20260304045002_white_thunderbolt_ross/migration.sql +16 -0
  17. package/migrations/drizzle-brain/20260304045002_white_thunderbolt_ross/snapshot.json +1233 -0
  18. package/migrations/drizzle-nexus/20260305070805_quick_ted_forrester/migration.sql +46 -0
  19. package/migrations/drizzle-nexus/20260305070805_quick_ted_forrester/snapshot.json +461 -0
  20. package/migrations/drizzle-tasks/20260308024513_oval_king_bedlam/migration.sql +32 -0
  21. package/migrations/drizzle-tasks/20260308024513_oval_king_bedlam/snapshot.json +3727 -0
  22. package/package.json +14 -4
  23. package/packages/ct-skills/skills/ct-cleo/SKILL.md +344 -81
  24. package/packages/ct-skills/skills/ct-grade/SKILL.md +20 -4
  25. package/packages/ct-skills/skills/ct-grade/agents/analysis-reporter.md +203 -0
  26. package/packages/ct-skills/skills/ct-grade/agents/blind-comparator.md +157 -0
  27. package/packages/ct-skills/skills/ct-grade/agents/scenario-runner.md +134 -0
  28. package/packages/ct-skills/skills/ct-grade/eval-viewer/generate_grade_review.py +1138 -0
  29. package/packages/ct-skills/skills/ct-grade/eval-viewer/generate_grade_viewer.py +544 -0
  30. package/packages/ct-skills/skills/ct-grade/eval-viewer/generate_review.py +283 -0
  31. package/packages/ct-skills/skills/ct-grade/eval-viewer/grade-review.html +1574 -0
  32. package/packages/ct-skills/skills/ct-grade/eval-viewer/viewer.html +219 -0
  33. package/packages/ct-skills/skills/ct-grade/evals/evals.json +94 -0
  34. package/packages/ct-skills/skills/ct-grade/references/ab-test-methodology.md +150 -0
  35. package/packages/ct-skills/skills/ct-grade/references/domains.md +137 -0
  36. package/packages/ct-skills/skills/ct-grade/references/grade-spec.md +236 -0
  37. package/packages/ct-skills/skills/ct-grade/references/scenario-playbook.md +234 -0
  38. package/packages/ct-skills/skills/ct-grade/references/token-tracking.md +120 -0
  39. package/packages/ct-skills/skills/ct-grade/scripts/audit_analyzer.py +279 -0
  40. package/packages/ct-skills/skills/ct-grade/scripts/generate_report.py +283 -0
  41. package/packages/ct-skills/skills/ct-grade/scripts/run_ab_test.py +504 -0
  42. package/packages/ct-skills/skills/ct-grade/scripts/run_all.py +287 -0
  43. package/packages/ct-skills/skills/ct-grade/scripts/setup_run.py +183 -0
  44. package/packages/ct-skills/skills/ct-grade/scripts/token_tracker.py +630 -0
  45. package/packages/ct-skills/skills/ct-grade-v2-1/SKILL.md +237 -0
  46. package/packages/ct-skills/skills/ct-grade-v2-1/agents/analysis-reporter.md +203 -0
  47. package/packages/ct-skills/skills/ct-grade-v2-1/agents/blind-comparator.md +157 -0
  48. package/packages/ct-skills/skills/ct-grade-v2-1/agents/scenario-runner.md +179 -0
  49. package/packages/ct-skills/skills/ct-grade-v2-1/evals/evals.json +74 -0
  50. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/build_op_stats.py +174 -0
  51. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/eval-analysis.json +41 -0
  52. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/eval-report.md +34 -0
  53. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/generate_grade_review.py +1023 -0
  54. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/generate_grade_viewer.py +548 -0
  55. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/grade-review-eval.html +613 -0
  56. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/grade-review.html +1532 -0
  57. package/packages/ct-skills/skills/ct-grade-v2-1/grade-viewer/viewer.html +620 -0
  58. package/packages/ct-skills/skills/ct-grade-v2-1/manifest-entry.json +31 -0
  59. package/packages/ct-skills/skills/ct-grade-v2-1/references/ab-testing.md +233 -0
  60. package/packages/ct-skills/skills/ct-grade-v2-1/references/domains-ssot.md +156 -0
  61. package/packages/ct-skills/skills/ct-grade-v2-1/references/grade-spec-v2.md +167 -0
  62. package/packages/ct-skills/skills/ct-grade-v2-1/references/playbook-v2.md +393 -0
  63. package/packages/ct-skills/skills/ct-grade-v2-1/references/token-tracking.md +202 -0
  64. package/packages/ct-skills/skills/ct-grade-v2-1/scripts/generate_report.py +419 -0
  65. package/packages/ct-skills/skills/ct-grade-v2-1/scripts/run_ab_test.py +493 -0
  66. package/packages/ct-skills/skills/ct-grade-v2-1/scripts/run_scenario.py +396 -0
  67. package/packages/ct-skills/skills/ct-grade-v2-1/scripts/setup_run.py +207 -0
  68. package/packages/ct-skills/skills/ct-grade-v2-1/scripts/token_tracker.py +175 -0
  69. package/packages/ct-skills/skills/ct-orchestrator/SKILL.md +1 -29
  70. package/packages/ct-skills/skills/ct-orchestrator/manifest-entry.json +19 -0
  71. package/packages/ct-skills/skills/ct-skill-creator/SKILL.md +0 -12
  72. package/packages/ct-skills/skills/ct-skill-creator/agents/analyzer.md +276 -0
  73. package/packages/ct-skills/skills/ct-skill-creator/agents/comparator.md +204 -0
  74. package/packages/ct-skills/skills/ct-skill-creator/agents/grader.md +225 -0
  75. package/packages/ct-skills/skills/ct-skill-creator/assets/eval_review.html +146 -0
  76. package/packages/ct-skills/skills/ct-skill-creator/eval-viewer/generate_review.py +471 -0
  77. package/packages/ct-skills/skills/ct-skill-creator/eval-viewer/viewer.html +1325 -0
  78. package/packages/ct-skills/skills/ct-skill-creator/manifest-entry.json +17 -0
  79. package/packages/ct-skills/skills/ct-skill-creator/references/dynamic-context.md +228 -0
  80. package/packages/ct-skills/skills/ct-skill-creator/references/frontmatter.md +83 -0
  81. package/packages/ct-skills/skills/ct-skill-creator/references/invocation-control.md +165 -0
  82. package/packages/ct-skills/skills/ct-skill-creator/references/provider-deployment.md +175 -0
  83. package/packages/ct-skills/skills/ct-skill-creator/references/schemas.md +430 -0
  84. package/packages/ct-skills/skills/ct-skill-creator/scripts/__init__.py +1 -0
  85. package/packages/ct-skills/skills/ct-skill-creator/scripts/aggregate_benchmark.py +401 -0
  86. package/packages/ct-skills/skills/ct-skill-creator/scripts/generate_report.py +326 -0
  87. package/packages/ct-skills/skills/ct-skill-creator/scripts/improve_description.py +247 -0
  88. package/packages/ct-skills/skills/ct-skill-creator/scripts/run_eval.py +310 -0
  89. package/packages/ct-skills/skills/ct-skill-creator/scripts/run_loop.py +328 -0
  90. package/packages/ct-skills/skills/ct-skill-creator/scripts/utils.py +47 -0
  91. package/packages/ct-skills/skills/ct-skill-validator/SKILL.md +178 -0
  92. package/packages/ct-skills/skills/ct-skill-validator/agents/ecosystem-checker.md +151 -0
  93. package/packages/ct-skills/skills/ct-skill-validator/assets/valid-skill-example.md +13 -0
  94. package/packages/ct-skills/skills/ct-skill-validator/evals/eval_set.json +14 -0
  95. package/packages/ct-skills/skills/ct-skill-validator/evals/evals.json +52 -0
  96. package/packages/ct-skills/skills/ct-skill-validator/manifest-entry.json +20 -0
  97. package/packages/ct-skills/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +163 -0
  98. package/packages/ct-skills/skills/ct-skill-validator/references/validation-rules.md +168 -0
  99. package/packages/ct-skills/skills/ct-skill-validator/scripts/__init__.py +0 -0
  100. package/packages/ct-skills/skills/ct-skill-validator/scripts/audit_body.py +242 -0
  101. package/packages/ct-skills/skills/ct-skill-validator/scripts/check_ecosystem.py +169 -0
  102. package/packages/ct-skills/skills/ct-skill-validator/scripts/check_manifest.py +172 -0
  103. package/packages/ct-skills/skills/ct-skill-validator/scripts/generate_validation_report.py +442 -0
  104. package/packages/ct-skills/skills/ct-skill-validator/scripts/validate.py +422 -0
  105. /package/{drizzle → migrations/drizzle-tasks}/20260224040019_baseline/migration.sql +0 -0
  106. /package/{drizzle → migrations/drizzle-tasks}/20260224040019_baseline/snapshot.json +0 -0
  107. /package/{drizzle → migrations/drizzle-tasks}/20260224040238_add-audit-log/migration.sql +0 -0
  108. /package/{drizzle → migrations/drizzle-tasks}/20260224040238_add-audit-log/snapshot.json +0 -0
  109. /package/{drizzle → migrations/drizzle-tasks}/20260224144602_closed_grim_reaper/migration.sql +0 -0
  110. /package/{drizzle → migrations/drizzle-tasks}/20260224144602_closed_grim_reaper/snapshot.json +0 -0
  111. /package/{drizzle → migrations/drizzle-tasks}/20260225024442_sync-lifecycle-enums-and-arch-decisions/migration.sql +0 -0
  112. /package/{drizzle → migrations/drizzle-tasks}/20260225024442_sync-lifecycle-enums-and-arch-decisions/snapshot.json +0 -0
  113. /package/{drizzle → migrations/drizzle-tasks}/20260227014821_adr-system-and-status-registry/migration.sql +0 -0
  114. /package/{drizzle → migrations/drizzle-tasks}/20260227014821_adr-system-and-status-registry/snapshot.json +0 -0
  115. /package/{drizzle → migrations/drizzle-tasks}/20260227021231_add-cancelled-pipeline-status/migration.sql +0 -0
  116. /package/{drizzle → migrations/drizzle-tasks}/20260227021231_add-cancelled-pipeline-status/snapshot.json +0 -0
  117. /package/{drizzle → migrations/drizzle-tasks}/20260227022417_adr-cognitive-search-fields/migration.sql +0 -0
  118. /package/{drizzle → migrations/drizzle-tasks}/20260227022417_adr-cognitive-search-fields/snapshot.json +0 -0
  119. /package/{drizzle → migrations/drizzle-tasks}/20260227172236_freezing_grey_gargoyle/migration.sql +0 -0
  120. /package/{drizzle → migrations/drizzle-tasks}/20260227172236_freezing_grey_gargoyle/snapshot.json +0 -0
  121. /package/{drizzle → migrations/drizzle-tasks}/20260227183444_fix-orphaned-parent-ids/migration.sql +0 -0
  122. /package/{drizzle → migrations/drizzle-tasks}/20260227183444_fix-orphaned-parent-ids/snapshot.json +0 -0
  123. /package/{drizzle → migrations/drizzle-tasks}/20260227183521_parent-id-on-delete-set-null/migration.sql +0 -0
  124. /package/{drizzle → migrations/drizzle-tasks}/20260227183521_parent-id-on-delete-set-null/snapshot.json +0 -0
  125. /package/{drizzle → migrations/drizzle-tasks}/20260227200430_numerous_mysterio/migration.sql +0 -0
  126. /package/{drizzle → migrations/drizzle-tasks}/20260227200430_numerous_mysterio/snapshot.json +0 -0
  127. /package/{drizzle → migrations/drizzle-tasks}/20260227235745_add-audit-log-dispatch-columns/migration.sql +0 -0
  128. /package/{drizzle → migrations/drizzle-tasks}/20260227235745_add-audit-log-dispatch-columns/snapshot.json +0 -0
  129. /package/{drizzle → migrations/drizzle-tasks}/20260301053344_careless_changeling/migration.sql +0 -0
  130. /package/{drizzle → migrations/drizzle-tasks}/20260301053344_careless_changeling/snapshot.json +0 -0
  131. /package/{drizzle → migrations/drizzle-tasks}/20260301175940_futuristic_eternity/migration.sql +0 -0
  132. /package/{drizzle → migrations/drizzle-tasks}/20260301175940_futuristic_eternity/snapshot.json +0 -0
  133. /package/{drizzle → migrations/drizzle-tasks}/20260301180528_update-task-relations-check-constraint/migration.sql +0 -0
  134. /package/{drizzle → migrations/drizzle-tasks}/20260301180528_update-task-relations-check-constraint/snapshot.json +0 -0
  135. /package/{drizzle → migrations/drizzle-tasks}/20260302163443_free_silk_fever/migration.sql +0 -0
  136. /package/{drizzle → migrations/drizzle-tasks}/20260302163443_free_silk_fever/snapshot.json +0 -0
  137. /package/{drizzle → migrations/drizzle-tasks}/20260302163457_robust_johnny_storm/migration.sql +0 -0
  138. /package/{drizzle → migrations/drizzle-tasks}/20260302163457_robust_johnny_storm/snapshot.json +0 -0
  139. /package/{drizzle → migrations/drizzle-tasks}/20260302163511_late_sphinx/migration.sql +0 -0
  140. /package/{drizzle → migrations/drizzle-tasks}/20260302163511_late_sphinx/snapshot.json +0 -0
  141. /package/{drizzle → migrations/drizzle-tasks}/20260305011924_cheerful_mongu/migration.sql +0 -0
  142. /package/{drizzle → migrations/drizzle-tasks}/20260305011924_cheerful_mongu/snapshot.json +0 -0
  143. /package/{drizzle → migrations/drizzle-tasks}/20260305203927_demonic_storm/migration.sql +0 -0
  144. /package/{drizzle → migrations/drizzle-tasks}/20260305203927_demonic_storm/snapshot.json +0 -0
  145. /package/{drizzle → migrations/drizzle-tasks}/20260306001243_spooky_rage/migration.sql +0 -0
  146. /package/{drizzle → migrations/drizzle-tasks}/20260306001243_spooky_rage/snapshot.json +0 -0
  147. /package/{drizzle → migrations/drizzle-tasks}/20260306193138_young_morbius/migration.sql +0 -0
  148. /package/{drizzle → migrations/drizzle-tasks}/20260306193138_young_morbius/snapshot.json +0 -0
  149. /package/{drizzle → migrations/drizzle-tasks}/20260306194959_sticky_captain_flint/migration.sql +0 -0
  150. /package/{drizzle → migrations/drizzle-tasks}/20260306194959_sticky_captain_flint/snapshot.json +0 -0
@@ -0,0 +1,175 @@
1
+ #!/usr/bin/env python3
2
+ """
3
+ token_tracker.py — Aggregate token usage stats from a completed A/B run.
4
+
5
+ Usage:
6
+ python token_tracker.py --run-dir ./ab_results/run-001
7
+
8
+ Reads all timing.json files in the run directory and produces token-summary.json
9
+ with per-arm statistics.
10
+
11
+ Output: <run-dir>/token-summary.json
12
+ """
13
+
14
+ import argparse
15
+ import json
16
+ import os
17
+ import sys
18
+ import math
19
+ from pathlib import Path
20
+
21
+
22
+ def find_timing_files(run_dir):
23
+ """Find all timing.json files under run_dir."""
24
+ return list(Path(run_dir).rglob("timing.json"))
25
+
26
+
27
+ def load_timing(path):
28
+ try:
29
+ with open(path) as f:
30
+ return json.load(f)
31
+ except Exception as e:
32
+ print(f" WARN: Could not read {path}: {e}", file=sys.stderr)
33
+ return None
34
+
35
+
36
+ def mean(values):
37
+ return sum(values) / len(values) if values else 0
38
+
39
+
40
+ def stddev(values):
41
+ if len(values) < 2:
42
+ return 0
43
+ m = mean(values)
44
+ return math.sqrt(sum((x - m) ** 2 for x in values) / (len(values) - 1))
45
+
46
+
47
+ def stats(values):
48
+ if not values:
49
+ return {"mean": None, "stddev": None, "min": None, "max": None, "count": 0}
50
+ return {
51
+ "mean": round(mean(values), 1),
52
+ "stddev": round(stddev(values), 1),
53
+ "min": min(values),
54
+ "max": max(values),
55
+ "count": len(values),
56
+ }
57
+
58
+
59
+ def main():
60
+ parser = argparse.ArgumentParser(description="Aggregate token stats from ct-grade A/B run")
61
+ parser.add_argument("--run-dir", required=True)
62
+ parser.add_argument("--output", default=None, help="Output path (default: <run-dir>/token-summary.json)")
63
+ args = parser.parse_args()
64
+
65
+ run_dir = args.run_dir
66
+ if not os.path.isdir(run_dir):
67
+ print(f"ERROR: Run dir not found: {run_dir}", file=sys.stderr)
68
+ sys.exit(1)
69
+
70
+ timing_files = find_timing_files(run_dir)
71
+ if not timing_files:
72
+ print(f"ERROR: No timing.json files found in {run_dir}", file=sys.stderr)
73
+ sys.exit(1)
74
+
75
+ print(f"Found {len(timing_files)} timing.json files")
76
+
77
+ # Group by arm
78
+ by_arm = {}
79
+ by_interface = {}
80
+ missing_tokens = []
81
+
82
+ for tpath in timing_files:
83
+ data = load_timing(tpath)
84
+ if data is None:
85
+ continue
86
+
87
+ arm = data.get("arm", "unknown")
88
+ iface = data.get("interface", "unknown")
89
+ tokens = data.get("total_tokens")
90
+ duration = data.get("duration_ms")
91
+
92
+ if arm not in by_arm:
93
+ by_arm[arm] = {"tokens": [], "duration_ms": [], "interface": iface, "files": []}
94
+ if iface not in by_interface:
95
+ by_interface[iface] = {"tokens": [], "duration_ms": [], "files": []}
96
+
97
+ by_arm[arm]["files"].append(str(tpath))
98
+
99
+ if tokens is not None:
100
+ by_arm[arm]["tokens"].append(tokens)
101
+ by_interface[iface]["tokens"].append(tokens)
102
+ else:
103
+ missing_tokens.append(str(tpath))
104
+
105
+ if duration is not None:
106
+ by_arm[arm]["duration_ms"].append(duration)
107
+ by_interface[iface]["duration_ms"].append(duration)
108
+
109
+ # Build summary
110
+ arm_stats = {}
111
+ for arm, data in sorted(by_arm.items()):
112
+ arm_stats[arm] = {
113
+ "interface": data["interface"],
114
+ "file_count": len(data["files"]),
115
+ "total_tokens": stats(data["tokens"]),
116
+ "duration_ms": stats(data["duration_ms"]),
117
+ }
118
+
119
+ iface_stats = {}
120
+ for iface, data in sorted(by_interface.items()):
121
+ iface_stats[iface] = {
122
+ "file_count": len(data["files"]),
123
+ "total_tokens": stats(data["tokens"]),
124
+ "duration_ms": stats(data["duration_ms"]),
125
+ }
126
+
127
+ # Compute delta between arms (A vs B)
128
+ delta = {}
129
+ if "arm-A" in arm_stats and "arm-B" in arm_stats:
130
+ a_mean = arm_stats["arm-A"]["total_tokens"].get("mean") or 0
131
+ b_mean = arm_stats["arm-B"]["total_tokens"].get("mean") or 0
132
+ if b_mean > 0:
133
+ delta = {
134
+ "mean_tokens": round(a_mean - b_mean, 1),
135
+ "percent": f"{((a_mean - b_mean) / b_mean * 100):+.1f}%",
136
+ "note": f"Arm A uses {abs(a_mean - b_mean):.0f} {'more' if a_mean > b_mean else 'fewer'} tokens on average",
137
+ }
138
+
139
+ summary = {
140
+ "run_dir": os.path.abspath(run_dir),
141
+ "timing_files_found": len(timing_files),
142
+ "timing_files_missing_tokens": len(missing_tokens),
143
+ "by_arm": arm_stats,
144
+ "by_interface": iface_stats,
145
+ "delta_A_vs_B": delta,
146
+ "warnings": (
147
+ [f"MISSING total_tokens in {len(missing_tokens)} files — fill these from task notifications"]
148
+ if missing_tokens else []
149
+ ),
150
+ }
151
+
152
+ output_path = args.output or os.path.join(run_dir, "token-summary.json")
153
+ with open(output_path, "w") as f:
154
+ json.dump(summary, f, indent=2)
155
+
156
+ # Print summary
157
+ print(f"\nToken Summary")
158
+ print(f"{'='*50}")
159
+ for arm, s in arm_stats.items():
160
+ t = s["total_tokens"]
161
+ if t["mean"] is not None:
162
+ print(f" {arm} ({s['interface']}): {t['mean']:.0f} tokens (±{t['stddev']:.0f}, n={t['count']})")
163
+ else:
164
+ print(f" {arm} ({s['interface']}): NO TOKEN DATA (fill timing.json from task notifications)")
165
+ if delta:
166
+ print(f"\n Delta (A-B): {delta['percent']} ({delta['mean_tokens']:+.0f} tokens)")
167
+ print(f" {delta['note']}")
168
+ if missing_tokens:
169
+ print(f"\n WARNING: {len(missing_tokens)} files missing total_tokens")
170
+ print(f" These must be filled from Claude Code task notification data.")
171
+ print(f"\nWritten: {output_path}")
172
+
173
+
174
+ if __name__ == "__main__":
175
+ main()
@@ -1,34 +1,6 @@
1
1
  ---
2
2
  name: ct-orchestrator
3
- description: |
4
- Pipeline-aware orchestration skill for managing complex workflows through subagent delegation.
5
- Use when the user asks to "orchestrate", "orchestrator mode", "run as orchestrator",
6
- "delegate to subagents", "coordinate agents", "spawn subagents", "multi-agent workflow",
7
- "context-protected workflow", "agent farm", "HITL orchestration", "pipeline management",
8
- or needs to manage complex workflows by delegating work to subagents while protecting
9
- the main context window. Enforces ORC-001 through ORC-009 constraints. Provider-neutral —
10
- works with any agent runtime that supports prompt-based delegation.
11
- version: 4.0.0
12
- tier: 0
13
- core: true
14
- category: core
15
- protocol: agent-protocol
16
- mvi_scope: orchestrator
17
- requires_tiers:
18
- - minimal
19
- - standard
20
- - orchestrator
21
- dependencies: []
22
- sharedResources:
23
- - subagent-protocol-base
24
- - task-system-integration
25
- compatibility:
26
- - claude-code
27
- - cursor
28
- - windsurf
29
- - gemini-cli
30
- - opencode
31
- - codex-cli
3
+ description: "Pipeline-aware orchestration skill for managing complex workflows through subagent delegation. Use when the user asks to \"orchestrate\", \"orchestrator mode\", \"run as orchestrator\", \"delegate to subagents\", \"coordinate agents\", \"spawn subagents\", \"multi-agent workflow\", \"context-protected workflow\", \"agent farm\", \"HITL orchestration\", \"pipeline management\", or needs to manage complex workflows by delegating work to subagents while protecting the main context window. Enforces ORC-001 through ORC-009 constraints. Provider-neutral."
32
4
  license: MIT
33
5
  ---
34
6
 
@@ -0,0 +1,19 @@
1
+ {
2
+ "name": "ct-orchestrator",
3
+ "version": "4.0.0",
4
+ "description": "Pipeline-aware orchestration skill for managing complex workflows through subagent delegation.",
5
+ "path": "packages/ct-skills/skills/ct-orchestrator",
6
+ "status": "active",
7
+ "tier": 0,
8
+ "core": true,
9
+ "category": "core",
10
+ "protocol": "agent-protocol",
11
+ "mvi_scope": "orchestrator",
12
+ "requires_tiers": ["minimal", "standard", "orchestrator"],
13
+ "dependencies": [],
14
+ "sharedResources": ["subagent-protocol-base", "task-system-integration"],
15
+ "compatibility": ["claude-code", "cursor", "windsurf", "gemini-cli", "opencode", "codex-cli"],
16
+ "token_budget": 8000,
17
+ "capabilities": ["multi-agent-coordination", "wave-planning", "context-protection", "hitl-orchestration"],
18
+ "constraints": ["requires-agent-runtime"]
19
+ }
@@ -1,18 +1,6 @@
1
1
  ---
2
2
  name: ct-skill-creator
3
3
  description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
4
- version: 1.0.0
5
- tier: 3
6
- core: false
7
- category: meta
8
- protocol: null
9
- dependencies: []
10
- sharedResources: []
11
- compatibility:
12
- - claude-code
13
- - cursor
14
- - windsurf
15
- - gemini-cli
16
4
  license: MIT
17
5
  ---
18
6
 
@@ -0,0 +1,276 @@
1
+ # Post-hoc Analyzer Agent
2
+
3
+ Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
4
+
5
+ ## Role
6
+
7
+ After the blind comparator determines a winner, the Post-hoc Analyzer "unblinds" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
8
+
9
+ ## Inputs
10
+
11
+ You receive these parameters in your prompt:
12
+
13
+ - **winner**: "A" or "B" (from blind comparison)
14
+ - **winner_skill_path**: Path to the skill that produced the winning output
15
+ - **winner_transcript_path**: Path to the execution transcript for the winner
16
+ - **loser_skill_path**: Path to the skill that produced the losing output
17
+ - **loser_transcript_path**: Path to the execution transcript for the loser
18
+ - **comparison_result_path**: Path to the blind comparator's output JSON
19
+ - **output_path**: Where to save the analysis results
20
+
21
+ ## Process
22
+
23
+ ### Step 1: Read Comparison Result
24
+
25
+ 1. Read the blind comparator's output at comparison_result_path
26
+ 2. Note the winning side (A or B), the reasoning, and any scores
27
+ 3. Understand what the comparator valued in the winning output
28
+
29
+ ### Step 2: Read Both Skills
30
+
31
+ 1. Read the winner skill's SKILL.md and key referenced files
32
+ 2. Read the loser skill's SKILL.md and key referenced files
33
+ 3. Identify structural differences:
34
+ - Instructions clarity and specificity
35
+ - Script/tool usage patterns
36
+ - Example coverage
37
+ - Edge case handling
38
+
39
+ ### Step 3: Read Both Transcripts
40
+
41
+ 1. Read the winner's transcript
42
+ 2. Read the loser's transcript
43
+ 3. Compare execution patterns:
44
+ - How closely did each follow their skill's instructions?
45
+ - What tools were used differently?
46
+ - Where did the loser diverge from optimal behavior?
47
+ - Did either encounter errors or make recovery attempts?
48
+
49
+ ### Step 4: Analyze Instruction Following
50
+
51
+ For each transcript, evaluate:
52
+ - Did the agent follow the skill's explicit instructions?
53
+ - Did the agent use the skill's provided tools/scripts?
54
+ - Were there missed opportunities to leverage skill content?
55
+ - Did the agent add unnecessary steps not in the skill?
56
+
57
+ Score instruction following 1-10 and note specific issues.
58
+
59
+ ### Step 5: Identify Winner Strengths
60
+
61
+ Determine what made the winner better:
62
+ - Clearer instructions that led to better behavior?
63
+ - Better scripts/tools that produced better output?
64
+ - More comprehensive examples that guided edge cases?
65
+ - Better error handling guidance?
66
+
67
+ Be specific. Quote from skills/transcripts where relevant.
68
+
69
+ ### Step 6: Identify Loser Weaknesses
70
+
71
+ Determine what held the loser back:
72
+ - Ambiguous instructions that led to suboptimal choices?
73
+ - Missing tools/scripts that forced workarounds?
74
+ - Gaps in edge case coverage?
75
+ - Poor error handling that caused failures?
76
+
77
+ ### Step 7: Generate Improvement Suggestions
78
+
79
+ Based on the analysis, produce actionable suggestions for improving the loser skill:
80
+ - Specific instruction changes to make
81
+ - Tools/scripts to add or modify
82
+ - Examples to include
83
+ - Edge cases to address
84
+
85
+ Prioritize by impact. Focus on changes that would have changed the outcome.
86
+
87
+ ### Step 8: Write Analysis Results
88
+
89
+ Save structured analysis to `{output_path}`.
90
+
91
+ ## Output Format
92
+
93
+ Write a JSON file with this structure:
94
+
95
+ ```json
96
+ {
97
+ "comparison_summary": {
98
+ "winner": "A",
99
+ "winner_skill": "path/to/winner/skill",
100
+ "loser_skill": "path/to/loser/skill",
101
+ "comparator_reasoning": "Brief summary of why comparator chose winner"
102
+ },
103
+ "winner_strengths": [
104
+ "Clear step-by-step instructions for handling multi-page documents",
105
+ "Included validation script that caught formatting errors",
106
+ "Explicit guidance on fallback behavior when OCR fails"
107
+ ],
108
+ "loser_weaknesses": [
109
+ "Vague instruction 'process the document appropriately' led to inconsistent behavior",
110
+ "No script for validation, agent had to improvise and made errors",
111
+ "No guidance on OCR failure, agent gave up instead of trying alternatives"
112
+ ],
113
+ "instruction_following": {
114
+ "winner": {
115
+ "score": 9,
116
+ "issues": [
117
+ "Minor: skipped optional logging step"
118
+ ]
119
+ },
120
+ "loser": {
121
+ "score": 6,
122
+ "issues": [
123
+ "Did not use the skill's formatting template",
124
+ "Invented own approach instead of following step 3",
125
+ "Missed the 'always validate output' instruction"
126
+ ]
127
+ }
128
+ },
129
+ "improvement_suggestions": [
130
+ {
131
+ "priority": "high",
132
+ "category": "instructions",
133
+ "suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
134
+ "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
135
+ },
136
+ {
137
+ "priority": "high",
138
+ "category": "tools",
139
+ "suggestion": "Add validate_output.py script similar to winner skill's validation approach",
140
+ "expected_impact": "Would catch formatting errors before final output"
141
+ },
142
+ {
143
+ "priority": "medium",
144
+ "category": "error_handling",
145
+ "suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
146
+ "expected_impact": "Would prevent early failure on difficult documents"
147
+ }
148
+ ],
149
+ "transcript_insights": {
150
+ "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
151
+ "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
152
+ }
153
+ }
154
+ ```
155
+
156
+ ## Guidelines
157
+
158
+ - **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
159
+ - **Be actionable**: Suggestions should be concrete changes, not vague advice
160
+ - **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
161
+ - **Prioritize by impact**: Which changes would most likely have changed the outcome?
162
+ - **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
163
+ - **Stay objective**: Analyze what happened, don't editorialize
164
+ - **Think about generalization**: Would this improvement help on other evals too?
165
+
166
+ ## Categories for Suggestions
167
+
168
+ Use these categories to organize improvement suggestions:
169
+
170
+ | Category | Description |
171
+ |----------|-------------|
172
+ | `instructions` | Changes to the skill's prose instructions |
173
+ | `tools` | Scripts, templates, or utilities to add/modify |
174
+ | `examples` | Example inputs/outputs to include |
175
+ | `error_handling` | Guidance for handling failures |
176
+ | `structure` | Reorganization of skill content |
177
+ | `references` | External docs or resources to add |
178
+
179
+ ## Priority Levels
180
+
181
+ - **high**: Would likely change the outcome of this comparison
182
+ - **medium**: Would improve quality but may not change win/loss
183
+ - **low**: Nice to have, marginal improvement
184
+
185
+ ---
186
+
187
+ # Analyzing Benchmark Results
188
+
189
+ When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
190
+
191
+ ## Role
192
+
193
+ Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
194
+
195
+ ## Inputs
196
+
197
+ You receive these parameters in your prompt:
198
+
199
+ - **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
200
+ - **skill_path**: Path to the skill being benchmarked
201
+ - **output_path**: Where to save the notes (as JSON array of strings)
202
+
203
+ ## Process
204
+
205
+ ### Step 1: Read Benchmark Data
206
+
207
+ 1. Read the benchmark.json containing all run results
208
+ 2. Note the configurations tested (with_skill, without_skill)
209
+ 3. Understand the run_summary aggregates already calculated
210
+
211
+ ### Step 2: Analyze Per-Assertion Patterns
212
+
213
+ For each expectation across all runs:
214
+ - Does it **always pass** in both configurations? (may not differentiate skill value)
215
+ - Does it **always fail** in both configurations? (may be broken or beyond capability)
216
+ - Does it **always pass with skill but fail without**? (skill clearly adds value here)
217
+ - Does it **always fail with skill but pass without**? (skill may be hurting)
218
+ - Is it **highly variable**? (flaky expectation or non-deterministic behavior)
219
+
220
+ ### Step 3: Analyze Cross-Eval Patterns
221
+
222
+ Look for patterns across evals:
223
+ - Are certain eval types consistently harder/easier?
224
+ - Do some evals show high variance while others are stable?
225
+ - Are there surprising results that contradict expectations?
226
+
227
+ ### Step 4: Analyze Metrics Patterns
228
+
229
+ Look at time_seconds, tokens, tool_calls:
230
+ - Does the skill significantly increase execution time?
231
+ - Is there high variance in resource usage?
232
+ - Are there outlier runs that skew the aggregates?
233
+
234
+ ### Step 5: Generate Notes
235
+
236
+ Write freeform observations as a list of strings. Each note should:
237
+ - State a specific observation
238
+ - Be grounded in the data (not speculation)
239
+ - Help the user understand something the aggregate metrics don't show
240
+
241
+ Examples:
242
+ - "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
243
+ - "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
244
+ - "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
245
+ - "Skill adds 13s average execution time but improves pass rate by 50%"
246
+ - "Token usage is 80% higher with skill, primarily due to script output parsing"
247
+ - "All 3 without-skill runs for eval 1 produced empty output"
248
+
249
+ ### Step 6: Write Notes
250
+
251
+ Save notes to `{output_path}` as a JSON array of strings:
252
+
253
+ ```json
254
+ [
255
+ "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
256
+ "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
257
+ "Without-skill runs consistently fail on table extraction expectations",
258
+ "Skill adds 13s average execution time but improves pass rate by 50%"
259
+ ]
260
+ ```
261
+
262
+ ## Guidelines
263
+
264
+ **DO:**
265
+ - Report what you observe in the data
266
+ - Be specific about which evals, expectations, or runs you're referring to
267
+ - Note patterns that aggregate metrics would hide
268
+ - Provide context that helps interpret the numbers
269
+
270
+ **DO NOT:**
271
+ - Suggest improvements to the skill (that's for the improvement step, not benchmarking)
272
+ - Make subjective quality judgments ("the output was good/bad")
273
+ - Speculate about causes without evidence
274
+ - Repeat information already in the run_summary aggregates
275
+
276
+ See [references/schemas.md](../references/schemas.md) for the complete analysis.json and benchmark.json schema definitions.