@cleocode/skills 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dispatch-config.json +404 -0
- package/index.d.ts +178 -0
- package/index.js +405 -0
- package/package.json +14 -0
- package/profiles/core.json +7 -0
- package/profiles/full.json +10 -0
- package/profiles/minimal.json +7 -0
- package/profiles/recommended.json +7 -0
- package/provider-skills-map.json +97 -0
- package/skills/_shared/cleo-style-guide.md +84 -0
- package/skills/_shared/manifest-operations.md +810 -0
- package/skills/_shared/placeholders.json +433 -0
- package/skills/_shared/skill-chaining-patterns.md +237 -0
- package/skills/_shared/subagent-protocol-base.md +223 -0
- package/skills/_shared/task-system-integration.md +232 -0
- package/skills/_shared/testing-framework-config.md +110 -0
- package/skills/ct-cleo/SKILL.md +490 -0
- package/skills/ct-cleo/references/anti-patterns.md +19 -0
- package/skills/ct-cleo/references/loom-lifecycle.md +136 -0
- package/skills/ct-cleo/references/orchestrator-constraints.md +55 -0
- package/skills/ct-cleo/references/session-protocol.md +162 -0
- package/skills/ct-codebase-mapper/SKILL.md +82 -0
- package/skills/ct-contribution/SKILL.md +521 -0
- package/skills/ct-contribution/templates/contribution-init.json +21 -0
- package/skills/ct-dev-workflow/SKILL.md +423 -0
- package/skills/ct-docs-lookup/SKILL.md +66 -0
- package/skills/ct-docs-review/SKILL.md +175 -0
- package/skills/ct-docs-write/SKILL.md +108 -0
- package/skills/ct-documentor/SKILL.md +231 -0
- package/skills/ct-epic-architect/SKILL.md +305 -0
- package/skills/ct-epic-architect/references/bug-epic-example.md +172 -0
- package/skills/ct-epic-architect/references/commands.md +201 -0
- package/skills/ct-epic-architect/references/feature-epic-example.md +210 -0
- package/skills/ct-epic-architect/references/migration-epic-example.md +244 -0
- package/skills/ct-epic-architect/references/output-format.md +92 -0
- package/skills/ct-epic-architect/references/patterns.md +284 -0
- package/skills/ct-epic-architect/references/refactor-epic-example.md +412 -0
- package/skills/ct-epic-architect/references/research-epic-example.md +226 -0
- package/skills/ct-epic-architect/references/shell-escaping.md +86 -0
- package/skills/ct-epic-architect/references/skill-aware-execution.md +195 -0
- package/skills/ct-grade/SKILL.md +230 -0
- package/skills/ct-grade/agents/analysis-reporter.md +203 -0
- package/skills/ct-grade/agents/blind-comparator.md +157 -0
- package/skills/ct-grade/agents/scenario-runner.md +134 -0
- package/skills/ct-grade/eval-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
- package/skills/ct-grade/eval-viewer/generate_grade_review.py +1138 -0
- package/skills/ct-grade/eval-viewer/generate_grade_viewer.py +544 -0
- package/skills/ct-grade/eval-viewer/generate_review.py +283 -0
- package/skills/ct-grade/eval-viewer/grade-review.html +1574 -0
- package/skills/ct-grade/eval-viewer/viewer.html +219 -0
- package/skills/ct-grade/evals/evals.json +94 -0
- package/skills/ct-grade/references/ab-test-methodology.md +150 -0
- package/skills/ct-grade/references/domains.md +137 -0
- package/skills/ct-grade/references/grade-spec.md +236 -0
- package/skills/ct-grade/references/scenario-playbook.md +234 -0
- package/skills/ct-grade/references/token-tracking.md +120 -0
- package/skills/ct-grade/scripts/__pycache__/audit_analyzer.cpython-314.pyc +0 -0
- package/skills/ct-grade/scripts/__pycache__/run_ab_test.cpython-314.pyc +0 -0
- package/skills/ct-grade/scripts/__pycache__/run_all.cpython-314.pyc +0 -0
- package/skills/ct-grade/scripts/__pycache__/token_tracker.cpython-314.pyc +0 -0
- package/skills/ct-grade/scripts/audit_analyzer.py +279 -0
- package/skills/ct-grade/scripts/generate_report.py +283 -0
- package/skills/ct-grade/scripts/run_ab_test.py +504 -0
- package/skills/ct-grade/scripts/run_all.py +287 -0
- package/skills/ct-grade/scripts/setup_run.py +183 -0
- package/skills/ct-grade/scripts/token_tracker.py +630 -0
- package/skills/ct-grade-v2-1/SKILL.md +237 -0
- package/skills/ct-grade-v2-1/agents/analysis-reporter.md +203 -0
- package/skills/ct-grade-v2-1/agents/blind-comparator.md +157 -0
- package/skills/ct-grade-v2-1/agents/scenario-runner.md +179 -0
- package/skills/ct-grade-v2-1/evals/evals.json +74 -0
- package/skills/ct-grade-v2-1/grade-viewer/__pycache__/build_op_stats.cpython-314.pyc +0 -0
- package/skills/ct-grade-v2-1/grade-viewer/__pycache__/generate_grade_review.cpython-314.pyc +0 -0
- package/skills/ct-grade-v2-1/grade-viewer/build_op_stats.py +174 -0
- package/skills/ct-grade-v2-1/grade-viewer/eval-analysis.json +41 -0
- package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +34 -0
- package/skills/ct-grade-v2-1/grade-viewer/generate_grade_review.py +1023 -0
- package/skills/ct-grade-v2-1/grade-viewer/generate_grade_viewer.py +548 -0
- package/skills/ct-grade-v2-1/grade-viewer/grade-review-eval.html +613 -0
- package/skills/ct-grade-v2-1/grade-viewer/grade-review.html +1532 -0
- package/skills/ct-grade-v2-1/grade-viewer/viewer.html +620 -0
- package/skills/ct-grade-v2-1/manifest-entry.json +31 -0
- package/skills/ct-grade-v2-1/references/ab-testing.md +233 -0
- package/skills/ct-grade-v2-1/references/domains-ssot.md +156 -0
- package/skills/ct-grade-v2-1/references/grade-spec-v2.md +167 -0
- package/skills/ct-grade-v2-1/references/playbook-v2.md +393 -0
- package/skills/ct-grade-v2-1/references/token-tracking.md +202 -0
- package/skills/ct-grade-v2-1/scripts/generate_report.py +419 -0
- package/skills/ct-grade-v2-1/scripts/run_ab_test.py +493 -0
- package/skills/ct-grade-v2-1/scripts/run_scenario.py +396 -0
- package/skills/ct-grade-v2-1/scripts/setup_run.py +207 -0
- package/skills/ct-grade-v2-1/scripts/token_tracker.py +175 -0
- package/skills/ct-memory/SKILL.md +84 -0
- package/skills/ct-orchestrator/INSTALL.md +61 -0
- package/skills/ct-orchestrator/README.md +69 -0
- package/skills/ct-orchestrator/SKILL.md +380 -0
- package/skills/ct-orchestrator/manifest-entry.json +19 -0
- package/skills/ct-orchestrator/orchestrator-prompt.txt +17 -0
- package/skills/ct-orchestrator/references/SUBAGENT-PROTOCOL-BLOCK.md +66 -0
- package/skills/ct-orchestrator/references/autonomous-operation.md +167 -0
- package/skills/ct-orchestrator/references/lifecycle-gates.md +98 -0
- package/skills/ct-orchestrator/references/orchestrator-compliance.md +271 -0
- package/skills/ct-orchestrator/references/orchestrator-handoffs.md +85 -0
- package/skills/ct-orchestrator/references/orchestrator-patterns.md +164 -0
- package/skills/ct-orchestrator/references/orchestrator-recovery.md +113 -0
- package/skills/ct-orchestrator/references/orchestrator-spawning.md +271 -0
- package/skills/ct-orchestrator/references/orchestrator-tokens.md +180 -0
- package/skills/ct-research-agent/SKILL.md +226 -0
- package/skills/ct-skill-creator/.cleo/.context-state.json +13 -0
- package/skills/ct-skill-creator/.cleo/logs/cleo.2026-03-07.1.log +24 -0
- package/skills/ct-skill-creator/.cleo/tasks.db +0 -0
- package/skills/ct-skill-creator/SKILL.md +356 -0
- package/skills/ct-skill-creator/agents/analyzer.md +276 -0
- package/skills/ct-skill-creator/agents/comparator.md +204 -0
- package/skills/ct-skill-creator/agents/grader.md +225 -0
- package/skills/ct-skill-creator/assets/eval_review.html +146 -0
- package/skills/ct-skill-creator/eval-viewer/__pycache__/generate_review.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/eval-viewer/generate_review.py +471 -0
- package/skills/ct-skill-creator/eval-viewer/viewer.html +1325 -0
- package/skills/ct-skill-creator/manifest-entry.json +17 -0
- package/skills/ct-skill-creator/references/dynamic-context.md +228 -0
- package/skills/ct-skill-creator/references/frontmatter.md +83 -0
- package/skills/ct-skill-creator/references/invocation-control.md +165 -0
- package/skills/ct-skill-creator/references/output-patterns.md +86 -0
- package/skills/ct-skill-creator/references/provider-deployment.md +175 -0
- package/skills/ct-skill-creator/references/schemas.md +430 -0
- package/skills/ct-skill-creator/references/workflows.md +28 -0
- package/skills/ct-skill-creator/scripts/__init__.py +1 -0
- package/skills/ct-skill-creator/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/aggregate_benchmark.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/generate_report.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/improve_description.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/init_skill.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/run_eval.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/run_loop.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/__pycache__/utils.cpython-314.pyc +0 -0
- package/skills/ct-skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/skills/ct-skill-creator/scripts/generate_report.py +326 -0
- package/skills/ct-skill-creator/scripts/improve_description.py +247 -0
- package/skills/ct-skill-creator/scripts/init_skill.py +306 -0
- package/skills/ct-skill-creator/scripts/package_skill.py +110 -0
- package/skills/ct-skill-creator/scripts/quick_validate.py +97 -0
- package/skills/ct-skill-creator/scripts/run_eval.py +310 -0
- package/skills/ct-skill-creator/scripts/run_loop.py +328 -0
- package/skills/ct-skill-creator/scripts/utils.py +47 -0
- package/skills/ct-skill-validator/SKILL.md +178 -0
- package/skills/ct-skill-validator/agents/ecosystem-checker.md +151 -0
- package/skills/ct-skill-validator/assets/valid-skill-example.md +13 -0
- package/skills/ct-skill-validator/evals/eval_set.json +14 -0
- package/skills/ct-skill-validator/evals/evals.json +52 -0
- package/skills/ct-skill-validator/manifest-entry.json +20 -0
- package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +163 -0
- package/skills/ct-skill-validator/references/validation-rules.md +168 -0
- package/skills/ct-skill-validator/scripts/__init__.py +0 -0
- package/skills/ct-skill-validator/scripts/__pycache__/audit_body.cpython-314.pyc +0 -0
- package/skills/ct-skill-validator/scripts/__pycache__/check_ecosystem.cpython-314.pyc +0 -0
- package/skills/ct-skill-validator/scripts/__pycache__/generate_validation_report.cpython-314.pyc +0 -0
- package/skills/ct-skill-validator/scripts/__pycache__/validate.cpython-314.pyc +0 -0
- package/skills/ct-skill-validator/scripts/audit_body.py +242 -0
- package/skills/ct-skill-validator/scripts/check_ecosystem.py +169 -0
- package/skills/ct-skill-validator/scripts/check_manifest.py +172 -0
- package/skills/ct-skill-validator/scripts/generate_validation_report.py +442 -0
- package/skills/ct-skill-validator/scripts/validate.py +422 -0
- package/skills/ct-spec-writer/SKILL.md +189 -0
- package/skills/ct-stickynote/README.md +14 -0
- package/skills/ct-stickynote/SKILL.md +46 -0
- package/skills/ct-task-executor/SKILL.md +296 -0
- package/skills/ct-validator/SKILL.md +216 -0
- package/skills/manifest.json +469 -0
- package/skills.json +281 -0
|
@@ -0,0 +1,276 @@
|
|
|
1
|
+
# Post-hoc Analyzer Agent
|
|
2
|
+
|
|
3
|
+
Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
|
|
4
|
+
|
|
5
|
+
## Role
|
|
6
|
+
|
|
7
|
+
After the blind comparator determines a winner, the Post-hoc Analyzer "unblinds" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
|
|
8
|
+
|
|
9
|
+
## Inputs
|
|
10
|
+
|
|
11
|
+
You receive these parameters in your prompt:
|
|
12
|
+
|
|
13
|
+
- **winner**: "A" or "B" (from blind comparison)
|
|
14
|
+
- **winner_skill_path**: Path to the skill that produced the winning output
|
|
15
|
+
- **winner_transcript_path**: Path to the execution transcript for the winner
|
|
16
|
+
- **loser_skill_path**: Path to the skill that produced the losing output
|
|
17
|
+
- **loser_transcript_path**: Path to the execution transcript for the loser
|
|
18
|
+
- **comparison_result_path**: Path to the blind comparator's output JSON
|
|
19
|
+
- **output_path**: Where to save the analysis results
|
|
20
|
+
|
|
21
|
+
## Process
|
|
22
|
+
|
|
23
|
+
### Step 1: Read Comparison Result
|
|
24
|
+
|
|
25
|
+
1. Read the blind comparator's output at comparison_result_path
|
|
26
|
+
2. Note the winning side (A or B), the reasoning, and any scores
|
|
27
|
+
3. Understand what the comparator valued in the winning output
|
|
28
|
+
|
|
29
|
+
### Step 2: Read Both Skills
|
|
30
|
+
|
|
31
|
+
1. Read the winner skill's SKILL.md and key referenced files
|
|
32
|
+
2. Read the loser skill's SKILL.md and key referenced files
|
|
33
|
+
3. Identify structural differences:
|
|
34
|
+
- Instructions clarity and specificity
|
|
35
|
+
- Script/tool usage patterns
|
|
36
|
+
- Example coverage
|
|
37
|
+
- Edge case handling
|
|
38
|
+
|
|
39
|
+
### Step 3: Read Both Transcripts
|
|
40
|
+
|
|
41
|
+
1. Read the winner's transcript
|
|
42
|
+
2. Read the loser's transcript
|
|
43
|
+
3. Compare execution patterns:
|
|
44
|
+
- How closely did each follow their skill's instructions?
|
|
45
|
+
- What tools were used differently?
|
|
46
|
+
- Where did the loser diverge from optimal behavior?
|
|
47
|
+
- Did either encounter errors or make recovery attempts?
|
|
48
|
+
|
|
49
|
+
### Step 4: Analyze Instruction Following
|
|
50
|
+
|
|
51
|
+
For each transcript, evaluate:
|
|
52
|
+
- Did the agent follow the skill's explicit instructions?
|
|
53
|
+
- Did the agent use the skill's provided tools/scripts?
|
|
54
|
+
- Were there missed opportunities to leverage skill content?
|
|
55
|
+
- Did the agent add unnecessary steps not in the skill?
|
|
56
|
+
|
|
57
|
+
Score instruction following 1-10 and note specific issues.
|
|
58
|
+
|
|
59
|
+
### Step 5: Identify Winner Strengths
|
|
60
|
+
|
|
61
|
+
Determine what made the winner better:
|
|
62
|
+
- Clearer instructions that led to better behavior?
|
|
63
|
+
- Better scripts/tools that produced better output?
|
|
64
|
+
- More comprehensive examples that guided edge cases?
|
|
65
|
+
- Better error handling guidance?
|
|
66
|
+
|
|
67
|
+
Be specific. Quote from skills/transcripts where relevant.
|
|
68
|
+
|
|
69
|
+
### Step 6: Identify Loser Weaknesses
|
|
70
|
+
|
|
71
|
+
Determine what held the loser back:
|
|
72
|
+
- Ambiguous instructions that led to suboptimal choices?
|
|
73
|
+
- Missing tools/scripts that forced workarounds?
|
|
74
|
+
- Gaps in edge case coverage?
|
|
75
|
+
- Poor error handling that caused failures?
|
|
76
|
+
|
|
77
|
+
### Step 7: Generate Improvement Suggestions
|
|
78
|
+
|
|
79
|
+
Based on the analysis, produce actionable suggestions for improving the loser skill:
|
|
80
|
+
- Specific instruction changes to make
|
|
81
|
+
- Tools/scripts to add or modify
|
|
82
|
+
- Examples to include
|
|
83
|
+
- Edge cases to address
|
|
84
|
+
|
|
85
|
+
Prioritize by impact. Focus on changes that would have changed the outcome.
|
|
86
|
+
|
|
87
|
+
### Step 8: Write Analysis Results
|
|
88
|
+
|
|
89
|
+
Save structured analysis to `{output_path}`.
|
|
90
|
+
|
|
91
|
+
## Output Format
|
|
92
|
+
|
|
93
|
+
Write a JSON file with this structure:
|
|
94
|
+
|
|
95
|
+
```json
|
|
96
|
+
{
|
|
97
|
+
"comparison_summary": {
|
|
98
|
+
"winner": "A",
|
|
99
|
+
"winner_skill": "path/to/winner/skill",
|
|
100
|
+
"loser_skill": "path/to/loser/skill",
|
|
101
|
+
"comparator_reasoning": "Brief summary of why comparator chose winner"
|
|
102
|
+
},
|
|
103
|
+
"winner_strengths": [
|
|
104
|
+
"Clear step-by-step instructions for handling multi-page documents",
|
|
105
|
+
"Included validation script that caught formatting errors",
|
|
106
|
+
"Explicit guidance on fallback behavior when OCR fails"
|
|
107
|
+
],
|
|
108
|
+
"loser_weaknesses": [
|
|
109
|
+
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
|
|
110
|
+
"No script for validation, agent had to improvise and made errors",
|
|
111
|
+
"No guidance on OCR failure, agent gave up instead of trying alternatives"
|
|
112
|
+
],
|
|
113
|
+
"instruction_following": {
|
|
114
|
+
"winner": {
|
|
115
|
+
"score": 9,
|
|
116
|
+
"issues": [
|
|
117
|
+
"Minor: skipped optional logging step"
|
|
118
|
+
]
|
|
119
|
+
},
|
|
120
|
+
"loser": {
|
|
121
|
+
"score": 6,
|
|
122
|
+
"issues": [
|
|
123
|
+
"Did not use the skill's formatting template",
|
|
124
|
+
"Invented own approach instead of following step 3",
|
|
125
|
+
"Missed the 'always validate output' instruction"
|
|
126
|
+
]
|
|
127
|
+
}
|
|
128
|
+
},
|
|
129
|
+
"improvement_suggestions": [
|
|
130
|
+
{
|
|
131
|
+
"priority": "high",
|
|
132
|
+
"category": "instructions",
|
|
133
|
+
"suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
|
|
134
|
+
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
|
|
135
|
+
},
|
|
136
|
+
{
|
|
137
|
+
"priority": "high",
|
|
138
|
+
"category": "tools",
|
|
139
|
+
"suggestion": "Add validate_output.py script similar to winner skill's validation approach",
|
|
140
|
+
"expected_impact": "Would catch formatting errors before final output"
|
|
141
|
+
},
|
|
142
|
+
{
|
|
143
|
+
"priority": "medium",
|
|
144
|
+
"category": "error_handling",
|
|
145
|
+
"suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
|
|
146
|
+
"expected_impact": "Would prevent early failure on difficult documents"
|
|
147
|
+
}
|
|
148
|
+
],
|
|
149
|
+
"transcript_insights": {
|
|
150
|
+
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
|
|
151
|
+
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
|
|
152
|
+
}
|
|
153
|
+
}
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## Guidelines
|
|
157
|
+
|
|
158
|
+
- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
|
|
159
|
+
- **Be actionable**: Suggestions should be concrete changes, not vague advice
|
|
160
|
+
- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
|
|
161
|
+
- **Prioritize by impact**: Which changes would most likely have changed the outcome?
|
|
162
|
+
- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
|
|
163
|
+
- **Stay objective**: Analyze what happened, don't editorialize
|
|
164
|
+
- **Think about generalization**: Would this improvement help on other evals too?
|
|
165
|
+
|
|
166
|
+
## Categories for Suggestions
|
|
167
|
+
|
|
168
|
+
Use these categories to organize improvement suggestions:
|
|
169
|
+
|
|
170
|
+
| Category | Description |
|
|
171
|
+
|----------|-------------|
|
|
172
|
+
| `instructions` | Changes to the skill's prose instructions |
|
|
173
|
+
| `tools` | Scripts, templates, or utilities to add/modify |
|
|
174
|
+
| `examples` | Example inputs/outputs to include |
|
|
175
|
+
| `error_handling` | Guidance for handling failures |
|
|
176
|
+
| `structure` | Reorganization of skill content |
|
|
177
|
+
| `references` | External docs or resources to add |
|
|
178
|
+
|
|
179
|
+
## Priority Levels
|
|
180
|
+
|
|
181
|
+
- **high**: Would likely change the outcome of this comparison
|
|
182
|
+
- **medium**: Would improve quality but may not change win/loss
|
|
183
|
+
- **low**: Nice to have, marginal improvement
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
# Analyzing Benchmark Results
|
|
188
|
+
|
|
189
|
+
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
|
|
190
|
+
|
|
191
|
+
## Role
|
|
192
|
+
|
|
193
|
+
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
|
|
194
|
+
|
|
195
|
+
## Inputs
|
|
196
|
+
|
|
197
|
+
You receive these parameters in your prompt:
|
|
198
|
+
|
|
199
|
+
- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
|
|
200
|
+
- **skill_path**: Path to the skill being benchmarked
|
|
201
|
+
- **output_path**: Where to save the notes (as JSON array of strings)
|
|
202
|
+
|
|
203
|
+
## Process
|
|
204
|
+
|
|
205
|
+
### Step 1: Read Benchmark Data
|
|
206
|
+
|
|
207
|
+
1. Read the benchmark.json containing all run results
|
|
208
|
+
2. Note the configurations tested (with_skill, without_skill)
|
|
209
|
+
3. Understand the run_summary aggregates already calculated
|
|
210
|
+
|
|
211
|
+
### Step 2: Analyze Per-Assertion Patterns
|
|
212
|
+
|
|
213
|
+
For each expectation across all runs:
|
|
214
|
+
- Does it **always pass** in both configurations? (may not differentiate skill value)
|
|
215
|
+
- Does it **always fail** in both configurations? (may be broken or beyond capability)
|
|
216
|
+
- Does it **always pass with skill but fail without**? (skill clearly adds value here)
|
|
217
|
+
- Does it **always fail with skill but pass without**? (skill may be hurting)
|
|
218
|
+
- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
|
|
219
|
+
|
|
220
|
+
### Step 3: Analyze Cross-Eval Patterns
|
|
221
|
+
|
|
222
|
+
Look for patterns across evals:
|
|
223
|
+
- Are certain eval types consistently harder/easier?
|
|
224
|
+
- Do some evals show high variance while others are stable?
|
|
225
|
+
- Are there surprising results that contradict expectations?
|
|
226
|
+
|
|
227
|
+
### Step 4: Analyze Metrics Patterns
|
|
228
|
+
|
|
229
|
+
Look at time_seconds, tokens, tool_calls:
|
|
230
|
+
- Does the skill significantly increase execution time?
|
|
231
|
+
- Is there high variance in resource usage?
|
|
232
|
+
- Are there outlier runs that skew the aggregates?
|
|
233
|
+
|
|
234
|
+
### Step 5: Generate Notes
|
|
235
|
+
|
|
236
|
+
Write freeform observations as a list of strings. Each note should:
|
|
237
|
+
- State a specific observation
|
|
238
|
+
- Be grounded in the data (not speculation)
|
|
239
|
+
- Help the user understand something the aggregate metrics don't show
|
|
240
|
+
|
|
241
|
+
Examples:
|
|
242
|
+
- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
|
|
243
|
+
- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
|
|
244
|
+
- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
|
|
245
|
+
- "Skill adds 13s average execution time but improves pass rate by 50%"
|
|
246
|
+
- "Token usage is 80% higher with skill, primarily due to script output parsing"
|
|
247
|
+
- "All 3 without-skill runs for eval 1 produced empty output"
|
|
248
|
+
|
|
249
|
+
### Step 6: Write Notes
|
|
250
|
+
|
|
251
|
+
Save notes to `{output_path}` as a JSON array of strings:
|
|
252
|
+
|
|
253
|
+
```json
|
|
254
|
+
[
|
|
255
|
+
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
|
|
256
|
+
"Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
|
|
257
|
+
"Without-skill runs consistently fail on table extraction expectations",
|
|
258
|
+
"Skill adds 13s average execution time but improves pass rate by 50%"
|
|
259
|
+
]
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Guidelines
|
|
263
|
+
|
|
264
|
+
**DO:**
|
|
265
|
+
- Report what you observe in the data
|
|
266
|
+
- Be specific about which evals, expectations, or runs you're referring to
|
|
267
|
+
- Note patterns that aggregate metrics would hide
|
|
268
|
+
- Provide context that helps interpret the numbers
|
|
269
|
+
|
|
270
|
+
**DO NOT:**
|
|
271
|
+
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
|
|
272
|
+
- Make subjective quality judgments ("the output was good/bad")
|
|
273
|
+
- Speculate about causes without evidence
|
|
274
|
+
- Repeat information already in the run_summary aggregates
|
|
275
|
+
|
|
276
|
+
See [references/schemas.md](../references/schemas.md) for the complete analysis.json and benchmark.json schema definitions.
|
|
@@ -0,0 +1,204 @@
|
|
|
1
|
+
# Blind Comparator Agent
|
|
2
|
+
|
|
3
|
+
Compare two outputs WITHOUT knowing which skill produced them.
|
|
4
|
+
|
|
5
|
+
## Role
|
|
6
|
+
|
|
7
|
+
The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
|
|
8
|
+
|
|
9
|
+
Your judgment is based purely on output quality and task completion.
|
|
10
|
+
|
|
11
|
+
## Inputs
|
|
12
|
+
|
|
13
|
+
You receive these parameters in your prompt:
|
|
14
|
+
|
|
15
|
+
- **output_a_path**: Path to the first output file or directory
|
|
16
|
+
- **output_b_path**: Path to the second output file or directory
|
|
17
|
+
- **eval_prompt**: The original task/prompt that was executed
|
|
18
|
+
- **expectations**: List of expectations to check (optional - may be empty)
|
|
19
|
+
|
|
20
|
+
## Process
|
|
21
|
+
|
|
22
|
+
### Step 1: Read Both Outputs
|
|
23
|
+
|
|
24
|
+
1. Examine output A (file or directory)
|
|
25
|
+
2. Examine output B (file or directory)
|
|
26
|
+
3. Note the type, structure, and content of each
|
|
27
|
+
4. If outputs are directories, examine all relevant files inside
|
|
28
|
+
|
|
29
|
+
### Step 2: Understand the Task
|
|
30
|
+
|
|
31
|
+
1. Read the eval_prompt carefully
|
|
32
|
+
2. Identify what the task requires:
|
|
33
|
+
- What should be produced?
|
|
34
|
+
- What qualities matter (accuracy, completeness, format)?
|
|
35
|
+
- What would distinguish a good output from a poor one?
|
|
36
|
+
|
|
37
|
+
### Step 3: Generate Evaluation Rubric
|
|
38
|
+
|
|
39
|
+
Based on the task, generate a rubric with two dimensions:
|
|
40
|
+
|
|
41
|
+
**Content Rubric** (what the output contains):
|
|
42
|
+
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|
|
43
|
+
|-----------|----------|----------------|---------------|
|
|
44
|
+
| Correctness | Major errors | Minor errors | Fully correct |
|
|
45
|
+
| Completeness | Missing key elements | Mostly complete | All elements present |
|
|
46
|
+
| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
|
|
47
|
+
|
|
48
|
+
**Structure Rubric** (how the output is organized):
|
|
49
|
+
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|
|
50
|
+
|-----------|----------|----------------|---------------|
|
|
51
|
+
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
|
|
52
|
+
| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
|
|
53
|
+
| Usability | Difficult to use | Usable with effort | Easy to use |
|
|
54
|
+
|
|
55
|
+
Adapt criteria to the specific task. For example:
|
|
56
|
+
- PDF form → "Field alignment", "Text readability", "Data placement"
|
|
57
|
+
- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
|
|
58
|
+
- Data output → "Schema correctness", "Data types", "Completeness"
|
|
59
|
+
|
|
60
|
+
### Step 4: Evaluate Each Output Against the Rubric
|
|
61
|
+
|
|
62
|
+
For each output (A and B):
|
|
63
|
+
|
|
64
|
+
1. **Score each criterion** on the rubric (1-5 scale)
|
|
65
|
+
2. **Calculate dimension totals**: Content score, Structure score
|
|
66
|
+
3. **Calculate overall score**: Average of dimension scores, scaled to 1-10
|
|
67
|
+
|
|
68
|
+
### Step 5: Check Assertions (if provided)
|
|
69
|
+
|
|
70
|
+
If expectations are provided:
|
|
71
|
+
|
|
72
|
+
1. Check each expectation against output A
|
|
73
|
+
2. Check each expectation against output B
|
|
74
|
+
3. Count pass rates for each output
|
|
75
|
+
4. Use expectation scores as secondary evidence (not the primary decision factor)
|
|
76
|
+
|
|
77
|
+
### Step 6: Determine the Winner
|
|
78
|
+
|
|
79
|
+
Compare A and B based on (in priority order):
|
|
80
|
+
|
|
81
|
+
1. **Primary**: Overall rubric score (content + structure)
|
|
82
|
+
2. **Secondary**: Assertion pass rates (if applicable)
|
|
83
|
+
3. **Tiebreaker**: If truly equal, declare a TIE
|
|
84
|
+
|
|
85
|
+
Be decisive - ties should be rare. One output is usually better, even if marginally.
|
|
86
|
+
|
|
87
|
+
### Step 7: Write Comparison Results
|
|
88
|
+
|
|
89
|
+
Save results to a JSON file at the path specified (or `comparison.json` if not specified).
|
|
90
|
+
|
|
91
|
+
## Output Format
|
|
92
|
+
|
|
93
|
+
Write a JSON file with this structure:
|
|
94
|
+
|
|
95
|
+
```json
|
|
96
|
+
{
|
|
97
|
+
"winner": "A",
|
|
98
|
+
"reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
|
|
99
|
+
"rubric": {
|
|
100
|
+
"A": {
|
|
101
|
+
"content": {
|
|
102
|
+
"correctness": 5,
|
|
103
|
+
"completeness": 5,
|
|
104
|
+
"accuracy": 4
|
|
105
|
+
},
|
|
106
|
+
"structure": {
|
|
107
|
+
"organization": 4,
|
|
108
|
+
"formatting": 5,
|
|
109
|
+
"usability": 4
|
|
110
|
+
},
|
|
111
|
+
"content_score": 4.7,
|
|
112
|
+
"structure_score": 4.3,
|
|
113
|
+
"overall_score": 9.0
|
|
114
|
+
},
|
|
115
|
+
"B": {
|
|
116
|
+
"content": {
|
|
117
|
+
"correctness": 3,
|
|
118
|
+
"completeness": 2,
|
|
119
|
+
"accuracy": 3
|
|
120
|
+
},
|
|
121
|
+
"structure": {
|
|
122
|
+
"organization": 3,
|
|
123
|
+
"formatting": 2,
|
|
124
|
+
"usability": 3
|
|
125
|
+
},
|
|
126
|
+
"content_score": 2.7,
|
|
127
|
+
"structure_score": 2.7,
|
|
128
|
+
"overall_score": 5.4
|
|
129
|
+
}
|
|
130
|
+
},
|
|
131
|
+
"output_quality": {
|
|
132
|
+
"A": {
|
|
133
|
+
"score": 9,
|
|
134
|
+
"strengths": ["Complete solution", "Well-formatted", "All fields present"],
|
|
135
|
+
"weaknesses": ["Minor style inconsistency in header"]
|
|
136
|
+
},
|
|
137
|
+
"B": {
|
|
138
|
+
"score": 5,
|
|
139
|
+
"strengths": ["Readable output", "Correct basic structure"],
|
|
140
|
+
"weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
|
|
141
|
+
}
|
|
142
|
+
},
|
|
143
|
+
"expectation_results": {
|
|
144
|
+
"A": {
|
|
145
|
+
"passed": 4,
|
|
146
|
+
"total": 5,
|
|
147
|
+
"pass_rate": 0.80,
|
|
148
|
+
"details": [
|
|
149
|
+
{"text": "Output includes name", "passed": true},
|
|
150
|
+
{"text": "Output includes date", "passed": true},
|
|
151
|
+
{"text": "Format is PDF", "passed": true},
|
|
152
|
+
{"text": "Contains signature", "passed": false},
|
|
153
|
+
{"text": "Readable text", "passed": true}
|
|
154
|
+
]
|
|
155
|
+
},
|
|
156
|
+
"B": {
|
|
157
|
+
"passed": 3,
|
|
158
|
+
"total": 5,
|
|
159
|
+
"pass_rate": 0.60,
|
|
160
|
+
"details": [
|
|
161
|
+
{"text": "Output includes name", "passed": true},
|
|
162
|
+
{"text": "Output includes date", "passed": false},
|
|
163
|
+
{"text": "Format is PDF", "passed": true},
|
|
164
|
+
{"text": "Contains signature", "passed": false},
|
|
165
|
+
{"text": "Readable text", "passed": true}
|
|
166
|
+
]
|
|
167
|
+
}
|
|
168
|
+
}
|
|
169
|
+
}
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
If no expectations were provided, omit the `expectation_results` field entirely.
|
|
173
|
+
|
|
174
|
+
## Field Descriptions
|
|
175
|
+
|
|
176
|
+
- **winner**: "A", "B", or "TIE"
|
|
177
|
+
- **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
|
|
178
|
+
- **rubric**: Structured rubric evaluation for each output
|
|
179
|
+
- **content**: Scores for content criteria (correctness, completeness, accuracy)
|
|
180
|
+
- **structure**: Scores for structure criteria (organization, formatting, usability)
|
|
181
|
+
- **content_score**: Average of content criteria (1-5)
|
|
182
|
+
- **structure_score**: Average of structure criteria (1-5)
|
|
183
|
+
- **overall_score**: Combined score scaled to 1-10
|
|
184
|
+
- **output_quality**: Summary quality assessment
|
|
185
|
+
- **score**: 1-10 rating (should match rubric overall_score)
|
|
186
|
+
- **strengths**: List of positive aspects
|
|
187
|
+
- **weaknesses**: List of issues or shortcomings
|
|
188
|
+
- **expectation_results**: (Only if expectations provided)
|
|
189
|
+
- **passed**: Number of expectations that passed
|
|
190
|
+
- **total**: Total number of expectations
|
|
191
|
+
- **pass_rate**: Fraction passed (0.0 to 1.0)
|
|
192
|
+
- **details**: Individual expectation results
|
|
193
|
+
|
|
194
|
+
## Guidelines
|
|
195
|
+
|
|
196
|
+
- **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
|
|
197
|
+
- **Be specific**: Cite specific examples when explaining strengths and weaknesses.
|
|
198
|
+
- **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
|
|
199
|
+
- **Output quality first**: Assertion scores are secondary to overall task completion.
|
|
200
|
+
- **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
|
|
201
|
+
- **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
|
|
202
|
+
- **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.
|
|
203
|
+
|
|
204
|
+
See [references/schemas.md](../references/schemas.md) for the complete comparison.json schema definition.
|