codingbuddy-rules 4.4.0 → 5.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (122) hide show
  1. package/.ai-rules/adapters/antigravity.md +6 -6
  2. package/.ai-rules/adapters/claude-code.md +107 -4
  3. package/.ai-rules/adapters/codex.md +5 -5
  4. package/.ai-rules/adapters/cursor.md +2 -2
  5. package/.ai-rules/adapters/kiro.md +8 -8
  6. package/.ai-rules/adapters/opencode.md +7 -7
  7. package/.ai-rules/adapters/q.md +2 -2
  8. package/.ai-rules/agents/README.md +66 -16
  9. package/.ai-rules/agents/accessibility-specialist.json +2 -1
  10. package/.ai-rules/agents/act-mode.json +2 -1
  11. package/.ai-rules/agents/agent-architect.json +8 -7
  12. package/.ai-rules/agents/ai-ml-engineer.json +1 -0
  13. package/.ai-rules/agents/architecture-specialist.json +1 -0
  14. package/.ai-rules/agents/auto-mode.json +4 -2
  15. package/.ai-rules/agents/backend-developer.json +1 -0
  16. package/.ai-rules/agents/code-quality-specialist.json +1 -0
  17. package/.ai-rules/agents/code-reviewer.json +65 -64
  18. package/.ai-rules/agents/data-engineer.json +8 -7
  19. package/.ai-rules/agents/data-scientist.json +10 -9
  20. package/.ai-rules/agents/devops-engineer.json +1 -0
  21. package/.ai-rules/agents/documentation-specialist.json +1 -0
  22. package/.ai-rules/agents/eval-mode.json +20 -19
  23. package/.ai-rules/agents/event-architecture-specialist.json +1 -0
  24. package/.ai-rules/agents/frontend-developer.json +1 -0
  25. package/.ai-rules/agents/i18n-specialist.json +2 -1
  26. package/.ai-rules/agents/integration-specialist.json +1 -0
  27. package/.ai-rules/agents/migration-specialist.json +1 -0
  28. package/.ai-rules/agents/mobile-developer.json +8 -7
  29. package/.ai-rules/agents/observability-specialist.json +1 -0
  30. package/.ai-rules/agents/parallel-orchestrator.json +346 -0
  31. package/.ai-rules/agents/performance-specialist.json +1 -0
  32. package/.ai-rules/agents/plan-mode.json +3 -1
  33. package/.ai-rules/agents/plan-reviewer.json +208 -0
  34. package/.ai-rules/agents/platform-engineer.json +1 -0
  35. package/.ai-rules/agents/security-engineer.json +9 -8
  36. package/.ai-rules/agents/security-specialist.json +2 -1
  37. package/.ai-rules/agents/seo-specialist.json +1 -0
  38. package/.ai-rules/agents/software-engineer.json +1 -0
  39. package/.ai-rules/agents/solution-architect.json +11 -10
  40. package/.ai-rules/agents/systems-developer.json +9 -8
  41. package/.ai-rules/agents/technical-planner.json +11 -10
  42. package/.ai-rules/agents/test-engineer.json +7 -6
  43. package/.ai-rules/agents/test-strategy-specialist.json +1 -0
  44. package/.ai-rules/agents/tooling-engineer.json +4 -3
  45. package/.ai-rules/agents/ui-ux-designer.json +1 -0
  46. package/.ai-rules/keyword-modes.json +4 -4
  47. package/.ai-rules/rules/clarification-guide.md +14 -14
  48. package/.ai-rules/rules/core.md +90 -1
  49. package/.ai-rules/rules/parallel-execution.md +217 -0
  50. package/.ai-rules/skills/README.md +23 -1
  51. package/.ai-rules/skills/agent-design/SKILL.md +5 -0
  52. package/.ai-rules/skills/agent-design/examples/agent-template.json +58 -0
  53. package/.ai-rules/skills/agent-design/references/expertise-guidelines.md +112 -0
  54. package/.ai-rules/skills/agent-discussion/SKILL.md +199 -0
  55. package/.ai-rules/skills/agent-discussion-panel/SKILL.md +448 -0
  56. package/.ai-rules/skills/api-design/SKILL.md +5 -0
  57. package/.ai-rules/skills/api-design/examples/error-response.json +159 -0
  58. package/.ai-rules/skills/api-design/examples/openapi-template.yaml +393 -0
  59. package/.ai-rules/skills/build-fix/SKILL.md +234 -0
  60. package/.ai-rules/skills/code-explanation/SKILL.md +4 -0
  61. package/.ai-rules/skills/context-management/SKILL.md +1 -0
  62. package/.ai-rules/skills/cost-budget/SKILL.md +348 -0
  63. package/.ai-rules/skills/cross-repo-issues/SKILL.md +257 -0
  64. package/.ai-rules/skills/database-migration/SKILL.md +1 -0
  65. package/.ai-rules/skills/deepsearch/SKILL.md +214 -0
  66. package/.ai-rules/skills/deployment-checklist/SKILL.md +1 -0
  67. package/.ai-rules/skills/error-analysis/SKILL.md +1 -0
  68. package/.ai-rules/skills/finishing-a-development-branch/SKILL.md +281 -0
  69. package/.ai-rules/skills/frontend-design/SKILL.md +5 -0
  70. package/.ai-rules/skills/frontend-design/examples/component-template.tsx +203 -0
  71. package/.ai-rules/skills/frontend-design/references/css-patterns.md +243 -0
  72. package/.ai-rules/skills/git-master/SKILL.md +358 -0
  73. package/.ai-rules/skills/incident-response/SKILL.md +1 -0
  74. package/.ai-rules/skills/legacy-modernization/SKILL.md +1 -0
  75. package/.ai-rules/skills/mcp-builder/SKILL.md +7 -0
  76. package/.ai-rules/skills/mcp-builder/examples/resource-example.ts +233 -0
  77. package/.ai-rules/skills/mcp-builder/examples/tool-example.ts +203 -0
  78. package/.ai-rules/skills/mcp-builder/references/protocol-spec.md +215 -0
  79. package/.ai-rules/skills/performance-optimization/SKILL.md +3 -0
  80. package/.ai-rules/skills/plan-and-review/SKILL.md +115 -0
  81. package/.ai-rules/skills/pr-all-in-one/SKILL.md +15 -13
  82. package/.ai-rules/skills/pr-all-in-one/configuration-guide.md +7 -7
  83. package/.ai-rules/skills/pr-all-in-one/pr-templates.md +10 -10
  84. package/.ai-rules/skills/pr-review/SKILL.md +4 -0
  85. package/.ai-rules/skills/receiving-code-review/SKILL.md +347 -0
  86. package/.ai-rules/skills/refactoring/SKILL.md +1 -0
  87. package/.ai-rules/skills/requesting-code-review/SKILL.md +348 -0
  88. package/.ai-rules/skills/rule-authoring/SKILL.md +5 -0
  89. package/.ai-rules/skills/rule-authoring/examples/rule-template.md +142 -0
  90. package/.ai-rules/skills/rule-authoring/examples/trigger-patterns.md +126 -0
  91. package/.ai-rules/skills/security-audit/SKILL.md +4 -0
  92. package/.ai-rules/skills/skill-creator/SKILL.md +461 -0
  93. package/.ai-rules/skills/skill-creator/agents/analyzer.md +206 -0
  94. package/.ai-rules/skills/skill-creator/agents/comparator.md +167 -0
  95. package/.ai-rules/skills/skill-creator/agents/grader.md +152 -0
  96. package/.ai-rules/skills/skill-creator/assets/eval_review.html +289 -0
  97. package/.ai-rules/skills/skill-creator/assets/skill-template.md +43 -0
  98. package/.ai-rules/skills/skill-creator/eval-viewer/generate_review.py +496 -0
  99. package/.ai-rules/skills/skill-creator/references/frontmatter-guide.md +632 -0
  100. package/.ai-rules/skills/skill-creator/references/multi-tool-compat.md +480 -0
  101. package/.ai-rules/skills/skill-creator/references/schemas.md +784 -0
  102. package/.ai-rules/skills/skill-creator/scripts/aggregate_benchmark.py +302 -0
  103. package/.ai-rules/skills/skill-creator/scripts/init_skill.sh +196 -0
  104. package/.ai-rules/skills/skill-creator/scripts/run_loop.py +327 -0
  105. package/.ai-rules/skills/systematic-debugging/SKILL.md +1 -0
  106. package/.ai-rules/skills/tech-debt/SKILL.md +1 -0
  107. package/.ai-rules/skills/test-coverage-gate/SKILL.md +303 -0
  108. package/.ai-rules/skills/tmux-master/SKILL.md +491 -0
  109. package/.ai-rules/skills/using-git-worktrees/SKILL.md +368 -0
  110. package/.ai-rules/skills/verification-before-completion/SKILL.md +234 -0
  111. package/.ai-rules/skills/widget-slot-architecture/SKILL.md +6 -0
  112. package/.ai-rules/skills/widget-slot-architecture/examples/parallel-route-setup.tsx +206 -0
  113. package/.ai-rules/skills/widget-slot-architecture/examples/widget-component.tsx +250 -0
  114. package/.ai-rules/skills/writing-plans/SKILL.md +78 -0
  115. package/bin/cli.js +178 -0
  116. package/lib/init/detect-stack.js +148 -0
  117. package/lib/init/generate-config.js +31 -0
  118. package/lib/init/index.js +86 -0
  119. package/lib/init/prompt.js +60 -0
  120. package/lib/init/scaffold.js +67 -0
  121. package/lib/init/suggest-agent.js +46 -0
  122. package/package.json +10 -2
@@ -0,0 +1,461 @@
1
+ ---
2
+ name: skill-creator
3
+ description: >-
4
+ Create new skills, modify and improve existing skills,
5
+ and measure skill performance with eval pipeline.
6
+ Use when creating a skill from scratch, editing or optimizing
7
+ an existing skill, running evals to test a skill,
8
+ or benchmarking skill performance.
9
+ disable-model-invocation: true
10
+ argument-hint: [create|eval|improve|benchmark] [skill-name]
11
+ ---
12
+
13
+ # Skill Creator
14
+
15
+ ## Overview
16
+
17
+ Skills are reusable workflows that encode expert processes into repeatable instructions. A well-crafted skill transforms inconsistent ad-hoc work into systematic, verifiable outcomes across any AI tool.
18
+
19
+ **Core principle:** A skill must change behavior. If an AI assistant produces the same output with and without the skill loaded, the skill has failed.
20
+
21
+ **Iron Law:**
22
+ ```
23
+ EVERY SKILL MUST HAVE A MEASURABLE "DID BEHAVIOR CHANGE?" TEST
24
+ No eval = no confidence. Ship nothing you haven't measured.
25
+ ```
26
+
27
+ ## Modes
28
+
29
+ | Mode | Purpose | Input | Output |
30
+ |------|---------|-------|--------|
31
+ | **Create** | Build a new skill from scratch | Intent or problem statement | `SKILL.md` + scaffold |
32
+ | **Eval** | Test skill effectiveness | Skill + test cases | Graded scorecard |
33
+ | **Improve** | Refine based on eval results | Skill + eval data | Improved `SKILL.md` |
34
+ | **Benchmark** | Compare performance metrics | Skill + baseline | Performance report |
35
+
36
+ ## When to Use
37
+
38
+ - Creating a new skill for `.ai-rules/skills/`
39
+ - Testing whether an existing skill produces correct behavior
40
+ - Optimizing a skill that underperforms on edge cases
41
+ - Comparing skill versions to select the best one
42
+ - Measuring skill quality before shipping
43
+
44
+ ## When NOT to Use
45
+
46
+ - Writing one-off instructions (not reusable = not a skill)
47
+ - Creating rules (use `rule-authoring` skill)
48
+ - Designing agents (use `agent-design` skill)
49
+ - Process is too simple to warrant a workflow (< 3 steps)
50
+
51
+ ---
52
+
53
+ ## Create Mode
54
+
55
+ **Trigger:** `skill-creator create <skill-name>`
56
+
57
+ ### Phase 1: Intent — Define What the Skill Does
58
+
59
+ Answer before writing anything:
60
+
61
+ ```
62
+ 1. What SPECIFIC problem does this skill solve?
63
+ Bad: "helps with testing"
64
+ Good: "enforces Red-Green-Refactor TDD cycle with mandatory verification"
65
+
66
+ 2. What behavior change should loading this skill cause?
67
+ Without: "AI writes implementation first, tests after"
68
+ With: "AI writes failing test first, verifies failure, then implements"
69
+
70
+ 3. Who consumes this skill?
71
+ Which AI tools? (Claude Code, Cursor, Codex, Q, Kiro)
72
+ What user skill level?
73
+
74
+ 4. What is the boundary?
75
+ What it handles vs. what it delegates to other skills
76
+ Name 2-3 skills it does NOT overlap with
77
+ ```
78
+
79
+ ### Phase 2: Interview — Gather Domain Knowledge
80
+
81
+ Collect the expertise the skill will encode:
82
+
83
+ ```
84
+ For each major workflow step:
85
+ 1. What is the step?
86
+ 2. What is the expected input/output?
87
+ 3. What are the most common mistakes?
88
+ 4. How do you verify correctness?
89
+ 5. What red flags should halt progress?
90
+ ```
91
+
92
+ **Sources to consult:**
93
+ - Existing codebase patterns (search for conventions)
94
+ - Project documentation and ADRs
95
+ - Domain experts (ask the user)
96
+ - Related skills (check for reusable patterns)
97
+
98
+ ### Phase 3: Write — Author the SKILL.md
99
+
100
+ **Required structure:**
101
+
102
+ ```markdown
103
+ ---
104
+ name: skill-name
105
+ description: "Use when... (max 500 chars)"
106
+ [optional frontmatter fields]
107
+ ---
108
+
109
+ # Skill Title
110
+
111
+ ## Overview
112
+ 2-3 sentences. Core principle. Iron Law.
113
+
114
+ ## When to Use
115
+ Bullet list of trigger scenarios.
116
+
117
+ ## When NOT to Use (if applicable)
118
+
119
+ ## Process / Phases
120
+ The actual workflow steps.
121
+
122
+ ## Verification Checklist
123
+
124
+ ## Red Flags — STOP
125
+ Table of rationalizations vs. reality.
126
+ ```
127
+
128
+ **Writing rules:**
129
+
130
+ | Rule | Why |
131
+ |------|-----|
132
+ | Imperative mood ("Write the test") | Direct instructions produce consistent behavior |
133
+ | Concrete examples over abstractions | AI tools follow examples more reliably than rules |
134
+ | Good/Bad comparisons for ambiguous steps | Eliminates interpretation variance across tools |
135
+ | One responsibility per phase | Multi-purpose phases get half-completed |
136
+ | Max 500 lines total | Longer skills get truncated or ignored |
137
+
138
+ ### Phase 4: Scaffold — Create Supporting Files
139
+
140
+ ```bash
141
+ mkdir -p packages/rules/.ai-rules/skills/<skill-name>
142
+ # Write SKILL.md (from Phase 3)
143
+ # Optional supporting references:
144
+ # <skill-name>/reference-guide.md
145
+ # <skill-name>/examples.md
146
+ ```
147
+
148
+ **Frontmatter field reference:**
149
+
150
+ | Field | Required | Description |
151
+ |-------|----------|-------------|
152
+ | `name` | Yes | `^[a-z0-9-]+$`, matches directory name |
153
+ | `description` | Yes | 1-500 chars, start with "Use when..." |
154
+ | `disable-model-invocation` | No | `true` if skill handles its own execution flow |
155
+ | `argument-hint` | No | Usage hint shown in skill listings |
156
+ | `allowed-tools` | No | Restrict available tools during execution |
157
+ | `context` | No | `fork` to run in isolated context |
158
+ | `agent` | No | Agent to activate with the skill |
159
+
160
+ ### Phase 5: Test — Verify the Skill Works
161
+
162
+ ```
163
+ - [ ] Frontmatter validates (name matches directory, description <= 500 chars)
164
+ - [ ] Skill loads without error in target tool (list_skills / get_skill)
165
+ - [ ] Following the skill produces different behavior than without it
166
+ - [ ] Every phase has a verifiable output
167
+ - [ ] Red flags table covers top 3 rationalizations for skipping
168
+ - [ ] No overlap with existing skills (check skills/README.md)
169
+ - [ ] Multi-tool compatible (no tool-specific syntax in core workflow)
170
+ ```
171
+
172
+ ---
173
+
174
+ ## Eval Mode
175
+
176
+ **Trigger:** `skill-creator eval <skill-name>`
177
+
178
+ Measure whether a skill produces the intended behavior change.
179
+
180
+ ### Phase 1: Define — Write Test Scenarios
181
+
182
+ Create scenarios exercising the skill's key behaviors:
183
+
184
+ ```
185
+ Scenario: [descriptive name]
186
+ Given: [initial state / context]
187
+ When: [skill is applied with this input]
188
+ Then: [expected behavior / output]
189
+ Anti-pattern: [what happens WITHOUT the skill]
190
+ ```
191
+
192
+ **Minimum scenarios:**
193
+ - 1 happy path (standard use case)
194
+ - 1 edge case (unusual but valid input)
195
+ - 1 adversarial case (input that tempts skipping the skill)
196
+
197
+ ### Phase 2: Spawn — Execute Test Scenarios
198
+
199
+ Run each scenario against the skill:
200
+
201
+ ```
202
+ For each scenario:
203
+ 1. Load the skill content
204
+ 2. Present the scenario input
205
+ 3. Capture the AI's response
206
+ 4. Save response for grading
207
+ ```
208
+
209
+ **Execution options:**
210
+ - **Manual:** Paste skill + scenario, capture response
211
+ - **Automated:** Use subagent with skill loaded, capture output
212
+ - **Parallel:** Run via dispatching-parallel-agents skill
213
+
214
+ ### Phase 3: Assert — Check Expected Behavior
215
+
216
+ Grade each response:
217
+
218
+ ```
219
+ PASS: Response follows skill workflow
220
+ PARTIAL: Some steps followed, others skipped
221
+ FAIL: Skill ignored or wrong behavior produced
222
+ ```
223
+
224
+ Note which specific steps were followed/skipped.
225
+
226
+ ### Phase 4: Grade — Assign Severity
227
+
228
+ | Severity | Definition | Action |
229
+ |----------|-----------|--------|
230
+ | **Critical** | Skill completely ignored | Must fix before shipping |
231
+ | **High** | Key phase skipped or verification missing | Must fix before shipping |
232
+ | **Medium** | Minor step deviation, output still usable | Fix in next iteration |
233
+ | **Low** | Style/formatting difference, behavior correct | Optional fix |
234
+
235
+ ### Phase 5: Aggregate — Summarize Results
236
+
237
+ ```
238
+ Skill: [name]
239
+ Scenarios: [total] | Pass: [n] | Partial: [n] | Fail: [n]
240
+
241
+ Critical: [count] High: [count] Medium: [count] Low: [count]
242
+
243
+ Verdict:
244
+ SHIP = Critical=0 AND High=0
245
+ ITERATE = Critical=0, High>0
246
+ REWRITE = Critical>0
247
+ ```
248
+
249
+ ### Phase 6: View — Present Findings
250
+
251
+ ```markdown
252
+ ## Eval Report: [skill-name]
253
+
254
+ ### Summary
255
+ [Aggregate from Phase 5]
256
+
257
+ ### Scenario Results
258
+ | Scenario | Result | Issues |
259
+ |----------|--------|--------|
260
+ | ... | PASS/PARTIAL/FAIL | ... |
261
+
262
+ ### Recommendations
263
+ 1. [Fix for highest-severity issue]
264
+ 2. [Next fix]
265
+ ```
266
+
267
+ ---
268
+
269
+ ## Improve Mode
270
+
271
+ **Trigger:** `skill-creator improve <skill-name>`
272
+
273
+ Refine an existing skill based on eval data or observed behavior gaps.
274
+
275
+ ### Phase 1: Read — Understand Current State
276
+
277
+ ```
278
+ 1. Read the current SKILL.md
279
+ 2. Read eval results (if available)
280
+ 3. Identify the gap:
281
+ - Which phases are being skipped?
282
+ - Which instructions are ambiguous?
283
+ - Where does behavior diverge across AI tools?
284
+ ```
285
+
286
+ ### Phase 2: Generalize — Find Patterns in Failures
287
+
288
+ ```
289
+ Look for systemic issues:
290
+ - Same step skipped across scenarios → Step is unclear or seems optional
291
+ - Different behavior per AI tool → Instructions use tool-specific syntax
292
+ - Partial compliance → Steps too large, need decomposition
293
+ - Complete skip → Trigger conditions don't match use case
294
+ ```
295
+
296
+ ### Phase 3: Apply — Make Targeted Changes
297
+
298
+ | Issue Type | Fix Strategy |
299
+ |-----------|-------------|
300
+ | Step skipped | Add "MANDATORY" marker + red flag for skipping |
301
+ | Ambiguous instruction | Replace with concrete Good/Bad example |
302
+ | Tool-specific behavior | Remove tool-specific syntax, use universal patterns |
303
+ | Steps too large | Decompose into sub-steps with verification |
304
+ | Missing edge case | Add scenario to When to Use section |
305
+
306
+ **Rules for changes:**
307
+ - One targeted change per identified issue
308
+ - Do not rewrite the entire skill (preserve what works)
309
+ - Add examples where instructions were misinterpreted
310
+ - Strengthen red flags for commonly skipped steps
311
+
312
+ ### Phase 4: Re-run — Eval the Improved Version
313
+
314
+ Run the same eval scenarios against the modified skill:
315
+
316
+ ```
317
+ Compare original vs. improved:
318
+ - Did the targeted fix resolve the issue?
319
+ - Did the fix introduce new issues?
320
+ - Is overall pass rate higher?
321
+ ```
322
+
323
+ ### Phase 5: Compare — Side-by-Side Analysis
324
+
325
+ ```markdown
326
+ | Metric | Before | After | Delta |
327
+ |--------|--------|-------|-------|
328
+ | Pass rate | X% | Y% | +/-Z% |
329
+ | Critical issues | N | M | +/-D |
330
+ | High issues | N | M | +/-D |
331
+ | Avg. steps followed | X/Y | X/Y | +/-D |
332
+ ```
333
+
334
+ ### Phase 6: Analyze — Document Learnings
335
+
336
+ ```
337
+ 1. Which fix strategy was most effective?
338
+ 2. Which issues persisted despite changes?
339
+ 3. Are there structural problems requiring a rewrite?
340
+ 4. What patterns should inform future skill authoring?
341
+ ```
342
+
343
+ ---
344
+
345
+ ## Benchmark Mode
346
+
347
+ **Trigger:** `skill-creator benchmark <skill-name>`
348
+
349
+ Measure skill performance across dimensions and optimize weak spots.
350
+
351
+ ### Phase 1: Generate — Create Benchmark Suite
352
+
353
+ Design a comprehensive test suite:
354
+
355
+ ```
356
+ Dimensions:
357
+ 1. Compliance: Does the AI follow every step?
358
+ 2. Consistency: Same input → same behavior across runs?
359
+ 3. Portability: Works across Claude Code, Cursor, Codex, Q, Kiro?
360
+ 4. Robustness: Handles edge cases without breaking?
361
+ 5. Efficiency: Does the skill add unnecessary overhead?
362
+ ```
363
+
364
+ - Minimum 2 cases per dimension
365
+ - Mix of simple and complex inputs
366
+ - At least 1 adversarial case per dimension
367
+
368
+ ### Phase 2: Review — Analyze Benchmark Results
369
+
370
+ ```
371
+ For each dimension:
372
+ Score: [0-100]
373
+ Weakest case: [description]
374
+ Root cause: [why this dimension scored low]
375
+ ```
376
+
377
+ | Score | Meaning |
378
+ |-------|---------|
379
+ | 90-100 | Excellent — production-ready |
380
+ | 70-89 | Good — minor improvements needed |
381
+ | 50-69 | Fair — significant gaps in at least one dimension |
382
+ | 0-49 | Poor — needs major rework |
383
+
384
+ ### Phase 3: Optimize — Target Weak Dimensions
385
+
386
+ Focus on the lowest-scoring dimension first:
387
+
388
+ ```
389
+ For each dimension scoring < 70:
390
+ 1. Identify the specific instruction causing the gap
391
+ 2. Apply the appropriate fix from Improve Mode Phase 3
392
+ 3. Re-run that dimension's cases only
393
+ 4. Verify improvement without regression in other dimensions
394
+ ```
395
+
396
+ ### Phase 4: Apply — Finalize and Document
397
+
398
+ ```
399
+ 1. Update SKILL.md with optimized content
400
+ 2. Update skills/README.md if category or description changed
401
+ 3. Record benchmark baseline:
402
+
403
+ Benchmark: [skill-name] @ [date]
404
+ Compliance: [score]
405
+ Consistency: [score]
406
+ Portability: [score]
407
+ Robustness: [score]
408
+ Efficiency: [score]
409
+ Overall: [weighted average]
410
+ ```
411
+
412
+ ---
413
+
414
+ ## Additional Resources
415
+
416
+ ### Related Skills
417
+
418
+ | Skill | Relationship |
419
+ |-------|-------------|
420
+ | `rule-authoring` | Rules constrain behavior; skills define workflows. Complementary. |
421
+ | `agent-design` | Agents are personas; skills are processes. Non-overlapping. |
422
+ | `prompt-engineering` | Prompt techniques apply within skill instructions. Supporting. |
423
+ | `writing-plans` | Plans are one-time; skills are reusable. Different lifecycle. |
424
+
425
+ ### Agent Support
426
+
427
+ | Agent | When to Involve |
428
+ |-------|----------------|
429
+ | Code Quality Specialist | Reviewing skill structure and clarity |
430
+ | Test Engineer | Designing eval scenarios |
431
+ | Architecture Specialist | Skill decomposition and boundary design |
432
+
433
+ ### Multi-Tool Compatibility
434
+
435
+ Skills must work across all supported AI tools:
436
+
437
+ | Tool | How Skills Load | Key Consideration |
438
+ |------|----------------|-------------------|
439
+ | Claude Code | `get_skill` MCP tool | Full markdown + frontmatter parsed |
440
+ | Cursor | `@file` reference | Inline loading, no frontmatter processing |
441
+ | Codex / Copilot | `cat` file content | Plain text only, examples critical |
442
+ | Amazon Q | `.q/rules/` reference | Rule-style integration |
443
+ | Kiro | `.kiro/` reference | Spec-based integration |
444
+
445
+ **Portability rules:**
446
+ - No tool-specific syntax in core workflow
447
+ - Examples in generic markdown, not tool-specific blocks
448
+ - Phases described as actions, not tool commands
449
+ - Test with at least 2 different tools before shipping
450
+
451
+ ## Red Flags — STOP
452
+
453
+ | Thought | Reality |
454
+ |---------|---------|
455
+ | "This skill is obvious, no need to eval" | Obvious skills still get ignored. Eval proves they work. |
456
+ | "I'll test it manually later" | Manual tests are forgotten. Eval now. |
457
+ | "One scenario is enough" | One is anecdote. Three is pattern. |
458
+ | "It works in Claude Code, ship it" | Cursor/Codex may ignore the same instructions. Test portability. |
459
+ | "Small change, no need to re-eval" | Small changes cause cascading behavior shifts. Re-eval. |
460
+ | "The skill is too long but everything is needed" | Max 500 lines. Cut or decompose into reference files. |
461
+ | "I'll add examples later" | Skills without examples produce inconsistent behavior. Add now. |
@@ -0,0 +1,206 @@
1
+ # Analyzer Agent
2
+
3
+ An agent that discovers patterns in benchmark results and suggests directions for skill improvement.
4
+
5
+ ## Role
6
+
7
+ You are a skill evaluation analyst. You comprehensively analyze `benchmark.json` and each eval's `grading.json` results to derive the skill's strengths, weaknesses, improvement directions, and priorities. You perform only data-driven pattern analysis and never make suggestions based on speculation.
8
+
9
+ ## Iron Law
10
+
11
+ ```
12
+ Do not report patterns that are not in the data.
13
+ Every weakness must have evidence.
14
+ Every improvement suggestion must have a measurable goal.
15
+ ```
16
+
17
+ ## Input
18
+
19
+ | Item | Source | Description |
20
+ |------|--------|-------------|
21
+ | **benchmark.json** | `iteration-N/benchmark.json` | Iteration benchmark aggregate results |
22
+ | **grading results** | `iteration-N/eval-M/{with_skill\|without_skill}/grading.json` | Individual eval grading results |
23
+ | **eval metadata** | `iteration-N/eval-M/{with_skill\|without_skill}/eval_metadata.json` | Evaluation scenario information (optional) |
24
+ | **timing data** | `iteration-N/eval-M/{with_skill\|without_skill}/timing.json` | Token/time measurements (optional) |
25
+
26
+ ### benchmark.json Core Structure
27
+
28
+ ```json
29
+ {
30
+ "skill_name": "skill-name",
31
+ "iteration": 1,
32
+ "summary": {
33
+ "pass_rate": { "mean": 0.85, "stddev": 0.12 },
34
+ "tokens": { "mean": 42000, "stddev": 5200 },
35
+ "duration_seconds": { "mean": 35.5, "stddev": 8.3 }
36
+ },
37
+ "eval_results": [
38
+ {
39
+ "eval_id": 0,
40
+ "with_skill": { "pass_rate": 0.75, "tokens": 45230, "duration": 32.15 },
41
+ "baseline": { "pass_rate": 0.50, "tokens": 38400, "duration": 28.90 }
42
+ }
43
+ ]
44
+ }
45
+ ```
46
+
47
+ ### grading.json Core Structure
48
+
49
+ ```json
50
+ {
51
+ "expectations": [
52
+ {
53
+ "text": "assertion description",
54
+ "passed": true,
55
+ "evidence": "basis for the judgment"
56
+ }
57
+ ]
58
+ }
59
+ ```
60
+
61
+ ## Output
62
+
63
+ An analysis report in markdown format. Must include **all** 4 sections below:
64
+
65
+ ```markdown
66
+ # Skill Analysis Report: {skill_name}
67
+
68
+ ## 1. Strengths
69
+
70
+ Areas where the skill performs well. Each strength includes supporting data.
71
+
72
+ - **[Strength title]**: [Description] (evidence: [data citation])
73
+
74
+ ## 2. Weaknesses
75
+
76
+ Areas where the skill falls short. Each weakness includes supporting data and severity.
77
+
78
+ - **[Weakness title]** [Critical|High|Medium|Low]: [Description] (evidence: [data citation])
79
+
80
+ ## 3. Improvement Suggestions
81
+
82
+ Specific improvement measures for each weakness. Includes measurable goals.
83
+
84
+ | # | Linked Weakness | Improvement Measure | Goal | Difficulty |
85
+ |---|----------------|---------------------|------|------------|
86
+ | 1 | [Weakness title] | [Specific action] | [Measurable goal] | Low/Medium/High |
87
+
88
+ ## 4. Priority
89
+
90
+ Execution order for improvement suggestions. Based on severity x impact scope x difficulty.
91
+
92
+ 1. [Highest priority item] — Reason: [basis]
93
+ 2. [Next item] — Reason: [basis]
94
+ ```
95
+
96
+ ## Process
97
+
98
+ ### Step 1: Data Collection
99
+
100
+ ```
101
+ 1. Read benchmark.json → Check summary and eval_results
102
+ 2. Read each eval's grading.json → Check pass/fail per assertion
103
+ 3. Read timing.json (if available) → Check token/time overhead
104
+ ```
105
+
106
+ ### Step 2: Pattern Analysis
107
+
108
+ ```
109
+ Explore patterns from the following perspectives:
110
+
111
+ 1. Skill effectiveness:
112
+ - Compare with_skill.pass_rate vs baseline.pass_rate
113
+ - Which evals show quality improvement with the skill?
114
+ - Which evals show degradation with the skill?
115
+
116
+ 2. Consistency:
117
+ - High summary.pass_rate.stddev indicates inconsistency
118
+ - Are there extreme results in specific evals only?
119
+
120
+ 3. Cost:
121
+ - with_skill.tokens vs baseline.tokens → Token overhead
122
+ - with_skill.duration vs baseline.duration → Time overhead
123
+ - Is the quality improvement justified relative to the overhead?
124
+
125
+ 4. Assertion patterns:
126
+ - Which assertions repeatedly FAIL? → Structural weakness of the skill
127
+ - Which assertions always PASS? → Strength of the skill
128
+ - Which assertions FAIL only with_skill? → Skill causing side effects
129
+ ```
130
+
131
+ ### Step 3: Severity Classification
132
+
133
+ ```
134
+ Assign severity to each weakness:
135
+
136
+ | Severity | Criteria |
137
+ |----------|----------|
138
+ | Critical | Skill produces worse results than baseline |
139
+ | High | pass_rate < 0.5 or core assertion failure |
140
+ | Medium | pass_rate 0.5-0.7 or non-core assertion failure |
141
+ | Low | pass_rate > 0.7 but room for improvement |
142
+ ```
143
+
144
+ ### Step 4: Derive Improvement Suggestions
145
+
146
+ ```
147
+ For each weakness:
148
+ 1. Estimate root cause (data-driven)
149
+ 2. Suggest specific actions (which part of the skill to modify)
150
+ 3. Set measurable goals (e.g., "pass_rate 0.5 → 0.8")
151
+ 4. Assess difficulty (Low/Medium/High)
152
+ ```
153
+
154
+ ### Step 5: Determine Priority
155
+
156
+ ```
157
+ Priority = Severity x Impact Scope x (1 / Difficulty)
158
+
159
+ 1. Critical weaknesses → Always highest priority
160
+ 2. High + wide impact scope → Next priority
161
+ 3. Medium + easy fix → Quick wins
162
+ 4. Low → Backlog
163
+ ```
164
+
165
+ ### Step 6: Write Report
166
+
167
+ Write the report following the Output format. Include all 4 sections.
168
+
169
+ ## Analysis Patterns
170
+
171
+ ### Useful Comparison Metrics
172
+
173
+ | Metric | Calculation | Interpretation |
174
+ |--------|-------------|----------------|
175
+ | **Skill Lift** | `with_skill.pass_rate - baseline.pass_rate` | Positive means skill improves quality |
176
+ | **Token Overhead** | `with_skill.tokens / baseline.tokens - 1` | Additional token ratio when skill is applied |
177
+ | **Time Overhead** | `with_skill.duration / baseline.duration - 1` | Additional time ratio when skill is applied |
178
+ | **Consistency** | `1 - summary.pass_rate.stddev` | Closer to 1 means more consistent |
179
+ | **Cost-Effectiveness** | `Skill Lift / Token Overhead` | Higher means more efficient |
180
+
181
+ ### Multi-Iteration Comparison (when applicable)
182
+
183
+ ```
184
+ iteration-1 vs iteration-2:
185
+ - pass_rate change: [before] → [after] (Δ [difference])
186
+ - token change: [before] → [after] (Δ [difference])
187
+ - Resolved weaknesses: [list]
188
+ - Newly introduced weaknesses: [list]
189
+ ```
190
+
191
+ ## Red Flags — STOP
192
+
193
+ | Thought | Reality |
194
+ |---------|---------|
195
+ | "The data is sparse but I can see a trend" | Judging trends from 2 evals is overfitting. Report the data as-is |
196
+ | "This weakness is probably not important" | Severity is determined by the criteria table. Do not dismiss based on intuition |
197
+ | "There are too many improvement suggestions" | Maximum 5. Reduce by priority |
198
+ | "The skill is generally good, so skip weaknesses" | An analysis with 0 weaknesses has no value. Always report them |
199
+ | "The baseline also did well, so the skill has no effect" | Quantify with Skill Lift calculation. Use numbers, not feelings |
200
+
201
+ ## Constraints
202
+
203
+ - **Independent execution**: This agent does not depend on results from other agents (grading.json is received as input)
204
+ - **Data-driven**: All analysis is derived from input data. No external knowledge or speculation
205
+ - **Structured output**: All 4 sections (Strengths/Weaknesses/Improvement Suggestions/Priority) must be included
206
+ - **Improvement suggestion cap**: Maximum 5. If exceeded, trim based on priority