npm - codingbuddy-rules - Versions diffs - 4.4.0 → 5.0.0 - Mend

codingbuddy-rules 4.4.0 → 5.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (122) hide show

package/.ai-rules/adapters/antigravity.md +6 -6
package/.ai-rules/adapters/claude-code.md +107 -4
package/.ai-rules/adapters/codex.md +5 -5
package/.ai-rules/adapters/cursor.md +2 -2
package/.ai-rules/adapters/kiro.md +8 -8
package/.ai-rules/adapters/opencode.md +7 -7
package/.ai-rules/adapters/q.md +2 -2
package/.ai-rules/agents/README.md +66 -16
package/.ai-rules/agents/accessibility-specialist.json +2 -1
package/.ai-rules/agents/act-mode.json +2 -1
package/.ai-rules/agents/agent-architect.json +8 -7
package/.ai-rules/agents/ai-ml-engineer.json +1 -0
package/.ai-rules/agents/architecture-specialist.json +1 -0
package/.ai-rules/agents/auto-mode.json +4 -2
package/.ai-rules/agents/backend-developer.json +1 -0
package/.ai-rules/agents/code-quality-specialist.json +1 -0
package/.ai-rules/agents/code-reviewer.json +65 -64
package/.ai-rules/agents/data-engineer.json +8 -7
package/.ai-rules/agents/data-scientist.json +10 -9
package/.ai-rules/agents/devops-engineer.json +1 -0
package/.ai-rules/agents/documentation-specialist.json +1 -0
package/.ai-rules/agents/eval-mode.json +20 -19
package/.ai-rules/agents/event-architecture-specialist.json +1 -0
package/.ai-rules/agents/frontend-developer.json +1 -0
package/.ai-rules/agents/i18n-specialist.json +2 -1
package/.ai-rules/agents/integration-specialist.json +1 -0
package/.ai-rules/agents/migration-specialist.json +1 -0
package/.ai-rules/agents/mobile-developer.json +8 -7
package/.ai-rules/agents/observability-specialist.json +1 -0
package/.ai-rules/agents/parallel-orchestrator.json +346 -0
package/.ai-rules/agents/performance-specialist.json +1 -0
package/.ai-rules/agents/plan-mode.json +3 -1
package/.ai-rules/agents/plan-reviewer.json +208 -0
package/.ai-rules/agents/platform-engineer.json +1 -0
package/.ai-rules/agents/security-engineer.json +9 -8
package/.ai-rules/agents/security-specialist.json +2 -1
package/.ai-rules/agents/seo-specialist.json +1 -0
package/.ai-rules/agents/software-engineer.json +1 -0
package/.ai-rules/agents/solution-architect.json +11 -10
package/.ai-rules/agents/systems-developer.json +9 -8
package/.ai-rules/agents/technical-planner.json +11 -10
package/.ai-rules/agents/test-engineer.json +7 -6
package/.ai-rules/agents/test-strategy-specialist.json +1 -0
package/.ai-rules/agents/tooling-engineer.json +4 -3
package/.ai-rules/agents/ui-ux-designer.json +1 -0
package/.ai-rules/keyword-modes.json +4 -4
package/.ai-rules/rules/clarification-guide.md +14 -14
package/.ai-rules/rules/core.md +90 -1
package/.ai-rules/rules/parallel-execution.md +217 -0
package/.ai-rules/skills/README.md +23 -1
package/.ai-rules/skills/agent-design/SKILL.md +5 -0
package/.ai-rules/skills/agent-design/examples/agent-template.json +58 -0
package/.ai-rules/skills/agent-design/references/expertise-guidelines.md +112 -0
package/.ai-rules/skills/agent-discussion/SKILL.md +199 -0
package/.ai-rules/skills/agent-discussion-panel/SKILL.md +448 -0
package/.ai-rules/skills/api-design/SKILL.md +5 -0
package/.ai-rules/skills/api-design/examples/error-response.json +159 -0
package/.ai-rules/skills/api-design/examples/openapi-template.yaml +393 -0
package/.ai-rules/skills/build-fix/SKILL.md +234 -0
package/.ai-rules/skills/code-explanation/SKILL.md +4 -0
package/.ai-rules/skills/context-management/SKILL.md +1 -0
package/.ai-rules/skills/cost-budget/SKILL.md +348 -0
package/.ai-rules/skills/cross-repo-issues/SKILL.md +257 -0
package/.ai-rules/skills/database-migration/SKILL.md +1 -0
package/.ai-rules/skills/deepsearch/SKILL.md +214 -0
package/.ai-rules/skills/deployment-checklist/SKILL.md +1 -0
package/.ai-rules/skills/error-analysis/SKILL.md +1 -0
package/.ai-rules/skills/finishing-a-development-branch/SKILL.md +281 -0
package/.ai-rules/skills/frontend-design/SKILL.md +5 -0
package/.ai-rules/skills/frontend-design/examples/component-template.tsx +203 -0
package/.ai-rules/skills/frontend-design/references/css-patterns.md +243 -0
package/.ai-rules/skills/git-master/SKILL.md +358 -0
package/.ai-rules/skills/incident-response/SKILL.md +1 -0
package/.ai-rules/skills/legacy-modernization/SKILL.md +1 -0
package/.ai-rules/skills/mcp-builder/SKILL.md +7 -0
package/.ai-rules/skills/mcp-builder/examples/resource-example.ts +233 -0
package/.ai-rules/skills/mcp-builder/examples/tool-example.ts +203 -0
package/.ai-rules/skills/mcp-builder/references/protocol-spec.md +215 -0
package/.ai-rules/skills/performance-optimization/SKILL.md +3 -0
package/.ai-rules/skills/plan-and-review/SKILL.md +115 -0
package/.ai-rules/skills/pr-all-in-one/SKILL.md +15 -13
package/.ai-rules/skills/pr-all-in-one/configuration-guide.md +7 -7
package/.ai-rules/skills/pr-all-in-one/pr-templates.md +10 -10
package/.ai-rules/skills/pr-review/SKILL.md +4 -0
package/.ai-rules/skills/receiving-code-review/SKILL.md +347 -0
package/.ai-rules/skills/refactoring/SKILL.md +1 -0
package/.ai-rules/skills/requesting-code-review/SKILL.md +348 -0
package/.ai-rules/skills/rule-authoring/SKILL.md +5 -0
package/.ai-rules/skills/rule-authoring/examples/rule-template.md +142 -0
package/.ai-rules/skills/rule-authoring/examples/trigger-patterns.md +126 -0
package/.ai-rules/skills/security-audit/SKILL.md +4 -0
package/.ai-rules/skills/skill-creator/SKILL.md +461 -0
package/.ai-rules/skills/skill-creator/agents/analyzer.md +206 -0
package/.ai-rules/skills/skill-creator/agents/comparator.md +167 -0
package/.ai-rules/skills/skill-creator/agents/grader.md +152 -0
package/.ai-rules/skills/skill-creator/assets/eval_review.html +289 -0
package/.ai-rules/skills/skill-creator/assets/skill-template.md +43 -0
package/.ai-rules/skills/skill-creator/eval-viewer/generate_review.py +496 -0
package/.ai-rules/skills/skill-creator/references/frontmatter-guide.md +632 -0
package/.ai-rules/skills/skill-creator/references/multi-tool-compat.md +480 -0
package/.ai-rules/skills/skill-creator/references/schemas.md +784 -0
package/.ai-rules/skills/skill-creator/scripts/aggregate_benchmark.py +302 -0
package/.ai-rules/skills/skill-creator/scripts/init_skill.sh +196 -0
package/.ai-rules/skills/skill-creator/scripts/run_loop.py +327 -0
package/.ai-rules/skills/systematic-debugging/SKILL.md +1 -0
package/.ai-rules/skills/tech-debt/SKILL.md +1 -0
package/.ai-rules/skills/test-coverage-gate/SKILL.md +303 -0
package/.ai-rules/skills/tmux-master/SKILL.md +491 -0
package/.ai-rules/skills/using-git-worktrees/SKILL.md +368 -0
package/.ai-rules/skills/verification-before-completion/SKILL.md +234 -0
package/.ai-rules/skills/widget-slot-architecture/SKILL.md +6 -0
package/.ai-rules/skills/widget-slot-architecture/examples/parallel-route-setup.tsx +206 -0
package/.ai-rules/skills/widget-slot-architecture/examples/widget-component.tsx +250 -0
package/.ai-rules/skills/writing-plans/SKILL.md +78 -0
package/bin/cli.js +178 -0
package/lib/init/detect-stack.js +148 -0
package/lib/init/generate-config.js +31 -0
package/lib/init/index.js +86 -0
package/lib/init/prompt.js +60 -0
package/lib/init/scaffold.js +67 -0
package/lib/init/suggest-agent.js +46 -0
package/package.json +10 -2

package/.ai-rules/skills/skill-creator/SKILL.md ADDED Viewed

@@ -0,0 +1,461 @@
+---
+name: skill-creator
+description: >-
+  Create new skills, modify and improve existing skills,
+  and measure skill performance with eval pipeline.
+  Use when creating a skill from scratch, editing or optimizing
+  an existing skill, running evals to test a skill,
+  or benchmarking skill performance.
+disable-model-invocation: true
+argument-hint: [create|eval|improve|benchmark] [skill-name]
+---
+# Skill Creator
+## Overview
+Skills are reusable workflows that encode expert processes into repeatable instructions. A well-crafted skill transforms inconsistent ad-hoc work into systematic, verifiable outcomes across any AI tool.
+**Core principle:** A skill must change behavior. If an AI assistant produces the same output with and without the skill loaded, the skill has failed.
+**Iron Law:**
+```
+EVERY SKILL MUST HAVE A MEASURABLE "DID BEHAVIOR CHANGE?" TEST
+No eval = no confidence. Ship nothing you haven't measured.
+```
+## Modes
+| Mode | Purpose | Input | Output |
+|------|---------|-------|--------|
+| **Create** | Build a new skill from scratch | Intent or problem statement | `SKILL.md` + scaffold |
+| **Eval** | Test skill effectiveness | Skill + test cases | Graded scorecard |
+| **Improve** | Refine based on eval results | Skill + eval data | Improved `SKILL.md` |
+| **Benchmark** | Compare performance metrics | Skill + baseline | Performance report |
+## When to Use
+- Creating a new skill for `.ai-rules/skills/`
+- Testing whether an existing skill produces correct behavior
+- Optimizing a skill that underperforms on edge cases
+- Comparing skill versions to select the best one
+- Measuring skill quality before shipping
+## When NOT to Use
+- Writing one-off instructions (not reusable = not a skill)
+- Creating rules (use `rule-authoring` skill)
+- Designing agents (use `agent-design` skill)
+- Process is too simple to warrant a workflow (< 3 steps)
+---
+## Create Mode
+**Trigger:** `skill-creator create <skill-name>`
+### Phase 1: Intent — Define What the Skill Does
+Answer before writing anything:
+```
+1. What SPECIFIC problem does this skill solve?
+   Bad:  "helps with testing"
+   Good: "enforces Red-Green-Refactor TDD cycle with mandatory verification"
+2. What behavior change should loading this skill cause?
+   Without: "AI writes implementation first, tests after"
+   With:    "AI writes failing test first, verifies failure, then implements"
+3. Who consumes this skill?
+   Which AI tools? (Claude Code, Cursor, Codex, Q, Kiro)
+   What user skill level?
+4. What is the boundary?
+   What it handles vs. what it delegates to other skills
+   Name 2-3 skills it does NOT overlap with
+```
+### Phase 2: Interview — Gather Domain Knowledge
+Collect the expertise the skill will encode:
+```
+For each major workflow step:
+  1. What is the step?
+  2. What is the expected input/output?
+  3. What are the most common mistakes?
+  4. How do you verify correctness?
+  5. What red flags should halt progress?
+```
+**Sources to consult:**
+- Existing codebase patterns (search for conventions)
+- Project documentation and ADRs
+- Domain experts (ask the user)
+- Related skills (check for reusable patterns)
+### Phase 3: Write — Author the SKILL.md
+**Required structure:**
+```markdown
+---
+name: skill-name
+description: "Use when... (max 500 chars)"
+[optional frontmatter fields]
+---
+# Skill Title
+## Overview
+2-3 sentences. Core principle. Iron Law.
+## When to Use
+Bullet list of trigger scenarios.
+## When NOT to Use (if applicable)
+## Process / Phases
+The actual workflow steps.
+## Verification Checklist
+## Red Flags — STOP
+Table of rationalizations vs. reality.
+```
+**Writing rules:**
+| Rule | Why |
+|------|-----|
+| Imperative mood ("Write the test") | Direct instructions produce consistent behavior |
+| Concrete examples over abstractions | AI tools follow examples more reliably than rules |
+| Good/Bad comparisons for ambiguous steps | Eliminates interpretation variance across tools |
+| One responsibility per phase | Multi-purpose phases get half-completed |
+| Max 500 lines total | Longer skills get truncated or ignored |
+### Phase 4: Scaffold — Create Supporting Files
+```bash
+mkdir -p packages/rules/.ai-rules/skills/<skill-name>
+# Write SKILL.md (from Phase 3)
+# Optional supporting references:
+#   <skill-name>/reference-guide.md
+#   <skill-name>/examples.md
+```
+**Frontmatter field reference:**
+| Field | Required | Description |
+|-------|----------|-------------|
+| `name` | Yes | `^[a-z0-9-]+$`, matches directory name |
+| `description` | Yes | 1-500 chars, start with "Use when..." |
+| `disable-model-invocation` | No | `true` if skill handles its own execution flow |
+| `argument-hint` | No | Usage hint shown in skill listings |
+| `allowed-tools` | No | Restrict available tools during execution |
+| `context` | No | `fork` to run in isolated context |
+| `agent` | No | Agent to activate with the skill |
+### Phase 5: Test — Verify the Skill Works
+```
+- [ ] Frontmatter validates (name matches directory, description <= 500 chars)
+- [ ] Skill loads without error in target tool (list_skills / get_skill)
+- [ ] Following the skill produces different behavior than without it
+- [ ] Every phase has a verifiable output
+- [ ] Red flags table covers top 3 rationalizations for skipping
+- [ ] No overlap with existing skills (check skills/README.md)
+- [ ] Multi-tool compatible (no tool-specific syntax in core workflow)
+```
+---
+## Eval Mode
+**Trigger:** `skill-creator eval <skill-name>`
+Measure whether a skill produces the intended behavior change.
+### Phase 1: Define — Write Test Scenarios
+Create scenarios exercising the skill's key behaviors:
+```
+Scenario: [descriptive name]
+  Given: [initial state / context]
+  When:  [skill is applied with this input]
+  Then:  [expected behavior / output]
+  Anti-pattern: [what happens WITHOUT the skill]
+```
+**Minimum scenarios:**
+- 1 happy path (standard use case)
+- 1 edge case (unusual but valid input)
+- 1 adversarial case (input that tempts skipping the skill)
+### Phase 2: Spawn — Execute Test Scenarios
+Run each scenario against the skill:
+```
+For each scenario:
+  1. Load the skill content
+  2. Present the scenario input
+  3. Capture the AI's response
+  4. Save response for grading
+```
+**Execution options:**
+- **Manual:** Paste skill + scenario, capture response
+- **Automated:** Use subagent with skill loaded, capture output
+- **Parallel:** Run via dispatching-parallel-agents skill
+### Phase 3: Assert — Check Expected Behavior
+Grade each response:
+```
+  PASS:    Response follows skill workflow
+  PARTIAL: Some steps followed, others skipped
+  FAIL:    Skill ignored or wrong behavior produced
+```
+Note which specific steps were followed/skipped.
+### Phase 4: Grade — Assign Severity
+| Severity | Definition | Action |
+|----------|-----------|--------|
+| **Critical** | Skill completely ignored | Must fix before shipping |
+| **High** | Key phase skipped or verification missing | Must fix before shipping |
+| **Medium** | Minor step deviation, output still usable | Fix in next iteration |
+| **Low** | Style/formatting difference, behavior correct | Optional fix |
+### Phase 5: Aggregate — Summarize Results
+```
+Skill: [name]
+Scenarios: [total] | Pass: [n] | Partial: [n] | Fail: [n]
+Critical: [count]  High: [count]  Medium: [count]  Low: [count]
+Verdict:
+  SHIP    = Critical=0 AND High=0
+  ITERATE = Critical=0, High>0
+  REWRITE = Critical>0
+```
+### Phase 6: View — Present Findings
+```markdown
+## Eval Report: [skill-name]
+### Summary
+[Aggregate from Phase 5]
+### Scenario Results
+| Scenario | Result | Issues |
+|----------|--------|--------|
+| ...      | PASS/PARTIAL/FAIL | ... |
+### Recommendations
+1. [Fix for highest-severity issue]
+2. [Next fix]
+```
+---
+## Improve Mode
+**Trigger:** `skill-creator improve <skill-name>`
+Refine an existing skill based on eval data or observed behavior gaps.
+### Phase 1: Read — Understand Current State
+```
+1. Read the current SKILL.md
+2. Read eval results (if available)
+3. Identify the gap:
+   - Which phases are being skipped?
+   - Which instructions are ambiguous?
+   - Where does behavior diverge across AI tools?
+```
+### Phase 2: Generalize — Find Patterns in Failures
+```
+Look for systemic issues:
+- Same step skipped across scenarios → Step is unclear or seems optional
+- Different behavior per AI tool   → Instructions use tool-specific syntax
+- Partial compliance               → Steps too large, need decomposition
+- Complete skip                    → Trigger conditions don't match use case
+```
+### Phase 3: Apply — Make Targeted Changes
+| Issue Type | Fix Strategy |
+|-----------|-------------|
+| Step skipped | Add "MANDATORY" marker + red flag for skipping |
+| Ambiguous instruction | Replace with concrete Good/Bad example |
+| Tool-specific behavior | Remove tool-specific syntax, use universal patterns |
+| Steps too large | Decompose into sub-steps with verification |
+| Missing edge case | Add scenario to When to Use section |
+**Rules for changes:**
+- One targeted change per identified issue
+- Do not rewrite the entire skill (preserve what works)
+- Add examples where instructions were misinterpreted
+- Strengthen red flags for commonly skipped steps
+### Phase 4: Re-run — Eval the Improved Version
+Run the same eval scenarios against the modified skill:
+```
+Compare original vs. improved:
+  - Did the targeted fix resolve the issue?
+  - Did the fix introduce new issues?
+  - Is overall pass rate higher?
+```
+### Phase 5: Compare — Side-by-Side Analysis
+```markdown
+| Metric | Before | After | Delta |
+|--------|--------|-------|-------|
+| Pass rate | X% | Y% | +/-Z% |
+| Critical issues | N | M | +/-D |
+| High issues | N | M | +/-D |
+| Avg. steps followed | X/Y | X/Y | +/-D |
+```
+### Phase 6: Analyze — Document Learnings
+```
+1. Which fix strategy was most effective?
+2. Which issues persisted despite changes?
+3. Are there structural problems requiring a rewrite?
+4. What patterns should inform future skill authoring?
+```
+---
+## Benchmark Mode
+**Trigger:** `skill-creator benchmark <skill-name>`
+Measure skill performance across dimensions and optimize weak spots.
+### Phase 1: Generate — Create Benchmark Suite
+Design a comprehensive test suite:
+```
+Dimensions:
+  1. Compliance:   Does the AI follow every step?
+  2. Consistency:  Same input → same behavior across runs?
+  3. Portability:  Works across Claude Code, Cursor, Codex, Q, Kiro?
+  4. Robustness:   Handles edge cases without breaking?
+  5. Efficiency:   Does the skill add unnecessary overhead?
+```
+- Minimum 2 cases per dimension
+- Mix of simple and complex inputs
+- At least 1 adversarial case per dimension
+### Phase 2: Review — Analyze Benchmark Results
+```
+For each dimension:
+  Score: [0-100]
+  Weakest case: [description]
+  Root cause: [why this dimension scored low]
+```
+| Score | Meaning |
+|-------|---------|
+| 90-100 | Excellent — production-ready |
+| 70-89 | Good — minor improvements needed |
+| 50-69 | Fair — significant gaps in at least one dimension |
+| 0-49 | Poor — needs major rework |
+### Phase 3: Optimize — Target Weak Dimensions
+Focus on the lowest-scoring dimension first:
+```
+For each dimension scoring < 70:
+  1. Identify the specific instruction causing the gap
+  2. Apply the appropriate fix from Improve Mode Phase 3
+  3. Re-run that dimension's cases only
+  4. Verify improvement without regression in other dimensions
+```
+### Phase 4: Apply — Finalize and Document
+```
+1. Update SKILL.md with optimized content
+2. Update skills/README.md if category or description changed
+3. Record benchmark baseline:
+   Benchmark: [skill-name] @ [date]
+   Compliance:  [score]
+   Consistency: [score]
+   Portability: [score]
+   Robustness:  [score]
+   Efficiency:  [score]
+   Overall:     [weighted average]
+```
+---
+## Additional Resources
+### Related Skills
+| Skill | Relationship |
+|-------|-------------|
+| `rule-authoring` | Rules constrain behavior; skills define workflows. Complementary. |
+| `agent-design` | Agents are personas; skills are processes. Non-overlapping. |
+| `prompt-engineering` | Prompt techniques apply within skill instructions. Supporting. |
+| `writing-plans` | Plans are one-time; skills are reusable. Different lifecycle. |
+### Agent Support
+| Agent | When to Involve |
+|-------|----------------|
+| Code Quality Specialist | Reviewing skill structure and clarity |
+| Test Engineer | Designing eval scenarios |
+| Architecture Specialist | Skill decomposition and boundary design |
+### Multi-Tool Compatibility
+Skills must work across all supported AI tools:
+| Tool | How Skills Load | Key Consideration |
+|------|----------------|-------------------|
+| Claude Code | `get_skill` MCP tool | Full markdown + frontmatter parsed |
+| Cursor | `@file` reference | Inline loading, no frontmatter processing |
+| Codex / Copilot | `cat` file content | Plain text only, examples critical |
+| Amazon Q | `.q/rules/` reference | Rule-style integration |
+| Kiro | `.kiro/` reference | Spec-based integration |
+**Portability rules:**
+- No tool-specific syntax in core workflow
+- Examples in generic markdown, not tool-specific blocks
+- Phases described as actions, not tool commands
+- Test with at least 2 different tools before shipping
+## Red Flags — STOP
+| Thought | Reality |
+|---------|---------|
+| "This skill is obvious, no need to eval" | Obvious skills still get ignored. Eval proves they work. |
+| "I'll test it manually later" | Manual tests are forgotten. Eval now. |
+| "One scenario is enough" | One is anecdote. Three is pattern. |
+| "It works in Claude Code, ship it" | Cursor/Codex may ignore the same instructions. Test portability. |
+| "Small change, no need to re-eval" | Small changes cause cascading behavior shifts. Re-eval. |
+| "The skill is too long but everything is needed" | Max 500 lines. Cut or decompose into reference files. |
+| "I'll add examples later" | Skills without examples produce inconsistent behavior. Add now. |

package/.ai-rules/skills/skill-creator/agents/analyzer.md ADDED Viewed

@@ -0,0 +1,206 @@
+# Analyzer Agent
+An agent that discovers patterns in benchmark results and suggests directions for skill improvement.
+## Role
+You are a skill evaluation analyst. You comprehensively analyze `benchmark.json` and each eval's `grading.json` results to derive the skill's strengths, weaknesses, improvement directions, and priorities. You perform only data-driven pattern analysis and never make suggestions based on speculation.
+## Iron Law
+```
+Do not report patterns that are not in the data.
+Every weakness must have evidence.
+Every improvement suggestion must have a measurable goal.
+```
+## Input
+| Item | Source | Description |
+|------|--------|-------------|
+| **benchmark.json** | `iteration-N/benchmark.json` | Iteration benchmark aggregate results |
+| **grading results** | `iteration-N/eval-M/{with_skill\|without_skill}/grading.json` | Individual eval grading results |
+| **eval metadata** | `iteration-N/eval-M/{with_skill\|without_skill}/eval_metadata.json` | Evaluation scenario information (optional) |
+| **timing data** | `iteration-N/eval-M/{with_skill\|without_skill}/timing.json` | Token/time measurements (optional) |
+### benchmark.json Core Structure
+```json
+{
+  "skill_name": "skill-name",
+  "iteration": 1,
+  "summary": {
+    "pass_rate": { "mean": 0.85, "stddev": 0.12 },
+    "tokens": { "mean": 42000, "stddev": 5200 },
+    "duration_seconds": { "mean": 35.5, "stddev": 8.3 }
+  },
+  "eval_results": [
+    {
+      "eval_id": 0,
+      "with_skill": { "pass_rate": 0.75, "tokens": 45230, "duration": 32.15 },
+      "baseline": { "pass_rate": 0.50, "tokens": 38400, "duration": 28.90 }
+    }
+  ]
+}
+```
+### grading.json Core Structure
+```json
+{
+  "expectations": [
+    {
+      "text": "assertion description",
+      "passed": true,
+      "evidence": "basis for the judgment"
+    }
+  ]
+}
+```
+## Output
+An analysis report in markdown format. Must include **all** 4 sections below:
+```markdown
+# Skill Analysis Report: {skill_name}
+## 1. Strengths
+Areas where the skill performs well. Each strength includes supporting data.
+- **[Strength title]**: [Description] (evidence: [data citation])
+## 2. Weaknesses
+Areas where the skill falls short. Each weakness includes supporting data and severity.
+- **[Weakness title]** [Critical|High|Medium|Low]: [Description] (evidence: [data citation])
+## 3. Improvement Suggestions
+Specific improvement measures for each weakness. Includes measurable goals.
+| # | Linked Weakness | Improvement Measure | Goal | Difficulty |
+|---|----------------|---------------------|------|------------|
+| 1 | [Weakness title] | [Specific action] | [Measurable goal] | Low/Medium/High |
+## 4. Priority
+Execution order for improvement suggestions. Based on severity x impact scope x difficulty.
+1. [Highest priority item] — Reason: [basis]
+2. [Next item] — Reason: [basis]
+```
+## Process
+### Step 1: Data Collection
+```
+1. Read benchmark.json → Check summary and eval_results
+2. Read each eval's grading.json → Check pass/fail per assertion
+3. Read timing.json (if available) → Check token/time overhead
+```
+### Step 2: Pattern Analysis
+```
+Explore patterns from the following perspectives:
+1. Skill effectiveness:
+   - Compare with_skill.pass_rate vs baseline.pass_rate
+   - Which evals show quality improvement with the skill?
+   - Which evals show degradation with the skill?
+2. Consistency:
+   - High summary.pass_rate.stddev indicates inconsistency
+   - Are there extreme results in specific evals only?
+3. Cost:
+   - with_skill.tokens vs baseline.tokens → Token overhead
+   - with_skill.duration vs baseline.duration → Time overhead
+   - Is the quality improvement justified relative to the overhead?
+4. Assertion patterns:
+   - Which assertions repeatedly FAIL? → Structural weakness of the skill
+   - Which assertions always PASS? → Strength of the skill
+   - Which assertions FAIL only with_skill? → Skill causing side effects
+```
+### Step 3: Severity Classification
+```
+Assign severity to each weakness:
+| Severity | Criteria |
+|----------|----------|
+| Critical | Skill produces worse results than baseline |
+| High | pass_rate < 0.5 or core assertion failure |
+| Medium | pass_rate 0.5-0.7 or non-core assertion failure |
+| Low | pass_rate > 0.7 but room for improvement |
+```
+### Step 4: Derive Improvement Suggestions
+```
+For each weakness:
+  1. Estimate root cause (data-driven)
+  2. Suggest specific actions (which part of the skill to modify)
+  3. Set measurable goals (e.g., "pass_rate 0.5 → 0.8")
+  4. Assess difficulty (Low/Medium/High)
+```
+### Step 5: Determine Priority
+```
+Priority = Severity x Impact Scope x (1 / Difficulty)
+1. Critical weaknesses → Always highest priority
+2. High + wide impact scope → Next priority
+3. Medium + easy fix → Quick wins
+4. Low → Backlog
+```
+### Step 6: Write Report
+Write the report following the Output format. Include all 4 sections.
+## Analysis Patterns
+### Useful Comparison Metrics
+| Metric | Calculation | Interpretation |
+|--------|-------------|----------------|
+| **Skill Lift** | `with_skill.pass_rate - baseline.pass_rate` | Positive means skill improves quality |
+| **Token Overhead** | `with_skill.tokens / baseline.tokens - 1` | Additional token ratio when skill is applied |
+| **Time Overhead** | `with_skill.duration / baseline.duration - 1` | Additional time ratio when skill is applied |
+| **Consistency** | `1 - summary.pass_rate.stddev` | Closer to 1 means more consistent |
+| **Cost-Effectiveness** | `Skill Lift / Token Overhead` | Higher means more efficient |
+### Multi-Iteration Comparison (when applicable)
+```
+iteration-1 vs iteration-2:
+  - pass_rate change: [before] → [after] (Δ [difference])
+  - token change: [before] → [after] (Δ [difference])
+  - Resolved weaknesses: [list]
+  - Newly introduced weaknesses: [list]
+```
+## Red Flags — STOP
+| Thought | Reality |
+|---------|---------|
+| "The data is sparse but I can see a trend" | Judging trends from 2 evals is overfitting. Report the data as-is |
+| "This weakness is probably not important" | Severity is determined by the criteria table. Do not dismiss based on intuition |
+| "There are too many improvement suggestions" | Maximum 5. Reduce by priority |
+| "The skill is generally good, so skip weaknesses" | An analysis with 0 weaknesses has no value. Always report them |
+| "The baseline also did well, so the skill has no effect" | Quantify with Skill Lift calculation. Use numbers, not feelings |
+## Constraints
+- **Independent execution**: This agent does not depend on results from other agents (grading.json is received as input)
+- **Data-driven**: All analysis is derived from input data. No external knowledge or speculation
+- **Structured output**: All 4 sections (Strengths/Weaknesses/Improvement Suggestions/Priority) must be included
+- **Improvement suggestion cap**: Maximum 5. If exceeded, trim based on priority