npm - @caoscompanybr/merlin - Versions diffs - 3.5.0 - Mend

@caoscompanybr/merlin 3.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (762) hide show

package/.merlin-core/skills/general/skill-creator/SKILL.md ADDED Viewed

@@ -0,0 +1,342 @@
+---
+name: skill-creator
+description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
+license: MIT (Anthropic)
+compatibility: claude-code
+metadata:
+  source: anthropic/skills (merged with Merlin skill infrastructure)
+  imported: 2026-03-15
+  version: "3.0.0"
+  auto_activate:
+    [
+      skill-create,
+      create-skill,
+      new-skill,
+      skill-scaffold,
+      skill-eval,
+      skill-optimize,
+    ]
+---
+# Skill Creator
+A skill for creating new skills and iteratively improving them.
+**Core loop:**
+1. Decide what the skill should do
+2. Write a draft SKILL.md
+3. Create test prompts and run with-skill vs baseline
+4. Evaluate results (qualitatively + quantitatively)
+5. Rewrite based on feedback
+6. Repeat until satisfied
+7. Expand test set and try at larger scale
+Your job is to figure out where the user is in this process and help them progress.
+Be flexible — if the user says "just vibe with me, I don't need evaluations," that's fine. If they already have a draft, skip to eval/iterate.
+## Communicating with the User
+People across a wide range of coding familiarity use this skill. Pay attention to context cues:
+- "evaluation" and "benchmark" are borderline OK
+- For "JSON" and "assertion," see serious cues from the user that they know these terms before using them without explaining
+Briefly explain terms if in doubt.
+---
+## Part 1: Merlin Skill Infrastructure
+### Skill Directory Structure
+```
+skill-name/
+├── SKILL.md (required)
+│   ├── YAML frontmatter (name, description, auto_activate required for Merlin)
+│   └── Markdown instructions
+└── Bundled Resources (optional)
+    ├── scripts/    - Executable code for deterministic/repetitive tasks
+    ├── references/ - Docs loaded into context as needed
+    └── assets/     - Files used in output (templates, icons, fonts)
+```
+Skills live in:
+- `.merlin-core/skills/general/` — Built-in (shipped with framework)
+- `.merlin-core/skills/domain/` — Project-specific domain knowledge
+- `.agents/skills/` — Agent Skills Standard cross-client path
+### Frontmatter Fields (Agent Skills Standard + Merlin v3)
+```yaml
+---
+name: my-skill              # lowercase, hyphens only, max 64 chars (REQUIRED)
+description: What it does   # Primary trigger — be specific (REQUIRED)
+license: MIT                # Optional
+compatibility: claude-code  # Optional
+metadata:                   # Optional nested map
+  author: your-name
+  version: "1.0"
+  auto_activate: [kw1, kw2]  # Keywords that auto-activate this skill
+  file_triggers: [*.ext]      # File patterns that trigger activation
+  tool_reminders: [tool1]     # Tools to suggest when skill is active
+  allowed_tools: [Read, Edit] # Tools pre-approved when skill is active
+---
+```
+**v3 format:** `auto_activate`, `file_triggers`, `tool_reminders` go inside `metadata:` for Agent Skills Standard compliance. The parser also reads top-level (legacy) for backward compat.
+**Description writing tip:** Claude tends to "undertrigger" skills. Make descriptions slightly pushy. Instead of "Build dashboards" → "Build dashboards. Use this skill whenever the user mentions dashboards, data visualization, metrics, or wants to display any kind of data, even if they don't explicitly ask for a 'dashboard.'"
+### Progressive Disclosure
+1. **Metadata** (name + description) — Always in context (~100 words)
+2. **SKILL.md body** — In context whenever skill triggers (<500 lines ideal)
+3. **Bundled resources** — As needed (scripts can execute without loading into context)
+### Naming Rules (Agent Skills Standard)
+- Lowercase only
+- Hyphens for word separation (no underscores or spaces)
+- Max 64 characters
+- No consecutive hyphens, no leading/trailing hyphens
+### Validation
+- `node .merlin-core/tools/merlin-tools.js list-skills` — Verify discovery
+- `node --test .merlin-core/core/orchestration/skill-dispatcher.test.js` — Keyword collision check
+- The skill-importer (`core/alkimia/skill-importer.js`) validates frontmatter and imports external skills
+---
+## Part 2: Creating a Skill
+### Step 1: Capture Intent
+1. What should this skill enable Claude to do?
+2. When should it trigger? (what user phrases/contexts)
+3. What's the expected output format?
+4. Should we set up test cases? (Yes for objectively verifiable outputs — file transforms, data extraction, code generation. Optional for subjective outputs — writing style, art.)
+If the current conversation already contains a workflow the user wants to capture ("turn this into a skill"), extract the tools used, sequence of steps, corrections made, input/output formats observed.
+### Step 2: Interview and Research
+Proactively ask about edge cases, input/output formats, example files, success criteria, dependencies. Wait to write test prompts until this is ironed out.
+### Step 3: Write the SKILL.md
+**Writing guide:**
+- Keep under 500 lines; use `references/` for overflow
+- Use imperative form
+- Explain the **why** behind requirements instead of heavy-handed MUSTs
+- Include 2-3 examples showing expected behavior
+- For large references (>300 lines), include a table of contents
+- Organize multi-domain skills by variant in `references/`
+**Look for repeated work:** Read transcripts from test runs. If all test cases independently wrote similar helper scripts, bundle that script in `scripts/`.
+### Step 4: Test Cases
+Create 2-3 realistic test prompts. Save to `evals/evals.json`:
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's task prompt",
+      "expected_output": "Description of expected result",
+      "files": []
+    }
+  ]
+}
+```
+See [references/schemas.md](references/schemas.md) for the full schema including assertions.
+---
+## Part 3: Running and Evaluating Test Cases
+Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Organize by iteration (`iteration-1/`, `iteration-2/`, etc.) and within that, each test case (`eval-0/`, `eval-1/`, etc.).
+### Step 1: Spawn All Runs (With-Skill AND Baseline) in Same Turn
+For each test case, spawn two subagents simultaneously:
+**With-skill run:**
+```
+Execute this task:
+- Skill path: <path-to-skill>
+- Task: <eval prompt>
+- Input files: <eval files if any>
+- Save outputs to: <workspace>/iteration-N/eval-ID/with_skill/outputs/
+```
+**Baseline run:**
+- Creating a new skill → no skill at all, save to `without_skill/outputs/`
+- Improving existing skill → old version (snapshot first: `cp -r <skill-path> <workspace>/skill-snapshot/`)
+Write `eval_metadata.json` for each test case:
+```json
+{
+  "eval_id": 0,
+  "eval_name": "descriptive-name-here",
+  "prompt": "The user's task prompt",
+  "assertions": []
+}
+```
+### Step 2: While Runs Are In Progress, Draft Assertions
+Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable with descriptive names that read clearly in the benchmark viewer.
+Update `eval_metadata.json` and `evals/evals.json` with assertions.
+### Step 3: As Runs Complete, Capture Timing Data
+When each subagent completes, save `total_tokens` and `duration_ms` from the notification to `timing.json`:
+```json
+{ "total_tokens": 84852, "duration_ms": 23332, "total_duration_seconds": 23.3 }
+```
+### Step 4: Grade, Aggregate, and Launch the Viewer
+1. **Grade** — Spawn grader subagent (read `agents/grader.md`). Save to `grading.json`. Use fields: `text`, `passed`, `evidence`.
+2. **Aggregate** — Run from the skill-creator directory:
+   ```bash
+   python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
+   ```
+   Produces `benchmark.json` and `benchmark.md`.
+3. **Analyze** — Read `agents/analyzer.md`. Surface patterns: non-discriminating assertions, high-variance evals, time/token tradeoffs.
+4. **Launch viewer:**
+   ```bash
+   nohup python eval-viewer/generate_review.py \
+     <workspace>/iteration-N \
+     --skill-name "my-skill" \
+     --benchmark <workspace>/iteration-N/benchmark.json \
+     > /dev/null 2>&1 &
+   ```
+   For iteration 2+, add `--previous-workspace <workspace>/iteration-<N-1>`.
+5. **Tell the user** the viewer is open — "Outputs" tab for qualitative review, "Benchmark" tab for quantitative comparison.
+### Step 5: Read Feedback
+When user is done, read `feedback.json`. Empty feedback = user thinks it's fine. Focus improvements on test cases with specific complaints.
+---
+## Part 4: Improving the Skill (The Iteration Loop)
+### How to Think About Improvements
+1. **Generalize from feedback** — Skills are used millions of times across many prompts. Don't overfit to test examples. If there's a stubborn issue, try different metaphors or patterns.
+2. **Keep the prompt lean** — Remove things that aren't pulling their weight. Read transcripts, not just outputs — if the skill wastes time on unproductive work, trim those instructions.
+3. **Explain the why** — LLMs are smart. Explain reasoning instead of writing ALWAYS/NEVER in all caps. Reframe rigid structures into understanding-based instructions.
+4. **Look for repeated work** — If all test runs independently wrote similar helper scripts, bundle that script in `scripts/`.
+### The Loop
+1. Apply improvements to the skill
+2. Rerun all test cases into `iteration-<N+1>/` with baselines
+3. Launch viewer with `--previous-workspace` pointing at previous iteration
+4. Wait for user review
+5. Read feedback, improve again, repeat
+**Stop when:** User is happy, feedback is all empty, or no meaningful progress.
+### Advanced: Blind Comparison
+For rigorous A/B comparison, read `agents/comparator.md` and `agents/analyzer.md`. Give two outputs to an independent agent without revealing which is which. Optional — human review loop is usually sufficient.
+---
+## Part 5: Description Optimization
+After skill creation/improvement, optimize the description for better triggering.
+### Step 1: Generate 20 Eval Queries
+Mix of should-trigger (8-10) and should-not-trigger (8-10). Must be realistic — specific, with file paths, personal context, abbreviations, typos, casual speech.
+**Bad:** `"Format this data"` — too vague
+**Good:** `"ok so my boss sent me this xlsx called 'Q4 sales final FINAL v2.xlsx' and she wants me to add a profit margin column. Revenue is column C, costs column D i think"`
+For should-not-trigger: use near-misses that share keywords but need something different. Avoid obviously irrelevant queries.
+### Step 2: Review with User
+Use the HTML template (`assets/eval_review.html`), replace placeholders, open in browser. User can edit queries, toggle should-trigger, add/remove entries, export.
+### Step 3: Run Optimization Loop
+```bash
+python -m scripts.run_loop \
+  --eval-set <path-to-trigger-eval.json> \
+  --skill-path <path-to-skill> \
+  --model <model-id-powering-this-session> \
+  --max-iterations 5 \
+  --verbose
+```
+Splits 60% train / 40% test, evaluates current description (3 runs per query), proposes improvements, re-evaluates. Selects best by test score to avoid overfitting.
+### Step 4: Apply Result
+Take `best_description` from JSON output, update SKILL.md frontmatter. Show user before/after with scores.
+---
+## Bundled Resources
+### Agents (for subagent delegation)
+- [agents/grader.md](agents/grader.md) — Evaluate assertions against outputs
+- [agents/comparator.md](agents/comparator.md) — Blind A/B comparison
+- [agents/analyzer.md](agents/analyzer.md) — Analyze why one version beat another
+### Scripts (Python evaluation runtime)
+| Script                           | Purpose                                    |
+| -------------------------------- | ------------------------------------------ |
+| `scripts/run_eval.py`            | Run skill against eval prompts with timing |
+| `scripts/run_loop.py`            | Full description optimization loop         |
+| `scripts/aggregate_benchmark.py` | Aggregate grading into benchmark.json      |
+| `scripts/generate_report.py`     | Generate benchmark report                  |
+| `scripts/improve_description.py` | AI-driven description improvement          |
+| `scripts/package_skill.py`       | Bundle skill into .skill file              |
+| `scripts/quick_validate.py`      | Fast sanity check                          |
+| `scripts/utils.py`               | Shared utilities                           |
+### Eval Viewer
+- `eval-viewer/generate_review.py` — Generate interactive HTML viewer
+- `eval-viewer/viewer.html` — Viewer template (1325 lines)
+### Assets
+- `assets/eval_review.html` — Description optimization review template
+### References
+- [references/schemas.md](references/schemas.md) — JSON schemas for evals.json, grading.json, benchmark.json, feedback.json

package/.merlin-core/skills/general/skill-creator/agents/analyzer.md ADDED Viewed

@@ -0,0 +1,283 @@
+# Post-hoc Analyzer Agent
+Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
+## Role
+After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
+## Inputs
+You receive these parameters in your prompt:
+- **winner**: "A" or "B" (from blind comparison)
+- **winner_skill_path**: Path to the skill that produced the winning output
+- **winner_transcript_path**: Path to the execution transcript for the winner
+- **loser_skill_path**: Path to the skill that produced the losing output
+- **loser_transcript_path**: Path to the execution transcript for the loser
+- **comparison_result_path**: Path to the blind comparator's output JSON
+- **output_path**: Where to save the analysis results
+## Process
+### Step 1: Read Comparison Result
+1. Read the blind comparator's output at comparison_result_path
+2. Note the winning side (A or B), the reasoning, and any scores
+3. Understand what the comparator valued in the winning output
+### Step 2: Read Both Skills
+1. Read the winner skill's SKILL.md and key referenced files
+2. Read the loser skill's SKILL.md and key referenced files
+3. Identify structural differences:
+   - Instructions clarity and specificity
+   - Script/tool usage patterns
+   - Example coverage
+   - Edge case handling
+### Step 3: Read Both Transcripts
+1. Read the winner's transcript
+2. Read the loser's transcript
+3. Compare execution patterns:
+   - How closely did each follow their skill's instructions?
+   - What tools were used differently?
+   - Where did the loser diverge from optimal behavior?
+   - Did either encounter errors or make recovery attempts?
+### Step 4: Analyze Instruction Following
+For each transcript, evaluate:
+- Did the agent follow the skill's explicit instructions?
+- Did the agent use the skill's provided tools/scripts?
+- Were there missed opportunities to leverage skill content?
+- Did the agent add unnecessary steps not in the skill?
+Score instruction following 1-10 and note specific issues.
+### Step 5: Identify Winner Strengths
+Determine what made the winner better:
+- Clearer instructions that led to better behavior?
+- Better scripts/tools that produced better output?
+- More comprehensive examples that guided edge cases?
+- Better error handling guidance?
+Be specific. Quote from skills/transcripts where relevant.
+### Step 6: Identify Loser Weaknesses
+Determine what held the loser back:
+- Ambiguous instructions that led to suboptimal choices?
+- Missing tools/scripts that forced workarounds?
+- Gaps in edge case coverage?
+- Poor error handling that caused failures?
+### Step 7: Generate Improvement Suggestions
+Based on the analysis, produce actionable suggestions for improving the loser skill:
+- Specific instruction changes to make
+- Tools/scripts to add or modify
+- Examples to include
+- Edge cases to address
+Prioritize by impact. Focus on changes that would have changed the outcome.
+### Step 8: Write Analysis Results
+Save structured analysis to `{output_path}`.
+## Output Format
+Write a JSON file with this structure:
+```json
+{
+  "comparison_summary": {
+    "winner": "A",
+    "winner_skill": "path/to/winner/skill",
+    "loser_skill": "path/to/loser/skill",
+    "comparator_reasoning": "Brief summary of why comparator chose winner"
+  },
+  "winner_strengths": [
+    "Clear step-by-step instructions for handling multi-page documents",
+    "Included validation script that caught formatting errors",
+    "Explicit guidance on fallback behavior when OCR fails"
+  ],
+  "loser_weaknesses": [
+    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
+    "No script for validation, agent had to improvise and made errors",
+    "No guidance on OCR failure, agent gave up instead of trying alternatives"
+  ],
+  "instruction_following": {
+    "winner": {
+      "score": 9,
+      "issues": ["Minor: skipped optional logging step"]
+    },
+    "loser": {
+      "score": 6,
+      "issues": [
+        "Did not use the skill's formatting template",
+        "Invented own approach instead of following step 3",
+        "Missed the 'always validate output' instruction"
+      ]
+    }
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "category": "instructions",
+      "suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
+      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
+    },
+    {
+      "priority": "high",
+      "category": "tools",
+      "suggestion": "Add validate_output.py script similar to winner skill's validation approach",
+      "expected_impact": "Would catch formatting errors before final output"
+    },
+    {
+      "priority": "medium",
+      "category": "error_handling",
+      "suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
+      "expected_impact": "Would prevent early failure on difficult documents"
+    }
+  ],
+  "transcript_insights": {
+    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
+    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
+  }
+}
+```
+## Guidelines
+- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
+- **Be actionable**: Suggestions should be concrete changes, not vague advice
+- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
+- **Prioritize by impact**: Which changes would most likely have changed the outcome?
+- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
+- **Stay objective**: Analyze what happened, don't editorialize
+- **Think about generalization**: Would this improvement help on other evals too?
+## Categories for Suggestions
+Use these categories to organize improvement suggestions:
+| Category         | Description                                    |
+| ---------------- | ---------------------------------------------- |
+| `instructions`   | Changes to the skill's prose instructions      |
+| `tools`          | Scripts, templates, or utilities to add/modify |
+| `examples`       | Example inputs/outputs to include              |
+| `error_handling` | Guidance for handling failures                 |
+| `structure`      | Reorganization of skill content                |
+| `references`     | External docs or resources to add              |
+## Priority Levels
+- **high**: Would likely change the outcome of this comparison
+- **medium**: Would improve quality but may not change win/loss
+- **low**: Nice to have, marginal improvement
+---
+# Analyzing Benchmark Results
+When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
+## Role
+Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
+## Inputs
+You receive these parameters in your prompt:
+- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
+- **skill_path**: Path to the skill being benchmarked
+- **output_path**: Where to save the notes (as JSON array of strings)
+## Process
+### Step 1: Read Benchmark Data
+1. Read the benchmark.json containing all run results
+2. Note the configurations tested (with_skill, without_skill)
+3. Understand the run_summary aggregates already calculated
+### Step 2: Analyze Per-Assertion Patterns
+For each expectation across all runs:
+- Does it **always pass** in both configurations? (may not differentiate skill value)
+- Does it **always fail** in both configurations? (may be broken or beyond capability)
+- Does it **always pass with skill but fail without**? (skill clearly adds value here)
+- Does it **always fail with skill but pass without**? (skill may be hurting)
+- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
+### Step 3: Analyze Cross-Eval Patterns
+Look for patterns across evals:
+- Are certain eval types consistently harder/easier?
+- Do some evals show high variance while others are stable?
+- Are there surprising results that contradict expectations?
+### Step 4: Analyze Metrics Patterns
+Look at time_seconds, tokens, tool_calls:
+- Does the skill significantly increase execution time?
+- Is there high variance in resource usage?
+- Are there outlier runs that skew the aggregates?
+### Step 5: Generate Notes
+Write freeform observations as a list of strings. Each note should:
+- State a specific observation
+- Be grounded in the data (not speculation)
+- Help the user understand something the aggregate metrics don't show
+Examples:
+- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
+- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
+- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
+- "Skill adds 13s average execution time but improves pass rate by 50%"
+- "Token usage is 80% higher with skill, primarily due to script output parsing"
+- "All 3 without-skill runs for eval 1 produced empty output"
+### Step 6: Write Notes
+Save notes to `{output_path}` as a JSON array of strings:
+```json
+[
+  "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
+  "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
+  "Without-skill runs consistently fail on table extraction expectations",
+  "Skill adds 13s average execution time but improves pass rate by 50%"
+]
+```
+## Guidelines
+**DO:**
+- Report what you observe in the data
+- Be specific about which evals, expectations, or runs you're referring to
+- Note patterns that aggregate metrics would hide
+- Provide context that helps interpret the numbers
+**DO NOT:**
+- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
+- Make subjective quality judgments ("the output was good/bad")
+- Speculate about causes without evidence
+- Repeat information already in the run_summary aggregates