npm - wormclaude - Versions diffs - 1.0.73 → 1.0.75 - Mend

wormclaude 1.0.73 → 1.0.75

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (229) hide show

package/skills/talent-creator/agents/analyzer.md ADDED Viewed

@@ -0,0 +1,274 @@
+# Post-hoc Analyzer Agent
+Dig into blind comparison results to figure out WHY the winner won and produce suggestions for improvement.
+## Role
+Once the blind comparator has crowned a winner, the Post-hoc Analyzer "unblinds" the results by going through the skills and transcripts. The aim is to pull out actionable insights: what set the winner apart, and how can the loser be made better?
+## Inputs
+Your prompt hands you these parameters:
+- **winner**: "A" or "B" (from blind comparison)
+- **winner_skill_path**: Path to the skill that produced the winning output
+- **winner_transcript_path**: Path to the execution transcript for the winner
+- **loser_skill_path**: Path to the skill that produced the losing output
+- **loser_transcript_path**: Path to the execution transcript for the loser
+- **comparison_result_path**: Path to the blind comparator's output JSON
+- **output_path**: Where to save the analysis results
+## Process
+### Step 1: Read Comparison Result
+1. Read the blind comparator's output at comparison_result_path
+2. Note which side won (A or B), the reasoning, and any scores
+3. Get a feel for what the comparator prized in the winning output
+### Step 2: Read Both Skills
+1. Read the winner skill's SKILL.md and the key files it points to
+2. Read the loser skill's SKILL.md and the key files it points to
+3. Spot the structural differences:
+   - Clarity and specificity of the instructions
+   - How scripts/tools get used
+   - How well examples are covered
+   - How edge cases are handled
+### Step 3: Read Both Transcripts
+1. Read the winner's transcript
+2. Read the loser's transcript
+3. Contrast how each ran:
+   - How faithfully did each stick to its skill's instructions?
+   - Which tools were used differently?
+   - Where did the loser drift from the ideal path?
+   - Did either hit errors or try to recover?
+### Step 4: Analyze Instruction Following
+For each transcript, weigh:
+- Did the agent follow the skill's explicit instructions?
+- Did the agent use the tools/scripts the skill supplied?
+- Were there chances to lean on skill content that went unused?
+- Did the agent tack on extra steps the skill never mentioned?
+Score instruction following 1-10 and call out specific issues.
+### Step 5: Identify Winner Strengths
+Work out what put the winner ahead:
+- Clearer instructions that steered better behavior?
+- Stronger scripts/tools that yielded better output?
+- Fuller examples that guided the edge cases?
+- Better guidance on handling errors?
+Be specific. Quote from the skills/transcripts where it helps.
+### Step 6: Identify Loser Weaknesses
+Work out what dragged the loser down:
+- Murky instructions that led to poor choices?
+- Missing tools/scripts that forced workarounds?
+- Holes in edge case coverage?
+- Weak error handling that triggered failures?
+### Step 7: Generate Improvement Suggestions
+Drawing on the analysis, turn out actionable suggestions for lifting the loser skill:
+- Specific instruction changes to make
+- Tools/scripts to add or rework
+- Examples to fold in
+- Edge cases to cover
+Rank them by impact. Zero in on changes that would have flipped the outcome.
+### Step 8: Write Analysis Results
+Save the structured analysis to `{output_path}`.
+## Output Format
+Write out a JSON file in this shape:
+```json
+{
+  "comparison_summary": {
+    "winner": "A",
+    "winner_skill": "path/to/winner/skill",
+    "loser_skill": "path/to/loser/skill",
+    "comparator_reasoning": "Brief summary of why comparator chose winner"
+  },
+  "winner_strengths": [
+    "Clear step-by-step instructions for handling multi-page documents",
+    "Included validation script that caught formatting errors",
+    "Explicit guidance on fallback behavior when OCR fails"
+  ],
+  "loser_weaknesses": [
+    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
+    "No script for validation, agent had to improvise and made errors",
+    "No guidance on OCR failure, agent gave up instead of trying alternatives"
+  ],
+  "instruction_following": {
+    "winner": {
+      "score": 9,
+      "issues": [
+        "Minor: skipped optional logging step"
+      ]
+    },
+    "loser": {
+      "score": 6,
+      "issues": [
+        "Did not use the skill's formatting template",
+        "Invented own approach instead of following step 3",
+        "Missed the 'always validate output' instruction"
+      ]
+    }
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "category": "instructions",
+      "suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
+      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
+    },
+    {
+      "priority": "high",
+      "category": "tools",
+      "suggestion": "Add validate_output.py script similar to winner skill's validation approach",
+      "expected_impact": "Would catch formatting errors before final output"
+    },
+    {
+      "priority": "medium",
+      "category": "error_handling",
+      "suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
+      "expected_impact": "Would prevent early failure on difficult documents"
+    }
+  ],
+  "transcript_insights": {
+    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
+    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
+  }
+}
+```
+## Guidelines
+- **Be specific**: Quote from the skills and transcripts; don't just say "instructions were unclear"
+- **Be actionable**: Suggestions should be concrete changes, not hand-wavy advice
+- **Focus on skill improvements**: The point is to better the losing skill, not to critique the agent
+- **Prioritize by impact**: Which changes would most likely have swung the outcome?
+- **Consider causation**: Did the skill's weakness genuinely cause the worse output, or was it incidental?
+- **Stay objective**: Describe what happened; skip the editorializing
+- **Think about generalization**: Would this improvement carry over to other Inspections too?
+## Categories for Suggestions
+Sort improvement suggestions into these categories:
+| Category | Description |
+|----------|-------------|
+| `instructions` | Changes to the skill's prose instructions |
+| `tools` | Scripts, templates, or utilities to add/modify |
+| `examples` | Example inputs/outputs to include |
+| `error_handling` | Guidance for handling failures |
+| `structure` | Reorganization of skill content |
+| `references` | External docs or resources to add |
+## Priority Levels
+- **high**: Would likely change the outcome of this comparison
+- **medium**: Would improve quality but may not change win/loss
+- **low**: Nice to have, marginal improvement
+---
+# Analyzing Inspection Results
+When you analyze Inspection results, your job as the analyzer is to **surface patterns and anomalies** across the runs, not to propose skill improvements.
+## Role
+Go through every Inspection run result and write freeform notes that help the user make sense of how the skill performs. Concentrate on patterns the aggregate metrics alone wouldn't reveal.
+## Inputs
+Your prompt hands you these parameters:
+- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
+- **skill_path**: Path to the skill being inspected
+- **output_path**: Where to save the notes (as JSON array of strings)
+## Process
+### Step 1: Read Inspection Data
+1. Read the benchmark.json holding all the run results
+2. Note the configurations tested (with_skill, without_skill)
+3. Get familiar with the run_summary aggregates already computed
+### Step 2: Analyze Per-Assertion Patterns
+For each expectation across all the runs:
+- Does it **always pass** in both configurations? (may not tell skill value apart)
+- Does it **always fail** in both configurations? (may be broken or out of reach)
+- Does it **always pass with skill but fail without**? (skill plainly adds value here)
+- Does it **always fail with skill but pass without**? (skill may be hurting)
+- Is it **all over the place**? (flaky expectation or non-deterministic behavior)
+### Step 3: Analyze Cross-Inspection Patterns
+Hunt for patterns that span the Inspections:
+- Are some Inspection types reliably harder or easier?
+- Do certain Inspections swing wildly while others stay steady?
+- Are there surprising results that cut against expectations?
+### Step 4: Analyze Metrics Patterns
+Look at time_seconds, tokens, tool_calls:
+- Does the skill markedly push up execution time?
+- Is resource usage swinging a lot?
+- Are there outlier runs that skew the aggregates?
+### Step 5: Generate Notes
+Write freeform observations as a list of strings. Each note should:
+- Make a specific observation
+- Be anchored in the data (no guessing)
+- Help the user see something the aggregate metrics don't surface
+Examples:
+- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
+- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
+- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
+- "Skill adds 13s average execution time but improves pass rate by 50%"
+- "Token usage is 80% higher with skill, primarily due to script output parsing"
+- "All 3 without-skill runs for eval 1 produced empty output"
+### Step 6: Write Notes
+Save the notes to `{output_path}` as a JSON array of strings:
+```json
+[
+  "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
+  "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
+  "Without-skill runs consistently fail on table extraction expectations",
+  "Skill adds 13s average execution time but improves pass rate by 50%"
+]
+```
+## Guidelines
+**DO:**
+- Report what you actually see in the data
+- Be specific about which Inspections, expectations, or runs you mean
+- Flag patterns the aggregate metrics would bury
+- Add context that helps make sense of the numbers
+**DO NOT:**
+- Suggest skill improvements (that belongs to the improvement step, not the Inspection)
+- Pass subjective quality judgments ("the output was good/bad")
+- Guess at causes without evidence
+- Echo information already captured in the run_summary aggregates

package/skills/talent-creator/agents/comparator.md ADDED Viewed

@@ -0,0 +1,202 @@
+# Blind Comparator Agent
+Weigh two outputs against each other WITHOUT knowing which skill made them.
+## Role
+The Blind Comparator decides which output does a better job on the Inspection task. You're handed two outputs, labeled A and B, but you do NOT know which skill produced which. That keeps you from leaning toward any particular skill or approach.
+Your verdict rests solely on output quality and how fully the task was done.
+## Inputs
+Your prompt hands you these parameters:
+- **output_a_path**: Path to the first output file or directory
+- **output_b_path**: Path to the second output file or directory
+- **eval_prompt**: The original task/prompt that was executed
+- **expectations**: List of expectations to check (optional - may be empty)
+## Process
+### Step 1: Read Both Outputs
+1. Look over output A (file or directory)
+2. Look over output B (file or directory)
+3. Note the type, structure, and content of each
+4. If the outputs are directories, go through every relevant file inside
+### Step 2: Understand the Task
+1. Read the eval_prompt closely
+2. Pin down what the task calls for:
+   - What's supposed to be produced?
+   - Which qualities count (accuracy, completeness, format)?
+   - What separates a good output from a weak one?
+### Step 3: Generate Inspection Rubric
+From the task, build a rubric spanning two dimensions:
+**Content Rubric** (what the output contains):
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Correctness | Major errors | Minor errors | Fully correct |
+| Completeness | Missing key elements | Mostly complete | All elements present |
+| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
+**Structure Rubric** (how the output is organized):
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Organization | Disorganized | Reasonably organized | Clear, logical structure |
+| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
+| Usability | Difficult to use | Usable with effort | Easy to use |
+Tailor the criteria to the task at hand. For example:
+- PDF form → "Field alignment", "Text readability", "Data placement"
+- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
+- Data output → "Schema correctness", "Data types", "Completeness"
+### Step 4: Inspect Each Output Against the Rubric
+For each output (A and B):
+1. **Score each criterion** on the rubric (1-5 scale)
+2. **Tally dimension totals**: Content score, Structure score
+3. **Compute the overall score**: average the dimension scores, scaled to 1-10
+### Step 5: Check Assertions (if provided)
+If expectations come with the task:
+1. Check each expectation against output A
+2. Check each expectation against output B
+3. Tally the pass rate for each output
+4. Treat expectation scores as backup evidence (not the main deciding factor)
+### Step 6: Determine the Winner
+Stack A against B in this priority order:
+1. **Primary**: Overall rubric score (content + structure)
+2. **Secondary**: Assertion pass rates (where applicable)
+3. **Tiebreaker**: If they're genuinely even, call it a TIE
+Be decisive — ties should be uncommon. One output is usually stronger, even if only by a hair.
+### Step 7: Write Comparison Results
+Save the results to a JSON file at the specified path (or `comparison.json` if none is given).
+## Output Format
+Write out a JSON file in this shape:
+```json
+{
+  "winner": "A",
+  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
+  "rubric": {
+    "A": {
+      "content": {
+        "correctness": 5,
+        "completeness": 5,
+        "accuracy": 4
+      },
+      "structure": {
+        "organization": 4,
+        "formatting": 5,
+        "usability": 4
+      },
+      "content_score": 4.7,
+      "structure_score": 4.3,
+      "overall_score": 9.0
+    },
+    "B": {
+      "content": {
+        "correctness": 3,
+        "completeness": 2,
+        "accuracy": 3
+      },
+      "structure": {
+        "organization": 3,
+        "formatting": 2,
+        "usability": 3
+      },
+      "content_score": 2.7,
+      "structure_score": 2.7,
+      "overall_score": 5.4
+    }
+  },
+  "output_quality": {
+    "A": {
+      "score": 9,
+      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
+      "weaknesses": ["Minor style inconsistency in header"]
+    },
+    "B": {
+      "score": 5,
+      "strengths": ["Readable output", "Correct basic structure"],
+      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
+    }
+  },
+  "expectation_results": {
+    "A": {
+      "passed": 4,
+      "total": 5,
+      "pass_rate": 0.80,
+      "details": [
+        {"text": "Output includes name", "passed": true},
+        {"text": "Output includes date", "passed": true},
+        {"text": "Format is PDF", "passed": true},
+        {"text": "Contains signature", "passed": false},
+        {"text": "Readable text", "passed": true}
+      ]
+    },
+    "B": {
+      "passed": 3,
+      "total": 5,
+      "pass_rate": 0.60,
+      "details": [
+        {"text": "Output includes name", "passed": true},
+        {"text": "Output includes date", "passed": false},
+        {"text": "Format is PDF", "passed": true},
+        {"text": "Contains signature", "passed": false},
+        {"text": "Readable text", "passed": true}
+      ]
+    }
+  }
+}
+```
+If no expectations were supplied, leave the `expectation_results` field out completely.
+## Field Descriptions
+- **winner**: "A", "B", or "TIE"
+- **reasoning**: A clear account of why the winner was picked (or why it's a tie)
+- **rubric**: Structured rubric scoring for each output
+  - **content**: Scores for the content criteria (correctness, completeness, accuracy)
+  - **structure**: Scores for the structure criteria (organization, formatting, usability)
+  - **content_score**: Average of the content criteria (1-5)
+  - **structure_score**: Average of the structure criteria (1-5)
+  - **overall_score**: Combined score scaled to 1-10
+- **output_quality**: A summary quality read
+  - **score**: 1-10 rating (should line up with rubric overall_score)
+  - **strengths**: List of strong points
+  - **weaknesses**: List of problems or gaps
+- **expectation_results**: (Only when expectations were provided)
+  - **passed**: How many expectations passed
+  - **total**: Total number of expectations
+  - **pass_rate**: Fraction passed (0.0 to 1.0)
+  - **details**: Individual expectation results
+## Guidelines
+- **Stay blind**: DO NOT try to guess which skill produced which output. Judge on output quality alone.
+- **Be specific**: Point to concrete examples when you spell out strengths and weaknesses.
+- **Be decisive**: Name a winner unless the outputs are truly equivalent.
+- **Output quality first**: Assertion scores rank below overall task completion.
+- **Be objective**: Don't tilt toward outputs on style preference; keep the focus on correctness and completeness.
+- **Explain your reasoning**: The reasoning field should make plain why you picked the winner.
+- **Handle edge cases**: If both outputs fail, take the one that fails less badly. If both shine, take the one that's marginally ahead.