npm - claude-dev-env - Versions diffs - 1.15.0 → 1.16.1 - Mend

claude-dev-env 1.15.0 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/agents/skill-writer-agent.md +3 -3
package/hooks/HOOK_SPECS_PROMPT_WORKFLOW.md +6 -0
package/hooks/blocking/prompt_workflow_gate_core.py +174 -56
package/hooks/blocking/test_prompt_workflow_gate_core.py +84 -13
package/package.json +1 -1
package/skills/prompt-generator/REFERENCE.md +2 -2
package/skills/prompt-generator/REFINEMENT_PIPELINE_RUNBOOK.md +4 -4
package/skills/prompt-generator/SKILL.md +24 -14
package/skills/prompt-generator/TARGET_OUTPUT.md +11 -9
package/skills/prompt-generator/evals/prompt-generator.json +25 -13
package/skills/skill-builder/SKILL.md +87 -0
package/skills/skill-builder/references/delegation-map.md +151 -0
package/skills/skill-builder/templates/gap-analysis.md +41 -0
package/skills/skill-builder/workflows/improve-skill.md +88 -0
package/skills/skill-builder/workflows/new-skill.md +223 -0
package/skills/skill-builder/workflows/polish-skill.md +83 -0

package/skills/skill-builder/workflows/improve-skill.md ADDED Viewed

@@ -0,0 +1,88 @@
+# Improve Skill Workflow
+Observation-first flow for iterating on an existing skill.
+## Prerequisites
+- An existing skill that needs improvement
+- The skill has been used at least once (or the user has observed specific issues)
+---
+## Phase 1: Observe
+**Goal:** Document the existing skill's current behavior by running it on real tasks.
+> "Use the Skill in real workflows: Give Claude B (with the Skill loaded) actual tasks, not test scenarios"
+### Process
+1. Identify the skill to improve. Read its current SKILL.md and any reference files.
+2. Ask the user what prompted the improvement:
+   - "What specific issue did you observe?"
+   - "Can you give me a concrete task where the skill underperformed?"
+   - "Is this a triggering issue (skill does not activate), a quality issue (skill activates but produces poor results), or a scope issue (skill does the wrong thing)?"
+3. Run the existing skill on 2-3 real tasks. For each, spawn a subagent:
+   ```
+   Execute this task using the skill at [path-to-existing-skill]:
+   - Read the skill at [path]/SKILL.md and follow its instructions
+   - Task: [realistic task from user]
+   - Save outputs to: [skill-name]-workspace/observation/task-[N]/outputs/
+   - Save transcript to: [skill-name]-workspace/observation/task-[N]/transcript.md
+   ```
+4. Analyze the transcripts. Document observations:
+   - Where did the skill work well?
+   - Where did it fail or produce subpar results?
+   - Did Claude B follow the skill's instructions as written?
+   - Did Claude B ignore any sections or files?
+   - Did Claude B explore in unexpected directions?
+5. Generate a gap analysis (same template as new-skill Phase 1) focused on the delta between current behavior and desired behavior.
+**Output:** `[skill-name]-workspace/gap-analysis.md` with observation-based gaps
+---
+## Phase 2-6: Follow the New Skill Workflow
+From here, follow the same phases as `${CLAUDE_SKILL_DIR}/workflows/new-skill.md`, starting at Phase 2 (Build Evals).
+Key differences from the new-skill flow:
+- **Phase 2 (Build Evals):** Evals should test the specific issues observed in Phase 1, not hypothetical gaps.
+- **Phase 3 (Write Skill):** Instead of writing from scratch, invoke `/skill-writer` with:
+  ```
+  Refine this existing skill based on observation findings.
+  Current SKILL.md: [reference or paste current skill]
+  Gap analysis: [reference observation-based gaps]
+  Eval scenarios: [reference evals]
+  Constraint: Preserve what works. Only change what the observations demand.
+  ```
+- **Phase 4 (Test):** The baseline is the CURRENT skill (snapshot it before editing). Compare old-skill vs new-skill, not with-skill vs without-skill.
+  Before making any changes, snapshot the existing skill:
+  ```bash
+  cp -r [skill-path] [workspace]/skill-snapshot/
+  ```
+  Then for baseline runs, point subagents at the snapshot:
+  ```
+  Execute this task using the ORIGINAL skill at [workspace]/skill-snapshot/:
+  - Read the skill and follow its instructions
+  - Task: [eval prompt]
+  - Save outputs to: [workspace]/iteration-N/eval-[name]/old_skill/outputs/
+  - Save transcript to: [workspace]/iteration-N/eval-[name]/old_skill/transcript.md
+  ```
+- **Phase 5 (Iterate):** Same process. The improvement loop compares new version against the snapshot.
+- **Phase 6 (Polish):** Same process. Run description optimization if triggering was an issue.

package/skills/skill-builder/workflows/new-skill.md ADDED Viewed

@@ -0,0 +1,223 @@
+# New Skill Workflow
+Full evaluation-driven lifecycle for building a new skill from scratch.
+## Prerequisites
+- The user has a task or domain they want to capture as a skill
+- No existing skill for this capability (or intentionally starting fresh)
+---
+## Phase 1: Identify Gaps
+**Goal:** Document what fails or requires repeated context when working without a skill.
+### Process
+1. Have a guided conversation to uncover gaps. Explore these areas:
+   - "What task were you doing when you realized you needed a skill?"
+   - "What context did you repeatedly provide to Claude?"
+   - "Where did Claude fail or produce subpar results without guidance?"
+   - "What domain knowledge was missing?"
+   - "What specific format or structure did you need?"
+   - "Were there tools or scripts that needed to be used in a particular way?"
+   - "What rules or constraints did Claude violate?"
+2. As patterns emerge, probe for eval-worthy scenarios:
+   - "Can you give me a concrete example of a task where this failed?"
+   - "What would success look like for that specific task?"
+   - "Are there edge cases where the right approach changes?"
+3. Generate `gap-analysis.md` from the conversation using the template at `${CLAUDE_SKILL_DIR}/templates/gap-analysis.md`. Fill in all sections from what was discussed.
+4. Review the gap analysis with the user. Confirm completeness before moving to Phase 2.
+**Output:** `[skill-name]-workspace/gap-analysis.md`
+---
+## Phase 2: Build Evals
+**Goal:** Create 3+ evaluation scenarios that test the identified gaps. Establish a baseline.
+### Process
+1. Transform each gap into at least one eval scenario. Each scenario needs:
+   - A realistic user prompt (detailed and specific, like a real request)
+   - A description of what success looks like
+   - Objectively verifiable expectations (assertions)
+2. Draft evals using the schema at `${CLAUDE_SKILL_DIR}/templates/eval-scenario.json`. Ensure:
+   - Minimum 3 scenarios (official requirement)
+   - Every identified gap has at least one scenario testing it
+   - Expectations are objectively verifiable, not subjective
+   - Prompts sound like things a real user would say
+3. Review eval scenarios with the user. Adjust until both sides are satisfied.
+4. Save to `[skill-name]-workspace/evals/evals.json`.
+5. **Establish baseline.** For each eval, spawn a subagent WITHOUT any skill:
+   ```
+   Execute this task with NO skill loaded:
+   - Task: [eval prompt]
+   - Input files: [eval files if any, or "none"]
+   - Save all output files to: [workspace]/iteration-0/eval-[name]/without_skill/outputs/
+   - Save a complete transcript of your work to: [workspace]/iteration-0/eval-[name]/without_skill/transcript.md
+   ```
+   Spawn all baseline runs in parallel. Capture timing data when each completes.
+6. Grade baseline results using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for exact grading invocation.
+**Output:** `[skill-name]-workspace/evals/evals.json` and baseline results in `iteration-0/`
+---
+## Phase 3: Write Minimal Skill
+**Goal:** Create just enough skill content to address the documented gaps and pass evaluations.
+### Process
+1. Invoke `/skill-writer` with this context:
+   ```
+   Create a skill based on this gap analysis and eval scenarios.
+   Gap analysis: [reference or paste gap-analysis.md]
+   Eval scenarios: [reference or paste evals.json expected_output and expectations]
+   Baseline failures: [summarize what Claude got wrong in iteration-0]
+   Constraint: Write the minimum instructions needed to address these specific gaps.
+   Every line must serve a documented gap. Do not over-document.
+   ```
+2. `/skill-writer` will run its workflow: classify type, set degree of freedom, ask clarifying questions, produce the SKILL.md artifact.
+3. Review the draft with the user:
+   - "Does this address all the gaps we identified?"
+   - "Is anything here unnecessary or over-engineered?"
+   - "Would this pass our eval scenarios?"
+4. Save the skill to its target directory.
+**Output:** The skill's SKILL.md (and optional reference files)
+---
+## Phase 4: Test (Feedback Loop)
+**Goal:** Run the skill on eval scenarios, compare against baseline, identify remaining gaps.
+### Process
+1. **Spawn all runs in parallel.** For each eval scenario, launch a with-skill subagent:
+   ```
+   Execute this task:
+   - Read the skill at [path-to-skill]/SKILL.md and follow its instructions
+   - Task: [eval prompt from evals.json]
+   - Input files: [eval files if any, or "none"]
+   - Save all output files to: [workspace]/iteration-N/eval-[name]/with_skill/outputs/
+   - Save a complete transcript of your work to: [workspace]/iteration-N/eval-[name]/with_skill/transcript.md
+   ```
+   For iteration-1, the without-skill baseline already exists from Phase 2.
+2. **While runs are in progress**, review and refine assertions if needed based on what was learned from the baseline.
+3. **When runs complete**, immediately capture timing data (`total_tokens`, `duration_ms`) to `timing.json` in each run directory. This data is only available in the task completion notification.
+4. **Grade each run** using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the grading process.
+5. **Aggregate into benchmark** using skill-creator's aggregation script. See delegation-map.md for the exact command.
+6. **Launch the eval viewer** using skill-creator's generate_review.py. See delegation-map.md for the exact command. For iteration 2+, include `--previous-workspace` to show diffs.
+7. Tell the user to review in the viewer:
+   - "Outputs" tab: click through each test case, leave feedback
+   - "Benchmark" tab: quantitative comparison (pass rates, timing, tokens)
+8. Wait for the user to complete their review.
+**Output:** `grading.json`, `benchmark.json`, `feedback.json` in the iteration directory
+---
+## Phase 5: Iterate
+**Goal:** Refine the skill based on observed Claude B behavior and user feedback.
+### Process
+1. Read `feedback.json` from the viewer. Empty feedback means the user was satisfied with that test case.
+2. Read transcripts from Phase 4 runs. Watch for the signals the official docs highlight:
+   - **Unexpected exploration paths** -- Claude B read files in an order you did not anticipate
+   - **Missed connections** -- Claude B did not follow references to important files
+   - **Overreliance on certain sections** -- content that should be promoted to SKILL.md
+   - **Ignored content** -- files Claude B never accessed (may be unnecessary or poorly signaled)
+   - **Repeated work across test cases** -- all subagents wrote similar helper scripts (bundle them into the skill)
+3. Synthesize observations into actionable improvements. For each piece of feedback, identify the specific skill change that would fix it.
+4. Apply improvements. For significant changes, re-invoke `/skill-writer` with:
+   ```
+   Refine this existing skill based on testing observations.
+   Current SKILL.md: [reference or paste]
+   User feedback: [from feedback.json -- only non-empty entries]
+   Behavioral observations: [from transcript analysis]
+   Specific issues to address:
+   1. [Issue]
+   2. [Issue]
+   Constraint: Only change what the feedback demands. Do not reorganize working content.
+   ```
+5. Key principles for this phase (from the official docs):
+   - **Generalize from feedback** -- the skill will be used across many different prompts, not just these test cases
+   - **Keep the prompt lean** -- remove instructions that are not pulling their weight
+   - **Explain the why** -- theory of mind beats rigid MUSTs
+   - **Bundle repeated work** -- if subagents all wrote similar scripts, add them to the skill
+6. Return to Phase 4 with the refined skill. Continue iterating until:
+   - User feedback is all empty (satisfied with every test case)
+   - Pass rates meet acceptable thresholds
+   - No meaningful progress between iterations
+---
+## Phase 6: Polish
+**Goal:** Optimize the skill description for triggering accuracy and run final validation.
+### Process
+1. **Description optimization.** Follow the process in `${CLAUDE_SKILL_DIR}/workflows/polish-skill.md`.
+2. **Final validation.** Run the skill-writer self-check rubric against the finished skill:
+   - [ ] Description is third person with trigger phrases
+   - [ ] Under 500 lines
+   - [ ] States what to do in positive terms (not prohibition-heavy)
+   - [ ] Degree of freedom matches task fragility
+   - [ ] Progressive disclosure used (heavy content in separate files)
+   - [ ] Examples are concrete, not abstract
+   - [ ] Frontmatter fields are valid
+   - [ ] One skill = one capability
+3. **Final checklist** from the official Anthropic docs:
+   - [ ] At least 3 evaluation scenarios created and passing
+   - [ ] Tested with real usage scenarios
+   - [ ] Skill solves documented gaps (not imagined requirements)
+   - [ ] Iterative refinement based on observed behavior (not assumptions)
+4. Present the finished skill to the user with:
+   - Final benchmark comparison (latest iteration vs baseline)
+   - Summary of gaps addressed
+   - Any remaining limitations or known edge cases

package/skills/skill-builder/workflows/polish-skill.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Polish Skill Workflow
+Final optimization pass for a skill that is functionally complete.
+## Prerequisites
+- The skill passes its evaluation scenarios
+- The user is satisfied with output quality
+- This is the final step before the skill is considered done
+---
+## Step 1: Description Optimization
+Optimize the skill's description for triggering accuracy using the skill-creator's trigger eval system.
+### Generate trigger eval queries
+Create 20 eval queries: 10 should-trigger and 10 should-not-trigger.
+**Should-trigger queries (10):** Different phrasings of the same intent. Include:
+- Formal and casual variations
+- Cases where the user does not explicitly name the skill but clearly needs it
+- Uncommon use cases
+- Cases where this skill competes with another but should win
+**Should-not-trigger queries (10):** Near-misses that share keywords but need something different. Include:
+- Adjacent domains with overlapping terminology
+- Ambiguous phrasing where naive keyword matching would falsely trigger
+- Tasks that touch the skill's domain but in a context where another tool is better
+All queries must be realistic -- detailed, specific, with file paths, personal context, casual speech. Not abstract one-liners.
+### Review with user
+Present the eval set using the skill-creator's HTML review template. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact process.
+The user can edit queries, toggle should-trigger, and add/remove entries.
+### Run optimization loop
+See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact command. The loop:
+1. Splits eval set into 60% train / 40% held-out test
+2. Evaluates current description (3 runs per query for reliability)
+3. Proposes improvements based on failures
+4. Re-evaluates on both train and test
+5. Iterates up to 5 times
+6. Selects best description by test score (avoids overfitting)
+### Apply result
+Update the skill's SKILL.md frontmatter with the optimized description. Show the user before/after with scores.
+---
+## Step 2: Final Validation
+Run the skill-writer self-check rubric:
+- [ ] Description is third person with trigger phrases
+- [ ] SKILL.md body under 500 lines
+- [ ] States what to do in positive terms (not prohibition-heavy)
+- [ ] Degree of freedom matches task fragility
+- [ ] Progressive disclosure used (heavy content in separate files)
+- [ ] No time-sensitive claims unless clearly dated
+- [ ] Examples are concrete, not abstract
+- [ ] Frontmatter fields are valid per official docs
+- [ ] One skill = one capability
+- [ ] Consistent terminology throughout
+- [ ] File references are one level deep from SKILL.md
+- [ ] Files over 100 lines have a table of contents
+---
+## Step 3: Final Summary
+Present the finished skill to the user:
+1. **Benchmark summary:** Final pass rate vs baseline, with delta
+2. **Gaps addressed:** Map each original gap to the skill content that addresses it
+3. **Description optimization:** Before/after trigger accuracy scores
+4. **Known limitations:** Anything the skill does not handle (scope boundaries)
+5. **Maintenance notes:** What to watch for in future usage that might warrant re-iteration