npm - claude-dev-env - Versions diffs - 1.16.0 → 1.17.0 - Mend

claude-dev-env 1.16.0 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/hooks/HOOK_SPECS_PROMPT_WORKFLOW.md +6 -0
package/hooks/blocking/prompt_workflow_gate_core.py +174 -56
package/hooks/blocking/test_prompt_workflow_gate_core.py +84 -13
package/package.json +1 -1
package/skills/prompt-generator/SKILL.md +2 -0
package/skills/prompt-generator/evals/prompt-generator.json +5 -5
package/skills/prompt-generator/templates/skill-from-ground-up.md +104 -0
package/skills/prompt-generator/templates/skill-refinement-package.md +109 -0
package/skills/skill-builder/SKILL.md +107 -0
package/skills/skill-builder/references/delegation-map.md +153 -0
package/skills/skill-builder/templates/gap-analysis.md +41 -0
package/skills/skill-builder/workflows/improve-skill.md +97 -0
package/skills/skill-builder/workflows/new-skill.md +235 -0
package/skills/skill-builder/workflows/polish-skill.md +92 -0
package/skills/skill-writer/SKILL.md +18 -0

package/skills/skill-builder/workflows/new-skill.md ADDED Viewed

@@ -0,0 +1,235 @@
+# New Skill Workflow
+Full evaluation-driven lifecycle for building a new skill from scratch.
+## Prerequisites
+- The user has a task or domain they want to capture as a skill
+- No existing skill for this capability (or intentionally starting fresh)
+### Ground-up package layout (required before multi-file implementation)
+When the outcome includes **ARCHITECTURE.md**, **REFERENCE / EXAMPLES / WORKFLOWS**, and **`evals/*.json`** under a workspace (Anthropic-style progressive disclosure plus checkpointed rollout):
+1. Read `prompt-generator/templates/skill-from-ground-up.md` from the claude-dev-env `skills/` tree (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-from-ground-up.md`).
+2. Run `/prompt-generator` using that template (substitute tokens per its table) **before** Phase 3 expands the repo; align the XML scope block with this workflow’s workspace and evidence rules.
+3. Keep Phase 1–2 artifacts honest: eval prompts and expectations stay grounded in **real** user scenarios; the template reinforces eval rows that reference pasted or explicitly approved evidence only.
+Skip this block only when the user explicitly wants a **single-file** SKILL.md with no staged package plan.
+Refinements to an **existing** skill package use `prompt-generator/templates/skill-refinement-package.md` instead (see `improve-skill.md`).
+---
+## Phase 1: Identify Gaps
+**Goal:** Document what fails or requires repeated context when working without a skill.
+### Process
+1. Have a guided conversation to uncover gaps. Explore these areas:
+   - "What task were you doing when you realized you needed a skill?"
+   - "What context did you repeatedly provide to Claude?"
+   - "Where did Claude fail or produce subpar results without guidance?"
+   - "What domain knowledge was missing?"
+   - "What specific format or structure did you need?"
+   - "Were there tools or scripts that needed to be used in a particular way?"
+   - "What rules or constraints did Claude violate?"
+2. As patterns emerge, probe for eval-worthy scenarios:
+   - "Can you give me a concrete example of a task where this failed?"
+   - "What would success look like for that specific task?"
+   - "Are there edge cases where the right approach changes?"
+3. Generate `gap-analysis.md` from the conversation using the template at `${CLAUDE_SKILL_DIR}/templates/gap-analysis.md`. Fill in all sections from what was discussed.
+4. Review the gap analysis with the user. Confirm completeness before moving to Phase 2.
+**Output:** `[skill-name]-workspace/gap-analysis.md`
+---
+## Phase 2: Build Evals
+**Goal:** Create 3+ evaluation scenarios that test the identified gaps. Establish a baseline.
+### Process
+1. Transform each gap into at least one eval scenario. Each scenario needs:
+   - A realistic user prompt (detailed and specific, like a real request)
+   - A description of what success looks like
+   - Objectively verifiable expectations (assertions)
+2. Draft evals using the schema at `${CLAUDE_SKILL_DIR}/templates/eval-scenario.json`. Ensure:
+   - Minimum 3 scenarios (official requirement)
+   - Every identified gap has at least one scenario testing it
+   - Expectations are objectively verifiable, not subjective
+   - Prompts sound like things a real user would say
+3. Review eval scenarios with the user. Adjust until both sides are satisfied.
+4. Save to `[skill-name]-workspace/evals/evals.json`.
+5. **Establish baseline.** For each eval, spawn a subagent WITHOUT any skill:
+   ```
+   Execute this task with NO skill loaded:
+   - Task: [eval prompt]
+   - Input files: [eval files if any, or "none"]
+   - Save all output files to: [workspace]/iteration-0/eval-[name]/without_skill/outputs/
+   - Save a complete transcript of your work to: [workspace]/iteration-0/eval-[name]/without_skill/transcript.md
+   ```
+   Spawn all baseline runs in parallel. Capture timing data when each completes.
+6. Grade baseline results using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for exact grading invocation.
+**Output:** `[skill-name]-workspace/evals/evals.json` and baseline results in `iteration-0/`
+---
+## Phase 3: Write Minimal Skill
+**Goal:** Create just enough skill content to address the documented gaps and pass evaluations.
+### Process
+1. Invoke `/skill-writer` with this context:
+   ```
+   Create a skill based on this gap analysis and eval scenarios.
+   Gap analysis: [reference or paste gap-analysis.md]
+   Eval scenarios: [reference or paste evals.json expected_output and expectations]
+   Baseline failures: [summarize what Claude got wrong in iteration-0]
+   Constraint: Write the minimum instructions needed to address these specific gaps.
+   Every line must serve a documented gap. Do not over-document.
+   ```
+2. `/skill-writer` will run its workflow: classify type, set degree of freedom, ask clarifying questions, produce the SKILL.md artifact.
+3. Review the draft with the user:
+   - "Does this address all the gaps we identified?"
+   - "Is anything here unnecessary or over-engineered?"
+   - "Would this pass our eval scenarios?"
+4. Save the skill to its target directory.
+**Output:** The skill's SKILL.md (and optional reference files)
+---
+## Phase 4: Test (Feedback Loop)
+**Goal:** Run the skill on eval scenarios, compare against baseline, identify remaining gaps.
+### Process
+1. **Spawn all runs in parallel.** For each eval scenario, launch a with-skill subagent:
+   ```
+   Execute this task:
+   - Read the skill at [path-to-skill]/SKILL.md and follow its instructions
+   - Task: [eval prompt from evals.json]
+   - Input files: [eval files if any, or "none"]
+   - Save all output files to: [workspace]/iteration-N/eval-[name]/with_skill/outputs/
+   - Save a complete transcript of your work to: [workspace]/iteration-N/eval-[name]/with_skill/transcript.md
+   ```
+   For iteration-1, the without-skill baseline already exists from Phase 2.
+2. **While runs are in progress**, review and refine assertions if needed based on what was learned from the baseline.
+3. **When runs complete**, immediately capture timing data (`total_tokens`, `duration_ms`) to `timing.json` in each run directory. This data is only available in the task completion notification.
+4. **Grade each run** using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the grading process.
+5. **Aggregate into benchmark** using skill-creator's aggregation script. See delegation-map.md for the exact command.
+6. **Launch the eval viewer** using skill-creator's generate_review.py. See delegation-map.md for the exact command. For iteration 2+, include `--previous-workspace` to show diffs.
+7. Tell the user to review in the viewer:
+   - "Outputs" tab: click through each test case, leave feedback
+   - "Benchmark" tab: quantitative comparison (pass rates, timing, tokens)
+8. Wait for the user to complete their review.
+**Output:** `grading.json`, `benchmark.json`, `feedback.json` in the iteration directory
+---
+## Phase 5: Iterate
+**Goal:** Refine the skill based on observed Claude B behavior and user feedback.
+### Process
+1. Read `feedback.json` from the viewer. Empty feedback means the user was satisfied with that test case.
+2. Read transcripts from Phase 4 runs. Watch for the signals the official docs highlight:
+   - **Unexpected exploration paths** -- Claude B read files in an order you did not anticipate
+   - **Missed connections** -- Claude B did not follow references to important files
+   - **Overreliance on certain sections** -- content that should be promoted to SKILL.md
+   - **Ignored content** -- files Claude B never accessed (may be unnecessary or poorly signaled)
+   - **Repeated work across test cases** -- all subagents wrote similar helper scripts (bundle them into the skill)
+3. Synthesize observations into actionable improvements. For each piece of feedback, identify the specific skill change that would fix it.
+4. Apply improvements. For significant changes, re-invoke `/skill-writer` with:
+   ```
+   Refine this existing skill based on testing observations.
+   Current SKILL.md: [reference or paste]
+   User feedback: [from feedback.json -- only non-empty entries]
+   Behavioral observations: [from transcript analysis]
+   Specific issues to address:
+   1. [Issue]
+   2. [Issue]
+   Constraint: Only change what the feedback demands. Do not reorganize working content.
+   ```
+5. Key principles for this phase (from the official docs):
+   - **Generalize from feedback** -- the skill will be used across many different prompts, not just these test cases
+   - **Keep the prompt lean** -- remove instructions that are not pulling their weight
+   - **Explain the why** -- theory of mind beats rigid MUSTs
+   - **Bundle repeated work** -- if subagents all wrote similar scripts, add them to the skill
+6. Return to Phase 4 with the refined skill. Continue iterating until:
+   - User feedback is all empty (satisfied with every test case)
+   - Pass rates meet acceptable thresholds
+   - No meaningful progress between iterations
+---
+## Phase 6: Polish
+**Goal:** Optimize the skill description for triggering accuracy and run final validation.
+### Process
+1. **Description optimization.** Follow the process in `${CLAUDE_SKILL_DIR}/workflows/polish-skill.md`.
+2. **Final validation.** Run the skill-writer self-check rubric against the finished skill:
+   - [ ] Description is third person with trigger phrases
+   - [ ] Under 500 lines
+   - [ ] States what to do in positive terms (not prohibition-heavy)
+   - [ ] Degree of freedom matches task fragility
+   - [ ] Progressive disclosure used (heavy content in separate files)
+   - [ ] Examples are concrete, not abstract
+   - [ ] Frontmatter fields are valid
+   - [ ] One skill = one capability
+3. **Final checklist** from the official Anthropic docs:
+   - [ ] At least 3 evaluation scenarios created and passing
+   - [ ] Tested with real usage scenarios
+   - [ ] Skill solves documented gaps (not imagined requirements)
+   - [ ] Iterative refinement based on observed behavior (not assumptions)
+4. Present the finished skill to the user with:
+   - Final benchmark comparison (latest iteration vs baseline)
+   - Summary of gaps addressed
+   - Any remaining limitations or known edge cases

package/skills/skill-builder/workflows/polish-skill.md ADDED Viewed

@@ -0,0 +1,92 @@
+# Polish Skill Workflow
+Final optimization pass for a skill that is functionally complete.
+## Prerequisites
+- The skill passes its evaluation scenarios
+- The user is satisfied with output quality
+- This is the final step before the skill is considered done
+### Package-aware polish (recommended)
+When the polish pass will touch **more than frontmatter alone** (for example `REFERENCE.md`, `EXAMPLES.md`, `WORKFLOWS.md`, link structure, or eval JSON), or the user wants **checkpointed** multi-file updates alongside description work:
+1. Read `prompt-generator/templates/skill-refinement-package.md` (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md`).
+2. Run `/prompt-generator` with tokens filled so `ARCHITECTURE.md` records baseline inventory, planned deltas for polish, and evidence rules for any new trigger or behavior evals.
+Purely **single-field** `description` edits with no structural package changes can skip this block.
+---
+## Step 1: Description Optimization
+Optimize the skill's description for triggering accuracy using the skill-creator's trigger eval system.
+### Generate trigger eval queries
+Create 20 eval queries: 10 should-trigger and 10 should-not-trigger.
+**Should-trigger queries (10):** Different phrasings of the same intent. Include:
+- Formal and casual variations
+- Cases where the user does not explicitly name the skill but clearly needs it
+- Uncommon use cases
+- Cases where this skill competes with another but should win
+**Should-not-trigger queries (10):** Near-misses that share keywords but need something different. Include:
+- Adjacent domains with overlapping terminology
+- Ambiguous phrasing where naive keyword matching would falsely trigger
+- Tasks that touch the skill's domain but in a context where another tool is better
+All queries must be realistic -- detailed, specific, with file paths, personal context, casual speech. Not abstract one-liners.
+### Review with user
+Present the eval set using the skill-creator's HTML review template. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact process.
+The user can edit queries, toggle should-trigger, and add/remove entries.
+### Run optimization loop
+See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact command. The loop:
+1. Splits eval set into 60% train / 40% held-out test
+2. Evaluates current description (3 runs per query for reliability)
+3. Proposes improvements based on failures
+4. Re-evaluates on both train and test
+5. Iterates up to 5 times
+6. Selects best description by test score (avoids overfitting)
+### Apply result
+Update the skill's SKILL.md frontmatter with the optimized description. Show the user before/after with scores.
+---
+## Step 2: Final Validation
+Run the skill-writer self-check rubric:
+- [ ] Description is third person with trigger phrases
+- [ ] SKILL.md body under 500 lines
+- [ ] States what to do in positive terms (not prohibition-heavy)
+- [ ] Degree of freedom matches task fragility
+- [ ] Progressive disclosure used (heavy content in separate files)
+- [ ] No time-sensitive claims unless clearly dated
+- [ ] Examples are concrete, not abstract
+- [ ] Frontmatter fields are valid per official docs
+- [ ] One skill = one capability
+- [ ] Consistent terminology throughout
+- [ ] File references are one level deep from SKILL.md
+- [ ] Files over 100 lines have a table of contents
+---
+## Step 3: Final Summary
+Present the finished skill to the user:
+1. **Benchmark summary:** Final pass rate vs baseline, with delta
+2. **Gaps addressed:** Map each original gap to the skill content that addresses it
+3. **Description optimization:** Before/after trigger accuracy scores
+4. **Known limitations:** Anything the skill does not handle (scope boundaries)
+5. **Maintenance notes:** What to watch for in future usage that might warrant re-iteration

package/skills/skill-writer/SKILL.md CHANGED Viewed

@@ -32,6 +32,24 @@ Use this skill when the user needs a structured skill artifact; for quick answer
 When invoked with arguments (e.g. `/skill-writer improve this: [paste]`), treat `$ARGUMENTS` as the skill content to refine.
+### Ground-up multi-file packages (required)
+When the user is creating a **new** skill as a **package** (workspace with `ARCHITECTURE.md`, `REFERENCE.md`, `EXAMPLES.md`, `WORKFLOWS.md`, `evals/*.json`, per-file human review), **before** drafting `SKILL.md`:
+1. Read `packages/claude-dev-env/skills/prompt-generator/templates/skill-from-ground-up.md` (installed layout: sibling folder `prompt-generator/templates/skill-from-ground-up.md` under the same `skills/` parent).
+2. Ensure `/prompt-generator` has run with that template filled so architecture-first steps, checkpoint gates, and eval evidence rules are already agreed.
+If the task is **only** editing an existing `SKILL.md` or a small single-file tweak, this subsection does not apply.
+### Refinement multi-file packages (required)
+When the user is **refining** an existing skill as a **package** (baseline skill directory, `ARCHITECTURE.md` with planned deltas, checkpointed updates to REFERENCE / EXAMPLES / WORKFLOWS / `evals/`), **before** rewriting multiple files:
+1. Read `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md` (installed layout: `prompt-generator/templates/skill-refinement-package.md` under the same `skills/` parent).
+2. Ensure `/prompt-generator` has run with that template filled so baseline root, workspace root, observation inputs, and evidence rules are fixed before edits proceed.
+If the change set is a **small single-file** tweak, this subsection does not apply.
 ## Workflow (run in order)
 ### 1. Classify the skill type