npm - claude-dev-env - Versions diffs - 1.16.0 → 1.17.0 - Mend

claude-dev-env 1.16.0 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/hooks/HOOK_SPECS_PROMPT_WORKFLOW.md +6 -0
package/hooks/blocking/prompt_workflow_gate_core.py +174 -56
package/hooks/blocking/test_prompt_workflow_gate_core.py +84 -13
package/package.json +1 -1
package/skills/prompt-generator/SKILL.md +2 -0
package/skills/prompt-generator/evals/prompt-generator.json +5 -5
package/skills/prompt-generator/templates/skill-from-ground-up.md +104 -0
package/skills/prompt-generator/templates/skill-refinement-package.md +109 -0
package/skills/skill-builder/SKILL.md +107 -0
package/skills/skill-builder/references/delegation-map.md +153 -0
package/skills/skill-builder/templates/gap-analysis.md +41 -0
package/skills/skill-builder/workflows/improve-skill.md +97 -0
package/skills/skill-builder/workflows/new-skill.md +235 -0
package/skills/skill-builder/workflows/polish-skill.md +92 -0
package/skills/skill-writer/SKILL.md +18 -0

package/skills/prompt-generator/templates/skill-from-ground-up.md ADDED Viewed

@@ -0,0 +1,104 @@
+# Template: ground-up skill build (collaborative)
+Use this artifact when you want a **new** Agent Skill package built **architecture-first**, then **implementation in layers**, with **human checkpoints** and **evaluation rows tied to real evidence** (pasted threads or explicit user approval), aligned with [Agent Skills best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
+## Before you paste
+Replace every `[[TOKEN]]` below (keep the brackets off in the final prompt you paste into a session, or substitute literal values).
+| Token | Meaning |
+| --- | --- |
+| `[[SKILL_PURPOSE]]` | Two to five sentences: capability, user pain, success shape. |
+| `[[WORKSPACE_ROOT]]` | Repo-relative folder for all Phase A–B files (e.g. `my-skill-workspace/`). |
+| `[[DESIGN_INPUT_GLOB]]` | Hypothesis or gap doc path (e.g. `my-skill-workspace/gap-analysis.md`), or `None` carried into `ARCHITECTURE.md` `Design inputs`. If `None`, omit the trailing `, plus \`[[DESIGN_INPUT_GLOB]]\`` fragment from `target_file_globs` in the XML you paste. |
+| `[[OPTIONAL_PACKAGE_TARGET]]` | Future install path (e.g. `packages/claude-dev-env/skills/my-skill/`) or `deferred until the user requests packaging`. |
+| `[[SKILL_SLUG]]` | Short kebab-case id for eval filename and metadata (e.g. `my-skill`). |
+| `[[DESCRIPTION_TRIGGERS]]` | Comma-separated phrases for third-person `description` triggers (user says, user asks, …). |
+| `[[EVIDENCE_RULE]]` | One sentence: what counts as authorized eval material (e.g. pasted excerpts only). |
+Optional: run `/prompt-generator` with this template as the brief so **AskUserQuestion** can lock scope before drafting.
+---
+## Paste-ready XML (after substitution)
+```xml
+<role>
+You are a skill architect and implementer for Claude Agent Skills. You build a skill package under human review. You follow Anthropic Agent Skills best practices for structure, progressive disclosure, workflows, examples, and evaluation-driven iteration. You treat the user as the sole authority for real evaluation evidence.
+</role>
+<background>
+Motivation: [[SKILL_PURPOSE]] The user builds collaboratively: evaluation scenarios must trace to material the user pasted or explicitly approved in thread. Work proceeds architecture-first (full end-state inventory), then implementation from foundation upward, with a full stop for review after each new or materially rewritten file.
+Scope block (ground every instruction here):
+- target_local_roots: `[[WORKSPACE_ROOT]]` at the repository root of this project (primary drafting and documentation area).
+- target_canonical_roots: Optional future install under `[[OPTIONAL_PACKAGE_TARGET]]`; until packaging is requested, keep all writes inside `[[WORKSPACE_ROOT]]`.
+- target_file_globs: `[[WORKSPACE_ROOT]]ARCHITECTURE.md`, `[[WORKSPACE_ROOT]]SKILL.md`, `[[WORKSPACE_ROOT]]REFERENCE.md`, `[[WORKSPACE_ROOT]]EXAMPLES.md`, `[[WORKSPACE_ROOT]]WORKFLOWS.md`, `[[WORKSPACE_ROOT]]evals/*.json`, `[[WORKSPACE_ROOT]]scripts/**` when present, plus `[[DESIGN_INPUT_GLOB]]` when it names a real path.
+- comparison_basis: Anthropic Agent Skills best practices for concise metadata, third-person descriptions, progressive disclosure (SKILL.md as table of contents, linked files one level deep, table of contents atop references over one hundred lines), copy-paste checklists for complex flows, template plus example pairs, evaluation JSON with expected_behavior arrays, and iterative testing across models the user cares about.
+- completion_boundary: Phase A delivers `[[WORKSPACE_ROOT]]ARCHITECTURE.md` listing every end-state file, its purpose, load order, eval evidence policy, and optional hook milestones; the user approves or edits that architecture in chat. Phase B creates or updates exactly one primary file per user checkpoint; each checkpoint ends with a short summary naming the next file queued. Phase C closes when SKILL.md plus linked references exist, `evals/` contains JSON whose populated rows include evidence references, and a final inventory table lists every path with a one-line purpose.
+Hypothesis note: When `[[DESIGN_INPUT_GLOB]]` names a file, treat that file as design input; each eval row still needs a matching pasted thread excerpt or explicit user approval before it counts as grounded behavior truth.
+Source priority for structure and quality bars: Anthropic Agent Skills documentation and prompting best practices first; user pasted materials for behavioral truth; repository AGENTS.md or package README when they touch path layout.
+Citation policy: Quote user-pasted excerpts inside evaluation records or EXAMPLES.md when illustrating a scenario; paraphrase only when the user labels a paste as approved summary material. Ground structural claims about Anthropic guidance in links recorded in REFERENCE.md.
+Evaluation evidence rule: [[EVIDENCE_RULE]]
+</background>
+<instructions>
+1. Phase A — Architecture inventory: Create or replace `[[WORKSPACE_ROOT]]ARCHITECTURE.md` with an end-state map: a `Design inputs` line carrying either the path `[[DESIGN_INPUT_GLOB]]` or the literal `None`; required YAML frontmatter fields for SKILL.md; third-person description triggers covering [[DESCRIPTION_TRIGGERS]]; planned progressive disclosure files (each linked one level deep from SKILL.md); workflow checklist locations; EXAMPLES.md plan (input and output pairs aligned to user-provided threads when available); `evals/` JSON filename, schema fields (`query`, `expected_behavior`, optional `files`, `evidence_ref` pointing to paste ids or approved labels); optional future hook integration as a separate milestone requiring explicit user activation; forward-slash paths only. Close Phase A with a brief chat summary listing deliverables and ask the user to approve or edit `ARCHITECTURE.md` before Phase B begins.
+2. Phase B — Ground-up implementation after approval: (a) Pull vocabulary and pain language from the design source path recorded in `ARCHITECTURE.md` under `Design inputs` when that section lists a path; reuse stable terms across new files. When `Design inputs` lists `None`, rely on repository conventions plus stated chat goals for terminology. (b) Draft `[[WORKSPACE_ROOT]]SKILL.md` with valid `name` (lowercase, hyphens, within length limits) and a rich third-person `description` covering what the skill does and exact triggers. (c) Add `REFERENCE.md` for terminology, links to Anthropic docs, and eval evidence rules. (d) Add `EXAMPLES.md` with template openings; pair each example with a user-supplied or user-approved excerpt when available; leave labeled slots where pastes are still outstanding. (e) Add `WORKFLOWS.md` containing copy-paste checklists for the main user journeys. (f) Create `evals/[[SKILL_SLUG]].json` using the schema from ARCHITECTURE.md; keep `expected_behavior` empty or placeholder only until matching pasted excerpts arrive; each populated row includes `evidence_ref` to the user paste or approval line. (g) Keep SKILL.md body under five hundred lines; push depth into linked files; add a table of contents at the top of any reference file expected to exceed one hundred lines.
+3. Collaboration gate: After every new file or material rewrite in Phase B, pause the thread for user review; continue only after explicit user approval naming the next file to tackle.
+4. Evaluation hygiene: Add or tighten evaluation rows solely from text the user pasted or explicitly approved in chat; when coverage gaps remain, document the gap inside REFERENCE.md under a heading such as “Pending approved excerpts” with positive wording about what artifact will unlock each slot.
+5. Quality pass before each handoff: Verify third-person description, consistent terminology, one-level-deep links from SKILL.md, forward-slash paths, and checklist presence for multi-step flows.
+6. Optional Phase C — Distribution: Copy or adapt the approved package into the package skill path only when the user requests packaging; mirror the same file set; update cross-links.
+</instructions>
+<constraints>
+- Keep writes primarily inside `[[WORKSPACE_ROOT]]` until the user expands scope in chat.
+- Limit each checkpoint to one primary file (plus tiny cross-link tweaks the user OKs in the same message).
+- Prefer additive edits: create new files listed in ARCHITECTURE.md before renaming or collapsing documents; describe planned moves in the checkpoint summary.
+- Defer hook installation, Stop-hook changes, or clipboard automation to an explicit milestone the user starts later.
+- Align naming with Anthropic guidance: gerund or action-oriented skill names; descriptive file names; ship only path segments that name the skill domain clearly (for example `reference/`, `evals/`, `EXAMPLES.md`).
+</constraints>
+<output_format>
+Every assistant milestone message uses this shape:
+1. `## Summary` — bullets for files touched, decisions made, and the single filename queued for the next review.
+2. `## Checkpoint` — one paragraph restating what the user should verify before the next file starts.
+3. When showing draft bodies inline, use fenced code blocks with language tags (`markdown`, `json`) matching the destination file.
+4. Phase A exit criterion: user approves `ARCHITECTURE.md` content in chat or edits it and signals approval.
+5. Phase B exit criterion: SKILL.md, REFERENCE.md, EXAMPLES.md, WORKFLOWS.md, and `evals/*.json` exist; eval JSON lists only user-pasted or user-approved evidence references in populated rows; REFERENCE.md lists any still-empty slots.
+6. Ship a closing `## Inventory` table: path, one-line purpose, progressive disclosure load order.
+Light self-check the implementer runs silently before sending each milestone: scope paths remain under `[[WORKSPACE_ROOT]]` while Phase C stays inactive; when Phase C is active, allow writes under the package target too; collaboration gate honored; evaluation rows tied to approved evidence only; links one level deep from SKILL.md; forward slashes throughout.
+</output_format>
+<illustrations>
+    Phase A summary template (markdown):
+    ## Architecture ready for review
+    - End-state files: [list]
+    - Eval evidence policy: [restated in one line]
+    - Next action after your OK: create `SKILL.md` skeleton
+    Eval row stub (JSON shape):
+    {
+      "skills": ["[[SKILL_SLUG]]"],
+      "query": "[paste user question text here]",
+      "files": [],
+      "expected_behavior": [],
+      "evidence_ref": "[chat message id or user approval line]"
+    }
+    Checkpoint paragraph pattern:
+    Please review `REFERENCE.md`. When ready, reply with approval to proceed to `EXAMPLES.md`.
+</illustrations>
+```
+---
+Notation: The claude-dev-env **skill-builder** and **skill-writer** skills reference this file as a required step for **net-new** full skill packages. For refinements anchored to an existing skill directory, use the sibling template `skill-refinement-package.md` in this folder. Keep this file self-contained and token-driven so it stays paste-ready for `/prompt-generator`.

package/skills/prompt-generator/templates/skill-refinement-package.md ADDED Viewed

@@ -0,0 +1,109 @@
+# Template: skill package refinement (collaborative)
+Use this artifact when you are **refining an existing** Agent Skill package **architecture-first**, then **applying changes in layers**, with **human checkpoints** and **evaluation rows tied to real evidence** (observation transcripts, pasted threads, or explicit user approval). Same orchestration shape as `skill-from-ground-up.md`; this variant anchors every plan to a **baseline skill directory** and a **delta** (observations, gap analysis), and favors **surgical edits** over wholesale rewrite.
+Source layout mirrors: [Agent Skills best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
+## Before you paste
+Replace every `[[TOKEN]]` below (drop brackets in the final prompt, or substitute literals).
+| Token | Meaning |
+| --- | --- |
+| `[[REFINEMENT_PURPOSE]]` | Two to five sentences: what must improve, what must stay stable, success shape for the refined package. |
+| `[[BASELINE_SKILL_ROOT]]` | Repo-relative path to the **current** skill package root (the live `SKILL.md` directory you read first). |
+| `[[WORKSPACE_ROOT]]` | Repo-relative folder for `ARCHITECTURE.md`, iteration notes, evals, and optional draft copies (can match `[[BASELINE_SKILL_ROOT]]` for in-place work, or a dedicated `[name]-workspace/` when using a snapshot workflow). |
+| `[[DESIGN_INPUT_GLOB]]` | Observation or gap doc path (e.g. `[name]-workspace/gap-analysis.md`). Carry into `ARCHITECTURE.md` under `Observation inputs`. If truly absent, set `Design inputs` to `None` in `ARCHITECTURE.md` and omit the trailing `plus \`[[DESIGN_INPUT_GLOB]]\`` fragment from `target_file_globs` in the pasted XML. |
+| `[[OPTIONAL_PACKAGE_TARGET]]` | Future install path (e.g. `packages/claude-dev-env/skills/my-skill/`) or `deferred until the user requests packaging`. |
+| `[[SKILL_SLUG]]` | Short kebab-case id for eval filename and metadata (usually matches existing skill `name`). |
+| `[[DESCRIPTION_TRIGGERS]]` | Comma-separated phrases for third-person `description` triggers after refinement (retain strong triggers; add any new ones). |
+| `[[EVIDENCE_RULE]]` | One sentence: authorized eval material (e.g. observation transcripts plus pasted or explicitly approved thread excerpts only). |
+Optional: run `/prompt-generator` with this template as the brief so **AskUserQuestion** can lock scope before drafting.
+---
+## Paste-ready XML (after substitution)
+```xml
+<role>
+You are a skill architect and implementer for Claude Agent Skills. You refine an existing skill package under human review. You follow Anthropic Agent Skills best practices for structure, progressive disclosure, workflows, examples, and evaluation-driven iteration. You treat the user and observation artifacts as the sole authority for behavioral truth; you treat the baseline skill tree as the anchor for what already works.
+</role>
+<background>
+Motivation: [[REFINEMENT_PURPOSE]] The user refines collaboratively: evaluation scenarios must trace to observation transcripts, pasted material, or explicitly approved thread lines. Work proceeds architecture-first (inventory baseline plus planned delta), then implementation in controlled layers, with a full stop for review after each materially touched file.
+Scope block (ground every instruction here):
+- target_local_roots: `[[WORKSPACE_ROOT]]` at the repository root of this project (planning, evals, optional drafts). Baseline read source: `[[BASELINE_SKILL_ROOT]]` (always list both in ARCHITECTURE.md).
+- target_canonical_roots: Optional future install under `[[OPTIONAL_PACKAGE_TARGET]]`; until packaging is requested, keep planning and agreed writes scoped per ARCHITECTURE.md checkpoints.
+- target_file_globs: `[[WORKSPACE_ROOT]]ARCHITECTURE.md`, files under `[[BASELINE_SKILL_ROOT]]**` that ARCHITECTURE.md lists as in scope, `[[WORKSPACE_ROOT]]evals/*.json`, `[[WORKSPACE_ROOT]]scripts/**` when present, plus `[[DESIGN_INPUT_GLOB]]` when it names a real path.
+- comparison_basis: Anthropic Agent Skills best practices for concise metadata, third-person descriptions, progressive disclosure (SKILL.md as table of contents, linked files one level deep, table of contents atop references over one hundred lines), copy-paste checklists, template plus example pairs, evaluation JSON with expected_behavior arrays, and iterative testing across models the user cares about; plus **delta discipline**: preserve sections the observation record marks as still effective; change only what observations or approved gaps demand.
+- completion_boundary: Phase A delivers or updates `[[WORKSPACE_ROOT]]ARCHITECTURE.md` with sections `Baseline inventory` (paths under `[[BASELINE_SKILL_ROOT]]`), `Observation inputs` (link to `[[DESIGN_INPUT_GLOB]]` or `None`), `Planned deltas` (bullet list tied to observations), `Files unchanged this pass` (explicit list), eval evidence policy, optional hook milestones; the user approves or edits in chat. Phase B updates exactly one primary file per user checkpoint (baseline path or workspace draft per ARCHITECTURE.md); each checkpoint ends naming the next file. Phase C closes when planned deltas are implemented, eval JSON populated rows carry evidence references per [[EVIDENCE_RULE]], and a final `## Inventory` table lists every touched path with one-line purpose.
+Hypothesis note: `[[DESIGN_INPUT_GLOB]]` when present carries observation-grounded gaps; each new or tightened eval row still needs a matching transcript excerpt, paste, or explicit user approval before it counts as grounded behavior truth.
+Source priority for structure and quality bars: Anthropic Agent Skills documentation and prompting best practices first; observation transcripts and user pasted materials for behavioral truth; repository AGENTS.md or package README when they touch path layout.
+Citation policy: Quote observation snippets and user-pasted excerpts inside evaluation records or EXAMPLES.md when illustrating a scenario; ground structural claims about Anthropic guidance in links recorded in REFERENCE.md.
+Evaluation evidence rule: [[EVIDENCE_RULE]]
+</background>
+<instructions>
+1. Phase A — Refinement architecture: Create or replace `[[WORKSPACE_ROOT]]ARCHITECTURE.md` with: `Baseline inventory` listing every file under `[[BASELINE_SKILL_ROOT]]` that stays in scope; `Observation inputs` with path `[[DESIGN_INPUT_GLOB]]` or `None`; `Planned deltas` as bullets each referencing an observation or approved gap; `Files unchanged this pass` listing files explicitly left alone; required YAML frontmatter fields for SKILL.md after refinement; third-person description triggers covering [[DESCRIPTION_TRIGGERS]]; progressive disclosure plan (links one level deep from SKILL.md); workflow checklist touchpoints; EXAMPLES.md delta plan; `evals/` JSON filename and schema fields (`query`, `expected_behavior`, optional `files`, `evidence_ref`); optional hook milestone flagged for explicit user activation; forward-slash paths only. Close Phase A with a brief chat summary and obtain user approval on `ARCHITECTURE.md` before Phase B.
+2. Phase B — Layered implementation after approval: (a) Before editing any file, read the current version from `[[BASELINE_SKILL_ROOT]]` (or from the approved snapshot path recorded in ARCHITECTURE.md when using a copy workflow). (b) Apply only the deltas listed in `Planned deltas`; retain prose and structure the observation record marks as still working. (c) Update `[[BASELINE_SKILL_ROOT]]SKILL.md` or workspace draft per ARCHITECTURE.md with valid `name` and refined third-person `description`. (d) Update REFERENCE.md, EXAMPLES.md, WORKFLOWS.md only when ARCHITECTURE.md lists them in scope for this pass. (e) Update or create `evals/[[SKILL_SLUG]].json` per ARCHITECTURE.md; new or tightened rows include `evidence_ref` to observation transcript ids, pastes, or approval lines. (f) Keep SKILL.md body under five hundred lines; push new depth into linked files; add a table of contents at the top of any reference file expected to exceed one hundred lines.
+3. Collaboration gate: After every materially touched file in Phase B, pause for user review; continue only after explicit user approval naming the next file.
+4. Evaluation hygiene: Add or tighten evaluation rows solely from observation artifacts or text the user pasted or explicitly approved; document remaining coverage needs under REFERENCE.md in a “Pending approved excerpts” style section with positive wording about what artifact unlocks each slot.
+5. Quality pass before each handoff: Verify third-person description, consistent terminology, one-level-deep links from SKILL.md, forward-slash paths, checklist presence for multi-step flows, and that unchanged files match the `Files unchanged this pass` list; when the user approves a scope change, update that list in `ARCHITECTURE.md` in the same checkpoint before touching additional files.
+6. Optional Phase C — Distribution: Copy or adapt the approved package into the package skill path only when the user requests packaging; mirror the agreed file set; update cross-links.
+</instructions>
+<constraints>
+- Read baseline files from `[[BASELINE_SKILL_ROOT]]` before drafting replacements; prefer minimal diffs aligned to ARCHITECTURE.md.
+- Limit each checkpoint to one primary file (plus tiny cross-link tweaks the user OKs in the same message).
+- Prefer additive edits and explicit rename plans recorded in the checkpoint summary before broad moves.
+- Defer hook installation, Stop-hook changes, or clipboard automation to an explicit milestone the user starts later.
+- Align naming with Anthropic guidance; ship only path segments that name the skill domain clearly (for example `reference/`, `evals/`, `EXAMPLES.md`).
+</constraints>
+<output_format>
+Every assistant milestone message uses this shape:
+1. `## Summary` — bullets for files touched, decisions made, baseline subsection preserved or changed, single filename queued for next review.
+2. `## Checkpoint` — one paragraph restating what the user should verify before the next file starts.
+3. When showing draft bodies inline, use fenced code blocks with language tags (`markdown`, `json`) matching the destination file.
+4. Phase A exit criterion: user approves `ARCHITECTURE.md` in chat or edits it and signals approval.
+5. Phase B exit criterion: planned deltas from ARCHITECTURE.md are implemented; eval JSON populated rows reference evidence per [[EVIDENCE_RULE]]; REFERENCE.md lists any still-empty eval slots.
+6. Ship a closing `## Inventory` table: path, one-line purpose, progressive disclosure load order, note whether each path was baseline-updated or newly added.
+Light self-check before each milestone: scope matches ARCHITECTURE.md; collaboration gate honored; evaluation rows tied to approved evidence only; links one level deep from SKILL.md; forward slashes throughout; baseline sections marked stable stay intact until the user approves an explicit change to them.
+</output_format>
+<illustrations>
+    Phase A summary template (markdown):
+    ## Refinement architecture ready for review
+    - Baseline root: [[BASELINE_SKILL_ROOT]]
+    - Observation inputs: [[DESIGN_INPUT_GLOB]]
+    - Planned deltas: [bullets tied to observations]
+    - Unchanged this pass: [file list]
+    - Next action after your OK: edit [single file name]
+    Eval row stub (JSON shape):
+    {
+      "skills": ["[[SKILL_SLUG]]"],
+      "query": "[task that failed under old skill]",
+      "files": [],
+      "expected_behavior": [],
+      "evidence_ref": "[observation transcript id or user approval line]"
+    }
+    Checkpoint paragraph pattern:
+    Please review the updated `SKILL.md` delta against baseline. When ready, reply with approval to proceed to `REFERENCE.md`.
+</illustrations>
+```
+---
+Notation: Derived from `skill-from-ground-up.md` in this folder; use **that** template for **net-new** packages, **this** template when a baseline skill directory exists. The claude-dev-env **skill-builder** workflows (`improve-skill`, `polish-skill`) and **skill-writer** reference this file for checkpointed, multi-file refinements.

package/skills/skill-builder/SKILL.md ADDED Viewed

@@ -0,0 +1,107 @@
+---
+name: skill-builder
+description: >-
+  Orchestrates the complete skill-building lifecycle using evaluation-driven
+  development. Routes through gap analysis, eval creation, skill writing (via
+  skill-writer), subagent testing (via skill-creator infrastructure), and
+  iterative refinement. Use when creating new skills, improving existing skills,
+  or optimizing skill descriptions. Triggers: 'build a skill', 'new skill
+  workflow', 'improve this skill', 'optimize skill description', 'skill
+  development lifecycle'.
+---
+@${CLAUDE_SKILL_DIR}/references/eval-driven-flow.md
+# Skill Builder
+**Core principle:** Evaluation-driven development. Build evals BEFORE writing extensive documentation. This ensures skills solve real problems rather than documenting imagined ones.
+Source: [Anthropic Skill Best Practices - Evaluation and Iteration](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices#evaluation-and-iteration)
+## When this skill applies
+Trigger for requests to **build**, **improve**, or **polish** a skill through the full evaluation-driven lifecycle. This skill orchestrates the process -- it delegates writing to `/skill-writer` and evaluation infrastructure to the `skill-creator` plugin.
+For quick skill syntax questions or one-off SKILL.md edits, use `/skill-writer` directly instead.
+## Routing
+Assess the user's intent from conversation context and existing artifacts. Route directly:
+**Creating a new skill?**
+Read `${CLAUDE_SKILL_DIR}/workflows/new-skill.md` and follow it.
+**Improving an existing skill?**
+Read `${CLAUDE_SKILL_DIR}/workflows/improve-skill.md` and follow it.
+**Final polish only (description optimization, trigger eval)?**
+Read `${CLAUDE_SKILL_DIR}/workflows/polish-skill.md` and follow it.
+**Ambiguous?** Ask: "Are you creating a new skill, improving an existing one, or doing a final polish pass?"
+## The Claude A / Claude B Pattern
+You and the user are **Claude A** -- the expert who designs and refines the skill. Subagents running the built skill on eval tasks are **Claude B** -- the agent using the skill to perform real work.
+> "Work with one instance of Claude ('Claude A') to create a Skill that is used by other instances ('Claude B'). Claude A helps you design and refine instructions, while Claude B tests them in real tasks."
+The feedback loop: observe Claude B's behavior, bring insights back, refine the skill, test again.
+## Phase Overview
+| Phase | Purpose | Delegated To |
+|-------|---------|-------------|
+| 1. Identify gaps | Document what fails without the skill | This skill (guided conversation) |
+| 2. Build evals | Create 3+ scenarios testing the gaps | This skill (templates + user input) |
+| 3. Write skill | Minimal instructions addressing gaps | `/skill-writer` |
+| 4. Test | Subagent runs with/without skill, grade, benchmark | `skill-creator` eval infrastructure |
+| 5. Iterate | Review results, refine, re-test | This skill + `/skill-writer` + Phase 4 |
+| 6. Polish | Description optimization, trigger eval, final check | `skill-creator` description optimizer |
+## Ground-up skill package (required reference)
+When building a **new** skill as a **full package** (architecture inventory, progressive disclosure files, `evals/`, **human checkpoint after each file**), treat the following as **mandatory** before implementation:
+- **Read:** `prompt-generator/templates/skill-from-ground-up.md` inside the claude-dev-env `skills/` directory (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-from-ground-up.md`).
+- **Do:** Run `/prompt-generator` with that file’s token table filled so the downstream session follows architecture-first sequencing, per-file review gates, and eval rows tied only to user-pasted or explicitly approved evidence.
+Use this **together with** gap-analysis and eval-scenario templates in this package; the ground-up template supplies the **orchestration contract** for the multi-file layout Anthropic recommends.
+## Refinement skill package (required reference)
+When **improving** an existing skill as a **multi-file** or **checkpointed** package (baseline directory plus planned deltas, observation-grounded evals), treat the following as **mandatory** before Phase 2–6 file work in `improve-skill.md` or package-aware steps in `polish-skill.md`:
+- **Read:** `prompt-generator/templates/skill-refinement-package.md` (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md`).
+- **Do:** Run `/prompt-generator` with that file’s token table filled (`[[BASELINE_SKILL_ROOT]]`, `[[WORKSPACE_ROOT]]`, observation gap path, evidence rule) so rollout stays architecture-first, delta-focused, and tied to real observation or approved excerpts.
+Net-new packages without a baseline skill directory use `skill-from-ground-up.md` instead.
+## Principles (apply across all phases)
+1. **Evals before documentation.** Never write extensive skill content without evaluation scenarios to validate it.
+2. **Minimal instructions first.** Write just enough to pass evaluations. Resist the urge to over-document.
+3. **Generalize from feedback.** The skill will be used across many prompts. Do not overfit to test cases.
+4. **Explain the why.** Theory of mind beats rigid rules. Help the model understand reasoning, not just constraints.
+5. **Observe, do not assume.** Iterate based on what Claude B actually does, not what you think it should do.
+## Delegation Details
+See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for exact invocation patterns and integration points between this orchestrator, `/skill-writer`, and `skill-creator`.
+## File Index
+| File | Purpose |
+|------|---------|
+| `workflows/new-skill.md` | Full lifecycle for new skills (6 phases) |
+| `workflows/improve-skill.md` | Observation-first flow for existing skills |
+| `workflows/polish-skill.md` | Description optimization and final validation |
+| `references/eval-driven-flow.md` | Official Anthropic methodology with citations |
+| `references/delegation-map.md` | Integration map for skill-writer and skill-creator |
+| `templates/gap-analysis.md` | Template for Phase 1 gap documentation |
+| `templates/eval-scenario.json` | Eval template matching skill-creator schema |
+| `../prompt-generator/templates/skill-from-ground-up.md` | Required orchestration template for **net-new** full skill packages (architecture-first, checkpoints, evidence-backed evals) |
+| `../prompt-generator/templates/skill-refinement-package.md` | Required orchestration template for **existing-skill** multi-file refinements and package-aware polish (baseline + delta, checkpoints, evidence-backed evals) |

package/skills/skill-builder/references/delegation-map.md ADDED Viewed

@@ -0,0 +1,153 @@
+# Delegation Map
+How the skill-builder orchestrator integrates with external skills at each phase.
+## Phase 1: Identify Gaps -- This Orchestrator
+No external delegation. The orchestrator guides a conversation with the user to document what fails without a skill.
+**Output:** `[skill-name]-workspace/gap-analysis.md` using the template at `templates/gap-analysis.md`.
+## Phase 2: Build Evals -- This Orchestrator
+No external delegation. The orchestrator helps the user transform gaps into eval scenarios.
+**Output:** `[skill-name]-workspace/evals/evals.json` using the template at `templates/eval-scenario.json`.
+**Baseline runs:** Spawn subagents WITHOUT any skill for each eval scenario. These run as background Agent tasks.
+## Phase 3: Write Skill -- Delegate to `/skill-writer`
+**Full package path:** If the user approved a package plan from `prompt-generator/templates/skill-from-ground-up.md` (net-new) or `prompt-generator/templates/skill-refinement-package.md` (existing baseline), paste the approved architecture summary, baseline inventory, planned deltas, and checkpoint list into the skill-writer handoff so file order and scope stay aligned.
+Invoke `/skill-writer` with the following context in your prompt:
+```
+Create a skill based on this gap analysis and eval scenarios.
+Gap analysis: [paste or reference gap-analysis.md]
+Eval scenarios: [paste or reference evals.json expected_output fields]
+Baseline failures: [summarize what Claude got wrong without the skill]
+Constraint: Write the minimum instructions needed to address these specific gaps.
+Do not over-document. Every line must serve a documented gap.
+```
+skill-writer handles: type classification, degree of freedom, frontmatter, body structure, progressive disclosure, self-check.
+**Output:** The skill's SKILL.md (and optional REFERENCE.md, scripts, etc.)
+## Phase 4: Test -- Delegate to skill-creator Infrastructure
+The skill-creator plugin provides the eval infrastructure. Reference its components directly:
+### Spawning test runs
+For each eval, spawn TWO subagents in the SAME turn (parallel):
+**With-skill subagent:**
+```
+Execute this task:
+- Read the skill at [path-to-skill]/SKILL.md and follow its instructions
+- Task: [eval prompt from evals.json]
+- Input files: [eval files if any]
+- Save all output files to: [workspace]/iteration-N/eval-[name]/with_skill/outputs/
+- Save a transcript of your complete work to: [workspace]/iteration-N/eval-[name]/with_skill/transcript.md
+- At the end, write a metrics.json with tool call counts and file list
+```
+**Without-skill subagent** (baseline):
+For iteration-1, reuse baseline results from Phase 2 (iteration-0). For later iterations, the original baseline persists.
+### Grading
+Read the grading agent instructions from the skill-creator plugin:
+`[skill-creator-plugin-path]/agents/grader.md`
+Spawn a grader subagent for each run with:
+- The expectations from evals.json
+- The transcript path
+- The outputs directory
+**Output:** `grading.json` in each run directory.
+### Benchmarking
+Run the aggregation script from the skill-creator plugin directory:
+```bash
+cd [skill-creator-plugin-path] && python -m scripts.aggregate_benchmark [workspace]/iteration-N --skill-name [name]
+```
+**Output:** `benchmark.json` and `benchmark.md` in the iteration directory.
+### Eval Viewer
+Launch the viewer from the skill-creator plugin:
+```bash
+python [skill-creator-plugin-path]/eval-viewer/generate_review.py \
+  [workspace]/iteration-N \
+  --skill-name "[name]" \
+  --benchmark [workspace]/iteration-N/benchmark.json
+```
+For iteration 2+, add: `--previous-workspace [workspace]/iteration-[N-1]`
+If no browser/display available, add: `--static [workspace]/iteration-N/review.html`
+**Output:** Browser-based reviewer where the user inspects outputs and leaves feedback.
+### Finding the skill-creator plugin path
+The skill-creator plugin is installed at a path like:
+`~/.claude/plugins/marketplaces/claude-plugins-official/plugins/skill-creator/skills/skill-creator/`
+To find it dynamically, search for the skill-creator SKILL.md:
+```bash
+find ~/.claude/plugins -name "SKILL.md" -path "*/skill-creator/*" 2>/dev/null | head -1
+```
+Then derive the plugin root from that path.
+## Phase 5: Iterate -- This Orchestrator + `/skill-writer`
+The orchestrator reads feedback.json and transcripts, synthesizes observations, then delegates refinement to `/skill-writer`:
+```
+Refine this existing skill based on these observations from testing.
+Current SKILL.md: [paste or reference]
+User feedback: [from feedback.json]
+Behavioral observations: [from transcript analysis]
+Specific issues to address:
+1. [Issue from feedback]
+2. [Issue from observation]
+Constraint: Only change what the feedback demands. Do not reorganize working content.
+```
+Then return to Phase 4 with the refined skill.
+## Phase 6: Polish -- Delegate to skill-creator Description Optimizer
+The skill-creator plugin includes a description optimization loop:
+### Trigger eval generation
+Generate 20 realistic eval queries (10 should-trigger, 10 should-not-trigger). Use the HTML review template from:
+`[skill-creator-plugin-path]/assets/eval_review.html`
+### Optimization loop
+```bash
+cd [skill-creator-plugin-path] && python -m scripts.run_loop \
+  --eval-set [path-to-trigger-eval.json] \
+  --skill-path [path-to-skill] \
+  --model [current-model-id] \
+  --max-iterations 5 \
+  --verbose
+```
+### Final validation
+Run the skill-writer self-check rubric (from skill-writer's Step 9) against the finished skill. All items must pass.

package/skills/skill-builder/templates/gap-analysis.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Gap Analysis: [Skill Name]
+## Task Description
+[What the user is trying to accomplish -- the capability this skill should provide]
+## Gaps Identified
+### Gap 1: [Descriptive Name]
+- **What happened:** [Description of the failure or missing context when working without a skill]
+- **What was needed:** [The specific context, instruction, or knowledge that would fix it]
+- **Frequency:** [How often this comes up in real usage]
+- **Example task:** [A concrete task that exposes this gap]
+### Gap 2: [Descriptive Name]
+- **What happened:** [Description]
+- **What was needed:** [Context/instruction needed]
+- **Frequency:** [Frequency]
+- **Example task:** [Concrete example]
+### Gap 3: [Descriptive Name]
+- **What happened:** [Description]
+- **What was needed:** [Context/instruction needed]
+- **Frequency:** [Frequency]
+- **Example task:** [Concrete example]
+## Patterns
+- [Recurring themes across gaps -- e.g., "Claude consistently lacks knowledge about X"]
+- [Common failure modes -- e.g., "Without guidance, Claude chooses library A when library B is required"]
+- [Context that was repeatedly provided manually]
+## Candidate Eval Scenarios
+- [Task that would expose Gap 1 -- becomes the seed for an eval]
+- [Task that would expose Gap 2]
+- [Task that would expose multiple gaps simultaneously]
+- [Edge case that tests boundary behavior]

package/skills/skill-builder/workflows/improve-skill.md ADDED Viewed

@@ -0,0 +1,97 @@
+# Improve Skill Workflow
+Observation-first flow for iterating on an existing skill.
+## Prerequisites
+- An existing skill that needs improvement
+- The skill has been used at least once (or the user has observed specific issues)
+---
+## Phase 1: Observe
+**Goal:** Document the existing skill's current behavior by running it on real tasks.
+> "Use the Skill in real workflows: Give Claude B (with the Skill loaded) actual tasks, not test scenarios"
+### Process
+1. Identify the skill to improve. Read its current SKILL.md and any reference files.
+2. Ask the user what prompted the improvement:
+   - "What specific issue did you observe?"
+   - "Can you give me a concrete task where the skill underperformed?"
+   - "Is this a triggering issue (skill does not activate), a quality issue (skill activates but produces poor results), or a scope issue (skill does the wrong thing)?"
+3. Run the existing skill on 2-3 real tasks. For each, spawn a subagent:
+   ```
+   Execute this task using the skill at [path-to-existing-skill]:
+   - Read the skill at [path]/SKILL.md and follow its instructions
+   - Task: [realistic task from user]
+   - Save outputs to: [skill-name]-workspace/observation/task-[N]/outputs/
+   - Save transcript to: [skill-name]-workspace/observation/task-[N]/transcript.md
+   ```
+4. Analyze the transcripts. Document observations:
+   - Where did the skill work well?
+   - Where did it fail or produce subpar results?
+   - Did Claude B follow the skill's instructions as written?
+   - Did Claude B ignore any sections or files?
+   - Did Claude B explore in unexpected directions?
+5. Generate a gap analysis (same template as new-skill Phase 1) focused on the delta between current behavior and desired behavior.
+**Output:** `[skill-name]-workspace/gap-analysis.md` with observation-based gaps
+---
+## Phase 2-6: Follow the New Skill Workflow
+From here, follow the same phases as `${CLAUDE_SKILL_DIR}/workflows/new-skill.md`, starting at Phase 2 (Build Evals).
+### Collaborative package orchestration (Phases 2–6)
+Whenever Phases 2–6 will touch **multiple files**, **progressive disclosure layout**, or use **checkpointed file-by-file rollout**, treat this as **required** before expanding or rewriting the tree:
+1. Read `prompt-generator/templates/skill-refinement-package.md` from the claude-dev-env `skills/` tree (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md`).
+2. Run `/prompt-generator` with that template’s token table filled: set `[[BASELINE_SKILL_ROOT]]` to the existing skill directory, `[[WORKSPACE_ROOT]]` to your iteration workspace (in-place or snapshot per user preference), and `[[DESIGN_INPUT_GLOB]]` to this workflow’s observation-based `gap-analysis.md` when it exists.
+Use `skill-from-ground-up.md` **only** for **greenfield** packages where no baseline skill directory exists yet; use `skill-refinement-package.md` for every refinement anchored to an existing skill.
+Key differences from the new-skill flow:
+- **Phase 2 (Build Evals):** Evals should test the specific issues observed in Phase 1, not hypothetical gaps.
+- **Phase 3 (Write Skill):** Instead of writing from scratch, invoke `/skill-writer` with:
+  ```
+  Refine this existing skill based on observation findings.
+  Current SKILL.md: [reference or paste current skill]
+  Gap analysis: [reference observation-based gaps]
+  Eval scenarios: [reference evals]
+  Constraint: Preserve what works. Only change what the observations demand.
+  ```
+- **Phase 4 (Test):** The baseline is the CURRENT skill (snapshot it before editing). Compare old-skill vs new-skill, not with-skill vs without-skill.
+  Before making any changes, snapshot the existing skill:
+  ```bash
+  cp -r [skill-path] [workspace]/skill-snapshot/
+  ```
+  Then for baseline runs, point subagents at the snapshot:
+  ```
+  Execute this task using the ORIGINAL skill at [workspace]/skill-snapshot/:
+  - Read the skill and follow its instructions
+  - Task: [eval prompt]
+  - Save outputs to: [workspace]/iteration-N/eval-[name]/old_skill/outputs/
+  - Save transcript to: [workspace]/iteration-N/eval-[name]/old_skill/transcript.md
+  ```
+- **Phase 5 (Iterate):** Same process. The improvement loop compares new version against the snapshot.
+- **Phase 6 (Polish):** Same process. Run description optimization if triggering was an issue.