claude-dev-env 1.16.0 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,235 @@
1
+ # New Skill Workflow
2
+
3
+ Full evaluation-driven lifecycle for building a new skill from scratch.
4
+
5
+ ## Prerequisites
6
+
7
+ - The user has a task or domain they want to capture as a skill
8
+ - No existing skill for this capability (or intentionally starting fresh)
9
+
10
+ ### Ground-up package layout (required before multi-file implementation)
11
+
12
+ When the outcome includes **ARCHITECTURE.md**, **REFERENCE / EXAMPLES / WORKFLOWS**, and **`evals/*.json`** under a workspace (Anthropic-style progressive disclosure plus checkpointed rollout):
13
+
14
+ 1. Read `prompt-generator/templates/skill-from-ground-up.md` from the claude-dev-env `skills/` tree (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-from-ground-up.md`).
15
+ 2. Run `/prompt-generator` using that template (substitute tokens per its table) **before** Phase 3 expands the repo; align the XML scope block with this workflow’s workspace and evidence rules.
16
+ 3. Keep Phase 1–2 artifacts honest: eval prompts and expectations stay grounded in **real** user scenarios; the template reinforces eval rows that reference pasted or explicitly approved evidence only.
17
+
18
+ Skip this block only when the user explicitly wants a **single-file** SKILL.md with no staged package plan.
19
+
20
+ Refinements to an **existing** skill package use `prompt-generator/templates/skill-refinement-package.md` instead (see `improve-skill.md`).
21
+
22
+ ---
23
+
24
+ ## Phase 1: Identify Gaps
25
+
26
+ **Goal:** Document what fails or requires repeated context when working without a skill.
27
+
28
+ ### Process
29
+
30
+ 1. Have a guided conversation to uncover gaps. Explore these areas:
31
+ - "What task were you doing when you realized you needed a skill?"
32
+ - "What context did you repeatedly provide to Claude?"
33
+ - "Where did Claude fail or produce subpar results without guidance?"
34
+ - "What domain knowledge was missing?"
35
+ - "What specific format or structure did you need?"
36
+ - "Were there tools or scripts that needed to be used in a particular way?"
37
+ - "What rules or constraints did Claude violate?"
38
+
39
+ 2. As patterns emerge, probe for eval-worthy scenarios:
40
+ - "Can you give me a concrete example of a task where this failed?"
41
+ - "What would success look like for that specific task?"
42
+ - "Are there edge cases where the right approach changes?"
43
+
44
+ 3. Generate `gap-analysis.md` from the conversation using the template at `${CLAUDE_SKILL_DIR}/templates/gap-analysis.md`. Fill in all sections from what was discussed.
45
+
46
+ 4. Review the gap analysis with the user. Confirm completeness before moving to Phase 2.
47
+
48
+ **Output:** `[skill-name]-workspace/gap-analysis.md`
49
+
50
+ ---
51
+
52
+ ## Phase 2: Build Evals
53
+
54
+ **Goal:** Create 3+ evaluation scenarios that test the identified gaps. Establish a baseline.
55
+
56
+ ### Process
57
+
58
+ 1. Transform each gap into at least one eval scenario. Each scenario needs:
59
+ - A realistic user prompt (detailed and specific, like a real request)
60
+ - A description of what success looks like
61
+ - Objectively verifiable expectations (assertions)
62
+
63
+ 2. Draft evals using the schema at `${CLAUDE_SKILL_DIR}/templates/eval-scenario.json`. Ensure:
64
+ - Minimum 3 scenarios (official requirement)
65
+ - Every identified gap has at least one scenario testing it
66
+ - Expectations are objectively verifiable, not subjective
67
+ - Prompts sound like things a real user would say
68
+
69
+ 3. Review eval scenarios with the user. Adjust until both sides are satisfied.
70
+
71
+ 4. Save to `[skill-name]-workspace/evals/evals.json`.
72
+
73
+ 5. **Establish baseline.** For each eval, spawn a subagent WITHOUT any skill:
74
+
75
+ ```
76
+ Execute this task with NO skill loaded:
77
+ - Task: [eval prompt]
78
+ - Input files: [eval files if any, or "none"]
79
+ - Save all output files to: [workspace]/iteration-0/eval-[name]/without_skill/outputs/
80
+ - Save a complete transcript of your work to: [workspace]/iteration-0/eval-[name]/without_skill/transcript.md
81
+ ```
82
+
83
+ Spawn all baseline runs in parallel. Capture timing data when each completes.
84
+
85
+ 6. Grade baseline results using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for exact grading invocation.
86
+
87
+ **Output:** `[skill-name]-workspace/evals/evals.json` and baseline results in `iteration-0/`
88
+
89
+ ---
90
+
91
+ ## Phase 3: Write Minimal Skill
92
+
93
+ **Goal:** Create just enough skill content to address the documented gaps and pass evaluations.
94
+
95
+ ### Process
96
+
97
+ 1. Invoke `/skill-writer` with this context:
98
+
99
+ ```
100
+ Create a skill based on this gap analysis and eval scenarios.
101
+
102
+ Gap analysis: [reference or paste gap-analysis.md]
103
+ Eval scenarios: [reference or paste evals.json expected_output and expectations]
104
+ Baseline failures: [summarize what Claude got wrong in iteration-0]
105
+
106
+ Constraint: Write the minimum instructions needed to address these specific gaps.
107
+ Every line must serve a documented gap. Do not over-document.
108
+ ```
109
+
110
+ 2. `/skill-writer` will run its workflow: classify type, set degree of freedom, ask clarifying questions, produce the SKILL.md artifact.
111
+
112
+ 3. Review the draft with the user:
113
+ - "Does this address all the gaps we identified?"
114
+ - "Is anything here unnecessary or over-engineered?"
115
+ - "Would this pass our eval scenarios?"
116
+
117
+ 4. Save the skill to its target directory.
118
+
119
+ **Output:** The skill's SKILL.md (and optional reference files)
120
+
121
+ ---
122
+
123
+ ## Phase 4: Test (Feedback Loop)
124
+
125
+ **Goal:** Run the skill on eval scenarios, compare against baseline, identify remaining gaps.
126
+
127
+ ### Process
128
+
129
+ 1. **Spawn all runs in parallel.** For each eval scenario, launch a with-skill subagent:
130
+
131
+ ```
132
+ Execute this task:
133
+ - Read the skill at [path-to-skill]/SKILL.md and follow its instructions
134
+ - Task: [eval prompt from evals.json]
135
+ - Input files: [eval files if any, or "none"]
136
+ - Save all output files to: [workspace]/iteration-N/eval-[name]/with_skill/outputs/
137
+ - Save a complete transcript of your work to: [workspace]/iteration-N/eval-[name]/with_skill/transcript.md
138
+ ```
139
+
140
+ For iteration-1, the without-skill baseline already exists from Phase 2.
141
+
142
+ 2. **While runs are in progress**, review and refine assertions if needed based on what was learned from the baseline.
143
+
144
+ 3. **When runs complete**, immediately capture timing data (`total_tokens`, `duration_ms`) to `timing.json` in each run directory. This data is only available in the task completion notification.
145
+
146
+ 4. **Grade each run** using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the grading process.
147
+
148
+ 5. **Aggregate into benchmark** using skill-creator's aggregation script. See delegation-map.md for the exact command.
149
+
150
+ 6. **Launch the eval viewer** using skill-creator's generate_review.py. See delegation-map.md for the exact command. For iteration 2+, include `--previous-workspace` to show diffs.
151
+
152
+ 7. Tell the user to review in the viewer:
153
+ - "Outputs" tab: click through each test case, leave feedback
154
+ - "Benchmark" tab: quantitative comparison (pass rates, timing, tokens)
155
+
156
+ 8. Wait for the user to complete their review.
157
+
158
+ **Output:** `grading.json`, `benchmark.json`, `feedback.json` in the iteration directory
159
+
160
+ ---
161
+
162
+ ## Phase 5: Iterate
163
+
164
+ **Goal:** Refine the skill based on observed Claude B behavior and user feedback.
165
+
166
+ ### Process
167
+
168
+ 1. Read `feedback.json` from the viewer. Empty feedback means the user was satisfied with that test case.
169
+
170
+ 2. Read transcripts from Phase 4 runs. Watch for the signals the official docs highlight:
171
+ - **Unexpected exploration paths** -- Claude B read files in an order you did not anticipate
172
+ - **Missed connections** -- Claude B did not follow references to important files
173
+ - **Overreliance on certain sections** -- content that should be promoted to SKILL.md
174
+ - **Ignored content** -- files Claude B never accessed (may be unnecessary or poorly signaled)
175
+ - **Repeated work across test cases** -- all subagents wrote similar helper scripts (bundle them into the skill)
176
+
177
+ 3. Synthesize observations into actionable improvements. For each piece of feedback, identify the specific skill change that would fix it.
178
+
179
+ 4. Apply improvements. For significant changes, re-invoke `/skill-writer` with:
180
+
181
+ ```
182
+ Refine this existing skill based on testing observations.
183
+
184
+ Current SKILL.md: [reference or paste]
185
+ User feedback: [from feedback.json -- only non-empty entries]
186
+ Behavioral observations: [from transcript analysis]
187
+
188
+ Specific issues to address:
189
+ 1. [Issue]
190
+ 2. [Issue]
191
+
192
+ Constraint: Only change what the feedback demands. Do not reorganize working content.
193
+ ```
194
+
195
+ 5. Key principles for this phase (from the official docs):
196
+ - **Generalize from feedback** -- the skill will be used across many different prompts, not just these test cases
197
+ - **Keep the prompt lean** -- remove instructions that are not pulling their weight
198
+ - **Explain the why** -- theory of mind beats rigid MUSTs
199
+ - **Bundle repeated work** -- if subagents all wrote similar scripts, add them to the skill
200
+
201
+ 6. Return to Phase 4 with the refined skill. Continue iterating until:
202
+ - User feedback is all empty (satisfied with every test case)
203
+ - Pass rates meet acceptable thresholds
204
+ - No meaningful progress between iterations
205
+
206
+ ---
207
+
208
+ ## Phase 6: Polish
209
+
210
+ **Goal:** Optimize the skill description for triggering accuracy and run final validation.
211
+
212
+ ### Process
213
+
214
+ 1. **Description optimization.** Follow the process in `${CLAUDE_SKILL_DIR}/workflows/polish-skill.md`.
215
+
216
+ 2. **Final validation.** Run the skill-writer self-check rubric against the finished skill:
217
+ - [ ] Description is third person with trigger phrases
218
+ - [ ] Under 500 lines
219
+ - [ ] States what to do in positive terms (not prohibition-heavy)
220
+ - [ ] Degree of freedom matches task fragility
221
+ - [ ] Progressive disclosure used (heavy content in separate files)
222
+ - [ ] Examples are concrete, not abstract
223
+ - [ ] Frontmatter fields are valid
224
+ - [ ] One skill = one capability
225
+
226
+ 3. **Final checklist** from the official Anthropic docs:
227
+ - [ ] At least 3 evaluation scenarios created and passing
228
+ - [ ] Tested with real usage scenarios
229
+ - [ ] Skill solves documented gaps (not imagined requirements)
230
+ - [ ] Iterative refinement based on observed behavior (not assumptions)
231
+
232
+ 4. Present the finished skill to the user with:
233
+ - Final benchmark comparison (latest iteration vs baseline)
234
+ - Summary of gaps addressed
235
+ - Any remaining limitations or known edge cases
@@ -0,0 +1,92 @@
1
+ # Polish Skill Workflow
2
+
3
+ Final optimization pass for a skill that is functionally complete.
4
+
5
+ ## Prerequisites
6
+
7
+ - The skill passes its evaluation scenarios
8
+ - The user is satisfied with output quality
9
+ - This is the final step before the skill is considered done
10
+
11
+ ### Package-aware polish (recommended)
12
+
13
+ When the polish pass will touch **more than frontmatter alone** (for example `REFERENCE.md`, `EXAMPLES.md`, `WORKFLOWS.md`, link structure, or eval JSON), or the user wants **checkpointed** multi-file updates alongside description work:
14
+
15
+ 1. Read `prompt-generator/templates/skill-refinement-package.md` (repository path: `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md`).
16
+ 2. Run `/prompt-generator` with tokens filled so `ARCHITECTURE.md` records baseline inventory, planned deltas for polish, and evidence rules for any new trigger or behavior evals.
17
+
18
+ Purely **single-field** `description` edits with no structural package changes can skip this block.
19
+
20
+ ---
21
+
22
+ ## Step 1: Description Optimization
23
+
24
+ Optimize the skill's description for triggering accuracy using the skill-creator's trigger eval system.
25
+
26
+ ### Generate trigger eval queries
27
+
28
+ Create 20 eval queries: 10 should-trigger and 10 should-not-trigger.
29
+
30
+ **Should-trigger queries (10):** Different phrasings of the same intent. Include:
31
+ - Formal and casual variations
32
+ - Cases where the user does not explicitly name the skill but clearly needs it
33
+ - Uncommon use cases
34
+ - Cases where this skill competes with another but should win
35
+
36
+ **Should-not-trigger queries (10):** Near-misses that share keywords but need something different. Include:
37
+ - Adjacent domains with overlapping terminology
38
+ - Ambiguous phrasing where naive keyword matching would falsely trigger
39
+ - Tasks that touch the skill's domain but in a context where another tool is better
40
+
41
+ All queries must be realistic -- detailed, specific, with file paths, personal context, casual speech. Not abstract one-liners.
42
+
43
+ ### Review with user
44
+
45
+ Present the eval set using the skill-creator's HTML review template. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact process.
46
+
47
+ The user can edit queries, toggle should-trigger, and add/remove entries.
48
+
49
+ ### Run optimization loop
50
+
51
+ See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact command. The loop:
52
+ 1. Splits eval set into 60% train / 40% held-out test
53
+ 2. Evaluates current description (3 runs per query for reliability)
54
+ 3. Proposes improvements based on failures
55
+ 4. Re-evaluates on both train and test
56
+ 5. Iterates up to 5 times
57
+ 6. Selects best description by test score (avoids overfitting)
58
+
59
+ ### Apply result
60
+
61
+ Update the skill's SKILL.md frontmatter with the optimized description. Show the user before/after with scores.
62
+
63
+ ---
64
+
65
+ ## Step 2: Final Validation
66
+
67
+ Run the skill-writer self-check rubric:
68
+
69
+ - [ ] Description is third person with trigger phrases
70
+ - [ ] SKILL.md body under 500 lines
71
+ - [ ] States what to do in positive terms (not prohibition-heavy)
72
+ - [ ] Degree of freedom matches task fragility
73
+ - [ ] Progressive disclosure used (heavy content in separate files)
74
+ - [ ] No time-sensitive claims unless clearly dated
75
+ - [ ] Examples are concrete, not abstract
76
+ - [ ] Frontmatter fields are valid per official docs
77
+ - [ ] One skill = one capability
78
+ - [ ] Consistent terminology throughout
79
+ - [ ] File references are one level deep from SKILL.md
80
+ - [ ] Files over 100 lines have a table of contents
81
+
82
+ ---
83
+
84
+ ## Step 3: Final Summary
85
+
86
+ Present the finished skill to the user:
87
+
88
+ 1. **Benchmark summary:** Final pass rate vs baseline, with delta
89
+ 2. **Gaps addressed:** Map each original gap to the skill content that addresses it
90
+ 3. **Description optimization:** Before/after trigger accuracy scores
91
+ 4. **Known limitations:** Anything the skill does not handle (scope boundaries)
92
+ 5. **Maintenance notes:** What to watch for in future usage that might warrant re-iteration
@@ -32,6 +32,24 @@ Use this skill when the user needs a structured skill artifact; for quick answer
32
32
 
33
33
  When invoked with arguments (e.g. `/skill-writer improve this: [paste]`), treat `$ARGUMENTS` as the skill content to refine.
34
34
 
35
+ ### Ground-up multi-file packages (required)
36
+
37
+ When the user is creating a **new** skill as a **package** (workspace with `ARCHITECTURE.md`, `REFERENCE.md`, `EXAMPLES.md`, `WORKFLOWS.md`, `evals/*.json`, per-file human review), **before** drafting `SKILL.md`:
38
+
39
+ 1. Read `packages/claude-dev-env/skills/prompt-generator/templates/skill-from-ground-up.md` (installed layout: sibling folder `prompt-generator/templates/skill-from-ground-up.md` under the same `skills/` parent).
40
+ 2. Ensure `/prompt-generator` has run with that template filled so architecture-first steps, checkpoint gates, and eval evidence rules are already agreed.
41
+
42
+ If the task is **only** editing an existing `SKILL.md` or a small single-file tweak, this subsection does not apply.
43
+
44
+ ### Refinement multi-file packages (required)
45
+
46
+ When the user is **refining** an existing skill as a **package** (baseline skill directory, `ARCHITECTURE.md` with planned deltas, checkpointed updates to REFERENCE / EXAMPLES / WORKFLOWS / `evals/`), **before** rewriting multiple files:
47
+
48
+ 1. Read `packages/claude-dev-env/skills/prompt-generator/templates/skill-refinement-package.md` (installed layout: `prompt-generator/templates/skill-refinement-package.md` under the same `skills/` parent).
49
+ 2. Ensure `/prompt-generator` has run with that template filled so baseline root, workspace root, observation inputs, and evidence rules are fixed before edits proceed.
50
+
51
+ If the change set is a **small single-file** tweak, this subsection does not apply.
52
+
35
53
  ## Workflow (run in order)
36
54
 
37
55
  ### 1. Classify the skill type