claude-dev-env 1.15.0 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,88 @@
1
+ # Improve Skill Workflow
2
+
3
+ Observation-first flow for iterating on an existing skill.
4
+
5
+ ## Prerequisites
6
+
7
+ - An existing skill that needs improvement
8
+ - The skill has been used at least once (or the user has observed specific issues)
9
+
10
+ ---
11
+
12
+ ## Phase 1: Observe
13
+
14
+ **Goal:** Document the existing skill's current behavior by running it on real tasks.
15
+
16
+ > "Use the Skill in real workflows: Give Claude B (with the Skill loaded) actual tasks, not test scenarios"
17
+
18
+ ### Process
19
+
20
+ 1. Identify the skill to improve. Read its current SKILL.md and any reference files.
21
+
22
+ 2. Ask the user what prompted the improvement:
23
+ - "What specific issue did you observe?"
24
+ - "Can you give me a concrete task where the skill underperformed?"
25
+ - "Is this a triggering issue (skill does not activate), a quality issue (skill activates but produces poor results), or a scope issue (skill does the wrong thing)?"
26
+
27
+ 3. Run the existing skill on 2-3 real tasks. For each, spawn a subagent:
28
+
29
+ ```
30
+ Execute this task using the skill at [path-to-existing-skill]:
31
+ - Read the skill at [path]/SKILL.md and follow its instructions
32
+ - Task: [realistic task from user]
33
+ - Save outputs to: [skill-name]-workspace/observation/task-[N]/outputs/
34
+ - Save transcript to: [skill-name]-workspace/observation/task-[N]/transcript.md
35
+ ```
36
+
37
+ 4. Analyze the transcripts. Document observations:
38
+ - Where did the skill work well?
39
+ - Where did it fail or produce subpar results?
40
+ - Did Claude B follow the skill's instructions as written?
41
+ - Did Claude B ignore any sections or files?
42
+ - Did Claude B explore in unexpected directions?
43
+
44
+ 5. Generate a gap analysis (same template as new-skill Phase 1) focused on the delta between current behavior and desired behavior.
45
+
46
+ **Output:** `[skill-name]-workspace/gap-analysis.md` with observation-based gaps
47
+
48
+ ---
49
+
50
+ ## Phase 2-6: Follow the New Skill Workflow
51
+
52
+ From here, follow the same phases as `${CLAUDE_SKILL_DIR}/workflows/new-skill.md`, starting at Phase 2 (Build Evals).
53
+
54
+ Key differences from the new-skill flow:
55
+
56
+ - **Phase 2 (Build Evals):** Evals should test the specific issues observed in Phase 1, not hypothetical gaps.
57
+
58
+ - **Phase 3 (Write Skill):** Instead of writing from scratch, invoke `/skill-writer` with:
59
+
60
+ ```
61
+ Refine this existing skill based on observation findings.
62
+
63
+ Current SKILL.md: [reference or paste current skill]
64
+ Gap analysis: [reference observation-based gaps]
65
+ Eval scenarios: [reference evals]
66
+
67
+ Constraint: Preserve what works. Only change what the observations demand.
68
+ ```
69
+
70
+ - **Phase 4 (Test):** The baseline is the CURRENT skill (snapshot it before editing). Compare old-skill vs new-skill, not with-skill vs without-skill.
71
+
72
+ Before making any changes, snapshot the existing skill:
73
+ ```bash
74
+ cp -r [skill-path] [workspace]/skill-snapshot/
75
+ ```
76
+
77
+ Then for baseline runs, point subagents at the snapshot:
78
+ ```
79
+ Execute this task using the ORIGINAL skill at [workspace]/skill-snapshot/:
80
+ - Read the skill and follow its instructions
81
+ - Task: [eval prompt]
82
+ - Save outputs to: [workspace]/iteration-N/eval-[name]/old_skill/outputs/
83
+ - Save transcript to: [workspace]/iteration-N/eval-[name]/old_skill/transcript.md
84
+ ```
85
+
86
+ - **Phase 5 (Iterate):** Same process. The improvement loop compares new version against the snapshot.
87
+
88
+ - **Phase 6 (Polish):** Same process. Run description optimization if triggering was an issue.
@@ -0,0 +1,223 @@
1
+ # New Skill Workflow
2
+
3
+ Full evaluation-driven lifecycle for building a new skill from scratch.
4
+
5
+ ## Prerequisites
6
+
7
+ - The user has a task or domain they want to capture as a skill
8
+ - No existing skill for this capability (or intentionally starting fresh)
9
+
10
+ ---
11
+
12
+ ## Phase 1: Identify Gaps
13
+
14
+ **Goal:** Document what fails or requires repeated context when working without a skill.
15
+
16
+ ### Process
17
+
18
+ 1. Have a guided conversation to uncover gaps. Explore these areas:
19
+ - "What task were you doing when you realized you needed a skill?"
20
+ - "What context did you repeatedly provide to Claude?"
21
+ - "Where did Claude fail or produce subpar results without guidance?"
22
+ - "What domain knowledge was missing?"
23
+ - "What specific format or structure did you need?"
24
+ - "Were there tools or scripts that needed to be used in a particular way?"
25
+ - "What rules or constraints did Claude violate?"
26
+
27
+ 2. As patterns emerge, probe for eval-worthy scenarios:
28
+ - "Can you give me a concrete example of a task where this failed?"
29
+ - "What would success look like for that specific task?"
30
+ - "Are there edge cases where the right approach changes?"
31
+
32
+ 3. Generate `gap-analysis.md` from the conversation using the template at `${CLAUDE_SKILL_DIR}/templates/gap-analysis.md`. Fill in all sections from what was discussed.
33
+
34
+ 4. Review the gap analysis with the user. Confirm completeness before moving to Phase 2.
35
+
36
+ **Output:** `[skill-name]-workspace/gap-analysis.md`
37
+
38
+ ---
39
+
40
+ ## Phase 2: Build Evals
41
+
42
+ **Goal:** Create 3+ evaluation scenarios that test the identified gaps. Establish a baseline.
43
+
44
+ ### Process
45
+
46
+ 1. Transform each gap into at least one eval scenario. Each scenario needs:
47
+ - A realistic user prompt (detailed and specific, like a real request)
48
+ - A description of what success looks like
49
+ - Objectively verifiable expectations (assertions)
50
+
51
+ 2. Draft evals using the schema at `${CLAUDE_SKILL_DIR}/templates/eval-scenario.json`. Ensure:
52
+ - Minimum 3 scenarios (official requirement)
53
+ - Every identified gap has at least one scenario testing it
54
+ - Expectations are objectively verifiable, not subjective
55
+ - Prompts sound like things a real user would say
56
+
57
+ 3. Review eval scenarios with the user. Adjust until both sides are satisfied.
58
+
59
+ 4. Save to `[skill-name]-workspace/evals/evals.json`.
60
+
61
+ 5. **Establish baseline.** For each eval, spawn a subagent WITHOUT any skill:
62
+
63
+ ```
64
+ Execute this task with NO skill loaded:
65
+ - Task: [eval prompt]
66
+ - Input files: [eval files if any, or "none"]
67
+ - Save all output files to: [workspace]/iteration-0/eval-[name]/without_skill/outputs/
68
+ - Save a complete transcript of your work to: [workspace]/iteration-0/eval-[name]/without_skill/transcript.md
69
+ ```
70
+
71
+ Spawn all baseline runs in parallel. Capture timing data when each completes.
72
+
73
+ 6. Grade baseline results using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for exact grading invocation.
74
+
75
+ **Output:** `[skill-name]-workspace/evals/evals.json` and baseline results in `iteration-0/`
76
+
77
+ ---
78
+
79
+ ## Phase 3: Write Minimal Skill
80
+
81
+ **Goal:** Create just enough skill content to address the documented gaps and pass evaluations.
82
+
83
+ ### Process
84
+
85
+ 1. Invoke `/skill-writer` with this context:
86
+
87
+ ```
88
+ Create a skill based on this gap analysis and eval scenarios.
89
+
90
+ Gap analysis: [reference or paste gap-analysis.md]
91
+ Eval scenarios: [reference or paste evals.json expected_output and expectations]
92
+ Baseline failures: [summarize what Claude got wrong in iteration-0]
93
+
94
+ Constraint: Write the minimum instructions needed to address these specific gaps.
95
+ Every line must serve a documented gap. Do not over-document.
96
+ ```
97
+
98
+ 2. `/skill-writer` will run its workflow: classify type, set degree of freedom, ask clarifying questions, produce the SKILL.md artifact.
99
+
100
+ 3. Review the draft with the user:
101
+ - "Does this address all the gaps we identified?"
102
+ - "Is anything here unnecessary or over-engineered?"
103
+ - "Would this pass our eval scenarios?"
104
+
105
+ 4. Save the skill to its target directory.
106
+
107
+ **Output:** The skill's SKILL.md (and optional reference files)
108
+
109
+ ---
110
+
111
+ ## Phase 4: Test (Feedback Loop)
112
+
113
+ **Goal:** Run the skill on eval scenarios, compare against baseline, identify remaining gaps.
114
+
115
+ ### Process
116
+
117
+ 1. **Spawn all runs in parallel.** For each eval scenario, launch a with-skill subagent:
118
+
119
+ ```
120
+ Execute this task:
121
+ - Read the skill at [path-to-skill]/SKILL.md and follow its instructions
122
+ - Task: [eval prompt from evals.json]
123
+ - Input files: [eval files if any, or "none"]
124
+ - Save all output files to: [workspace]/iteration-N/eval-[name]/with_skill/outputs/
125
+ - Save a complete transcript of your work to: [workspace]/iteration-N/eval-[name]/with_skill/transcript.md
126
+ ```
127
+
128
+ For iteration-1, the without-skill baseline already exists from Phase 2.
129
+
130
+ 2. **While runs are in progress**, review and refine assertions if needed based on what was learned from the baseline.
131
+
132
+ 3. **When runs complete**, immediately capture timing data (`total_tokens`, `duration_ms`) to `timing.json` in each run directory. This data is only available in the task completion notification.
133
+
134
+ 4. **Grade each run** using the skill-creator grading agent. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the grading process.
135
+
136
+ 5. **Aggregate into benchmark** using skill-creator's aggregation script. See delegation-map.md for the exact command.
137
+
138
+ 6. **Launch the eval viewer** using skill-creator's generate_review.py. See delegation-map.md for the exact command. For iteration 2+, include `--previous-workspace` to show diffs.
139
+
140
+ 7. Tell the user to review in the viewer:
141
+ - "Outputs" tab: click through each test case, leave feedback
142
+ - "Benchmark" tab: quantitative comparison (pass rates, timing, tokens)
143
+
144
+ 8. Wait for the user to complete their review.
145
+
146
+ **Output:** `grading.json`, `benchmark.json`, `feedback.json` in the iteration directory
147
+
148
+ ---
149
+
150
+ ## Phase 5: Iterate
151
+
152
+ **Goal:** Refine the skill based on observed Claude B behavior and user feedback.
153
+
154
+ ### Process
155
+
156
+ 1. Read `feedback.json` from the viewer. Empty feedback means the user was satisfied with that test case.
157
+
158
+ 2. Read transcripts from Phase 4 runs. Watch for the signals the official docs highlight:
159
+ - **Unexpected exploration paths** -- Claude B read files in an order you did not anticipate
160
+ - **Missed connections** -- Claude B did not follow references to important files
161
+ - **Overreliance on certain sections** -- content that should be promoted to SKILL.md
162
+ - **Ignored content** -- files Claude B never accessed (may be unnecessary or poorly signaled)
163
+ - **Repeated work across test cases** -- all subagents wrote similar helper scripts (bundle them into the skill)
164
+
165
+ 3. Synthesize observations into actionable improvements. For each piece of feedback, identify the specific skill change that would fix it.
166
+
167
+ 4. Apply improvements. For significant changes, re-invoke `/skill-writer` with:
168
+
169
+ ```
170
+ Refine this existing skill based on testing observations.
171
+
172
+ Current SKILL.md: [reference or paste]
173
+ User feedback: [from feedback.json -- only non-empty entries]
174
+ Behavioral observations: [from transcript analysis]
175
+
176
+ Specific issues to address:
177
+ 1. [Issue]
178
+ 2. [Issue]
179
+
180
+ Constraint: Only change what the feedback demands. Do not reorganize working content.
181
+ ```
182
+
183
+ 5. Key principles for this phase (from the official docs):
184
+ - **Generalize from feedback** -- the skill will be used across many different prompts, not just these test cases
185
+ - **Keep the prompt lean** -- remove instructions that are not pulling their weight
186
+ - **Explain the why** -- theory of mind beats rigid MUSTs
187
+ - **Bundle repeated work** -- if subagents all wrote similar scripts, add them to the skill
188
+
189
+ 6. Return to Phase 4 with the refined skill. Continue iterating until:
190
+ - User feedback is all empty (satisfied with every test case)
191
+ - Pass rates meet acceptable thresholds
192
+ - No meaningful progress between iterations
193
+
194
+ ---
195
+
196
+ ## Phase 6: Polish
197
+
198
+ **Goal:** Optimize the skill description for triggering accuracy and run final validation.
199
+
200
+ ### Process
201
+
202
+ 1. **Description optimization.** Follow the process in `${CLAUDE_SKILL_DIR}/workflows/polish-skill.md`.
203
+
204
+ 2. **Final validation.** Run the skill-writer self-check rubric against the finished skill:
205
+ - [ ] Description is third person with trigger phrases
206
+ - [ ] Under 500 lines
207
+ - [ ] States what to do in positive terms (not prohibition-heavy)
208
+ - [ ] Degree of freedom matches task fragility
209
+ - [ ] Progressive disclosure used (heavy content in separate files)
210
+ - [ ] Examples are concrete, not abstract
211
+ - [ ] Frontmatter fields are valid
212
+ - [ ] One skill = one capability
213
+
214
+ 3. **Final checklist** from the official Anthropic docs:
215
+ - [ ] At least 3 evaluation scenarios created and passing
216
+ - [ ] Tested with real usage scenarios
217
+ - [ ] Skill solves documented gaps (not imagined requirements)
218
+ - [ ] Iterative refinement based on observed behavior (not assumptions)
219
+
220
+ 4. Present the finished skill to the user with:
221
+ - Final benchmark comparison (latest iteration vs baseline)
222
+ - Summary of gaps addressed
223
+ - Any remaining limitations or known edge cases
@@ -0,0 +1,83 @@
1
+ # Polish Skill Workflow
2
+
3
+ Final optimization pass for a skill that is functionally complete.
4
+
5
+ ## Prerequisites
6
+
7
+ - The skill passes its evaluation scenarios
8
+ - The user is satisfied with output quality
9
+ - This is the final step before the skill is considered done
10
+
11
+ ---
12
+
13
+ ## Step 1: Description Optimization
14
+
15
+ Optimize the skill's description for triggering accuracy using the skill-creator's trigger eval system.
16
+
17
+ ### Generate trigger eval queries
18
+
19
+ Create 20 eval queries: 10 should-trigger and 10 should-not-trigger.
20
+
21
+ **Should-trigger queries (10):** Different phrasings of the same intent. Include:
22
+ - Formal and casual variations
23
+ - Cases where the user does not explicitly name the skill but clearly needs it
24
+ - Uncommon use cases
25
+ - Cases where this skill competes with another but should win
26
+
27
+ **Should-not-trigger queries (10):** Near-misses that share keywords but need something different. Include:
28
+ - Adjacent domains with overlapping terminology
29
+ - Ambiguous phrasing where naive keyword matching would falsely trigger
30
+ - Tasks that touch the skill's domain but in a context where another tool is better
31
+
32
+ All queries must be realistic -- detailed, specific, with file paths, personal context, casual speech. Not abstract one-liners.
33
+
34
+ ### Review with user
35
+
36
+ Present the eval set using the skill-creator's HTML review template. See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact process.
37
+
38
+ The user can edit queries, toggle should-trigger, and add/remove entries.
39
+
40
+ ### Run optimization loop
41
+
42
+ See `${CLAUDE_SKILL_DIR}/references/delegation-map.md` for the exact command. The loop:
43
+ 1. Splits eval set into 60% train / 40% held-out test
44
+ 2. Evaluates current description (3 runs per query for reliability)
45
+ 3. Proposes improvements based on failures
46
+ 4. Re-evaluates on both train and test
47
+ 5. Iterates up to 5 times
48
+ 6. Selects best description by test score (avoids overfitting)
49
+
50
+ ### Apply result
51
+
52
+ Update the skill's SKILL.md frontmatter with the optimized description. Show the user before/after with scores.
53
+
54
+ ---
55
+
56
+ ## Step 2: Final Validation
57
+
58
+ Run the skill-writer self-check rubric:
59
+
60
+ - [ ] Description is third person with trigger phrases
61
+ - [ ] SKILL.md body under 500 lines
62
+ - [ ] States what to do in positive terms (not prohibition-heavy)
63
+ - [ ] Degree of freedom matches task fragility
64
+ - [ ] Progressive disclosure used (heavy content in separate files)
65
+ - [ ] No time-sensitive claims unless clearly dated
66
+ - [ ] Examples are concrete, not abstract
67
+ - [ ] Frontmatter fields are valid per official docs
68
+ - [ ] One skill = one capability
69
+ - [ ] Consistent terminology throughout
70
+ - [ ] File references are one level deep from SKILL.md
71
+ - [ ] Files over 100 lines have a table of contents
72
+
73
+ ---
74
+
75
+ ## Step 3: Final Summary
76
+
77
+ Present the finished skill to the user:
78
+
79
+ 1. **Benchmark summary:** Final pass rate vs baseline, with delta
80
+ 2. **Gaps addressed:** Map each original gap to the skill content that addresses it
81
+ 3. **Description optimization:** Before/after trigger accuracy scores
82
+ 4. **Known limitations:** Anything the skill does not handle (scope boundaries)
83
+ 5. **Maintenance notes:** What to watch for in future usage that might warrant re-iteration