@buaa_smat/hometrans 0.1.7 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,300 +0,0 @@
1
- > ## Documentation Index
2
- > Fetch the complete documentation index at: https://agentskills.io/llms.txt
3
- > Use this file to discover all available pages before exploring further.
4
-
5
- # Evaluating skill output quality
6
-
7
- > How to test whether your skill produces good outputs using eval-driven iteration.
8
-
9
- You wrote a skill, tried it on a prompt, and it seemed to work. But does it work reliably — across varied prompts, in edge cases, better than no skill at all? Running structured evaluations (evals) answers these questions and gives you a feedback loop for improving the skill systematically.
10
-
11
- ## Designing test cases
12
-
13
- A test case has three parts:
14
-
15
- * **Prompt**: a realistic user message — the kind of thing someone would actually type.
16
- * **Expected output**: a human-readable description of what success looks like.
17
- * **Input files** (optional): files the skill needs to work with.
18
-
19
- Store test cases in `evals/evals.json` inside your skill directory:
20
-
21
- ```json evals/evals.json theme={null}
22
- {
23
- "skill_name": "csv-analyzer",
24
- "evals": [
25
- {
26
- "id": 1,
27
- "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
28
- "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
29
- "files": ["evals/files/sales_2025.csv"]
30
- },
31
- {
32
- "id": 2,
33
- "prompt": "there's a csv in my downloads called customers.csv, some rows have missing emails — can you clean it up and tell me how many were missing?",
34
- "expected_output": "A cleaned CSV with missing emails handled, plus a count of how many were missing.",
35
- "files": ["evals/files/customers.csv"]
36
- }
37
- ]
38
- }
39
- ```
40
-
41
- **Tips for writing good test prompts:**
42
-
43
- * **Start with 2-3 test cases.** Don't over-invest before you've seen your first round of results. You can expand the set later.
44
- * **Vary the prompts.** Use different phrasings, levels of detail, and formality. Some prompts should be casual ("hey can you clean up this csv"), others precise ("Parse the CSV at data/input.csv, drop rows where column B is null, and write the result to data/output.csv").
45
- * **Cover edge cases.** Include at least one prompt that tests a boundary condition — a malformed input, an unusual request, or a case where the skill's instructions might be ambiguous.
46
- * **Use realistic context.** Real users mention file paths, column names, and personal context. Prompts like "process this data" are too vague to test anything useful.
47
-
48
- Don't worry about defining specific pass/fail checks yet — just the prompts and expected outputs. You'll add detailed checks (called assertions) after you see what the first run produces.
49
-
50
- ## Running evals
51
-
52
- The core pattern is to run each test case twice: once **with the skill** and once **without it** (or with a previous version). This gives you a baseline to compare against.
53
-
54
- ### Workspace structure
55
-
56
- Organize eval results in a workspace directory alongside your skill directory. Each pass through the full eval loop gets its own `iteration-N/` directory. Within that, each test case gets an eval directory with `with_skill/` and `without_skill/` subdirectories:
57
-
58
- ```
59
- csv-analyzer/
60
- ├── SKILL.md
61
- └── evals/
62
- └── evals.json
63
- csv-analyzer-workspace/
64
- └── iteration-1/
65
- ├── eval-top-months-chart/
66
- │ ├── with_skill/
67
- │ │ ├── outputs/ # Files produced by the run
68
- │ │ ├── timing.json # Tokens and duration
69
- │ │ └── grading.json # Assertion results
70
- │ └── without_skill/
71
- │ ├── outputs/
72
- │ ├── timing.json
73
- │ └── grading.json
74
- ├── eval-clean-missing-emails/
75
- │ ├── with_skill/
76
- │ │ ├── outputs/
77
- │ │ ├── timing.json
78
- │ │ └── grading.json
79
- │ └── without_skill/
80
- │ ├── outputs/
81
- │ ├── timing.json
82
- │ └── grading.json
83
- └── benchmark.json # Aggregated statistics
84
- ```
85
-
86
- The main file you author by hand is `evals/evals.json`. The other JSON files (`grading.json`, `timing.json`, `benchmark.json`) are produced during the eval process — by the agent, by scripts, or by you.
87
-
88
- ### Spawning runs
89
-
90
- Each eval run should start with a clean context — no leftover state from previous runs or from the skill development process. This ensures the agent follows only what the `SKILL.md` tells it. In environments that support subagents (Claude Code, for example), this isolation comes naturally: each child task starts fresh. Without subagents, use a separate session for each run.
91
-
92
- For each run, provide:
93
-
94
- * The skill path (or no skill for the baseline)
95
- * The test prompt
96
- * Any input files
97
- * The output directory
98
-
99
- Here's an example of the instructions you'd give the agent for a single with-skill run:
100
-
101
- ```
102
- Execute this task:
103
- - Skill path: /path/to/csv-analyzer
104
- - Task: I have a CSV of monthly sales data in data/sales_2025.csv.
105
- Can you find the top 3 months by revenue and make a bar chart?
106
- - Input files: evals/files/sales_2025.csv
107
- - Save outputs to: csv-analyzer-workspace/iteration-1/eval-top-months-chart/with_skill/outputs/
108
- ```
109
-
110
- For the baseline, use the same prompt but without the skill path, saving to `without_skill/outputs/`.
111
-
112
- When improving an existing skill, use the previous version as your baseline. Snapshot it before editing (`cp -r <skill-path> <workspace>/skill-snapshot/`), point the baseline run at the snapshot, and save to `old_skill/outputs/` instead of `without_skill/`.
113
-
114
- ### Capturing timing data
115
-
116
- Timing data lets you compare how much time and tokens the skill costs relative to the baseline — a skill that dramatically improves output quality but triples token usage is a different trade-off than one that's both better and cheaper. When each run completes, record the token count and duration:
117
-
118
- ```json timing.json theme={null}
119
- {
120
- "total_tokens": 84852,
121
- "duration_ms": 23332
122
- }
123
- ```
124
-
125
- <Tip>
126
- In Claude Code, when a subagent task finishes, the [task completion notification](https://platform.claude.com/docs/en/agent-sdk/typescript#sdk-task-notification-message) includes `total_tokens` and `duration_ms`. Save these values immediately — they aren't persisted anywhere else.
127
- </Tip>
128
-
129
- ## Writing assertions
130
-
131
- Assertions are verifiable statements about what the output should contain or achieve. Add them after you see your first round of outputs — you often don't know what "good" looks like until the skill has run.
132
-
133
- Good assertions:
134
-
135
- * `"The output file is valid JSON"` — programmatically verifiable.
136
- * `"The bar chart has labeled axes"` — specific and observable.
137
- * `"The report includes at least 3 recommendations"` — countable.
138
-
139
- Weak assertions:
140
-
141
- * `"The output is good"` — too vague to grade.
142
- * `"The output uses exactly the phrase 'Total Revenue: $X'"` — too brittle; correct output with different wording would fail.
143
-
144
- Not everything needs an assertion. Some qualities — writing style, visual design, whether the output "feels right" — are hard to decompose into pass/fail checks. These are better caught during [human review](#reviewing-results-with-a-human). Reserve assertions for things that can be checked objectively.
145
-
146
- Add assertions to each test case in `evals/evals.json`:
147
-
148
- ```json evals/evals.json highlight={9-14} theme={null}
149
- {
150
- "skill_name": "csv-analyzer",
151
- "evals": [
152
- {
153
- "id": 1,
154
- "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
155
- "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
156
- "files": ["evals/files/sales_2025.csv"],
157
- "assertions": [
158
- "The output includes a bar chart image file",
159
- "The chart shows exactly 3 months",
160
- "Both axes are labeled",
161
- "The chart title or caption mentions revenue"
162
- ]
163
- }
164
- ]
165
- }
166
- ```
167
-
168
- ## Grading outputs
169
-
170
- Grading means evaluating each assertion against the actual outputs and recording **PASS** or **FAIL** with specific evidence. The evidence should quote or reference the output, not just state an opinion.
171
-
172
- The simplest approach is to give the outputs and assertions to an LLM and ask it to evaluate each one. For assertions that can be checked by code (valid JSON, correct row count, file exists with expected dimensions), use a verification script — scripts are more reliable than LLM judgment for mechanical checks and reusable across iterations.
173
-
174
- ```json grading.json theme={null}
175
- {
176
- "assertion_results": [
177
- {
178
- "text": "The output includes a bar chart image file",
179
- "passed": true,
180
- "evidence": "Found chart.png (45KB) in outputs directory"
181
- },
182
- {
183
- "text": "The chart shows exactly 3 months",
184
- "passed": true,
185
- "evidence": "Chart displays bars for March, July, and November"
186
- },
187
- {
188
- "text": "Both axes are labeled",
189
- "passed": false,
190
- "evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
191
- },
192
- {
193
- "text": "The chart title or caption mentions revenue",
194
- "passed": true,
195
- "evidence": "Chart title reads 'Top 3 Months by Revenue'"
196
- }
197
- ],
198
- "summary": {
199
- "passed": 3,
200
- "failed": 1,
201
- "total": 4,
202
- "pass_rate": 0.75
203
- }
204
- }
205
- ```
206
-
207
- ### Grading principles
208
-
209
- * **Require concrete evidence for a PASS.** Don't give the benefit of the doubt. If an assertion says "includes a summary" and the output has a section titled "Summary" with one vague sentence, that's a FAIL — the label is there but the substance isn't.
210
- * **Review the assertions themselves, not just the results.** While grading, notice when assertions are too easy (always pass regardless of skill quality), too hard (always fail even when the output is good), or unverifiable (can't be checked from the output alone). Fix these for the next iteration.
211
-
212
- <Tip>
213
- For comparing two skill versions, try **blind comparison**: present both outputs to an LLM judge without revealing which came from which version. The judge scores holistic qualities — organization, formatting, usability, polish — on its own rubric, free from bias about which version "should" be better. This complements assertion grading: two outputs might both pass all assertions but differ significantly in overall quality.
214
- </Tip>
215
-
216
- ## Aggregating results
217
-
218
- Once every run in the iteration is graded, compute summary statistics per configuration and save them to `benchmark.json` alongside the eval directories (e.g., `csv-analyzer-workspace/iteration-1/benchmark.json`):
219
-
220
- ```json benchmark.json theme={null}
221
- {
222
- "run_summary": {
223
- "with_skill": {
224
- "pass_rate": { "mean": 0.83, "stddev": 0.06 },
225
- "time_seconds": { "mean": 45.0, "stddev": 12.0 },
226
- "tokens": { "mean": 3800, "stddev": 400 }
227
- },
228
- "without_skill": {
229
- "pass_rate": { "mean": 0.33, "stddev": 0.10 },
230
- "time_seconds": { "mean": 32.0, "stddev": 8.0 },
231
- "tokens": { "mean": 2100, "stddev": 300 }
232
- },
233
- "delta": {
234
- "pass_rate": 0.50,
235
- "time_seconds": 13.0,
236
- "tokens": 1700
237
- }
238
- }
239
- }
240
- ```
241
-
242
- The `delta` tells you what the skill costs (more time, more tokens) and what it buys (higher pass rate). A skill that adds 13 seconds but improves pass rate by 50 percentage points is probably worth it. A skill that doubles token usage for a 2-point improvement might not be.
243
-
244
- <Note>
245
- Standard deviation (`stddev`) is only meaningful with multiple runs per eval. In early iterations with just 2-3 test cases and single runs, focus on the raw pass counts and the delta — the statistical measures become useful as you expand the test set and run each eval multiple times.
246
- </Note>
247
-
248
- ## Analyzing patterns
249
-
250
- Aggregate statistics can hide important patterns. After computing the benchmarks:
251
-
252
- * **Remove or replace assertions that always pass in both configurations.** These don't tell you anything useful — the model handles them fine without the skill. They inflate the with-skill pass rate without reflecting actual skill value.
253
- * **Investigate assertions that always fail in both configurations.** Either the assertion is broken (asking for something the model can't do), the test case is too hard, or the assertion is checking for the wrong thing. Fix these before the next iteration.
254
- * **Study assertions that pass with the skill but fail without.** This is where the skill is clearly adding value. Understand *why* — which instructions or scripts made the difference?
255
- * **Tighten instructions when results are inconsistent across runs.** If the same eval passes sometimes and fails others (reflected as high `stddev` in the benchmark), the eval may be flaky (sensitive to model randomness), or the skill's instructions may be ambiguous enough that the model interprets them differently each time. Add examples or more specific guidance to reduce ambiguity.
256
- * **Check time and token outliers.** If one eval takes 3x longer than the others, read its execution transcript (the full log of what the model did during the run) to find the bottleneck.
257
-
258
- ## Reviewing results with a human
259
-
260
- Assertion grading and pattern analysis catch a lot, but they only check what you thought to write assertions for. A human reviewer brings a fresh perspective — catching issues you didn't anticipate, noticing when the output is technically correct but misses the point, or spotting problems that are hard to express as pass/fail checks. For each test case, review the actual outputs alongside the grades.
261
-
262
- Record specific feedback for each test case and save it in the workspace (e.g., as a `feedback.json` alongside the eval directories):
263
-
264
- ```json feedback.json theme={null}
265
- {
266
- "eval-top-months-chart": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
267
- "eval-clean-missing-emails": ""
268
- }
269
- ```
270
-
271
- "The chart is missing axis labels" is actionable; "looks bad" is not. Empty feedback means the output looked fine — that test case passed your review. During the [iteration step](#iterating-on-the-skill), focus your improvements on the test cases where you had specific complaints.
272
-
273
- ## Iterating on the skill
274
-
275
- After grading and reviewing, you have three sources of signal:
276
-
277
- * **Failed assertions** point to specific gaps — a missing step, an unclear instruction, or a case the skill doesn't handle.
278
- * **Human feedback** points to broader quality issues — the approach was wrong, the output was poorly structured, or the skill produced a technically correct but unhelpful result.
279
- * **Execution transcripts** reveal *why* things went wrong. If the agent ignored an instruction, the instruction may be ambiguous. If the agent spent time on unproductive steps, those instructions may need to be simplified or removed.
280
-
281
- The most effective way to turn these signals into skill improvements is to give all three — along with the current `SKILL.md` — to an LLM and ask it to propose changes. The LLM can synthesize patterns across failed assertions, reviewer complaints, and transcript behavior that would be tedious to connect manually. When prompting the LLM, include these guidelines:
282
-
283
- * **Generalize from feedback.** The skill will be used across many different prompts, not just the test cases. Fixes should address underlying issues broadly rather than adding narrow patches for specific examples.
284
- * **Keep the skill lean.** Fewer, better instructions often outperform exhaustive rules. If transcripts show wasted work (unnecessary validation, unneeded intermediate outputs), remove those instructions. If pass rates plateau despite adding more rules, the skill may be over-constrained — try removing instructions and see if results hold or improve.
285
- * **Explain the why.** Reasoning-based instructions ("Do X because Y tends to cause Z") work better than rigid directives ("ALWAYS do X, NEVER do Y"). Models follow instructions more reliably when they understand the purpose.
286
- * **Bundle repeated work.** If every test run independently wrote a similar helper script (a chart builder, a data parser), that's a signal to bundle the script into the skill's `scripts/` directory. See [Using scripts](/skill-creation/using-scripts) for how to do this.
287
-
288
- ### The loop
289
-
290
- 1. Give the eval signals and current `SKILL.md` to an LLM and ask it to propose improvements.
291
- 2. Review and apply the changes.
292
- 3. Rerun all test cases in a new `iteration-<N+1>/` directory.
293
- 4. Grade and aggregate the new results.
294
- 5. Review with a human. Repeat.
295
-
296
- Stop when you're satisfied with the results, feedback is consistently empty, or you're no longer seeing meaningful improvement between iterations.
297
-
298
- <Tip>
299
- The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates much of this workflow — running evals, grading assertions, aggregating benchmarks, and presenting results for human review.
300
- </Tip>
@@ -1,196 +0,0 @@
1
- > ## Documentation Index
2
- > Fetch the complete documentation index at: https://agentskills.io/llms.txt
3
- > Use this file to discover all available pages before exploring further.
4
-
5
- # Optimizing skill descriptions
6
-
7
- > How to improve your skill's description so it triggers reliably on relevant prompts.
8
-
9
- A skill only helps if it gets activated. The `description` field in your `SKILL.md` frontmatter is the primary mechanism agents use to decide whether to load a skill for a given task. An under-specified description means the skill won't trigger when it should; an over-broad description means it triggers when it shouldn't.
10
-
11
- This guide covers how to systematically test and improve your skill's description for triggering accuracy.
12
-
13
- ## How skill triggering works
14
-
15
- Agents use [progressive disclosure](/specification#progressive-disclosure) to manage context. At startup, they load only the `name` and `description` of each available skill — just enough to decide when a skill might be relevant. When a user's task matches a description, the agent reads the full `SKILL.md` into context and follows its instructions.
16
-
17
- This means the description carries the entire burden of triggering. If the description doesn't convey when the skill is useful, the agent won't know to reach for it.
18
-
19
- One important nuance: agents typically only consult skills for tasks that require knowledge or capabilities beyond what they can handle alone. A simple, one-step request like "read this PDF" may not trigger a PDF skill even if the description matches perfectly, because the agent can handle it with basic tools. Tasks that involve specialized knowledge — an unfamiliar API, a domain-specific workflow, or an uncommon format — are where a well-written description can make the difference.
20
-
21
- ## Writing effective descriptions
22
-
23
- Before testing, it helps to know what a good description looks like. A few principles:
24
-
25
- * **Use imperative phrasing.** Frame the description as an instruction to the agent: "Use this skill when..." rather than "This skill does..." The agent is deciding whether to act, so tell it when to act.
26
- * **Focus on user intent, not implementation.** Describe what the user is trying to achieve, not the skill's internal mechanics. The agent matches against what the user asked for.
27
- * **Err on the side of being pushy.** Explicitly list contexts where the skill applies, including cases where the user doesn't name the domain directly: "even if they don't explicitly mention 'CSV' or 'analysis.'"
28
- * **Keep it concise.** A few sentences to a short paragraph is usually right — long enough to cover the skill's scope, short enough that it doesn't bloat the agent's context across many skills. The [specification](/specification#description-field) enforces a hard limit of 1024 characters.
29
-
30
- ## Designing trigger eval queries
31
-
32
- To test triggering, you need a set of eval queries — realistic user prompts labeled with whether they should or shouldn't trigger your skill.
33
-
34
- ```json eval_queries.json theme={null}
35
- [
36
- { "query": "I've got a spreadsheet in ~/data/q4_results.xlsx with revenue in col C and expenses in col D — can you add a profit margin column and highlight anything under 10%?", "should_trigger": true },
37
- { "query": "whats the quickest way to convert this json file to yaml", "should_trigger": false }
38
- ]
39
- ```
40
-
41
- Aim for about 20 queries: 8-10 that should trigger and 8-10 that shouldn't.
42
-
43
- ### Should-trigger queries
44
-
45
- These test whether the description captures the skill's scope. Vary them along several axes:
46
-
47
- * **Phrasing**: some formal, some casual, some with typos or abbreviations.
48
- * **Explicitness**: some name the skill's domain directly ("analyze this CSV"), others describe the need without naming it ("my boss wants a chart from this data file").
49
- * **Detail**: mix terse prompts with context-heavy ones — a short "analyze my sales CSV and make a chart" alongside a longer message with file paths, column names, and backstory.
50
- * **Complexity**: vary the number of steps and decision points. Include single-step tasks alongside multi-step workflows to test whether the agent can discern the skill is relevant when the task it addresses is buried in a larger chain.
51
-
52
- The most useful should-trigger queries are ones where the skill would help but the connection isn't obvious from the query alone. These are the cases where description wording makes the difference — if the query already asks for exactly what the skill does, any reasonable description would trigger.
53
-
54
- ### Should-not-trigger queries
55
-
56
- The most valuable negative test cases are **near-misses** — queries that share keywords or concepts with your skill but actually need something different. These test whether the description is precise, not just broad.
57
-
58
- For a CSV analysis skill, weak negative examples would be:
59
-
60
- * `"Write a fibonacci function"` — obviously irrelevant, tests nothing.
61
- * `"What's the weather today?"` — no keyword overlap, too easy.
62
-
63
- Strong negative examples:
64
-
65
- * `"I need to update the formulas in my Excel budget spreadsheet"` — shares "spreadsheet" and "data" concepts, but needs Excel editing, not CSV analysis.
66
- * `"can you write a python script that reads a csv and uploads each row to our postgres database"` — involves CSV, but the task is database ETL, not analysis.
67
-
68
- ### Tips for realism
69
-
70
- Real user prompts contain context that generic test queries lack. Include:
71
-
72
- * File paths (`~/Downloads/report_final_v2.xlsx`)
73
- * Personal context (`"my manager asked me to..."`)
74
- * Specific details (column names, company names, data values)
75
- * Casual language, abbreviations, and occasional typos
76
-
77
- ## Testing whether a description triggers
78
-
79
- The basic approach: run each query through your agent with the skill installed and observe whether the agent invokes it. Make sure the skill is registered and discoverable by your agent — how this works varies by client (e.g., a skills directory, a configuration file, or a CLI flag).
80
-
81
- Most agent clients provide some form of observability — execution logs, tool call histories, or verbose output — that lets you see which skills were consulted during a run. Check your client's documentation for details. The skill triggered if the agent loaded your skill's `SKILL.md`; it didn't trigger if the agent proceeded without consulting it.
82
-
83
- A query "passes" if:
84
-
85
- * `should_trigger` is `true` and the skill was invoked, or
86
- * `should_trigger` is `false` and the skill was not invoked.
87
-
88
- ### Running multiple times
89
-
90
- Model behavior is nondeterministic — the same query might trigger the skill on one run but not the next. Run each query multiple times (3 is a reasonable starting point) and compute a **trigger rate**: the fraction of runs where the skill was invoked.
91
-
92
- A should-trigger query passes if its trigger rate is above a threshold (0.5 is a reasonable default). A should-not-trigger query passes if its trigger rate is below that threshold.
93
-
94
- With 20 queries at 3 runs each, that's 60 invocations. You'll want to script this. Here's the general structure — replace the `claude` invocation and detection logic in `check_triggered` with whatever your agent client provides:
95
-
96
- ```bash theme={null}
97
- #!/bin/bash
98
- QUERIES_FILE="${1:?Usage: $0 <queries.json>}"
99
- SKILL_NAME="my-skill"
100
- RUNS=3
101
-
102
- # This example uses Claude Code's JSON output to check for Skill tool calls.
103
- # Replace this function with detection logic for your agent client.
104
- # Should return 0 (success) if the skill was invoked, 1 otherwise.
105
- check_triggered() {
106
- local query="$1"
107
- claude -p "$query" --output-format json 2>/dev/null \
108
- | jq -e --arg skill "$SKILL_NAME" \
109
- 'any(.messages[].content[]; .type == "tool_use" and .name == "Skill" and .input.skill == $skill)' \
110
- > /dev/null 2>&1
111
- }
112
-
113
- count=$(jq length "$QUERIES_FILE")
114
- for i in $(seq 0 $((count - 1))); do
115
- query=$(jq -r ".[$i].query" "$QUERIES_FILE")
116
- should_trigger=$(jq -r ".[$i].should_trigger" "$QUERIES_FILE")
117
- triggers=0
118
-
119
- for run in $(seq 1 $RUNS); do
120
- check_triggered "$query" && triggers=$((triggers + 1))
121
- done
122
-
123
- jq -n \
124
- --arg query "$query" \
125
- --argjson should_trigger "$should_trigger" \
126
- --argjson triggers "$triggers" \
127
- --argjson runs "$RUNS" \
128
- '{query: $query, should_trigger: $should_trigger, triggers: $triggers, runs: $runs, trigger_rate: ($triggers / $runs)}'
129
- done | jq -s '.'
130
- ```
131
-
132
- <Tip>
133
- If your agent client supports it, you can stop a run early once the outcome is clear — the agent either consulted the skill or started working without it. This can significantly reduce the time and cost of running the full eval set.
134
- </Tip>
135
-
136
- ## Avoiding overfitting with train/validation splits
137
-
138
- If you optimize the description against all your queries, you risk overfitting — crafting a description that works for these specific phrasings but fails on new ones.
139
-
140
- The solution is to split your query set:
141
-
142
- * **Train set (\~60%)**: the queries you use to identify failures and guide improvements.
143
- * **Validation set (\~40%)**: queries you set aside and only use to check whether improvements generalize.
144
-
145
- Make sure both sets contain a proportional mix of should-trigger and should-not-trigger queries — don't accidentally put all the positives in one set. Shuffle randomly and keep the split fixed across iterations so you're comparing apples to apples.
146
-
147
- If you're using a script like the one [above](#running-multiple-times), you can split your queries into two files — `train_queries.json` and `validation_queries.json` — and run the script against each one separately.
148
-
149
- ## The optimization loop
150
-
151
- 1. **Evaluate** the current description on both *train and validation sets*. The train results guide your changes; the validation results tell you whether those changes are generalizing.
152
- 2. **Identify failures** in the *train set*: which should-trigger queries didn't trigger? Which should-not-trigger queries did?
153
- * Only use train set failures to guide your changes — whether you're revising the description yourself or prompting an LLM, keep validation set results out of the process.
154
- 3. **Revise the description.** Focus on generalizing:
155
- * If should-trigger queries are failing, the description may be too narrow. Broaden the scope or add context about when the skill is useful.
156
- * If should-not-trigger queries are false-triggering, the description may be too broad. Add specificity about what the skill does *not* do, or clarify the boundary between this skill and adjacent capabilities.
157
- * Avoid adding specific keywords from failed queries — that's overfitting. Instead, find the general category or concept those queries represent and address that.
158
- * If you're stuck after several iterations, try a structurally different approach to the description rather than incremental tweaks. A different framing or sentence structure may break through where refinement can't.
159
- * Check that the description stays under the 1024-character limit — descriptions tend to grow during optimization.
160
- 4. **Repeat** steps 1-3 until all *train set* queries pass or you stop seeing meaningful improvement.
161
- 5. **Select the best iteration** by its validation pass rate — the fraction of queries in the *validation set* that passed. Note that the best description may not be the last one you produced; an earlier iteration might have a higher validation pass rate than later ones that overfit to the train set.
162
-
163
- Five iterations is usually enough. If performance isn't improving, the issue may be with the queries (too easy, too hard, or poorly labeled) rather than the description.
164
-
165
- <Tip>
166
- The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates this loop end-to-end: it splits the eval set, evaluates trigger rates in parallel, proposes description improvements using Claude, and generates a live HTML report you can watch as it runs.
167
- </Tip>
168
-
169
- ## Applying the result
170
-
171
- Once you've selected the best description:
172
-
173
- 1. Update the `description` field in your `SKILL.md` frontmatter.
174
- 2. Verify the description is under the [1024-character limit](/specification#description-field).
175
- 3. Verify the description triggers as expected. Try a few prompts manually as a quick sanity check. For a more rigorous test, write 5-10 fresh queries (a mix of should-trigger and should-not-trigger) and run them through the eval script — since these queries were never part of the optimization process, they give you an honest check on whether the description generalizes.
176
-
177
- Before and after:
178
-
179
- ```yaml theme={null}
180
- # Before
181
- description: Process CSV files.
182
-
183
- # After
184
- description: >
185
- Analyze CSV and tabular data files — compute summary statistics,
186
- add derived columns, generate charts, and clean messy data. Use this
187
- skill when the user has a CSV, TSV, or Excel file and wants to
188
- explore, transform, or visualize the data, even if they don't
189
- explicitly mention "CSV" or "analysis."
190
- ```
191
-
192
- The improved description is more specific about what the skill does (summary stats, derived columns, charts, cleaning) and broader about when it applies (CSV, TSV, Excel; even without explicit keywords).
193
-
194
- ## Next steps
195
-
196
- Once your skill triggers reliably, you'll want to evaluate whether it produces good outputs. See [Evaluating skill output quality](/skill-creation/evaluating-skills) for how to set up test cases, grade results, and iterate.