@buaa_smat/hometrans 0.1.6 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +113 -53
- package/package.json +1 -1
- package/skills/hmos-incremental-ui-align/SKILL.md +19 -17
- package/skills/skill-quality-evaluator/SKILL.md +138 -0
- package/skills/skill-quality-evaluator/assets/SKILL_TEMPLATE.md +77 -0
- package/skills/skill-quality-evaluator/references/Best-practices-for-skill-creators.md +277 -0
- package/skills/skill-quality-evaluator/references/Evaluating-skill-output-quality.md +300 -0
- package/skills/skill-quality-evaluator/references/Optimizing-skill-descriptions.md +196 -0
- package/skills/skill-quality-evaluator/references/Specification.md +272 -0
- package/skills/skill-quality-evaluator/references/Using-scripts-in-skills.md +308 -0
- package/skills/skill-quality-evaluator/references/report-template.md +163 -0
- package/skills/skill-quality-evaluator/references/scoring-rubric.md +269 -0
|
@@ -0,0 +1,277 @@
|
|
|
1
|
+
> ## Documentation Index
|
|
2
|
+
> Fetch the complete documentation index at: https://agentskills.io/llms.txt
|
|
3
|
+
> Use this file to discover all available pages before exploring further.
|
|
4
|
+
|
|
5
|
+
# Best practices for skill creators
|
|
6
|
+
|
|
7
|
+
> How to write skills that are well-scoped and calibrated to the task.
|
|
8
|
+
|
|
9
|
+
## Start from real expertise
|
|
10
|
+
|
|
11
|
+
A common pitfall in skill creation is asking an LLM to generate a skill without providing domain-specific context — relying solely on the LLM's general training knowledge. The result is vague, generic procedures ("handle errors appropriately," "follow best practices for authentication") rather than the specific API patterns, edge cases, and project conventions that make a skill valuable.
|
|
12
|
+
|
|
13
|
+
Effective skills are grounded in real expertise. The key is feeding domain-specific context into the creation process.
|
|
14
|
+
|
|
15
|
+
### Extract from a hands-on task
|
|
16
|
+
|
|
17
|
+
Complete a real task in conversation with an agent, providing context, corrections, and preferences along the way. Then extract the reusable pattern into a skill. Pay attention to:
|
|
18
|
+
|
|
19
|
+
* **Steps that worked** — the sequence of actions that led to success
|
|
20
|
+
* **Corrections you made** — places where you steered the agent's approach (e.g., "use library X instead of Y," "check for edge case Z")
|
|
21
|
+
* **Input/output formats** — what the data looked like going in and coming out
|
|
22
|
+
* **Context you provided** — project-specific facts, conventions, or constraints the agent didn't already know
|
|
23
|
+
|
|
24
|
+
### Synthesize from existing project artifacts
|
|
25
|
+
|
|
26
|
+
When you have a body of existing knowledge, you can feed it into an LLM and ask it to synthesize a skill. A data-pipeline skill synthesized from your team's actual incident reports and runbooks will outperform one synthesized from a generic "data engineering best practices" article, because it captures *your* schemas, failure modes, and recovery procedures. The key is project-specific material, not generic references.
|
|
27
|
+
|
|
28
|
+
Good source material includes:
|
|
29
|
+
|
|
30
|
+
* Internal documentation, runbooks, and style guides
|
|
31
|
+
* API specifications, schemas, and configuration files
|
|
32
|
+
* Code review comments and issue trackers (captures recurring concerns and reviewer expectations)
|
|
33
|
+
* Version control history, especially patches and fixes (reveals patterns through what actually changed)
|
|
34
|
+
* Real-world failure cases and their resolutions
|
|
35
|
+
|
|
36
|
+
## Refine with real execution
|
|
37
|
+
|
|
38
|
+
The first draft of a skill usually needs refinement. Run the skill against real tasks, then feed the results — all of them, not just failures — back into the creation process. Ask: what triggered false positives? What was missed? What could be cut?
|
|
39
|
+
|
|
40
|
+
Even a single pass of execute-then-revise noticeably improves quality, and complex domains often benefit from several.
|
|
41
|
+
|
|
42
|
+
<Tip>
|
|
43
|
+
Read agent execution traces, not just final outputs. If the agent wastes time on unproductive steps, common causes include instructions that are too vague (the agent tries several approaches before finding one that works), instructions that don't apply to the current task (the agent follows them anyway), or too many options presented without a clear default.
|
|
44
|
+
</Tip>
|
|
45
|
+
|
|
46
|
+
For a more structured approach to iteration, including test cases, assertions, and grading, see [Evaluating skill output quality](/skill-creation/evaluating-skills).
|
|
47
|
+
|
|
48
|
+
## Spending context wisely
|
|
49
|
+
|
|
50
|
+
Once a skill activates, its full `SKILL.md` body loads into the agent's context window alongside conversation history, system context, and other active skills. Every token in your skill competes for the agent's attention with everything else in that window.
|
|
51
|
+
|
|
52
|
+
### Add what the agent lacks, omit what it knows
|
|
53
|
+
|
|
54
|
+
Focus on what the agent *wouldn't* know without your skill: project-specific conventions, domain-specific procedures, non-obvious edge cases, and the particular tools or APIs to use. You don't need to explain what a PDF is, how HTTP works, or what a database migration does.
|
|
55
|
+
|
|
56
|
+
````markdown theme={null}
|
|
57
|
+
<!-- Too verbose — the agent already knows what PDFs are -->
|
|
58
|
+
## Extract PDF text
|
|
59
|
+
|
|
60
|
+
PDF (Portable Document Format) files are a common file format that contains
|
|
61
|
+
text, images, and other content. To extract text from a PDF, you'll need to
|
|
62
|
+
use a library. pdfplumber is recommended because it handles most cases well.
|
|
63
|
+
|
|
64
|
+
<!-- Better — jumps straight to what the agent wouldn't know on its own -->
|
|
65
|
+
## Extract PDF text
|
|
66
|
+
|
|
67
|
+
Use pdfplumber for text extraction. For scanned documents, fall back to
|
|
68
|
+
pdf2image with pytesseract.
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
import pdfplumber
|
|
72
|
+
|
|
73
|
+
with pdfplumber.open("file.pdf") as pdf:
|
|
74
|
+
text = pdf.pages[0].extract_text()
|
|
75
|
+
```
|
|
76
|
+
````
|
|
77
|
+
|
|
78
|
+
Ask yourself about each piece of content: "Would the agent get this wrong without this instruction?" If the answer is no, cut it. If you're unsure, test it. And if the agent already handles the entire task well without the skill, the skill may not be adding value. See [Evaluating skill output quality](/skill-creation/evaluating-skills) for how to test this systematically.
|
|
79
|
+
|
|
80
|
+
### Design coherent units
|
|
81
|
+
|
|
82
|
+
Deciding what a skill should cover is like deciding what a function should do: you want it to encapsulate a coherent unit of work that composes well with other skills. Skills scoped too narrowly force multiple skills to load for a single task, risking overhead and conflicting instructions. Skills scoped too broadly become hard to activate precisely. A skill for querying a database and formatting the results may be one coherent unit, while a skill that also covers database administration is probably trying to do too much.
|
|
83
|
+
|
|
84
|
+
### Aim for moderate detail
|
|
85
|
+
|
|
86
|
+
Overly comprehensive skills can hurt more than they help — the agent struggles to extract what's relevant and may pursue unproductive paths triggered by instructions that don't apply to the current task. Concise, stepwise guidance with a working example tends to outperform exhaustive documentation. When you find yourself covering every edge case, consider whether most are better handled by the agent's own judgment.
|
|
87
|
+
|
|
88
|
+
### Structure large skills with progressive disclosure
|
|
89
|
+
|
|
90
|
+
The [specification](/specification#progressive-disclosure) recommends keeping `SKILL.md` under 500 lines and 5,000 tokens — just the core instructions the agent needs on every run. When a skill legitimately needs more content, move detailed reference material to separate files in `references/` or similar directories.
|
|
91
|
+
|
|
92
|
+
The key is telling the agent *when* to load each file. "Read `references/api-errors.md` if the API returns a non-200 status code" is more useful than a generic "see references/ for details." This lets the agent load context on demand rather than up front, which is how [progressive disclosure](/specification#progressive-disclosure) is designed to work.
|
|
93
|
+
|
|
94
|
+
## Calibrating control
|
|
95
|
+
|
|
96
|
+
Not every part of a skill needs the same level of prescriptiveness. Match the specificity of your instructions to the fragility of the task.
|
|
97
|
+
|
|
98
|
+
### Match specificity to fragility
|
|
99
|
+
|
|
100
|
+
**Give the agent freedom** when multiple approaches are valid and the task tolerates variation. For flexible instructions, explaining *why* can be more effective than rigid directives — an agent that understands the purpose behind an instruction makes better context-dependent decisions. A code review skill can describe what to look for without prescribing exact steps:
|
|
101
|
+
|
|
102
|
+
```markdown theme={null}
|
|
103
|
+
## Code review process
|
|
104
|
+
|
|
105
|
+
1. Check all database queries for SQL injection (use parameterized queries)
|
|
106
|
+
2. Verify authentication checks on every endpoint
|
|
107
|
+
3. Look for race conditions in concurrent code paths
|
|
108
|
+
4. Confirm error messages don't leak internal details
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
**Be prescriptive** when operations are fragile, consistency matters, or a specific sequence must be followed:
|
|
112
|
+
|
|
113
|
+
````markdown theme={null}
|
|
114
|
+
## Database migration
|
|
115
|
+
|
|
116
|
+
Run exactly this sequence:
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
python scripts/migrate.py --verify --backup
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Do not modify the command or add additional flags.
|
|
123
|
+
````
|
|
124
|
+
|
|
125
|
+
Most skills have a mix. Calibrate each part independently.
|
|
126
|
+
|
|
127
|
+
### Provide defaults, not menus
|
|
128
|
+
|
|
129
|
+
When multiple tools or approaches could work, pick a default and mention alternatives briefly rather than presenting them as equal options.
|
|
130
|
+
|
|
131
|
+
````markdown theme={null}
|
|
132
|
+
<!-- Too many options -->
|
|
133
|
+
You can use pypdf, pdfplumber, PyMuPDF, or pdf2image...
|
|
134
|
+
|
|
135
|
+
<!-- Clear default with escape hatch -->
|
|
136
|
+
Use pdfplumber for text extraction:
|
|
137
|
+
|
|
138
|
+
```python
|
|
139
|
+
import pdfplumber
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
For scanned PDFs requiring OCR, use pdf2image with pytesseract instead.
|
|
143
|
+
````
|
|
144
|
+
|
|
145
|
+
### Favor procedures over declarations
|
|
146
|
+
|
|
147
|
+
A skill should teach the agent *how to approach* a class of problems, not *what to produce* for a specific instance. Compare:
|
|
148
|
+
|
|
149
|
+
```markdown theme={null}
|
|
150
|
+
<!-- Specific answer — only useful for this exact task -->
|
|
151
|
+
Join the `orders` table to `customers` on `customer_id`, filter where
|
|
152
|
+
`region = 'EMEA'`, and sum the `amount` column.
|
|
153
|
+
|
|
154
|
+
<!-- Reusable method — works for any analytical query -->
|
|
155
|
+
1. Read the schema from `references/schema.yaml` to find relevant tables
|
|
156
|
+
2. Join tables using the `_id` foreign key convention
|
|
157
|
+
3. Apply any filters from the user's request as WHERE clauses
|
|
158
|
+
4. Aggregate numeric columns as needed and format as a markdown table
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
This doesn't mean skills can't include specific details — output format templates (see [Templates for output format](#templates-for-output-format)), constraints like "never output PII," and tool-specific instructions are all valuable. The point is that the *approach* should generalize even when individual details are specific.
|
|
162
|
+
|
|
163
|
+
## Patterns for effective instructions
|
|
164
|
+
|
|
165
|
+
These are reusable techniques for structuring skill content. Not every skill needs all of them — use the ones that fit your task.
|
|
166
|
+
|
|
167
|
+
### Gotchas sections
|
|
168
|
+
|
|
169
|
+
The highest-value content in many skills is a list of gotchas — environment-specific facts that defy reasonable assumptions. These aren't general advice ("handle errors appropriately") but concrete corrections to mistakes the agent will make without being told otherwise:
|
|
170
|
+
|
|
171
|
+
```markdown theme={null}
|
|
172
|
+
## Gotchas
|
|
173
|
+
|
|
174
|
+
- The `users` table uses soft deletes. Queries must include
|
|
175
|
+
`WHERE deleted_at IS NULL` or results will include deactivated accounts.
|
|
176
|
+
- The user ID is `user_id` in the database, `uid` in the auth service,
|
|
177
|
+
and `accountId` in the billing API. All three refer to the same value.
|
|
178
|
+
- The `/health` endpoint returns 200 as long as the web server is running,
|
|
179
|
+
even if the database connection is down. Use `/ready` to check full
|
|
180
|
+
service health.
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
Keep gotchas in `SKILL.md` where the agent reads them before encountering the situation. A separate reference file works if you tell the agent when to load it, but for non-obvious issues, the agent may not recognize the trigger.
|
|
184
|
+
|
|
185
|
+
<Tip>
|
|
186
|
+
When an agent makes a mistake you have to correct, add the correction to the gotchas section. This is one of the most direct ways to improve a skill iteratively (see [Refine with real execution](#refine-with-real-execution)).
|
|
187
|
+
</Tip>
|
|
188
|
+
|
|
189
|
+
### Templates for output format
|
|
190
|
+
|
|
191
|
+
When you need the agent to produce output in a specific format, provide a template. This is more reliable than describing the format in prose, because agents pattern-match well against concrete structures. Short templates can live inline in `SKILL.md`; for longer templates, or templates only needed in certain cases, store them in `assets/` and reference them from `SKILL.md` so they only load when needed.
|
|
192
|
+
|
|
193
|
+
````markdown theme={null}
|
|
194
|
+
## Report structure
|
|
195
|
+
|
|
196
|
+
Use this template, adapting sections as needed for the specific analysis:
|
|
197
|
+
|
|
198
|
+
```markdown
|
|
199
|
+
# [Analysis Title]
|
|
200
|
+
|
|
201
|
+
## Executive summary
|
|
202
|
+
[One-paragraph overview of key findings]
|
|
203
|
+
|
|
204
|
+
## Key findings
|
|
205
|
+
- Finding 1 with supporting data
|
|
206
|
+
- Finding 2 with supporting data
|
|
207
|
+
|
|
208
|
+
## Recommendations
|
|
209
|
+
1. Specific actionable recommendation
|
|
210
|
+
2. Specific actionable recommendation
|
|
211
|
+
```
|
|
212
|
+
````
|
|
213
|
+
|
|
214
|
+
### Checklists for multi-step workflows
|
|
215
|
+
|
|
216
|
+
An explicit checklist helps the agent track progress and avoid skipping steps, especially when steps have dependencies or validation gates.
|
|
217
|
+
|
|
218
|
+
```markdown theme={null}
|
|
219
|
+
## Form processing workflow
|
|
220
|
+
|
|
221
|
+
Progress:
|
|
222
|
+
- [ ] Step 1: Analyze the form (run `scripts/analyze_form.py`)
|
|
223
|
+
- [ ] Step 2: Create field mapping (edit `fields.json`)
|
|
224
|
+
- [ ] Step 3: Validate mapping (run `scripts/validate_fields.py`)
|
|
225
|
+
- [ ] Step 4: Fill the form (run `scripts/fill_form.py`)
|
|
226
|
+
- [ ] Step 5: Verify output (run `scripts/verify_output.py`)
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### Validation loops
|
|
230
|
+
|
|
231
|
+
Instruct the agent to validate its own work before moving on. The pattern is: do the work, run a validator (a script, a reference checklist, or a self-check), fix any issues, and repeat until validation passes.
|
|
232
|
+
|
|
233
|
+
```markdown theme={null}
|
|
234
|
+
## Editing workflow
|
|
235
|
+
|
|
236
|
+
1. Make your edits
|
|
237
|
+
2. Run validation: `python scripts/validate.py output/`
|
|
238
|
+
3. If validation fails:
|
|
239
|
+
- Review the error message
|
|
240
|
+
- Fix the issues
|
|
241
|
+
- Run validation again
|
|
242
|
+
4. Only proceed when validation passes
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
A reference document can also serve as the "validator" — instruct the agent to check its work against the reference before finalizing.
|
|
246
|
+
|
|
247
|
+
### Plan-validate-execute
|
|
248
|
+
|
|
249
|
+
For batch or destructive operations, have the agent create an intermediate plan in a structured format, validate it against a source of truth, and only then execute.
|
|
250
|
+
|
|
251
|
+
```markdown theme={null}
|
|
252
|
+
## PDF form filling
|
|
253
|
+
|
|
254
|
+
1. Extract form fields: `python scripts/analyze_form.py input.pdf` → `form_fields.json`
|
|
255
|
+
(lists every field name, type, and whether it's required)
|
|
256
|
+
2. Create `field_values.json` mapping each field name to its intended value
|
|
257
|
+
3. Validate: `python scripts/validate_fields.py form_fields.json field_values.json`
|
|
258
|
+
(checks that every field name exists in the form, types are compatible, and
|
|
259
|
+
required fields aren't missing)
|
|
260
|
+
4. If validation fails, revise `field_values.json` and re-validate
|
|
261
|
+
5. Fill the form: `python scripts/fill_form.py input.pdf field_values.json output.pdf`
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
The key ingredient is step 3: a validation script that checks the plan (`field_values.json`) against the source of truth (`form_fields.json`). Errors like "Field 'signature\_date' not found — available fields: customer\_name, order\_total, signature\_date\_signed" give the agent enough information to self-correct.
|
|
265
|
+
|
|
266
|
+
### Bundling reusable scripts
|
|
267
|
+
|
|
268
|
+
When [iterating on a skill](/skill-creation/evaluating-skills), compare the agent's execution traces across test cases. If you notice the agent independently reinventing the same logic each run — building charts, parsing a specific format, validating output — that's a signal to write a tested script once and bundle it in `scripts/`.
|
|
269
|
+
|
|
270
|
+
For more on designing and bundling scripts, see [Using scripts in skills](/skill-creation/using-scripts).
|
|
271
|
+
|
|
272
|
+
## Next steps
|
|
273
|
+
|
|
274
|
+
Once you have a working skill, two guides can help you refine it further:
|
|
275
|
+
|
|
276
|
+
* **[Evaluating skill output quality](/skill-creation/evaluating-skills)** — Set up test cases, grade results, and iterate systematically.
|
|
277
|
+
* **[Optimizing skill descriptions](/skill-creation/optimizing-descriptions)** — Test and improve your skill's `description` field so it triggers on the right prompts.
|
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
> ## Documentation Index
|
|
2
|
+
> Fetch the complete documentation index at: https://agentskills.io/llms.txt
|
|
3
|
+
> Use this file to discover all available pages before exploring further.
|
|
4
|
+
|
|
5
|
+
# Evaluating skill output quality
|
|
6
|
+
|
|
7
|
+
> How to test whether your skill produces good outputs using eval-driven iteration.
|
|
8
|
+
|
|
9
|
+
You wrote a skill, tried it on a prompt, and it seemed to work. But does it work reliably — across varied prompts, in edge cases, better than no skill at all? Running structured evaluations (evals) answers these questions and gives you a feedback loop for improving the skill systematically.
|
|
10
|
+
|
|
11
|
+
## Designing test cases
|
|
12
|
+
|
|
13
|
+
A test case has three parts:
|
|
14
|
+
|
|
15
|
+
* **Prompt**: a realistic user message — the kind of thing someone would actually type.
|
|
16
|
+
* **Expected output**: a human-readable description of what success looks like.
|
|
17
|
+
* **Input files** (optional): files the skill needs to work with.
|
|
18
|
+
|
|
19
|
+
Store test cases in `evals/evals.json` inside your skill directory:
|
|
20
|
+
|
|
21
|
+
```json evals/evals.json theme={null}
|
|
22
|
+
{
|
|
23
|
+
"skill_name": "csv-analyzer",
|
|
24
|
+
"evals": [
|
|
25
|
+
{
|
|
26
|
+
"id": 1,
|
|
27
|
+
"prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
|
|
28
|
+
"expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
|
|
29
|
+
"files": ["evals/files/sales_2025.csv"]
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"id": 2,
|
|
33
|
+
"prompt": "there's a csv in my downloads called customers.csv, some rows have missing emails — can you clean it up and tell me how many were missing?",
|
|
34
|
+
"expected_output": "A cleaned CSV with missing emails handled, plus a count of how many were missing.",
|
|
35
|
+
"files": ["evals/files/customers.csv"]
|
|
36
|
+
}
|
|
37
|
+
]
|
|
38
|
+
}
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**Tips for writing good test prompts:**
|
|
42
|
+
|
|
43
|
+
* **Start with 2-3 test cases.** Don't over-invest before you've seen your first round of results. You can expand the set later.
|
|
44
|
+
* **Vary the prompts.** Use different phrasings, levels of detail, and formality. Some prompts should be casual ("hey can you clean up this csv"), others precise ("Parse the CSV at data/input.csv, drop rows where column B is null, and write the result to data/output.csv").
|
|
45
|
+
* **Cover edge cases.** Include at least one prompt that tests a boundary condition — a malformed input, an unusual request, or a case where the skill's instructions might be ambiguous.
|
|
46
|
+
* **Use realistic context.** Real users mention file paths, column names, and personal context. Prompts like "process this data" are too vague to test anything useful.
|
|
47
|
+
|
|
48
|
+
Don't worry about defining specific pass/fail checks yet — just the prompts and expected outputs. You'll add detailed checks (called assertions) after you see what the first run produces.
|
|
49
|
+
|
|
50
|
+
## Running evals
|
|
51
|
+
|
|
52
|
+
The core pattern is to run each test case twice: once **with the skill** and once **without it** (or with a previous version). This gives you a baseline to compare against.
|
|
53
|
+
|
|
54
|
+
### Workspace structure
|
|
55
|
+
|
|
56
|
+
Organize eval results in a workspace directory alongside your skill directory. Each pass through the full eval loop gets its own `iteration-N/` directory. Within that, each test case gets an eval directory with `with_skill/` and `without_skill/` subdirectories:
|
|
57
|
+
|
|
58
|
+
```
|
|
59
|
+
csv-analyzer/
|
|
60
|
+
├── SKILL.md
|
|
61
|
+
└── evals/
|
|
62
|
+
└── evals.json
|
|
63
|
+
csv-analyzer-workspace/
|
|
64
|
+
└── iteration-1/
|
|
65
|
+
├── eval-top-months-chart/
|
|
66
|
+
│ ├── with_skill/
|
|
67
|
+
│ │ ├── outputs/ # Files produced by the run
|
|
68
|
+
│ │ ├── timing.json # Tokens and duration
|
|
69
|
+
│ │ └── grading.json # Assertion results
|
|
70
|
+
│ └── without_skill/
|
|
71
|
+
│ ├── outputs/
|
|
72
|
+
│ ├── timing.json
|
|
73
|
+
│ └── grading.json
|
|
74
|
+
├── eval-clean-missing-emails/
|
|
75
|
+
│ ├── with_skill/
|
|
76
|
+
│ │ ├── outputs/
|
|
77
|
+
│ │ ├── timing.json
|
|
78
|
+
│ │ └── grading.json
|
|
79
|
+
│ └── without_skill/
|
|
80
|
+
│ ├── outputs/
|
|
81
|
+
│ ├── timing.json
|
|
82
|
+
│ └── grading.json
|
|
83
|
+
└── benchmark.json # Aggregated statistics
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
The main file you author by hand is `evals/evals.json`. The other JSON files (`grading.json`, `timing.json`, `benchmark.json`) are produced during the eval process — by the agent, by scripts, or by you.
|
|
87
|
+
|
|
88
|
+
### Spawning runs
|
|
89
|
+
|
|
90
|
+
Each eval run should start with a clean context — no leftover state from previous runs or from the skill development process. This ensures the agent follows only what the `SKILL.md` tells it. In environments that support subagents (Claude Code, for example), this isolation comes naturally: each child task starts fresh. Without subagents, use a separate session for each run.
|
|
91
|
+
|
|
92
|
+
For each run, provide:
|
|
93
|
+
|
|
94
|
+
* The skill path (or no skill for the baseline)
|
|
95
|
+
* The test prompt
|
|
96
|
+
* Any input files
|
|
97
|
+
* The output directory
|
|
98
|
+
|
|
99
|
+
Here's an example of the instructions you'd give the agent for a single with-skill run:
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
Execute this task:
|
|
103
|
+
- Skill path: /path/to/csv-analyzer
|
|
104
|
+
- Task: I have a CSV of monthly sales data in data/sales_2025.csv.
|
|
105
|
+
Can you find the top 3 months by revenue and make a bar chart?
|
|
106
|
+
- Input files: evals/files/sales_2025.csv
|
|
107
|
+
- Save outputs to: csv-analyzer-workspace/iteration-1/eval-top-months-chart/with_skill/outputs/
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
For the baseline, use the same prompt but without the skill path, saving to `without_skill/outputs/`.
|
|
111
|
+
|
|
112
|
+
When improving an existing skill, use the previous version as your baseline. Snapshot it before editing (`cp -r <skill-path> <workspace>/skill-snapshot/`), point the baseline run at the snapshot, and save to `old_skill/outputs/` instead of `without_skill/`.
|
|
113
|
+
|
|
114
|
+
### Capturing timing data
|
|
115
|
+
|
|
116
|
+
Timing data lets you compare how much time and tokens the skill costs relative to the baseline — a skill that dramatically improves output quality but triples token usage is a different trade-off than one that's both better and cheaper. When each run completes, record the token count and duration:
|
|
117
|
+
|
|
118
|
+
```json timing.json theme={null}
|
|
119
|
+
{
|
|
120
|
+
"total_tokens": 84852,
|
|
121
|
+
"duration_ms": 23332
|
|
122
|
+
}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
<Tip>
|
|
126
|
+
In Claude Code, when a subagent task finishes, the [task completion notification](https://platform.claude.com/docs/en/agent-sdk/typescript#sdk-task-notification-message) includes `total_tokens` and `duration_ms`. Save these values immediately — they aren't persisted anywhere else.
|
|
127
|
+
</Tip>
|
|
128
|
+
|
|
129
|
+
## Writing assertions
|
|
130
|
+
|
|
131
|
+
Assertions are verifiable statements about what the output should contain or achieve. Add them after you see your first round of outputs — you often don't know what "good" looks like until the skill has run.
|
|
132
|
+
|
|
133
|
+
Good assertions:
|
|
134
|
+
|
|
135
|
+
* `"The output file is valid JSON"` — programmatically verifiable.
|
|
136
|
+
* `"The bar chart has labeled axes"` — specific and observable.
|
|
137
|
+
* `"The report includes at least 3 recommendations"` — countable.
|
|
138
|
+
|
|
139
|
+
Weak assertions:
|
|
140
|
+
|
|
141
|
+
* `"The output is good"` — too vague to grade.
|
|
142
|
+
* `"The output uses exactly the phrase 'Total Revenue: $X'"` — too brittle; correct output with different wording would fail.
|
|
143
|
+
|
|
144
|
+
Not everything needs an assertion. Some qualities — writing style, visual design, whether the output "feels right" — are hard to decompose into pass/fail checks. These are better caught during [human review](#reviewing-results-with-a-human). Reserve assertions for things that can be checked objectively.
|
|
145
|
+
|
|
146
|
+
Add assertions to each test case in `evals/evals.json`:
|
|
147
|
+
|
|
148
|
+
```json evals/evals.json highlight={9-14} theme={null}
|
|
149
|
+
{
|
|
150
|
+
"skill_name": "csv-analyzer",
|
|
151
|
+
"evals": [
|
|
152
|
+
{
|
|
153
|
+
"id": 1,
|
|
154
|
+
"prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
|
|
155
|
+
"expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
|
|
156
|
+
"files": ["evals/files/sales_2025.csv"],
|
|
157
|
+
"assertions": [
|
|
158
|
+
"The output includes a bar chart image file",
|
|
159
|
+
"The chart shows exactly 3 months",
|
|
160
|
+
"Both axes are labeled",
|
|
161
|
+
"The chart title or caption mentions revenue"
|
|
162
|
+
]
|
|
163
|
+
}
|
|
164
|
+
]
|
|
165
|
+
}
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
## Grading outputs
|
|
169
|
+
|
|
170
|
+
Grading means evaluating each assertion against the actual outputs and recording **PASS** or **FAIL** with specific evidence. The evidence should quote or reference the output, not just state an opinion.
|
|
171
|
+
|
|
172
|
+
The simplest approach is to give the outputs and assertions to an LLM and ask it to evaluate each one. For assertions that can be checked by code (valid JSON, correct row count, file exists with expected dimensions), use a verification script — scripts are more reliable than LLM judgment for mechanical checks and reusable across iterations.
|
|
173
|
+
|
|
174
|
+
```json grading.json theme={null}
|
|
175
|
+
{
|
|
176
|
+
"assertion_results": [
|
|
177
|
+
{
|
|
178
|
+
"text": "The output includes a bar chart image file",
|
|
179
|
+
"passed": true,
|
|
180
|
+
"evidence": "Found chart.png (45KB) in outputs directory"
|
|
181
|
+
},
|
|
182
|
+
{
|
|
183
|
+
"text": "The chart shows exactly 3 months",
|
|
184
|
+
"passed": true,
|
|
185
|
+
"evidence": "Chart displays bars for March, July, and November"
|
|
186
|
+
},
|
|
187
|
+
{
|
|
188
|
+
"text": "Both axes are labeled",
|
|
189
|
+
"passed": false,
|
|
190
|
+
"evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
|
|
191
|
+
},
|
|
192
|
+
{
|
|
193
|
+
"text": "The chart title or caption mentions revenue",
|
|
194
|
+
"passed": true,
|
|
195
|
+
"evidence": "Chart title reads 'Top 3 Months by Revenue'"
|
|
196
|
+
}
|
|
197
|
+
],
|
|
198
|
+
"summary": {
|
|
199
|
+
"passed": 3,
|
|
200
|
+
"failed": 1,
|
|
201
|
+
"total": 4,
|
|
202
|
+
"pass_rate": 0.75
|
|
203
|
+
}
|
|
204
|
+
}
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
### Grading principles
|
|
208
|
+
|
|
209
|
+
* **Require concrete evidence for a PASS.** Don't give the benefit of the doubt. If an assertion says "includes a summary" and the output has a section titled "Summary" with one vague sentence, that's a FAIL — the label is there but the substance isn't.
|
|
210
|
+
* **Review the assertions themselves, not just the results.** While grading, notice when assertions are too easy (always pass regardless of skill quality), too hard (always fail even when the output is good), or unverifiable (can't be checked from the output alone). Fix these for the next iteration.
|
|
211
|
+
|
|
212
|
+
<Tip>
|
|
213
|
+
For comparing two skill versions, try **blind comparison**: present both outputs to an LLM judge without revealing which came from which version. The judge scores holistic qualities — organization, formatting, usability, polish — on its own rubric, free from bias about which version "should" be better. This complements assertion grading: two outputs might both pass all assertions but differ significantly in overall quality.
|
|
214
|
+
</Tip>
|
|
215
|
+
|
|
216
|
+
## Aggregating results
|
|
217
|
+
|
|
218
|
+
Once every run in the iteration is graded, compute summary statistics per configuration and save them to `benchmark.json` alongside the eval directories (e.g., `csv-analyzer-workspace/iteration-1/benchmark.json`):
|
|
219
|
+
|
|
220
|
+
```json benchmark.json theme={null}
|
|
221
|
+
{
|
|
222
|
+
"run_summary": {
|
|
223
|
+
"with_skill": {
|
|
224
|
+
"pass_rate": { "mean": 0.83, "stddev": 0.06 },
|
|
225
|
+
"time_seconds": { "mean": 45.0, "stddev": 12.0 },
|
|
226
|
+
"tokens": { "mean": 3800, "stddev": 400 }
|
|
227
|
+
},
|
|
228
|
+
"without_skill": {
|
|
229
|
+
"pass_rate": { "mean": 0.33, "stddev": 0.10 },
|
|
230
|
+
"time_seconds": { "mean": 32.0, "stddev": 8.0 },
|
|
231
|
+
"tokens": { "mean": 2100, "stddev": 300 }
|
|
232
|
+
},
|
|
233
|
+
"delta": {
|
|
234
|
+
"pass_rate": 0.50,
|
|
235
|
+
"time_seconds": 13.0,
|
|
236
|
+
"tokens": 1700
|
|
237
|
+
}
|
|
238
|
+
}
|
|
239
|
+
}
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
The `delta` tells you what the skill costs (more time, more tokens) and what it buys (higher pass rate). A skill that adds 13 seconds but improves pass rate by 50 percentage points is probably worth it. A skill that doubles token usage for a 2-point improvement might not be.
|
|
243
|
+
|
|
244
|
+
<Note>
|
|
245
|
+
Standard deviation (`stddev`) is only meaningful with multiple runs per eval. In early iterations with just 2-3 test cases and single runs, focus on the raw pass counts and the delta — the statistical measures become useful as you expand the test set and run each eval multiple times.
|
|
246
|
+
</Note>
|
|
247
|
+
|
|
248
|
+
## Analyzing patterns
|
|
249
|
+
|
|
250
|
+
Aggregate statistics can hide important patterns. After computing the benchmarks:
|
|
251
|
+
|
|
252
|
+
* **Remove or replace assertions that always pass in both configurations.** These don't tell you anything useful — the model handles them fine without the skill. They inflate the with-skill pass rate without reflecting actual skill value.
|
|
253
|
+
* **Investigate assertions that always fail in both configurations.** Either the assertion is broken (asking for something the model can't do), the test case is too hard, or the assertion is checking for the wrong thing. Fix these before the next iteration.
|
|
254
|
+
* **Study assertions that pass with the skill but fail without.** This is where the skill is clearly adding value. Understand *why* — which instructions or scripts made the difference?
|
|
255
|
+
* **Tighten instructions when results are inconsistent across runs.** If the same eval passes sometimes and fails others (reflected as high `stddev` in the benchmark), the eval may be flaky (sensitive to model randomness), or the skill's instructions may be ambiguous enough that the model interprets them differently each time. Add examples or more specific guidance to reduce ambiguity.
|
|
256
|
+
* **Check time and token outliers.** If one eval takes 3x longer than the others, read its execution transcript (the full log of what the model did during the run) to find the bottleneck.
|
|
257
|
+
|
|
258
|
+
## Reviewing results with a human
|
|
259
|
+
|
|
260
|
+
Assertion grading and pattern analysis catch a lot, but they only check what you thought to write assertions for. A human reviewer brings a fresh perspective — catching issues you didn't anticipate, noticing when the output is technically correct but misses the point, or spotting problems that are hard to express as pass/fail checks. For each test case, review the actual outputs alongside the grades.
|
|
261
|
+
|
|
262
|
+
Record specific feedback for each test case and save it in the workspace (e.g., as a `feedback.json` alongside the eval directories):
|
|
263
|
+
|
|
264
|
+
```json feedback.json theme={null}
|
|
265
|
+
{
|
|
266
|
+
"eval-top-months-chart": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
|
|
267
|
+
"eval-clean-missing-emails": ""
|
|
268
|
+
}
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
"The chart is missing axis labels" is actionable; "looks bad" is not. Empty feedback means the output looked fine — that test case passed your review. During the [iteration step](#iterating-on-the-skill), focus your improvements on the test cases where you had specific complaints.
|
|
272
|
+
|
|
273
|
+
## Iterating on the skill
|
|
274
|
+
|
|
275
|
+
After grading and reviewing, you have three sources of signal:
|
|
276
|
+
|
|
277
|
+
* **Failed assertions** point to specific gaps — a missing step, an unclear instruction, or a case the skill doesn't handle.
|
|
278
|
+
* **Human feedback** points to broader quality issues — the approach was wrong, the output was poorly structured, or the skill produced a technically correct but unhelpful result.
|
|
279
|
+
* **Execution transcripts** reveal *why* things went wrong. If the agent ignored an instruction, the instruction may be ambiguous. If the agent spent time on unproductive steps, those instructions may need to be simplified or removed.
|
|
280
|
+
|
|
281
|
+
The most effective way to turn these signals into skill improvements is to give all three — along with the current `SKILL.md` — to an LLM and ask it to propose changes. The LLM can synthesize patterns across failed assertions, reviewer complaints, and transcript behavior that would be tedious to connect manually. When prompting the LLM, include these guidelines:
|
|
282
|
+
|
|
283
|
+
* **Generalize from feedback.** The skill will be used across many different prompts, not just the test cases. Fixes should address underlying issues broadly rather than adding narrow patches for specific examples.
|
|
284
|
+
* **Keep the skill lean.** Fewer, better instructions often outperform exhaustive rules. If transcripts show wasted work (unnecessary validation, unneeded intermediate outputs), remove those instructions. If pass rates plateau despite adding more rules, the skill may be over-constrained — try removing instructions and see if results hold or improve.
|
|
285
|
+
* **Explain the why.** Reasoning-based instructions ("Do X because Y tends to cause Z") work better than rigid directives ("ALWAYS do X, NEVER do Y"). Models follow instructions more reliably when they understand the purpose.
|
|
286
|
+
* **Bundle repeated work.** If every test run independently wrote a similar helper script (a chart builder, a data parser), that's a signal to bundle the script into the skill's `scripts/` directory. See [Using scripts](/skill-creation/using-scripts) for how to do this.
|
|
287
|
+
|
|
288
|
+
### The loop
|
|
289
|
+
|
|
290
|
+
1. Give the eval signals and current `SKILL.md` to an LLM and ask it to propose improvements.
|
|
291
|
+
2. Review and apply the changes.
|
|
292
|
+
3. Rerun all test cases in a new `iteration-<N+1>/` directory.
|
|
293
|
+
4. Grade and aggregate the new results.
|
|
294
|
+
5. Review with a human. Repeat.
|
|
295
|
+
|
|
296
|
+
Stop when you're satisfied with the results, feedback is consistently empty, or you're no longer seeing meaningful improvement between iterations.
|
|
297
|
+
|
|
298
|
+
<Tip>
|
|
299
|
+
The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates much of this workflow — running evals, grading assertions, aggregating benchmarks, and presenting results for human review.
|
|
300
|
+
</Tip>
|