skilltest 0.7.0 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +11 -7
- package/README.md +267 -12
- package/dist/index.js +1699 -173
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/CLAUDE.md
CHANGED
|
@@ -19,12 +19,15 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
|
19
19
|
- `src/core/linter/`: lint check modules and orchestrator
|
|
20
20
|
- `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
|
|
21
21
|
- `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
|
|
22
|
+
- `src/core/tool-environment.ts`: mock tool environment + agentic loop for tool-aware eval
|
|
22
23
|
- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
|
|
23
24
|
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
24
|
-
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
25
|
+
- `src/providers/`: LLM provider abstraction (`sendMessage`, `sendWithTools`) and provider implementations
|
|
25
26
|
- `src/reporters/`: terminal, JSON, and HTML output helpers
|
|
26
27
|
- `src/utils/`: filesystem and API key config helpers
|
|
27
28
|
|
|
29
|
+
Eval supports optional mock tool environments for testing skills that invoke tools.
|
|
30
|
+
|
|
28
31
|
## Build and Test Locally
|
|
29
32
|
|
|
30
33
|
Install deps:
|
|
@@ -68,6 +71,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
68
71
|
|
|
69
72
|
- Minimal provider interface:
|
|
70
73
|
- `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
|
|
74
|
+
- `sendWithTools(systemPrompt, messages, { model, tools }) => ProviderToolResponse`
|
|
71
75
|
- Lint is fully offline and first-class.
|
|
72
76
|
- Trigger/eval rely on the same provider abstraction.
|
|
73
77
|
- `check` wraps lint + trigger + eval and enforces minimum thresholds:
|
|
@@ -77,6 +81,9 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
77
81
|
- default concurrency is `5`
|
|
78
82
|
- `--concurrency 1` preserves the old sequential behavior
|
|
79
83
|
- trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
|
|
84
|
+
- Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
|
|
85
|
+
- Tool-aware eval uses mock responses only. No real tool execution happens during eval.
|
|
86
|
+
- Tool assertions are evaluated structurally, without the grader model, so those checks stay deterministic.
|
|
80
87
|
- JSON mode is strict:
|
|
81
88
|
- no spinners
|
|
82
89
|
- no colored output
|
|
@@ -103,11 +110,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
103
110
|
- Security heuristics: `src/core/linter/security.ts`
|
|
104
111
|
- Progressive disclosure: `src/core/linter/disclosure.ts`
|
|
105
112
|
- Compatibility hints: `src/core/linter/compat.ts`
|
|
106
|
-
-
|
|
113
|
+
- Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
|
|
114
|
+
- Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
|
|
115
|
+
- Mock tool environment + agentic loop: `src/core/tool-environment.ts`
|
|
107
116
|
- Eval grading schema: `src/core/grader.ts`
|
|
108
117
|
- Combined quality gate orchestration: `src/core/check-runner.ts`
|
|
109
|
-
|
|
110
|
-
## Future Work (Not Implemented Yet)
|
|
111
|
-
|
|
112
|
-
- Config file support (`.skilltestrc`)
|
|
113
|
-
- Plugin linter rules
|
package/README.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
[](./LICENSE)
|
|
5
5
|
[](#cicd-integration)
|
|
6
6
|
|
|
7
|
-
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
|
|
7
|
+
The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your `SKILL.md` files.
|
|
8
8
|
|
|
9
9
|
`skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
|
|
10
10
|
|
|
@@ -12,12 +12,6 @@ The repository itself uses a fast Vitest suite for offline unit and integration
|
|
|
12
12
|
coverage of the parser, linters, trigger math, config resolution, reporters,
|
|
13
13
|
and linter orchestration.
|
|
14
14
|
|
|
15
|
-
## Demo
|
|
16
|
-
|
|
17
|
-
GIF coming soon.
|
|
18
|
-
|
|
19
|
-
<!--  -->
|
|
20
|
-
|
|
21
15
|
## Why skilltest?
|
|
22
16
|
|
|
23
17
|
Agent Skills are quick to write but hard to validate before deployment:
|
|
@@ -27,7 +21,7 @@ Agent Skills are quick to write but hard to validate before deployment:
|
|
|
27
21
|
- You cannot easily measure trigger precision/recall.
|
|
28
22
|
- You do not know whether outputs are good until users exercise the skill.
|
|
29
23
|
|
|
30
|
-
`skilltest` closes this gap with one CLI and
|
|
24
|
+
`skilltest` closes this gap with one CLI and five modes.
|
|
31
25
|
|
|
32
26
|
## Install
|
|
33
27
|
|
|
@@ -71,6 +65,18 @@ Run full quality gate:
|
|
|
71
65
|
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
|
|
72
66
|
```
|
|
73
67
|
|
|
68
|
+
Propose a verified rewrite without touching the source file:
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
skilltest improve ./path/to/skill --provider anthropic
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Apply the verified rewrite in place:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
skilltest improve ./path/to/skill --provider anthropic --apply
|
|
78
|
+
```
|
|
79
|
+
|
|
74
80
|
Write a self-contained HTML report:
|
|
75
81
|
|
|
76
82
|
```bash
|
|
@@ -80,8 +86,8 @@ skilltest check ./path/to/skill --html ./reports/check.html
|
|
|
80
86
|
Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
|
|
81
87
|
the old sequential execution order. Seeded trigger runs stay deterministic regardless
|
|
82
88
|
of concurrency.
|
|
83
|
-
|
|
84
|
-
|
|
89
|
+
`lint`, `trigger`, `eval`, and `check` support `--html <path>` for offline reports.
|
|
90
|
+
`improve` is terminal/JSON only in v1.
|
|
85
91
|
|
|
86
92
|
Example lint summary:
|
|
87
93
|
|
|
@@ -115,7 +121,8 @@ Example `.skilltestrc`:
|
|
|
115
121
|
},
|
|
116
122
|
"eval": {
|
|
117
123
|
"numRuns": 5,
|
|
118
|
-
"threshold": 0.9
|
|
124
|
+
"threshold": 0.9,
|
|
125
|
+
"maxToolIterations": 10
|
|
119
126
|
}
|
|
120
127
|
}
|
|
121
128
|
```
|
|
@@ -163,6 +170,72 @@ What it checks:
|
|
|
163
170
|
Flags:
|
|
164
171
|
|
|
165
172
|
- `--html <path>` write a self-contained HTML report
|
|
173
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
174
|
+
|
|
175
|
+
### Plugin Rules
|
|
176
|
+
|
|
177
|
+
You can run custom lint rules alongside the built-in checks. Plugin rules use the
|
|
178
|
+
same `LintContext` and `LintIssue` types as the core linter, and their results
|
|
179
|
+
appear in the same `LintReport`.
|
|
180
|
+
|
|
181
|
+
Config:
|
|
182
|
+
|
|
183
|
+
```json
|
|
184
|
+
{
|
|
185
|
+
"lint": {
|
|
186
|
+
"plugins": ["./my-rules.js"]
|
|
187
|
+
}
|
|
188
|
+
}
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
CLI:
|
|
192
|
+
|
|
193
|
+
```bash
|
|
194
|
+
skilltest lint ./skill --plugin ./my-rules.js
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
Minimal plugin example:
|
|
198
|
+
|
|
199
|
+
```js
|
|
200
|
+
export default {
|
|
201
|
+
rules: [
|
|
202
|
+
{
|
|
203
|
+
checkId: "custom:no-todo",
|
|
204
|
+
title: "No TODO comments",
|
|
205
|
+
check(context) {
|
|
206
|
+
const body = context.frontmatter.content;
|
|
207
|
+
if (/\bTODO\b/.test(body)) {
|
|
208
|
+
return [
|
|
209
|
+
{
|
|
210
|
+
id: "custom.no-todo",
|
|
211
|
+
checkId: "custom:no-todo",
|
|
212
|
+
title: "No TODO comments",
|
|
213
|
+
status: "warn",
|
|
214
|
+
message: "SKILL.md contains a TODO marker."
|
|
215
|
+
}
|
|
216
|
+
];
|
|
217
|
+
}
|
|
218
|
+
return [
|
|
219
|
+
{
|
|
220
|
+
id: "custom.no-todo",
|
|
221
|
+
checkId: "custom:no-todo",
|
|
222
|
+
title: "No TODO comments",
|
|
223
|
+
status: "pass",
|
|
224
|
+
message: "No TODO markers found."
|
|
225
|
+
}
|
|
226
|
+
];
|
|
227
|
+
}
|
|
228
|
+
}
|
|
229
|
+
]
|
|
230
|
+
};
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
Notes:
|
|
234
|
+
|
|
235
|
+
- Plugin files are loaded with dynamic `import()`.
|
|
236
|
+
- `.js` and `.mjs` work directly; `.ts` plugins must be precompiled by the user.
|
|
237
|
+
- Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
|
|
238
|
+
- CLI `--plugin` values replace config-file `lint.plugins` values.
|
|
166
239
|
|
|
167
240
|
### `skilltest trigger <path-to-skill>`
|
|
168
241
|
|
|
@@ -175,6 +248,7 @@ Flow:
|
|
|
175
248
|
3. For each query, asks model to select one skill from a mixed list:
|
|
176
249
|
- your skill under test
|
|
177
250
|
- realistic fake skills
|
|
251
|
+
- optional sibling competitor skills from `--compare`
|
|
178
252
|
4. Computes TP, TN, FP, FN, precision, recall, F1.
|
|
179
253
|
|
|
180
254
|
For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
|
|
@@ -188,6 +262,7 @@ Flags:
|
|
|
188
262
|
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
189
263
|
- `--provider <anthropic|openai>` default: `anthropic`
|
|
190
264
|
- `--queries <path>` use custom queries JSON
|
|
265
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
191
266
|
- `--num-queries <n>` default: `20` (must be even)
|
|
192
267
|
- `--seed <number>` RNG seed for reproducible fake-skill sampling
|
|
193
268
|
- `--concurrency <n>` default: `5`
|
|
@@ -196,6 +271,28 @@ Flags:
|
|
|
196
271
|
- `--api-key <key>` explicit key override
|
|
197
272
|
- `--verbose` show full model decision text
|
|
198
273
|
|
|
274
|
+
### Comparative Trigger Testing
|
|
275
|
+
|
|
276
|
+
Test whether your skill is distinctive enough to be selected over similar real skills:
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
Config:
|
|
283
|
+
|
|
284
|
+
```json
|
|
285
|
+
{
|
|
286
|
+
"trigger": {
|
|
287
|
+
"compare": ["../similar-skill-1", "../similar-skill-2"]
|
|
288
|
+
}
|
|
289
|
+
}
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
Comparative mode includes the real competitor skills in the candidate list alongside
|
|
293
|
+
fake skills. This reveals confusion between skills with overlapping descriptions that
|
|
294
|
+
standard trigger testing would miss.
|
|
295
|
+
|
|
199
296
|
### `skilltest eval <path-to-skill>`
|
|
200
297
|
|
|
201
298
|
Runs full skill behavior and grades outputs against assertions.
|
|
@@ -219,6 +316,70 @@ Flags:
|
|
|
219
316
|
- `--api-key <key>` explicit key override
|
|
220
317
|
- `--verbose` show full model responses
|
|
221
318
|
|
|
319
|
+
Config-only eval setting:
|
|
320
|
+
|
|
321
|
+
- `eval.maxToolIterations` default: `10` safety cap for tool-aware eval loops
|
|
322
|
+
|
|
323
|
+
### Tool-Aware Eval
|
|
324
|
+
|
|
325
|
+
When an eval prompt defines `tools`, `skilltest` runs the prompt in a mock tool
|
|
326
|
+
environment instead of plain text-only execution. The model can call the mocked
|
|
327
|
+
tools during eval, and `skilltest` records the calls alongside the normal grader
|
|
328
|
+
assertions.
|
|
329
|
+
|
|
330
|
+
Tool responses are always mocked. `skilltest` does not execute real tools,
|
|
331
|
+
scripts, shell commands, MCP servers, or APIs during eval.
|
|
332
|
+
|
|
333
|
+
Example prompt file:
|
|
334
|
+
|
|
335
|
+
```json
|
|
336
|
+
[
|
|
337
|
+
{
|
|
338
|
+
"prompt": "Parse this deployment checklist and tell me what is missing.",
|
|
339
|
+
"assertions": ["output should mention the missing rollback plan"],
|
|
340
|
+
"tools": [
|
|
341
|
+
{
|
|
342
|
+
"name": "read_file",
|
|
343
|
+
"description": "Read a file from the workspace",
|
|
344
|
+
"parameters": [
|
|
345
|
+
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
|
|
346
|
+
],
|
|
347
|
+
"responses": {
|
|
348
|
+
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
|
|
349
|
+
"*": "[mock] File not found"
|
|
350
|
+
}
|
|
351
|
+
},
|
|
352
|
+
{
|
|
353
|
+
"name": "run_script",
|
|
354
|
+
"description": "Execute a shell script",
|
|
355
|
+
"parameters": [
|
|
356
|
+
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
|
|
357
|
+
],
|
|
358
|
+
"responses": {
|
|
359
|
+
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
|
|
360
|
+
}
|
|
361
|
+
}
|
|
362
|
+
],
|
|
363
|
+
"toolAssertions": [
|
|
364
|
+
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
|
|
365
|
+
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
|
|
366
|
+
{
|
|
367
|
+
"type": "tool_argument_match",
|
|
368
|
+
"toolName": "read_file",
|
|
369
|
+
"expectedArgs": { "path": "checklist.md" },
|
|
370
|
+
"description": "Model should read checklist.md specifically"
|
|
371
|
+
}
|
|
372
|
+
]
|
|
373
|
+
}
|
|
374
|
+
]
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
Run it with:
|
|
378
|
+
|
|
379
|
+
```bash
|
|
380
|
+
skilltest eval ./my-skill --prompts ./eval-prompts.json
|
|
381
|
+
```
|
|
382
|
+
|
|
222
383
|
### `skilltest check <path-to-skill>`
|
|
223
384
|
|
|
224
385
|
Runs `lint + trigger + eval` in one command and applies quality thresholds.
|
|
@@ -238,9 +399,11 @@ Flags:
|
|
|
238
399
|
- `--grader-model <model>` default: same as resolved `--model`
|
|
239
400
|
- `--api-key <key>` explicit key override
|
|
240
401
|
- `--queries <path>` custom trigger queries JSON
|
|
402
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
241
403
|
- `--num-queries <n>` default: `20` (must be even)
|
|
242
404
|
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
243
405
|
- `--prompts <path>` custom eval prompts JSON
|
|
406
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
244
407
|
- `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
|
|
245
408
|
- `--html <path>` write a self-contained HTML report
|
|
246
409
|
- `--min-f1 <n>` default: `0.8`
|
|
@@ -249,6 +412,52 @@ Flags:
|
|
|
249
412
|
- `--continue-on-lint-fail` continue trigger/eval even if lint fails
|
|
250
413
|
- `--verbose` include detailed trigger/eval sections
|
|
251
414
|
|
|
415
|
+
### `skilltest improve <path-to-skill>`
|
|
416
|
+
|
|
417
|
+
Rewrites `SKILL.md`, verifies the rewrite on a frozen test set, and optionally
|
|
418
|
+
applies it.
|
|
419
|
+
|
|
420
|
+
Default behavior:
|
|
421
|
+
|
|
422
|
+
1. Run a baseline `check` with `continue-on-lint-fail=true`.
|
|
423
|
+
2. Freeze the exact trigger queries and eval prompts used in that baseline run.
|
|
424
|
+
3. Ask the model for a structured JSON rewrite of `SKILL.md`.
|
|
425
|
+
4. Rebuild and validate the candidate locally:
|
|
426
|
+
- must stay parseable
|
|
427
|
+
- must keep the same skill `name`
|
|
428
|
+
- must keep the current `license` when one already exists
|
|
429
|
+
- must not introduce broken relative references
|
|
430
|
+
5. Verify the candidate by rerunning `check` against a copied skill directory with
|
|
431
|
+
the frozen trigger/eval inputs.
|
|
432
|
+
6. Only write files when the candidate measurably improves the skill and passes the
|
|
433
|
+
configured quality gates.
|
|
434
|
+
|
|
435
|
+
Flags:
|
|
436
|
+
|
|
437
|
+
- `--provider <anthropic|openai>` default: `anthropic`
|
|
438
|
+
- `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
|
|
439
|
+
- `--api-key <key>` explicit key override
|
|
440
|
+
- `--queries <path>` custom trigger queries JSON
|
|
441
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
442
|
+
- `--num-queries <n>` default: `20` (must be even when auto-generating)
|
|
443
|
+
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
444
|
+
- `--prompts <path>` custom eval prompts JSON
|
|
445
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
446
|
+
- `--concurrency <n>` default: `5`
|
|
447
|
+
- `--output <path>` write the verified candidate `SKILL.md` to a separate file
|
|
448
|
+
- `--save-results <path>` save full improve result JSON
|
|
449
|
+
- `--min-f1 <n>` default: `0.8`
|
|
450
|
+
- `--min-assert-pass-rate <n>` default: `0.9`
|
|
451
|
+
- `--apply` write the verified rewrite back to the source `SKILL.md`
|
|
452
|
+
- `--verbose` include full baseline and verification reports
|
|
453
|
+
|
|
454
|
+
Notes:
|
|
455
|
+
|
|
456
|
+
- `improve` is dry-run by default.
|
|
457
|
+
- `--apply` only writes when parse, lint, trigger, and eval verification all pass.
|
|
458
|
+
- Before/after metrics are measured against the same generated or user-supplied
|
|
459
|
+
trigger queries and eval prompts, not a fresh sample.
|
|
460
|
+
|
|
252
461
|
## Global Flags
|
|
253
462
|
|
|
254
463
|
- `--help` show help
|
|
@@ -287,12 +496,56 @@ Eval prompts (`--prompts`):
|
|
|
287
496
|
]
|
|
288
497
|
```
|
|
289
498
|
|
|
499
|
+
Tool-aware eval prompts (`--prompts`):
|
|
500
|
+
|
|
501
|
+
```json
|
|
502
|
+
[
|
|
503
|
+
{
|
|
504
|
+
"prompt": "Parse this deployment checklist and tell me what is missing.",
|
|
505
|
+
"assertions": ["output should mention remediation steps"],
|
|
506
|
+
"tools": [
|
|
507
|
+
{
|
|
508
|
+
"name": "read_file",
|
|
509
|
+
"description": "Read a file from the workspace",
|
|
510
|
+
"parameters": [
|
|
511
|
+
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
|
|
512
|
+
],
|
|
513
|
+
"responses": {
|
|
514
|
+
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
|
|
515
|
+
"*": "[mock] File not found"
|
|
516
|
+
}
|
|
517
|
+
},
|
|
518
|
+
{
|
|
519
|
+
"name": "run_script",
|
|
520
|
+
"description": "Execute a shell script",
|
|
521
|
+
"parameters": [
|
|
522
|
+
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
|
|
523
|
+
],
|
|
524
|
+
"responses": {
|
|
525
|
+
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
|
|
526
|
+
}
|
|
527
|
+
}
|
|
528
|
+
],
|
|
529
|
+
"toolAssertions": [
|
|
530
|
+
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
|
|
531
|
+
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
|
|
532
|
+
{
|
|
533
|
+
"type": "tool_argument_match",
|
|
534
|
+
"toolName": "read_file",
|
|
535
|
+
"expectedArgs": { "path": "checklist.md" },
|
|
536
|
+
"description": "Model should read checklist.md specifically"
|
|
537
|
+
}
|
|
538
|
+
]
|
|
539
|
+
}
|
|
540
|
+
]
|
|
541
|
+
```
|
|
542
|
+
|
|
290
543
|
## Output and Exit Codes
|
|
291
544
|
|
|
292
545
|
Exit codes:
|
|
293
546
|
|
|
294
547
|
- `0`: success
|
|
295
|
-
- `1`: quality gate failed (`lint`, `check`
|
|
548
|
+
- `1`: quality gate failed (`lint`, `check`, `improve` blocked, or other command-specific failure conditions)
|
|
296
549
|
- `2`: runtime/config/API/parse error
|
|
297
550
|
|
|
298
551
|
JSON mode examples:
|
|
@@ -302,6 +555,7 @@ skilltest lint ./skill --json
|
|
|
302
555
|
skilltest trigger ./skill --json
|
|
303
556
|
skilltest eval ./skill --json
|
|
304
557
|
skilltest check ./skill --json
|
|
558
|
+
skilltest improve ./skill --json
|
|
305
559
|
```
|
|
306
560
|
|
|
307
561
|
HTML report examples:
|
|
@@ -433,6 +687,7 @@ node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
|
433
687
|
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
|
|
434
688
|
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|
|
435
689
|
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
690
|
+
node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
436
691
|
```
|
|
437
692
|
|
|
438
693
|
## Release Checklist
|