skilltest 0.5.0 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +5 -3
- package/README.md +77 -3
- package/dist/index.js +1061 -176
- package/dist/index.js.map +1 -1
- package/package.json +4 -3
package/CLAUDE.md
CHANGED
|
@@ -22,7 +22,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
|
22
22
|
- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
|
|
23
23
|
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
24
24
|
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
25
|
-
- `src/reporters/`: terminal
|
|
25
|
+
- `src/reporters/`: terminal, JSON, and HTML output helpers
|
|
26
26
|
- `src/utils/`: filesystem and API key config helpers
|
|
27
27
|
|
|
28
28
|
## Build and Test Locally
|
|
@@ -73,6 +73,10 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
73
73
|
- `check` wraps lint + trigger + eval and enforces minimum thresholds:
|
|
74
74
|
- trigger F1
|
|
75
75
|
- eval assertion pass rate
|
|
76
|
+
- Trigger/eval work is concurrency-limited instead of fully unbounded:
|
|
77
|
+
- default concurrency is `5`
|
|
78
|
+
- `--concurrency 1` preserves the old sequential behavior
|
|
79
|
+
- trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
|
|
76
80
|
- JSON mode is strict:
|
|
77
81
|
- no spinners
|
|
78
82
|
- no colored output
|
|
@@ -106,6 +110,4 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
106
110
|
## Future Work (Not Implemented Yet)
|
|
107
111
|
|
|
108
112
|
- Config file support (`.skilltestrc`)
|
|
109
|
-
- Parallel execution
|
|
110
|
-
- HTML reporting
|
|
111
113
|
- Plugin linter rules
|
package/README.md
CHANGED
|
@@ -8,11 +8,15 @@ The testing framework for Agent Skills. Lint, test triggering, and evaluate your
|
|
|
8
8
|
|
|
9
9
|
`skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
|
|
10
10
|
|
|
11
|
+
The repository itself uses a fast Vitest suite for offline unit and integration
|
|
12
|
+
coverage of the parser, linters, trigger math, config resolution, reporters,
|
|
13
|
+
and linter orchestration.
|
|
14
|
+
|
|
11
15
|
## Demo
|
|
12
16
|
|
|
13
17
|
GIF coming soon.
|
|
14
18
|
|
|
15
|
-

|
|
19
|
+
<!--  -->
|
|
16
20
|
|
|
17
21
|
## Why skilltest?
|
|
18
22
|
|
|
@@ -67,6 +71,18 @@ Run full quality gate:
|
|
|
67
71
|
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
|
|
68
72
|
```
|
|
69
73
|
|
|
74
|
+
Write a self-contained HTML report:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
skilltest check ./path/to/skill --html ./reports/check.html
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
|
|
81
|
+
the old sequential execution order. Seeded trigger runs stay deterministic regardless
|
|
82
|
+
of concurrency.
|
|
83
|
+
All four commands also support `--html <path>` for an offline HTML report, and
|
|
84
|
+
`--json` can be used with `--html` in the same run.
|
|
85
|
+
|
|
70
86
|
Example lint summary:
|
|
71
87
|
|
|
72
88
|
```text
|
|
@@ -75,6 +91,35 @@ target: ./test-fixtures/sample-skill
|
|
|
75
91
|
summary: 29/29 checks passed, 0 warnings, 0 failures
|
|
76
92
|
```
|
|
77
93
|
|
|
94
|
+
## Configuration
|
|
95
|
+
|
|
96
|
+
`skilltest` resolves config in this order:
|
|
97
|
+
|
|
98
|
+
1. `.skilltestrc` in the target skill root
|
|
99
|
+
2. `.skilltestrc` in the current working directory
|
|
100
|
+
3. the nearest `package.json` containing `skilltestrc`
|
|
101
|
+
|
|
102
|
+
CLI flags override config values.
|
|
103
|
+
|
|
104
|
+
Example `.skilltestrc`:
|
|
105
|
+
|
|
106
|
+
```json
|
|
107
|
+
{
|
|
108
|
+
"provider": "anthropic",
|
|
109
|
+
"model": "claude-sonnet-4-5-20250929",
|
|
110
|
+
"concurrency": 5,
|
|
111
|
+
"trigger": {
|
|
112
|
+
"numQueries": 20,
|
|
113
|
+
"threshold": 0.8,
|
|
114
|
+
"seed": 123
|
|
115
|
+
},
|
|
116
|
+
"eval": {
|
|
117
|
+
"numRuns": 5,
|
|
118
|
+
"threshold": 0.9
|
|
119
|
+
}
|
|
120
|
+
}
|
|
121
|
+
```
|
|
122
|
+
|
|
78
123
|
## Commands
|
|
79
124
|
|
|
80
125
|
### `skilltest lint <path-to-skill>`
|
|
@@ -115,6 +160,10 @@ What it checks:
|
|
|
115
160
|
- warns on provider-specific conventions such as `allowed-tools`
|
|
116
161
|
- emits a likely compatibility summary
|
|
117
162
|
|
|
163
|
+
Flags:
|
|
164
|
+
|
|
165
|
+
- `--html <path>` write a self-contained HTML report
|
|
166
|
+
|
|
118
167
|
### `skilltest trigger <path-to-skill>`
|
|
119
168
|
|
|
120
169
|
Measures trigger behavior for your skill description with model simulation.
|
|
@@ -131,6 +180,8 @@ Flow:
|
|
|
131
180
|
For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
|
|
132
181
|
terminal and JSON output include it so the run can be repeated exactly. If you use
|
|
133
182
|
`.skilltestrc`, `trigger.seed` sets the default and the CLI flag overrides it.
|
|
183
|
+
The fake-skill setup is precomputed before requests begin, so the same seed produces
|
|
184
|
+
the same trigger cases at any concurrency level.
|
|
134
185
|
|
|
135
186
|
Flags:
|
|
136
187
|
|
|
@@ -139,6 +190,8 @@ Flags:
|
|
|
139
190
|
- `--queries <path>` use custom queries JSON
|
|
140
191
|
- `--num-queries <n>` default: `20` (must be even)
|
|
141
192
|
- `--seed <number>` RNG seed for reproducible fake-skill sampling
|
|
193
|
+
- `--concurrency <n>` default: `5`
|
|
194
|
+
- `--html <path>` write a self-contained HTML report
|
|
142
195
|
- `--save-queries <path>` save generated query set
|
|
143
196
|
- `--api-key <key>` explicit key override
|
|
144
197
|
- `--verbose` show full model decision text
|
|
@@ -160,6 +213,8 @@ Flags:
|
|
|
160
213
|
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
161
214
|
- `--grader-model <model>` default: same as `--model`
|
|
162
215
|
- `--provider <anthropic|openai>` default: `anthropic`
|
|
216
|
+
- `--concurrency <n>` default: `5`
|
|
217
|
+
- `--html <path>` write a self-contained HTML report
|
|
163
218
|
- `--save-results <path>` write full JSON result
|
|
164
219
|
- `--api-key <key>` explicit key override
|
|
165
220
|
- `--verbose` show full model responses
|
|
@@ -173,7 +228,8 @@ Default behavior:
|
|
|
173
228
|
1. Run lint.
|
|
174
229
|
2. Stop before model calls if lint has failures.
|
|
175
230
|
3. Run trigger and eval only when lint passes.
|
|
176
|
-
4.
|
|
231
|
+
4. When concurrency is greater than `1`, run trigger and eval in parallel.
|
|
232
|
+
5. Fail quality gate when either threshold is below target.
|
|
177
233
|
|
|
178
234
|
Flags:
|
|
179
235
|
|
|
@@ -185,6 +241,8 @@ Flags:
|
|
|
185
241
|
- `--num-queries <n>` default: `20` (must be even)
|
|
186
242
|
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
187
243
|
- `--prompts <path>` custom eval prompts JSON
|
|
244
|
+
- `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
|
|
245
|
+
- `--html <path>` write a self-contained HTML report
|
|
188
246
|
- `--min-f1 <n>` default: `0.8`
|
|
189
247
|
- `--min-assert-pass-rate <n>` default: `0.9`
|
|
190
248
|
- `--save-results <path>` save combined check result JSON
|
|
@@ -246,6 +304,15 @@ skilltest eval ./skill --json
|
|
|
246
304
|
skilltest check ./skill --json
|
|
247
305
|
```
|
|
248
306
|
|
|
307
|
+
HTML report examples:
|
|
308
|
+
|
|
309
|
+
```bash
|
|
310
|
+
skilltest lint ./skill --html ./reports/lint.html
|
|
311
|
+
skilltest trigger ./skill --html ./reports/trigger.html
|
|
312
|
+
skilltest eval ./skill --html ./reports/eval.html
|
|
313
|
+
skilltest check ./skill --json --html ./reports/check.html
|
|
314
|
+
```
|
|
315
|
+
|
|
249
316
|
Seeded trigger example:
|
|
250
317
|
|
|
251
318
|
```bash
|
|
@@ -312,6 +379,8 @@ jobs:
|
|
|
312
379
|
with:
|
|
313
380
|
node-version: "20"
|
|
314
381
|
- run: npm ci
|
|
382
|
+
- run: npm run lint
|
|
383
|
+
- run: npm run test
|
|
315
384
|
- run: npm run build
|
|
316
385
|
- run: npx skilltest lint path/to/skill --json
|
|
317
386
|
```
|
|
@@ -347,14 +416,19 @@ jobs:
|
|
|
347
416
|
```bash
|
|
348
417
|
npm install
|
|
349
418
|
npm run lint
|
|
419
|
+
npm run test
|
|
350
420
|
npm run build
|
|
351
421
|
node dist/index.js --help
|
|
352
422
|
```
|
|
353
423
|
|
|
354
|
-
|
|
424
|
+
`npm test` runs the Vitest suite. The tests are offline and do not call model
|
|
425
|
+
providers.
|
|
426
|
+
|
|
427
|
+
Manual CLI smoke tests:
|
|
355
428
|
|
|
356
429
|
```bash
|
|
357
430
|
node dist/index.js lint test-fixtures/sample-skill/
|
|
431
|
+
node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
|
|
358
432
|
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
359
433
|
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
|
|
360
434
|
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|