skilltest 0.4.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +5 -3
- package/README.md +78 -1
- package/dist/index.js +1475 -257
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/CLAUDE.md
CHANGED
|
@@ -22,7 +22,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
|
22
22
|
- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
|
|
23
23
|
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
24
24
|
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
25
|
-
- `src/reporters/`: terminal
|
|
25
|
+
- `src/reporters/`: terminal, JSON, and HTML output helpers
|
|
26
26
|
- `src/utils/`: filesystem and API key config helpers
|
|
27
27
|
|
|
28
28
|
## Build and Test Locally
|
|
@@ -73,6 +73,10 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
73
73
|
- `check` wraps lint + trigger + eval and enforces minimum thresholds:
|
|
74
74
|
- trigger F1
|
|
75
75
|
- eval assertion pass rate
|
|
76
|
+
- Trigger/eval work is concurrency-limited instead of fully unbounded:
|
|
77
|
+
- default concurrency is `5`
|
|
78
|
+
- `--concurrency 1` preserves the old sequential behavior
|
|
79
|
+
- trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
|
|
76
80
|
- JSON mode is strict:
|
|
77
81
|
- no spinners
|
|
78
82
|
- no colored output
|
|
@@ -106,6 +110,4 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
106
110
|
## Future Work (Not Implemented Yet)
|
|
107
111
|
|
|
108
112
|
- Config file support (`.skilltestrc`)
|
|
109
|
-
- Parallel execution
|
|
110
|
-
- HTML reporting
|
|
111
113
|
- Plugin linter rules
|
package/README.md
CHANGED
|
@@ -67,6 +67,18 @@ Run full quality gate:
|
|
|
67
67
|
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
|
|
68
68
|
```
|
|
69
69
|
|
|
70
|
+
Write a self-contained HTML report:
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
skilltest check ./path/to/skill --html ./reports/check.html
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
|
|
77
|
+
the old sequential execution order. Seeded trigger runs stay deterministic regardless
|
|
78
|
+
of concurrency.
|
|
79
|
+
All four commands also support `--html <path>` for an offline HTML report, and
|
|
80
|
+
`--json` can be used with `--html` in the same run.
|
|
81
|
+
|
|
70
82
|
Example lint summary:
|
|
71
83
|
|
|
72
84
|
```text
|
|
@@ -75,6 +87,35 @@ target: ./test-fixtures/sample-skill
|
|
|
75
87
|
summary: 29/29 checks passed, 0 warnings, 0 failures
|
|
76
88
|
```
|
|
77
89
|
|
|
90
|
+
## Configuration
|
|
91
|
+
|
|
92
|
+
`skilltest` resolves config in this order:
|
|
93
|
+
|
|
94
|
+
1. `.skilltestrc` in the target skill root
|
|
95
|
+
2. `.skilltestrc` in the current working directory
|
|
96
|
+
3. the nearest `package.json` containing `skilltestrc`
|
|
97
|
+
|
|
98
|
+
CLI flags override config values.
|
|
99
|
+
|
|
100
|
+
Example `.skilltestrc`:
|
|
101
|
+
|
|
102
|
+
```json
|
|
103
|
+
{
|
|
104
|
+
"provider": "anthropic",
|
|
105
|
+
"model": "claude-sonnet-4-5-20250929",
|
|
106
|
+
"concurrency": 5,
|
|
107
|
+
"trigger": {
|
|
108
|
+
"numQueries": 20,
|
|
109
|
+
"threshold": 0.8,
|
|
110
|
+
"seed": 123
|
|
111
|
+
},
|
|
112
|
+
"eval": {
|
|
113
|
+
"numRuns": 5,
|
|
114
|
+
"threshold": 0.9
|
|
115
|
+
}
|
|
116
|
+
}
|
|
117
|
+
```
|
|
118
|
+
|
|
78
119
|
## Commands
|
|
79
120
|
|
|
80
121
|
### `skilltest lint <path-to-skill>`
|
|
@@ -115,6 +156,10 @@ What it checks:
|
|
|
115
156
|
- warns on provider-specific conventions such as `allowed-tools`
|
|
116
157
|
- emits a likely compatibility summary
|
|
117
158
|
|
|
159
|
+
Flags:
|
|
160
|
+
|
|
161
|
+
- `--html <path>` write a self-contained HTML report
|
|
162
|
+
|
|
118
163
|
### `skilltest trigger <path-to-skill>`
|
|
119
164
|
|
|
120
165
|
Measures trigger behavior for your skill description with model simulation.
|
|
@@ -128,12 +173,21 @@ Flow:
|
|
|
128
173
|
- realistic fake skills
|
|
129
174
|
4. Computes TP, TN, FP, FN, precision, recall, F1.
|
|
130
175
|
|
|
176
|
+
For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
|
|
177
|
+
terminal and JSON output include it so the run can be repeated exactly. If you use
|
|
178
|
+
`.skilltestrc`, `trigger.seed` sets the default and the CLI flag overrides it.
|
|
179
|
+
The fake-skill setup is precomputed before requests begin, so the same seed produces
|
|
180
|
+
the same trigger cases at any concurrency level.
|
|
181
|
+
|
|
131
182
|
Flags:
|
|
132
183
|
|
|
133
184
|
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
134
185
|
- `--provider <anthropic|openai>` default: `anthropic`
|
|
135
186
|
- `--queries <path>` use custom queries JSON
|
|
136
187
|
- `--num-queries <n>` default: `20` (must be even)
|
|
188
|
+
- `--seed <number>` RNG seed for reproducible fake-skill sampling
|
|
189
|
+
- `--concurrency <n>` default: `5`
|
|
190
|
+
- `--html <path>` write a self-contained HTML report
|
|
137
191
|
- `--save-queries <path>` save generated query set
|
|
138
192
|
- `--api-key <key>` explicit key override
|
|
139
193
|
- `--verbose` show full model decision text
|
|
@@ -155,6 +209,8 @@ Flags:
|
|
|
155
209
|
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
156
210
|
- `--grader-model <model>` default: same as `--model`
|
|
157
211
|
- `--provider <anthropic|openai>` default: `anthropic`
|
|
212
|
+
- `--concurrency <n>` default: `5`
|
|
213
|
+
- `--html <path>` write a self-contained HTML report
|
|
158
214
|
- `--save-results <path>` write full JSON result
|
|
159
215
|
- `--api-key <key>` explicit key override
|
|
160
216
|
- `--verbose` show full model responses
|
|
@@ -168,7 +224,8 @@ Default behavior:
|
|
|
168
224
|
1. Run lint.
|
|
169
225
|
2. Stop before model calls if lint has failures.
|
|
170
226
|
3. Run trigger and eval only when lint passes.
|
|
171
|
-
4.
|
|
227
|
+
4. When concurrency is greater than `1`, run trigger and eval in parallel.
|
|
228
|
+
5. Fail quality gate when either threshold is below target.
|
|
172
229
|
|
|
173
230
|
Flags:
|
|
174
231
|
|
|
@@ -178,7 +235,10 @@ Flags:
|
|
|
178
235
|
- `--api-key <key>` explicit key override
|
|
179
236
|
- `--queries <path>` custom trigger queries JSON
|
|
180
237
|
- `--num-queries <n>` default: `20` (must be even)
|
|
238
|
+
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
181
239
|
- `--prompts <path>` custom eval prompts JSON
|
|
240
|
+
- `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
|
|
241
|
+
- `--html <path>` write a self-contained HTML report
|
|
182
242
|
- `--min-f1 <n>` default: `0.8`
|
|
183
243
|
- `--min-assert-pass-rate <n>` default: `0.9`
|
|
184
244
|
- `--save-results <path>` save combined check result JSON
|
|
@@ -240,6 +300,21 @@ skilltest eval ./skill --json
|
|
|
240
300
|
skilltest check ./skill --json
|
|
241
301
|
```
|
|
242
302
|
|
|
303
|
+
HTML report examples:
|
|
304
|
+
|
|
305
|
+
```bash
|
|
306
|
+
skilltest lint ./skill --html ./reports/lint.html
|
|
307
|
+
skilltest trigger ./skill --html ./reports/trigger.html
|
|
308
|
+
skilltest eval ./skill --html ./reports/eval.html
|
|
309
|
+
skilltest check ./skill --json --html ./reports/check.html
|
|
310
|
+
```
|
|
311
|
+
|
|
312
|
+
Seeded trigger example:
|
|
313
|
+
|
|
314
|
+
```bash
|
|
315
|
+
skilltest trigger ./skill --seed 123
|
|
316
|
+
```
|
|
317
|
+
|
|
243
318
|
## API Keys
|
|
244
319
|
|
|
245
320
|
Anthropic:
|
|
@@ -343,7 +418,9 @@ Smoke tests:
|
|
|
343
418
|
|
|
344
419
|
```bash
|
|
345
420
|
node dist/index.js lint test-fixtures/sample-skill/
|
|
421
|
+
node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
|
|
346
422
|
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
423
|
+
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
|
|
347
424
|
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|
|
348
425
|
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
349
426
|
```
|