skilltest 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -22,7 +22,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
22
22
  - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
23
23
  - `src/core/grader.ts`: structured grader prompt + JSON parse
24
24
  - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
25
- - `src/reporters/`: terminal rendering and JSON output helper
25
+ - `src/reporters/`: terminal, JSON, and HTML output helpers
26
26
  - `src/utils/`: filesystem and API key config helpers
27
27
 
28
28
  ## Build and Test Locally
@@ -73,6 +73,10 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
73
73
  - `check` wraps lint + trigger + eval and enforces minimum thresholds:
74
74
  - trigger F1
75
75
  - eval assertion pass rate
76
+ - Trigger/eval work is concurrency-limited instead of fully unbounded:
77
+ - default concurrency is `5`
78
+ - `--concurrency 1` preserves the old sequential behavior
79
+ - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
76
80
  - JSON mode is strict:
77
81
  - no spinners
78
82
  - no colored output
@@ -106,6 +110,4 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
106
110
  ## Future Work (Not Implemented Yet)
107
111
 
108
112
  - Config file support (`.skilltestrc`)
109
- - Parallel execution
110
- - HTML reporting
111
113
  - Plugin linter rules
package/README.md CHANGED
@@ -8,11 +8,15 @@ The testing framework for Agent Skills. Lint, test triggering, and evaluate your
8
8
 
9
9
  `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
10
10
 
11
+ The repository itself uses a fast Vitest suite for offline unit and integration
12
+ coverage of the parser, linters, trigger math, config resolution, reporters,
13
+ and linter orchestration.
14
+
11
15
  ## Demo
12
16
 
13
17
  GIF coming soon.
14
18
 
15
- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon)
19
+ <!-- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon) -->
16
20
 
17
21
  ## Why skilltest?
18
22
 
@@ -67,6 +71,18 @@ Run full quality gate:
67
71
  skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
68
72
  ```
69
73
 
74
+ Write a self-contained HTML report:
75
+
76
+ ```bash
77
+ skilltest check ./path/to/skill --html ./reports/check.html
78
+ ```
79
+
80
+ Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
81
+ the old sequential execution order. Seeded trigger runs stay deterministic regardless
82
+ of concurrency.
83
+ All four commands also support `--html <path>` for an offline HTML report, and
84
+ `--json` can be used with `--html` in the same run.
85
+
70
86
  Example lint summary:
71
87
 
72
88
  ```text
@@ -75,6 +91,35 @@ target: ./test-fixtures/sample-skill
75
91
  summary: 29/29 checks passed, 0 warnings, 0 failures
76
92
  ```
77
93
 
94
+ ## Configuration
95
+
96
+ `skilltest` resolves config in this order:
97
+
98
+ 1. `.skilltestrc` in the target skill root
99
+ 2. `.skilltestrc` in the current working directory
100
+ 3. the nearest `package.json` containing `skilltestrc`
101
+
102
+ CLI flags override config values.
103
+
104
+ Example `.skilltestrc`:
105
+
106
+ ```json
107
+ {
108
+ "provider": "anthropic",
109
+ "model": "claude-sonnet-4-5-20250929",
110
+ "concurrency": 5,
111
+ "trigger": {
112
+ "numQueries": 20,
113
+ "threshold": 0.8,
114
+ "seed": 123
115
+ },
116
+ "eval": {
117
+ "numRuns": 5,
118
+ "threshold": 0.9
119
+ }
120
+ }
121
+ ```
122
+
78
123
  ## Commands
79
124
 
80
125
  ### `skilltest lint <path-to-skill>`
@@ -115,6 +160,10 @@ What it checks:
115
160
  - warns on provider-specific conventions such as `allowed-tools`
116
161
  - emits a likely compatibility summary
117
162
 
163
+ Flags:
164
+
165
+ - `--html <path>` write a self-contained HTML report
166
+
118
167
  ### `skilltest trigger <path-to-skill>`
119
168
 
120
169
  Measures trigger behavior for your skill description with model simulation.
@@ -131,6 +180,8 @@ Flow:
131
180
  For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
132
181
  terminal and JSON output include it so the run can be repeated exactly. If you use
133
182
  `.skilltestrc`, `trigger.seed` sets the default and the CLI flag overrides it.
183
+ The fake-skill setup is precomputed before requests begin, so the same seed produces
184
+ the same trigger cases at any concurrency level.
134
185
 
135
186
  Flags:
136
187
 
@@ -139,6 +190,8 @@ Flags:
139
190
  - `--queries <path>` use custom queries JSON
140
191
  - `--num-queries <n>` default: `20` (must be even)
141
192
  - `--seed <number>` RNG seed for reproducible fake-skill sampling
193
+ - `--concurrency <n>` default: `5`
194
+ - `--html <path>` write a self-contained HTML report
142
195
  - `--save-queries <path>` save generated query set
143
196
  - `--api-key <key>` explicit key override
144
197
  - `--verbose` show full model decision text
@@ -160,6 +213,8 @@ Flags:
160
213
  - `--model <model>` default: `claude-sonnet-4-5-20250929`
161
214
  - `--grader-model <model>` default: same as `--model`
162
215
  - `--provider <anthropic|openai>` default: `anthropic`
216
+ - `--concurrency <n>` default: `5`
217
+ - `--html <path>` write a self-contained HTML report
163
218
  - `--save-results <path>` write full JSON result
164
219
  - `--api-key <key>` explicit key override
165
220
  - `--verbose` show full model responses
@@ -173,7 +228,8 @@ Default behavior:
173
228
  1. Run lint.
174
229
  2. Stop before model calls if lint has failures.
175
230
  3. Run trigger and eval only when lint passes.
176
- 4. Fail quality gate when either threshold is below target.
231
+ 4. When concurrency is greater than `1`, run trigger and eval in parallel.
232
+ 5. Fail quality gate when either threshold is below target.
177
233
 
178
234
  Flags:
179
235
 
@@ -185,6 +241,8 @@ Flags:
185
241
  - `--num-queries <n>` default: `20` (must be even)
186
242
  - `--seed <number>` RNG seed for reproducible trigger sampling
187
243
  - `--prompts <path>` custom eval prompts JSON
244
+ - `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
245
+ - `--html <path>` write a self-contained HTML report
188
246
  - `--min-f1 <n>` default: `0.8`
189
247
  - `--min-assert-pass-rate <n>` default: `0.9`
190
248
  - `--save-results <path>` save combined check result JSON
@@ -246,6 +304,15 @@ skilltest eval ./skill --json
246
304
  skilltest check ./skill --json
247
305
  ```
248
306
 
307
+ HTML report examples:
308
+
309
+ ```bash
310
+ skilltest lint ./skill --html ./reports/lint.html
311
+ skilltest trigger ./skill --html ./reports/trigger.html
312
+ skilltest eval ./skill --html ./reports/eval.html
313
+ skilltest check ./skill --json --html ./reports/check.html
314
+ ```
315
+
249
316
  Seeded trigger example:
250
317
 
251
318
  ```bash
@@ -312,6 +379,8 @@ jobs:
312
379
  with:
313
380
  node-version: "20"
314
381
  - run: npm ci
382
+ - run: npm run lint
383
+ - run: npm run test
315
384
  - run: npm run build
316
385
  - run: npx skilltest lint path/to/skill --json
317
386
  ```
@@ -347,14 +416,19 @@ jobs:
347
416
  ```bash
348
417
  npm install
349
418
  npm run lint
419
+ npm run test
350
420
  npm run build
351
421
  node dist/index.js --help
352
422
  ```
353
423
 
354
- Smoke tests:
424
+ `npm test` runs the Vitest suite. The tests are offline and do not call model
425
+ providers.
426
+
427
+ Manual CLI smoke tests:
355
428
 
356
429
  ```bash
357
430
  node dist/index.js lint test-fixtures/sample-skill/
431
+ node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
358
432
  node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
359
433
  node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
360
434
  node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json