skilltest 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -22,7 +22,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
22
22
  - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
23
23
  - `src/core/grader.ts`: structured grader prompt + JSON parse
24
24
  - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
25
- - `src/reporters/`: terminal rendering and JSON output helper
25
+ - `src/reporters/`: terminal, JSON, and HTML output helpers
26
26
  - `src/utils/`: filesystem and API key config helpers
27
27
 
28
28
  ## Build and Test Locally
@@ -73,6 +73,10 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
73
73
  - `check` wraps lint + trigger + eval and enforces minimum thresholds:
74
74
  - trigger F1
75
75
  - eval assertion pass rate
76
+ - Trigger/eval work is concurrency-limited instead of fully unbounded:
77
+ - default concurrency is `5`
78
+ - `--concurrency 1` preserves the old sequential behavior
79
+ - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
76
80
  - JSON mode is strict:
77
81
  - no spinners
78
82
  - no colored output
@@ -106,6 +110,4 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
106
110
  ## Future Work (Not Implemented Yet)
107
111
 
108
112
  - Config file support (`.skilltestrc`)
109
- - Parallel execution
110
- - HTML reporting
111
113
  - Plugin linter rules
package/README.md CHANGED
@@ -67,6 +67,18 @@ Run full quality gate:
67
67
  skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
68
68
  ```
69
69
 
70
+ Write a self-contained HTML report:
71
+
72
+ ```bash
73
+ skilltest check ./path/to/skill --html ./reports/check.html
74
+ ```
75
+
76
+ Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
77
+ the old sequential execution order. Seeded trigger runs stay deterministic regardless
78
+ of concurrency.
79
+ All four commands also support `--html <path>` for an offline HTML report, and
80
+ `--json` can be used with `--html` in the same run.
81
+
70
82
  Example lint summary:
71
83
 
72
84
  ```text
@@ -75,6 +87,35 @@ target: ./test-fixtures/sample-skill
75
87
  summary: 29/29 checks passed, 0 warnings, 0 failures
76
88
  ```
77
89
 
90
+ ## Configuration
91
+
92
+ `skilltest` resolves config in this order:
93
+
94
+ 1. `.skilltestrc` in the target skill root
95
+ 2. `.skilltestrc` in the current working directory
96
+ 3. the nearest `package.json` containing `skilltestrc`
97
+
98
+ CLI flags override config values.
99
+
100
+ Example `.skilltestrc`:
101
+
102
+ ```json
103
+ {
104
+ "provider": "anthropic",
105
+ "model": "claude-sonnet-4-5-20250929",
106
+ "concurrency": 5,
107
+ "trigger": {
108
+ "numQueries": 20,
109
+ "threshold": 0.8,
110
+ "seed": 123
111
+ },
112
+ "eval": {
113
+ "numRuns": 5,
114
+ "threshold": 0.9
115
+ }
116
+ }
117
+ ```
118
+
78
119
  ## Commands
79
120
 
80
121
  ### `skilltest lint <path-to-skill>`
@@ -115,6 +156,10 @@ What it checks:
115
156
  - warns on provider-specific conventions such as `allowed-tools`
116
157
  - emits a likely compatibility summary
117
158
 
159
+ Flags:
160
+
161
+ - `--html <path>` write a self-contained HTML report
162
+
118
163
  ### `skilltest trigger <path-to-skill>`
119
164
 
120
165
  Measures trigger behavior for your skill description with model simulation.
@@ -128,12 +173,21 @@ Flow:
128
173
  - realistic fake skills
129
174
  4. Computes TP, TN, FP, FN, precision, recall, F1.
130
175
 
176
+ For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
177
+ terminal and JSON output include it so the run can be repeated exactly. If you use
178
+ `.skilltestrc`, `trigger.seed` sets the default and the CLI flag overrides it.
179
+ The fake-skill setup is precomputed before requests begin, so the same seed produces
180
+ the same trigger cases at any concurrency level.
181
+
131
182
  Flags:
132
183
 
133
184
  - `--model <model>` default: `claude-sonnet-4-5-20250929`
134
185
  - `--provider <anthropic|openai>` default: `anthropic`
135
186
  - `--queries <path>` use custom queries JSON
136
187
  - `--num-queries <n>` default: `20` (must be even)
188
+ - `--seed <number>` RNG seed for reproducible fake-skill sampling
189
+ - `--concurrency <n>` default: `5`
190
+ - `--html <path>` write a self-contained HTML report
137
191
  - `--save-queries <path>` save generated query set
138
192
  - `--api-key <key>` explicit key override
139
193
  - `--verbose` show full model decision text
@@ -155,6 +209,8 @@ Flags:
155
209
  - `--model <model>` default: `claude-sonnet-4-5-20250929`
156
210
  - `--grader-model <model>` default: same as `--model`
157
211
  - `--provider <anthropic|openai>` default: `anthropic`
212
+ - `--concurrency <n>` default: `5`
213
+ - `--html <path>` write a self-contained HTML report
158
214
  - `--save-results <path>` write full JSON result
159
215
  - `--api-key <key>` explicit key override
160
216
  - `--verbose` show full model responses
@@ -168,7 +224,8 @@ Default behavior:
168
224
  1. Run lint.
169
225
  2. Stop before model calls if lint has failures.
170
226
  3. Run trigger and eval only when lint passes.
171
- 4. Fail quality gate when either threshold is below target.
227
+ 4. When concurrency is greater than `1`, run trigger and eval in parallel.
228
+ 5. Fail quality gate when either threshold is below target.
172
229
 
173
230
  Flags:
174
231
 
@@ -178,7 +235,10 @@ Flags:
178
235
  - `--api-key <key>` explicit key override
179
236
  - `--queries <path>` custom trigger queries JSON
180
237
  - `--num-queries <n>` default: `20` (must be even)
238
+ - `--seed <number>` RNG seed for reproducible trigger sampling
181
239
  - `--prompts <path>` custom eval prompts JSON
240
+ - `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
241
+ - `--html <path>` write a self-contained HTML report
182
242
  - `--min-f1 <n>` default: `0.8`
183
243
  - `--min-assert-pass-rate <n>` default: `0.9`
184
244
  - `--save-results <path>` save combined check result JSON
@@ -240,6 +300,21 @@ skilltest eval ./skill --json
240
300
  skilltest check ./skill --json
241
301
  ```
242
302
 
303
+ HTML report examples:
304
+
305
+ ```bash
306
+ skilltest lint ./skill --html ./reports/lint.html
307
+ skilltest trigger ./skill --html ./reports/trigger.html
308
+ skilltest eval ./skill --html ./reports/eval.html
309
+ skilltest check ./skill --json --html ./reports/check.html
310
+ ```
311
+
312
+ Seeded trigger example:
313
+
314
+ ```bash
315
+ skilltest trigger ./skill --seed 123
316
+ ```
317
+
243
318
  ## API Keys
244
319
 
245
320
  Anthropic:
@@ -343,7 +418,9 @@ Smoke tests:
343
418
 
344
419
  ```bash
345
420
  node dist/index.js lint test-fixtures/sample-skill/
421
+ node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
346
422
  node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
423
+ node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
347
424
  node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
348
425
  node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
349
426
  ```