skilltest 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -7,6 +7,7 @@
7
7
  - `lint`: static/offline quality checks
8
8
  - `trigger`: model-based triggerability testing
9
9
  - `eval`: end-to-end execution + grader-based scoring
10
+ - `check`: lint + trigger + eval quality gates in one run
10
11
 
11
12
  The CLI is published as `skilltest` and built for `npx skilltest` usage.
12
13
 
@@ -18,6 +19,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
18
19
  - `src/core/linter/`: lint check modules and orchestrator
19
20
  - `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
20
21
  - `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
22
+ - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
21
23
  - `src/core/grader.ts`: structured grader prompt + JSON parse
22
24
  - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
23
25
  - `src/reporters/`: terminal rendering and JSON output helper
@@ -68,6 +70,9 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
68
70
  - `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
69
71
  - Lint is fully offline and first-class.
70
72
  - Trigger/eval rely on the same provider abstraction.
73
+ - `check` wraps lint + trigger + eval and enforces minimum thresholds:
74
+ - trigger F1
75
+ - eval assertion pass rate
71
76
  - JSON mode is strict:
72
77
  - no spinners
73
78
  - no colored output
@@ -79,6 +84,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
79
84
  ## Gotchas
80
85
 
81
86
  - `trigger --num-queries` must be even for balanced positive/negative cases.
87
+ - `check` also requires even `--num-queries`.
88
+ - `check` stops after lint failures unless `--continue-on-lint-fail` is set.
82
89
  - OpenAI provider is implemented via dynamic import so Anthropic-only installs do not crash if optional deps are skipped.
83
90
  - Frontmatter is validated with both `gray-matter` and `js-yaml`; malformed YAML should fail fast.
84
91
  - Keep file references relative to skill root; out-of-root refs are lint failures.
@@ -94,6 +101,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
94
101
  - Compatibility hints: `src/core/linter/compat.ts`
95
102
  - Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
96
103
  - Eval grading schema: `src/core/grader.ts`
104
+ - Combined quality gate orchestration: `src/core/check-runner.ts`
97
105
 
98
106
  ## Future Work (Not Implemented Yet)
99
107
 
package/README.md CHANGED
@@ -23,7 +23,7 @@ Agent Skills are quick to write but hard to validate before deployment:
23
23
  - You cannot easily measure trigger precision/recall.
24
24
  - You do not know whether outputs are good until users exercise the skill.
25
25
 
26
- `skilltest` closes this gap with one CLI and three modes.
26
+ `skilltest` closes this gap with one CLI and four modes.
27
27
 
28
28
  ## Install
29
29
 
@@ -61,12 +61,18 @@ End-to-end eval:
61
61
  skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
62
62
  ```
63
63
 
64
+ Run full quality gate:
65
+
66
+ ```bash
67
+ skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
68
+ ```
69
+
64
70
  Example lint summary:
65
71
 
66
72
  ```text
67
73
  skilltest lint
68
74
  target: ./test-fixtures/sample-skill
69
- summary: 25/25 checks passed, 0 warnings, 0 failures
75
+ summary: 29/29 checks passed, 0 warnings, 0 failures
70
76
  ```
71
77
 
72
78
  ## Commands
@@ -153,6 +159,32 @@ Flags:
153
159
  - `--api-key <key>` explicit key override
154
160
  - `--verbose` show full model responses
155
161
 
162
+ ### `skilltest check <path-to-skill>`
163
+
164
+ Runs `lint + trigger + eval` in one command and applies quality thresholds.
165
+
166
+ Default behavior:
167
+
168
+ 1. Run lint.
169
+ 2. Stop before model calls if lint has failures.
170
+ 3. Run trigger and eval only when lint passes.
171
+ 4. Fail quality gate when either threshold is below target.
172
+
173
+ Flags:
174
+
175
+ - `--provider <anthropic|openai>` default: `anthropic`
176
+ - `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
177
+ - `--grader-model <model>` default: same as resolved `--model`
178
+ - `--api-key <key>` explicit key override
179
+ - `--queries <path>` custom trigger queries JSON
180
+ - `--num-queries <n>` default: `20` (must be even)
181
+ - `--prompts <path>` custom eval prompts JSON
182
+ - `--min-f1 <n>` default: `0.8`
183
+ - `--min-assert-pass-rate <n>` default: `0.9`
184
+ - `--save-results <path>` save combined check result JSON
185
+ - `--continue-on-lint-fail` continue trigger/eval even if lint fails
186
+ - `--verbose` include detailed trigger/eval sections
187
+
156
188
  ## Global Flags
157
189
 
158
190
  - `--help` show help
@@ -195,8 +227,8 @@ Eval prompts (`--prompts`):
195
227
 
196
228
  Exit codes:
197
229
 
198
- - `0`: success with no lint failures
199
- - `1`: lint failures present
230
+ - `0`: success
231
+ - `1`: quality gate failed (`lint`, `check` thresholds, or command-specific failure conditions)
200
232
  - `2`: runtime/config/API/parse error
201
233
 
202
234
  JSON mode examples:
@@ -205,6 +237,7 @@ JSON mode examples:
205
237
  skilltest lint ./skill --json
206
238
  skilltest trigger ./skill --json
207
239
  skilltest eval ./skill --json
240
+ skilltest check ./skill --json
208
241
  ```
209
242
 
210
243
  ## API Keys
@@ -294,6 +327,7 @@ jobs:
294
327
  - run: npm run build
295
328
  - run: npx skilltest trigger path/to/skill --num-queries 20 --json
296
329
  - run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
330
+ - run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --json
297
331
  ```
298
332
 
299
333
  ## Local Development
@@ -311,6 +345,7 @@ Smoke tests:
311
345
  node dist/index.js lint test-fixtures/sample-skill/
312
346
  node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
313
347
  node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
348
+ node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
314
349
  ```
315
350
 
316
351
  ## Release Checklist