skilltest 0.2.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +8 -0
- package/README.md +39 -4
- package/dist/index.js +1160 -370
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/CLAUDE.md
CHANGED
|
@@ -7,6 +7,7 @@
|
|
|
7
7
|
- `lint`: static/offline quality checks
|
|
8
8
|
- `trigger`: model-based triggerability testing
|
|
9
9
|
- `eval`: end-to-end execution + grader-based scoring
|
|
10
|
+
- `check`: lint + trigger + eval quality gates in one run
|
|
10
11
|
|
|
11
12
|
The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
12
13
|
|
|
@@ -18,6 +19,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
|
18
19
|
- `src/core/linter/`: lint check modules and orchestrator
|
|
19
20
|
- `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
|
|
20
21
|
- `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
|
|
22
|
+
- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
|
|
21
23
|
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
22
24
|
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
23
25
|
- `src/reporters/`: terminal rendering and JSON output helper
|
|
@@ -68,6 +70,9 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
68
70
|
- `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
|
|
69
71
|
- Lint is fully offline and first-class.
|
|
70
72
|
- Trigger/eval rely on the same provider abstraction.
|
|
73
|
+
- `check` wraps lint + trigger + eval and enforces minimum thresholds:
|
|
74
|
+
- trigger F1
|
|
75
|
+
- eval assertion pass rate
|
|
71
76
|
- JSON mode is strict:
|
|
72
77
|
- no spinners
|
|
73
78
|
- no colored output
|
|
@@ -79,6 +84,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
79
84
|
## Gotchas
|
|
80
85
|
|
|
81
86
|
- `trigger --num-queries` must be even for balanced positive/negative cases.
|
|
87
|
+
- `check` also requires even `--num-queries`.
|
|
88
|
+
- `check` stops after lint failures unless `--continue-on-lint-fail` is set.
|
|
82
89
|
- OpenAI provider is implemented via dynamic import so Anthropic-only installs do not crash if optional deps are skipped.
|
|
83
90
|
- Frontmatter is validated with both `gray-matter` and `js-yaml`; malformed YAML should fail fast.
|
|
84
91
|
- Keep file references relative to skill root; out-of-root refs are lint failures.
|
|
@@ -94,6 +101,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
94
101
|
- Compatibility hints: `src/core/linter/compat.ts`
|
|
95
102
|
- Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
|
|
96
103
|
- Eval grading schema: `src/core/grader.ts`
|
|
104
|
+
- Combined quality gate orchestration: `src/core/check-runner.ts`
|
|
97
105
|
|
|
98
106
|
## Future Work (Not Implemented Yet)
|
|
99
107
|
|
package/README.md
CHANGED
|
@@ -23,7 +23,7 @@ Agent Skills are quick to write but hard to validate before deployment:
|
|
|
23
23
|
- You cannot easily measure trigger precision/recall.
|
|
24
24
|
- You do not know whether outputs are good until users exercise the skill.
|
|
25
25
|
|
|
26
|
-
`skilltest` closes this gap with one CLI and
|
|
26
|
+
`skilltest` closes this gap with one CLI and four modes.
|
|
27
27
|
|
|
28
28
|
## Install
|
|
29
29
|
|
|
@@ -61,12 +61,18 @@ End-to-end eval:
|
|
|
61
61
|
skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
|
|
62
62
|
```
|
|
63
63
|
|
|
64
|
+
Run full quality gate:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
|
|
68
|
+
```
|
|
69
|
+
|
|
64
70
|
Example lint summary:
|
|
65
71
|
|
|
66
72
|
```text
|
|
67
73
|
skilltest lint
|
|
68
74
|
target: ./test-fixtures/sample-skill
|
|
69
|
-
summary:
|
|
75
|
+
summary: 29/29 checks passed, 0 warnings, 0 failures
|
|
70
76
|
```
|
|
71
77
|
|
|
72
78
|
## Commands
|
|
@@ -153,6 +159,32 @@ Flags:
|
|
|
153
159
|
- `--api-key <key>` explicit key override
|
|
154
160
|
- `--verbose` show full model responses
|
|
155
161
|
|
|
162
|
+
### `skilltest check <path-to-skill>`
|
|
163
|
+
|
|
164
|
+
Runs `lint + trigger + eval` in one command and applies quality thresholds.
|
|
165
|
+
|
|
166
|
+
Default behavior:
|
|
167
|
+
|
|
168
|
+
1. Run lint.
|
|
169
|
+
2. Stop before model calls if lint has failures.
|
|
170
|
+
3. Run trigger and eval only when lint passes.
|
|
171
|
+
4. Fail quality gate when either threshold is below target.
|
|
172
|
+
|
|
173
|
+
Flags:
|
|
174
|
+
|
|
175
|
+
- `--provider <anthropic|openai>` default: `anthropic`
|
|
176
|
+
- `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
|
|
177
|
+
- `--grader-model <model>` default: same as resolved `--model`
|
|
178
|
+
- `--api-key <key>` explicit key override
|
|
179
|
+
- `--queries <path>` custom trigger queries JSON
|
|
180
|
+
- `--num-queries <n>` default: `20` (must be even)
|
|
181
|
+
- `--prompts <path>` custom eval prompts JSON
|
|
182
|
+
- `--min-f1 <n>` default: `0.8`
|
|
183
|
+
- `--min-assert-pass-rate <n>` default: `0.9`
|
|
184
|
+
- `--save-results <path>` save combined check result JSON
|
|
185
|
+
- `--continue-on-lint-fail` continue trigger/eval even if lint fails
|
|
186
|
+
- `--verbose` include detailed trigger/eval sections
|
|
187
|
+
|
|
156
188
|
## Global Flags
|
|
157
189
|
|
|
158
190
|
- `--help` show help
|
|
@@ -195,8 +227,8 @@ Eval prompts (`--prompts`):
|
|
|
195
227
|
|
|
196
228
|
Exit codes:
|
|
197
229
|
|
|
198
|
-
- `0`: success
|
|
199
|
-
- `1`: lint
|
|
230
|
+
- `0`: success
|
|
231
|
+
- `1`: quality gate failed (`lint`, `check` thresholds, or command-specific failure conditions)
|
|
200
232
|
- `2`: runtime/config/API/parse error
|
|
201
233
|
|
|
202
234
|
JSON mode examples:
|
|
@@ -205,6 +237,7 @@ JSON mode examples:
|
|
|
205
237
|
skilltest lint ./skill --json
|
|
206
238
|
skilltest trigger ./skill --json
|
|
207
239
|
skilltest eval ./skill --json
|
|
240
|
+
skilltest check ./skill --json
|
|
208
241
|
```
|
|
209
242
|
|
|
210
243
|
## API Keys
|
|
@@ -294,6 +327,7 @@ jobs:
|
|
|
294
327
|
- run: npm run build
|
|
295
328
|
- run: npx skilltest trigger path/to/skill --num-queries 20 --json
|
|
296
329
|
- run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
|
|
330
|
+
- run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --json
|
|
297
331
|
```
|
|
298
332
|
|
|
299
333
|
## Local Development
|
|
@@ -311,6 +345,7 @@ Smoke tests:
|
|
|
311
345
|
node dist/index.js lint test-fixtures/sample-skill/
|
|
312
346
|
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
313
347
|
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|
|
348
|
+
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
314
349
|
```
|
|
315
350
|
|
|
316
351
|
## Release Checklist
|