skilltest 0.8.0 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +8 -1
- package/README.md +175 -12
- package/dist/index.js +1325 -79
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/CLAUDE.md
CHANGED
|
@@ -19,12 +19,15 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
|
19
19
|
- `src/core/linter/`: lint check modules and orchestrator
|
|
20
20
|
- `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
|
|
21
21
|
- `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
|
|
22
|
+
- `src/core/tool-environment.ts`: mock tool environment + agentic loop for tool-aware eval
|
|
22
23
|
- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
|
|
23
24
|
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
24
|
-
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
25
|
+
- `src/providers/`: LLM provider abstraction (`sendMessage`, `sendWithTools`) and provider implementations
|
|
25
26
|
- `src/reporters/`: terminal, JSON, and HTML output helpers
|
|
26
27
|
- `src/utils/`: filesystem and API key config helpers
|
|
27
28
|
|
|
29
|
+
Eval supports optional mock tool environments for testing skills that invoke tools.
|
|
30
|
+
|
|
28
31
|
## Build and Test Locally
|
|
29
32
|
|
|
30
33
|
Install deps:
|
|
@@ -68,6 +71,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
68
71
|
|
|
69
72
|
- Minimal provider interface:
|
|
70
73
|
- `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
|
|
74
|
+
- `sendWithTools(systemPrompt, messages, { model, tools }) => ProviderToolResponse`
|
|
71
75
|
- Lint is fully offline and first-class.
|
|
72
76
|
- Trigger/eval rely on the same provider abstraction.
|
|
73
77
|
- `check` wraps lint + trigger + eval and enforces minimum thresholds:
|
|
@@ -78,6 +82,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
78
82
|
- `--concurrency 1` preserves the old sequential behavior
|
|
79
83
|
- trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
|
|
80
84
|
- Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
|
|
85
|
+
- Tool-aware eval uses mock responses only. No real tool execution happens during eval.
|
|
86
|
+
- Tool assertions are evaluated structurally, without the grader model, so those checks stay deterministic.
|
|
81
87
|
- JSON mode is strict:
|
|
82
88
|
- no spinners
|
|
83
89
|
- no colored output
|
|
@@ -106,5 +112,6 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
106
112
|
- Compatibility hints: `src/core/linter/compat.ts`
|
|
107
113
|
- Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
|
|
108
114
|
- Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
|
|
115
|
+
- Mock tool environment + agentic loop: `src/core/tool-environment.ts`
|
|
109
116
|
- Eval grading schema: `src/core/grader.ts`
|
|
110
117
|
- Combined quality gate orchestration: `src/core/check-runner.ts`
|
package/README.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
[](./LICENSE)
|
|
5
5
|
[](#cicd-integration)
|
|
6
6
|
|
|
7
|
-
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
|
|
7
|
+
The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your `SKILL.md` files.
|
|
8
8
|
|
|
9
9
|
`skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
|
|
10
10
|
|
|
@@ -12,12 +12,6 @@ The repository itself uses a fast Vitest suite for offline unit and integration
|
|
|
12
12
|
coverage of the parser, linters, trigger math, config resolution, reporters,
|
|
13
13
|
and linter orchestration.
|
|
14
14
|
|
|
15
|
-
## Demo
|
|
16
|
-
|
|
17
|
-
GIF coming soon.
|
|
18
|
-
|
|
19
|
-
<!--  -->
|
|
20
|
-
|
|
21
15
|
## Why skilltest?
|
|
22
16
|
|
|
23
17
|
Agent Skills are quick to write but hard to validate before deployment:
|
|
@@ -27,7 +21,7 @@ Agent Skills are quick to write but hard to validate before deployment:
|
|
|
27
21
|
- You cannot easily measure trigger precision/recall.
|
|
28
22
|
- You do not know whether outputs are good until users exercise the skill.
|
|
29
23
|
|
|
30
|
-
`skilltest` closes this gap with one CLI and
|
|
24
|
+
`skilltest` closes this gap with one CLI and five modes.
|
|
31
25
|
|
|
32
26
|
## Install
|
|
33
27
|
|
|
@@ -71,6 +65,18 @@ Run full quality gate:
|
|
|
71
65
|
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
|
|
72
66
|
```
|
|
73
67
|
|
|
68
|
+
Propose a verified rewrite without touching the source file:
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
skilltest improve ./path/to/skill --provider anthropic
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Apply the verified rewrite in place:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
skilltest improve ./path/to/skill --provider anthropic --apply
|
|
78
|
+
```
|
|
79
|
+
|
|
74
80
|
Write a self-contained HTML report:
|
|
75
81
|
|
|
76
82
|
```bash
|
|
@@ -80,8 +86,8 @@ skilltest check ./path/to/skill --html ./reports/check.html
|
|
|
80
86
|
Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
|
|
81
87
|
the old sequential execution order. Seeded trigger runs stay deterministic regardless
|
|
82
88
|
of concurrency.
|
|
83
|
-
|
|
84
|
-
|
|
89
|
+
`lint`, `trigger`, `eval`, and `check` support `--html <path>` for offline reports.
|
|
90
|
+
`improve` is terminal/JSON only in v1.
|
|
85
91
|
|
|
86
92
|
Example lint summary:
|
|
87
93
|
|
|
@@ -115,7 +121,8 @@ Example `.skilltestrc`:
|
|
|
115
121
|
},
|
|
116
122
|
"eval": {
|
|
117
123
|
"numRuns": 5,
|
|
118
|
-
"threshold": 0.9
|
|
124
|
+
"threshold": 0.9,
|
|
125
|
+
"maxToolIterations": 10
|
|
119
126
|
}
|
|
120
127
|
}
|
|
121
128
|
```
|
|
@@ -309,6 +316,70 @@ Flags:
|
|
|
309
316
|
- `--api-key <key>` explicit key override
|
|
310
317
|
- `--verbose` show full model responses
|
|
311
318
|
|
|
319
|
+
Config-only eval setting:
|
|
320
|
+
|
|
321
|
+
- `eval.maxToolIterations` default: `10` safety cap for tool-aware eval loops
|
|
322
|
+
|
|
323
|
+
### Tool-Aware Eval
|
|
324
|
+
|
|
325
|
+
When an eval prompt defines `tools`, `skilltest` runs the prompt in a mock tool
|
|
326
|
+
environment instead of plain text-only execution. The model can call the mocked
|
|
327
|
+
tools during eval, and `skilltest` records the calls alongside the normal grader
|
|
328
|
+
assertions.
|
|
329
|
+
|
|
330
|
+
Tool responses are always mocked. `skilltest` does not execute real tools,
|
|
331
|
+
scripts, shell commands, MCP servers, or APIs during eval.
|
|
332
|
+
|
|
333
|
+
Example prompt file:
|
|
334
|
+
|
|
335
|
+
```json
|
|
336
|
+
[
|
|
337
|
+
{
|
|
338
|
+
"prompt": "Parse this deployment checklist and tell me what is missing.",
|
|
339
|
+
"assertions": ["output should mention the missing rollback plan"],
|
|
340
|
+
"tools": [
|
|
341
|
+
{
|
|
342
|
+
"name": "read_file",
|
|
343
|
+
"description": "Read a file from the workspace",
|
|
344
|
+
"parameters": [
|
|
345
|
+
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
|
|
346
|
+
],
|
|
347
|
+
"responses": {
|
|
348
|
+
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
|
|
349
|
+
"*": "[mock] File not found"
|
|
350
|
+
}
|
|
351
|
+
},
|
|
352
|
+
{
|
|
353
|
+
"name": "run_script",
|
|
354
|
+
"description": "Execute a shell script",
|
|
355
|
+
"parameters": [
|
|
356
|
+
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
|
|
357
|
+
],
|
|
358
|
+
"responses": {
|
|
359
|
+
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
|
|
360
|
+
}
|
|
361
|
+
}
|
|
362
|
+
],
|
|
363
|
+
"toolAssertions": [
|
|
364
|
+
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
|
|
365
|
+
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
|
|
366
|
+
{
|
|
367
|
+
"type": "tool_argument_match",
|
|
368
|
+
"toolName": "read_file",
|
|
369
|
+
"expectedArgs": { "path": "checklist.md" },
|
|
370
|
+
"description": "Model should read checklist.md specifically"
|
|
371
|
+
}
|
|
372
|
+
]
|
|
373
|
+
}
|
|
374
|
+
]
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
Run it with:
|
|
378
|
+
|
|
379
|
+
```bash
|
|
380
|
+
skilltest eval ./my-skill --prompts ./eval-prompts.json
|
|
381
|
+
```
|
|
382
|
+
|
|
312
383
|
### `skilltest check <path-to-skill>`
|
|
313
384
|
|
|
314
385
|
Runs `lint + trigger + eval` in one command and applies quality thresholds.
|
|
@@ -341,6 +412,52 @@ Flags:
|
|
|
341
412
|
- `--continue-on-lint-fail` continue trigger/eval even if lint fails
|
|
342
413
|
- `--verbose` include detailed trigger/eval sections
|
|
343
414
|
|
|
415
|
+
### `skilltest improve <path-to-skill>`
|
|
416
|
+
|
|
417
|
+
Rewrites `SKILL.md`, verifies the rewrite on a frozen test set, and optionally
|
|
418
|
+
applies it.
|
|
419
|
+
|
|
420
|
+
Default behavior:
|
|
421
|
+
|
|
422
|
+
1. Run a baseline `check` with `continue-on-lint-fail=true`.
|
|
423
|
+
2. Freeze the exact trigger queries and eval prompts used in that baseline run.
|
|
424
|
+
3. Ask the model for a structured JSON rewrite of `SKILL.md`.
|
|
425
|
+
4. Rebuild and validate the candidate locally:
|
|
426
|
+
- must stay parseable
|
|
427
|
+
- must keep the same skill `name`
|
|
428
|
+
- must keep the current `license` when one already exists
|
|
429
|
+
- must not introduce broken relative references
|
|
430
|
+
5. Verify the candidate by rerunning `check` against a copied skill directory with
|
|
431
|
+
the frozen trigger/eval inputs.
|
|
432
|
+
6. Only write files when the candidate measurably improves the skill and passes the
|
|
433
|
+
configured quality gates.
|
|
434
|
+
|
|
435
|
+
Flags:
|
|
436
|
+
|
|
437
|
+
- `--provider <anthropic|openai>` default: `anthropic`
|
|
438
|
+
- `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
|
|
439
|
+
- `--api-key <key>` explicit key override
|
|
440
|
+
- `--queries <path>` custom trigger queries JSON
|
|
441
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
442
|
+
- `--num-queries <n>` default: `20` (must be even when auto-generating)
|
|
443
|
+
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
444
|
+
- `--prompts <path>` custom eval prompts JSON
|
|
445
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
446
|
+
- `--concurrency <n>` default: `5`
|
|
447
|
+
- `--output <path>` write the verified candidate `SKILL.md` to a separate file
|
|
448
|
+
- `--save-results <path>` save full improve result JSON
|
|
449
|
+
- `--min-f1 <n>` default: `0.8`
|
|
450
|
+
- `--min-assert-pass-rate <n>` default: `0.9`
|
|
451
|
+
- `--apply` write the verified rewrite back to the source `SKILL.md`
|
|
452
|
+
- `--verbose` include full baseline and verification reports
|
|
453
|
+
|
|
454
|
+
Notes:
|
|
455
|
+
|
|
456
|
+
- `improve` is dry-run by default.
|
|
457
|
+
- `--apply` only writes when parse, lint, trigger, and eval verification all pass.
|
|
458
|
+
- Before/after metrics are measured against the same generated or user-supplied
|
|
459
|
+
trigger queries and eval prompts, not a fresh sample.
|
|
460
|
+
|
|
344
461
|
## Global Flags
|
|
345
462
|
|
|
346
463
|
- `--help` show help
|
|
@@ -379,12 +496,56 @@ Eval prompts (`--prompts`):
|
|
|
379
496
|
]
|
|
380
497
|
```
|
|
381
498
|
|
|
499
|
+
Tool-aware eval prompts (`--prompts`):
|
|
500
|
+
|
|
501
|
+
```json
|
|
502
|
+
[
|
|
503
|
+
{
|
|
504
|
+
"prompt": "Parse this deployment checklist and tell me what is missing.",
|
|
505
|
+
"assertions": ["output should mention remediation steps"],
|
|
506
|
+
"tools": [
|
|
507
|
+
{
|
|
508
|
+
"name": "read_file",
|
|
509
|
+
"description": "Read a file from the workspace",
|
|
510
|
+
"parameters": [
|
|
511
|
+
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
|
|
512
|
+
],
|
|
513
|
+
"responses": {
|
|
514
|
+
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
|
|
515
|
+
"*": "[mock] File not found"
|
|
516
|
+
}
|
|
517
|
+
},
|
|
518
|
+
{
|
|
519
|
+
"name": "run_script",
|
|
520
|
+
"description": "Execute a shell script",
|
|
521
|
+
"parameters": [
|
|
522
|
+
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
|
|
523
|
+
],
|
|
524
|
+
"responses": {
|
|
525
|
+
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
|
|
526
|
+
}
|
|
527
|
+
}
|
|
528
|
+
],
|
|
529
|
+
"toolAssertions": [
|
|
530
|
+
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
|
|
531
|
+
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
|
|
532
|
+
{
|
|
533
|
+
"type": "tool_argument_match",
|
|
534
|
+
"toolName": "read_file",
|
|
535
|
+
"expectedArgs": { "path": "checklist.md" },
|
|
536
|
+
"description": "Model should read checklist.md specifically"
|
|
537
|
+
}
|
|
538
|
+
]
|
|
539
|
+
}
|
|
540
|
+
]
|
|
541
|
+
```
|
|
542
|
+
|
|
382
543
|
## Output and Exit Codes
|
|
383
544
|
|
|
384
545
|
Exit codes:
|
|
385
546
|
|
|
386
547
|
- `0`: success
|
|
387
|
-
- `1`: quality gate failed (`lint`, `check`
|
|
548
|
+
- `1`: quality gate failed (`lint`, `check`, `improve` blocked, or other command-specific failure conditions)
|
|
388
549
|
- `2`: runtime/config/API/parse error
|
|
389
550
|
|
|
390
551
|
JSON mode examples:
|
|
@@ -394,6 +555,7 @@ skilltest lint ./skill --json
|
|
|
394
555
|
skilltest trigger ./skill --json
|
|
395
556
|
skilltest eval ./skill --json
|
|
396
557
|
skilltest check ./skill --json
|
|
558
|
+
skilltest improve ./skill --json
|
|
397
559
|
```
|
|
398
560
|
|
|
399
561
|
HTML report examples:
|
|
@@ -525,6 +687,7 @@ node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
|
525
687
|
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
|
|
526
688
|
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|
|
527
689
|
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
690
|
+
node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
|
|
528
691
|
```
|
|
529
692
|
|
|
530
693
|
## Release Checklist
|