skilltest 0.8.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -19,12 +19,15 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
19
19
  - `src/core/linter/`: lint check modules and orchestrator
20
20
  - `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
21
21
  - `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
22
+ - `src/core/tool-environment.ts`: mock tool environment + agentic loop for tool-aware eval
22
23
  - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
23
24
  - `src/core/grader.ts`: structured grader prompt + JSON parse
24
- - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
25
+ - `src/providers/`: LLM provider abstraction (`sendMessage`, `sendWithTools`) and provider implementations
25
26
  - `src/reporters/`: terminal, JSON, and HTML output helpers
26
27
  - `src/utils/`: filesystem and API key config helpers
27
28
 
29
+ Eval supports optional mock tool environments for testing skills that invoke tools.
30
+
28
31
  ## Build and Test Locally
29
32
 
30
33
  Install deps:
@@ -68,6 +71,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
68
71
 
69
72
  - Minimal provider interface:
70
73
  - `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
74
+ - `sendWithTools(systemPrompt, messages, { model, tools }) => ProviderToolResponse`
71
75
  - Lint is fully offline and first-class.
72
76
  - Trigger/eval rely on the same provider abstraction.
73
77
  - `check` wraps lint + trigger + eval and enforces minimum thresholds:
@@ -78,6 +82,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
78
82
  - `--concurrency 1` preserves the old sequential behavior
79
83
  - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
80
84
  - Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
85
+ - Tool-aware eval uses mock responses only. No real tool execution happens during eval.
86
+ - Tool assertions are evaluated structurally, without the grader model, so those checks stay deterministic.
81
87
  - JSON mode is strict:
82
88
  - no spinners
83
89
  - no colored output
@@ -106,5 +112,6 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
106
112
  - Compatibility hints: `src/core/linter/compat.ts`
107
113
  - Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
108
114
  - Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
115
+ - Mock tool environment + agentic loop: `src/core/tool-environment.ts`
109
116
  - Eval grading schema: `src/core/grader.ts`
110
117
  - Combined quality gate orchestration: `src/core/check-runner.ts`
package/README.md CHANGED
@@ -4,7 +4,7 @@
4
4
  [![License](https://img.shields.io/badge/license-MIT-green)](./LICENSE)
5
5
  [![CI](https://img.shields.io/badge/ci-placeholder-lightgrey)](#cicd-integration)
6
6
 
7
- The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
7
+ The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your `SKILL.md` files.
8
8
 
9
9
  `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
10
10
 
@@ -12,12 +12,6 @@ The repository itself uses a fast Vitest suite for offline unit and integration
12
12
  coverage of the parser, linters, trigger math, config resolution, reporters,
13
13
  and linter orchestration.
14
14
 
15
- ## Demo
16
-
17
- GIF coming soon.
18
-
19
- <!-- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon) -->
20
-
21
15
  ## Why skilltest?
22
16
 
23
17
  Agent Skills are quick to write but hard to validate before deployment:
@@ -27,7 +21,7 @@ Agent Skills are quick to write but hard to validate before deployment:
27
21
  - You cannot easily measure trigger precision/recall.
28
22
  - You do not know whether outputs are good until users exercise the skill.
29
23
 
30
- `skilltest` closes this gap with one CLI and four modes.
24
+ `skilltest` closes this gap with one CLI and five modes.
31
25
 
32
26
  ## Install
33
27
 
@@ -71,6 +65,18 @@ Run full quality gate:
71
65
  skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
72
66
  ```
73
67
 
68
+ Propose a verified rewrite without touching the source file:
69
+
70
+ ```bash
71
+ skilltest improve ./path/to/skill --provider anthropic
72
+ ```
73
+
74
+ Apply the verified rewrite in place:
75
+
76
+ ```bash
77
+ skilltest improve ./path/to/skill --provider anthropic --apply
78
+ ```
79
+
74
80
  Write a self-contained HTML report:
75
81
 
76
82
  ```bash
@@ -80,8 +86,8 @@ skilltest check ./path/to/skill --html ./reports/check.html
80
86
  Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
81
87
  the old sequential execution order. Seeded trigger runs stay deterministic regardless
82
88
  of concurrency.
83
- All four commands also support `--html <path>` for an offline HTML report, and
84
- `--json` can be used with `--html` in the same run.
89
+ `lint`, `trigger`, `eval`, and `check` support `--html <path>` for offline reports.
90
+ `improve` is terminal/JSON only in v1.
85
91
 
86
92
  Example lint summary:
87
93
 
@@ -115,7 +121,8 @@ Example `.skilltestrc`:
115
121
  },
116
122
  "eval": {
117
123
  "numRuns": 5,
118
- "threshold": 0.9
124
+ "threshold": 0.9,
125
+ "maxToolIterations": 10
119
126
  }
120
127
  }
121
128
  ```
@@ -309,6 +316,70 @@ Flags:
309
316
  - `--api-key <key>` explicit key override
310
317
  - `--verbose` show full model responses
311
318
 
319
+ Config-only eval setting:
320
+
321
+ - `eval.maxToolIterations` default: `10` safety cap for tool-aware eval loops
322
+
323
+ ### Tool-Aware Eval
324
+
325
+ When an eval prompt defines `tools`, `skilltest` runs the prompt in a mock tool
326
+ environment instead of plain text-only execution. The model can call the mocked
327
+ tools during eval, and `skilltest` records the calls alongside the normal grader
328
+ assertions.
329
+
330
+ Tool responses are always mocked. `skilltest` does not execute real tools,
331
+ scripts, shell commands, MCP servers, or APIs during eval.
332
+
333
+ Example prompt file:
334
+
335
+ ```json
336
+ [
337
+ {
338
+ "prompt": "Parse this deployment checklist and tell me what is missing.",
339
+ "assertions": ["output should mention the missing rollback plan"],
340
+ "tools": [
341
+ {
342
+ "name": "read_file",
343
+ "description": "Read a file from the workspace",
344
+ "parameters": [
345
+ { "name": "path", "type": "string", "description": "File path to read", "required": true }
346
+ ],
347
+ "responses": {
348
+ "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
349
+ "*": "[mock] File not found"
350
+ }
351
+ },
352
+ {
353
+ "name": "run_script",
354
+ "description": "Execute a shell script",
355
+ "parameters": [
356
+ { "name": "command", "type": "string", "description": "Command to run", "required": true }
357
+ ],
358
+ "responses": {
359
+ "*": "Script executed successfully. Output: 3 items checked, 1 missing."
360
+ }
361
+ }
362
+ ],
363
+ "toolAssertions": [
364
+ { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
365
+ { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
366
+ {
367
+ "type": "tool_argument_match",
368
+ "toolName": "read_file",
369
+ "expectedArgs": { "path": "checklist.md" },
370
+ "description": "Model should read checklist.md specifically"
371
+ }
372
+ ]
373
+ }
374
+ ]
375
+ ```
376
+
377
+ Run it with:
378
+
379
+ ```bash
380
+ skilltest eval ./my-skill --prompts ./eval-prompts.json
381
+ ```
382
+
312
383
  ### `skilltest check <path-to-skill>`
313
384
 
314
385
  Runs `lint + trigger + eval` in one command and applies quality thresholds.
@@ -341,6 +412,52 @@ Flags:
341
412
  - `--continue-on-lint-fail` continue trigger/eval even if lint fails
342
413
  - `--verbose` include detailed trigger/eval sections
343
414
 
415
+ ### `skilltest improve <path-to-skill>`
416
+
417
+ Rewrites `SKILL.md`, verifies the rewrite on a frozen test set, and optionally
418
+ applies it.
419
+
420
+ Default behavior:
421
+
422
+ 1. Run a baseline `check` with `continue-on-lint-fail=true`.
423
+ 2. Freeze the exact trigger queries and eval prompts used in that baseline run.
424
+ 3. Ask the model for a structured JSON rewrite of `SKILL.md`.
425
+ 4. Rebuild and validate the candidate locally:
426
+ - must stay parseable
427
+ - must keep the same skill `name`
428
+ - must keep the current `license` when one already exists
429
+ - must not introduce broken relative references
430
+ 5. Verify the candidate by rerunning `check` against a copied skill directory with
431
+ the frozen trigger/eval inputs.
432
+ 6. Only write files when the candidate measurably improves the skill and passes the
433
+ configured quality gates.
434
+
435
+ Flags:
436
+
437
+ - `--provider <anthropic|openai>` default: `anthropic`
438
+ - `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
439
+ - `--api-key <key>` explicit key override
440
+ - `--queries <path>` custom trigger queries JSON
441
+ - `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
442
+ - `--num-queries <n>` default: `20` (must be even when auto-generating)
443
+ - `--seed <number>` RNG seed for reproducible trigger sampling
444
+ - `--prompts <path>` custom eval prompts JSON
445
+ - `--plugin <path>` load a custom lint plugin file (repeatable)
446
+ - `--concurrency <n>` default: `5`
447
+ - `--output <path>` write the verified candidate `SKILL.md` to a separate file
448
+ - `--save-results <path>` save full improve result JSON
449
+ - `--min-f1 <n>` default: `0.8`
450
+ - `--min-assert-pass-rate <n>` default: `0.9`
451
+ - `--apply` write the verified rewrite back to the source `SKILL.md`
452
+ - `--verbose` include full baseline and verification reports
453
+
454
+ Notes:
455
+
456
+ - `improve` is dry-run by default.
457
+ - `--apply` only writes when parse, lint, trigger, and eval verification all pass.
458
+ - Before/after metrics are measured against the same generated or user-supplied
459
+ trigger queries and eval prompts, not a fresh sample.
460
+
344
461
  ## Global Flags
345
462
 
346
463
  - `--help` show help
@@ -379,12 +496,56 @@ Eval prompts (`--prompts`):
379
496
  ]
380
497
  ```
381
498
 
499
+ Tool-aware eval prompts (`--prompts`):
500
+
501
+ ```json
502
+ [
503
+ {
504
+ "prompt": "Parse this deployment checklist and tell me what is missing.",
505
+ "assertions": ["output should mention remediation steps"],
506
+ "tools": [
507
+ {
508
+ "name": "read_file",
509
+ "description": "Read a file from the workspace",
510
+ "parameters": [
511
+ { "name": "path", "type": "string", "description": "File path to read", "required": true }
512
+ ],
513
+ "responses": {
514
+ "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
515
+ "*": "[mock] File not found"
516
+ }
517
+ },
518
+ {
519
+ "name": "run_script",
520
+ "description": "Execute a shell script",
521
+ "parameters": [
522
+ { "name": "command", "type": "string", "description": "Command to run", "required": true }
523
+ ],
524
+ "responses": {
525
+ "*": "Script executed successfully. Output: 3 items checked, 1 missing."
526
+ }
527
+ }
528
+ ],
529
+ "toolAssertions": [
530
+ { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
531
+ { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
532
+ {
533
+ "type": "tool_argument_match",
534
+ "toolName": "read_file",
535
+ "expectedArgs": { "path": "checklist.md" },
536
+ "description": "Model should read checklist.md specifically"
537
+ }
538
+ ]
539
+ }
540
+ ]
541
+ ```
542
+
382
543
  ## Output and Exit Codes
383
544
 
384
545
  Exit codes:
385
546
 
386
547
  - `0`: success
387
- - `1`: quality gate failed (`lint`, `check` thresholds, or command-specific failure conditions)
548
+ - `1`: quality gate failed (`lint`, `check`, `improve` blocked, or other command-specific failure conditions)
388
549
  - `2`: runtime/config/API/parse error
389
550
 
390
551
  JSON mode examples:
@@ -394,6 +555,7 @@ skilltest lint ./skill --json
394
555
  skilltest trigger ./skill --json
395
556
  skilltest eval ./skill --json
396
557
  skilltest check ./skill --json
558
+ skilltest improve ./skill --json
397
559
  ```
398
560
 
399
561
  HTML report examples:
@@ -525,6 +687,7 @@ node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
525
687
  node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
526
688
  node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
527
689
  node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
690
+ node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
528
691
  ```
529
692
 
530
693
  ## Release Checklist