npm - skilltest - Versions diffs - 0.7.0 → 0.9.0 - Mend

skilltest 0.7.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CLAUDE.md CHANGED Viewed

@@ -19,12 +19,15 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
 - `src/core/linter/`: lint check modules and orchestrator
 - `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
 - `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
+- `src/core/tool-environment.ts`: mock tool environment + agentic loop for tool-aware eval
 - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
 - `src/core/grader.ts`: structured grader prompt + JSON parse
-- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
+- `src/providers/`: LLM provider abstraction (`sendMessage`, `sendWithTools`) and provider implementations
 - `src/reporters/`: terminal, JSON, and HTML output helpers
 - `src/utils/`: filesystem and API key config helpers
+Eval supports optional mock tool environments for testing skills that invoke tools.
 ## Build and Test Locally
 Install deps:
@@ -68,6 +71,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 - Minimal provider interface:
   - `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
+  - `sendWithTools(systemPrompt, messages, { model, tools }) => ProviderToolResponse`
 - Lint is fully offline and first-class.
 - Trigger/eval rely on the same provider abstraction.
 - `check` wraps lint + trigger + eval and enforces minimum thresholds:
@@ -77,6 +81,9 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
   - default concurrency is `5`
   - `--concurrency 1` preserves the old sequential behavior
   - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
+- Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
+- Tool-aware eval uses mock responses only. No real tool execution happens during eval.
+- Tool assertions are evaluated structurally, without the grader model, so those checks stay deterministic.
 - JSON mode is strict:
   - no spinners
   - no colored output
@@ -103,11 +110,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 - Security heuristics: `src/core/linter/security.ts`
 - Progressive disclosure: `src/core/linter/disclosure.ts`
 - Compatibility hints: `src/core/linter/compat.ts`
-- Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
+- Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
+- Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
+- Mock tool environment + agentic loop: `src/core/tool-environment.ts`
 - Eval grading schema: `src/core/grader.ts`
 - Combined quality gate orchestration: `src/core/check-runner.ts`
-## Future Work (Not Implemented Yet)
-- Config file support (`.skilltestrc`)
-- Plugin linter rules

package/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 [![License](https://img.shields.io/badge/license-MIT-green)](./LICENSE)
 [![CI](https://img.shields.io/badge/ci-placeholder-lightgrey)](#cicd-integration)
-The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
+The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your `SKILL.md` files.
 `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
@@ -12,12 +12,6 @@ The repository itself uses a fast Vitest suite for offline unit and integration
 coverage of the parser, linters, trigger math, config resolution, reporters,
 and linter orchestration.
-## Demo
-GIF coming soon.
-<!-- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon) -->
 ## Why skilltest?
 Agent Skills are quick to write but hard to validate before deployment:
@@ -27,7 +21,7 @@ Agent Skills are quick to write but hard to validate before deployment:
 - You cannot easily measure trigger precision/recall.
 - You do not know whether outputs are good until users exercise the skill.
-`skilltest` closes this gap with one CLI and four modes.
+`skilltest` closes this gap with one CLI and five modes.
 ## Install
@@ -71,6 +65,18 @@ Run full quality gate:
 skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
 ```
+Propose a verified rewrite without touching the source file:
+```bash
+skilltest improve ./path/to/skill --provider anthropic
+```
+Apply the verified rewrite in place:
+```bash
+skilltest improve ./path/to/skill --provider anthropic --apply
+```
 Write a self-contained HTML report:
 ```bash
@@ -80,8 +86,8 @@ skilltest check ./path/to/skill --html ./reports/check.html
 Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
 the old sequential execution order. Seeded trigger runs stay deterministic regardless
 of concurrency.
-All four commands also support `--html <path>` for an offline HTML report, and
-`--json` can be used with `--html` in the same run.
+`lint`, `trigger`, `eval`, and `check` support `--html <path>` for offline reports.
+`improve` is terminal/JSON only in v1.
 Example lint summary:
@@ -115,7 +121,8 @@ Example `.skilltestrc`:
   },
   "eval": {
     "numRuns": 5,
-    "threshold": 0.9
+    "threshold": 0.9,
+    "maxToolIterations": 10
   }
 }
 ```
@@ -163,6 +170,72 @@ What it checks:
 Flags:
 - `--html <path>` write a self-contained HTML report
+- `--plugin <path>` load a custom lint plugin file (repeatable)
+### Plugin Rules
+You can run custom lint rules alongside the built-in checks. Plugin rules use the
+same `LintContext` and `LintIssue` types as the core linter, and their results
+appear in the same `LintReport`.
+Config:
+```json
+{
+  "lint": {
+    "plugins": ["./my-rules.js"]
+  }
+}
+```
+CLI:
+```bash
+skilltest lint ./skill --plugin ./my-rules.js
+```
+Minimal plugin example:
+```js
+export default {
+  rules: [
+    {
+      checkId: "custom:no-todo",
+      title: "No TODO comments",
+      check(context) {
+        const body = context.frontmatter.content;
+        if (/\bTODO\b/.test(body)) {
+          return [
+            {
+              id: "custom.no-todo",
+              checkId: "custom:no-todo",
+              title: "No TODO comments",
+              status: "warn",
+              message: "SKILL.md contains a TODO marker."
+            }
+          ];
+        }
+        return [
+          {
+            id: "custom.no-todo",
+            checkId: "custom:no-todo",
+            title: "No TODO comments",
+            status: "pass",
+            message: "No TODO markers found."
+          }
+        ];
+      }
+    }
+  ]
+};
+```
+Notes:
+- Plugin files are loaded with dynamic `import()`.
+- `.js` and `.mjs` work directly; `.ts` plugins must be precompiled by the user.
+- Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
+- CLI `--plugin` values replace config-file `lint.plugins` values.
 ### `skilltest trigger <path-to-skill>`
@@ -175,6 +248,7 @@ Flow:
 3. For each query, asks model to select one skill from a mixed list:
    - your skill under test
    - realistic fake skills
+   - optional sibling competitor skills from `--compare`
 4. Computes TP, TN, FP, FN, precision, recall, F1.
 For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
@@ -188,6 +262,7 @@ Flags:
 - `--model <model>` default: `claude-sonnet-4-5-20250929`
 - `--provider <anthropic|openai>` default: `anthropic`
 - `--queries <path>` use custom queries JSON
+- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible fake-skill sampling
 - `--concurrency <n>` default: `5`
@@ -196,6 +271,28 @@ Flags:
 - `--api-key <key>` explicit key override
 - `--verbose` show full model decision text
+### Comparative Trigger Testing
+Test whether your skill is distinctive enough to be selected over similar real skills:
+```bash
+skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2
+```
+Config:
+```json
+{
+  "trigger": {
+    "compare": ["../similar-skill-1", "../similar-skill-2"]
+  }
+}
+```
+Comparative mode includes the real competitor skills in the candidate list alongside
+fake skills. This reveals confusion between skills with overlapping descriptions that
+standard trigger testing would miss.
 ### `skilltest eval <path-to-skill>`
 Runs full skill behavior and grades outputs against assertions.
@@ -219,6 +316,70 @@ Flags:
 - `--api-key <key>` explicit key override
 - `--verbose` show full model responses
+Config-only eval setting:
+- `eval.maxToolIterations` default: `10` safety cap for tool-aware eval loops
+### Tool-Aware Eval
+When an eval prompt defines `tools`, `skilltest` runs the prompt in a mock tool
+environment instead of plain text-only execution. The model can call the mocked
+tools during eval, and `skilltest` records the calls alongside the normal grader
+assertions.
+Tool responses are always mocked. `skilltest` does not execute real tools,
+scripts, shell commands, MCP servers, or APIs during eval.
+Example prompt file:
+```json
+[
+  {
+    "prompt": "Parse this deployment checklist and tell me what is missing.",
+    "assertions": ["output should mention the missing rollback plan"],
+    "tools": [
+      {
+        "name": "read_file",
+        "description": "Read a file from the workspace",
+        "parameters": [
+          { "name": "path", "type": "string", "description": "File path to read", "required": true }
+        ],
+        "responses": {
+          "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
+          "*": "[mock] File not found"
+        }
+      },
+      {
+        "name": "run_script",
+        "description": "Execute a shell script",
+        "parameters": [
+          { "name": "command", "type": "string", "description": "Command to run", "required": true }
+        ],
+        "responses": {
+          "*": "Script executed successfully. Output: 3 items checked, 1 missing."
+        }
+      }
+    ],
+    "toolAssertions": [
+      { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
+      { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
+      {
+        "type": "tool_argument_match",
+        "toolName": "read_file",
+        "expectedArgs": { "path": "checklist.md" },
+        "description": "Model should read checklist.md specifically"
+      }
+    ]
+  }
+]
+```
+Run it with:
+```bash
+skilltest eval ./my-skill --prompts ./eval-prompts.json
+```
 ### `skilltest check <path-to-skill>`
 Runs `lint + trigger + eval` in one command and applies quality thresholds.
@@ -238,9 +399,11 @@ Flags:
 - `--grader-model <model>` default: same as resolved `--model`
 - `--api-key <key>` explicit key override
 - `--queries <path>` custom trigger queries JSON
+- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible trigger sampling
 - `--prompts <path>` custom eval prompts JSON
+- `--plugin <path>` load a custom lint plugin file (repeatable)
 - `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
 - `--html <path>` write a self-contained HTML report
 - `--min-f1 <n>` default: `0.8`
@@ -249,6 +412,52 @@ Flags:
 - `--continue-on-lint-fail` continue trigger/eval even if lint fails
 - `--verbose` include detailed trigger/eval sections
+### `skilltest improve <path-to-skill>`
+Rewrites `SKILL.md`, verifies the rewrite on a frozen test set, and optionally
+applies it.
+Default behavior:
+1. Run a baseline `check` with `continue-on-lint-fail=true`.
+2. Freeze the exact trigger queries and eval prompts used in that baseline run.
+3. Ask the model for a structured JSON rewrite of `SKILL.md`.
+4. Rebuild and validate the candidate locally:
+   - must stay parseable
+   - must keep the same skill `name`
+   - must keep the current `license` when one already exists
+   - must not introduce broken relative references
+5. Verify the candidate by rerunning `check` against a copied skill directory with
+   the frozen trigger/eval inputs.
+6. Only write files when the candidate measurably improves the skill and passes the
+   configured quality gates.
+Flags:
+- `--provider <anthropic|openai>` default: `anthropic`
+- `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
+- `--api-key <key>` explicit key override
+- `--queries <path>` custom trigger queries JSON
+- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
+- `--num-queries <n>` default: `20` (must be even when auto-generating)
+- `--seed <number>` RNG seed for reproducible trigger sampling
+- `--prompts <path>` custom eval prompts JSON
+- `--plugin <path>` load a custom lint plugin file (repeatable)
+- `--concurrency <n>` default: `5`
+- `--output <path>` write the verified candidate `SKILL.md` to a separate file
+- `--save-results <path>` save full improve result JSON
+- `--min-f1 <n>` default: `0.8`
+- `--min-assert-pass-rate <n>` default: `0.9`
+- `--apply` write the verified rewrite back to the source `SKILL.md`
+- `--verbose` include full baseline and verification reports
+Notes:
+- `improve` is dry-run by default.
+- `--apply` only writes when parse, lint, trigger, and eval verification all pass.
+- Before/after metrics are measured against the same generated or user-supplied
+  trigger queries and eval prompts, not a fresh sample.
 ## Global Flags
 - `--help` show help
@@ -287,12 +496,56 @@ Eval prompts (`--prompts`):
 ]
 ```
+Tool-aware eval prompts (`--prompts`):
+```json
+[
+  {
+    "prompt": "Parse this deployment checklist and tell me what is missing.",
+    "assertions": ["output should mention remediation steps"],
+    "tools": [
+      {
+        "name": "read_file",
+        "description": "Read a file from the workspace",
+        "parameters": [
+          { "name": "path", "type": "string", "description": "File path to read", "required": true }
+        ],
+        "responses": {
+          "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
+          "*": "[mock] File not found"
+        }
+      },
+      {
+        "name": "run_script",
+        "description": "Execute a shell script",
+        "parameters": [
+          { "name": "command", "type": "string", "description": "Command to run", "required": true }
+        ],
+        "responses": {
+          "*": "Script executed successfully. Output: 3 items checked, 1 missing."
+        }
+      }
+    ],
+    "toolAssertions": [
+      { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
+      { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
+      {
+        "type": "tool_argument_match",
+        "toolName": "read_file",
+        "expectedArgs": { "path": "checklist.md" },
+        "description": "Model should read checklist.md specifically"
+      }
+    ]
+  }
+]
+```
 ## Output and Exit Codes
 Exit codes:
 - `0`: success
-- `1`: quality gate failed (`lint`, `check` thresholds, or command-specific failure conditions)
+- `1`: quality gate failed (`lint`, `check`, `improve` blocked, or other command-specific failure conditions)
 - `2`: runtime/config/API/parse error
 JSON mode examples:
@@ -302,6 +555,7 @@ skilltest lint ./skill --json
 skilltest trigger ./skill --json
 skilltest eval ./skill --json
 skilltest check ./skill --json
+skilltest improve ./skill --json
 ```
 HTML report examples:
@@ -433,6 +687,7 @@ node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
 node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
 node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
 node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
+node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
 ```
 ## Release Checklist