npm - skilltest - Versions diffs - 0.6.0 → 0.8.0 - Mend

skilltest 0.6.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CLAUDE.md CHANGED Viewed

@@ -77,6 +77,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
   - default concurrency is `5`
   - `--concurrency 1` preserves the old sequential behavior
   - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
+- Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
 - JSON mode is strict:
   - no spinners
   - no colored output
@@ -103,11 +104,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 - Security heuristics: `src/core/linter/security.ts`
 - Progressive disclosure: `src/core/linter/disclosure.ts`
 - Compatibility hints: `src/core/linter/compat.ts`
-- Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
+- Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
+- Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
 - Eval grading schema: `src/core/grader.ts`
 - Combined quality gate orchestration: `src/core/check-runner.ts`
-## Future Work (Not Implemented Yet)
-- Config file support (`.skilltestrc`)
-- Plugin linter rules

package/README.md CHANGED Viewed

@@ -8,11 +8,15 @@ The testing framework for Agent Skills. Lint, test triggering, and evaluate your
 `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
+The repository itself uses a fast Vitest suite for offline unit and integration
+coverage of the parser, linters, trigger math, config resolution, reporters,
+and linter orchestration.
 ## Demo
 GIF coming soon.
-![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon)
+<!-- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon) -->
 ## Why skilltest?
@@ -159,6 +163,72 @@ What it checks:
 Flags:
 - `--html <path>` write a self-contained HTML report
+- `--plugin <path>` load a custom lint plugin file (repeatable)
+### Plugin Rules
+You can run custom lint rules alongside the built-in checks. Plugin rules use the
+same `LintContext` and `LintIssue` types as the core linter, and their results
+appear in the same `LintReport`.
+Config:
+```json
+{
+  "lint": {
+    "plugins": ["./my-rules.js"]
+  }
+}
+```
+CLI:
+```bash
+skilltest lint ./skill --plugin ./my-rules.js
+```
+Minimal plugin example:
+```js
+export default {
+  rules: [
+    {
+      checkId: "custom:no-todo",
+      title: "No TODO comments",
+      check(context) {
+        const body = context.frontmatter.content;
+        if (/\bTODO\b/.test(body)) {
+          return [
+            {
+              id: "custom.no-todo",
+              checkId: "custom:no-todo",
+              title: "No TODO comments",
+              status: "warn",
+              message: "SKILL.md contains a TODO marker."
+            }
+          ];
+        }
+        return [
+          {
+            id: "custom.no-todo",
+            checkId: "custom:no-todo",
+            title: "No TODO comments",
+            status: "pass",
+            message: "No TODO markers found."
+          }
+        ];
+      }
+    }
+  ]
+};
+```
+Notes:
+- Plugin files are loaded with dynamic `import()`.
+- `.js` and `.mjs` work directly; `.ts` plugins must be precompiled by the user.
+- Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
+- CLI `--plugin` values replace config-file `lint.plugins` values.
 ### `skilltest trigger <path-to-skill>`
@@ -171,6 +241,7 @@ Flow:
 3. For each query, asks model to select one skill from a mixed list:
    - your skill under test
    - realistic fake skills
+   - optional sibling competitor skills from `--compare`
 4. Computes TP, TN, FP, FN, precision, recall, F1.
 For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
@@ -184,6 +255,7 @@ Flags:
 - `--model <model>` default: `claude-sonnet-4-5-20250929`
 - `--provider <anthropic|openai>` default: `anthropic`
 - `--queries <path>` use custom queries JSON
+- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible fake-skill sampling
 - `--concurrency <n>` default: `5`
@@ -192,6 +264,28 @@ Flags:
 - `--api-key <key>` explicit key override
 - `--verbose` show full model decision text
+### Comparative Trigger Testing
+Test whether your skill is distinctive enough to be selected over similar real skills:
+```bash
+skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2
+```
+Config:
+```json
+{
+  "trigger": {
+    "compare": ["../similar-skill-1", "../similar-skill-2"]
+  }
+}
+```
+Comparative mode includes the real competitor skills in the candidate list alongside
+fake skills. This reveals confusion between skills with overlapping descriptions that
+standard trigger testing would miss.
 ### `skilltest eval <path-to-skill>`
 Runs full skill behavior and grades outputs against assertions.
@@ -234,9 +328,11 @@ Flags:
 - `--grader-model <model>` default: same as resolved `--model`
 - `--api-key <key>` explicit key override
 - `--queries <path>` custom trigger queries JSON
+- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible trigger sampling
 - `--prompts <path>` custom eval prompts JSON
+- `--plugin <path>` load a custom lint plugin file (repeatable)
 - `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
 - `--html <path>` write a self-contained HTML report
 - `--min-f1 <n>` default: `0.8`
@@ -375,6 +471,8 @@ jobs:
         with:
           node-version: "20"
       - run: npm ci
+      - run: npm run lint
+      - run: npm run test
       - run: npm run build
       - run: npx skilltest lint path/to/skill --json
 ```
@@ -410,11 +508,15 @@ jobs:
 ```bash
 npm install
 npm run lint
+npm run test
 npm run build
 node dist/index.js --help
 ```
-Smoke tests:
+`npm test` runs the Vitest suite. The tests are offline and do not call model
+providers.
+Manual CLI smoke tests:
 ```bash
 node dist/index.js lint test-fixtures/sample-skill/