npm - skilltest - Versions diffs - 0.5.0 → 0.7.0 - Mend

skilltest 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CLAUDE.md CHANGED Viewed

@@ -22,7 +22,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
 - `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
 - `src/core/grader.ts`: structured grader prompt + JSON parse
 - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
-- `src/reporters/`: terminal rendering and JSON output helper
+- `src/reporters/`: terminal, JSON, and HTML output helpers
 - `src/utils/`: filesystem and API key config helpers
 ## Build and Test Locally
@@ -73,6 +73,10 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 - `check` wraps lint + trigger + eval and enforces minimum thresholds:
   - trigger F1
   - eval assertion pass rate
+- Trigger/eval work is concurrency-limited instead of fully unbounded:
+  - default concurrency is `5`
+  - `--concurrency 1` preserves the old sequential behavior
+  - trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
 - JSON mode is strict:
   - no spinners
   - no colored output
@@ -106,6 +110,4 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 ## Future Work (Not Implemented Yet)
 - Config file support (`.skilltestrc`)
-- Parallel execution
-- HTML reporting
 - Plugin linter rules

package/README.md CHANGED Viewed

@@ -8,11 +8,15 @@ The testing framework for Agent Skills. Lint, test triggering, and evaluate your
 `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
+The repository itself uses a fast Vitest suite for offline unit and integration
+coverage of the parser, linters, trigger math, config resolution, reporters,
+and linter orchestration.
 ## Demo
 GIF coming soon.
-![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon)
+<!-- ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon) -->
 ## Why skilltest?
@@ -67,6 +71,18 @@ Run full quality gate:
 skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
 ```
+Write a self-contained HTML report:
+```bash
+skilltest check ./path/to/skill --html ./reports/check.html
+```
+Model-backed commands default to `--concurrency 5`. Use `--concurrency 1` to force
+the old sequential execution order. Seeded trigger runs stay deterministic regardless
+of concurrency.
+All four commands also support `--html <path>` for an offline HTML report, and
+`--json` can be used with `--html` in the same run.
 Example lint summary:
 ```text
@@ -75,6 +91,35 @@ target: ./test-fixtures/sample-skill
 summary: 29/29 checks passed, 0 warnings, 0 failures
 ```
+## Configuration
+`skilltest` resolves config in this order:
+1. `.skilltestrc` in the target skill root
+2. `.skilltestrc` in the current working directory
+3. the nearest `package.json` containing `skilltestrc`
+CLI flags override config values.
+Example `.skilltestrc`:
+```json
+{
+  "provider": "anthropic",
+  "model": "claude-sonnet-4-5-20250929",
+  "concurrency": 5,
+  "trigger": {
+    "numQueries": 20,
+    "threshold": 0.8,
+    "seed": 123
+  },
+  "eval": {
+    "numRuns": 5,
+    "threshold": 0.9
+  }
+}
+```
 ## Commands
 ### `skilltest lint <path-to-skill>`
@@ -115,6 +160,10 @@ What it checks:
   - warns on provider-specific conventions such as `allowed-tools`
   - emits a likely compatibility summary
+Flags:
+- `--html <path>` write a self-contained HTML report
 ### `skilltest trigger <path-to-skill>`
 Measures trigger behavior for your skill description with model simulation.
@@ -131,6 +180,8 @@ Flow:
 For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
 terminal and JSON output include it so the run can be repeated exactly. If you use
 `.skilltestrc`, `trigger.seed` sets the default and the CLI flag overrides it.
+The fake-skill setup is precomputed before requests begin, so the same seed produces
+the same trigger cases at any concurrency level.
 Flags:
@@ -139,6 +190,8 @@ Flags:
 - `--queries <path>` use custom queries JSON
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible fake-skill sampling
+- `--concurrency <n>` default: `5`
+- `--html <path>` write a self-contained HTML report
 - `--save-queries <path>` save generated query set
 - `--api-key <key>` explicit key override
 - `--verbose` show full model decision text
@@ -160,6 +213,8 @@ Flags:
 - `--model <model>` default: `claude-sonnet-4-5-20250929`
 - `--grader-model <model>` default: same as `--model`
 - `--provider <anthropic|openai>` default: `anthropic`
+- `--concurrency <n>` default: `5`
+- `--html <path>` write a self-contained HTML report
 - `--save-results <path>` write full JSON result
 - `--api-key <key>` explicit key override
 - `--verbose` show full model responses
@@ -173,7 +228,8 @@ Default behavior:
 1. Run lint.
 2. Stop before model calls if lint has failures.
 3. Run trigger and eval only when lint passes.
-4. Fail quality gate when either threshold is below target.
+4. When concurrency is greater than `1`, run trigger and eval in parallel.
+5. Fail quality gate when either threshold is below target.
 Flags:
@@ -185,6 +241,8 @@ Flags:
 - `--num-queries <n>` default: `20` (must be even)
 - `--seed <number>` RNG seed for reproducible trigger sampling
 - `--prompts <path>` custom eval prompts JSON
+- `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
+- `--html <path>` write a self-contained HTML report
 - `--min-f1 <n>` default: `0.8`
 - `--min-assert-pass-rate <n>` default: `0.9`
 - `--save-results <path>` save combined check result JSON
@@ -246,6 +304,15 @@ skilltest eval ./skill --json
 skilltest check ./skill --json
 ```
+HTML report examples:
+```bash
+skilltest lint ./skill --html ./reports/lint.html
+skilltest trigger ./skill --html ./reports/trigger.html
+skilltest eval ./skill --html ./reports/eval.html
+skilltest check ./skill --json --html ./reports/check.html
+```
 Seeded trigger example:
 ```bash
@@ -312,6 +379,8 @@ jobs:
         with:
           node-version: "20"
       - run: npm ci
+      - run: npm run lint
+      - run: npm run test
       - run: npm run build
       - run: npx skilltest lint path/to/skill --json
 ```
@@ -347,14 +416,19 @@ jobs:
 ```bash
 npm install
 npm run lint
+npm run test
 npm run build
 node dist/index.js --help
 ```
-Smoke tests:
+`npm test` runs the Vitest suite. The tests are offline and do not call model
+providers.
+Manual CLI smoke tests:
 ```bash
 node dist/index.js lint test-fixtures/sample-skill/
+node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
 node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
 node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
 node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json