npm - skilltest - Versions diffs - 0.2.0 → 0.4.0 - Mend

skilltest 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CLAUDE.md CHANGED Viewed

@@ -7,6 +7,7 @@
 - `lint`: static/offline quality checks
 - `trigger`: model-based triggerability testing
 - `eval`: end-to-end execution + grader-based scoring
+- `check`: lint + trigger + eval quality gates in one run
 The CLI is published as `skilltest` and built for `npx skilltest` usage.
@@ -18,6 +19,7 @@ The CLI is published as `skilltest` and built for `npx skilltest` usage.
 - `src/core/linter/`: lint check modules and orchestrator
 - `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
 - `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
+- `src/core/check-runner.ts`: orchestrates lint + trigger + eval with threshold gates
 - `src/core/grader.ts`: structured grader prompt + JSON parse
 - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
 - `src/reporters/`: terminal rendering and JSON output helper
@@ -68,6 +70,9 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
   - `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
 - Lint is fully offline and first-class.
 - Trigger/eval rely on the same provider abstraction.
+- `check` wraps lint + trigger + eval and enforces minimum thresholds:
+  - trigger F1
+  - eval assertion pass rate
 - JSON mode is strict:
   - no spinners
   - no colored output
@@ -79,6 +84,8 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 ## Gotchas
 - `trigger --num-queries` must be even for balanced positive/negative cases.
+- `check` also requires even `--num-queries`.
+- `check` stops after lint failures unless `--continue-on-lint-fail` is set.
 - OpenAI provider is implemented via dynamic import so Anthropic-only installs do not crash if optional deps are skipped.
 - Frontmatter is validated with both `gray-matter` and `js-yaml`; malformed YAML should fail fast.
 - Keep file references relative to skill root; out-of-root refs are lint failures.
@@ -94,6 +101,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
 - Compatibility hints: `src/core/linter/compat.ts`
 - Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
 - Eval grading schema: `src/core/grader.ts`
+- Combined quality gate orchestration: `src/core/check-runner.ts`
 ## Future Work (Not Implemented Yet)

package/README.md CHANGED Viewed

@@ -23,7 +23,7 @@ Agent Skills are quick to write but hard to validate before deployment:
 - You cannot easily measure trigger precision/recall.
 - You do not know whether outputs are good until users exercise the skill.
-`skilltest` closes this gap with one CLI and three modes.
+`skilltest` closes this gap with one CLI and four modes.
 ## Install
@@ -61,12 +61,18 @@ End-to-end eval:
 skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
 ```
+Run full quality gate:
+```bash
+skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
+```
 Example lint summary:
 ```text
 skilltest lint
 target: ./test-fixtures/sample-skill
-summary: 25/25 checks passed, 0 warnings, 0 failures
+summary: 29/29 checks passed, 0 warnings, 0 failures
 ```
 ## Commands
@@ -153,6 +159,32 @@ Flags:
 - `--api-key <key>` explicit key override
 - `--verbose` show full model responses
+### `skilltest check <path-to-skill>`
+Runs `lint + trigger + eval` in one command and applies quality thresholds.
+Default behavior:
+1. Run lint.
+2. Stop before model calls if lint has failures.
+3. Run trigger and eval only when lint passes.
+4. Fail quality gate when either threshold is below target.
+Flags:
+- `--provider <anthropic|openai>` default: `anthropic`
+- `--model <model>` default: `claude-sonnet-4-5-20250929` (auto-switches to `gpt-4.1-mini` for `--provider openai` when unchanged)
+- `--grader-model <model>` default: same as resolved `--model`
+- `--api-key <key>` explicit key override
+- `--queries <path>` custom trigger queries JSON
+- `--num-queries <n>` default: `20` (must be even)
+- `--prompts <path>` custom eval prompts JSON
+- `--min-f1 <n>` default: `0.8`
+- `--min-assert-pass-rate <n>` default: `0.9`
+- `--save-results <path>` save combined check result JSON
+- `--continue-on-lint-fail` continue trigger/eval even if lint fails
+- `--verbose` include detailed trigger/eval sections
 ## Global Flags
 - `--help` show help
@@ -195,8 +227,8 @@ Eval prompts (`--prompts`):
 Exit codes:
-- `0`: success with no lint failures
-- `1`: lint failures present
+- `0`: success
+- `1`: quality gate failed (`lint`, `check` thresholds, or command-specific failure conditions)
 - `2`: runtime/config/API/parse error
 JSON mode examples:
@@ -205,6 +237,7 @@ JSON mode examples:
 skilltest lint ./skill --json
 skilltest trigger ./skill --json
 skilltest eval ./skill --json
+skilltest check ./skill --json
 ```
 ## API Keys
@@ -294,6 +327,7 @@ jobs:
       - run: npm run build
       - run: npx skilltest trigger path/to/skill --num-queries 20 --json
       - run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
+      - run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --json
 ```
 ## Local Development
@@ -311,6 +345,7 @@ Smoke tests:
 node dist/index.js lint test-fixtures/sample-skill/
 node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
 node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
+node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
 ```
 ## Release Checklist