skilltest 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md ADDED
@@ -0,0 +1,103 @@
1
+ # CLAUDE.md
2
+
3
+ ## Project Overview
4
+
5
+ `skilltest` is a TypeScript CLI for validating Agent Skills (`SKILL.md` files). It provides:
6
+
7
+ - `lint`: static/offline quality checks
8
+ - `trigger`: model-based triggerability testing
9
+ - `eval`: end-to-end execution + grader-based scoring
10
+
11
+ The CLI is published as `skilltest` and built for `npx skilltest` usage.
12
+
13
+ ## Architecture
14
+
15
+ - `src/index.ts`: commander setup, global flags, command registration
16
+ - `src/commands/`: command handlers and CLI-level error/output behavior
17
+ - `src/core/skill-parser.ts`: skill path resolution, frontmatter parsing, reference extraction
18
+ - `src/core/linter/`: lint check modules and orchestrator
19
+ - `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
20
+ - `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
21
+ - `src/core/grader.ts`: structured grader prompt + JSON parse
22
+ - `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
23
+ - `src/reporters/`: terminal rendering and JSON output helper
24
+ - `src/utils/`: filesystem and API key config helpers
25
+
26
+ ## Build and Test Locally
27
+
28
+ Install deps:
29
+
30
+ ```bash
31
+ npm install
32
+ ```
33
+
34
+ Build:
35
+
36
+ ```bash
37
+ npm run build
38
+ ```
39
+
40
+ Type-check:
41
+
42
+ ```bash
43
+ npm run lint
44
+ ```
45
+
46
+ Smoke test lint command:
47
+
48
+ ```bash
49
+ node dist/index.js lint test-fixtures/sample-skill/
50
+ ```
51
+
52
+ Help/version:
53
+
54
+ ```bash
55
+ node dist/index.js --help
56
+ node dist/index.js --version
57
+ ```
58
+
59
+ Trigger test (requires key):
60
+
61
+ ```bash
62
+ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill/
63
+ ```
64
+
65
+ ## Key Design Decisions
66
+
67
+ - Minimal provider interface:
68
+ - `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
69
+ - Lint is fully offline and first-class.
70
+ - Trigger/eval rely on the same provider abstraction.
71
+ - JSON mode is strict:
72
+ - no spinners
73
+ - no colored output
74
+ - stdout only JSON payload
75
+ - Error semantics:
76
+ - lint failures => exit code `1`
77
+ - runtime/config errors => exit code `2`
78
+
79
+ ## Gotchas
80
+
81
+ - `trigger --num-queries` must be even for balanced positive/negative cases.
82
+ - OpenAI provider is intentionally a stub in v1 and throws `"OpenAI provider coming soon."`.
83
+ - Frontmatter is validated with both `gray-matter` and `js-yaml`; malformed YAML should fail fast.
84
+ - Keep file references relative to skill root; out-of-root refs are lint failures.
85
+ - If you modify reporter formatting, ensure JSON mode remains machine-safe.
86
+
87
+ ## File-Level Logic Map
88
+
89
+ - Frontmatter checks: `src/core/linter/frontmatter.ts`
90
+ - Structure checks: `src/core/linter/structure.ts`
91
+ - Content heuristics: `src/core/linter/content.ts`
92
+ - Progressive disclosure: `src/core/linter/disclosure.ts`
93
+ - Compatibility hints: `src/core/linter/compat.ts`
94
+ - Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
95
+ - Eval grading schema: `src/core/grader.ts`
96
+
97
+ ## Future Work (Not Implemented Yet)
98
+
99
+ - Real OpenAI provider implementation
100
+ - Config file support (`.skilltestrc`)
101
+ - Parallel execution
102
+ - HTML reporting
103
+ - Plugin linter rules
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,326 @@
1
+ # skilltest
2
+
3
+ [![npm version](https://img.shields.io/badge/npm-skilltest-blue)](https://www.npmjs.com/package/skilltest)
4
+ [![License](https://img.shields.io/badge/license-MIT-green)](./LICENSE)
5
+ [![CI](https://img.shields.io/badge/ci-placeholder-lightgrey)](#cicd-integration)
6
+
7
+ The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
8
+
9
+ `skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
10
+
11
+ ## Demo
12
+
13
+ GIF coming soon.
14
+
15
+ ![skilltest demo placeholder](https://via.placeholder.com/1200x420?text=skilltest+demo+gif+coming+soon)
16
+
17
+ ## Why skilltest?
18
+
19
+ Agent Skills are quick to write but hard to validate before deployment:
20
+
21
+ - Descriptions can be too vague to trigger reliably.
22
+ - Broken paths in `scripts/`, `references/`, or `assets/` fail silently.
23
+ - You cannot easily measure trigger precision/recall.
24
+ - You do not know whether outputs are good until users exercise the skill.
25
+
26
+ `skilltest` closes this gap with one CLI and three modes.
27
+
28
+ ## Install
29
+
30
+ Global:
31
+
32
+ ```bash
33
+ npm install -g skilltest
34
+ ```
35
+
36
+ Without install:
37
+
38
+ ```bash
39
+ npx skilltest --help
40
+ ```
41
+
42
+ Requires Node.js `>=18`.
43
+
44
+ ## Quick Start
45
+
46
+ Lint a skill:
47
+
48
+ ```bash
49
+ skilltest lint ./path/to/skill
50
+ ```
51
+
52
+ Trigger test:
53
+
54
+ ```bash
55
+ skilltest trigger ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
56
+ ```
57
+
58
+ End-to-end eval:
59
+
60
+ ```bash
61
+ skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
62
+ ```
63
+
64
+ Example lint summary:
65
+
66
+ ```text
67
+ skilltest lint
68
+ target: ./test-fixtures/sample-skill
69
+ summary: 25/25 checks passed, 0 warnings, 0 failures
70
+ ```
71
+
72
+ ## Commands
73
+
74
+ ### `skilltest lint <path-to-skill>`
75
+
76
+ Static analysis only. Fast and offline.
77
+
78
+ What it checks:
79
+
80
+ - Frontmatter:
81
+ - YAML presence and validity
82
+ - `name` required, max 64, lowercase/numbers/hyphens, no leading/trailing/consecutive hyphens
83
+ - `description` required, non-empty, max 1024
84
+ - warn if no `license`
85
+ - warn if description is weak on both what and when
86
+ - Structure:
87
+ - warns if `SKILL.md` exceeds 500 lines
88
+ - warns if long references (300+ lines) have no table of contents
89
+ - validates referenced files in `scripts/`, `references/`, `assets/`
90
+ - detects broken relative file references
91
+ - Content heuristics:
92
+ - warns if no headers
93
+ - warns if no examples
94
+ - warns on vague phrases
95
+ - warns on angle brackets in frontmatter
96
+ - fails on obvious secret patterns
97
+ - warns on empty/too-short body
98
+ - warns on very short description
99
+ - Progressive disclosure:
100
+ - warns if `SKILL.md` is large and no `references/` exists
101
+ - validates references are relative and inside skill root
102
+ - warns on deep reference chains
103
+ - Compatibility hints:
104
+ - warns on provider-specific conventions such as `allowed-tools`
105
+ - emits a likely compatibility summary
106
+
107
+ ### `skilltest trigger <path-to-skill>`
108
+
109
+ Measures trigger behavior for your skill description with model simulation.
110
+
111
+ Flow:
112
+
113
+ 1. Reads `name` and `description` from frontmatter.
114
+ 2. Generates balanced trigger/non-trigger queries (or loads custom query file).
115
+ 3. For each query, asks model to select one skill from a mixed list:
116
+ - your skill under test
117
+ - realistic fake skills
118
+ 4. Computes TP, TN, FP, FN, precision, recall, F1.
119
+
120
+ Flags:
121
+
122
+ - `--model <model>` default: `claude-sonnet-4-5-20250929`
123
+ - `--provider <anthropic|openai>` default: `anthropic`
124
+ - `--queries <path>` use custom queries JSON
125
+ - `--num-queries <n>` default: `20` (must be even)
126
+ - `--save-queries <path>` save generated query set
127
+ - `--api-key <key>` explicit key override
128
+ - `--verbose` show full model decision text
129
+
130
+ ### `skilltest eval <path-to-skill>`
131
+
132
+ Runs full skill behavior and grades outputs against assertions.
133
+
134
+ Flow:
135
+
136
+ 1. Loads prompts from file or auto-generates 5 prompts.
137
+ 2. Injects full `SKILL.md` as system instructions.
138
+ 3. Runs prompt on chosen model.
139
+ 4. Uses grader model to score each assertion with evidence.
140
+
141
+ Flags:
142
+
143
+ - `--prompts <path>` custom prompts JSON
144
+ - `--model <model>` default: `claude-sonnet-4-5-20250929`
145
+ - `--grader-model <model>` default: same as `--model`
146
+ - `--provider <anthropic|openai>` default: `anthropic`
147
+ - `--save-results <path>` write full JSON result
148
+ - `--api-key <key>` explicit key override
149
+ - `--verbose` show full model responses
150
+
151
+ ## Global Flags
152
+
153
+ - `--help` show help
154
+ - `--version` show version
155
+ - `--json` output only valid JSON to stdout
156
+ - `--no-color` disable terminal colors
157
+
158
+ ## Input File Formats
159
+
160
+ Trigger queries (`--queries`):
161
+
162
+ ```json
163
+ [
164
+ {
165
+ "query": "Please validate this deployment checklist and score it.",
166
+ "should_trigger": true
167
+ },
168
+ {
169
+ "query": "Write a SQL migration for adding an index.",
170
+ "should_trigger": false
171
+ }
172
+ ]
173
+ ```
174
+
175
+ Eval prompts (`--prompts`):
176
+
177
+ ```json
178
+ [
179
+ {
180
+ "prompt": "Validate this markdown checklist for a production release.",
181
+ "assertions": [
182
+ "output should include pass/warn/fail style categorization",
183
+ "output should provide at least one remediation recommendation"
184
+ ]
185
+ }
186
+ ]
187
+ ```
188
+
189
+ ## Output and Exit Codes
190
+
191
+ Exit codes:
192
+
193
+ - `0`: success with no lint failures
194
+ - `1`: lint failures present
195
+ - `2`: runtime/config/API/parse error
196
+
197
+ JSON mode examples:
198
+
199
+ ```bash
200
+ skilltest lint ./skill --json
201
+ skilltest trigger ./skill --json
202
+ skilltest eval ./skill --json
203
+ ```
204
+
205
+ ## API Keys
206
+
207
+ Anthropic:
208
+
209
+ ```bash
210
+ export ANTHROPIC_API_KEY=your-key
211
+ ```
212
+
213
+ OpenAI:
214
+
215
+ ```bash
216
+ export OPENAI_API_KEY=your-key
217
+ ```
218
+
219
+ Override at runtime:
220
+
221
+ ```bash
222
+ skilltest trigger ./skill --api-key your-key
223
+ ```
224
+
225
+ Current provider status:
226
+
227
+ - `anthropic`: implemented
228
+ - `openai`: interface wired, command currently returns "OpenAI provider coming soon."
229
+
230
+ ## CICD Integration
231
+
232
+ GitHub Actions example to lint skills on pull requests:
233
+
234
+ ```yaml
235
+ name: skill-lint
236
+
237
+ on:
238
+ pull_request:
239
+ paths:
240
+ - "**/SKILL.md"
241
+ - "**/references/**"
242
+ - "**/scripts/**"
243
+ - "**/assets/**"
244
+
245
+ jobs:
246
+ lint:
247
+ runs-on: ubuntu-latest
248
+ steps:
249
+ - uses: actions/checkout@v4
250
+ - uses: actions/setup-node@v4
251
+ with:
252
+ node-version: "20"
253
+ - run: npm ci
254
+ - run: npm run build
255
+ - run: npx skilltest lint path/to/skill --json
256
+ ```
257
+
258
+ Optional nightly trigger/eval:
259
+
260
+ ```yaml
261
+ name: skill-eval-nightly
262
+
263
+ on:
264
+ schedule:
265
+ - cron: "0 4 * * *"
266
+
267
+ jobs:
268
+ trigger-eval:
269
+ runs-on: ubuntu-latest
270
+ env:
271
+ ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
272
+ steps:
273
+ - uses: actions/checkout@v4
274
+ - uses: actions/setup-node@v4
275
+ with:
276
+ node-version: "20"
277
+ - run: npm ci
278
+ - run: npm run build
279
+ - run: npx skilltest trigger path/to/skill --num-queries 20 --json
280
+ - run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
281
+ ```
282
+
283
+ ## Local Development
284
+
285
+ ```bash
286
+ npm install
287
+ npm run lint
288
+ npm run build
289
+ node dist/index.js --help
290
+ ```
291
+
292
+ Smoke tests:
293
+
294
+ ```bash
295
+ node dist/index.js lint test-fixtures/sample-skill/
296
+ node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
297
+ node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
298
+ ```
299
+
300
+ ## Release Checklist
301
+
302
+ ```bash
303
+ npm run lint
304
+ npm run build
305
+ npm run test
306
+ npm pack --dry-run
307
+ npm publish --dry-run
308
+ ```
309
+
310
+ Then publish:
311
+
312
+ ```bash
313
+ npm publish
314
+ ```
315
+
316
+ ## Contributing
317
+
318
+ Issues and pull requests are welcome. Include:
319
+
320
+ - clear reproduction steps
321
+ - expected vs actual behavior
322
+ - sample `SKILL.md` or fixtures when relevant
323
+
324
+ ## License
325
+
326
+ MIT