skilltest 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +103 -0
- package/LICENSE +21 -0
- package/README.md +326 -0
- package/dist/index.js +1626 -0
- package/dist/index.js.map +1 -0
- package/package.json +51 -0
package/CLAUDE.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
## Project Overview
|
|
4
|
+
|
|
5
|
+
`skilltest` is a TypeScript CLI for validating Agent Skills (`SKILL.md` files). It provides:
|
|
6
|
+
|
|
7
|
+
- `lint`: static/offline quality checks
|
|
8
|
+
- `trigger`: model-based triggerability testing
|
|
9
|
+
- `eval`: end-to-end execution + grader-based scoring
|
|
10
|
+
|
|
11
|
+
The CLI is published as `skilltest` and built for `npx skilltest` usage.
|
|
12
|
+
|
|
13
|
+
## Architecture
|
|
14
|
+
|
|
15
|
+
- `src/index.ts`: commander setup, global flags, command registration
|
|
16
|
+
- `src/commands/`: command handlers and CLI-level error/output behavior
|
|
17
|
+
- `src/core/skill-parser.ts`: skill path resolution, frontmatter parsing, reference extraction
|
|
18
|
+
- `src/core/linter/`: lint check modules and orchestrator
|
|
19
|
+
- `src/core/trigger-tester.ts`: query generation + trigger simulation + metrics
|
|
20
|
+
- `src/core/eval-runner.ts`: prompt generation/loading + skill execution + grading loop
|
|
21
|
+
- `src/core/grader.ts`: structured grader prompt + JSON parse
|
|
22
|
+
- `src/providers/`: LLM provider abstraction (`sendMessage`) and provider implementations
|
|
23
|
+
- `src/reporters/`: terminal rendering and JSON output helper
|
|
24
|
+
- `src/utils/`: filesystem and API key config helpers
|
|
25
|
+
|
|
26
|
+
## Build and Test Locally
|
|
27
|
+
|
|
28
|
+
Install deps:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
npm install
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
Build:
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
npm run build
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Type-check:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
npm run lint
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Smoke test lint command:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
node dist/index.js lint test-fixtures/sample-skill/
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Help/version:
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
node dist/index.js --help
|
|
56
|
+
node dist/index.js --version
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Trigger test (requires key):
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill/
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Key Design Decisions
|
|
66
|
+
|
|
67
|
+
- Minimal provider interface:
|
|
68
|
+
- `sendMessage(systemPrompt, userMessage, { model }) => Promise<string>`
|
|
69
|
+
- Lint is fully offline and first-class.
|
|
70
|
+
- Trigger/eval rely on the same provider abstraction.
|
|
71
|
+
- JSON mode is strict:
|
|
72
|
+
- no spinners
|
|
73
|
+
- no colored output
|
|
74
|
+
- stdout only JSON payload
|
|
75
|
+
- Error semantics:
|
|
76
|
+
- lint failures => exit code `1`
|
|
77
|
+
- runtime/config errors => exit code `2`
|
|
78
|
+
|
|
79
|
+
## Gotchas
|
|
80
|
+
|
|
81
|
+
- `trigger --num-queries` must be even for balanced positive/negative cases.
|
|
82
|
+
- OpenAI provider is intentionally a stub in v1 and throws `"OpenAI provider coming soon."`.
|
|
83
|
+
- Frontmatter is validated with both `gray-matter` and `js-yaml`; malformed YAML should fail fast.
|
|
84
|
+
- Keep file references relative to skill root; out-of-root refs are lint failures.
|
|
85
|
+
- If you modify reporter formatting, ensure JSON mode remains machine-safe.
|
|
86
|
+
|
|
87
|
+
## File-Level Logic Map
|
|
88
|
+
|
|
89
|
+
- Frontmatter checks: `src/core/linter/frontmatter.ts`
|
|
90
|
+
- Structure checks: `src/core/linter/structure.ts`
|
|
91
|
+
- Content heuristics: `src/core/linter/content.ts`
|
|
92
|
+
- Progressive disclosure: `src/core/linter/disclosure.ts`
|
|
93
|
+
- Compatibility hints: `src/core/linter/compat.ts`
|
|
94
|
+
- Trigger fake skill pool + scoring: `src/core/trigger-tester.ts`
|
|
95
|
+
- Eval grading schema: `src/core/grader.ts`
|
|
96
|
+
|
|
97
|
+
## Future Work (Not Implemented Yet)
|
|
98
|
+
|
|
99
|
+
- Real OpenAI provider implementation
|
|
100
|
+
- Config file support (`.skilltestrc`)
|
|
101
|
+
- Parallel execution
|
|
102
|
+
- HTML reporting
|
|
103
|
+
- Plugin linter rules
|
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,326 @@
|
|
|
1
|
+
# skilltest
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/skilltest)
|
|
4
|
+
[](./LICENSE)
|
|
5
|
+
[](#cicd-integration)
|
|
6
|
+
|
|
7
|
+
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
|
|
8
|
+
|
|
9
|
+
`skilltest` is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
|
|
10
|
+
|
|
11
|
+
## Demo
|
|
12
|
+
|
|
13
|
+
GIF coming soon.
|
|
14
|
+
|
|
15
|
+

|
|
16
|
+
|
|
17
|
+
## Why skilltest?
|
|
18
|
+
|
|
19
|
+
Agent Skills are quick to write but hard to validate before deployment:
|
|
20
|
+
|
|
21
|
+
- Descriptions can be too vague to trigger reliably.
|
|
22
|
+
- Broken paths in `scripts/`, `references/`, or `assets/` fail silently.
|
|
23
|
+
- You cannot easily measure trigger precision/recall.
|
|
24
|
+
- You do not know whether outputs are good until users exercise the skill.
|
|
25
|
+
|
|
26
|
+
`skilltest` closes this gap with one CLI and three modes.
|
|
27
|
+
|
|
28
|
+
## Install
|
|
29
|
+
|
|
30
|
+
Global:
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
npm install -g skilltest
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
Without install:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
npx skilltest --help
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Requires Node.js `>=18`.
|
|
43
|
+
|
|
44
|
+
## Quick Start
|
|
45
|
+
|
|
46
|
+
Lint a skill:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
skilltest lint ./path/to/skill
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Trigger test:
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
skilltest trigger ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
End-to-end eval:
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Example lint summary:
|
|
65
|
+
|
|
66
|
+
```text
|
|
67
|
+
skilltest lint
|
|
68
|
+
target: ./test-fixtures/sample-skill
|
|
69
|
+
summary: 25/25 checks passed, 0 warnings, 0 failures
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Commands
|
|
73
|
+
|
|
74
|
+
### `skilltest lint <path-to-skill>`
|
|
75
|
+
|
|
76
|
+
Static analysis only. Fast and offline.
|
|
77
|
+
|
|
78
|
+
What it checks:
|
|
79
|
+
|
|
80
|
+
- Frontmatter:
|
|
81
|
+
- YAML presence and validity
|
|
82
|
+
- `name` required, max 64, lowercase/numbers/hyphens, no leading/trailing/consecutive hyphens
|
|
83
|
+
- `description` required, non-empty, max 1024
|
|
84
|
+
- warn if no `license`
|
|
85
|
+
- warn if description is weak on both what and when
|
|
86
|
+
- Structure:
|
|
87
|
+
- warns if `SKILL.md` exceeds 500 lines
|
|
88
|
+
- warns if long references (300+ lines) have no table of contents
|
|
89
|
+
- validates referenced files in `scripts/`, `references/`, `assets/`
|
|
90
|
+
- detects broken relative file references
|
|
91
|
+
- Content heuristics:
|
|
92
|
+
- warns if no headers
|
|
93
|
+
- warns if no examples
|
|
94
|
+
- warns on vague phrases
|
|
95
|
+
- warns on angle brackets in frontmatter
|
|
96
|
+
- fails on obvious secret patterns
|
|
97
|
+
- warns on empty/too-short body
|
|
98
|
+
- warns on very short description
|
|
99
|
+
- Progressive disclosure:
|
|
100
|
+
- warns if `SKILL.md` is large and no `references/` exists
|
|
101
|
+
- validates references are relative and inside skill root
|
|
102
|
+
- warns on deep reference chains
|
|
103
|
+
- Compatibility hints:
|
|
104
|
+
- warns on provider-specific conventions such as `allowed-tools`
|
|
105
|
+
- emits a likely compatibility summary
|
|
106
|
+
|
|
107
|
+
### `skilltest trigger <path-to-skill>`
|
|
108
|
+
|
|
109
|
+
Measures trigger behavior for your skill description with model simulation.
|
|
110
|
+
|
|
111
|
+
Flow:
|
|
112
|
+
|
|
113
|
+
1. Reads `name` and `description` from frontmatter.
|
|
114
|
+
2. Generates balanced trigger/non-trigger queries (or loads custom query file).
|
|
115
|
+
3. For each query, asks model to select one skill from a mixed list:
|
|
116
|
+
- your skill under test
|
|
117
|
+
- realistic fake skills
|
|
118
|
+
4. Computes TP, TN, FP, FN, precision, recall, F1.
|
|
119
|
+
|
|
120
|
+
Flags:
|
|
121
|
+
|
|
122
|
+
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
123
|
+
- `--provider <anthropic|openai>` default: `anthropic`
|
|
124
|
+
- `--queries <path>` use custom queries JSON
|
|
125
|
+
- `--num-queries <n>` default: `20` (must be even)
|
|
126
|
+
- `--save-queries <path>` save generated query set
|
|
127
|
+
- `--api-key <key>` explicit key override
|
|
128
|
+
- `--verbose` show full model decision text
|
|
129
|
+
|
|
130
|
+
### `skilltest eval <path-to-skill>`
|
|
131
|
+
|
|
132
|
+
Runs full skill behavior and grades outputs against assertions.
|
|
133
|
+
|
|
134
|
+
Flow:
|
|
135
|
+
|
|
136
|
+
1. Loads prompts from file or auto-generates 5 prompts.
|
|
137
|
+
2. Injects full `SKILL.md` as system instructions.
|
|
138
|
+
3. Runs prompt on chosen model.
|
|
139
|
+
4. Uses grader model to score each assertion with evidence.
|
|
140
|
+
|
|
141
|
+
Flags:
|
|
142
|
+
|
|
143
|
+
- `--prompts <path>` custom prompts JSON
|
|
144
|
+
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
145
|
+
- `--grader-model <model>` default: same as `--model`
|
|
146
|
+
- `--provider <anthropic|openai>` default: `anthropic`
|
|
147
|
+
- `--save-results <path>` write full JSON result
|
|
148
|
+
- `--api-key <key>` explicit key override
|
|
149
|
+
- `--verbose` show full model responses
|
|
150
|
+
|
|
151
|
+
## Global Flags
|
|
152
|
+
|
|
153
|
+
- `--help` show help
|
|
154
|
+
- `--version` show version
|
|
155
|
+
- `--json` output only valid JSON to stdout
|
|
156
|
+
- `--no-color` disable terminal colors
|
|
157
|
+
|
|
158
|
+
## Input File Formats
|
|
159
|
+
|
|
160
|
+
Trigger queries (`--queries`):
|
|
161
|
+
|
|
162
|
+
```json
|
|
163
|
+
[
|
|
164
|
+
{
|
|
165
|
+
"query": "Please validate this deployment checklist and score it.",
|
|
166
|
+
"should_trigger": true
|
|
167
|
+
},
|
|
168
|
+
{
|
|
169
|
+
"query": "Write a SQL migration for adding an index.",
|
|
170
|
+
"should_trigger": false
|
|
171
|
+
}
|
|
172
|
+
]
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Eval prompts (`--prompts`):
|
|
176
|
+
|
|
177
|
+
```json
|
|
178
|
+
[
|
|
179
|
+
{
|
|
180
|
+
"prompt": "Validate this markdown checklist for a production release.",
|
|
181
|
+
"assertions": [
|
|
182
|
+
"output should include pass/warn/fail style categorization",
|
|
183
|
+
"output should provide at least one remediation recommendation"
|
|
184
|
+
]
|
|
185
|
+
}
|
|
186
|
+
]
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
## Output and Exit Codes
|
|
190
|
+
|
|
191
|
+
Exit codes:
|
|
192
|
+
|
|
193
|
+
- `0`: success with no lint failures
|
|
194
|
+
- `1`: lint failures present
|
|
195
|
+
- `2`: runtime/config/API/parse error
|
|
196
|
+
|
|
197
|
+
JSON mode examples:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
skilltest lint ./skill --json
|
|
201
|
+
skilltest trigger ./skill --json
|
|
202
|
+
skilltest eval ./skill --json
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
## API Keys
|
|
206
|
+
|
|
207
|
+
Anthropic:
|
|
208
|
+
|
|
209
|
+
```bash
|
|
210
|
+
export ANTHROPIC_API_KEY=your-key
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
OpenAI:
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
export OPENAI_API_KEY=your-key
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
Override at runtime:
|
|
220
|
+
|
|
221
|
+
```bash
|
|
222
|
+
skilltest trigger ./skill --api-key your-key
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Current provider status:
|
|
226
|
+
|
|
227
|
+
- `anthropic`: implemented
|
|
228
|
+
- `openai`: interface wired, command currently returns "OpenAI provider coming soon."
|
|
229
|
+
|
|
230
|
+
## CICD Integration
|
|
231
|
+
|
|
232
|
+
GitHub Actions example to lint skills on pull requests:
|
|
233
|
+
|
|
234
|
+
```yaml
|
|
235
|
+
name: skill-lint
|
|
236
|
+
|
|
237
|
+
on:
|
|
238
|
+
pull_request:
|
|
239
|
+
paths:
|
|
240
|
+
- "**/SKILL.md"
|
|
241
|
+
- "**/references/**"
|
|
242
|
+
- "**/scripts/**"
|
|
243
|
+
- "**/assets/**"
|
|
244
|
+
|
|
245
|
+
jobs:
|
|
246
|
+
lint:
|
|
247
|
+
runs-on: ubuntu-latest
|
|
248
|
+
steps:
|
|
249
|
+
- uses: actions/checkout@v4
|
|
250
|
+
- uses: actions/setup-node@v4
|
|
251
|
+
with:
|
|
252
|
+
node-version: "20"
|
|
253
|
+
- run: npm ci
|
|
254
|
+
- run: npm run build
|
|
255
|
+
- run: npx skilltest lint path/to/skill --json
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
Optional nightly trigger/eval:
|
|
259
|
+
|
|
260
|
+
```yaml
|
|
261
|
+
name: skill-eval-nightly
|
|
262
|
+
|
|
263
|
+
on:
|
|
264
|
+
schedule:
|
|
265
|
+
- cron: "0 4 * * *"
|
|
266
|
+
|
|
267
|
+
jobs:
|
|
268
|
+
trigger-eval:
|
|
269
|
+
runs-on: ubuntu-latest
|
|
270
|
+
env:
|
|
271
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
272
|
+
steps:
|
|
273
|
+
- uses: actions/checkout@v4
|
|
274
|
+
- uses: actions/setup-node@v4
|
|
275
|
+
with:
|
|
276
|
+
node-version: "20"
|
|
277
|
+
- run: npm ci
|
|
278
|
+
- run: npm run build
|
|
279
|
+
- run: npx skilltest trigger path/to/skill --num-queries 20 --json
|
|
280
|
+
- run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
## Local Development
|
|
284
|
+
|
|
285
|
+
```bash
|
|
286
|
+
npm install
|
|
287
|
+
npm run lint
|
|
288
|
+
npm run build
|
|
289
|
+
node dist/index.js --help
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
Smoke tests:
|
|
293
|
+
|
|
294
|
+
```bash
|
|
295
|
+
node dist/index.js lint test-fixtures/sample-skill/
|
|
296
|
+
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
|
|
297
|
+
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
## Release Checklist
|
|
301
|
+
|
|
302
|
+
```bash
|
|
303
|
+
npm run lint
|
|
304
|
+
npm run build
|
|
305
|
+
npm run test
|
|
306
|
+
npm pack --dry-run
|
|
307
|
+
npm publish --dry-run
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
Then publish:
|
|
311
|
+
|
|
312
|
+
```bash
|
|
313
|
+
npm publish
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
## Contributing
|
|
317
|
+
|
|
318
|
+
Issues and pull requests are welcome. Include:
|
|
319
|
+
|
|
320
|
+
- clear reproduction steps
|
|
321
|
+
- expected vs actual behavior
|
|
322
|
+
- sample `SKILL.md` or fixtures when relevant
|
|
323
|
+
|
|
324
|
+
## License
|
|
325
|
+
|
|
326
|
+
MIT
|