skilltest 0.7.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +3 -6
- package/README.md +92 -0
- package/dist/index.js +378 -98
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/CLAUDE.md
CHANGED
|
@@ -77,6 +77,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
77
77
|
- default concurrency is `5`
|
|
78
78
|
- `--concurrency 1` preserves the old sequential behavior
|
|
79
79
|
- trigger RNG-dependent fake-skill setup is precomputed before requests begin, preserving seed determinism
|
|
80
|
+
- Comparative trigger testing is opt-in via `--compare`; standard fake-skill pool is the default.
|
|
80
81
|
- JSON mode is strict:
|
|
81
82
|
- no spinners
|
|
82
83
|
- no colored output
|
|
@@ -103,11 +104,7 @@ ANTHROPIC_API_KEY=your-key node dist/index.js trigger test-fixtures/sample-skill
|
|
|
103
104
|
- Security heuristics: `src/core/linter/security.ts`
|
|
104
105
|
- Progressive disclosure: `src/core/linter/disclosure.ts`
|
|
105
106
|
- Compatibility hints: `src/core/linter/compat.ts`
|
|
106
|
-
-
|
|
107
|
+
- Plugin loading + validation + rule execution: `src/core/linter/plugin.ts`
|
|
108
|
+
- Trigger fake skill pool + comparative competitor loading + scoring: `src/core/trigger-tester.ts`
|
|
107
109
|
- Eval grading schema: `src/core/grader.ts`
|
|
108
110
|
- Combined quality gate orchestration: `src/core/check-runner.ts`
|
|
109
|
-
|
|
110
|
-
## Future Work (Not Implemented Yet)
|
|
111
|
-
|
|
112
|
-
- Config file support (`.skilltestrc`)
|
|
113
|
-
- Plugin linter rules
|
package/README.md
CHANGED
|
@@ -163,6 +163,72 @@ What it checks:
|
|
|
163
163
|
Flags:
|
|
164
164
|
|
|
165
165
|
- `--html <path>` write a self-contained HTML report
|
|
166
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
167
|
+
|
|
168
|
+
### Plugin Rules
|
|
169
|
+
|
|
170
|
+
You can run custom lint rules alongside the built-in checks. Plugin rules use the
|
|
171
|
+
same `LintContext` and `LintIssue` types as the core linter, and their results
|
|
172
|
+
appear in the same `LintReport`.
|
|
173
|
+
|
|
174
|
+
Config:
|
|
175
|
+
|
|
176
|
+
```json
|
|
177
|
+
{
|
|
178
|
+
"lint": {
|
|
179
|
+
"plugins": ["./my-rules.js"]
|
|
180
|
+
}
|
|
181
|
+
}
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
CLI:
|
|
185
|
+
|
|
186
|
+
```bash
|
|
187
|
+
skilltest lint ./skill --plugin ./my-rules.js
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Minimal plugin example:
|
|
191
|
+
|
|
192
|
+
```js
|
|
193
|
+
export default {
|
|
194
|
+
rules: [
|
|
195
|
+
{
|
|
196
|
+
checkId: "custom:no-todo",
|
|
197
|
+
title: "No TODO comments",
|
|
198
|
+
check(context) {
|
|
199
|
+
const body = context.frontmatter.content;
|
|
200
|
+
if (/\bTODO\b/.test(body)) {
|
|
201
|
+
return [
|
|
202
|
+
{
|
|
203
|
+
id: "custom.no-todo",
|
|
204
|
+
checkId: "custom:no-todo",
|
|
205
|
+
title: "No TODO comments",
|
|
206
|
+
status: "warn",
|
|
207
|
+
message: "SKILL.md contains a TODO marker."
|
|
208
|
+
}
|
|
209
|
+
];
|
|
210
|
+
}
|
|
211
|
+
return [
|
|
212
|
+
{
|
|
213
|
+
id: "custom.no-todo",
|
|
214
|
+
checkId: "custom:no-todo",
|
|
215
|
+
title: "No TODO comments",
|
|
216
|
+
status: "pass",
|
|
217
|
+
message: "No TODO markers found."
|
|
218
|
+
}
|
|
219
|
+
];
|
|
220
|
+
}
|
|
221
|
+
}
|
|
222
|
+
]
|
|
223
|
+
};
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Notes:
|
|
227
|
+
|
|
228
|
+
- Plugin files are loaded with dynamic `import()`.
|
|
229
|
+
- `.js` and `.mjs` work directly; `.ts` plugins must be precompiled by the user.
|
|
230
|
+
- Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
|
|
231
|
+
- CLI `--plugin` values replace config-file `lint.plugins` values.
|
|
166
232
|
|
|
167
233
|
### `skilltest trigger <path-to-skill>`
|
|
168
234
|
|
|
@@ -175,6 +241,7 @@ Flow:
|
|
|
175
241
|
3. For each query, asks model to select one skill from a mixed list:
|
|
176
242
|
- your skill under test
|
|
177
243
|
- realistic fake skills
|
|
244
|
+
- optional sibling competitor skills from `--compare`
|
|
178
245
|
4. Computes TP, TN, FP, FN, precision, recall, F1.
|
|
179
246
|
|
|
180
247
|
For reproducible fake-skill sampling, pass `--seed <number>`. When a seed is used,
|
|
@@ -188,6 +255,7 @@ Flags:
|
|
|
188
255
|
- `--model <model>` default: `claude-sonnet-4-5-20250929`
|
|
189
256
|
- `--provider <anthropic|openai>` default: `anthropic`
|
|
190
257
|
- `--queries <path>` use custom queries JSON
|
|
258
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
191
259
|
- `--num-queries <n>` default: `20` (must be even)
|
|
192
260
|
- `--seed <number>` RNG seed for reproducible fake-skill sampling
|
|
193
261
|
- `--concurrency <n>` default: `5`
|
|
@@ -196,6 +264,28 @@ Flags:
|
|
|
196
264
|
- `--api-key <key>` explicit key override
|
|
197
265
|
- `--verbose` show full model decision text
|
|
198
266
|
|
|
267
|
+
### Comparative Trigger Testing
|
|
268
|
+
|
|
269
|
+
Test whether your skill is distinctive enough to be selected over similar real skills:
|
|
270
|
+
|
|
271
|
+
```bash
|
|
272
|
+
skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
Config:
|
|
276
|
+
|
|
277
|
+
```json
|
|
278
|
+
{
|
|
279
|
+
"trigger": {
|
|
280
|
+
"compare": ["../similar-skill-1", "../similar-skill-2"]
|
|
281
|
+
}
|
|
282
|
+
}
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
Comparative mode includes the real competitor skills in the candidate list alongside
|
|
286
|
+
fake skills. This reveals confusion between skills with overlapping descriptions that
|
|
287
|
+
standard trigger testing would miss.
|
|
288
|
+
|
|
199
289
|
### `skilltest eval <path-to-skill>`
|
|
200
290
|
|
|
201
291
|
Runs full skill behavior and grades outputs against assertions.
|
|
@@ -238,9 +328,11 @@ Flags:
|
|
|
238
328
|
- `--grader-model <model>` default: same as resolved `--model`
|
|
239
329
|
- `--api-key <key>` explicit key override
|
|
240
330
|
- `--queries <path>` custom trigger queries JSON
|
|
331
|
+
- `--compare <path>` path to a sibling skill directory to use as a competitor (repeatable)
|
|
241
332
|
- `--num-queries <n>` default: `20` (must be even)
|
|
242
333
|
- `--seed <number>` RNG seed for reproducible trigger sampling
|
|
243
334
|
- `--prompts <path>` custom eval prompts JSON
|
|
335
|
+
- `--plugin <path>` load a custom lint plugin file (repeatable)
|
|
244
336
|
- `--concurrency <n>` default: `5` (`1` keeps the old sequential `check` behavior)
|
|
245
337
|
- `--html <path>` write a self-contained HTML report
|
|
246
338
|
- `--min-f1 <n>` default: `0.8`
|