snapeval 1.0.1 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +96 -95
- package/assets/ideation-viewer.html +469 -0
- package/bin/snapeval.ts +65 -12
- package/dist/bin/snapeval.js +59 -12
- package/dist/bin/snapeval.js.map +1 -1
- package/dist/src/adapters/copilot-sdk-client.d.ts +13 -0
- package/dist/src/adapters/copilot-sdk-client.js +59 -0
- package/dist/src/adapters/copilot-sdk-client.js.map +1 -0
- package/dist/src/adapters/inference/copilot-sdk.d.ts +9 -0
- package/dist/src/adapters/inference/copilot-sdk.js +41 -0
- package/dist/src/adapters/inference/copilot-sdk.js.map +1 -0
- package/dist/src/adapters/inference/copilot.js +1 -2
- package/dist/src/adapters/inference/copilot.js.map +1 -1
- package/dist/src/adapters/inference/resolve.js +13 -4
- package/dist/src/adapters/inference/resolve.js.map +1 -1
- package/dist/src/adapters/report/html.d.ts +8 -0
- package/dist/src/adapters/report/html.js +283 -0
- package/dist/src/adapters/report/html.js.map +1 -0
- package/dist/src/adapters/report/terminal.js +1 -1
- package/dist/src/adapters/report/terminal.js.map +1 -1
- package/dist/src/adapters/skill/copilot-cli.d.ts +1 -0
- package/dist/src/adapters/skill/copilot-cli.js +17 -6
- package/dist/src/adapters/skill/copilot-cli.js.map +1 -1
- package/dist/src/adapters/skill/copilot-sdk.d.ts +6 -0
- package/dist/src/adapters/skill/copilot-sdk.js +68 -0
- package/dist/src/adapters/skill/copilot-sdk.js.map +1 -0
- package/dist/src/commands/check.d.ts +0 -2
- package/dist/src/commands/check.js +4 -5
- package/dist/src/commands/check.js.map +1 -1
- package/dist/src/commands/ideate.d.ts +1 -0
- package/dist/src/commands/ideate.js +69 -0
- package/dist/src/commands/ideate.js.map +1 -0
- package/dist/src/commands/report.d.ts +2 -1
- package/dist/src/commands/report.js +9 -0
- package/dist/src/commands/report.js.map +1 -1
- package/dist/src/commands/review.d.ts +7 -0
- package/dist/src/commands/review.js +34 -0
- package/dist/src/commands/review.js.map +1 -0
- package/dist/src/config.js +0 -1
- package/dist/src/config.js.map +1 -1
- package/dist/src/engine/comparison/judge.d.ts +4 -0
- package/dist/src/engine/comparison/judge.js +12 -3
- package/dist/src/engine/comparison/judge.js.map +1 -1
- package/dist/src/engine/comparison/pipeline.d.ts +1 -5
- package/dist/src/engine/comparison/pipeline.js +9 -22
- package/dist/src/engine/comparison/pipeline.js.map +1 -1
- package/dist/src/engine/generator.d.ts +5 -0
- package/dist/src/engine/generator.js +13 -0
- package/dist/src/engine/generator.js.map +1 -1
- package/dist/src/types.d.ts +33 -5
- package/package.json +16 -5
- package/plugin.json +7 -3
- package/skills/snapeval/SKILL.md +130 -23
- package/src/adapters/copilot-sdk-client.ts +66 -0
- package/src/adapters/inference/copilot-sdk.ts +48 -0
- package/src/adapters/inference/copilot.ts +1 -2
- package/src/adapters/inference/resolve.ts +17 -4
- package/src/adapters/report/html.ts +304 -0
- package/src/adapters/report/terminal.ts +1 -1
- package/src/adapters/skill/copilot-cli.ts +21 -7
- package/src/adapters/skill/copilot-sdk.ts +72 -0
- package/src/commands/check.ts +4 -5
- package/src/commands/ideate.ts +101 -0
- package/src/commands/report.ts +13 -2
- package/src/commands/review.ts +48 -0
- package/src/config.ts +0 -1
- package/src/engine/comparison/judge.ts +14 -4
- package/src/engine/comparison/pipeline.ts +8 -27
- package/src/engine/generator.ts +21 -0
- package/src/types.ts +30 -5
package/README.md
CHANGED
|
@@ -6,102 +6,94 @@ Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free infere
|
|
|
6
6
|
[](https://www.npmjs.com/package/snapeval)
|
|
7
7
|
[](https://opensource.org/licenses/MIT)
|
|
8
8
|
|
|
9
|
-
snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It
|
|
9
|
+
snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It analyzes your skill's `SKILL.md`, collaborates with you to design a test strategy through an interactive browser-based viewer, then captures baselines and detects regressions — all with zero manual test authoring.
|
|
10
10
|
|
|
11
11
|
## Why snapeval?
|
|
12
12
|
|
|
13
|
-
- **
|
|
14
|
-
- **
|
|
15
|
-
- **
|
|
16
|
-
- **
|
|
17
|
-
- **Platform-agnostic** — Adapter-based architecture. Copilot CLI first,
|
|
13
|
+
- **Interactive ideation** — AI decomposes your skill into behaviors, dimensions, and failure modes, then opens a visual viewer where you shape the test strategy together.
|
|
14
|
+
- **Zero assertions** — No test logic to write. The AI generates realistic, messy prompts that mirror how real users actually type.
|
|
15
|
+
- **Semantic comparison** — Tiered pipeline: schema check (free) → LLM judge with order-swap debiasing (when needed). Most checks cost $0.
|
|
16
|
+
- **Free inference** — Uses gpt-5-mini via Copilot CLI and GitHub Models API.
|
|
17
|
+
- **Platform-agnostic** — Adapter-based architecture. Copilot CLI first, others coming.
|
|
18
18
|
|
|
19
|
-
##
|
|
19
|
+
## Install
|
|
20
20
|
|
|
21
|
-
###
|
|
21
|
+
### From the marketplace
|
|
22
22
|
|
|
23
|
-
|
|
23
|
+
The snapeval marketplace is bundled with the repo. Add it once, then install by name:
|
|
24
24
|
|
|
25
25
|
```bash
|
|
26
|
-
|
|
26
|
+
copilot plugin marketplace add matantsach/snapeval
|
|
27
|
+
copilot plugin install snapeval@snapeval-marketplace
|
|
27
28
|
```
|
|
28
29
|
|
|
29
|
-
|
|
30
|
+
### From GitHub directly
|
|
30
31
|
|
|
31
32
|
```bash
|
|
32
|
-
|
|
33
|
-
gh copilot -- plugin install snapeval@snapeval-marketplace
|
|
33
|
+
copilot plugin install matantsach/snapeval
|
|
34
34
|
```
|
|
35
35
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
```
|
|
39
|
-
> evaluate my code-reviewer skill
|
|
40
|
-
> check skills/code-reviewer for regressions
|
|
41
|
-
> approve scenario 3
|
|
42
|
-
```
|
|
43
|
-
|
|
44
|
-
The agent will use the snapeval skill automatically based on your prompt.
|
|
45
|
-
|
|
46
|
-
### As a CLI
|
|
36
|
+
### Verify installation
|
|
47
37
|
|
|
48
38
|
```bash
|
|
49
|
-
|
|
50
|
-
npx snapeval capture <skill-path> # Run tests, save baseline snapshots
|
|
51
|
-
npx snapeval check <skill-path> # Compare current output to baselines
|
|
52
|
-
npx snapeval approve [--scenario N] # Accept new behavior as baseline
|
|
53
|
-
npx snapeval report <skill-path> # Generate benchmark.json
|
|
39
|
+
copilot plugin list
|
|
54
40
|
```
|
|
55
41
|
|
|
56
|
-
|
|
42
|
+
## Usage
|
|
57
43
|
|
|
58
|
-
|
|
44
|
+
In Copilot CLI, just talk naturally:
|
|
59
45
|
|
|
60
|
-
```
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
46
|
+
```
|
|
47
|
+
> evaluate my greeter skill
|
|
48
|
+
> test skills/code-reviewer for regressions
|
|
49
|
+
> check if I broke anything in my-skill
|
|
50
|
+
> approve scenario 3
|
|
64
51
|
```
|
|
65
52
|
|
|
66
|
-
|
|
53
|
+
snapeval activates automatically based on your prompt.
|
|
67
54
|
|
|
68
|
-
|
|
69
|
-
gh copilot -- --plugin-dir /path/to/snapeval
|
|
70
|
-
```
|
|
55
|
+
### What happens when you evaluate
|
|
71
56
|
|
|
72
|
-
|
|
57
|
+
1. **Analyze** — snapeval reads your SKILL.md and reasons through behaviors, input dimensions, failure modes, and ambiguities
|
|
58
|
+
2. **View** — A browser-based viewer opens showing the analysis with proposed scenarios you can toggle, edit, and extend
|
|
59
|
+
3. **Confirm** — You review, make changes, and click "Confirm & Run" to export your plan
|
|
60
|
+
4. **Capture** — snapeval writes `evals.json` and runs the scenarios against your skill, saving baseline snapshots
|
|
73
61
|
|
|
74
|
-
|
|
62
|
+
After initial setup, use `check` to detect regressions and `approve` to accept intentional changes.
|
|
75
63
|
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
on: [pull_request]
|
|
64
|
+
## CLI Reference
|
|
65
|
+
|
|
66
|
+
The CLI is the headless backend — useful for CI, scripting, and power users.
|
|
80
67
|
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
node-version: 22
|
|
89
|
-
- run: npm ci
|
|
90
|
-
- run: npx tsx bin/snapeval.ts check skills/my-skill --ci --skip-embedding
|
|
68
|
+
```
|
|
69
|
+
snapeval init [skill-dir] Generate test cases from SKILL.md
|
|
70
|
+
snapeval capture [skill-dir] Run scenarios and save baseline snapshots
|
|
71
|
+
snapeval check [skill-dir] Compare current output against baselines
|
|
72
|
+
snapeval approve [skill-dir] Approve regressed scenarios as new baselines
|
|
73
|
+
snapeval report [skill-dir] Write results with optional HTML viewer
|
|
74
|
+
snapeval ideate [skill-dir] Open the interactive scenario ideation viewer
|
|
91
75
|
```
|
|
92
76
|
|
|
93
|
-
|
|
77
|
+
| Flag | Description | Default |
|
|
78
|
+
|------|-------------|---------|
|
|
79
|
+
| `--adapter <name>` | Skill adapter | `copilot-cli` |
|
|
80
|
+
| `--inference <name>` | Inference adapter | `auto` |
|
|
81
|
+
| `--budget <amount>` | Spend cap in USD | `unlimited` |
|
|
82
|
+
| `--runs <n>` | Baseline runs per scenario | `1` |
|
|
83
|
+
| `--ci` | CI mode: exit 1 on regressions | off |
|
|
84
|
+
| `--html` | Generate HTML report viewer | off |
|
|
85
|
+
| `--scenario <ids>` | Comma-separated scenario IDs | all |
|
|
86
|
+
| `--verbose` | Verbose output | off |
|
|
94
87
|
|
|
95
88
|
## How It Works
|
|
96
89
|
|
|
97
90
|
```
|
|
98
|
-
SKILL.md → AI
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
LLM Judge agrees? → PASS/REGRESSED (expensive)
|
|
91
|
+
SKILL.md → AI analyzes skill → Interactive ideation viewer → Capture baselines
|
|
92
|
+
↓
|
|
93
|
+
Modify skill → Re-run scenarios → Compare via tiered pipeline
|
|
94
|
+
↓
|
|
95
|
+
Schema match? → PASS (free, instant)
|
|
96
|
+
LLM Judge agrees? → PASS/REGRESSED
|
|
105
97
|
```
|
|
106
98
|
|
|
107
99
|
### Comparison Pipeline
|
|
@@ -109,8 +101,7 @@ SKILL.md → AI generates test scenarios → Capture baseline snapshots
|
|
|
109
101
|
| Tier | Method | Cost | When Used |
|
|
110
102
|
|------|--------|------|-----------|
|
|
111
103
|
| 1 | Schema check | Free | Structural skeleton matches |
|
|
112
|
-
| 2 |
|
|
113
|
-
| 3 | LLM judge (order-swap) | Expensive | Ambiguous cases only |
|
|
104
|
+
| 2 | LLM judge (order-swap) | Cheap | Schema differs, needs semantic comparison |
|
|
114
105
|
|
|
115
106
|
Most stable skills are checked entirely at Tier 1 — $0.00 per run.
|
|
116
107
|
|
|
@@ -122,7 +113,8 @@ snapeval follows the [agentskills.io evaluation standard](https://agentskills.io
|
|
|
122
113
|
my-skill/
|
|
123
114
|
├── SKILL.md
|
|
124
115
|
└── evals/
|
|
125
|
-
├── evals.json ← AI-generated
|
|
116
|
+
├── evals.json ← Test scenarios (AI-generated or from ideation)
|
|
117
|
+
├── analysis.json ← Skill analysis (behaviors, dimensions, gaps)
|
|
126
118
|
├── snapshots/ ← Captured baseline outputs
|
|
127
119
|
└── results/
|
|
128
120
|
└── iteration-N/
|
|
@@ -131,6 +123,40 @@ my-skill/
|
|
|
131
123
|
└── benchmark.json
|
|
132
124
|
```
|
|
133
125
|
|
|
126
|
+
## In CI
|
|
127
|
+
|
|
128
|
+
Commit your `evals.json` and `snapshots/` directory, then add a workflow:
|
|
129
|
+
|
|
130
|
+
```yaml
|
|
131
|
+
name: Skill Evaluation
|
|
132
|
+
on: [pull_request]
|
|
133
|
+
|
|
134
|
+
jobs:
|
|
135
|
+
eval:
|
|
136
|
+
runs-on: ubuntu-latest
|
|
137
|
+
steps:
|
|
138
|
+
- uses: actions/checkout@v4
|
|
139
|
+
- uses: actions/setup-node@v4
|
|
140
|
+
with:
|
|
141
|
+
node-version: 22
|
|
142
|
+
- run: npm ci
|
|
143
|
+
- run: npx snapeval check skills/my-skill --ci
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## Local Development
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
git clone https://github.com/matantsach/snapeval.git
|
|
150
|
+
cd snapeval && npm install
|
|
151
|
+
npx tsx bin/snapeval.ts check <skill-path>
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Or load as a local plugin:
|
|
155
|
+
|
|
156
|
+
```bash
|
|
157
|
+
copilot plugin install ./path/to/snapeval
|
|
158
|
+
```
|
|
159
|
+
|
|
134
160
|
## Configuration
|
|
135
161
|
|
|
136
162
|
Create `snapeval.config.json` in your skill or project root:
|
|
@@ -139,7 +165,6 @@ Create `snapeval.config.json` in your skill or project root:
|
|
|
139
165
|
{
|
|
140
166
|
"adapter": "copilot-cli",
|
|
141
167
|
"inference": "auto",
|
|
142
|
-
"threshold": 0.85,
|
|
143
168
|
"runs": 3,
|
|
144
169
|
"budget": "unlimited"
|
|
145
170
|
}
|
|
@@ -147,43 +172,19 @@ Create `snapeval.config.json` in your skill or project root:
|
|
|
147
172
|
|
|
148
173
|
CLI flags override config file values.
|
|
149
174
|
|
|
150
|
-
## CLI Reference
|
|
151
|
-
|
|
152
|
-
```
|
|
153
|
-
snapeval init [skill-dir] Generate test cases from SKILL.md using AI
|
|
154
|
-
snapeval capture [skill-dir] Run skill against all scenarios, save baselines
|
|
155
|
-
snapeval check [skill-dir] Compare current output against baselines
|
|
156
|
-
snapeval approve [skill-dir] Approve regressed scenarios as new baselines
|
|
157
|
-
snapeval report [skill-dir] Write results to evals/results/iteration-N/
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
**Common flags:**
|
|
161
|
-
|
|
162
|
-
| Flag | Description | Default |
|
|
163
|
-
|------|-------------|---------|
|
|
164
|
-
| `--adapter <name>` | Skill adapter | `copilot-cli` |
|
|
165
|
-
| `--inference <name>` | Inference adapter | `auto` |
|
|
166
|
-
| `--threshold <n>` | Embedding similarity threshold | `0.85` |
|
|
167
|
-
| `--budget <amount>` | Spend cap in USD | `unlimited` |
|
|
168
|
-
| `--runs <n>` | Baseline runs per scenario | `1` |
|
|
169
|
-
| `--ci` | CI mode: exit 1 on regressions | off |
|
|
170
|
-
| `--skip-embedding` | Skip Tier 2 (embedding) | off |
|
|
171
|
-
| `--scenario <ids>` | Comma-separated scenario IDs | all |
|
|
172
|
-
| `--verbose` | Verbose output | off |
|
|
173
|
-
|
|
174
175
|
## Architecture
|
|
175
176
|
|
|
176
177
|
Three surfaces over a shared core engine:
|
|
177
178
|
|
|
178
179
|
- **Plugin** (SKILL.md) — Interactive product. AI handles everything.
|
|
179
180
|
- **CLI** (`npx snapeval`) — Headless backend for CI and power users.
|
|
180
|
-
- **GitHub Action** — CI wrapper (
|
|
181
|
+
- **GitHub Action** — CI wrapper (planned).
|
|
181
182
|
|
|
182
|
-
|
|
183
|
+
Adapter layers for platform independence:
|
|
183
184
|
|
|
184
|
-
- **SkillAdapter** — How to invoke a skill (Copilot CLI,
|
|
185
|
+
- **SkillAdapter** — How to invoke a skill (Copilot CLI, others planned)
|
|
185
186
|
- **InferenceAdapter** — Where to get LLM capabilities (Copilot gpt-5-mini, GitHub Models API)
|
|
186
|
-
- **ReportAdapter** — How to present results (terminal, JSON,
|
|
187
|
+
- **ReportAdapter** — How to present results (terminal, JSON, HTML viewer)
|
|
187
188
|
|
|
188
189
|
## Contributing
|
|
189
190
|
|