snapeval 1.0.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +96 -95
  2. package/assets/ideation-viewer.html +469 -0
  3. package/bin/snapeval.ts +65 -12
  4. package/dist/bin/snapeval.js +59 -12
  5. package/dist/bin/snapeval.js.map +1 -1
  6. package/dist/src/adapters/copilot-sdk-client.d.ts +13 -0
  7. package/dist/src/adapters/copilot-sdk-client.js +59 -0
  8. package/dist/src/adapters/copilot-sdk-client.js.map +1 -0
  9. package/dist/src/adapters/inference/copilot-sdk.d.ts +9 -0
  10. package/dist/src/adapters/inference/copilot-sdk.js +41 -0
  11. package/dist/src/adapters/inference/copilot-sdk.js.map +1 -0
  12. package/dist/src/adapters/inference/copilot.js +1 -2
  13. package/dist/src/adapters/inference/copilot.js.map +1 -1
  14. package/dist/src/adapters/inference/resolve.js +13 -4
  15. package/dist/src/adapters/inference/resolve.js.map +1 -1
  16. package/dist/src/adapters/report/html.d.ts +8 -0
  17. package/dist/src/adapters/report/html.js +283 -0
  18. package/dist/src/adapters/report/html.js.map +1 -0
  19. package/dist/src/adapters/report/terminal.js +1 -1
  20. package/dist/src/adapters/report/terminal.js.map +1 -1
  21. package/dist/src/adapters/skill/copilot-cli.d.ts +1 -0
  22. package/dist/src/adapters/skill/copilot-cli.js +17 -6
  23. package/dist/src/adapters/skill/copilot-cli.js.map +1 -1
  24. package/dist/src/adapters/skill/copilot-sdk.d.ts +6 -0
  25. package/dist/src/adapters/skill/copilot-sdk.js +68 -0
  26. package/dist/src/adapters/skill/copilot-sdk.js.map +1 -0
  27. package/dist/src/commands/check.d.ts +0 -2
  28. package/dist/src/commands/check.js +4 -5
  29. package/dist/src/commands/check.js.map +1 -1
  30. package/dist/src/commands/ideate.d.ts +1 -0
  31. package/dist/src/commands/ideate.js +69 -0
  32. package/dist/src/commands/ideate.js.map +1 -0
  33. package/dist/src/commands/report.d.ts +2 -1
  34. package/dist/src/commands/report.js +9 -0
  35. package/dist/src/commands/report.js.map +1 -1
  36. package/dist/src/commands/review.d.ts +7 -0
  37. package/dist/src/commands/review.js +34 -0
  38. package/dist/src/commands/review.js.map +1 -0
  39. package/dist/src/config.js +0 -1
  40. package/dist/src/config.js.map +1 -1
  41. package/dist/src/engine/comparison/judge.d.ts +4 -0
  42. package/dist/src/engine/comparison/judge.js +12 -3
  43. package/dist/src/engine/comparison/judge.js.map +1 -1
  44. package/dist/src/engine/comparison/pipeline.d.ts +1 -5
  45. package/dist/src/engine/comparison/pipeline.js +9 -22
  46. package/dist/src/engine/comparison/pipeline.js.map +1 -1
  47. package/dist/src/engine/generator.d.ts +5 -0
  48. package/dist/src/engine/generator.js +13 -0
  49. package/dist/src/engine/generator.js.map +1 -1
  50. package/dist/src/types.d.ts +33 -5
  51. package/package.json +16 -5
  52. package/plugin.json +7 -3
  53. package/skills/snapeval/SKILL.md +130 -23
  54. package/src/adapters/copilot-sdk-client.ts +66 -0
  55. package/src/adapters/inference/copilot-sdk.ts +48 -0
  56. package/src/adapters/inference/copilot.ts +1 -2
  57. package/src/adapters/inference/resolve.ts +17 -4
  58. package/src/adapters/report/html.ts +304 -0
  59. package/src/adapters/report/terminal.ts +1 -1
  60. package/src/adapters/skill/copilot-cli.ts +21 -7
  61. package/src/adapters/skill/copilot-sdk.ts +72 -0
  62. package/src/commands/check.ts +4 -5
  63. package/src/commands/ideate.ts +101 -0
  64. package/src/commands/report.ts +13 -2
  65. package/src/commands/review.ts +48 -0
  66. package/src/config.ts +0 -1
  67. package/src/engine/comparison/judge.ts +14 -4
  68. package/src/engine/comparison/pipeline.ts +8 -27
  69. package/src/engine/generator.ts +21 -0
  70. package/src/types.ts +30 -5
package/README.md CHANGED
@@ -6,102 +6,94 @@ Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free infere
6
6
  [![npm version](https://img.shields.io/npm/v/snapeval.svg)](https://www.npmjs.com/package/snapeval)
7
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
8
 
9
- snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It generates test cases from your skill's `SKILL.md`, captures baseline outputs, and detects regressions through a tiered comparison pipeline — all with zero manual test authoring.
9
+ snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It analyzes your skill's `SKILL.md`, collaborates with you to design a test strategy through an interactive browser-based viewer, then captures baselines and detects regressions — all with zero manual test authoring.
10
10
 
11
11
  ## Why snapeval?
12
12
 
13
- - **Zero assertions** — AI generates test cases from your SKILL.md. You never write test logic.
14
- - **Semantic comparison** — Three-tier pipeline: schema check (free) embedding similarity (cheap) LLM judge with order-swap debiasing (expensive). Most checks cost $0.
15
- - **Free inference** — Uses gpt-5-mini via Copilot CLI (0x multiplier on paid plans) and GitHub Models API (free with GITHUB_TOKEN).
16
- - **Non-determinism handling** — Variance envelope from N baseline runs prevents false regressions.
17
- - **Platform-agnostic** — Adapter-based architecture. Copilot CLI first, Claude Code and others coming.
13
+ - **Interactive ideation** — AI decomposes your skill into behaviors, dimensions, and failure modes, then opens a visual viewer where you shape the test strategy together.
14
+ - **Zero assertions** — No test logic to write. The AI generates realistic, messy prompts that mirror how real users actually type.
15
+ - **Semantic comparison** — Tiered pipeline: schema check (free) LLM judge with order-swap debiasing (when needed). Most checks cost $0.
16
+ - **Free inference** — Uses gpt-5-mini via Copilot CLI and GitHub Models API.
17
+ - **Platform-agnostic** — Adapter-based architecture. Copilot CLI first, others coming.
18
18
 
19
- ## Quick Start
19
+ ## Install
20
20
 
21
- ### As a Copilot CLI Plugin
21
+ ### From the marketplace
22
22
 
23
- Install directly from the GitHub repo:
23
+ The snapeval marketplace is bundled with the repo. Add it once, then install by name:
24
24
 
25
25
  ```bash
26
- gh copilot -- plugin install matantsach/snapeval
26
+ copilot plugin marketplace add matantsach/snapeval
27
+ copilot plugin install snapeval@snapeval-marketplace
27
28
  ```
28
29
 
29
- Or register the marketplace first, then install by name:
30
+ ### From GitHub directly
30
31
 
31
32
  ```bash
32
- gh copilot -- plugin marketplace add matantsach/snapeval
33
- gh copilot -- plugin install snapeval@snapeval-marketplace
33
+ copilot plugin install matantsach/snapeval
34
34
  ```
35
35
 
36
- Then in Copilot CLI interactive mode, just ask naturally:
37
-
38
- ```
39
- > evaluate my code-reviewer skill
40
- > check skills/code-reviewer for regressions
41
- > approve scenario 3
42
- ```
43
-
44
- The agent will use the snapeval skill automatically based on your prompt.
45
-
46
- ### As a CLI
36
+ ### Verify installation
47
37
 
48
38
  ```bash
49
- npx snapeval init <skill-path> # AI generates test cases from SKILL.md
50
- npx snapeval capture <skill-path> # Run tests, save baseline snapshots
51
- npx snapeval check <skill-path> # Compare current output to baselines
52
- npx snapeval approve [--scenario N] # Accept new behavior as baseline
53
- npx snapeval report <skill-path> # Generate benchmark.json
39
+ copilot plugin list
54
40
  ```
55
41
 
56
- ### Local Development
42
+ ## Usage
57
43
 
58
- For development without `npx`, clone and use `tsx` directly:
44
+ In Copilot CLI, just talk naturally:
59
45
 
60
- ```bash
61
- git clone https://github.com/matantsach/snapeval.git
62
- cd snapeval && npm install
63
- npx tsx bin/snapeval.ts init <skill-path>
46
+ ```
47
+ > evaluate my greeter skill
48
+ > test skills/code-reviewer for regressions
49
+ > check if I broke anything in my-skill
50
+ > approve scenario 3
64
51
  ```
65
52
 
66
- Or load as a local plugin during development:
53
+ snapeval activates automatically based on your prompt.
67
54
 
68
- ```bash
69
- gh copilot -- --plugin-dir /path/to/snapeval
70
- ```
55
+ ### What happens when you evaluate
71
56
 
72
- ### In CI
57
+ 1. **Analyze** — snapeval reads your SKILL.md and reasons through behaviors, input dimensions, failure modes, and ambiguities
58
+ 2. **View** — A browser-based viewer opens showing the analysis with proposed scenarios you can toggle, edit, and extend
59
+ 3. **Confirm** — You review, make changes, and click "Confirm & Run" to export your plan
60
+ 4. **Capture** — snapeval writes `evals.json` and runs the scenarios against your skill, saving baseline snapshots
73
61
 
74
- Commit your `evals.json` and `snapshots/` directory, then add a workflow:
62
+ After initial setup, use `check` to detect regressions and `approve` to accept intentional changes.
75
63
 
76
- ```yaml
77
- # .github/workflows/skill-eval.yml
78
- name: Skill Evaluation
79
- on: [pull_request]
64
+ ## CLI Reference
65
+
66
+ The CLI is the headless backend — useful for CI, scripting, and power users.
80
67
 
81
- jobs:
82
- eval:
83
- runs-on: ubuntu-latest
84
- steps:
85
- - uses: actions/checkout@v4
86
- - uses: actions/setup-node@v4
87
- with:
88
- node-version: 22
89
- - run: npm ci
90
- - run: npx tsx bin/snapeval.ts check skills/my-skill --ci --skip-embedding
68
+ ```
69
+ snapeval init [skill-dir] Generate test cases from SKILL.md
70
+ snapeval capture [skill-dir] Run scenarios and save baseline snapshots
71
+ snapeval check [skill-dir] Compare current output against baselines
72
+ snapeval approve [skill-dir] Approve regressed scenarios as new baselines
73
+ snapeval report [skill-dir] Write results with optional HTML viewer
74
+ snapeval ideate [skill-dir] Open the interactive scenario ideation viewer
91
75
  ```
92
76
 
93
- > **Note:** The `--skip-embedding` flag runs Tier 1 (schema) and Tier 3 (LLM judge) only, skipping Tier 2 which requires the GitHub Models embedding API. For Tier 1-only checks (fastest, free, no API needed), committed baselines with stable output structures will pass without any inference calls.
77
+ | Flag | Description | Default |
78
+ |------|-------------|---------|
79
+ | `--adapter <name>` | Skill adapter | `copilot-cli` |
80
+ | `--inference <name>` | Inference adapter | `auto` |
81
+ | `--budget <amount>` | Spend cap in USD | `unlimited` |
82
+ | `--runs <n>` | Baseline runs per scenario | `1` |
83
+ | `--ci` | CI mode: exit 1 on regressions | off |
84
+ | `--html` | Generate HTML report viewer | off |
85
+ | `--scenario <ids>` | Comma-separated scenario IDs | all |
86
+ | `--verbose` | Verbose output | off |
94
87
 
95
88
  ## How It Works
96
89
 
97
90
  ```
98
- SKILL.md → AI generates test scenarios → Capture baseline snapshots
99
-
100
- Modify skill → Re-run scenarios → Compare via tiered pipeline
101
-
102
- Schema match? → PASS (free, instant)
103
- Embedding > 0.85? → PASS (cheap)
104
- LLM Judge agrees? → PASS/REGRESSED (expensive)
91
+ SKILL.md → AI analyzes skill Interactive ideation viewer → Capture baselines
92
+
93
+ Modify skill → Re-run scenarios → Compare via tiered pipeline
94
+
95
+ Schema match? → PASS (free, instant)
96
+ LLM Judge agrees? → PASS/REGRESSED
105
97
  ```
106
98
 
107
99
  ### Comparison Pipeline
@@ -109,8 +101,7 @@ SKILL.md → AI generates test scenarios → Capture baseline snapshots
109
101
  | Tier | Method | Cost | When Used |
110
102
  |------|--------|------|-----------|
111
103
  | 1 | Schema check | Free | Structural skeleton matches |
112
- | 2 | Embedding similarity | Cheap | Schema differs but meaning similar |
113
- | 3 | LLM judge (order-swap) | Expensive | Ambiguous cases only |
104
+ | 2 | LLM judge (order-swap) | Cheap | Schema differs, needs semantic comparison |
114
105
 
115
106
  Most stable skills are checked entirely at Tier 1 — $0.00 per run.
116
107
 
@@ -122,7 +113,8 @@ snapeval follows the [agentskills.io evaluation standard](https://agentskills.io
122
113
  my-skill/
123
114
  ├── SKILL.md
124
115
  └── evals/
125
- ├── evals.json ← AI-generated test cases
116
+ ├── evals.json ← Test scenarios (AI-generated or from ideation)
117
+ ├── analysis.json ← Skill analysis (behaviors, dimensions, gaps)
126
118
  ├── snapshots/ ← Captured baseline outputs
127
119
  └── results/
128
120
  └── iteration-N/
@@ -131,6 +123,40 @@ my-skill/
131
123
  └── benchmark.json
132
124
  ```
133
125
 
126
+ ## In CI
127
+
128
+ Commit your `evals.json` and `snapshots/` directory, then add a workflow:
129
+
130
+ ```yaml
131
+ name: Skill Evaluation
132
+ on: [pull_request]
133
+
134
+ jobs:
135
+ eval:
136
+ runs-on: ubuntu-latest
137
+ steps:
138
+ - uses: actions/checkout@v4
139
+ - uses: actions/setup-node@v4
140
+ with:
141
+ node-version: 22
142
+ - run: npm ci
143
+ - run: npx snapeval check skills/my-skill --ci
144
+ ```
145
+
146
+ ## Local Development
147
+
148
+ ```bash
149
+ git clone https://github.com/matantsach/snapeval.git
150
+ cd snapeval && npm install
151
+ npx tsx bin/snapeval.ts check <skill-path>
152
+ ```
153
+
154
+ Or load as a local plugin:
155
+
156
+ ```bash
157
+ copilot plugin install ./path/to/snapeval
158
+ ```
159
+
134
160
  ## Configuration
135
161
 
136
162
  Create `snapeval.config.json` in your skill or project root:
@@ -139,7 +165,6 @@ Create `snapeval.config.json` in your skill or project root:
139
165
  {
140
166
  "adapter": "copilot-cli",
141
167
  "inference": "auto",
142
- "threshold": 0.85,
143
168
  "runs": 3,
144
169
  "budget": "unlimited"
145
170
  }
@@ -147,43 +172,19 @@ Create `snapeval.config.json` in your skill or project root:
147
172
 
148
173
  CLI flags override config file values.
149
174
 
150
- ## CLI Reference
151
-
152
- ```
153
- snapeval init [skill-dir] Generate test cases from SKILL.md using AI
154
- snapeval capture [skill-dir] Run skill against all scenarios, save baselines
155
- snapeval check [skill-dir] Compare current output against baselines
156
- snapeval approve [skill-dir] Approve regressed scenarios as new baselines
157
- snapeval report [skill-dir] Write results to evals/results/iteration-N/
158
- ```
159
-
160
- **Common flags:**
161
-
162
- | Flag | Description | Default |
163
- |------|-------------|---------|
164
- | `--adapter <name>` | Skill adapter | `copilot-cli` |
165
- | `--inference <name>` | Inference adapter | `auto` |
166
- | `--threshold <n>` | Embedding similarity threshold | `0.85` |
167
- | `--budget <amount>` | Spend cap in USD | `unlimited` |
168
- | `--runs <n>` | Baseline runs per scenario | `1` |
169
- | `--ci` | CI mode: exit 1 on regressions | off |
170
- | `--skip-embedding` | Skip Tier 2 (embedding) | off |
171
- | `--scenario <ids>` | Comma-separated scenario IDs | all |
172
- | `--verbose` | Verbose output | off |
173
-
174
175
  ## Architecture
175
176
 
176
177
  Three surfaces over a shared core engine:
177
178
 
178
179
  - **Plugin** (SKILL.md) — Interactive product. AI handles everything.
179
180
  - **CLI** (`npx snapeval`) — Headless backend for CI and power users.
180
- - **GitHub Action** — CI wrapper (coming in v2).
181
+ - **GitHub Action** — CI wrapper (planned).
181
182
 
182
- Three adapter layers for platform independence:
183
+ Adapter layers for platform independence:
183
184
 
184
- - **SkillAdapter** — How to invoke a skill (Copilot CLI, Claude Code, generic)
185
+ - **SkillAdapter** — How to invoke a skill (Copilot CLI, others planned)
185
186
  - **InferenceAdapter** — Where to get LLM capabilities (Copilot gpt-5-mini, GitHub Models API)
186
- - **ReportAdapter** — How to present results (terminal, JSON, PR comment)
187
+ - **ReportAdapter** — How to present results (terminal, JSON, HTML viewer)
187
188
 
188
189
  ## Contributing
189
190