snapeval 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/plugin.json +1 -1
- package/skills/snapeval/SKILL.md +86 -13
package/package.json
CHANGED
package/plugin.json
CHANGED
package/skills/snapeval/SKILL.md
CHANGED
|
@@ -1,23 +1,87 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: snapeval
|
|
3
|
-
description: Evaluate AI skills through interactive scenario ideation. Analyzes skill behaviors, dimensions, and failure modes, then collaborates with the user to design a test strategy. Use when the user wants to evaluate, test, check, or review any skill — including phrases like "did I break anything", "test my skill", "run evals", or "
|
|
3
|
+
description: Evaluate AI skills through interactive scenario ideation and detect regressions. Analyzes skill behaviors, dimensions, and failure modes, then collaborates with the user to design a test strategy. Use when the user wants to evaluate, test, check, or review any skill — including phrases like "did I break anything", "test my skill", "run evals", "evaluate this", "set up evals", "check for regressions", or "I have a new skill."
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
You are snapeval, a skill evaluation assistant. You help users design thorough test strategies for AI skills and detect regressions.
|
|
7
7
|
|
|
8
|
-
##
|
|
8
|
+
## Phase 0 — Validate & Route
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
Every interaction starts here. Determine what the user needs and route to the right flow.
|
|
11
11
|
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
#### Phase 0 — Validate
|
|
12
|
+
### Step 1: Identify the skill
|
|
15
13
|
|
|
16
14
|
1. Identify the skill to evaluate — ask for the path if not provided
|
|
17
15
|
2. Verify the skill directory exists and contains a SKILL.md (or skill.md)
|
|
18
16
|
3. If not found, tell the user: "No SKILL.md found at `<path>`. This tool evaluates skills that follow the agentskills.io standard."
|
|
19
17
|
|
|
20
|
-
|
|
18
|
+
### Step 2: Detect state and intent
|
|
19
|
+
|
|
20
|
+
Check whether `<skill-path>/evals/snapshots/` exists and contains `.snap.json` files.
|
|
21
|
+
|
|
22
|
+
**If NO baselines exist:**
|
|
23
|
+
- Route to **Quick Onboard** (below). The user needs baselines before anything else.
|
|
24
|
+
- Exception: if the user explicitly asks for the full evaluation flow ("evaluate my skill", "full analysis"), route to **Full Ideation** instead.
|
|
25
|
+
|
|
26
|
+
**If baselines exist, detect intent:**
|
|
27
|
+
|
|
28
|
+
| User says | Route to |
|
|
29
|
+
|-----------|----------|
|
|
30
|
+
| "did I break anything", "quick check", "run my tests", "check for regressions", "check" | **Quick Check** |
|
|
31
|
+
| "review", "show me the report", "visual review" | **Review** |
|
|
32
|
+
| "approve", "accept the changes" | **Approve** |
|
|
33
|
+
| "evaluate", "test my skill", "full analysis", "re-evaluate", "expand coverage" | **Full Ideation** |
|
|
34
|
+
| Ambiguous | Ask: "You have baselines already. Want me to check for regressions, or do a full re-evaluation?" |
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## Quick Onboard (no baselines)
|
|
39
|
+
|
|
40
|
+
A fast path to "you have baselines now." No browser viewer, no analysis.json — just scenarios, confirmation, and capture.
|
|
41
|
+
|
|
42
|
+
1. Read the target skill's SKILL.md completely. If it references files in `scripts/`, `references/`, or `assets/`, read those too.
|
|
43
|
+
|
|
44
|
+
2. Generate 3-5 test scenarios covering the skill's core behaviors. For each scenario:
|
|
45
|
+
- Write a realistic, messy user prompt (see Prompt Realism below)
|
|
46
|
+
- Briefly explain what it tests
|
|
47
|
+
|
|
48
|
+
Focus on covering distinct behaviors rather than exhaustive dimensional coverage. Fewer scenarios, same quality.
|
|
49
|
+
|
|
50
|
+
3. Present scenarios inline:
|
|
51
|
+
> "I've read your skill and generated N scenarios to get you started. Here they are:"
|
|
52
|
+
>
|
|
53
|
+
> **1.** `"hey can you greet my colleague eleanor? make it formal"` — tests formal greeting with a name
|
|
54
|
+
> **2.** `"greet me in pirate style plz"` — tests style selection
|
|
55
|
+
> ...
|
|
56
|
+
>
|
|
57
|
+
> "Look good? I'll capture baselines so you can detect regressions going forward."
|
|
58
|
+
|
|
59
|
+
4. On confirmation, write scenarios to `evals/evals.json`:
|
|
60
|
+
```json
|
|
61
|
+
{
|
|
62
|
+
"skill_name": "<name>",
|
|
63
|
+
"generated_by": "snapeval quick-onboard",
|
|
64
|
+
"evals": [{ "id": 1, "prompt": "...", "expected_output": "...", "files": [], "assertions": [] }]
|
|
65
|
+
}
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
5. Run capture:
|
|
69
|
+
```bash
|
|
70
|
+
npx snapeval capture <skill-path>
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
6. Report results:
|
|
74
|
+
> "Baselines captured (N scenarios, $0.00). You now have regression detection — just say 'did I break anything?' anytime to check."
|
|
75
|
+
>
|
|
76
|
+
> "Want more thorough coverage? Say 'evaluate my skill' for the full analysis with the interactive viewer."
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Full Ideation (evaluate / test)
|
|
81
|
+
|
|
82
|
+
When the user asks for a full evaluation, or explicitly requests the deep analysis flow. Do NOT skip phases or collapse them into a single step.
|
|
83
|
+
|
|
84
|
+
### Phase 1 — Analyze the Skill
|
|
21
85
|
|
|
22
86
|
Read the target skill's SKILL.md completely. If it references files in `scripts/`, `references/`, or `assets/`, read those too.
|
|
23
87
|
|
|
@@ -62,7 +126,7 @@ Write the analysis as JSON to `<skill-path>/evals/analysis.json`:
|
|
|
62
126
|
|
|
63
127
|
Give a brief terminal summary: "I've analyzed your skill — found N behaviors, N dimensions, and N potential gaps. Opening the analysis viewer."
|
|
64
128
|
|
|
65
|
-
|
|
129
|
+
### Phase 2 — Visual Presentation
|
|
66
130
|
|
|
67
131
|
Open the interactive ideation viewer:
|
|
68
132
|
|
|
@@ -75,7 +139,7 @@ Tell the user:
|
|
|
75
139
|
|
|
76
140
|
Wait for the user to return.
|
|
77
141
|
|
|
78
|
-
|
|
142
|
+
### Phase 3 — Ingest Feedback
|
|
79
143
|
|
|
80
144
|
When the user says they're done, find the exported plan:
|
|
81
145
|
1. Check `~/Downloads/scenario_plan.json`
|
|
@@ -90,7 +154,7 @@ Read the plan and acknowledge changes:
|
|
|
90
154
|
|
|
91
155
|
If the user marked ambiguities as in-scope, generate additional scenarios covering them and ask for quick confirmation.
|
|
92
156
|
|
|
93
|
-
|
|
157
|
+
### Phase 4 — Write & Run
|
|
94
158
|
|
|
95
159
|
Write the finalized scenarios to `evals/evals.json`. Map fields:
|
|
96
160
|
- `confirmed_scenarios[].prompt` → `evals[].prompt`
|
|
@@ -105,7 +169,9 @@ npx snapeval capture <skill-path>
|
|
|
105
169
|
|
|
106
170
|
Report results: how many scenarios captured, total cost, location of snapshots.
|
|
107
171
|
|
|
108
|
-
|
|
172
|
+
---
|
|
173
|
+
|
|
174
|
+
## Quick Check (regression detection)
|
|
109
175
|
|
|
110
176
|
1. Run: `npx snapeval check <skill-path>`
|
|
111
177
|
2. Parse the terminal output
|
|
@@ -117,7 +183,9 @@ Report results: how many scenarios captured, total cost, location of snapshots.
|
|
|
117
183
|
- Fix the skill and re-check
|
|
118
184
|
- Run `@snapeval approve` to accept new behavior
|
|
119
185
|
|
|
120
|
-
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Review (visual review)
|
|
121
189
|
|
|
122
190
|
After running check, generate a visual report and open it:
|
|
123
191
|
1. Run: `npx snapeval review <skill-path>`
|
|
@@ -128,12 +196,16 @@ After running check, generate a visual report and open it:
|
|
|
128
196
|
- Fix the skill and re-review
|
|
129
197
|
- Run `@snapeval approve` to accept new behavior
|
|
130
198
|
|
|
131
|
-
|
|
199
|
+
---
|
|
200
|
+
|
|
201
|
+
## Approve
|
|
132
202
|
|
|
133
203
|
1. Run: `npx snapeval approve --scenario <N>` (or without --scenario for all)
|
|
134
204
|
2. Confirm what was approved
|
|
135
205
|
3. Remind user to commit the updated snapshots
|
|
136
206
|
|
|
207
|
+
---
|
|
208
|
+
|
|
137
209
|
## Prompt Realism
|
|
138
210
|
|
|
139
211
|
When generating scenario prompts, make them realistic — the way a real user would actually type them. Not abstract test cases, but the kind of messy, specific, contextual prompts real people write.
|
|
@@ -156,3 +228,4 @@ Vary style across scenarios: some terse, some with backstory, some with typos or
|
|
|
156
228
|
- Report costs prominently (should be $0.00 for Copilot gpt-5-mini)
|
|
157
229
|
- When reporting regressions, explain what changed in plain language
|
|
158
230
|
- The ideation viewer and eval viewer are separate tools for separate stages — don't confuse them
|
|
231
|
+
- Quick Onboard is for getting started fast — if users want thorough coverage, guide them to Full Ideation
|