npm - snapeval - Versions diffs - 1.0.1 → 1.4.0 - Mend

snapeval 1.0.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/README.md +96 -95
package/assets/ideation-viewer.html +469 -0
package/bin/snapeval.ts +65 -12
package/dist/bin/snapeval.js +59 -12
package/dist/bin/snapeval.js.map +1 -1
package/dist/src/adapters/copilot-sdk-client.d.ts +13 -0
package/dist/src/adapters/copilot-sdk-client.js +59 -0
package/dist/src/adapters/copilot-sdk-client.js.map +1 -0
package/dist/src/adapters/inference/copilot-sdk.d.ts +9 -0
package/dist/src/adapters/inference/copilot-sdk.js +41 -0
package/dist/src/adapters/inference/copilot-sdk.js.map +1 -0
package/dist/src/adapters/inference/copilot.js +1 -2
package/dist/src/adapters/inference/copilot.js.map +1 -1
package/dist/src/adapters/inference/resolve.js +13 -4
package/dist/src/adapters/inference/resolve.js.map +1 -1
package/dist/src/adapters/report/html.d.ts +8 -0
package/dist/src/adapters/report/html.js +283 -0
package/dist/src/adapters/report/html.js.map +1 -0
package/dist/src/adapters/report/terminal.js +1 -1
package/dist/src/adapters/report/terminal.js.map +1 -1
package/dist/src/adapters/skill/copilot-cli.d.ts +1 -0
package/dist/src/adapters/skill/copilot-cli.js +17 -6
package/dist/src/adapters/skill/copilot-cli.js.map +1 -1
package/dist/src/adapters/skill/copilot-sdk.d.ts +6 -0
package/dist/src/adapters/skill/copilot-sdk.js +68 -0
package/dist/src/adapters/skill/copilot-sdk.js.map +1 -0
package/dist/src/commands/check.d.ts +0 -2
package/dist/src/commands/check.js +4 -5
package/dist/src/commands/check.js.map +1 -1
package/dist/src/commands/ideate.d.ts +1 -0
package/dist/src/commands/ideate.js +69 -0
package/dist/src/commands/ideate.js.map +1 -0
package/dist/src/commands/report.d.ts +2 -1
package/dist/src/commands/report.js +9 -0
package/dist/src/commands/report.js.map +1 -1
package/dist/src/commands/review.d.ts +7 -0
package/dist/src/commands/review.js +34 -0
package/dist/src/commands/review.js.map +1 -0
package/dist/src/config.js +0 -1
package/dist/src/config.js.map +1 -1
package/dist/src/engine/comparison/judge.d.ts +4 -0
package/dist/src/engine/comparison/judge.js +12 -3
package/dist/src/engine/comparison/judge.js.map +1 -1
package/dist/src/engine/comparison/pipeline.d.ts +1 -5
package/dist/src/engine/comparison/pipeline.js +9 -22
package/dist/src/engine/comparison/pipeline.js.map +1 -1
package/dist/src/engine/generator.d.ts +5 -0
package/dist/src/engine/generator.js +13 -0
package/dist/src/engine/generator.js.map +1 -1
package/dist/src/types.d.ts +33 -5
package/package.json +16 -5
package/plugin.json +7 -3
package/skills/snapeval/SKILL.md +130 -23
package/src/adapters/copilot-sdk-client.ts +66 -0
package/src/adapters/inference/copilot-sdk.ts +48 -0
package/src/adapters/inference/copilot.ts +1 -2
package/src/adapters/inference/resolve.ts +17 -4
package/src/adapters/report/html.ts +304 -0
package/src/adapters/report/terminal.ts +1 -1
package/src/adapters/skill/copilot-cli.ts +21 -7
package/src/adapters/skill/copilot-sdk.ts +72 -0
package/src/commands/check.ts +4 -5
package/src/commands/ideate.ts +101 -0
package/src/commands/report.ts +13 -2
package/src/commands/review.ts +48 -0
package/src/config.ts +0 -1
package/src/engine/comparison/judge.ts +14 -4
package/src/engine/comparison/pipeline.ts +8 -27
package/src/engine/generator.ts +21 -0
package/src/types.ts +30 -5

package/README.md CHANGED Viewed

@@ -6,102 +6,94 @@ Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free infere
 [![npm version](https://img.shields.io/npm/v/snapeval.svg)](https://www.npmjs.com/package/snapeval)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It generates test cases from your skill's `SKILL.md`, captures baseline outputs, and detects regressions through a tiered comparison pipeline — all with zero manual test authoring.
+snapeval evaluates [agentskills.io](https://agentskills.io) skills through semantic snapshot testing. It analyzes your skill's `SKILL.md`, collaborates with you to design a test strategy through an interactive browser-based viewer, then captures baselines and detects regressions — all with zero manual test authoring.
 ## Why snapeval?
-- **Zero assertions** — AI generates test cases from your SKILL.md. You never write test logic.
-- **Semantic comparison** — Three-tier pipeline: schema check (free) → embedding similarity (cheap) → LLM judge with order-swap debiasing (expensive). Most checks cost $0.
-- **Free inference** — Uses gpt-5-mini via Copilot CLI (0x multiplier on paid plans) and GitHub Models API (free with GITHUB_TOKEN).
-- **Non-determinism handling** — Variance envelope from N baseline runs prevents false regressions.
-- **Platform-agnostic** — Adapter-based architecture. Copilot CLI first, Claude Code and others coming.
+- **Interactive ideation** — AI decomposes your skill into behaviors, dimensions, and failure modes, then opens a visual viewer where you shape the test strategy together.
+- **Zero assertions** — No test logic to write. The AI generates realistic, messy prompts that mirror how real users actually type.
+- **Semantic comparison** — Tiered pipeline: schema check (free) → LLM judge with order-swap debiasing (when needed). Most checks cost $0.
+- **Free inference** — Uses gpt-5-mini via Copilot CLI and GitHub Models API.
+- **Platform-agnostic** — Adapter-based architecture. Copilot CLI first, others coming.
-## Quick Start
+## Install
-### As a Copilot CLI Plugin
+### From the marketplace
-Install directly from the GitHub repo:
+The snapeval marketplace is bundled with the repo. Add it once, then install by name:
 ```bash
-gh copilot -- plugin install matantsach/snapeval
+copilot plugin marketplace add matantsach/snapeval
+copilot plugin install snapeval@snapeval-marketplace
 ```
-Or register the marketplace first, then install by name:
+### From GitHub directly
 ```bash
-gh copilot -- plugin marketplace add matantsach/snapeval
-gh copilot -- plugin install snapeval@snapeval-marketplace
+copilot plugin install matantsach/snapeval
 ```
-Then in Copilot CLI interactive mode, just ask naturally:
-```
-> evaluate my code-reviewer skill
-> check skills/code-reviewer for regressions
-> approve scenario 3
-```
-The agent will use the snapeval skill automatically based on your prompt.
-### As a CLI
+### Verify installation
 ```bash
-npx snapeval init <skill-path>       # AI generates test cases from SKILL.md
-npx snapeval capture <skill-path>    # Run tests, save baseline snapshots
-npx snapeval check <skill-path>      # Compare current output to baselines
-npx snapeval approve [--scenario N]  # Accept new behavior as baseline
-npx snapeval report <skill-path>     # Generate benchmark.json
+copilot plugin list
 ```
-### Local Development
+## Usage
-For development without `npx`, clone and use `tsx` directly:
+In Copilot CLI, just talk naturally:
-```bash
-git clone https://github.com/matantsach/snapeval.git
-cd snapeval && npm install
-npx tsx bin/snapeval.ts init <skill-path>
+```
+> evaluate my greeter skill
+> test skills/code-reviewer for regressions
+> check if I broke anything in my-skill
+> approve scenario 3
 ```
-Or load as a local plugin during development:
+snapeval activates automatically based on your prompt.
-```bash
-gh copilot -- --plugin-dir /path/to/snapeval
-```
+### What happens when you evaluate
-### In CI
+1. **Analyze** — snapeval reads your SKILL.md and reasons through behaviors, input dimensions, failure modes, and ambiguities
+2. **View** — A browser-based viewer opens showing the analysis with proposed scenarios you can toggle, edit, and extend
+3. **Confirm** — You review, make changes, and click "Confirm & Run" to export your plan
+4. **Capture** — snapeval writes `evals.json` and runs the scenarios against your skill, saving baseline snapshots
-Commit your `evals.json` and `snapshots/` directory, then add a workflow:
+After initial setup, use `check` to detect regressions and `approve` to accept intentional changes.
-```yaml
-# .github/workflows/skill-eval.yml
-name: Skill Evaluation
-on: [pull_request]
+## CLI Reference
+The CLI is the headless backend — useful for CI, scripting, and power users.
-jobs:
-  eval:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-node@v4
-        with:
-          node-version: 22
-      - run: npm ci
-      - run: npx tsx bin/snapeval.ts check skills/my-skill --ci --skip-embedding
+```
+snapeval init [skill-dir]         Generate test cases from SKILL.md
+snapeval capture [skill-dir]      Run scenarios and save baseline snapshots
+snapeval check [skill-dir]        Compare current output against baselines
+snapeval approve [skill-dir]      Approve regressed scenarios as new baselines
+snapeval report [skill-dir]       Write results with optional HTML viewer
+snapeval ideate [skill-dir]       Open the interactive scenario ideation viewer
 ```
-> **Note:** The `--skip-embedding` flag runs Tier 1 (schema) and Tier 3 (LLM judge) only, skipping Tier 2 which requires the GitHub Models embedding API. For Tier 1-only checks (fastest, free, no API needed), committed baselines with stable output structures will pass without any inference calls.
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--adapter <name>` | Skill adapter | `copilot-cli` |
+| `--inference <name>` | Inference adapter | `auto` |
+| `--budget <amount>` | Spend cap in USD | `unlimited` |
+| `--runs <n>` | Baseline runs per scenario | `1` |
+| `--ci` | CI mode: exit 1 on regressions | off |
+| `--html` | Generate HTML report viewer | off |
+| `--scenario <ids>` | Comma-separated scenario IDs | all |
+| `--verbose` | Verbose output | off |
 ## How It Works
 ```
-SKILL.md → AI generates test scenarios → Capture baseline snapshots
-                                                    ↓
-         Modify skill → Re-run scenarios → Compare via tiered pipeline
-                                                    ↓
-                              Schema match? → PASS (free, instant)
-                              Embedding > 0.85? → PASS (cheap)
-                              LLM Judge agrees? → PASS/REGRESSED (expensive)
+SKILL.md → AI analyzes skill → Interactive ideation viewer → Capture baselines
+                                                                     ↓
+              Modify skill → Re-run scenarios → Compare via tiered pipeline
+                                                                     ↓
+                                     Schema match? → PASS (free, instant)
+                                     LLM Judge agrees? → PASS/REGRESSED
 ```
 ### Comparison Pipeline
@@ -109,8 +101,7 @@ SKILL.md → AI generates test scenarios → Capture baseline snapshots
 | Tier | Method | Cost | When Used |
 |------|--------|------|-----------|
 | 1 | Schema check | Free | Structural skeleton matches |
-| 2 | Embedding similarity | Cheap | Schema differs but meaning similar |
-| 3 | LLM judge (order-swap) | Expensive | Ambiguous cases only |
+| 2 | LLM judge (order-swap) | Cheap | Schema differs, needs semantic comparison |
 Most stable skills are checked entirely at Tier 1 — $0.00 per run.
@@ -122,7 +113,8 @@ snapeval follows the [agentskills.io evaluation standard](https://agentskills.io
 my-skill/
 ├── SKILL.md
 └── evals/
-    ├── evals.json          ← AI-generated test cases
+    ├── evals.json          ← Test scenarios (AI-generated or from ideation)
+    ├── analysis.json       ← Skill analysis (behaviors, dimensions, gaps)
     ├── snapshots/          ← Captured baseline outputs
     └── results/
         └── iteration-N/
@@ -131,6 +123,40 @@ my-skill/
             └── benchmark.json
 ```
+## In CI
+Commit your `evals.json` and `snapshots/` directory, then add a workflow:
+```yaml
+name: Skill Evaluation
+on: [pull_request]
+jobs:
+  eval:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 22
+      - run: npm ci
+      - run: npx snapeval check skills/my-skill --ci
+```
+## Local Development
+```bash
+git clone https://github.com/matantsach/snapeval.git
+cd snapeval && npm install
+npx tsx bin/snapeval.ts check <skill-path>
+```
+Or load as a local plugin:
+```bash
+copilot plugin install ./path/to/snapeval
+```
 ## Configuration
 Create `snapeval.config.json` in your skill or project root:
@@ -139,7 +165,6 @@ Create `snapeval.config.json` in your skill or project root:
 {
   "adapter": "copilot-cli",
   "inference": "auto",
-  "threshold": 0.85,
   "runs": 3,
   "budget": "unlimited"
 }
@@ -147,43 +172,19 @@ Create `snapeval.config.json` in your skill or project root:
 CLI flags override config file values.
-## CLI Reference
-```
-snapeval init [skill-dir]         Generate test cases from SKILL.md using AI
-snapeval capture [skill-dir]      Run skill against all scenarios, save baselines
-snapeval check [skill-dir]        Compare current output against baselines
-snapeval approve [skill-dir]      Approve regressed scenarios as new baselines
-snapeval report [skill-dir]       Write results to evals/results/iteration-N/
-```
-**Common flags:**
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--adapter <name>` | Skill adapter | `copilot-cli` |
-| `--inference <name>` | Inference adapter | `auto` |
-| `--threshold <n>` | Embedding similarity threshold | `0.85` |
-| `--budget <amount>` | Spend cap in USD | `unlimited` |
-| `--runs <n>` | Baseline runs per scenario | `1` |
-| `--ci` | CI mode: exit 1 on regressions | off |
-| `--skip-embedding` | Skip Tier 2 (embedding) | off |
-| `--scenario <ids>` | Comma-separated scenario IDs | all |
-| `--verbose` | Verbose output | off |
 ## Architecture
 Three surfaces over a shared core engine:
 - **Plugin** (SKILL.md) — Interactive product. AI handles everything.
 - **CLI** (`npx snapeval`) — Headless backend for CI and power users.
-- **GitHub Action** — CI wrapper (coming in v2).
+- **GitHub Action** — CI wrapper (planned).
-Three adapter layers for platform independence:
+Adapter layers for platform independence:
-- **SkillAdapter** — How to invoke a skill (Copilot CLI, Claude Code, generic)
+- **SkillAdapter** — How to invoke a skill (Copilot CLI, others planned)
 - **InferenceAdapter** — Where to get LLM capabilities (Copilot gpt-5-mini, GitHub Models API)
-- **ReportAdapter** — How to present results (terminal, JSON, PR comment)
+- **ReportAdapter** — How to present results (terminal, JSON, HTML viewer)
 ## Contributing