skills-eval 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,92 @@
1
+ # Skills Eval
2
+
3
+ A tool for evaluating and improving Claude Code skills (agent prompts). Two main features:
4
+
5
+ 1. **Eval Pipeline** — Run skill outputs against test cases, grade them with assertions, compare iterations, and iterate on the skill prompt until quality improves.
6
+ 2. **CLI Playground** — Experiment with `claude -p` flag combinations (model, effort, tools, JSON schema, system prompt, MCP config, etc.) in a visual interface before using them in production.
7
+
8
+ ## Quick Start
9
+
10
+ ```bash
11
+ bun install
12
+ bun run dev
13
+ ```
14
+
15
+ Open http://localhost:5173 for the eval pipeline, or http://localhost:5173/playground for the CLI playground.
16
+
17
+ ## Eval Pipeline
18
+
19
+ The main product. Given a skill (a markdown prompt file) and context about the user's use case:
20
+
21
+ 1. **Generate test cases** — Create diverse prompts to exercise the skill
22
+ 2. **Run evals** — Execute the skill against test cases with and without the skill prompt
23
+ 3. **Grade results** — Write assertions and judge outputs
24
+ 4. **Compare iterations** — See how changes to the skill affect output quality
25
+ 5. **Improve** — Iterate on the skill prompt based on grading feedback
26
+
27
+ ### Usage
28
+
29
+ ```bash
30
+ # Dev mode with example skill (single skill, HMR)
31
+ bun run dev
32
+
33
+ # With a custom skill
34
+ bun run dev:cli -- --skill-path path/to/SKILL.md --context "description of the use case"
35
+
36
+ # Production-like (build first)
37
+ bun run build
38
+ bun run dogfood:launch
39
+ ```
40
+
41
+ ### Multi-Skill Support
42
+
43
+ You can evaluate multiple skills concurrently. Each skill gets its own server on an auto-selected port:
44
+
45
+ ```bash
46
+ bun run build
47
+ bun bin/skill-eval.ts --skill-path path/to/skill-a/SKILL.md --context "context a"
48
+ bun bin/skill-eval.ts --skill-path path/to/skill-b/SKILL.md --context "context b"
49
+ ```
50
+
51
+ Each launch auto-selects an available port (starting from 3000), opens a browser tab, and tracks its state in `.skill-eval/servers.json`. If a server is already running for a skill, subsequent launches reuse it instead of starting a new one.
52
+
53
+ All skills within the same project share a single SQLite database (`.skill-eval/eval.db`), with data logically separated by skill ID.
54
+
55
+ ## CLI Playground
56
+
57
+ An interactive tool for experimenting with `claude -p` (print mode) flags. Useful for understanding what settings produce the best results before wiring them into the eval pipeline.
58
+
59
+ **Features:**
60
+ - Controls panel with all relevant CLI flags (model, effort, output format, tools, JSON schema, system prompt, debug, MCP, setting sources, max turns)
61
+ - Interdependency enforcement (e.g., verbose auto-enables stream-json, effort max requires opus)
62
+ - Real execution via server-spawned `claude -p` with streaming output via WebSocket
63
+ - Tabbed run history with per-run stats (duration, cost, tool counts, model)
64
+ - Copy button generates a ready-to-paste CLI command (prompt piped via stdin)
65
+ - Subfolder picker for `claude-cli/` configurations — each subfolder can have its own `.claude/settings.json`
66
+
67
+ ### CLI Config Subfolders
68
+
69
+ ```bash
70
+ # Create a config for testing with specific settings
71
+ mkdir -p claude-cli/my-experiment/.claude
72
+ echo '{"permissions":{"allow":["Bash(*)"]}}' > claude-cli/my-experiment/.claude/settings.json
73
+ ```
74
+
75
+ The playground auto-detects subfolders in `claude-cli/` and lets you pick which one to `cd` into before running.
76
+
77
+ ## Development
78
+
79
+ ```bash
80
+ bun run dev # Full dev (Fastify + Vite HMR)
81
+ bun run build # Build for production
82
+ bun run typecheck # Type check
83
+ bun x vitest run # Run tests
84
+ bun run check # Lint
85
+ ```
86
+
87
+ ## Stack
88
+
89
+ - **Frontend**: React 19, Vite 8, Tailwind CSS v4, shadcn/ui (radix-nova)
90
+ - **Backend**: Fastify 5, SQLite (drizzle-orm + libsql)
91
+ - **Testing**: Vitest 4, Testing Library
92
+ - **Runtime**: Bun, TypeScript 5.9