skills-eval 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +92 -0
- package/dist/client/assets/geist-cyrillic-wght-normal-CHSlOQsW.woff2 +0 -0
- package/dist/client/assets/geist-latin-ext-wght-normal-DMtmJ5ZE.woff2 +0 -0
- package/dist/client/assets/geist-latin-wght-normal-Dm3htQBi.woff2 +0 -0
- package/dist/client/assets/index-2gXzIx5u.js +62 -0
- package/dist/client/assets/index-B2aEHgRo.css +2 -0
- package/dist/client/favicon.svg +1 -0
- package/dist/client/icons.svg +24 -0
- package/dist/client/index.html +14 -0
- package/dist/server/skill-eval.js +282 -0
- package/package.json +82 -0
package/README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# Skills Eval
|
|
2
|
+
|
|
3
|
+
A tool for evaluating and improving Claude Code skills (agent prompts). Two main features:
|
|
4
|
+
|
|
5
|
+
1. **Eval Pipeline** — Run skill outputs against test cases, grade them with assertions, compare iterations, and iterate on the skill prompt until quality improves.
|
|
6
|
+
2. **CLI Playground** — Experiment with `claude -p` flag combinations (model, effort, tools, JSON schema, system prompt, MCP config, etc.) in a visual interface before using them in production.
|
|
7
|
+
|
|
8
|
+
## Quick Start
|
|
9
|
+
|
|
10
|
+
```bash
|
|
11
|
+
bun install
|
|
12
|
+
bun run dev
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Open http://localhost:5173 for the eval pipeline, or http://localhost:5173/playground for the CLI playground.
|
|
16
|
+
|
|
17
|
+
## Eval Pipeline
|
|
18
|
+
|
|
19
|
+
The main product. Given a skill (a markdown prompt file) and context about the user's use case:
|
|
20
|
+
|
|
21
|
+
1. **Generate test cases** — Create diverse prompts to exercise the skill
|
|
22
|
+
2. **Run evals** — Execute the skill against test cases with and without the skill prompt
|
|
23
|
+
3. **Grade results** — Write assertions and judge outputs
|
|
24
|
+
4. **Compare iterations** — See how changes to the skill affect output quality
|
|
25
|
+
5. **Improve** — Iterate on the skill prompt based on grading feedback
|
|
26
|
+
|
|
27
|
+
### Usage
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
# Dev mode with example skill (single skill, HMR)
|
|
31
|
+
bun run dev
|
|
32
|
+
|
|
33
|
+
# With a custom skill
|
|
34
|
+
bun run dev:cli -- --skill-path path/to/SKILL.md --context "description of the use case"
|
|
35
|
+
|
|
36
|
+
# Production-like (build first)
|
|
37
|
+
bun run build
|
|
38
|
+
bun run dogfood:launch
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Multi-Skill Support
|
|
42
|
+
|
|
43
|
+
You can evaluate multiple skills concurrently. Each skill gets its own server on an auto-selected port:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
bun run build
|
|
47
|
+
bun bin/skill-eval.ts --skill-path path/to/skill-a/SKILL.md --context "context a"
|
|
48
|
+
bun bin/skill-eval.ts --skill-path path/to/skill-b/SKILL.md --context "context b"
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Each launch auto-selects an available port (starting from 3000), opens a browser tab, and tracks its state in `.skill-eval/servers.json`. If a server is already running for a skill, subsequent launches reuse it instead of starting a new one.
|
|
52
|
+
|
|
53
|
+
All skills within the same project share a single SQLite database (`.skill-eval/eval.db`), with data logically separated by skill ID.
|
|
54
|
+
|
|
55
|
+
## CLI Playground
|
|
56
|
+
|
|
57
|
+
An interactive tool for experimenting with `claude -p` (print mode) flags. Useful for understanding what settings produce the best results before wiring them into the eval pipeline.
|
|
58
|
+
|
|
59
|
+
**Features:**
|
|
60
|
+
- Controls panel with all relevant CLI flags (model, effort, output format, tools, JSON schema, system prompt, debug, MCP, setting sources, max turns)
|
|
61
|
+
- Interdependency enforcement (e.g., verbose auto-enables stream-json, effort max requires opus)
|
|
62
|
+
- Real execution via server-spawned `claude -p` with streaming output via WebSocket
|
|
63
|
+
- Tabbed run history with per-run stats (duration, cost, tool counts, model)
|
|
64
|
+
- Copy button generates a ready-to-paste CLI command (prompt piped via stdin)
|
|
65
|
+
- Subfolder picker for `claude-cli/` configurations — each subfolder can have its own `.claude/settings.json`
|
|
66
|
+
|
|
67
|
+
### CLI Config Subfolders
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Create a config for testing with specific settings
|
|
71
|
+
mkdir -p claude-cli/my-experiment/.claude
|
|
72
|
+
echo '{"permissions":{"allow":["Bash(*)"]}}' > claude-cli/my-experiment/.claude/settings.json
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
The playground auto-detects subfolders in `claude-cli/` and lets you pick which one to `cd` into before running.
|
|
76
|
+
|
|
77
|
+
## Development
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
bun run dev # Full dev (Fastify + Vite HMR)
|
|
81
|
+
bun run build # Build for production
|
|
82
|
+
bun run typecheck # Type check
|
|
83
|
+
bun x vitest run # Run tests
|
|
84
|
+
bun run check # Lint
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Stack
|
|
88
|
+
|
|
89
|
+
- **Frontend**: React 19, Vite 8, Tailwind CSS v4, shadcn/ui (radix-nova)
|
|
90
|
+
- **Backend**: Fastify 5, SQLite (drizzle-orm + libsql)
|
|
91
|
+
- **Testing**: Vitest 4, Testing Library
|
|
92
|
+
- **Runtime**: Bun, TypeScript 5.9
|
|
Binary file
|
|
Binary file
|
|
Binary file
|