selftune 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +23 -0
- package/README.md +259 -0
- package/bin/selftune.cjs +29 -0
- package/cli/selftune/constants.ts +71 -0
- package/cli/selftune/eval/hooks-to-evals.ts +422 -0
- package/cli/selftune/evolution/audit.ts +44 -0
- package/cli/selftune/evolution/deploy-proposal.ts +244 -0
- package/cli/selftune/evolution/evolve.ts +406 -0
- package/cli/selftune/evolution/extract-patterns.ts +145 -0
- package/cli/selftune/evolution/propose-description.ts +146 -0
- package/cli/selftune/evolution/rollback.ts +242 -0
- package/cli/selftune/evolution/stopping-criteria.ts +69 -0
- package/cli/selftune/evolution/validate-proposal.ts +137 -0
- package/cli/selftune/grading/grade-session.ts +459 -0
- package/cli/selftune/hooks/prompt-log.ts +52 -0
- package/cli/selftune/hooks/session-stop.ts +54 -0
- package/cli/selftune/hooks/skill-eval.ts +73 -0
- package/cli/selftune/index.ts +104 -0
- package/cli/selftune/ingestors/codex-rollout.ts +416 -0
- package/cli/selftune/ingestors/codex-wrapper.ts +332 -0
- package/cli/selftune/ingestors/opencode-ingest.ts +565 -0
- package/cli/selftune/init.ts +297 -0
- package/cli/selftune/monitoring/watch.ts +328 -0
- package/cli/selftune/observability.ts +255 -0
- package/cli/selftune/types.ts +255 -0
- package/cli/selftune/utils/jsonl.ts +75 -0
- package/cli/selftune/utils/llm-call.ts +192 -0
- package/cli/selftune/utils/logging.ts +40 -0
- package/cli/selftune/utils/schema-validator.ts +47 -0
- package/cli/selftune/utils/seeded-random.ts +31 -0
- package/cli/selftune/utils/transcript.ts +260 -0
- package/package.json +29 -0
- package/skill/SKILL.md +120 -0
- package/skill/Workflows/Doctor.md +145 -0
- package/skill/Workflows/Evals.md +193 -0
- package/skill/Workflows/Evolve.md +159 -0
- package/skill/Workflows/Grade.md +157 -0
- package/skill/Workflows/Ingest.md +159 -0
- package/skill/Workflows/Initialize.md +125 -0
- package/skill/Workflows/Rollback.md +131 -0
- package/skill/Workflows/Watch.md +128 -0
- package/skill/references/grading-methodology.md +176 -0
- package/skill/references/invocation-taxonomy.md +144 -0
- package/skill/references/logs.md +168 -0
- package/skill/settings_snippet.json +41 -0
package/skill/SKILL.md
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: selftune
|
|
3
|
+
description: >
|
|
4
|
+
Skill observability and continuous improvement. Use when the user wants to:
|
|
5
|
+
grade a session, generate evals, check undertriggering, evolve a skill
|
|
6
|
+
description, rollback an evolution, monitor post-deploy performance, run
|
|
7
|
+
health checks, or ingest sessions from Codex/OpenCode.
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# selftune
|
|
11
|
+
|
|
12
|
+
Observe real agent sessions, detect missed triggers, grade execution quality,
|
|
13
|
+
and evolve skill descriptions toward the language real users actually use.
|
|
14
|
+
|
|
15
|
+
## Bootstrap
|
|
16
|
+
|
|
17
|
+
If `~/.selftune/config.json` does not exist, read `Workflows/Initialize.md`
|
|
18
|
+
first. Do not proceed with other commands until initialization is complete.
|
|
19
|
+
|
|
20
|
+
## Command Execution Policy
|
|
21
|
+
|
|
22
|
+
Build every CLI invocation from the config:
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
26
|
+
bun run $CLI_PATH <command> [options]
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Fallback (if config is missing or stale):
|
|
30
|
+
```bash
|
|
31
|
+
bun run <repo-path>/cli/selftune/index.ts <command> [options]
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
All commands output deterministic JSON. Always parse JSON output -- never
|
|
35
|
+
text-match against output strings.
|
|
36
|
+
|
|
37
|
+
## Quick Reference
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
selftune grade --skill <name> [--expectations "..."] [--use-agent]
|
|
41
|
+
selftune evals --skill <name> [--list-skills] [--stats] [--max N]
|
|
42
|
+
selftune evolve --skill <name> --skill-path <path> [--dry-run]
|
|
43
|
+
selftune rollback --skill <name> --skill-path <path> [--proposal-id <id>]
|
|
44
|
+
selftune watch --skill <name> --skill-path <path> [--auto-rollback]
|
|
45
|
+
selftune doctor
|
|
46
|
+
selftune ingest-codex
|
|
47
|
+
selftune ingest-opencode
|
|
48
|
+
selftune wrap-codex -- <codex args>
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Workflow Routing
|
|
52
|
+
|
|
53
|
+
| Trigger keywords | Workflow | File |
|
|
54
|
+
|------------------|----------|------|
|
|
55
|
+
| grade, score, evaluate, assess session | Grade | Workflows/Grade.md |
|
|
56
|
+
| evals, eval set, undertriggering, skill stats | Evals | Workflows/Evals.md |
|
|
57
|
+
| evolve, improve, triggers, catch more queries | Evolve | Workflows/Evolve.md |
|
|
58
|
+
| rollback, undo, restore, revert evolution | Rollback | Workflows/Rollback.md |
|
|
59
|
+
| watch, monitor, regression, post-deploy, performing | Watch | Workflows/Watch.md |
|
|
60
|
+
| doctor, health, hooks, broken, diagnose | Doctor | Workflows/Doctor.md |
|
|
61
|
+
| ingest, import, codex logs, opencode, wrap codex | Ingest | Workflows/Ingest.md |
|
|
62
|
+
| init, setup, bootstrap, first time | Initialize | Workflows/Initialize.md |
|
|
63
|
+
|
|
64
|
+
## The Feedback Loop
|
|
65
|
+
|
|
66
|
+
```
|
|
67
|
+
Observe --> Detect --> Diagnose --> Propose --> Validate --> Deploy --> Watch
|
|
68
|
+
| |
|
|
69
|
+
+--------------------------------------------------------------------+
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
1. **Observe** -- Hooks capture every session (queries, triggers, metrics)
|
|
73
|
+
2. **Detect** -- `evals` finds missed triggers across invocation types
|
|
74
|
+
3. **Diagnose** -- `grade` evaluates session quality with evidence
|
|
75
|
+
4. **Propose** -- `evolve` generates description improvements
|
|
76
|
+
5. **Validate** -- Evolution is tested against the eval set
|
|
77
|
+
6. **Deploy** -- Updated description replaces the original (with backup)
|
|
78
|
+
7. **Watch** -- `watch` monitors for regressions post-deploy
|
|
79
|
+
|
|
80
|
+
## Resource Index
|
|
81
|
+
|
|
82
|
+
| Resource | Purpose |
|
|
83
|
+
|----------|---------|
|
|
84
|
+
| `SKILL.md` | This file -- routing, triggers, quick reference |
|
|
85
|
+
| `references/logs.md` | Log file formats (telemetry, usage, queries, audit) |
|
|
86
|
+
| `references/grading-methodology.md` | 3-tier grading model, evidence standards, grading.json schema |
|
|
87
|
+
| `references/invocation-taxonomy.md` | 4 invocation types, coverage analysis, evolution connection |
|
|
88
|
+
| `settings_snippet.json` | Claude Code hook configuration template |
|
|
89
|
+
| `Workflows/Initialize.md` | First-time setup and config bootstrap |
|
|
90
|
+
| `Workflows/Grade.md` | Grade a session with expectations and evidence |
|
|
91
|
+
| `Workflows/Evals.md` | Generate eval sets, list skills, show stats |
|
|
92
|
+
| `Workflows/Evolve.md` | Evolve a skill description from failure patterns |
|
|
93
|
+
| `Workflows/Rollback.md` | Undo an evolution, restore previous description |
|
|
94
|
+
| `Workflows/Watch.md` | Post-deploy regression monitoring |
|
|
95
|
+
| `Workflows/Doctor.md` | Health checks on logs, hooks, schema |
|
|
96
|
+
| `Workflows/Ingest.md` | Import sessions from Codex and OpenCode |
|
|
97
|
+
|
|
98
|
+
## Examples
|
|
99
|
+
|
|
100
|
+
- "Grade my last pptx session"
|
|
101
|
+
- "What skills are undertriggering?"
|
|
102
|
+
- "Generate evals for the pptx skill"
|
|
103
|
+
- "Evolve the pptx skill to catch more queries"
|
|
104
|
+
- "Rollback the last evolution"
|
|
105
|
+
- "Is the skill performing well after the change?"
|
|
106
|
+
- "Check selftune health"
|
|
107
|
+
- "Ingest my codex logs"
|
|
108
|
+
- "Show me skill stats"
|
|
109
|
+
|
|
110
|
+
## Negative Examples
|
|
111
|
+
|
|
112
|
+
These should NOT trigger selftune:
|
|
113
|
+
|
|
114
|
+
- "Fix this React hydration bug"
|
|
115
|
+
- "Create a PowerPoint about Q3 results" (this is pptx, not selftune)
|
|
116
|
+
- "Run my unit tests"
|
|
117
|
+
- "What does this error mean?"
|
|
118
|
+
|
|
119
|
+
Route to other skills or general workflows unless the user explicitly
|
|
120
|
+
asks about grading, evals, evolution, monitoring, or skill observability.
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
# selftune Doctor Workflow
|
|
2
|
+
|
|
3
|
+
Run health checks on selftune logs, hooks, and schema integrity.
|
|
4
|
+
Reports pass/fail status for each check with actionable guidance.
|
|
5
|
+
|
|
6
|
+
## Default Command
|
|
7
|
+
|
|
8
|
+
```bash
|
|
9
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
10
|
+
bun run $CLI_PATH doctor
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
Fallback:
|
|
14
|
+
```bash
|
|
15
|
+
bun run <repo-path>/cli/selftune/index.ts doctor
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
## Options
|
|
19
|
+
|
|
20
|
+
None. Doctor runs all checks unconditionally.
|
|
21
|
+
|
|
22
|
+
## Output Format
|
|
23
|
+
|
|
24
|
+
```json
|
|
25
|
+
{
|
|
26
|
+
"healthy": true,
|
|
27
|
+
"checks": [
|
|
28
|
+
{
|
|
29
|
+
"name": "session_telemetry_log exists",
|
|
30
|
+
"status": "pass",
|
|
31
|
+
"detail": "Found 142 entries"
|
|
32
|
+
},
|
|
33
|
+
{
|
|
34
|
+
"name": "skill_usage_log parseable",
|
|
35
|
+
"status": "pass",
|
|
36
|
+
"detail": "All 89 entries valid JSON"
|
|
37
|
+
},
|
|
38
|
+
{
|
|
39
|
+
"name": "hooks installed",
|
|
40
|
+
"status": "fail",
|
|
41
|
+
"detail": "PostToolUse hook not found in ~/.claude/settings.json"
|
|
42
|
+
}
|
|
43
|
+
],
|
|
44
|
+
"summary": {
|
|
45
|
+
"passed": 5,
|
|
46
|
+
"failed": 1,
|
|
47
|
+
"total": 6
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
The process exits with code 0 if `healthy: true`, code 1 otherwise.
|
|
53
|
+
|
|
54
|
+
## Parsing Instructions
|
|
55
|
+
|
|
56
|
+
### Check Overall Health
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
# Parse: .healthy (boolean)
|
|
60
|
+
# Quick check: exit code 0 = healthy, 1 = unhealthy
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### Find Failed Checks
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
# Parse: .checks[] | select(.status == "fail") | { name, detail }
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Get Summary Counts
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
# Parse: .summary.passed, .summary.failed, .summary.total
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Health Checks
|
|
76
|
+
|
|
77
|
+
Doctor validates these areas:
|
|
78
|
+
|
|
79
|
+
### Log File Checks
|
|
80
|
+
|
|
81
|
+
| Check | What it validates |
|
|
82
|
+
|-------|-------------------|
|
|
83
|
+
| Log files exist | `session_telemetry_log.jsonl`, `skill_usage_log.jsonl`, `all_queries_log.jsonl` exist in `~/.claude/` |
|
|
84
|
+
| Logs are parseable | Every line in each log file is valid JSON |
|
|
85
|
+
| Schema conformance | Required fields present per log type (see `references/logs.md`) |
|
|
86
|
+
|
|
87
|
+
### Hook Checks
|
|
88
|
+
|
|
89
|
+
| Check | What it validates |
|
|
90
|
+
|-------|-------------------|
|
|
91
|
+
| Hooks installed | `UserPromptSubmit`, `PostToolUse`, and `Stop` hooks are configured in `~/.claude/settings.json` |
|
|
92
|
+
| Hook scripts exist | The script files referenced by hooks exist on disk |
|
|
93
|
+
|
|
94
|
+
### Evolution Audit Checks
|
|
95
|
+
|
|
96
|
+
| Check | What it validates |
|
|
97
|
+
|-------|-------------------|
|
|
98
|
+
| Audit log integrity | `evolution_audit_log.jsonl` entries have required fields (`timestamp`, `proposal_id`, `action`) |
|
|
99
|
+
| Valid action values | All entries use known action types: `created`, `validated`, `deployed`, `rolled_back` |
|
|
100
|
+
|
|
101
|
+
## Steps
|
|
102
|
+
|
|
103
|
+
### 1. Run Doctor
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
107
|
+
bun run $CLI_PATH doctor
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### 2. Check Results
|
|
111
|
+
|
|
112
|
+
Parse the JSON output. If `healthy: true`, selftune is fully operational.
|
|
113
|
+
|
|
114
|
+
### 3. Fix Any Issues
|
|
115
|
+
|
|
116
|
+
For each failed check, take the appropriate action:
|
|
117
|
+
|
|
118
|
+
| Failed check | Fix |
|
|
119
|
+
|-------------|-----|
|
|
120
|
+
| Log files missing | Run a session to generate initial log entries. Check hook installation. |
|
|
121
|
+
| Logs not parseable | Inspect the corrupted log file. Remove or fix invalid lines. |
|
|
122
|
+
| Hooks not installed | Merge `skill/settings_snippet.json` into `~/.claude/settings.json`. Update paths. |
|
|
123
|
+
| Hook scripts missing | Verify the selftune repo path. Re-run `init` if the repo was moved. |
|
|
124
|
+
| Audit log invalid | Remove corrupted entries. Future operations will append clean entries. |
|
|
125
|
+
|
|
126
|
+
### 4. Re-run Doctor
|
|
127
|
+
|
|
128
|
+
After fixes, run doctor again to verify all checks pass.
|
|
129
|
+
|
|
130
|
+
## Common Patterns
|
|
131
|
+
|
|
132
|
+
**"Something seems broken"**
|
|
133
|
+
> Run doctor first. Report any failing checks with their detail messages.
|
|
134
|
+
|
|
135
|
+
**"Are my hooks working?"**
|
|
136
|
+
> Doctor checks hook installation. If hooks pass but no data appears,
|
|
137
|
+
> verify the hook script paths point to actual files.
|
|
138
|
+
|
|
139
|
+
**"No telemetry available"**
|
|
140
|
+
> Doctor will report missing log files. Install hooks using the
|
|
141
|
+
> `settings_snippet.json` in the skill directory, then run a session.
|
|
142
|
+
|
|
143
|
+
**"Check selftune health"**
|
|
144
|
+
> Run doctor and report the summary. A clean bill of health means
|
|
145
|
+
> all checks pass and selftune is ready to grade/evolve/watch.
|
|
@@ -0,0 +1,193 @@
|
|
|
1
|
+
# selftune Evals Workflow
|
|
2
|
+
|
|
3
|
+
Generate eval sets from hook logs. Detects false negatives (queries that
|
|
4
|
+
should have triggered a skill but did not) and annotates each entry with
|
|
5
|
+
its invocation type.
|
|
6
|
+
|
|
7
|
+
## Default Command
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
11
|
+
bun run $CLI_PATH evals --skill <name> [options]
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
Fallback:
|
|
15
|
+
```bash
|
|
16
|
+
bun run <repo-path>/cli/selftune/index.ts evals --skill <name> [options]
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Options
|
|
20
|
+
|
|
21
|
+
| Flag | Description | Default |
|
|
22
|
+
|------|-------------|---------|
|
|
23
|
+
| `--skill <name>` | Skill to generate evals for | Required (unless `--list-skills`) |
|
|
24
|
+
| `--list-skills` | List all logged skills with query counts | Off |
|
|
25
|
+
| `--stats` | Show aggregate telemetry stats for the skill | Off |
|
|
26
|
+
| `--max <n>` | Maximum eval entries to generate | 50 |
|
|
27
|
+
| `--seed <n>` | Random seed for negative sampling | Random |
|
|
28
|
+
| `--out <path>` | Output file path | `evals-<skill>.json` |
|
|
29
|
+
|
|
30
|
+
## Output Format
|
|
31
|
+
|
|
32
|
+
### Eval Set (default)
|
|
33
|
+
|
|
34
|
+
```json
|
|
35
|
+
[
|
|
36
|
+
{
|
|
37
|
+
"id": 1,
|
|
38
|
+
"query": "Make me a slide deck for the Q3 board meeting",
|
|
39
|
+
"expected": true,
|
|
40
|
+
"invocation_type": "contextual",
|
|
41
|
+
"skill_name": "pptx",
|
|
42
|
+
"source_session": "abc123"
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"id": 2,
|
|
46
|
+
"query": "What format should I use for a presentation?",
|
|
47
|
+
"expected": false,
|
|
48
|
+
"invocation_type": "negative",
|
|
49
|
+
"skill_name": "pptx",
|
|
50
|
+
"source_session": null
|
|
51
|
+
}
|
|
52
|
+
]
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### List Skills
|
|
56
|
+
|
|
57
|
+
```json
|
|
58
|
+
{
|
|
59
|
+
"skills": [
|
|
60
|
+
{ "name": "pptx", "query_count": 42, "session_count": 15 },
|
|
61
|
+
{ "name": "selftune", "query_count": 28, "session_count": 10 }
|
|
62
|
+
]
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### Stats
|
|
67
|
+
|
|
68
|
+
```json
|
|
69
|
+
{
|
|
70
|
+
"skill_name": "pptx",
|
|
71
|
+
"sessions": 15,
|
|
72
|
+
"avg_turns": 4.2,
|
|
73
|
+
"tool_call_breakdown": { "Read": 30, "Write": 15, "Bash": 45 },
|
|
74
|
+
"error_rate": 0.13,
|
|
75
|
+
"bash_patterns": ["pip install python-pptx", "python3 /tmp/create_pptx.py"]
|
|
76
|
+
}
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## Parsing Instructions
|
|
80
|
+
|
|
81
|
+
### Count by Invocation Type
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
# Parse: group_by(.invocation_type) | map({ type: .[0].invocation_type, count: length })
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Find Missed Queries (False Negatives)
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# Parse: .[] | select(.expected == true and .invocation_type != "explicit")
|
|
91
|
+
# These are queries that should trigger but might be missed
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Get Negative Examples
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
# Parse: .[] | select(.expected == false)
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
## Sub-Workflows
|
|
101
|
+
|
|
102
|
+
### List Skills
|
|
103
|
+
|
|
104
|
+
Discover which skills have telemetry data and how many queries each has.
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
108
|
+
bun run $CLI_PATH evals --list-skills
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Use this first to identify which skills have enough data for eval generation.
|
|
112
|
+
|
|
113
|
+
### Generate Evals
|
|
114
|
+
|
|
115
|
+
Cross-reference `skill_usage_log.jsonl` (positive triggers) against
|
|
116
|
+
`all_queries_log.jsonl` (all queries, including non-triggers) to produce
|
|
117
|
+
an eval set annotated with invocation types.
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
121
|
+
bun run $CLI_PATH evals --skill pptx --max 50 --out evals-pptx.json
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
The command:
|
|
125
|
+
1. Reads positive triggers from `skill_usage_log.jsonl`
|
|
126
|
+
2. Reads all queries from `all_queries_log.jsonl`
|
|
127
|
+
3. Identifies queries that should have triggered but did not
|
|
128
|
+
4. Samples negative examples (unrelated queries)
|
|
129
|
+
5. Annotates each entry with invocation type
|
|
130
|
+
6. Writes the eval set to the output file
|
|
131
|
+
|
|
132
|
+
### Show Stats
|
|
133
|
+
|
|
134
|
+
View aggregate telemetry for a skill: average turns, tool call breakdown,
|
|
135
|
+
error rates, and common bash command patterns.
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
139
|
+
bun run $CLI_PATH evals --skill pptx --stats
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
## Steps
|
|
143
|
+
|
|
144
|
+
### 1. List Available Skills
|
|
145
|
+
|
|
146
|
+
Run `--list-skills` to see what skills have telemetry data. If the target
|
|
147
|
+
skill has zero or very few queries, more sessions are needed before
|
|
148
|
+
eval generation is useful.
|
|
149
|
+
|
|
150
|
+
### 2. Generate the Eval Set
|
|
151
|
+
|
|
152
|
+
Run with `--skill <name>`. Review the output file for:
|
|
153
|
+
- Balance between positive and negative entries
|
|
154
|
+
- Coverage of all three positive invocation types (explicit, implicit, contextual)
|
|
155
|
+
- Reasonable negative examples (keyword overlap but wrong intent)
|
|
156
|
+
|
|
157
|
+
### 3. Review Invocation Type Distribution
|
|
158
|
+
|
|
159
|
+
A healthy eval set has:
|
|
160
|
+
- Some explicit queries (easy baseline)
|
|
161
|
+
- Many implicit queries (natural usage)
|
|
162
|
+
- Several contextual queries (real-world usage)
|
|
163
|
+
- Enough negatives to prevent false positives
|
|
164
|
+
|
|
165
|
+
See `references/invocation-taxonomy.md` for what each type means and
|
|
166
|
+
what healthy distribution looks like.
|
|
167
|
+
|
|
168
|
+
### 4. Identify Coverage Gaps
|
|
169
|
+
|
|
170
|
+
If the eval set is missing implicit or contextual queries, the skill may be
|
|
171
|
+
undertriggering. This is the signal for `evolve` to improve the description.
|
|
172
|
+
|
|
173
|
+
### 5. Optional: Check Stats
|
|
174
|
+
|
|
175
|
+
Use `--stats` to understand session patterns before evolution. High error
|
|
176
|
+
rates or unusual tool call distributions may indicate process issues
|
|
177
|
+
beyond trigger coverage.
|
|
178
|
+
|
|
179
|
+
## Common Patterns
|
|
180
|
+
|
|
181
|
+
**"What skills are undertriggering?"**
|
|
182
|
+
> Run `--list-skills`, then for each skill with significant query counts,
|
|
183
|
+
> generate evals and check for missed implicit/contextual queries.
|
|
184
|
+
|
|
185
|
+
**"Generate evals for pptx"**
|
|
186
|
+
> Run `evals --skill pptx`. Review the invocation type distribution.
|
|
187
|
+
> Feed the output to `evolve` if coverage gaps exist.
|
|
188
|
+
|
|
189
|
+
**"Show me skill stats"**
|
|
190
|
+
> Run `evals --skill <name> --stats` for aggregate telemetry.
|
|
191
|
+
|
|
192
|
+
**"I want reproducible evals"**
|
|
193
|
+
> Use `--seed <n>` to fix the random sampling of negative examples.
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
# selftune Evolve Workflow
|
|
2
|
+
|
|
3
|
+
Improve a skill's description based on real usage signal. Analyzes failure
|
|
4
|
+
patterns from eval sets and proposes description changes that catch more
|
|
5
|
+
natural-language queries without breaking existing triggers.
|
|
6
|
+
|
|
7
|
+
## Default Command
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
|
|
11
|
+
bun run $CLI_PATH evolve --skill <name> --skill-path <path> [options]
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
Fallback:
|
|
15
|
+
```bash
|
|
16
|
+
bun run <repo-path>/cli/selftune/index.ts evolve --skill <name> --skill-path <path> [options]
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Options
|
|
20
|
+
|
|
21
|
+
| Flag | Description | Default |
|
|
22
|
+
|------|-------------|---------|
|
|
23
|
+
| `--skill <name>` | Skill name | Required |
|
|
24
|
+
| `--skill-path <path>` | Path to the skill's SKILL.md | Required |
|
|
25
|
+
| `--eval-set <path>` | Pre-built eval set JSON | Auto-generated from logs |
|
|
26
|
+
| `--mode api\|agent` | LLM mode for proposal generation | `agent` |
|
|
27
|
+
| `--agent <name>` | Agent CLI binary to use | Auto-detected |
|
|
28
|
+
| `--dry-run` | Propose and validate without deploying | Off |
|
|
29
|
+
| `--confidence <n>` | Minimum confidence threshold (0-1) | 0.7 |
|
|
30
|
+
| `--max-iterations <n>` | Maximum retry iterations | 3 |
|
|
31
|
+
|
|
32
|
+
## Output Format
|
|
33
|
+
|
|
34
|
+
Each evolution action is logged to `~/.claude/evolution_audit_log.jsonl`.
|
|
35
|
+
See `references/logs.md` for the audit log schema.
|
|
36
|
+
|
|
37
|
+
### Proposal Output (dry-run or pre-deploy)
|
|
38
|
+
|
|
39
|
+
```json
|
|
40
|
+
{
|
|
41
|
+
"proposal_id": "evolve-pptx-1709125200000",
|
|
42
|
+
"skill_name": "pptx",
|
|
43
|
+
"iteration": 1,
|
|
44
|
+
"original_pass_rate": 0.70,
|
|
45
|
+
"proposed_pass_rate": 0.92,
|
|
46
|
+
"regression_count": 0,
|
|
47
|
+
"confidence": 0.85,
|
|
48
|
+
"status": "validated",
|
|
49
|
+
"changes_summary": "Added implicit triggers: 'slide deck', 'presentation', 'board meeting slides'"
|
|
50
|
+
}
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### Audit Log Entries
|
|
54
|
+
|
|
55
|
+
The evolution process writes multiple audit entries:
|
|
56
|
+
|
|
57
|
+
| Action | When | Key details |
|
|
58
|
+
|--------|------|-------------|
|
|
59
|
+
| `created` | Proposal generated | `details` contains `original_description:` prefix |
|
|
60
|
+
| `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates |
|
|
61
|
+
| `deployed` | Updated SKILL.md written to disk | `eval_snapshot` with final rates |
|
|
62
|
+
|
|
63
|
+
## Parsing Instructions
|
|
64
|
+
|
|
65
|
+
### Track Evolution Progress
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
# Read audit log for the proposal
|
|
69
|
+
# Parse: entries where proposal_id matches
|
|
70
|
+
# Check: action sequence should be created -> validated -> deployed
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Check for Regression
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# Parse: .eval_snapshot in validated entry
|
|
77
|
+
# Verify: proposed pass_rate > original pass_rate
|
|
78
|
+
# Verify: regression_count < 5% of total evals
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## Steps
|
|
82
|
+
|
|
83
|
+
### 1. Load or Generate Eval Set
|
|
84
|
+
|
|
85
|
+
If `--eval-set` is provided, use it directly. Otherwise, the command
|
|
86
|
+
generates one from logs (equivalent to running `evals --skill <name>`).
|
|
87
|
+
|
|
88
|
+
An eval set is required for validation. Without enough telemetry data,
|
|
89
|
+
evolution cannot reliably measure improvement.
|
|
90
|
+
|
|
91
|
+
### 2. Extract Failure Patterns
|
|
92
|
+
|
|
93
|
+
The command groups missed queries by invocation type:
|
|
94
|
+
- Missed explicit: description is broken (rare, high priority)
|
|
95
|
+
- Missed implicit: description is too narrow (common, evolve target)
|
|
96
|
+
- Missed contextual: description lacks domain vocabulary (evolve target)
|
|
97
|
+
|
|
98
|
+
See `references/invocation-taxonomy.md` for the taxonomy.
|
|
99
|
+
|
|
100
|
+
### 3. Propose Description Changes
|
|
101
|
+
|
|
102
|
+
An LLM generates a candidate description that would catch the missed
|
|
103
|
+
queries. The candidate:
|
|
104
|
+
- Preserves existing trigger phrases that work
|
|
105
|
+
- Adds new phrases covering missed patterns
|
|
106
|
+
- Maintains the description's structure and tone
|
|
107
|
+
|
|
108
|
+
### 4. Validate Against Eval Set
|
|
109
|
+
|
|
110
|
+
The candidate is tested against the full eval set:
|
|
111
|
+
- Must improve overall pass rate
|
|
112
|
+
- Must not regress more than 5% on previously-passing entries
|
|
113
|
+
- Must exceed the `--confidence` threshold
|
|
114
|
+
|
|
115
|
+
If validation fails, the command retries up to `--max-iterations` times
|
|
116
|
+
with adjusted proposals.
|
|
117
|
+
|
|
118
|
+
### 5. Deploy (or Preview)
|
|
119
|
+
|
|
120
|
+
If `--dry-run`, the proposal is printed but not deployed. The audit log
|
|
121
|
+
still records `created` and `validated` entries for review.
|
|
122
|
+
|
|
123
|
+
If deploying:
|
|
124
|
+
1. The current SKILL.md is backed up to `SKILL.md.bak`
|
|
125
|
+
2. The updated description is written to SKILL.md
|
|
126
|
+
3. A `deployed` entry is logged to the evolution audit
|
|
127
|
+
|
|
128
|
+
### Stopping Criteria
|
|
129
|
+
|
|
130
|
+
The evolution loop stops when any of these conditions is met (priority order):
|
|
131
|
+
|
|
132
|
+
| # | Condition | Meaning |
|
|
133
|
+
|---|-----------|---------|
|
|
134
|
+
| 1 | **Converged** | Pass rate >= 0.95 |
|
|
135
|
+
| 2 | **Max iterations** | Reached `--max-iterations` limit |
|
|
136
|
+
| 3 | **Low confidence** | Proposal confidence below `--confidence` threshold |
|
|
137
|
+
| 4 | **Plateau** | Pass rate unchanged across 3 consecutive iterations |
|
|
138
|
+
| 5 | **Continue** | None of the above -- keep iterating |
|
|
139
|
+
|
|
140
|
+
## Common Patterns
|
|
141
|
+
|
|
142
|
+
**"Evolve the pptx skill"**
|
|
143
|
+
> Need `--skill pptx` and `--skill-path /path/to/pptx/SKILL.md`.
|
|
144
|
+
> If the user hasn't specified the path, search for the SKILL.md file
|
|
145
|
+
> in the workspace or ask.
|
|
146
|
+
|
|
147
|
+
**"Just show me what would change"**
|
|
148
|
+
> Use `--dry-run` to preview proposals without deploying.
|
|
149
|
+
|
|
150
|
+
**"The evolution didn't help enough"**
|
|
151
|
+
> Check the eval set quality. Missing contextual examples will limit
|
|
152
|
+
> what evolution can learn. Generate a richer eval set first.
|
|
153
|
+
|
|
154
|
+
**"Evolution keeps failing validation"**
|
|
155
|
+
> Lower `--confidence` slightly or increase `--max-iterations`.
|
|
156
|
+
> Also check if the eval set has contradictory expectations.
|
|
157
|
+
|
|
158
|
+
**"I want to use the API directly"**
|
|
159
|
+
> Pass `--mode api`. Requires `ANTHROPIC_API_KEY` in the environment.
|