selftune 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/CHANGELOG.md +23 -0
  2. package/README.md +259 -0
  3. package/bin/selftune.cjs +29 -0
  4. package/cli/selftune/constants.ts +71 -0
  5. package/cli/selftune/eval/hooks-to-evals.ts +422 -0
  6. package/cli/selftune/evolution/audit.ts +44 -0
  7. package/cli/selftune/evolution/deploy-proposal.ts +244 -0
  8. package/cli/selftune/evolution/evolve.ts +406 -0
  9. package/cli/selftune/evolution/extract-patterns.ts +145 -0
  10. package/cli/selftune/evolution/propose-description.ts +146 -0
  11. package/cli/selftune/evolution/rollback.ts +242 -0
  12. package/cli/selftune/evolution/stopping-criteria.ts +69 -0
  13. package/cli/selftune/evolution/validate-proposal.ts +137 -0
  14. package/cli/selftune/grading/grade-session.ts +459 -0
  15. package/cli/selftune/hooks/prompt-log.ts +52 -0
  16. package/cli/selftune/hooks/session-stop.ts +54 -0
  17. package/cli/selftune/hooks/skill-eval.ts +73 -0
  18. package/cli/selftune/index.ts +104 -0
  19. package/cli/selftune/ingestors/codex-rollout.ts +416 -0
  20. package/cli/selftune/ingestors/codex-wrapper.ts +332 -0
  21. package/cli/selftune/ingestors/opencode-ingest.ts +565 -0
  22. package/cli/selftune/init.ts +297 -0
  23. package/cli/selftune/monitoring/watch.ts +328 -0
  24. package/cli/selftune/observability.ts +255 -0
  25. package/cli/selftune/types.ts +255 -0
  26. package/cli/selftune/utils/jsonl.ts +75 -0
  27. package/cli/selftune/utils/llm-call.ts +192 -0
  28. package/cli/selftune/utils/logging.ts +40 -0
  29. package/cli/selftune/utils/schema-validator.ts +47 -0
  30. package/cli/selftune/utils/seeded-random.ts +31 -0
  31. package/cli/selftune/utils/transcript.ts +260 -0
  32. package/package.json +29 -0
  33. package/skill/SKILL.md +120 -0
  34. package/skill/Workflows/Doctor.md +145 -0
  35. package/skill/Workflows/Evals.md +193 -0
  36. package/skill/Workflows/Evolve.md +159 -0
  37. package/skill/Workflows/Grade.md +157 -0
  38. package/skill/Workflows/Ingest.md +159 -0
  39. package/skill/Workflows/Initialize.md +125 -0
  40. package/skill/Workflows/Rollback.md +131 -0
  41. package/skill/Workflows/Watch.md +128 -0
  42. package/skill/references/grading-methodology.md +176 -0
  43. package/skill/references/invocation-taxonomy.md +144 -0
  44. package/skill/references/logs.md +168 -0
  45. package/skill/settings_snippet.json +41 -0
package/skill/SKILL.md ADDED
@@ -0,0 +1,120 @@
1
+ ---
2
+ name: selftune
3
+ description: >
4
+ Skill observability and continuous improvement. Use when the user wants to:
5
+ grade a session, generate evals, check undertriggering, evolve a skill
6
+ description, rollback an evolution, monitor post-deploy performance, run
7
+ health checks, or ingest sessions from Codex/OpenCode.
8
+ ---
9
+
10
+ # selftune
11
+
12
+ Observe real agent sessions, detect missed triggers, grade execution quality,
13
+ and evolve skill descriptions toward the language real users actually use.
14
+
15
+ ## Bootstrap
16
+
17
+ If `~/.selftune/config.json` does not exist, read `Workflows/Initialize.md`
18
+ first. Do not proceed with other commands until initialization is complete.
19
+
20
+ ## Command Execution Policy
21
+
22
+ Build every CLI invocation from the config:
23
+
24
+ ```bash
25
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
26
+ bun run $CLI_PATH <command> [options]
27
+ ```
28
+
29
+ Fallback (if config is missing or stale):
30
+ ```bash
31
+ bun run <repo-path>/cli/selftune/index.ts <command> [options]
32
+ ```
33
+
34
+ All commands output deterministic JSON. Always parse JSON output -- never
35
+ text-match against output strings.
36
+
37
+ ## Quick Reference
38
+
39
+ ```bash
40
+ selftune grade --skill <name> [--expectations "..."] [--use-agent]
41
+ selftune evals --skill <name> [--list-skills] [--stats] [--max N]
42
+ selftune evolve --skill <name> --skill-path <path> [--dry-run]
43
+ selftune rollback --skill <name> --skill-path <path> [--proposal-id <id>]
44
+ selftune watch --skill <name> --skill-path <path> [--auto-rollback]
45
+ selftune doctor
46
+ selftune ingest-codex
47
+ selftune ingest-opencode
48
+ selftune wrap-codex -- <codex args>
49
+ ```
50
+
51
+ ## Workflow Routing
52
+
53
+ | Trigger keywords | Workflow | File |
54
+ |------------------|----------|------|
55
+ | grade, score, evaluate, assess session | Grade | Workflows/Grade.md |
56
+ | evals, eval set, undertriggering, skill stats | Evals | Workflows/Evals.md |
57
+ | evolve, improve, triggers, catch more queries | Evolve | Workflows/Evolve.md |
58
+ | rollback, undo, restore, revert evolution | Rollback | Workflows/Rollback.md |
59
+ | watch, monitor, regression, post-deploy, performing | Watch | Workflows/Watch.md |
60
+ | doctor, health, hooks, broken, diagnose | Doctor | Workflows/Doctor.md |
61
+ | ingest, import, codex logs, opencode, wrap codex | Ingest | Workflows/Ingest.md |
62
+ | init, setup, bootstrap, first time | Initialize | Workflows/Initialize.md |
63
+
64
+ ## The Feedback Loop
65
+
66
+ ```
67
+ Observe --> Detect --> Diagnose --> Propose --> Validate --> Deploy --> Watch
68
+ | |
69
+ +--------------------------------------------------------------------+
70
+ ```
71
+
72
+ 1. **Observe** -- Hooks capture every session (queries, triggers, metrics)
73
+ 2. **Detect** -- `evals` finds missed triggers across invocation types
74
+ 3. **Diagnose** -- `grade` evaluates session quality with evidence
75
+ 4. **Propose** -- `evolve` generates description improvements
76
+ 5. **Validate** -- Evolution is tested against the eval set
77
+ 6. **Deploy** -- Updated description replaces the original (with backup)
78
+ 7. **Watch** -- `watch` monitors for regressions post-deploy
79
+
80
+ ## Resource Index
81
+
82
+ | Resource | Purpose |
83
+ |----------|---------|
84
+ | `SKILL.md` | This file -- routing, triggers, quick reference |
85
+ | `references/logs.md` | Log file formats (telemetry, usage, queries, audit) |
86
+ | `references/grading-methodology.md` | 3-tier grading model, evidence standards, grading.json schema |
87
+ | `references/invocation-taxonomy.md` | 4 invocation types, coverage analysis, evolution connection |
88
+ | `settings_snippet.json` | Claude Code hook configuration template |
89
+ | `Workflows/Initialize.md` | First-time setup and config bootstrap |
90
+ | `Workflows/Grade.md` | Grade a session with expectations and evidence |
91
+ | `Workflows/Evals.md` | Generate eval sets, list skills, show stats |
92
+ | `Workflows/Evolve.md` | Evolve a skill description from failure patterns |
93
+ | `Workflows/Rollback.md` | Undo an evolution, restore previous description |
94
+ | `Workflows/Watch.md` | Post-deploy regression monitoring |
95
+ | `Workflows/Doctor.md` | Health checks on logs, hooks, schema |
96
+ | `Workflows/Ingest.md` | Import sessions from Codex and OpenCode |
97
+
98
+ ## Examples
99
+
100
+ - "Grade my last pptx session"
101
+ - "What skills are undertriggering?"
102
+ - "Generate evals for the pptx skill"
103
+ - "Evolve the pptx skill to catch more queries"
104
+ - "Rollback the last evolution"
105
+ - "Is the skill performing well after the change?"
106
+ - "Check selftune health"
107
+ - "Ingest my codex logs"
108
+ - "Show me skill stats"
109
+
110
+ ## Negative Examples
111
+
112
+ These should NOT trigger selftune:
113
+
114
+ - "Fix this React hydration bug"
115
+ - "Create a PowerPoint about Q3 results" (this is pptx, not selftune)
116
+ - "Run my unit tests"
117
+ - "What does this error mean?"
118
+
119
+ Route to other skills or general workflows unless the user explicitly
120
+ asks about grading, evals, evolution, monitoring, or skill observability.
@@ -0,0 +1,145 @@
1
+ # selftune Doctor Workflow
2
+
3
+ Run health checks on selftune logs, hooks, and schema integrity.
4
+ Reports pass/fail status for each check with actionable guidance.
5
+
6
+ ## Default Command
7
+
8
+ ```bash
9
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
10
+ bun run $CLI_PATH doctor
11
+ ```
12
+
13
+ Fallback:
14
+ ```bash
15
+ bun run <repo-path>/cli/selftune/index.ts doctor
16
+ ```
17
+
18
+ ## Options
19
+
20
+ None. Doctor runs all checks unconditionally.
21
+
22
+ ## Output Format
23
+
24
+ ```json
25
+ {
26
+ "healthy": true,
27
+ "checks": [
28
+ {
29
+ "name": "session_telemetry_log exists",
30
+ "status": "pass",
31
+ "detail": "Found 142 entries"
32
+ },
33
+ {
34
+ "name": "skill_usage_log parseable",
35
+ "status": "pass",
36
+ "detail": "All 89 entries valid JSON"
37
+ },
38
+ {
39
+ "name": "hooks installed",
40
+ "status": "fail",
41
+ "detail": "PostToolUse hook not found in ~/.claude/settings.json"
42
+ }
43
+ ],
44
+ "summary": {
45
+ "passed": 5,
46
+ "failed": 1,
47
+ "total": 6
48
+ }
49
+ }
50
+ ```
51
+
52
+ The process exits with code 0 if `healthy: true`, code 1 otherwise.
53
+
54
+ ## Parsing Instructions
55
+
56
+ ### Check Overall Health
57
+
58
+ ```bash
59
+ # Parse: .healthy (boolean)
60
+ # Quick check: exit code 0 = healthy, 1 = unhealthy
61
+ ```
62
+
63
+ ### Find Failed Checks
64
+
65
+ ```bash
66
+ # Parse: .checks[] | select(.status == "fail") | { name, detail }
67
+ ```
68
+
69
+ ### Get Summary Counts
70
+
71
+ ```bash
72
+ # Parse: .summary.passed, .summary.failed, .summary.total
73
+ ```
74
+
75
+ ## Health Checks
76
+
77
+ Doctor validates these areas:
78
+
79
+ ### Log File Checks
80
+
81
+ | Check | What it validates |
82
+ |-------|-------------------|
83
+ | Log files exist | `session_telemetry_log.jsonl`, `skill_usage_log.jsonl`, `all_queries_log.jsonl` exist in `~/.claude/` |
84
+ | Logs are parseable | Every line in each log file is valid JSON |
85
+ | Schema conformance | Required fields present per log type (see `references/logs.md`) |
86
+
87
+ ### Hook Checks
88
+
89
+ | Check | What it validates |
90
+ |-------|-------------------|
91
+ | Hooks installed | `UserPromptSubmit`, `PostToolUse`, and `Stop` hooks are configured in `~/.claude/settings.json` |
92
+ | Hook scripts exist | The script files referenced by hooks exist on disk |
93
+
94
+ ### Evolution Audit Checks
95
+
96
+ | Check | What it validates |
97
+ |-------|-------------------|
98
+ | Audit log integrity | `evolution_audit_log.jsonl` entries have required fields (`timestamp`, `proposal_id`, `action`) |
99
+ | Valid action values | All entries use known action types: `created`, `validated`, `deployed`, `rolled_back` |
100
+
101
+ ## Steps
102
+
103
+ ### 1. Run Doctor
104
+
105
+ ```bash
106
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
107
+ bun run $CLI_PATH doctor
108
+ ```
109
+
110
+ ### 2. Check Results
111
+
112
+ Parse the JSON output. If `healthy: true`, selftune is fully operational.
113
+
114
+ ### 3. Fix Any Issues
115
+
116
+ For each failed check, take the appropriate action:
117
+
118
+ | Failed check | Fix |
119
+ |-------------|-----|
120
+ | Log files missing | Run a session to generate initial log entries. Check hook installation. |
121
+ | Logs not parseable | Inspect the corrupted log file. Remove or fix invalid lines. |
122
+ | Hooks not installed | Merge `skill/settings_snippet.json` into `~/.claude/settings.json`. Update paths. |
123
+ | Hook scripts missing | Verify the selftune repo path. Re-run `init` if the repo was moved. |
124
+ | Audit log invalid | Remove corrupted entries. Future operations will append clean entries. |
125
+
126
+ ### 4. Re-run Doctor
127
+
128
+ After fixes, run doctor again to verify all checks pass.
129
+
130
+ ## Common Patterns
131
+
132
+ **"Something seems broken"**
133
+ > Run doctor first. Report any failing checks with their detail messages.
134
+
135
+ **"Are my hooks working?"**
136
+ > Doctor checks hook installation. If hooks pass but no data appears,
137
+ > verify the hook script paths point to actual files.
138
+
139
+ **"No telemetry available"**
140
+ > Doctor will report missing log files. Install hooks using the
141
+ > `settings_snippet.json` in the skill directory, then run a session.
142
+
143
+ **"Check selftune health"**
144
+ > Run doctor and report the summary. A clean bill of health means
145
+ > all checks pass and selftune is ready to grade/evolve/watch.
@@ -0,0 +1,193 @@
1
+ # selftune Evals Workflow
2
+
3
+ Generate eval sets from hook logs. Detects false negatives (queries that
4
+ should have triggered a skill but did not) and annotates each entry with
5
+ its invocation type.
6
+
7
+ ## Default Command
8
+
9
+ ```bash
10
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
11
+ bun run $CLI_PATH evals --skill <name> [options]
12
+ ```
13
+
14
+ Fallback:
15
+ ```bash
16
+ bun run <repo-path>/cli/selftune/index.ts evals --skill <name> [options]
17
+ ```
18
+
19
+ ## Options
20
+
21
+ | Flag | Description | Default |
22
+ |------|-------------|---------|
23
+ | `--skill <name>` | Skill to generate evals for | Required (unless `--list-skills`) |
24
+ | `--list-skills` | List all logged skills with query counts | Off |
25
+ | `--stats` | Show aggregate telemetry stats for the skill | Off |
26
+ | `--max <n>` | Maximum eval entries to generate | 50 |
27
+ | `--seed <n>` | Random seed for negative sampling | Random |
28
+ | `--out <path>` | Output file path | `evals-<skill>.json` |
29
+
30
+ ## Output Format
31
+
32
+ ### Eval Set (default)
33
+
34
+ ```json
35
+ [
36
+ {
37
+ "id": 1,
38
+ "query": "Make me a slide deck for the Q3 board meeting",
39
+ "expected": true,
40
+ "invocation_type": "contextual",
41
+ "skill_name": "pptx",
42
+ "source_session": "abc123"
43
+ },
44
+ {
45
+ "id": 2,
46
+ "query": "What format should I use for a presentation?",
47
+ "expected": false,
48
+ "invocation_type": "negative",
49
+ "skill_name": "pptx",
50
+ "source_session": null
51
+ }
52
+ ]
53
+ ```
54
+
55
+ ### List Skills
56
+
57
+ ```json
58
+ {
59
+ "skills": [
60
+ { "name": "pptx", "query_count": 42, "session_count": 15 },
61
+ { "name": "selftune", "query_count": 28, "session_count": 10 }
62
+ ]
63
+ }
64
+ ```
65
+
66
+ ### Stats
67
+
68
+ ```json
69
+ {
70
+ "skill_name": "pptx",
71
+ "sessions": 15,
72
+ "avg_turns": 4.2,
73
+ "tool_call_breakdown": { "Read": 30, "Write": 15, "Bash": 45 },
74
+ "error_rate": 0.13,
75
+ "bash_patterns": ["pip install python-pptx", "python3 /tmp/create_pptx.py"]
76
+ }
77
+ ```
78
+
79
+ ## Parsing Instructions
80
+
81
+ ### Count by Invocation Type
82
+
83
+ ```bash
84
+ # Parse: group_by(.invocation_type) | map({ type: .[0].invocation_type, count: length })
85
+ ```
86
+
87
+ ### Find Missed Queries (False Negatives)
88
+
89
+ ```bash
90
+ # Parse: .[] | select(.expected == true and .invocation_type != "explicit")
91
+ # These are queries that should trigger but might be missed
92
+ ```
93
+
94
+ ### Get Negative Examples
95
+
96
+ ```bash
97
+ # Parse: .[] | select(.expected == false)
98
+ ```
99
+
100
+ ## Sub-Workflows
101
+
102
+ ### List Skills
103
+
104
+ Discover which skills have telemetry data and how many queries each has.
105
+
106
+ ```bash
107
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
108
+ bun run $CLI_PATH evals --list-skills
109
+ ```
110
+
111
+ Use this first to identify which skills have enough data for eval generation.
112
+
113
+ ### Generate Evals
114
+
115
+ Cross-reference `skill_usage_log.jsonl` (positive triggers) against
116
+ `all_queries_log.jsonl` (all queries, including non-triggers) to produce
117
+ an eval set annotated with invocation types.
118
+
119
+ ```bash
120
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
121
+ bun run $CLI_PATH evals --skill pptx --max 50 --out evals-pptx.json
122
+ ```
123
+
124
+ The command:
125
+ 1. Reads positive triggers from `skill_usage_log.jsonl`
126
+ 2. Reads all queries from `all_queries_log.jsonl`
127
+ 3. Identifies queries that should have triggered but did not
128
+ 4. Samples negative examples (unrelated queries)
129
+ 5. Annotates each entry with invocation type
130
+ 6. Writes the eval set to the output file
131
+
132
+ ### Show Stats
133
+
134
+ View aggregate telemetry for a skill: average turns, tool call breakdown,
135
+ error rates, and common bash command patterns.
136
+
137
+ ```bash
138
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
139
+ bun run $CLI_PATH evals --skill pptx --stats
140
+ ```
141
+
142
+ ## Steps
143
+
144
+ ### 1. List Available Skills
145
+
146
+ Run `--list-skills` to see what skills have telemetry data. If the target
147
+ skill has zero or very few queries, more sessions are needed before
148
+ eval generation is useful.
149
+
150
+ ### 2. Generate the Eval Set
151
+
152
+ Run with `--skill <name>`. Review the output file for:
153
+ - Balance between positive and negative entries
154
+ - Coverage of all three positive invocation types (explicit, implicit, contextual)
155
+ - Reasonable negative examples (keyword overlap but wrong intent)
156
+
157
+ ### 3. Review Invocation Type Distribution
158
+
159
+ A healthy eval set has:
160
+ - Some explicit queries (easy baseline)
161
+ - Many implicit queries (natural usage)
162
+ - Several contextual queries (real-world usage)
163
+ - Enough negatives to prevent false positives
164
+
165
+ See `references/invocation-taxonomy.md` for what each type means and
166
+ what healthy distribution looks like.
167
+
168
+ ### 4. Identify Coverage Gaps
169
+
170
+ If the eval set is missing implicit or contextual queries, the skill may be
171
+ undertriggering. This is the signal for `evolve` to improve the description.
172
+
173
+ ### 5. Optional: Check Stats
174
+
175
+ Use `--stats` to understand session patterns before evolution. High error
176
+ rates or unusual tool call distributions may indicate process issues
177
+ beyond trigger coverage.
178
+
179
+ ## Common Patterns
180
+
181
+ **"What skills are undertriggering?"**
182
+ > Run `--list-skills`, then for each skill with significant query counts,
183
+ > generate evals and check for missed implicit/contextual queries.
184
+
185
+ **"Generate evals for pptx"**
186
+ > Run `evals --skill pptx`. Review the invocation type distribution.
187
+ > Feed the output to `evolve` if coverage gaps exist.
188
+
189
+ **"Show me skill stats"**
190
+ > Run `evals --skill <name> --stats` for aggregate telemetry.
191
+
192
+ **"I want reproducible evals"**
193
+ > Use `--seed <n>` to fix the random sampling of negative examples.
@@ -0,0 +1,159 @@
1
+ # selftune Evolve Workflow
2
+
3
+ Improve a skill's description based on real usage signal. Analyzes failure
4
+ patterns from eval sets and proposes description changes that catch more
5
+ natural-language queries without breaking existing triggers.
6
+
7
+ ## Default Command
8
+
9
+ ```bash
10
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
11
+ bun run $CLI_PATH evolve --skill <name> --skill-path <path> [options]
12
+ ```
13
+
14
+ Fallback:
15
+ ```bash
16
+ bun run <repo-path>/cli/selftune/index.ts evolve --skill <name> --skill-path <path> [options]
17
+ ```
18
+
19
+ ## Options
20
+
21
+ | Flag | Description | Default |
22
+ |------|-------------|---------|
23
+ | `--skill <name>` | Skill name | Required |
24
+ | `--skill-path <path>` | Path to the skill's SKILL.md | Required |
25
+ | `--eval-set <path>` | Pre-built eval set JSON | Auto-generated from logs |
26
+ | `--mode api\|agent` | LLM mode for proposal generation | `agent` |
27
+ | `--agent <name>` | Agent CLI binary to use | Auto-detected |
28
+ | `--dry-run` | Propose and validate without deploying | Off |
29
+ | `--confidence <n>` | Minimum confidence threshold (0-1) | 0.7 |
30
+ | `--max-iterations <n>` | Maximum retry iterations | 3 |
31
+
32
+ ## Output Format
33
+
34
+ Each evolution action is logged to `~/.claude/evolution_audit_log.jsonl`.
35
+ See `references/logs.md` for the audit log schema.
36
+
37
+ ### Proposal Output (dry-run or pre-deploy)
38
+
39
+ ```json
40
+ {
41
+ "proposal_id": "evolve-pptx-1709125200000",
42
+ "skill_name": "pptx",
43
+ "iteration": 1,
44
+ "original_pass_rate": 0.70,
45
+ "proposed_pass_rate": 0.92,
46
+ "regression_count": 0,
47
+ "confidence": 0.85,
48
+ "status": "validated",
49
+ "changes_summary": "Added implicit triggers: 'slide deck', 'presentation', 'board meeting slides'"
50
+ }
51
+ ```
52
+
53
+ ### Audit Log Entries
54
+
55
+ The evolution process writes multiple audit entries:
56
+
57
+ | Action | When | Key details |
58
+ |--------|------|-------------|
59
+ | `created` | Proposal generated | `details` contains `original_description:` prefix |
60
+ | `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates |
61
+ | `deployed` | Updated SKILL.md written to disk | `eval_snapshot` with final rates |
62
+
63
+ ## Parsing Instructions
64
+
65
+ ### Track Evolution Progress
66
+
67
+ ```bash
68
+ # Read audit log for the proposal
69
+ # Parse: entries where proposal_id matches
70
+ # Check: action sequence should be created -> validated -> deployed
71
+ ```
72
+
73
+ ### Check for Regression
74
+
75
+ ```bash
76
+ # Parse: .eval_snapshot in validated entry
77
+ # Verify: proposed pass_rate > original pass_rate
78
+ # Verify: regression_count < 5% of total evals
79
+ ```
80
+
81
+ ## Steps
82
+
83
+ ### 1. Load or Generate Eval Set
84
+
85
+ If `--eval-set` is provided, use it directly. Otherwise, the command
86
+ generates one from logs (equivalent to running `evals --skill <name>`).
87
+
88
+ An eval set is required for validation. Without enough telemetry data,
89
+ evolution cannot reliably measure improvement.
90
+
91
+ ### 2. Extract Failure Patterns
92
+
93
+ The command groups missed queries by invocation type:
94
+ - Missed explicit: description is broken (rare, high priority)
95
+ - Missed implicit: description is too narrow (common, evolve target)
96
+ - Missed contextual: description lacks domain vocabulary (evolve target)
97
+
98
+ See `references/invocation-taxonomy.md` for the taxonomy.
99
+
100
+ ### 3. Propose Description Changes
101
+
102
+ An LLM generates a candidate description that would catch the missed
103
+ queries. The candidate:
104
+ - Preserves existing trigger phrases that work
105
+ - Adds new phrases covering missed patterns
106
+ - Maintains the description's structure and tone
107
+
108
+ ### 4. Validate Against Eval Set
109
+
110
+ The candidate is tested against the full eval set:
111
+ - Must improve overall pass rate
112
+ - Must not regress more than 5% on previously-passing entries
113
+ - Must exceed the `--confidence` threshold
114
+
115
+ If validation fails, the command retries up to `--max-iterations` times
116
+ with adjusted proposals.
117
+
118
+ ### 5. Deploy (or Preview)
119
+
120
+ If `--dry-run`, the proposal is printed but not deployed. The audit log
121
+ still records `created` and `validated` entries for review.
122
+
123
+ If deploying:
124
+ 1. The current SKILL.md is backed up to `SKILL.md.bak`
125
+ 2. The updated description is written to SKILL.md
126
+ 3. A `deployed` entry is logged to the evolution audit
127
+
128
+ ### Stopping Criteria
129
+
130
+ The evolution loop stops when any of these conditions is met (priority order):
131
+
132
+ | # | Condition | Meaning |
133
+ |---|-----------|---------|
134
+ | 1 | **Converged** | Pass rate >= 0.95 |
135
+ | 2 | **Max iterations** | Reached `--max-iterations` limit |
136
+ | 3 | **Low confidence** | Proposal confidence below `--confidence` threshold |
137
+ | 4 | **Plateau** | Pass rate unchanged across 3 consecutive iterations |
138
+ | 5 | **Continue** | None of the above -- keep iterating |
139
+
140
+ ## Common Patterns
141
+
142
+ **"Evolve the pptx skill"**
143
+ > Need `--skill pptx` and `--skill-path /path/to/pptx/SKILL.md`.
144
+ > If the user hasn't specified the path, search for the SKILL.md file
145
+ > in the workspace or ask.
146
+
147
+ **"Just show me what would change"**
148
+ > Use `--dry-run` to preview proposals without deploying.
149
+
150
+ **"The evolution didn't help enough"**
151
+ > Check the eval set quality. Missing contextual examples will limit
152
+ > what evolution can learn. Generate a richer eval set first.
153
+
154
+ **"Evolution keeps failing validation"**
155
+ > Lower `--confidence` slightly or increase `--max-iterations`.
156
+ > Also check if the eval set has contradictory expectations.
157
+
158
+ **"I want to use the API directly"**
159
+ > Pass `--mode api`. Requires `ANTHROPIC_API_KEY` in the environment.