selftune 0.2.9 → 0.2.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +35 -35
- package/apps/local-dashboard/dist/assets/index-4_dAY17K.js +16 -0
- package/apps/local-dashboard/dist/assets/index-BxV5WZHc.css +2 -0
- package/apps/local-dashboard/dist/assets/rolldown-runtime-Dw2cE7zH.js +1 -0
- package/apps/local-dashboard/dist/assets/vendor-react-CKkiCskZ.js +11 -0
- package/apps/local-dashboard/dist/assets/vendor-table-pHbDxq36.js +8 -0
- package/apps/local-dashboard/dist/assets/vendor-ui-7xD7fNEU.js +12 -0
- package/apps/local-dashboard/dist/index.html +16 -15
- package/bin/selftune.cjs +1 -1
- package/cli/selftune/activation-rules.ts +1 -0
- package/cli/selftune/alpha-upload/build-payloads.ts +18 -2
- package/cli/selftune/alpha-upload/stage-canonical.ts +94 -0
- package/cli/selftune/auth/device-code.ts +32 -0
- package/cli/selftune/auto-update.ts +12 -0
- package/cli/selftune/badge/badge.ts +1 -0
- package/cli/selftune/canonical-export.ts +5 -0
- package/cli/selftune/claude-agents.ts +154 -0
- package/cli/selftune/contribute/bundle.ts +1 -0
- package/cli/selftune/contribute/contribute.ts +1 -0
- package/cli/selftune/cron/setup.ts +2 -2
- package/cli/selftune/dashboard-server.ts +1 -0
- package/cli/selftune/eval/hooks-to-evals.ts +1 -0
- package/cli/selftune/eval/import-skillsbench.ts +1 -0
- package/cli/selftune/eval/synthetic-evals.ts +2 -3
- package/cli/selftune/eval/unit-test.ts +1 -0
- package/cli/selftune/evolution/deploy-proposal.ts +9 -238
- package/cli/selftune/evolution/evolve-body.ts +93 -6
- package/cli/selftune/evolution/evolve.ts +3 -7
- package/cli/selftune/evolution/propose-body.ts +3 -2
- package/cli/selftune/evolution/propose-routing.ts +3 -2
- package/cli/selftune/evolution/refine-body.ts +3 -2
- package/cli/selftune/evolution/rollback.ts +1 -1
- package/cli/selftune/export.ts +1 -0
- package/cli/selftune/grading/grade-session.ts +8 -0
- package/cli/selftune/hooks/auto-activate.ts +1 -0
- package/cli/selftune/hooks/evolution-guard.ts +1 -1
- package/cli/selftune/hooks/prompt-log.ts +1 -0
- package/cli/selftune/hooks/session-stop.ts +34 -40
- package/cli/selftune/hooks/skill-change-guard.ts +1 -0
- package/cli/selftune/hooks/skill-eval.ts +1 -1
- package/cli/selftune/index.ts +23 -14
- package/cli/selftune/ingestors/claude-replay.ts +1 -0
- package/cli/selftune/ingestors/codex-rollout.ts +1 -0
- package/cli/selftune/ingestors/codex-wrapper.ts +1 -0
- package/cli/selftune/ingestors/openclaw-ingest.ts +1 -0
- package/cli/selftune/ingestors/opencode-ingest.ts +1 -0
- package/cli/selftune/init.ts +121 -29
- package/cli/selftune/localdb/db.ts +1 -0
- package/cli/selftune/localdb/direct-write.ts +39 -0
- package/cli/selftune/localdb/materialize.ts +2 -0
- package/cli/selftune/localdb/queries.ts +53 -0
- package/cli/selftune/localdb/schema.ts +28 -0
- package/cli/selftune/normalization.ts +1 -0
- package/cli/selftune/observability.ts +1 -0
- package/cli/selftune/repair/skill-usage.ts +1 -0
- package/cli/selftune/routes/orchestrate-runs.ts +1 -0
- package/cli/selftune/routes/overview.ts +1 -0
- package/cli/selftune/routes/report.ts +1 -1
- package/cli/selftune/routes/skill-report.ts +2 -1
- package/cli/selftune/status.ts +1 -1
- package/cli/selftune/sync.ts +30 -1
- package/cli/selftune/uninstall.ts +412 -0
- package/cli/selftune/utils/canonical-log.ts +2 -0
- package/cli/selftune/utils/frontmatter.ts +50 -7
- package/cli/selftune/utils/jsonl.ts +1 -0
- package/cli/selftune/utils/llm-call.ts +131 -3
- package/cli/selftune/utils/skill-log.ts +1 -0
- package/cli/selftune/utils/transcript.ts +1 -0
- package/cli/selftune/utils/trigger-check.ts +1 -1
- package/cli/selftune/workflows/skill-md-writer.ts +5 -5
- package/cli/selftune/workflows/workflows.ts +1 -0
- package/package.json +37 -33
- package/packages/telemetry-contract/fixtures/golden.test.ts +1 -0
- package/packages/telemetry-contract/package.json +1 -1
- package/packages/telemetry-contract/src/schemas.ts +1 -0
- package/packages/telemetry-contract/tests/compatibility.test.ts +1 -0
- package/packages/ui/README.md +35 -34
- package/packages/ui/package.json +3 -3
- package/packages/ui/src/components/ActivityTimeline.tsx +50 -43
- package/packages/ui/src/components/EvidenceViewer.tsx +306 -182
- package/packages/ui/src/components/EvolutionTimeline.tsx +83 -72
- package/packages/ui/src/components/InfoTip.tsx +4 -3
- package/packages/ui/src/components/OrchestrateRunsPanel.tsx +60 -53
- package/packages/ui/src/components/section-cards.tsx +20 -25
- package/packages/ui/src/components/skill-health-grid.tsx +213 -193
- package/packages/ui/src/lib/constants.tsx +1 -0
- package/packages/ui/src/primitives/badge.tsx +12 -15
- package/packages/ui/src/primitives/button.tsx +7 -7
- package/packages/ui/src/primitives/card.tsx +15 -26
- package/packages/ui/src/primitives/checkbox.tsx +7 -8
- package/packages/ui/src/primitives/collapsible.tsx +5 -5
- package/packages/ui/src/primitives/dropdown-menu.tsx +45 -55
- package/packages/ui/src/primitives/label.tsx +6 -6
- package/packages/ui/src/primitives/select.tsx +28 -37
- package/packages/ui/src/primitives/table.tsx +17 -44
- package/packages/ui/src/primitives/tabs.tsx +14 -21
- package/packages/ui/src/primitives/tooltip.tsx +10 -22
- package/skill/SKILL.md +70 -57
- package/skill/Workflows/AlphaUpload.md +4 -4
- package/skill/Workflows/AutoActivation.md +11 -6
- package/skill/Workflows/Badge.md +22 -16
- package/skill/Workflows/Baseline.md +34 -36
- package/skill/Workflows/Composability.md +16 -11
- package/skill/Workflows/Contribute.md +26 -21
- package/skill/Workflows/Cron.md +23 -22
- package/skill/Workflows/Dashboard.md +32 -27
- package/skill/Workflows/Doctor.md +33 -27
- package/skill/Workflows/Evals.md +48 -47
- package/skill/Workflows/EvolutionMemory.md +31 -21
- package/skill/Workflows/Evolve.md +84 -82
- package/skill/Workflows/EvolveBody.md +58 -47
- package/skill/Workflows/Grade.md +16 -13
- package/skill/Workflows/ImportSkillsBench.md +9 -6
- package/skill/Workflows/Ingest.md +36 -21
- package/skill/Workflows/Initialize.md +108 -40
- package/skill/Workflows/Orchestrate.md +22 -16
- package/skill/Workflows/Replay.md +12 -7
- package/skill/Workflows/Rollback.md +13 -6
- package/skill/Workflows/Schedule.md +6 -6
- package/skill/Workflows/Sync.md +18 -11
- package/skill/Workflows/UnitTest.md +28 -17
- package/skill/Workflows/Watch.md +28 -21
- package/skill/agents/diagnosis-analyst.md +11 -0
- package/skill/agents/evolution-reviewer.md +15 -1
- package/skill/agents/integration-guide.md +10 -0
- package/skill/agents/pattern-analyst.md +12 -1
- package/skill/references/grading-methodology.md +23 -24
- package/skill/references/interactive-config.md +7 -7
- package/skill/references/invocation-taxonomy.md +22 -20
- package/skill/references/logs.md +14 -6
- package/skill/references/setup-patterns.md +4 -2
- package/.claude/agents/diagnosis-analyst.md +0 -156
- package/.claude/agents/evolution-reviewer.md +0 -180
- package/.claude/agents/integration-guide.md +0 -212
- package/.claude/agents/pattern-analyst.md +0 -160
- package/apps/local-dashboard/dist/assets/index-Bs3Y4ixf.css +0 -1
- package/apps/local-dashboard/dist/assets/index-C4UYGWKr.js +0 -15
- package/apps/local-dashboard/dist/assets/vendor-react-BQH_6WrG.js +0 -60
- package/apps/local-dashboard/dist/assets/vendor-table-dK1QMLq9.js +0 -26
- package/apps/local-dashboard/dist/assets/vendor-ui-CO2mrx6e.js +0 -341
|
@@ -9,21 +9,21 @@ execution mode, model selection, and key parameters.
|
|
|
9
9
|
Each mutating workflow has a **Pre-Flight Configuration** step. Follow this pattern:
|
|
10
10
|
|
|
11
11
|
1. Present a brief summary of what the command will do
|
|
12
|
-
2. Use the `AskUserQuestion` tool to present structured options
|
|
12
|
+
2. Use the `AskUserQuestion` tool to present structured options, one question per tool call. Mark recommended defaults in option text with `(recommended)`.
|
|
13
13
|
3. Parse the user's selections from the tool response
|
|
14
14
|
4. Show a confirmation summary of selected options before executing
|
|
15
15
|
|
|
16
|
-
**IMPORTANT:**
|
|
16
|
+
**IMPORTANT:** Prefer `AskUserQuestion` for pre-flight, but never batch multiple questions into one payload. Ask one question at a time. If `AskUserQuestion` is not available or Claude Code does not invoke it, fall back to inline numbered options. Do not invent tool responses.
|
|
17
17
|
|
|
18
18
|
## Model Tier Reference
|
|
19
19
|
|
|
20
20
|
When presenting model choices, use this table:
|
|
21
21
|
|
|
22
|
-
| Tier
|
|
23
|
-
|
|
24
|
-
| Fast
|
|
25
|
-
| Balanced | `sonnet` | ~5s/call
|
|
26
|
-
| Best
|
|
22
|
+
| Tier | Model | Speed | Cost | Quality | Best for |
|
|
23
|
+
| -------- | -------- | --------- | ---- | --------- | ---------------------------------- |
|
|
24
|
+
| Fast | `haiku` | ~2s/call | $ | Good | Iteration loops, bulk validation |
|
|
25
|
+
| Balanced | `sonnet` | ~5s/call | $$ | Great | Single-pass proposals, gate checks |
|
|
26
|
+
| Best | `opus` | ~10s/call | $$$ | Excellent | High-stakes final validation |
|
|
27
27
|
|
|
28
28
|
## Quick Path
|
|
29
29
|
|
|
@@ -67,10 +67,10 @@ Separate from eval types, selftune classifies each **live** skill invocation by
|
|
|
67
67
|
the user triggered it. This is shown as the `invocation_mode` field in canonical
|
|
68
68
|
telemetry and the "Mode" column in the dashboard.
|
|
69
69
|
|
|
70
|
-
| Mode
|
|
71
|
-
|
|
72
|
-
| `explicit` | User typed a slash command (`/skillname`)
|
|
73
|
-
| `implicit` | User mentioned the skill by name in their prompt
|
|
70
|
+
| Mode | Definition | Example |
|
|
71
|
+
| ---------- | -------------------------------------------------------- | ----------------------------------------------- |
|
|
72
|
+
| `explicit` | User typed a slash command (`/skillname`) | `/selftune grade` |
|
|
73
|
+
| `implicit` | User mentioned the skill by name in their prompt | `evolve the selftune skill` |
|
|
74
74
|
| `inferred` | Agent chose the skill autonomously — user never named it | `show me the dashboard` → agent invokes Browser |
|
|
75
75
|
|
|
76
76
|
### How classification works
|
|
@@ -84,10 +84,10 @@ and mapped to canonical modes in `cli/selftune/normalization.ts` (`deriveInvocat
|
|
|
84
84
|
|
|
85
85
|
### Eval types vs runtime modes
|
|
86
86
|
|
|
87
|
-
| Concept
|
|
88
|
-
|
|
89
|
-
| **Eval invocation type**
|
|
90
|
-
| **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred
|
|
87
|
+
| Concept | Purpose | Values |
|
|
88
|
+
| --------------------------- | ---------------------- | ---------------------------------------- |
|
|
89
|
+
| **Eval invocation type** | Classifying test cases | explicit, implicit, contextual, negative |
|
|
90
|
+
| **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred |
|
|
91
91
|
|
|
92
92
|
`contextual` (eval) and `inferred` (runtime) are related but different: contextual means
|
|
93
93
|
the user's intent is buried in domain context, while inferred means the agent chose the
|
|
@@ -99,12 +99,12 @@ skill without any user mention at all.
|
|
|
99
99
|
|
|
100
100
|
A healthy skill catches all three positive invocation types:
|
|
101
101
|
|
|
102
|
-
| Type
|
|
103
|
-
|
|
104
|
-
| Explicit
|
|
105
|
-
| Implicit
|
|
102
|
+
| Type | Healthy | Unhealthy |
|
|
103
|
+
| ---------- | ------------ | ------------------------------------------------------- |
|
|
104
|
+
| Explicit | Catches all | Misses some (broken) |
|
|
105
|
+
| Implicit | Catches most | Only catches explicit (too rigid) |
|
|
106
106
|
| Contextual | Catches many | Only catches explicit + some implicit (needs evolution) |
|
|
107
|
-
| Negative
|
|
107
|
+
| Negative | Rejects all | False positives on keyword overlap |
|
|
108
108
|
|
|
109
109
|
### The Coverage Spectrum
|
|
110
110
|
|
|
@@ -128,6 +128,7 @@ The invocation taxonomy directly drives the evolution feedback loop:
|
|
|
128
128
|
|
|
129
129
|
When `selftune eval generate` shows implicit queries that don't trigger the skill, the
|
|
130
130
|
description is too narrow. The `evolve` command will:
|
|
131
|
+
|
|
131
132
|
1. Extract the missed implicit patterns
|
|
132
133
|
2. Propose description changes that cover them
|
|
133
134
|
3. Validate that existing triggers still work
|
|
@@ -146,6 +147,7 @@ Evolution should tighten the scope or add "Don't Use When" clauses.
|
|
|
146
147
|
### The Evolution Priority
|
|
147
148
|
|
|
148
149
|
Fix in this order:
|
|
150
|
+
|
|
149
151
|
1. **Missed explicit** -- broken, fix immediately
|
|
150
152
|
2. **Missed implicit** -- undertriggering, evolve next
|
|
151
153
|
3. **Missed contextual** -- under-evolved, evolve when implicit is clean
|
|
@@ -168,11 +170,11 @@ Each entry in a generated eval set looks like:
|
|
|
168
170
|
}
|
|
169
171
|
```
|
|
170
172
|
|
|
171
|
-
| Field
|
|
172
|
-
|
|
173
|
-
| `id`
|
|
174
|
-
| `query`
|
|
175
|
-
| `expected`
|
|
173
|
+
| Field | Description |
|
|
174
|
+
| ----------------- | -------------------------------------------------------- |
|
|
175
|
+
| `id` | Sequential identifier |
|
|
176
|
+
| `query` | The user's original query text |
|
|
177
|
+
| `expected` | `true` = should trigger, `false` = should not |
|
|
176
178
|
| `invocation_type` | One of: `explicit`, `implicit`, `contextual`, `negative` |
|
|
177
|
-
| `skill_name`
|
|
178
|
-
| `source_session`
|
|
179
|
+
| `skill_name` | The skill this eval targets |
|
|
180
|
+
| `source_session` | Session ID the query came from (if positive) |
|
package/skill/references/logs.md
CHANGED
|
@@ -43,6 +43,7 @@ One JSON record per line. Each record is one completed agent session.
|
|
|
43
43
|
```
|
|
44
44
|
|
|
45
45
|
**source values:**
|
|
46
|
+
|
|
46
47
|
- `claude_code` — written by session-stop.ts (Stop hook)
|
|
47
48
|
- `codex` — written by ingestors/codex-wrapper.ts
|
|
48
49
|
- `codex_rollout` — written by ingestors/codex-rollout.ts
|
|
@@ -78,6 +79,7 @@ One record per skill trigger event. Populated by skill-eval.ts (PostToolUse hook
|
|
|
78
79
|
```
|
|
79
80
|
|
|
80
81
|
Optional provenance fields:
|
|
82
|
+
|
|
81
83
|
- `skill_scope`: `project | global | admin | system | unknown`
|
|
82
84
|
- `skill_project_root`: resolved repo/worktree root when the skill came from a project-local registry
|
|
83
85
|
- `skill_registry_dir`: the registry directory where the resolved `SKILL.md` came from
|
|
@@ -88,6 +90,7 @@ record shape, but is rebuilt from source-truth transcripts/rollouts rather than
|
|
|
88
90
|
hooks alone.
|
|
89
91
|
|
|
90
92
|
Notes:
|
|
93
|
+
|
|
91
94
|
- `launcher_base_dir` means selftune recovered Claude's `Base directory for this skill:` launcher metadata. If that recovered `SKILL.md` path points into a stable registry like `~/.agents/skills`, `~/.claude/skills`, `/etc/codex/skills`, or a real project checkout, selftune now reclassifies the invocation into the matching `skill_scope`. Ephemeral temp launcher directories still remain `unknown`.
|
|
92
95
|
- `fallback` means selftune confirmed a skill invocation but could not yet resolve a concrete `SKILL.md` path.
|
|
93
96
|
|
|
@@ -243,6 +246,7 @@ Consumed signal example:
|
|
|
243
246
|
```
|
|
244
247
|
|
|
245
248
|
**signal_type values:**
|
|
249
|
+
|
|
246
250
|
- `correction` — User pointing out a missed skill ("why didn't you use X?", "you should have used X", "next time use X")
|
|
247
251
|
- `explicit_request` — User asking to use a skill ("please use the X skill", "use the commit skill")
|
|
248
252
|
- `manual_invocation` — Direct `/skill` invocation detected
|
|
@@ -265,12 +269,13 @@ One record per evolution action. Written by the evolution and rollback modules.
|
|
|
265
269
|
"total": 50,
|
|
266
270
|
"passed": 35,
|
|
267
271
|
"failed": 15,
|
|
268
|
-
"pass_rate": 0.
|
|
272
|
+
"pass_rate": 0.7
|
|
269
273
|
}
|
|
270
274
|
}
|
|
271
275
|
```
|
|
272
276
|
|
|
273
277
|
**action values:**
|
|
278
|
+
|
|
274
279
|
- `created` — New evolution proposal generated. `details` starts with `original_description:` prefix preserving the pre-evolution SKILL.md content.
|
|
275
280
|
- `validated` — Proposal tested against eval set. `eval_snapshot` contains before/after pass rates.
|
|
276
281
|
- `deployed` — Updated SKILL.md written to disk. `eval_snapshot` contains final pass rates.
|
|
@@ -289,6 +294,7 @@ One record per evolution action. Written by the evolution and rollback modules.
|
|
|
289
294
|
One JSON object per line. Two observed variants:
|
|
290
295
|
|
|
291
296
|
**Variant A (nested, current):**
|
|
297
|
+
|
|
292
298
|
```json
|
|
293
299
|
{"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}}
|
|
294
300
|
{"type": "assistant", "message": {"role": "assistant", "content": [
|
|
@@ -298,6 +304,7 @@ One JSON object per line. Two observed variants:
|
|
|
298
304
|
```
|
|
299
305
|
|
|
300
306
|
**Variant B (flat, older):**
|
|
307
|
+
|
|
301
308
|
```json
|
|
302
309
|
{"role": "user", "content": "..."}
|
|
303
310
|
{"role": "assistant", "content": [{"type": "tool_use", "name": "Bash", "input": {"command": "..."}}]}
|
|
@@ -309,7 +316,7 @@ Skill reads appear as `Read` tool calls where `input.file_path` ends in `SKILL.m
|
|
|
309
316
|
|
|
310
317
|
---
|
|
311
318
|
|
|
312
|
-
## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout
|
|
319
|
+
## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-\*.jsonl)
|
|
313
320
|
|
|
314
321
|
```json
|
|
315
322
|
{"type": "thread.started", "thread_id": "..."}
|
|
@@ -332,13 +339,14 @@ Content is a JSON string containing an array of blocks. Anthropic format:
|
|
|
332
339
|
|
|
333
340
|
```json
|
|
334
341
|
[
|
|
335
|
-
{"type": "text", "text": "I'll create the presentation."},
|
|
336
|
-
{"type": "tool_use", "name": "Bash", "input": {"command": "pip install python-pptx"}},
|
|
337
|
-
{"type": "tool_use", "name": "Read", "input": {"file_path": "/skills/pptx/SKILL.md"}}
|
|
342
|
+
{ "type": "text", "text": "I'll create the presentation." },
|
|
343
|
+
{ "type": "tool_use", "name": "Bash", "input": { "command": "pip install python-pptx" } },
|
|
344
|
+
{ "type": "tool_use", "name": "Read", "input": { "file_path": "/skills/pptx/SKILL.md" } }
|
|
338
345
|
]
|
|
339
346
|
```
|
|
340
347
|
|
|
341
348
|
Tool results appear in subsequent user messages:
|
|
349
|
+
|
|
342
350
|
```json
|
|
343
|
-
[{"type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false}]
|
|
351
|
+
[{ "type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false }]
|
|
344
352
|
```
|
|
@@ -61,5 +61,7 @@ combined.
|
|
|
61
61
|
## Optional Repository Extensions
|
|
62
62
|
|
|
63
63
|
selftune bundles specialized agent instruction files in `skill/agents/` for
|
|
64
|
-
diagnosis, evolution review, pattern analysis, and setup help. These
|
|
65
|
-
|
|
64
|
+
diagnosis, evolution review, pattern analysis, and setup help. These bundled
|
|
65
|
+
files are the canonical source. On Claude Code, `selftune init` syncs
|
|
66
|
+
compatibility copies into `~/.claude/agents/` so native subagent calls keep
|
|
67
|
+
matching the bundled definitions.
|
|
@@ -1,156 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: diagnosis-analyst
|
|
3
|
-
description: Deep-dive analysis of underperforming skills with root cause identification and actionable recommendations.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# Diagnosis Analyst
|
|
7
|
-
|
|
8
|
-
## Role
|
|
9
|
-
|
|
10
|
-
Investigate why a specific skill is underperforming. Analyze telemetry logs,
|
|
11
|
-
grading results, and session transcripts to identify root causes and recommend
|
|
12
|
-
targeted fixes.
|
|
13
|
-
|
|
14
|
-
**Activation policy:** This is a subagent-only role, spawned by the main agent.
|
|
15
|
-
If a user asks for diagnosis directly, the main agent should route to this subagent.
|
|
16
|
-
|
|
17
|
-
## Connection to Workflows
|
|
18
|
-
|
|
19
|
-
This agent is spawned by the main agent as a subagent when deeper analysis is
|
|
20
|
-
needed — it is not called directly by the user.
|
|
21
|
-
|
|
22
|
-
**Connected workflows:**
|
|
23
|
-
- **Doctor** — when `selftune doctor` reveals persistent issues with a specific skill, spawn this agent for root cause analysis
|
|
24
|
-
- **Grade** — when grades are consistently low for a skill, spawn this agent to investigate why
|
|
25
|
-
- **Status** — when `selftune status` shows CRITICAL or WARNING flags on a skill, spawn this agent for a deep dive
|
|
26
|
-
|
|
27
|
-
The main agent decides when to escalate to this subagent based on severity
|
|
28
|
-
and persistence of the issue. One-off failures are handled inline; recurring
|
|
29
|
-
or unexplained failures warrant spawning this agent.
|
|
30
|
-
|
|
31
|
-
## Context
|
|
32
|
-
|
|
33
|
-
You need access to:
|
|
34
|
-
- `~/.claude/session_telemetry_log.jsonl` — session-level metrics
|
|
35
|
-
- `~/.claude/skill_usage_log.jsonl` — skill trigger events
|
|
36
|
-
- `~/.claude/all_queries_log.jsonl` — all user queries (triggered and missed)
|
|
37
|
-
- `~/.claude/evolution_audit_log.jsonl` — evolution history
|
|
38
|
-
- The target skill's `SKILL.md` file
|
|
39
|
-
- Session transcripts referenced in telemetry entries
|
|
40
|
-
|
|
41
|
-
## Workflow
|
|
42
|
-
|
|
43
|
-
### Step 1: Identify the target skill
|
|
44
|
-
|
|
45
|
-
Ask the user which skill to diagnose, or infer from context. Confirm the
|
|
46
|
-
skill name before proceeding.
|
|
47
|
-
|
|
48
|
-
### Step 2: Gather current health snapshot
|
|
49
|
-
|
|
50
|
-
```bash
|
|
51
|
-
selftune status
|
|
52
|
-
selftune last
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
Parse JSON output. Note the skill's current pass rate, session count, and
|
|
56
|
-
any warnings or regression flags.
|
|
57
|
-
|
|
58
|
-
### Step 3: Pull telemetry stats
|
|
59
|
-
|
|
60
|
-
```bash
|
|
61
|
-
selftune eval generate --skill <name> --stats
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
Review aggregate metrics:
|
|
65
|
-
- **Error rate** — high error rate suggests process failures, not trigger issues
|
|
66
|
-
- **Tool call breakdown** — unusual patterns (e.g., excessive Bash retries) indicate thrashing
|
|
67
|
-
- **Average turns** — abnormally high turn count suggests the agent is struggling
|
|
68
|
-
|
|
69
|
-
### Step 4: Analyze trigger coverage
|
|
70
|
-
|
|
71
|
-
```bash
|
|
72
|
-
selftune eval generate --skill <name> --max 50
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
Review the generated eval set. Count entries by invocation type:
|
|
76
|
-
- **Explicit missed** = description is fundamentally broken (critical)
|
|
77
|
-
- **Implicit missed** = description too narrow (common, fixable via evolve)
|
|
78
|
-
- **Contextual missed** = lacks domain vocabulary (fixable via evolve)
|
|
79
|
-
- **False-positive negatives** = overtriggering (description too broad)
|
|
80
|
-
|
|
81
|
-
Reference `skill/references/invocation-taxonomy.md` for the full taxonomy.
|
|
82
|
-
|
|
83
|
-
### Step 5: Review grading evidence
|
|
84
|
-
|
|
85
|
-
Read the skill's `SKILL.md` and check recent grading results. For each
|
|
86
|
-
failed expectation, look at:
|
|
87
|
-
- **Trigger tier** — did the skill fire at all?
|
|
88
|
-
- **Process tier** — did the agent follow the right steps?
|
|
89
|
-
- **Quality tier** — was the output actually good?
|
|
90
|
-
|
|
91
|
-
Reference `skill/references/grading-methodology.md` for the 3-tier model.
|
|
92
|
-
|
|
93
|
-
### Step 6: Check evolution history
|
|
94
|
-
|
|
95
|
-
Read `~/.claude/evolution_audit_log.jsonl` for entries matching the skill.
|
|
96
|
-
Look for:
|
|
97
|
-
- Recent evolutions that may have introduced regressions
|
|
98
|
-
- Rollbacks that suggest instability
|
|
99
|
-
- Plateau patterns (repeated evolutions with no improvement)
|
|
100
|
-
|
|
101
|
-
### Step 7: Inspect session transcripts
|
|
102
|
-
|
|
103
|
-
For the worst-performing sessions, read the transcript JSONL files. Look for:
|
|
104
|
-
- SKILL.md not being read (trigger failure)
|
|
105
|
-
- Steps executed out of order (process failure)
|
|
106
|
-
- Repeated errors or thrashing (quality failure)
|
|
107
|
-
- Missing tool calls that should have occurred
|
|
108
|
-
|
|
109
|
-
### Step 8: Synthesize diagnosis
|
|
110
|
-
|
|
111
|
-
Compile findings into a structured report.
|
|
112
|
-
|
|
113
|
-
## Commands
|
|
114
|
-
|
|
115
|
-
| Command | Purpose |
|
|
116
|
-
|---------|---------|
|
|
117
|
-
| `selftune status` | Overall health snapshot |
|
|
118
|
-
| `selftune last` | Most recent session details |
|
|
119
|
-
| `selftune eval generate --skill <name> --stats` | Aggregate telemetry |
|
|
120
|
-
| `selftune eval generate --skill <name> --max 50` | Generate eval set for coverage analysis |
|
|
121
|
-
| `selftune doctor` | Check infrastructure health |
|
|
122
|
-
|
|
123
|
-
## Output
|
|
124
|
-
|
|
125
|
-
Produce a structured diagnosis report:
|
|
126
|
-
|
|
127
|
-
```markdown
|
|
128
|
-
## Diagnosis Report: <skill-name>
|
|
129
|
-
|
|
130
|
-
### Summary
|
|
131
|
-
[One-paragraph overview of the problem]
|
|
132
|
-
|
|
133
|
-
### Health Metrics
|
|
134
|
-
- Pass rate: X%
|
|
135
|
-
- Sessions analyzed: N
|
|
136
|
-
- Error rate: X%
|
|
137
|
-
- Trigger coverage: explicit X% / implicit X% / contextual X%
|
|
138
|
-
|
|
139
|
-
### Root Cause
|
|
140
|
-
[Primary reason for underperformance, categorized as:]
|
|
141
|
-
- TRIGGER: Skill not firing when it should
|
|
142
|
-
- PROCESS: Skill fires but agent follows wrong steps
|
|
143
|
-
- QUALITY: Steps are correct but output is poor
|
|
144
|
-
- INFRASTRUCTURE: Hooks, logs, or config issues
|
|
145
|
-
|
|
146
|
-
### Evidence
|
|
147
|
-
[Specific log entries, transcript lines, or metrics supporting the diagnosis]
|
|
148
|
-
|
|
149
|
-
### Recommendations
|
|
150
|
-
1. [Highest priority fix]
|
|
151
|
-
2. [Secondary fix]
|
|
152
|
-
3. [Optional improvement]
|
|
153
|
-
|
|
154
|
-
### Suggested Commands
|
|
155
|
-
[Exact selftune commands to execute the recommended fixes]
|
|
156
|
-
```
|
|
@@ -1,180 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: evolution-reviewer
|
|
3
|
-
description: Safety gate that reviews pending evolution proposals before deployment, checking for regressions and quality.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# Evolution Reviewer
|
|
7
|
-
|
|
8
|
-
## Role
|
|
9
|
-
|
|
10
|
-
Review pending evolution proposals before they are deployed. Act as a safety
|
|
11
|
-
gate that checks for regressions, validates eval set coverage, compares old
|
|
12
|
-
vs. new descriptions, and provides an approve/reject verdict with reasoning.
|
|
13
|
-
|
|
14
|
-
**Activate when the user says:**
|
|
15
|
-
- "review evolution proposal"
|
|
16
|
-
- "check before deploying evolution"
|
|
17
|
-
- "is this evolution safe"
|
|
18
|
-
- "review pending changes"
|
|
19
|
-
- "should I deploy this evolution"
|
|
20
|
-
|
|
21
|
-
## Connection to Workflows
|
|
22
|
-
|
|
23
|
-
This agent is spawned by the main agent as a subagent to provide a safety
|
|
24
|
-
review before deploying an evolution.
|
|
25
|
-
|
|
26
|
-
**Connected workflows:**
|
|
27
|
-
- **Evolve** — in the review-before-deploy step, spawn this agent to evaluate the proposal for regressions, scope creep, and eval set quality
|
|
28
|
-
- **EvolveBody** — same role for full-body and routing-table evolutions
|
|
29
|
-
|
|
30
|
-
**Mode behavior:**
|
|
31
|
-
- **Interactive mode** — spawn this agent before deploying an evolution to get a human-readable safety review with an approve/reject verdict
|
|
32
|
-
- **Autonomous mode** — the orchestrator handles validation internally using regression thresholds and auto-rollback; this agent is for interactive safety reviews only
|
|
33
|
-
|
|
34
|
-
## Context
|
|
35
|
-
|
|
36
|
-
You need access to:
|
|
37
|
-
- `~/.claude/evolution_audit_log.jsonl` — proposal entries with before/after data
|
|
38
|
-
- The target skill's `SKILL.md` file (current version)
|
|
39
|
-
- The skill's `SKILL.md.bak` file (pre-evolution backup, if it exists)
|
|
40
|
-
- The eval set used for validation (path from evolve output or `evals-<skill>.json`)
|
|
41
|
-
- `skill/references/invocation-taxonomy.md` — invocation type definitions
|
|
42
|
-
- `skill/references/grading-methodology.md` — grading standards
|
|
43
|
-
|
|
44
|
-
## Workflow
|
|
45
|
-
|
|
46
|
-
### Step 1: Identify the proposal
|
|
47
|
-
|
|
48
|
-
Ask the user for the proposal ID, or find the latest pending proposal:
|
|
49
|
-
|
|
50
|
-
```bash
|
|
51
|
-
# Read the evolution audit log and find the most recent 'validated' entry
|
|
52
|
-
# that has not yet been 'deployed'
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
Parse `~/.claude/evolution_audit_log.jsonl` for entries matching the skill.
|
|
56
|
-
The latest `validated` entry without a subsequent `deployed` entry is the
|
|
57
|
-
pending proposal.
|
|
58
|
-
|
|
59
|
-
### Step 2: Run a dry-run if no proposal exists
|
|
60
|
-
|
|
61
|
-
If no pending proposal is found, generate one:
|
|
62
|
-
|
|
63
|
-
```bash
|
|
64
|
-
selftune evolve --skill <name> --skill-path <path> --dry-run
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
Parse the JSON output for the proposal details.
|
|
68
|
-
|
|
69
|
-
### Step 3: Compare descriptions
|
|
70
|
-
|
|
71
|
-
Extract the original description from the audit log `created` entry
|
|
72
|
-
(the `details` field starts with `original_description:`). Compare against
|
|
73
|
-
the proposed new description.
|
|
74
|
-
|
|
75
|
-
**Fallback:** If `created.details` does not contain the `original_description:`
|
|
76
|
-
prefix, read the skill's `SKILL.md.bak` file (created by the evolve workflow
|
|
77
|
-
as a pre-evolution backup) to obtain the original description.
|
|
78
|
-
|
|
79
|
-
Check for:
|
|
80
|
-
- **Preserved triggers** — all existing trigger phrases still present
|
|
81
|
-
- **Added triggers** — new phrases covering missed queries
|
|
82
|
-
- **Removed content** — anything removed that should not have been
|
|
83
|
-
- **Tone consistency** — new text matches the style of the original
|
|
84
|
-
- **Scope creep** — new description doesn't expand beyond the skill's purpose
|
|
85
|
-
|
|
86
|
-
### Step 4: Validate eval set quality
|
|
87
|
-
|
|
88
|
-
Read the eval set used for validation. Check:
|
|
89
|
-
- **Size** — at least 20 entries for meaningful coverage
|
|
90
|
-
- **Type balance** — mix of explicit, implicit, contextual, and negative
|
|
91
|
-
- **Negative coverage** — enough negatives to catch overtriggering
|
|
92
|
-
- **Representativeness** — queries reflect real usage, not synthetic edge cases
|
|
93
|
-
|
|
94
|
-
Reference `skill/references/invocation-taxonomy.md` for healthy distribution.
|
|
95
|
-
|
|
96
|
-
### Step 5: Check regression metrics
|
|
97
|
-
|
|
98
|
-
From the proposal output or audit log `validated` entry, verify:
|
|
99
|
-
- **Pass rate improved** — proposed rate > original rate
|
|
100
|
-
- **No excessive regressions** — regression count < 5% of total evals
|
|
101
|
-
- **Confidence above threshold** — proposal confidence >= 0.7
|
|
102
|
-
- **No explicit regressions** — zero previously-passing explicit queries now failing
|
|
103
|
-
|
|
104
|
-
### Step 6: Review evolution history
|
|
105
|
-
|
|
106
|
-
Check for patterns that suggest instability:
|
|
107
|
-
- Multiple evolutions in a short time (churn)
|
|
108
|
-
- Previous rollbacks for this skill (fragility)
|
|
109
|
-
- Plateau pattern (evolution not producing meaningful gains)
|
|
110
|
-
|
|
111
|
-
### Step 7: Cross-check with watch baseline
|
|
112
|
-
|
|
113
|
-
If the skill has been monitored with `selftune watch`, check:
|
|
114
|
-
|
|
115
|
-
```bash
|
|
116
|
-
selftune watch --skill <name> --skill-path <path>
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
Ensure the current baseline is healthy before introducing changes.
|
|
120
|
-
|
|
121
|
-
### Step 8: Render verdict
|
|
122
|
-
|
|
123
|
-
Issue an approve or reject decision with full reasoning.
|
|
124
|
-
|
|
125
|
-
## Commands
|
|
126
|
-
|
|
127
|
-
| Command | Purpose |
|
|
128
|
-
|---------|---------|
|
|
129
|
-
| `selftune evolve --skill <name> --skill-path <path> --dry-run` | Generate proposal without deploying |
|
|
130
|
-
| Read eval file from evolve output or audit log | Inspect the exact eval set used for validation |
|
|
131
|
-
| `selftune watch --skill <name> --skill-path <path>` | Check current performance baseline |
|
|
132
|
-
| `selftune status` | Overall skill health context |
|
|
133
|
-
|
|
134
|
-
## Output
|
|
135
|
-
|
|
136
|
-
Produce a structured review verdict:
|
|
137
|
-
|
|
138
|
-
```
|
|
139
|
-
## Evolution Review: <skill-name>
|
|
140
|
-
|
|
141
|
-
### Proposal ID
|
|
142
|
-
<proposal-id>
|
|
143
|
-
|
|
144
|
-
### Verdict: APPROVE / REJECT
|
|
145
|
-
|
|
146
|
-
### Description Diff
|
|
147
|
-
- Added: [new trigger phrases or content]
|
|
148
|
-
- Removed: [anything removed]
|
|
149
|
-
- Changed: [modified sections]
|
|
150
|
-
|
|
151
|
-
### Metrics
|
|
152
|
-
| Metric | Before | After | Delta |
|
|
153
|
-
|--------|--------|-------|-------|
|
|
154
|
-
| Pass rate | X% | Y% | +Z% |
|
|
155
|
-
| Regression count | - | N | - |
|
|
156
|
-
| Confidence | - | 0.XX | - |
|
|
157
|
-
|
|
158
|
-
### Eval Set Assessment
|
|
159
|
-
- Total entries: N
|
|
160
|
-
- Type distribution: explicit X / implicit Y / contextual Z / negative W
|
|
161
|
-
- Quality: [adequate / insufficient — with reason]
|
|
162
|
-
|
|
163
|
-
### Risk Assessment
|
|
164
|
-
- Regression risk: LOW / MEDIUM / HIGH
|
|
165
|
-
- Overtriggering risk: LOW / MEDIUM / HIGH
|
|
166
|
-
- Stability history: [stable / unstable — based on evolution history]
|
|
167
|
-
|
|
168
|
-
### Reasoning
|
|
169
|
-
[Detailed explanation of the verdict, citing specific evidence]
|
|
170
|
-
|
|
171
|
-
### Conditions (if APPROVE)
|
|
172
|
-
[Any conditions that should be met post-deploy:]
|
|
173
|
-
- Run `selftune watch` for N sessions after deployment
|
|
174
|
-
- Re-evaluate if pass rate drops below X%
|
|
175
|
-
|
|
176
|
-
### Required Changes (if REJECT)
|
|
177
|
-
[Specific changes needed before re-review:]
|
|
178
|
-
1. [First required change]
|
|
179
|
-
2. [Second required change]
|
|
180
|
-
```
|