selftune 0.2.8 → 0.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (140) hide show
  1. package/README.md +35 -35
  2. package/apps/local-dashboard/dist/assets/index-BZVLv70T.js +16 -0
  3. package/apps/local-dashboard/dist/assets/{index-CRtLkBTi.css → index-Bs3Y4ixf.css} +1 -1
  4. package/apps/local-dashboard/dist/assets/{vendor-react-BQH_6WrG.js → vendor-react-BXP54cYo.js} +4 -4
  5. package/apps/local-dashboard/dist/assets/{vendor-table-dK1QMLq9.js → vendor-table-DTF_SXoy.js} +1 -1
  6. package/apps/local-dashboard/dist/assets/{vendor-ui-CO2mrx6e.js → vendor-ui-CWU0d1wd.js} +66 -66
  7. package/apps/local-dashboard/dist/index.html +15 -15
  8. package/bin/selftune.cjs +1 -1
  9. package/cli/selftune/activation-rules.ts +37 -18
  10. package/cli/selftune/agent-guidance.ts +16 -16
  11. package/cli/selftune/alpha-identity.ts +1 -2
  12. package/cli/selftune/alpha-upload/build-payloads.ts +18 -2
  13. package/cli/selftune/alpha-upload/flush.ts +2 -2
  14. package/cli/selftune/alpha-upload/stage-canonical.ts +106 -3
  15. package/cli/selftune/auth/device-code.ts +32 -0
  16. package/cli/selftune/auto-update.ts +12 -0
  17. package/cli/selftune/badge/badge.ts +1 -0
  18. package/cli/selftune/canonical-export.ts +5 -0
  19. package/cli/selftune/claude-agents.ts +154 -0
  20. package/cli/selftune/contribute/bundle.ts +2 -0
  21. package/cli/selftune/contribute/contribute.ts +1 -0
  22. package/cli/selftune/cron/setup.ts +2 -2
  23. package/cli/selftune/dashboard-contract.ts +1 -1
  24. package/cli/selftune/dashboard-server.ts +11 -52
  25. package/cli/selftune/eval/hooks-to-evals.ts +13 -6
  26. package/cli/selftune/eval/import-skillsbench.ts +1 -0
  27. package/cli/selftune/eval/synthetic-evals.ts +2 -3
  28. package/cli/selftune/eval/unit-test.ts +1 -0
  29. package/cli/selftune/evolution/deploy-proposal.ts +1 -0
  30. package/cli/selftune/evolution/evolve-body.ts +93 -6
  31. package/cli/selftune/evolution/evolve.ts +0 -1
  32. package/cli/selftune/evolution/propose-body.ts +3 -2
  33. package/cli/selftune/evolution/propose-routing.ts +3 -2
  34. package/cli/selftune/evolution/refine-body.ts +3 -2
  35. package/cli/selftune/export.ts +1 -0
  36. package/cli/selftune/grading/auto-grade.ts +1 -0
  37. package/cli/selftune/grading/grade-session.ts +9 -0
  38. package/cli/selftune/hooks/auto-activate.ts +6 -0
  39. package/cli/selftune/hooks/evolution-guard.ts +12 -15
  40. package/cli/selftune/hooks/prompt-log.ts +1 -0
  41. package/cli/selftune/hooks/session-stop.ts +34 -40
  42. package/cli/selftune/hooks/skill-change-guard.ts +1 -0
  43. package/cli/selftune/hooks/skill-eval.ts +1 -1
  44. package/cli/selftune/index.ts +23 -14
  45. package/cli/selftune/ingestors/claude-replay.ts +1 -0
  46. package/cli/selftune/ingestors/codex-rollout.ts +1 -0
  47. package/cli/selftune/ingestors/codex-wrapper.ts +1 -0
  48. package/cli/selftune/ingestors/openclaw-ingest.ts +1 -0
  49. package/cli/selftune/ingestors/opencode-ingest.ts +1 -0
  50. package/cli/selftune/init.ts +197 -96
  51. package/cli/selftune/localdb/db.ts +1 -0
  52. package/cli/selftune/localdb/direct-write.ts +93 -12
  53. package/cli/selftune/localdb/materialize.ts +2 -0
  54. package/cli/selftune/localdb/queries.ts +210 -0
  55. package/cli/selftune/localdb/schema.ts +72 -1
  56. package/cli/selftune/monitoring/watch.ts +1 -0
  57. package/cli/selftune/normalization.ts +4 -0
  58. package/cli/selftune/observability.ts +14 -7
  59. package/cli/selftune/orchestrate.ts +15 -37
  60. package/cli/selftune/repair/skill-usage.ts +7 -3
  61. package/cli/selftune/routes/orchestrate-runs.ts +1 -0
  62. package/cli/selftune/routes/overview.ts +1 -0
  63. package/cli/selftune/routes/skill-report.ts +1 -0
  64. package/cli/selftune/sync.ts +31 -1
  65. package/cli/selftune/types.ts +2 -2
  66. package/cli/selftune/uninstall.ts +412 -0
  67. package/cli/selftune/utils/canonical-log.ts +2 -0
  68. package/cli/selftune/utils/jsonl.ts +1 -0
  69. package/cli/selftune/utils/llm-call.ts +131 -3
  70. package/cli/selftune/utils/skill-log.ts +1 -0
  71. package/cli/selftune/utils/transcript.ts +1 -0
  72. package/cli/selftune/utils/trigger-check.ts +1 -1
  73. package/cli/selftune/workflows/skill-md-writer.ts +5 -5
  74. package/cli/selftune/workflows/workflows.ts +1 -0
  75. package/package.json +38 -33
  76. package/packages/telemetry-contract/fixtures/golden.test.ts +1 -0
  77. package/packages/telemetry-contract/package.json +3 -3
  78. package/packages/telemetry-contract/src/index.ts +0 -1
  79. package/packages/telemetry-contract/src/schemas.ts +6 -24
  80. package/packages/telemetry-contract/tests/compatibility.test.ts +1 -0
  81. package/packages/ui/README.md +35 -34
  82. package/packages/ui/package.json +3 -3
  83. package/packages/ui/src/components/ActivityTimeline.tsx +49 -42
  84. package/packages/ui/src/components/EvidenceViewer.tsx +306 -182
  85. package/packages/ui/src/components/EvolutionTimeline.tsx +83 -72
  86. package/packages/ui/src/components/InfoTip.tsx +4 -3
  87. package/packages/ui/src/components/OrchestrateRunsPanel.tsx +60 -53
  88. package/packages/ui/src/components/section-cards.tsx +19 -24
  89. package/packages/ui/src/components/skill-health-grid.tsx +213 -193
  90. package/packages/ui/src/lib/constants.tsx +1 -0
  91. package/packages/ui/src/primitives/badge.tsx +12 -15
  92. package/packages/ui/src/primitives/button.tsx +7 -7
  93. package/packages/ui/src/primitives/card.tsx +15 -26
  94. package/packages/ui/src/primitives/checkbox.tsx +7 -8
  95. package/packages/ui/src/primitives/collapsible.tsx +5 -5
  96. package/packages/ui/src/primitives/dropdown-menu.tsx +45 -55
  97. package/packages/ui/src/primitives/label.tsx +6 -6
  98. package/packages/ui/src/primitives/select.tsx +28 -37
  99. package/packages/ui/src/primitives/table.tsx +17 -44
  100. package/packages/ui/src/primitives/tabs.tsx +14 -21
  101. package/packages/ui/src/primitives/tooltip.tsx +10 -22
  102. package/skill/SKILL.md +72 -59
  103. package/skill/Workflows/AlphaUpload.md +4 -4
  104. package/skill/Workflows/AutoActivation.md +11 -6
  105. package/skill/Workflows/Badge.md +22 -16
  106. package/skill/Workflows/Baseline.md +34 -36
  107. package/skill/Workflows/Composability.md +16 -11
  108. package/skill/Workflows/Contribute.md +26 -21
  109. package/skill/Workflows/Cron.md +23 -22
  110. package/skill/Workflows/Dashboard.md +40 -40
  111. package/skill/Workflows/Doctor.md +40 -34
  112. package/skill/Workflows/Evals.md +48 -47
  113. package/skill/Workflows/EvolutionMemory.md +31 -21
  114. package/skill/Workflows/Evolve.md +84 -82
  115. package/skill/Workflows/EvolveBody.md +58 -47
  116. package/skill/Workflows/Grade.md +16 -13
  117. package/skill/Workflows/ImportSkillsBench.md +9 -6
  118. package/skill/Workflows/Ingest.md +36 -21
  119. package/skill/Workflows/Initialize.md +138 -97
  120. package/skill/Workflows/Orchestrate.md +22 -16
  121. package/skill/Workflows/Replay.md +12 -7
  122. package/skill/Workflows/Rollback.md +13 -6
  123. package/skill/Workflows/Schedule.md +6 -6
  124. package/skill/Workflows/Sync.md +18 -11
  125. package/skill/Workflows/UnitTest.md +28 -17
  126. package/skill/Workflows/Watch.md +28 -21
  127. package/skill/agents/diagnosis-analyst.md +11 -0
  128. package/skill/agents/evolution-reviewer.md +15 -1
  129. package/skill/agents/integration-guide.md +10 -0
  130. package/skill/agents/pattern-analyst.md +12 -1
  131. package/skill/references/grading-methodology.md +23 -24
  132. package/skill/references/interactive-config.md +7 -7
  133. package/skill/references/invocation-taxonomy.md +22 -20
  134. package/skill/references/logs.md +20 -6
  135. package/skill/references/setup-patterns.md +4 -2
  136. package/.claude/agents/diagnosis-analyst.md +0 -156
  137. package/.claude/agents/evolution-reviewer.md +0 -180
  138. package/.claude/agents/integration-guide.md +0 -212
  139. package/.claude/agents/pattern-analyst.md +0 -160
  140. package/apps/local-dashboard/dist/assets/index-Bk9vSHHd.js +0 -15
@@ -66,6 +66,7 @@ Then read the actual `SKILL.md` files for the skills in scope.
66
66
  ### 2. Extract each skill's ownership contract
67
67
 
68
68
  For each skill, capture:
69
+
69
70
  - frontmatter description
70
71
  - workflow-routing triggers
71
72
  - explicit exclusions or negative examples
@@ -74,6 +75,7 @@ For each skill, capture:
74
75
  ### 3. Detect conflicts and gaps
75
76
 
76
77
  Compare trigger keywords and description phrases across all skills. Flag:
78
+
77
79
  - direct conflicts
78
80
  - semantic overlaps
79
81
  - negative-example gaps
@@ -83,6 +85,7 @@ Compare trigger keywords and description phrases across all skills. Flag:
83
85
  ### 4. Analyze real query behavior
84
86
 
85
87
  Read the logs and look for:
88
+
86
89
  - queries that triggered multiple skills
87
90
  - queries that triggered no skills despite matching one or more descriptions
88
91
  - queries that appear to have been routed to the wrong skill
@@ -103,6 +106,7 @@ shifted ownership or introduced churn.
103
106
  ### 6. Recommend ownership changes
104
107
 
105
108
  For each important conflict, state:
109
+
106
110
  - which skill should own the query family
107
111
  - which skill should back off
108
112
  - whether the fix is a description change, routing-table change, negative
@@ -111,6 +115,7 @@ For each important conflict, state:
111
115
  ## Stop Conditions
112
116
 
113
117
  Stop and return to the parent if:
118
+
114
119
  - the skills in scope are not identifiable
115
120
  - there is not enough log data to say anything useful
116
121
  - the question is really about one underperforming skill rather than
@@ -124,26 +129,32 @@ Return a compact report with these sections:
124
129
  ## Cross-Skill Pattern Analysis
125
130
 
126
131
  ### Summary
132
+
127
133
  [2-4 sentence overview]
128
134
 
129
135
  ### Findings
136
+
130
137
  - [Finding 1]
131
138
  - [Finding 2]
132
139
  - [Finding 3]
133
140
 
134
141
  ### Conflict Matrix
142
+
135
143
  | Skill A | Skill B | Problem | Evidence | Recommended Owner |
136
- |---------|---------|---------|----------|-------------------|
144
+ | ------- | ------- | ------- | -------- | ----------------- |
137
145
  | ... | ... | ... | ... | ... |
138
146
 
139
147
  ### Coverage Gaps
148
+
140
149
  - [query family or sample]
141
150
 
142
151
  ### Recommended Changes
152
+
143
153
  1. [Highest-priority change]
144
154
  2. [Second change]
145
155
  3. [Optional follow-up]
146
156
 
147
157
  ### Confidence
158
+
148
159
  [high / medium / low]
149
160
  ```
@@ -9,11 +9,11 @@ referenced by evolution workflows to understand quality signals.
9
9
 
10
10
  Every session is graded across three tiers, each answering a different question:
11
11
 
12
- | Tier | Question | Example expectation |
13
- |------|----------|---------------------|
14
- | **Trigger** | Did the skill fire at all? | `skills_triggered` contains the skill name |
15
- | **Process** | Did the agent follow the right steps? | SKILL.md was read before main work started |
16
- | **Quality** | Was the output actually good? | Output file has correct content and structure |
12
+ | Tier | Question | Example expectation |
13
+ | ----------- | ------------------------------------- | --------------------------------------------- |
14
+ | **Trigger** | Did the skill fire at all? | `skills_triggered` contains the skill name |
15
+ | **Process** | Did the agent follow the right steps? | SKILL.md was read before main work started |
16
+ | **Quality** | Was the output actually good? | Output file has correct content and structure |
17
17
 
18
18
  A session can pass Trigger but fail Process (skill fired, but steps were wrong),
19
19
  or pass Process but fail Quality (steps were right, but output was bad).
@@ -71,13 +71,14 @@ Always include at least one Process and one Quality expectation.
71
71
  After grading explicit expectations, extract 2-4 implicit claims from the transcript.
72
72
  Each claim falls into one of three types:
73
73
 
74
- | Type | What it captures | Example |
75
- |------|------------------|---------|
76
- | **Factual** | A verifiable statement the agent made | "The agent said 12 slides were created" |
77
- | **Process** | An observed behavior pattern | "The agent read SKILL.md before making any file changes" |
78
- | **Quality** | An output characteristic | "The output file was named correctly" |
74
+ | Type | What it captures | Example |
75
+ | ----------- | ------------------------------------- | -------------------------------------------------------- |
76
+ | **Factual** | A verifiable statement the agent made | "The agent said 12 slides were created" |
77
+ | **Process** | An observed behavior pattern | "The agent read SKILL.md before making any file changes" |
78
+ | **Quality** | An output characteristic | "The output file was named correctly" |
79
79
 
80
80
  For each claim:
81
+
81
82
  1. State the claim clearly
82
83
  2. Classify its type
83
84
  3. Mark `verified: true` or `verified: false`
@@ -153,9 +154,7 @@ Only raise things worth improving. The goal is actionable feedback, not exhausti
153
154
  }
154
155
  ],
155
156
  "eval_feedback": {
156
- "suggestions": [
157
- { "reason": "No expectation checks slide content" }
158
- ],
157
+ "suggestions": [{ "reason": "No expectation checks slide content" }],
159
158
  "overall": "Process coverage good; add output quality assertions."
160
159
  }
161
160
  }
@@ -163,14 +162,14 @@ Only raise things worth improving. The goal is actionable feedback, not exhausti
163
162
 
164
163
  ### Field descriptions
165
164
 
166
- | Field | Type | Description |
167
- |-------|------|-------------|
168
- | `session_id` | string | From session telemetry |
169
- | `skill_name` | string | The skill being graded |
170
- | `transcript_path` | string | Path to the session transcript JSONL |
171
- | `graded_at` | string | ISO 8601 timestamp of grading |
172
- | `expectations[]` | array | Each expectation with verdict and evidence |
173
- | `summary` | object | Aggregate pass/fail counts and rate |
174
- | `execution_metrics` | object | Raw metrics from session telemetry |
175
- | `claims[]` | array | Implicit claims extracted from transcript |
176
- | `eval_feedback` | object | Suggestions for improving the eval set |
165
+ | Field | Type | Description |
166
+ | ------------------- | ------ | ------------------------------------------ |
167
+ | `session_id` | string | From session telemetry |
168
+ | `skill_name` | string | The skill being graded |
169
+ | `transcript_path` | string | Path to the session transcript JSONL |
170
+ | `graded_at` | string | ISO 8601 timestamp of grading |
171
+ | `expectations[]` | array | Each expectation with verdict and evidence |
172
+ | `summary` | object | Aggregate pass/fail counts and rate |
173
+ | `execution_metrics` | object | Raw metrics from session telemetry |
174
+ | `claims[]` | array | Implicit claims extracted from transcript |
175
+ | `eval_feedback` | object | Suggestions for improving the eval set |
@@ -9,21 +9,21 @@ execution mode, model selection, and key parameters.
9
9
  Each mutating workflow has a **Pre-Flight Configuration** step. Follow this pattern:
10
10
 
11
11
  1. Present a brief summary of what the command will do
12
- 2. Use the `AskUserQuestion` tool to present structured options (max 4 questions per call — split into multiple calls if needed). Mark recommended defaults in option text with `(recommended)`.
12
+ 2. Use the `AskUserQuestion` tool to present structured options, one question per tool call. Mark recommended defaults in option text with `(recommended)`.
13
13
  3. Parse the user's selections from the tool response
14
14
  4. Show a confirmation summary of selected options before executing
15
15
 
16
- **IMPORTANT:** Always use `AskUserQuestion` for pre-flight never present options as inline numbered text. The tool provides a structured UI that is easier for users to interact with. If `AskUserQuestion` is not available, fall back to inline numbered options.
16
+ **IMPORTANT:** Prefer `AskUserQuestion` for pre-flight, but never batch multiple questions into one payload. Ask one question at a time. If `AskUserQuestion` is not available or Claude Code does not invoke it, fall back to inline numbered options. Do not invent tool responses.
17
17
 
18
18
  ## Model Tier Reference
19
19
 
20
20
  When presenting model choices, use this table:
21
21
 
22
- | Tier | Model | Speed | Cost | Quality | Best for |
23
- |------|-------|-------|------|---------|----------|
24
- | Fast | `haiku` | ~2s/call | $ | Good | Iteration loops, bulk validation |
25
- | Balanced | `sonnet` | ~5s/call | $$ | Great | Single-pass proposals, gate checks |
26
- | Best | `opus` | ~10s/call | $$$ | Excellent | High-stakes final validation |
22
+ | Tier | Model | Speed | Cost | Quality | Best for |
23
+ | -------- | -------- | --------- | ---- | --------- | ---------------------------------- |
24
+ | Fast | `haiku` | ~2s/call | $ | Good | Iteration loops, bulk validation |
25
+ | Balanced | `sonnet` | ~5s/call | $$ | Great | Single-pass proposals, gate checks |
26
+ | Best | `opus` | ~10s/call | $$$ | Excellent | High-stakes final validation |
27
27
 
28
28
  ## Quick Path
29
29
 
@@ -67,10 +67,10 @@ Separate from eval types, selftune classifies each **live** skill invocation by
67
67
  the user triggered it. This is shown as the `invocation_mode` field in canonical
68
68
  telemetry and the "Mode" column in the dashboard.
69
69
 
70
- | Mode | Definition | Example |
71
- |------|-----------|---------|
72
- | `explicit` | User typed a slash command (`/skillname`) | `/selftune grade` |
73
- | `implicit` | User mentioned the skill by name in their prompt | `evolve the selftune skill` |
70
+ | Mode | Definition | Example |
71
+ | ---------- | -------------------------------------------------------- | ----------------------------------------------- |
72
+ | `explicit` | User typed a slash command (`/skillname`) | `/selftune grade` |
73
+ | `implicit` | User mentioned the skill by name in their prompt | `evolve the selftune skill` |
74
74
  | `inferred` | Agent chose the skill autonomously — user never named it | `show me the dashboard` → agent invokes Browser |
75
75
 
76
76
  ### How classification works
@@ -84,10 +84,10 @@ and mapped to canonical modes in `cli/selftune/normalization.ts` (`deriveInvocat
84
84
 
85
85
  ### Eval types vs runtime modes
86
86
 
87
- | Concept | Purpose | Values |
88
- |---------|---------|--------|
89
- | **Eval invocation type** | Classifying test cases | explicit, implicit, contextual, negative |
90
- | **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred |
87
+ | Concept | Purpose | Values |
88
+ | --------------------------- | ---------------------- | ---------------------------------------- |
89
+ | **Eval invocation type** | Classifying test cases | explicit, implicit, contextual, negative |
90
+ | **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred |
91
91
 
92
92
  `contextual` (eval) and `inferred` (runtime) are related but different: contextual means
93
93
  the user's intent is buried in domain context, while inferred means the agent chose the
@@ -99,12 +99,12 @@ skill without any user mention at all.
99
99
 
100
100
  A healthy skill catches all three positive invocation types:
101
101
 
102
- | Type | Healthy | Unhealthy |
103
- |------|---------|-----------|
104
- | Explicit | Catches all | Misses some (broken) |
105
- | Implicit | Catches most | Only catches explicit (too rigid) |
102
+ | Type | Healthy | Unhealthy |
103
+ | ---------- | ------------ | ------------------------------------------------------- |
104
+ | Explicit | Catches all | Misses some (broken) |
105
+ | Implicit | Catches most | Only catches explicit (too rigid) |
106
106
  | Contextual | Catches many | Only catches explicit + some implicit (needs evolution) |
107
- | Negative | Rejects all | False positives on keyword overlap |
107
+ | Negative | Rejects all | False positives on keyword overlap |
108
108
 
109
109
  ### The Coverage Spectrum
110
110
 
@@ -128,6 +128,7 @@ The invocation taxonomy directly drives the evolution feedback loop:
128
128
 
129
129
  When `selftune eval generate` shows implicit queries that don't trigger the skill, the
130
130
  description is too narrow. The `evolve` command will:
131
+
131
132
  1. Extract the missed implicit patterns
132
133
  2. Propose description changes that cover them
133
134
  3. Validate that existing triggers still work
@@ -146,6 +147,7 @@ Evolution should tighten the scope or add "Don't Use When" clauses.
146
147
  ### The Evolution Priority
147
148
 
148
149
  Fix in this order:
150
+
149
151
  1. **Missed explicit** -- broken, fix immediately
150
152
  2. **Missed implicit** -- undertriggering, evolve next
151
153
  3. **Missed contextual** -- under-evolved, evolve when implicit is clean
@@ -168,11 +170,11 @@ Each entry in a generated eval set looks like:
168
170
  }
169
171
  ```
170
172
 
171
- | Field | Description |
172
- |-------|-------------|
173
- | `id` | Sequential identifier |
174
- | `query` | The user's original query text |
175
- | `expected` | `true` = should trigger, `false` = should not |
173
+ | Field | Description |
174
+ | ----------------- | -------------------------------------------------------- |
175
+ | `id` | Sequential identifier |
176
+ | `query` | The user's original query text |
177
+ | `expected` | `true` = should trigger, `false` = should not |
176
178
  | `invocation_type` | One of: `explicit`, `implicit`, `contextual`, `negative` |
177
- | `skill_name` | The skill this eval targets |
178
- | `source_session` | Session ID the query came from (if positive) |
179
+ | `skill_name` | The skill this eval targets |
180
+ | `source_session` | Session ID the query came from (if positive) |
@@ -4,6 +4,12 @@ selftune writes raw legacy logs plus a canonical event log. This reference
4
4
  describes each format in detail for the skill to use when parsing sessions,
5
5
  audit trails, and cloud-ingest exports.
6
6
 
7
+ > **Note:** JSONL files are now backup/recovery only. SQLite (`~/.selftune/selftune.db`)
8
+ > is the sole operational store for all runtime reads. JSONL writes are retained for
9
+ > append-only durability, but all dashboard queries, hook reads, grading, monitoring,
10
+ > and upload staging read from SQLite. JSONL reads only occur when custom log paths
11
+ > are provided (e.g., `--telemetry-log`, `--skill-log`) for test isolation.
12
+
7
13
  ---
8
14
 
9
15
  ## ~/.claude/session_telemetry_log.jsonl
@@ -37,6 +43,7 @@ One JSON record per line. Each record is one completed agent session.
37
43
  ```
38
44
 
39
45
  **source values:**
46
+
40
47
  - `claude_code` — written by session-stop.ts (Stop hook)
41
48
  - `codex` — written by ingestors/codex-wrapper.ts
42
49
  - `codex_rollout` — written by ingestors/codex-rollout.ts
@@ -72,6 +79,7 @@ One record per skill trigger event. Populated by skill-eval.ts (PostToolUse hook
72
79
  ```
73
80
 
74
81
  Optional provenance fields:
82
+
75
83
  - `skill_scope`: `project | global | admin | system | unknown`
76
84
  - `skill_project_root`: resolved repo/worktree root when the skill came from a project-local registry
77
85
  - `skill_registry_dir`: the registry directory where the resolved `SKILL.md` came from
@@ -82,6 +90,7 @@ record shape, but is rebuilt from source-truth transcripts/rollouts rather than
82
90
  hooks alone.
83
91
 
84
92
  Notes:
93
+
85
94
  - `launcher_base_dir` means selftune recovered Claude's `Base directory for this skill:` launcher metadata. If that recovered `SKILL.md` path points into a stable registry like `~/.agents/skills`, `~/.claude/skills`, `/etc/codex/skills`, or a real project checkout, selftune now reclassifies the invocation into the matching `skill_scope`. Ephemeral temp launcher directories still remain `unknown`.
86
95
  - `fallback` means selftune confirmed a skill invocation but could not yet resolve a concrete `SKILL.md` path.
87
96
 
@@ -237,6 +246,7 @@ Consumed signal example:
237
246
  ```
238
247
 
239
248
  **signal_type values:**
249
+
240
250
  - `correction` — User pointing out a missed skill ("why didn't you use X?", "you should have used X", "next time use X")
241
251
  - `explicit_request` — User asking to use a skill ("please use the X skill", "use the commit skill")
242
252
  - `manual_invocation` — Direct `/skill` invocation detected
@@ -259,12 +269,13 @@ One record per evolution action. Written by the evolution and rollback modules.
259
269
  "total": 50,
260
270
  "passed": 35,
261
271
  "failed": 15,
262
- "pass_rate": 0.70
272
+ "pass_rate": 0.7
263
273
  }
264
274
  }
265
275
  ```
266
276
 
267
277
  **action values:**
278
+
268
279
  - `created` — New evolution proposal generated. `details` starts with `original_description:` prefix preserving the pre-evolution SKILL.md content.
269
280
  - `validated` — Proposal tested against eval set. `eval_snapshot` contains before/after pass rates.
270
281
  - `deployed` — Updated SKILL.md written to disk. `eval_snapshot` contains final pass rates.
@@ -283,6 +294,7 @@ One record per evolution action. Written by the evolution and rollback modules.
283
294
  One JSON object per line. Two observed variants:
284
295
 
285
296
  **Variant A (nested, current):**
297
+
286
298
  ```json
287
299
  {"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}}
288
300
  {"type": "assistant", "message": {"role": "assistant", "content": [
@@ -292,6 +304,7 @@ One JSON object per line. Two observed variants:
292
304
  ```
293
305
 
294
306
  **Variant B (flat, older):**
307
+
295
308
  ```json
296
309
  {"role": "user", "content": "..."}
297
310
  {"role": "assistant", "content": [{"type": "tool_use", "name": "Bash", "input": {"command": "..."}}]}
@@ -303,7 +316,7 @@ Skill reads appear as `Read` tool calls where `input.file_path` ends in `SKILL.m
303
316
 
304
317
  ---
305
318
 
306
- ## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl)
319
+ ## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-\*.jsonl)
307
320
 
308
321
  ```json
309
322
  {"type": "thread.started", "thread_id": "..."}
@@ -326,13 +339,14 @@ Content is a JSON string containing an array of blocks. Anthropic format:
326
339
 
327
340
  ```json
328
341
  [
329
- {"type": "text", "text": "I'll create the presentation."},
330
- {"type": "tool_use", "name": "Bash", "input": {"command": "pip install python-pptx"}},
331
- {"type": "tool_use", "name": "Read", "input": {"file_path": "/skills/pptx/SKILL.md"}}
342
+ { "type": "text", "text": "I'll create the presentation." },
343
+ { "type": "tool_use", "name": "Bash", "input": { "command": "pip install python-pptx" } },
344
+ { "type": "tool_use", "name": "Read", "input": { "file_path": "/skills/pptx/SKILL.md" } }
332
345
  ]
333
346
  ```
334
347
 
335
348
  Tool results appear in subsequent user messages:
349
+
336
350
  ```json
337
- [{"type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false}]
351
+ [{ "type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false }]
338
352
  ```
@@ -61,5 +61,7 @@ combined.
61
61
  ## Optional Repository Extensions
62
62
 
63
63
  selftune bundles specialized agent instruction files in `skill/agents/` for
64
- diagnosis, evolution review, pattern analysis, and setup help. These ship with
65
- the skill package and are read directly when needed no installation step required.
64
+ diagnosis, evolution review, pattern analysis, and setup help. These bundled
65
+ files are the canonical source. On Claude Code, `selftune init` syncs
66
+ compatibility copies into `~/.claude/agents/` so native subagent calls keep
67
+ matching the bundled definitions.
@@ -1,156 +0,0 @@
1
- ---
2
- name: diagnosis-analyst
3
- description: Deep-dive analysis of underperforming skills with root cause identification and actionable recommendations.
4
- ---
5
-
6
- # Diagnosis Analyst
7
-
8
- ## Role
9
-
10
- Investigate why a specific skill is underperforming. Analyze telemetry logs,
11
- grading results, and session transcripts to identify root causes and recommend
12
- targeted fixes.
13
-
14
- **Activation policy:** This is a subagent-only role, spawned by the main agent.
15
- If a user asks for diagnosis directly, the main agent should route to this subagent.
16
-
17
- ## Connection to Workflows
18
-
19
- This agent is spawned by the main agent as a subagent when deeper analysis is
20
- needed — it is not called directly by the user.
21
-
22
- **Connected workflows:**
23
- - **Doctor** — when `selftune doctor` reveals persistent issues with a specific skill, spawn this agent for root cause analysis
24
- - **Grade** — when grades are consistently low for a skill, spawn this agent to investigate why
25
- - **Status** — when `selftune status` shows CRITICAL or WARNING flags on a skill, spawn this agent for a deep dive
26
-
27
- The main agent decides when to escalate to this subagent based on severity
28
- and persistence of the issue. One-off failures are handled inline; recurring
29
- or unexplained failures warrant spawning this agent.
30
-
31
- ## Context
32
-
33
- You need access to:
34
- - `~/.claude/session_telemetry_log.jsonl` — session-level metrics
35
- - `~/.claude/skill_usage_log.jsonl` — skill trigger events
36
- - `~/.claude/all_queries_log.jsonl` — all user queries (triggered and missed)
37
- - `~/.claude/evolution_audit_log.jsonl` — evolution history
38
- - The target skill's `SKILL.md` file
39
- - Session transcripts referenced in telemetry entries
40
-
41
- ## Workflow
42
-
43
- ### Step 1: Identify the target skill
44
-
45
- Ask the user which skill to diagnose, or infer from context. Confirm the
46
- skill name before proceeding.
47
-
48
- ### Step 2: Gather current health snapshot
49
-
50
- ```bash
51
- selftune status
52
- selftune last
53
- ```
54
-
55
- Parse JSON output. Note the skill's current pass rate, session count, and
56
- any warnings or regression flags.
57
-
58
- ### Step 3: Pull telemetry stats
59
-
60
- ```bash
61
- selftune eval generate --skill <name> --stats
62
- ```
63
-
64
- Review aggregate metrics:
65
- - **Error rate** — high error rate suggests process failures, not trigger issues
66
- - **Tool call breakdown** — unusual patterns (e.g., excessive Bash retries) indicate thrashing
67
- - **Average turns** — abnormally high turn count suggests the agent is struggling
68
-
69
- ### Step 4: Analyze trigger coverage
70
-
71
- ```bash
72
- selftune eval generate --skill <name> --max 50
73
- ```
74
-
75
- Review the generated eval set. Count entries by invocation type:
76
- - **Explicit missed** = description is fundamentally broken (critical)
77
- - **Implicit missed** = description too narrow (common, fixable via evolve)
78
- - **Contextual missed** = lacks domain vocabulary (fixable via evolve)
79
- - **False-positive negatives** = overtriggering (description too broad)
80
-
81
- Reference `skill/references/invocation-taxonomy.md` for the full taxonomy.
82
-
83
- ### Step 5: Review grading evidence
84
-
85
- Read the skill's `SKILL.md` and check recent grading results. For each
86
- failed expectation, look at:
87
- - **Trigger tier** — did the skill fire at all?
88
- - **Process tier** — did the agent follow the right steps?
89
- - **Quality tier** — was the output actually good?
90
-
91
- Reference `skill/references/grading-methodology.md` for the 3-tier model.
92
-
93
- ### Step 6: Check evolution history
94
-
95
- Read `~/.claude/evolution_audit_log.jsonl` for entries matching the skill.
96
- Look for:
97
- - Recent evolutions that may have introduced regressions
98
- - Rollbacks that suggest instability
99
- - Plateau patterns (repeated evolutions with no improvement)
100
-
101
- ### Step 7: Inspect session transcripts
102
-
103
- For the worst-performing sessions, read the transcript JSONL files. Look for:
104
- - SKILL.md not being read (trigger failure)
105
- - Steps executed out of order (process failure)
106
- - Repeated errors or thrashing (quality failure)
107
- - Missing tool calls that should have occurred
108
-
109
- ### Step 8: Synthesize diagnosis
110
-
111
- Compile findings into a structured report.
112
-
113
- ## Commands
114
-
115
- | Command | Purpose |
116
- |---------|---------|
117
- | `selftune status` | Overall health snapshot |
118
- | `selftune last` | Most recent session details |
119
- | `selftune eval generate --skill <name> --stats` | Aggregate telemetry |
120
- | `selftune eval generate --skill <name> --max 50` | Generate eval set for coverage analysis |
121
- | `selftune doctor` | Check infrastructure health |
122
-
123
- ## Output
124
-
125
- Produce a structured diagnosis report:
126
-
127
- ```markdown
128
- ## Diagnosis Report: <skill-name>
129
-
130
- ### Summary
131
- [One-paragraph overview of the problem]
132
-
133
- ### Health Metrics
134
- - Pass rate: X%
135
- - Sessions analyzed: N
136
- - Error rate: X%
137
- - Trigger coverage: explicit X% / implicit X% / contextual X%
138
-
139
- ### Root Cause
140
- [Primary reason for underperformance, categorized as:]
141
- - TRIGGER: Skill not firing when it should
142
- - PROCESS: Skill fires but agent follows wrong steps
143
- - QUALITY: Steps are correct but output is poor
144
- - INFRASTRUCTURE: Hooks, logs, or config issues
145
-
146
- ### Evidence
147
- [Specific log entries, transcript lines, or metrics supporting the diagnosis]
148
-
149
- ### Recommendations
150
- 1. [Highest priority fix]
151
- 2. [Secondary fix]
152
- 3. [Optional improvement]
153
-
154
- ### Suggested Commands
155
- [Exact selftune commands to execute the recommended fixes]
156
- ```