npm - selftune - Versions diffs - 0.2.8 → 0.2.10 - Mend

selftune 0.2.8 → 0.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (140) hide show

package/README.md +35 -35
package/apps/local-dashboard/dist/assets/index-BZVLv70T.js +16 -0
package/apps/local-dashboard/dist/assets/{index-CRtLkBTi.css → index-Bs3Y4ixf.css} +1 -1
package/apps/local-dashboard/dist/assets/{vendor-react-BQH_6WrG.js → vendor-react-BXP54cYo.js} +4 -4
package/apps/local-dashboard/dist/assets/{vendor-table-dK1QMLq9.js → vendor-table-DTF_SXoy.js} +1 -1
package/apps/local-dashboard/dist/assets/{vendor-ui-CO2mrx6e.js → vendor-ui-CWU0d1wd.js} +66 -66
package/apps/local-dashboard/dist/index.html +15 -15
package/bin/selftune.cjs +1 -1
package/cli/selftune/activation-rules.ts +37 -18
package/cli/selftune/agent-guidance.ts +16 -16
package/cli/selftune/alpha-identity.ts +1 -2
package/cli/selftune/alpha-upload/build-payloads.ts +18 -2
package/cli/selftune/alpha-upload/flush.ts +2 -2
package/cli/selftune/alpha-upload/stage-canonical.ts +106 -3
package/cli/selftune/auth/device-code.ts +32 -0
package/cli/selftune/auto-update.ts +12 -0
package/cli/selftune/badge/badge.ts +1 -0
package/cli/selftune/canonical-export.ts +5 -0
package/cli/selftune/claude-agents.ts +154 -0
package/cli/selftune/contribute/bundle.ts +2 -0
package/cli/selftune/contribute/contribute.ts +1 -0
package/cli/selftune/cron/setup.ts +2 -2
package/cli/selftune/dashboard-contract.ts +1 -1
package/cli/selftune/dashboard-server.ts +11 -52
package/cli/selftune/eval/hooks-to-evals.ts +13 -6
package/cli/selftune/eval/import-skillsbench.ts +1 -0
package/cli/selftune/eval/synthetic-evals.ts +2 -3
package/cli/selftune/eval/unit-test.ts +1 -0
package/cli/selftune/evolution/deploy-proposal.ts +1 -0
package/cli/selftune/evolution/evolve-body.ts +93 -6
package/cli/selftune/evolution/evolve.ts +0 -1
package/cli/selftune/evolution/propose-body.ts +3 -2
package/cli/selftune/evolution/propose-routing.ts +3 -2
package/cli/selftune/evolution/refine-body.ts +3 -2
package/cli/selftune/export.ts +1 -0
package/cli/selftune/grading/auto-grade.ts +1 -0
package/cli/selftune/grading/grade-session.ts +9 -0
package/cli/selftune/hooks/auto-activate.ts +6 -0
package/cli/selftune/hooks/evolution-guard.ts +12 -15
package/cli/selftune/hooks/prompt-log.ts +1 -0
package/cli/selftune/hooks/session-stop.ts +34 -40
package/cli/selftune/hooks/skill-change-guard.ts +1 -0
package/cli/selftune/hooks/skill-eval.ts +1 -1
package/cli/selftune/index.ts +23 -14
package/cli/selftune/ingestors/claude-replay.ts +1 -0
package/cli/selftune/ingestors/codex-rollout.ts +1 -0
package/cli/selftune/ingestors/codex-wrapper.ts +1 -0
package/cli/selftune/ingestors/openclaw-ingest.ts +1 -0
package/cli/selftune/ingestors/opencode-ingest.ts +1 -0
package/cli/selftune/init.ts +197 -96
package/cli/selftune/localdb/db.ts +1 -0
package/cli/selftune/localdb/direct-write.ts +93 -12
package/cli/selftune/localdb/materialize.ts +2 -0
package/cli/selftune/localdb/queries.ts +210 -0
package/cli/selftune/localdb/schema.ts +72 -1
package/cli/selftune/monitoring/watch.ts +1 -0
package/cli/selftune/normalization.ts +4 -0
package/cli/selftune/observability.ts +14 -7
package/cli/selftune/orchestrate.ts +15 -37
package/cli/selftune/repair/skill-usage.ts +7 -3
package/cli/selftune/routes/orchestrate-runs.ts +1 -0
package/cli/selftune/routes/overview.ts +1 -0
package/cli/selftune/routes/skill-report.ts +1 -0
package/cli/selftune/sync.ts +31 -1
package/cli/selftune/types.ts +2 -2
package/cli/selftune/uninstall.ts +412 -0
package/cli/selftune/utils/canonical-log.ts +2 -0
package/cli/selftune/utils/jsonl.ts +1 -0
package/cli/selftune/utils/llm-call.ts +131 -3
package/cli/selftune/utils/skill-log.ts +1 -0
package/cli/selftune/utils/transcript.ts +1 -0
package/cli/selftune/utils/trigger-check.ts +1 -1
package/cli/selftune/workflows/skill-md-writer.ts +5 -5
package/cli/selftune/workflows/workflows.ts +1 -0
package/package.json +38 -33
package/packages/telemetry-contract/fixtures/golden.test.ts +1 -0
package/packages/telemetry-contract/package.json +3 -3
package/packages/telemetry-contract/src/index.ts +0 -1
package/packages/telemetry-contract/src/schemas.ts +6 -24
package/packages/telemetry-contract/tests/compatibility.test.ts +1 -0
package/packages/ui/README.md +35 -34
package/packages/ui/package.json +3 -3
package/packages/ui/src/components/ActivityTimeline.tsx +49 -42
package/packages/ui/src/components/EvidenceViewer.tsx +306 -182
package/packages/ui/src/components/EvolutionTimeline.tsx +83 -72
package/packages/ui/src/components/InfoTip.tsx +4 -3
package/packages/ui/src/components/OrchestrateRunsPanel.tsx +60 -53
package/packages/ui/src/components/section-cards.tsx +19 -24
package/packages/ui/src/components/skill-health-grid.tsx +213 -193
package/packages/ui/src/lib/constants.tsx +1 -0
package/packages/ui/src/primitives/badge.tsx +12 -15
package/packages/ui/src/primitives/button.tsx +7 -7
package/packages/ui/src/primitives/card.tsx +15 -26
package/packages/ui/src/primitives/checkbox.tsx +7 -8
package/packages/ui/src/primitives/collapsible.tsx +5 -5
package/packages/ui/src/primitives/dropdown-menu.tsx +45 -55
package/packages/ui/src/primitives/label.tsx +6 -6
package/packages/ui/src/primitives/select.tsx +28 -37
package/packages/ui/src/primitives/table.tsx +17 -44
package/packages/ui/src/primitives/tabs.tsx +14 -21
package/packages/ui/src/primitives/tooltip.tsx +10 -22
package/skill/SKILL.md +72 -59
package/skill/Workflows/AlphaUpload.md +4 -4
package/skill/Workflows/AutoActivation.md +11 -6
package/skill/Workflows/Badge.md +22 -16
package/skill/Workflows/Baseline.md +34 -36
package/skill/Workflows/Composability.md +16 -11
package/skill/Workflows/Contribute.md +26 -21
package/skill/Workflows/Cron.md +23 -22
package/skill/Workflows/Dashboard.md +40 -40
package/skill/Workflows/Doctor.md +40 -34
package/skill/Workflows/Evals.md +48 -47
package/skill/Workflows/EvolutionMemory.md +31 -21
package/skill/Workflows/Evolve.md +84 -82
package/skill/Workflows/EvolveBody.md +58 -47
package/skill/Workflows/Grade.md +16 -13
package/skill/Workflows/ImportSkillsBench.md +9 -6
package/skill/Workflows/Ingest.md +36 -21
package/skill/Workflows/Initialize.md +138 -97
package/skill/Workflows/Orchestrate.md +22 -16
package/skill/Workflows/Replay.md +12 -7
package/skill/Workflows/Rollback.md +13 -6
package/skill/Workflows/Schedule.md +6 -6
package/skill/Workflows/Sync.md +18 -11
package/skill/Workflows/UnitTest.md +28 -17
package/skill/Workflows/Watch.md +28 -21
package/skill/agents/diagnosis-analyst.md +11 -0
package/skill/agents/evolution-reviewer.md +15 -1
package/skill/agents/integration-guide.md +10 -0
package/skill/agents/pattern-analyst.md +12 -1
package/skill/references/grading-methodology.md +23 -24
package/skill/references/interactive-config.md +7 -7
package/skill/references/invocation-taxonomy.md +22 -20
package/skill/references/logs.md +20 -6
package/skill/references/setup-patterns.md +4 -2
package/.claude/agents/diagnosis-analyst.md +0 -156
package/.claude/agents/evolution-reviewer.md +0 -180
package/.claude/agents/integration-guide.md +0 -212
package/.claude/agents/pattern-analyst.md +0 -160
package/apps/local-dashboard/dist/assets/index-Bk9vSHHd.js +0 -15

package/skill/agents/pattern-analyst.md CHANGED Viewed

@@ -66,6 +66,7 @@ Then read the actual `SKILL.md` files for the skills in scope.
 ### 2. Extract each skill's ownership contract
 For each skill, capture:
 - frontmatter description
 - workflow-routing triggers
 - explicit exclusions or negative examples
@@ -74,6 +75,7 @@ For each skill, capture:
 ### 3. Detect conflicts and gaps
 Compare trigger keywords and description phrases across all skills. Flag:
 - direct conflicts
 - semantic overlaps
 - negative-example gaps
@@ -83,6 +85,7 @@ Compare trigger keywords and description phrases across all skills. Flag:
 ### 4. Analyze real query behavior
 Read the logs and look for:
 - queries that triggered multiple skills
 - queries that triggered no skills despite matching one or more descriptions
 - queries that appear to have been routed to the wrong skill
@@ -103,6 +106,7 @@ shifted ownership or introduced churn.
 ### 6. Recommend ownership changes
 For each important conflict, state:
 - which skill should own the query family
 - which skill should back off
 - whether the fix is a description change, routing-table change, negative
@@ -111,6 +115,7 @@ For each important conflict, state:
 ## Stop Conditions
 Stop and return to the parent if:
 - the skills in scope are not identifiable
 - there is not enough log data to say anything useful
 - the question is really about one underperforming skill rather than
@@ -124,26 +129,32 @@ Return a compact report with these sections:
 ## Cross-Skill Pattern Analysis
 ### Summary
 [2-4 sentence overview]
 ### Findings
 - [Finding 1]
 - [Finding 2]
 - [Finding 3]
 ### Conflict Matrix
 | Skill A | Skill B | Problem | Evidence | Recommended Owner |
-|---------|---------|---------|----------|-------------------|
+| ------- | ------- | ------- | -------- | ----------------- |
 | ...     | ...     | ...     | ...      | ...               |
 ### Coverage Gaps
 - [query family or sample]
 ### Recommended Changes
 1. [Highest-priority change]
 2. [Second change]
 3. [Optional follow-up]
 ### Confidence
 [high / medium / low]
 ```

package/skill/references/grading-methodology.md CHANGED Viewed

@@ -9,11 +9,11 @@ referenced by evolution workflows to understand quality signals.
 Every session is graded across three tiers, each answering a different question:
-| Tier | Question | Example expectation |
-|------|----------|---------------------|
-| **Trigger** | Did the skill fire at all? | `skills_triggered` contains the skill name |
-| **Process** | Did the agent follow the right steps? | SKILL.md was read before main work started |
-| **Quality** | Was the output actually good? | Output file has correct content and structure |
+| Tier        | Question                              | Example expectation                           |
+| ----------- | ------------------------------------- | --------------------------------------------- |
+| **Trigger** | Did the skill fire at all?            | `skills_triggered` contains the skill name    |
+| **Process** | Did the agent follow the right steps? | SKILL.md was read before main work started    |
+| **Quality** | Was the output actually good?         | Output file has correct content and structure |
 A session can pass Trigger but fail Process (skill fired, but steps were wrong),
 or pass Process but fail Quality (steps were right, but output was bad).
@@ -71,13 +71,14 @@ Always include at least one Process and one Quality expectation.
 After grading explicit expectations, extract 2-4 implicit claims from the transcript.
 Each claim falls into one of three types:
-| Type | What it captures | Example |
-|------|------------------|---------|
-| **Factual** | A verifiable statement the agent made | "The agent said 12 slides were created" |
-| **Process** | An observed behavior pattern | "The agent read SKILL.md before making any file changes" |
-| **Quality** | An output characteristic | "The output file was named correctly" |
+| Type        | What it captures                      | Example                                                  |
+| ----------- | ------------------------------------- | -------------------------------------------------------- |
+| **Factual** | A verifiable statement the agent made | "The agent said 12 slides were created"                  |
+| **Process** | An observed behavior pattern          | "The agent read SKILL.md before making any file changes" |
+| **Quality** | An output characteristic              | "The output file was named correctly"                    |
 For each claim:
 1. State the claim clearly
 2. Classify its type
 3. Mark `verified: true` or `verified: false`
@@ -153,9 +154,7 @@ Only raise things worth improving. The goal is actionable feedback, not exhausti
     }
   ],
   "eval_feedback": {
-    "suggestions": [
-      { "reason": "No expectation checks slide content" }
-    ],
+    "suggestions": [{ "reason": "No expectation checks slide content" }],
     "overall": "Process coverage good; add output quality assertions."
   }
 }
@@ -163,14 +162,14 @@ Only raise things worth improving. The goal is actionable feedback, not exhausti
 ### Field descriptions
-| Field | Type | Description |
-|-------|------|-------------|
-| `session_id` | string | From session telemetry |
-| `skill_name` | string | The skill being graded |
-| `transcript_path` | string | Path to the session transcript JSONL |
-| `graded_at` | string | ISO 8601 timestamp of grading |
-| `expectations[]` | array | Each expectation with verdict and evidence |
-| `summary` | object | Aggregate pass/fail counts and rate |
-| `execution_metrics` | object | Raw metrics from session telemetry |
-| `claims[]` | array | Implicit claims extracted from transcript |
-| `eval_feedback` | object | Suggestions for improving the eval set |
+| Field               | Type   | Description                                |
+| ------------------- | ------ | ------------------------------------------ |
+| `session_id`        | string | From session telemetry                     |
+| `skill_name`        | string | The skill being graded                     |
+| `transcript_path`   | string | Path to the session transcript JSONL       |
+| `graded_at`         | string | ISO 8601 timestamp of grading              |
+| `expectations[]`    | array  | Each expectation with verdict and evidence |
+| `summary`           | object | Aggregate pass/fail counts and rate        |
+| `execution_metrics` | object | Raw metrics from session telemetry         |
+| `claims[]`          | array  | Implicit claims extracted from transcript  |
+| `eval_feedback`     | object | Suggestions for improving the eval set     |

package/skill/references/interactive-config.md CHANGED Viewed

@@ -9,21 +9,21 @@ execution mode, model selection, and key parameters.
 Each mutating workflow has a **Pre-Flight Configuration** step. Follow this pattern:
 1. Present a brief summary of what the command will do
-2. Use the `AskUserQuestion` tool to present structured options (max 4 questions per call — split into multiple calls if needed). Mark recommended defaults in option text with `(recommended)`.
+2. Use the `AskUserQuestion` tool to present structured options, one question per tool call. Mark recommended defaults in option text with `(recommended)`.
 3. Parse the user's selections from the tool response
 4. Show a confirmation summary of selected options before executing
-**IMPORTANT:** Always use `AskUserQuestion` for pre-flight — never present options as inline numbered text. The tool provides a structured UI that is easier for users to interact with. If `AskUserQuestion` is not available, fall back to inline numbered options.
+**IMPORTANT:** Prefer `AskUserQuestion` for pre-flight, but never batch multiple questions into one payload. Ask one question at a time. If `AskUserQuestion` is not available or Claude Code does not invoke it, fall back to inline numbered options. Do not invent tool responses.
 ## Model Tier Reference
 When presenting model choices, use this table:
-| Tier | Model | Speed | Cost | Quality | Best for |
-|------|-------|-------|------|---------|----------|
-| Fast | `haiku` | ~2s/call | $ | Good | Iteration loops, bulk validation |
-| Balanced | `sonnet` | ~5s/call | $$ | Great | Single-pass proposals, gate checks |
-| Best | `opus` | ~10s/call | $$$ | Excellent | High-stakes final validation |
+| Tier     | Model    | Speed     | Cost | Quality   | Best for                           |
+| -------- | -------- | --------- | ---- | --------- | ---------------------------------- |
+| Fast     | `haiku`  | ~2s/call  | $    | Good      | Iteration loops, bulk validation   |
+| Balanced | `sonnet` | ~5s/call  | $$   | Great     | Single-pass proposals, gate checks |
+| Best     | `opus`   | ~10s/call | $$$  | Excellent | High-stakes final validation       |
 ## Quick Path

package/skill/references/invocation-taxonomy.md CHANGED Viewed

@@ -67,10 +67,10 @@ Separate from eval types, selftune classifies each **live** skill invocation by
 the user triggered it. This is shown as the `invocation_mode` field in canonical
 telemetry and the "Mode" column in the dashboard.
-| Mode | Definition | Example |
-|------|-----------|---------|
-| `explicit` | User typed a slash command (`/skillname`) | `/selftune grade` |
-| `implicit` | User mentioned the skill by name in their prompt | `evolve the selftune skill` |
+| Mode       | Definition                                               | Example                                         |
+| ---------- | -------------------------------------------------------- | ----------------------------------------------- |
+| `explicit` | User typed a slash command (`/skillname`)                | `/selftune grade`                               |
+| `implicit` | User mentioned the skill by name in their prompt         | `evolve the selftune skill`                     |
 | `inferred` | Agent chose the skill autonomously — user never named it | `show me the dashboard` → agent invokes Browser |
 ### How classification works
@@ -84,10 +84,10 @@ and mapped to canonical modes in `cli/selftune/normalization.ts` (`deriveInvocat
 ### Eval types vs runtime modes
-| Concept | Purpose | Values |
-|---------|---------|--------|
-| **Eval invocation type** | Classifying test cases | explicit, implicit, contextual, negative |
-| **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred |
+| Concept                     | Purpose                | Values                                   |
+| --------------------------- | ---------------------- | ---------------------------------------- |
+| **Eval invocation type**    | Classifying test cases | explicit, implicit, contextual, negative |
+| **Runtime invocation mode** | Classifying live usage | explicit, implicit, inferred             |
 `contextual` (eval) and `inferred` (runtime) are related but different: contextual means
 the user's intent is buried in domain context, while inferred means the agent chose the
@@ -99,12 +99,12 @@ skill without any user mention at all.
 A healthy skill catches all three positive invocation types:
-| Type | Healthy | Unhealthy |
-|------|---------|-----------|
-| Explicit | Catches all | Misses some (broken) |
-| Implicit | Catches most | Only catches explicit (too rigid) |
+| Type       | Healthy      | Unhealthy                                               |
+| ---------- | ------------ | ------------------------------------------------------- |
+| Explicit   | Catches all  | Misses some (broken)                                    |
+| Implicit   | Catches most | Only catches explicit (too rigid)                       |
 | Contextual | Catches many | Only catches explicit + some implicit (needs evolution) |
-| Negative | Rejects all | False positives on keyword overlap |
+| Negative   | Rejects all  | False positives on keyword overlap                      |
 ### The Coverage Spectrum
@@ -128,6 +128,7 @@ The invocation taxonomy directly drives the evolution feedback loop:
 When `selftune eval generate` shows implicit queries that don't trigger the skill, the
 description is too narrow. The `evolve` command will:
 1. Extract the missed implicit patterns
 2. Propose description changes that cover them
 3. Validate that existing triggers still work
@@ -146,6 +147,7 @@ Evolution should tighten the scope or add "Don't Use When" clauses.
 ### The Evolution Priority
 Fix in this order:
 1. **Missed explicit** -- broken, fix immediately
 2. **Missed implicit** -- undertriggering, evolve next
 3. **Missed contextual** -- under-evolved, evolve when implicit is clean
@@ -168,11 +170,11 @@ Each entry in a generated eval set looks like:
 }
 ```
-| Field | Description |
-|-------|-------------|
-| `id` | Sequential identifier |
-| `query` | The user's original query text |
-| `expected` | `true` = should trigger, `false` = should not |
+| Field             | Description                                              |
+| ----------------- | -------------------------------------------------------- |
+| `id`              | Sequential identifier                                    |
+| `query`           | The user's original query text                           |
+| `expected`        | `true` = should trigger, `false` = should not            |
 | `invocation_type` | One of: `explicit`, `implicit`, `contextual`, `negative` |
-| `skill_name` | The skill this eval targets |
-| `source_session` | Session ID the query came from (if positive) |
+| `skill_name`      | The skill this eval targets                              |
+| `source_session`  | Session ID the query came from (if positive)             |

package/skill/references/logs.md CHANGED Viewed

@@ -4,6 +4,12 @@ selftune writes raw legacy logs plus a canonical event log. This reference
 describes each format in detail for the skill to use when parsing sessions,
 audit trails, and cloud-ingest exports.
+> **Note:** JSONL files are now backup/recovery only. SQLite (`~/.selftune/selftune.db`)
+> is the sole operational store for all runtime reads. JSONL writes are retained for
+> append-only durability, but all dashboard queries, hook reads, grading, monitoring,
+> and upload staging read from SQLite. JSONL reads only occur when custom log paths
+> are provided (e.g., `--telemetry-log`, `--skill-log`) for test isolation.
 ---
 ## ~/.claude/session_telemetry_log.jsonl
@@ -37,6 +43,7 @@ One JSON record per line. Each record is one completed agent session.
 ```
 **source values:**
 - `claude_code` — written by session-stop.ts (Stop hook)
 - `codex` — written by ingestors/codex-wrapper.ts
 - `codex_rollout` — written by ingestors/codex-rollout.ts
@@ -72,6 +79,7 @@ One record per skill trigger event. Populated by skill-eval.ts (PostToolUse hook
 ```
 Optional provenance fields:
 - `skill_scope`: `project | global | admin | system | unknown`
 - `skill_project_root`: resolved repo/worktree root when the skill came from a project-local registry
 - `skill_registry_dir`: the registry directory where the resolved `SKILL.md` came from
@@ -82,6 +90,7 @@ record shape, but is rebuilt from source-truth transcripts/rollouts rather than
 hooks alone.
 Notes:
 - `launcher_base_dir` means selftune recovered Claude's `Base directory for this skill:` launcher metadata. If that recovered `SKILL.md` path points into a stable registry like `~/.agents/skills`, `~/.claude/skills`, `/etc/codex/skills`, or a real project checkout, selftune now reclassifies the invocation into the matching `skill_scope`. Ephemeral temp launcher directories still remain `unknown`.
 - `fallback` means selftune confirmed a skill invocation but could not yet resolve a concrete `SKILL.md` path.
@@ -237,6 +246,7 @@ Consumed signal example:
 ```
 **signal_type values:**
 - `correction` — User pointing out a missed skill ("why didn't you use X?", "you should have used X", "next time use X")
 - `explicit_request` — User asking to use a skill ("please use the X skill", "use the commit skill")
 - `manual_invocation` — Direct `/skill` invocation detected
@@ -259,12 +269,13 @@ One record per evolution action. Written by the evolution and rollback modules.
     "total": 50,
     "passed": 35,
     "failed": 15,
-    "pass_rate": 0.70
+    "pass_rate": 0.7
   }
 }
 ```
 **action values:**
 - `created` — New evolution proposal generated. `details` starts with `original_description:` prefix preserving the pre-evolution SKILL.md content.
 - `validated` — Proposal tested against eval set. `eval_snapshot` contains before/after pass rates.
 - `deployed` — Updated SKILL.md written to disk. `eval_snapshot` contains final pass rates.
@@ -283,6 +294,7 @@ One record per evolution action. Written by the evolution and rollback modules.
 One JSON object per line. Two observed variants:
 **Variant A (nested, current):**
 ```json
 {"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}}
 {"type": "assistant", "message": {"role": "assistant", "content": [
@@ -292,6 +304,7 @@ One JSON object per line. Two observed variants:
 ```
 **Variant B (flat, older):**
 ```json
 {"role": "user", "content": "..."}
 {"role": "assistant", "content": [{"type": "tool_use", "name": "Bash", "input": {"command": "..."}}]}
@@ -303,7 +316,7 @@ Skill reads appear as `Read` tool calls where `input.file_path` ends in `SKILL.m
 ---
-## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl)
+## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-\*.jsonl)
 ```json
 {"type": "thread.started", "thread_id": "..."}
@@ -326,13 +339,14 @@ Content is a JSON string containing an array of blocks. Anthropic format:
 ```json
 [
-  {"type": "text", "text": "I'll create the presentation."},
-  {"type": "tool_use", "name": "Bash", "input": {"command": "pip install python-pptx"}},
-  {"type": "tool_use", "name": "Read", "input": {"file_path": "/skills/pptx/SKILL.md"}}
+  { "type": "text", "text": "I'll create the presentation." },
+  { "type": "tool_use", "name": "Bash", "input": { "command": "pip install python-pptx" } },
+  { "type": "tool_use", "name": "Read", "input": { "file_path": "/skills/pptx/SKILL.md" } }
 ]
 ```
 Tool results appear in subsequent user messages:
 ```json
-[{"type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false}]
+[{ "type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false }]
 ```

package/skill/references/setup-patterns.md CHANGED Viewed

@@ -61,5 +61,7 @@ combined.
 ## Optional Repository Extensions
 selftune bundles specialized agent instruction files in `skill/agents/` for
-diagnosis, evolution review, pattern analysis, and setup help. These ship with
-the skill package and are read directly when needed — no installation step required.
+diagnosis, evolution review, pattern analysis, and setup help. These bundled
+files are the canonical source. On Claude Code, `selftune init` syncs
+compatibility copies into `~/.claude/agents/` so native subagent calls keep
+matching the bundled definitions.

package/.claude/agents/diagnosis-analyst.md DELETED Viewed

@@ -1,156 +0,0 @@
----
-name: diagnosis-analyst
-description: Deep-dive analysis of underperforming skills with root cause identification and actionable recommendations.
----
-# Diagnosis Analyst
-## Role
-Investigate why a specific skill is underperforming. Analyze telemetry logs,
-grading results, and session transcripts to identify root causes and recommend
-targeted fixes.
-**Activation policy:** This is a subagent-only role, spawned by the main agent.
-If a user asks for diagnosis directly, the main agent should route to this subagent.
-## Connection to Workflows
-This agent is spawned by the main agent as a subagent when deeper analysis is
-needed — it is not called directly by the user.
-**Connected workflows:**
-- **Doctor** — when `selftune doctor` reveals persistent issues with a specific skill, spawn this agent for root cause analysis
-- **Grade** — when grades are consistently low for a skill, spawn this agent to investigate why
-- **Status** — when `selftune status` shows CRITICAL or WARNING flags on a skill, spawn this agent for a deep dive
-The main agent decides when to escalate to this subagent based on severity
-and persistence of the issue. One-off failures are handled inline; recurring
-or unexplained failures warrant spawning this agent.
-## Context
-You need access to:
-- `~/.claude/session_telemetry_log.jsonl` — session-level metrics
-- `~/.claude/skill_usage_log.jsonl` — skill trigger events
-- `~/.claude/all_queries_log.jsonl` — all user queries (triggered and missed)
-- `~/.claude/evolution_audit_log.jsonl` — evolution history
-- The target skill's `SKILL.md` file
-- Session transcripts referenced in telemetry entries
-## Workflow
-### Step 1: Identify the target skill
-Ask the user which skill to diagnose, or infer from context. Confirm the
-skill name before proceeding.
-### Step 2: Gather current health snapshot
-```bash
-selftune status
-selftune last
-```
-Parse JSON output. Note the skill's current pass rate, session count, and
-any warnings or regression flags.
-### Step 3: Pull telemetry stats
-```bash
-selftune eval generate --skill <name> --stats
-```
-Review aggregate metrics:
-- **Error rate** — high error rate suggests process failures, not trigger issues
-- **Tool call breakdown** — unusual patterns (e.g., excessive Bash retries) indicate thrashing
-- **Average turns** — abnormally high turn count suggests the agent is struggling
-### Step 4: Analyze trigger coverage
-```bash
-selftune eval generate --skill <name> --max 50
-```
-Review the generated eval set. Count entries by invocation type:
-- **Explicit missed** = description is fundamentally broken (critical)
-- **Implicit missed** = description too narrow (common, fixable via evolve)
-- **Contextual missed** = lacks domain vocabulary (fixable via evolve)
-- **False-positive negatives** = overtriggering (description too broad)
-Reference `skill/references/invocation-taxonomy.md` for the full taxonomy.
-### Step 5: Review grading evidence
-Read the skill's `SKILL.md` and check recent grading results. For each
-failed expectation, look at:
-- **Trigger tier** — did the skill fire at all?
-- **Process tier** — did the agent follow the right steps?
-- **Quality tier** — was the output actually good?
-Reference `skill/references/grading-methodology.md` for the 3-tier model.
-### Step 6: Check evolution history
-Read `~/.claude/evolution_audit_log.jsonl` for entries matching the skill.
-Look for:
-- Recent evolutions that may have introduced regressions
-- Rollbacks that suggest instability
-- Plateau patterns (repeated evolutions with no improvement)
-### Step 7: Inspect session transcripts
-For the worst-performing sessions, read the transcript JSONL files. Look for:
-- SKILL.md not being read (trigger failure)
-- Steps executed out of order (process failure)
-- Repeated errors or thrashing (quality failure)
-- Missing tool calls that should have occurred
-### Step 8: Synthesize diagnosis
-Compile findings into a structured report.
-## Commands
-| Command | Purpose |
-|---------|---------|
-| `selftune status` | Overall health snapshot |
-| `selftune last` | Most recent session details |
-| `selftune eval generate --skill <name> --stats` | Aggregate telemetry |
-| `selftune eval generate --skill <name> --max 50` | Generate eval set for coverage analysis |
-| `selftune doctor` | Check infrastructure health |
-## Output
-Produce a structured diagnosis report:
-```markdown
-## Diagnosis Report: <skill-name>
-### Summary
-[One-paragraph overview of the problem]
-### Health Metrics
-- Pass rate: X%
-- Sessions analyzed: N
-- Error rate: X%
-- Trigger coverage: explicit X% / implicit X% / contextual X%
-### Root Cause
-[Primary reason for underperformance, categorized as:]
-- TRIGGER: Skill not firing when it should
-- PROCESS: Skill fires but agent follows wrong steps
-- QUALITY: Steps are correct but output is poor
-- INFRASTRUCTURE: Hooks, logs, or config issues
-### Evidence
-[Specific log entries, transcript lines, or metrics supporting the diagnosis]
-### Recommendations
-1. [Highest priority fix]
-2. [Secondary fix]
-3. [Optional improvement]
-### Suggested Commands
-[Exact selftune commands to execute the recommended fixes]
-```