selftune 0.1.4 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (153) hide show
  1. package/.claude/agents/diagnosis-analyst.md +156 -0
  2. package/.claude/agents/evolution-reviewer.md +180 -0
  3. package/.claude/agents/integration-guide.md +212 -0
  4. package/.claude/agents/pattern-analyst.md +160 -0
  5. package/CHANGELOG.md +46 -1
  6. package/README.md +105 -257
  7. package/apps/local-dashboard/dist/assets/geist-cyrillic-wght-normal-CHSlOQsW.woff2 +0 -0
  8. package/apps/local-dashboard/dist/assets/geist-latin-ext-wght-normal-DMtmJ5ZE.woff2 +0 -0
  9. package/apps/local-dashboard/dist/assets/geist-latin-wght-normal-Dm3htQBi.woff2 +0 -0
  10. package/apps/local-dashboard/dist/assets/index-C4EOTFZ2.js +15 -0
  11. package/apps/local-dashboard/dist/assets/index-bl-Webyd.css +1 -0
  12. package/apps/local-dashboard/dist/assets/vendor-react-U7zYD9Rg.js +60 -0
  13. package/apps/local-dashboard/dist/assets/vendor-table-B7VF2Ipl.js +26 -0
  14. package/apps/local-dashboard/dist/assets/vendor-ui-D7_zX_qy.js +346 -0
  15. package/apps/local-dashboard/dist/favicon.png +0 -0
  16. package/apps/local-dashboard/dist/index.html +17 -0
  17. package/apps/local-dashboard/dist/logo.png +0 -0
  18. package/apps/local-dashboard/dist/logo.svg +9 -0
  19. package/assets/BeforeAfter.gif +0 -0
  20. package/assets/FeedbackLoop.gif +0 -0
  21. package/assets/logo.svg +9 -0
  22. package/assets/skill-health-badge.svg +20 -0
  23. package/cli/selftune/activation-rules.ts +171 -0
  24. package/cli/selftune/badge/badge-data.ts +108 -0
  25. package/cli/selftune/badge/badge-svg.ts +212 -0
  26. package/cli/selftune/badge/badge.ts +99 -0
  27. package/cli/selftune/canonical-export.ts +183 -0
  28. package/cli/selftune/constants.ts +103 -1
  29. package/cli/selftune/contribute/bundle.ts +314 -0
  30. package/cli/selftune/contribute/contribute.ts +214 -0
  31. package/cli/selftune/contribute/sanitize.ts +162 -0
  32. package/cli/selftune/cron/setup.ts +266 -0
  33. package/cli/selftune/dashboard-contract.ts +202 -0
  34. package/cli/selftune/dashboard-server.ts +1049 -0
  35. package/cli/selftune/dashboard.ts +43 -156
  36. package/cli/selftune/eval/baseline.ts +248 -0
  37. package/cli/selftune/eval/composability-v2.ts +273 -0
  38. package/cli/selftune/eval/composability.ts +117 -0
  39. package/cli/selftune/eval/generate-unit-tests.ts +143 -0
  40. package/cli/selftune/eval/hooks-to-evals.ts +101 -16
  41. package/cli/selftune/eval/import-skillsbench.ts +221 -0
  42. package/cli/selftune/eval/synthetic-evals.ts +172 -0
  43. package/cli/selftune/eval/unit-test-cli.ts +152 -0
  44. package/cli/selftune/eval/unit-test.ts +196 -0
  45. package/cli/selftune/evolution/deploy-proposal.ts +142 -1
  46. package/cli/selftune/evolution/evidence.ts +26 -0
  47. package/cli/selftune/evolution/evolve-body.ts +586 -0
  48. package/cli/selftune/evolution/evolve.ts +825 -116
  49. package/cli/selftune/evolution/extract-patterns.ts +105 -16
  50. package/cli/selftune/evolution/pareto.ts +314 -0
  51. package/cli/selftune/evolution/propose-body.ts +171 -0
  52. package/cli/selftune/evolution/propose-description.ts +100 -2
  53. package/cli/selftune/evolution/propose-routing.ts +166 -0
  54. package/cli/selftune/evolution/refine-body.ts +141 -0
  55. package/cli/selftune/evolution/rollback.ts +21 -4
  56. package/cli/selftune/evolution/validate-body.ts +254 -0
  57. package/cli/selftune/evolution/validate-proposal.ts +257 -35
  58. package/cli/selftune/evolution/validate-routing.ts +177 -0
  59. package/cli/selftune/grading/auto-grade.ts +200 -0
  60. package/cli/selftune/grading/grade-session.ts +513 -42
  61. package/cli/selftune/grading/pre-gates.ts +104 -0
  62. package/cli/selftune/grading/results.ts +42 -0
  63. package/cli/selftune/hooks/auto-activate.ts +185 -0
  64. package/cli/selftune/hooks/evolution-guard.ts +165 -0
  65. package/cli/selftune/hooks/prompt-log.ts +172 -2
  66. package/cli/selftune/hooks/session-stop.ts +123 -3
  67. package/cli/selftune/hooks/skill-change-guard.ts +112 -0
  68. package/cli/selftune/hooks/skill-eval.ts +119 -3
  69. package/cli/selftune/index.ts +415 -48
  70. package/cli/selftune/ingestors/claude-replay.ts +377 -0
  71. package/cli/selftune/ingestors/codex-rollout.ts +345 -46
  72. package/cli/selftune/ingestors/codex-wrapper.ts +207 -39
  73. package/cli/selftune/ingestors/openclaw-ingest.ts +573 -0
  74. package/cli/selftune/ingestors/opencode-ingest.ts +193 -17
  75. package/cli/selftune/init.ts +376 -16
  76. package/cli/selftune/last.ts +14 -5
  77. package/cli/selftune/localdb/db.ts +63 -0
  78. package/cli/selftune/localdb/materialize.ts +428 -0
  79. package/cli/selftune/localdb/queries.ts +376 -0
  80. package/cli/selftune/localdb/schema.ts +204 -0
  81. package/cli/selftune/memory/writer.ts +447 -0
  82. package/cli/selftune/monitoring/watch.ts +90 -16
  83. package/cli/selftune/normalization.ts +682 -0
  84. package/cli/selftune/observability.ts +19 -44
  85. package/cli/selftune/orchestrate.ts +1073 -0
  86. package/cli/selftune/quickstart.ts +203 -0
  87. package/cli/selftune/repair/skill-usage.ts +576 -0
  88. package/cli/selftune/schedule.ts +561 -0
  89. package/cli/selftune/status.ts +59 -33
  90. package/cli/selftune/sync.ts +627 -0
  91. package/cli/selftune/types.ts +525 -5
  92. package/cli/selftune/utils/canonical-log.ts +45 -0
  93. package/cli/selftune/utils/frontmatter.ts +217 -0
  94. package/cli/selftune/utils/hooks.ts +41 -0
  95. package/cli/selftune/utils/html.ts +27 -0
  96. package/cli/selftune/utils/llm-call.ts +103 -19
  97. package/cli/selftune/utils/math.ts +10 -0
  98. package/cli/selftune/utils/query-filter.ts +139 -0
  99. package/cli/selftune/utils/skill-discovery.ts +340 -0
  100. package/cli/selftune/utils/skill-log.ts +68 -0
  101. package/cli/selftune/utils/skill-usage-confidence.ts +18 -0
  102. package/cli/selftune/utils/transcript.ts +307 -26
  103. package/cli/selftune/utils/trigger-check.ts +89 -0
  104. package/cli/selftune/utils/tui.ts +156 -0
  105. package/cli/selftune/workflows/discover.ts +254 -0
  106. package/cli/selftune/workflows/skill-md-writer.ts +288 -0
  107. package/cli/selftune/workflows/workflows.ts +188 -0
  108. package/package.json +28 -11
  109. package/packages/telemetry-contract/README.md +11 -0
  110. package/packages/telemetry-contract/fixtures/golden.json +87 -0
  111. package/packages/telemetry-contract/fixtures/golden.test.ts +42 -0
  112. package/packages/telemetry-contract/index.ts +1 -0
  113. package/packages/telemetry-contract/package.json +19 -0
  114. package/packages/telemetry-contract/src/index.ts +2 -0
  115. package/packages/telemetry-contract/src/types.ts +163 -0
  116. package/packages/telemetry-contract/src/validators.ts +109 -0
  117. package/skill/SKILL.md +180 -33
  118. package/skill/Workflows/AutoActivation.md +145 -0
  119. package/skill/Workflows/Badge.md +124 -0
  120. package/skill/Workflows/Baseline.md +144 -0
  121. package/skill/Workflows/Composability.md +107 -0
  122. package/skill/Workflows/Contribute.md +94 -0
  123. package/skill/Workflows/Cron.md +132 -0
  124. package/skill/Workflows/Dashboard.md +214 -0
  125. package/skill/Workflows/Doctor.md +63 -14
  126. package/skill/Workflows/Evals.md +110 -18
  127. package/skill/Workflows/EvolutionMemory.md +154 -0
  128. package/skill/Workflows/Evolve.md +181 -21
  129. package/skill/Workflows/EvolveBody.md +159 -0
  130. package/skill/Workflows/Grade.md +36 -31
  131. package/skill/Workflows/ImportSkillsBench.md +117 -0
  132. package/skill/Workflows/Ingest.md +142 -21
  133. package/skill/Workflows/Initialize.md +91 -23
  134. package/skill/Workflows/Orchestrate.md +139 -0
  135. package/skill/Workflows/Replay.md +91 -0
  136. package/skill/Workflows/Rollback.md +23 -4
  137. package/skill/Workflows/Schedule.md +61 -0
  138. package/skill/Workflows/Sync.md +88 -0
  139. package/skill/Workflows/UnitTest.md +150 -0
  140. package/skill/Workflows/Watch.md +33 -1
  141. package/skill/Workflows/Workflows.md +129 -0
  142. package/skill/assets/activation-rules-default.json +26 -0
  143. package/skill/assets/multi-skill-settings.json +63 -0
  144. package/skill/assets/single-skill-settings.json +57 -0
  145. package/skill/references/invocation-taxonomy.md +2 -2
  146. package/skill/references/logs.md +164 -2
  147. package/skill/references/setup-patterns.md +65 -0
  148. package/skill/references/version-history.md +40 -0
  149. package/skill/settings_snippet.json +23 -0
  150. package/templates/activation-rules-default.json +27 -0
  151. package/templates/multi-skill-settings.json +64 -0
  152. package/templates/single-skill-settings.json +58 -0
  153. package/dashboard/index.html +0 -1119
package/CHANGELOG.md CHANGED
@@ -5,6 +5,49 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/).
7
7
 
8
+ ## [Unreleased]
9
+
10
+ ### Added
11
+
12
+ - **Real-time improvement signal detection** — `prompt-log` hook detects user corrections ("why didn't you use X?") and explicit skill requests via pure regex patterns. Signals are logged to `~/.claude/improvement_signals.jsonl` with skill name extraction from installed skills.
13
+ - **Signal-reactive orchestration** — `session-stop` hook checks for pending improvement signals and spawns a focused `selftune orchestrate --max-skills 2` run in the background. Respects a 30-minute lockfile to prevent concurrent runs.
14
+ - **Signal-aware candidate selection** — Orchestrator reads pending signals and boosts priority for mentioned skills (+150 per signal, capped at +450). Signaled skills bypass the minimum evidence gate and the "UNGRADED with 0 missed queries" gate.
15
+ - **Orchestrate lockfile** — `acquireLock()`/`releaseLock()` with PID+timestamp in `~/.claude/.orchestrate.lock`. 30-minute stale threshold prevents deadlocks from crashed runs.
16
+ - **Signal consumption** — After an orchestrate run completes, consumed signals are marked with `consumed: true`, `consumed_at`, and `consumed_by_run` so they don't affect subsequent runs.
17
+
18
+ ## [0.2.0] — 2026-03-08
19
+
20
+ ### Added
21
+
22
+ - **Full skill body evolution** — Teacher-student model for evolving routing tables and complete skill bodies with 3-gate validation (structural, trigger, quality)
23
+ - **Synthetic eval generation** — `selftune eval generate --synthetic --skill <name> --skill-path <path>` generates eval sets from SKILL.md via LLM without needing real session logs. Solves cold-start for new skills.
24
+ - **Batch trigger validation** — `validateProposalBatched()` batches 10 queries per LLM call (configurable via `TRIGGER_CHECK_BATCH_SIZE`). ~10x faster evolution loops. Sequential `validateProposalSequential()` kept for backward compat.
25
+ - **Cheap-loop evolution mode** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate. New `--gate-model` and `--proposal-model` flags for manual per-stage control.
26
+ - **Validation model selection** — `--validation-model` flag on `evolve` and `evolve body` commands (default: `haiku`).
27
+ - **Proposal model selection** — `--proposal-model` flag on `evolve`, passed through to `generateProposal()` and `generateMultipleProposals()`.
28
+ - **Gate validation dependency injection** — `gateValidateProposal` added to `EvolveDeps` for testability.
29
+ - **Auto-activation system** — `auto-activate.ts` UserPromptSubmit hook detects when selftune should run and outputs formatted suggestions; session state tracking prevents repeated nags; PAI coexistence support
30
+ - **Skill change guard** — `skill-change-guard.ts` PreToolUse hook detects Write/Edit to SKILL.md files and suggests running `selftune watch`
31
+ - **Evolution memory** — 3-file persistence system at `~/.selftune/memory/` (context.md, plan.md, decisions.md) survives context resets; auto-maintained by evolve, rollback, and watch commands
32
+ - **Specialized agents** — 4 purpose-built Claude Code agents: diagnosis-analyst, pattern-analyst, evolution-reviewer, integration-guide
33
+ - **Enforcement guardrails** — `evolution-guard.ts` PreToolUse hook blocks SKILL.md edits on actively monitored skills unless `selftune watch` has been run recently
34
+ - **Integration guide** — Comprehensive `docs/integration-guide.md` with project-type patterns (single-skill, multi-skill, monorepo, Codex-only, OpenCode-only, mixed)
35
+ - **Settings templates** — `templates/single-skill-settings.json`, `templates/multi-skill-settings.json`, `templates/activation-rules-default.json`
36
+ - **Enhanced init** — `selftune init` now detects workspace structure (skill count, monorepo layout) and suggests appropriate template
37
+ - **Dashboard server** — `selftune dashboard --serve` launches live Bun.serve server with SSE auto-refresh, action buttons (watch/evolve/rollback), and evolution timeline
38
+ - **Activation rules engine** — Configurable trigger rules for auto-activation (grading thresholds, stale evolutions, regression detection)
39
+ - **Sandbox test harness** (`tests/sandbox/run-sandbox.ts`): Exercises all CLI commands and hooks against fixture data in an isolated `/tmp` environment. Runs in ~400ms with 10/10 tests passing.
40
+ - **Devcontainer-based LLM testing** (`.devcontainer/` + `tests/sandbox/docker/`): Based on the official Claude Code devcontainer reference. Uses `claude -p` with `--dangerously-skip-permissions` for unattended LLM-dependent testing (grade, evolve, watch). No API key required — uses existing Claude subscription.
41
+ - **Realistic test fixtures**: 3 skills from skills.sh (find-skills, frontend-design, ai-image-generation) with 15 sessions, 30 queries, 7 skill usage records, and evolution audit history.
42
+ - **Hook integration tests**: All 3 Claude Code hooks (prompt-log, skill-eval, session-stop) tested via stdin payload injection.
43
+
44
+ ### Changed
45
+
46
+ - `validateProposal()` now delegates to `validateProposalBatched()` by default (was sequential).
47
+ - `hooks-to-evals.ts` `cliMain()` is now async to support synthetic generation.
48
+ - `EvolveOptions` extended with `validationModel`, `cheapLoop`, `gateModel`, `proposalModel`.
49
+ - `EvolveResult` extended with `gateValidation`.
50
+
8
51
  ## [0.1.4] - 2026-03-01
9
52
 
10
53
  ### Added
@@ -12,6 +55,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
12
55
  - `selftune status` — CLI skill health summary with pass rates, trends, and system health
13
56
  - `selftune last` — Quick insight from the most recent session
14
57
  - `selftune dashboard` — Skill-health-centric HTML dashboard with grid view and drill-down
58
+ - `selftune ingest claude` — Claude Code transcript replay for retroactive log backfill
59
+ - `selftune contribute` — Opt-in anonymized data export for community contribution
15
60
  - CI/CD workflows: publish, auto-bump, CodeQL, scorecard
16
61
  - FOSS governance: LICENSE (MIT), CODE_OF_CONDUCT, CONTRIBUTING, SECURITY
17
62
  - npm package configuration with CJS bin entry point
@@ -20,7 +65,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
20
65
 
21
66
  ### Added
22
67
 
23
- - CLI entry point with 10 commands: `init`, `evals`, `grade`, `evolve`, `rollback`, `watch`, `doctor`, `ingest-codex`, `ingest-opencode`, `wrap-codex`
68
+ - CLI entry point with 10 commands: `init`, `eval generate`, `grade`, `evolve`, `evolve rollback`, `watch`, `doctor`, `ingest codex`, `ingest opencode`, `ingest wrap-codex`
24
69
  - Agent auto-detection for Claude Code, Codex, and OpenCode
25
70
  - Telemetry hooks for Claude Code (`prompt-log`, `skill-eval`, `session-stop`)
26
71
  - Codex wrapper and batch ingestor for rollout logs
package/README.md CHANGED
@@ -1,316 +1,164 @@
1
- [![CI](https://github.com/WellDunDun/selftune/actions/workflows/ci.yml/badge.svg)](https://github.com/WellDunDun/selftune/actions/workflows/ci.yml)
2
- [![CodeQL](https://github.com/WellDunDun/selftune/actions/workflows/codeql.yml/badge.svg)](https://github.com/WellDunDun/selftune/actions/workflows/codeql.yml)
3
- [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/WellDunDun/selftune/badge)](https://securityscorecards.dev/viewer/?uri=github.com/WellDunDun/selftune)
4
- [![npm version](https://img.shields.io/npm/v/selftune)](https://www.npmjs.com/package/selftune)
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
- [![TypeScript](https://img.shields.io/badge/TypeScript-5.0-blue.svg)](https://www.typescriptlang.org/)
7
- [![Zero Dependencies](https://img.shields.io/badge/dependencies-0-brightgreen)](https://www.npmjs.com/package/selftune?activeTab=dependencies)
8
- [![Bun](https://img.shields.io/badge/runtime-bun%20%7C%20node-black)](https://bun.sh)
1
+ <div align="center">
2
+
3
+ <img src="assets/logo.svg" alt="selftune logo" width="80" />
4
+
5
+ # selftune
9
6
 
10
- # selftune Skill Observability & Continuous Improvement CLI
7
+ **Self-improving skills for AI agents.**
11
8
 
9
+ [![CI](https://github.com/selftune-dev/selftune/actions/workflows/ci.yml/badge.svg)](https://github.com/selftune-dev/selftune/actions/workflows/ci.yml)
10
+ [![CodeQL](https://github.com/selftune-dev/selftune/actions/workflows/codeql.yml/badge.svg)](https://github.com/selftune-dev/selftune/actions/workflows/codeql.yml)
11
+ [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/selftune-dev/selftune/badge)](https://securityscorecards.dev/viewer/?uri=github.com/selftune-dev/selftune)
12
12
  [![npm version](https://img.shields.io/npm/v/selftune)](https://www.npmjs.com/package/selftune)
13
- [![CI](https://github.com/WellDunDun/selftune/actions/workflows/ci.yml/badge.svg)](https://github.com/WellDunDun/selftune/actions/workflows/ci.yml)
14
- [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
13
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
14
+ [![TypeScript](https://img.shields.io/badge/TypeScript-blue.svg)](https://www.typescriptlang.org/)
15
15
  [![Zero Dependencies](https://img.shields.io/badge/dependencies-0-brightgreen)](https://www.npmjs.com/package/selftune?activeTab=dependencies)
16
16
  [![Bun](https://img.shields.io/badge/runtime-bun%20%7C%20node-black)](https://bun.sh)
17
17
 
18
- Observe real sessions, detect missed triggers, grade execution quality, and automatically evolve skill descriptions toward the language real users actually use.
18
+ Your agent skills learn how you work. Detect what's broken. Fix it automatically.
19
19
 
20
- Works with **Claude Code**, **Codex**, and **OpenCode**.
20
+ **[Install](#install)** · **[Use Cases](#built-for-how-you-actually-work)** · **[How It Works](#how-it-works)** · **[Commands](#commands)** · **[Platforms](#platforms)** · **[Docs](docs/integration-guide.md)**
21
21
 
22
- ```
23
- Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat
24
- ```
22
+ </div>
25
23
 
26
24
  ---
27
25
 
28
- ## Install
26
+ Your skills don't understand how you talk. You say "make me a slide deck" and nothing happens — no error, no log, no signal. selftune watches your real sessions, learns how you actually speak, and rewrites skill descriptions to match. Automatically.
29
27
 
30
- ```bash
31
- npx selftune@latest doctor
32
- ```
28
+ Works with **Claude Code** (primary). Codex, OpenCode, and OpenClaw adapters are experimental. Zero runtime dependencies.
33
29
 
34
- Or install globally:
30
+ ## Install
35
31
 
36
32
  ```bash
37
- npm install -g selftune
38
- selftune doctor
33
+ npx skills add selftune-dev/selftune
39
34
  ```
40
35
 
41
- Requires [Bun](https://bun.sh) or Node.js 18+ with [tsx](https://github.com/privatenumber/tsx).
42
-
43
- ---
44
-
45
- ## Why
46
-
47
- Agent skills are static, but users are not. When a skill undertriggers — when someone says "make me a slide deck" and the pptx skill doesn't fire — that failure is invisible. The user concludes "AI doesn't follow directions" rather than recognizing the skill description doesn't match how real people talk.
48
-
49
- selftune closes this feedback loop.
50
-
51
- ---
52
-
53
- ## What It Does
36
+ Then tell your agent: **"initialize selftune"**
54
37
 
55
- | Capability | Description |
56
- |---|---|
57
- | **Session telemetry** | Captures per-session process metrics across all three platforms |
58
- | **False negative detection** | Surfaces queries where a skill should have fired but didn't |
59
- | **Eval set generation** | Converts hook logs into trigger eval sets with real usage as ground truth |
60
- | **Session grading** | 3-tier evaluation (Trigger / Process / Quality) using the agent you already have |
61
- | **Skill evolution** | Proposes improved descriptions, validates them, deploys with audit trail |
62
- | **Post-deploy monitoring** | Watches evolved skills for regressions, auto-rollback on pass rate drops |
63
-
64
- ---
38
+ Two minutes. No API keys. No external services. No configuration ceremony. Uses your existing agent subscription. You'll see which skills are undertriggering.
65
39
 
66
- ## Setup
67
-
68
- ### 1. Add the skill
40
+ **CLI only** (no skill, just the CLI):
69
41
 
70
42
  ```bash
71
- npx skills add WellDunDun/selftune
43
+ npx selftune@latest doctor
72
44
  ```
73
45
 
74
- ### 2. Initialize
46
+ ## Before / After
75
47
 
76
- Tell your agent: **"initialize selftune"**
48
+ <p align="center">
49
+ <img src="./assets/BeforeAfter.gif" alt="Before: 47% pass rate → After: 89% pass rate" width="800">
50
+ </p>
77
51
 
78
- The agent will install the CLI (`npm install -g selftune`) if needed, run `selftune init` to bootstrap config, install hooks, and verify with `selftune doctor`.
52
+ selftune learned that real users say "slides", "deck", "presentation for Monday" none of which matched the original skill description. It rewrote the description to match how people actually talk. Validated against the eval set. Deployed with a backup. Done.
79
53
 
80
- ---
81
-
82
- ## Development
83
-
84
- For contributors running from source.
85
-
86
- ### 1. Initialize
87
-
88
- ```bash
89
- npx selftune@latest init
90
- ```
54
+ ## Built for How You Actually Work
91
55
 
92
- The `init` command auto-detects your agent environment (Claude Code, Codex, or OpenCode), resolves the CLI path, determines the LLM mode, and writes config to `~/.selftune/config.json`. All subsequent commands read from this config.
56
+ **I write and use my own skills** Your skill descriptions don't match how you actually talk. Tell your agent "improve my skills" and selftune learns your language from real sessions, evolves descriptions to match, and validates before deploying. No manual tuning.
93
57
 
94
- Use `--agent claude_code|codex|opencode` to override detection, `--llm-mode agent|api` to override LLM mode, or `--force` to reinitialize.
58
+ **I publish skills others install** — Your skill works for you, but every user talks differently. selftune ships skills that get better for every user automatically — adapting descriptions to how each person actually works.
95
59
 
96
- ### 4. Install hooks (Claude Code)
60
+ **I manage an agent setup with many skills** — You have 15+ skills installed. Some work. Some don't. Some conflict. Tell your agent "how are my skills doing?" and selftune gives you a health dashboard and automatically improves the skills that aren't keeping up.
97
61
 
98
- If `init` reports hooks are not installed, merge the entries from `skill/settings_snippet.json` into `~/.claude/settings.json`. Derive hook script paths from the `cli_path` field in `~/.selftune/config.json` — the hooks directory is at `dirname(cli_path)/hooks/`.
62
+ ## How It Works
99
63
 
100
- ### 5. Verify setup
64
+ <p align="center">
65
+ <img src="./assets/FeedbackLoop.gif" alt="Observe → Detect → Evolve → Watch" width="800">
66
+ </p>
101
67
 
102
- ```bash
103
- selftune doctor
104
- ```
68
+ A continuous feedback loop that makes your skills learn and adapt. Automatically. Your agent runs everything — you just install the skill and talk naturally.
105
69
 
106
- Doctor checks log file health, hook installation, schema validity, and config status.
70
+ **Observe** Hooks capture every query and which skills fired. On Claude Code, hooks install automatically during `selftune init`. Backfill existing transcripts with `selftune ingest claude`.
107
71
 
108
- ### Platform-Specific Notes
72
+ **Detect** — Finds the gap between how you talk and how your skills are described. You say "make me a slide deck" and your pptx skill stays silent — selftune catches that mismatch. Real-time correction signals ("why didn't you use X?") are detected and trigger immediate improvement.
109
73
 
110
- **Claude Code** — Hooks capture telemetry automatically after installation. Zero configuration once hooks are in `settings.json`.
74
+ **Evolve** — Rewrites skill descriptions and full skill bodies — to match how you actually work. Cheap-loop mode uses haiku for the loop, sonnet for the gate (~80% cost reduction). Teacher-student body evolution with 3-gate validation. Automatic backup.
111
75
 
112
- **Codex** — Use the wrapper for real-time capture or the batch ingestor for historical logs:
113
- ```bash
114
- selftune wrap-codex -- <your codex args>
115
- selftune ingest-codex
116
- ```
76
+ **Watch** — After deploying changes, selftune monitors skill trigger rates. If anything regresses, it rolls back automatically.
117
77
 
118
- **OpenCode** — Backfill historical sessions from SQLite:
119
- ```bash
120
- selftune ingest-opencode
121
- ```
78
+ **Automate** — Run `selftune cron setup` to install OS-level scheduling. selftune syncs, evaluates, evolves, and watches on a schedule — no manual intervention needed.
122
79
 
123
- All platforms write to the same shared JSONL log schema at `~/.claude/`.
80
+ ## What's New in v0.2.0
124
81
 
125
- ---
82
+ - **Full skill body evolution** — Beyond descriptions: evolve routing tables and entire skill bodies using teacher-student model with structural, trigger, and quality gates
83
+ - **Synthetic eval generation** — `selftune eval generate --synthetic` generates eval sets from SKILL.md via LLM, no session logs needed. Solves cold-start: new skills get evals immediately.
84
+ - **Cheap-loop evolution** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate. ~80% cost reduction.
85
+ - **Batch trigger validation** — Validation now batches 10 queries per LLM call instead of one-per-query. ~10x faster evolution loops.
86
+ - **Per-stage model control** — `--validation-model`, `--proposal-model`, and `--gate-model` flags give fine-grained control over which model runs each evolution stage.
87
+ - **Auto-activation system** — Hooks detect when selftune should run and suggest actions
88
+ - **Enforcement guardrails** — Blocks SKILL.md edits on monitored skills unless `selftune watch` has been run
89
+ - **Live dashboard server** — `selftune dashboard --serve` with SSE auto-refresh and action buttons
90
+ - **Evolution memory** — Persists context, plans, and decisions across context resets
91
+ - **4 specialized agents** — Diagnosis analyst, pattern analyst, evolution reviewer, integration guide
92
+ - **Sandbox test harness** — Comprehensive automated test coverage, including devcontainer-based LLM testing
126
93
 
127
94
  ## Commands
128
95
 
129
- ```
130
- selftune <command> [options]
131
- ```
132
-
133
- | Command | Purpose |
96
+ Your agent runs these — you just say what you want ("improve my skills", "show the dashboard").
97
+
98
+ | Group | Command | What it does |
99
+ |-------|---------|-------------|
100
+ | | `selftune status` | See which skills are undertriggering and why |
101
+ | | `selftune orchestrate` | Run the full autonomous loop (sync → evolve → watch) |
102
+ | | `selftune dashboard` | Open the visual skill health dashboard |
103
+ | | `selftune doctor` | Health check: logs, hooks, config, permissions |
104
+ | **ingest** | `selftune ingest claude` | Backfill from Claude Code transcripts |
105
+ | | `selftune ingest codex` | Import Codex rollout logs (experimental) |
106
+ | **grade** | `selftune grade --skill <name>` | Grade a skill session with evidence |
107
+ | | `selftune grade baseline --skill <name>` | Measure skill value vs no-skill baseline |
108
+ | **evolve** | `selftune evolve --skill <name>` | Propose, validate, and deploy improved descriptions |
109
+ | | `selftune evolve body --skill <name>` | Evolve full skill body or routing table |
110
+ | | `selftune evolve rollback --skill <name>` | Rollback a previous evolution |
111
+ | **eval** | `selftune eval generate --skill <name>` | Generate eval sets (`--synthetic` for cold-start) |
112
+ | | `selftune eval unit-test --skill <name>` | Run or generate skill-level unit tests |
113
+ | | `selftune eval composability --skill <name>` | Detect conflicts between co-occurring skills |
114
+ | | `selftune eval import` | Import external eval corpus from [SkillsBench](https://github.com/benchflow-ai/skillsbench) |
115
+ | **auto** | `selftune cron setup` | Install OS-level scheduling (cron/launchd/systemd) |
116
+ | | `selftune watch --skill <name>` | Monitor after deploy. Auto-rollback on regression. |
117
+
118
+ Full command reference: `selftune --help`
119
+
120
+ ## Why Not Just Rewrite Skills Manually?
121
+
122
+ | Approach | Problem |
134
123
  |---|---|
135
- | `init` | Auto-detect agent environment, write `~/.selftune/config.json` |
136
- | `grade --skill <name>` | Grade a session (3-tier: trigger, process, quality) |
137
- | `evals --skill <name>` | Generate eval set from real usage logs |
138
- | `evals --list-skills` | Show logged skills and query counts |
139
- | `evolve --skill <name> --skill-path <path>` | Analyze failures, propose and deploy improved description |
140
- | `rollback --skill <name> --skill-path <path>` | Restore pre-evolution description |
141
- | `watch --skill <name> --skill-path <path>` | Monitor post-deploy pass rates, detect regressions |
142
- | `status` | Show skill health summary (pass rates, trends, missed queries) |
143
- | `last` | Show quick insight from the most recent session |
144
- | `doctor` | Health checks on logs, hooks, config, and schema |
145
- | `dashboard` | Open skill-health-centric HTML dashboard in browser |
146
- | `ingest-codex` | Batch ingest Codex rollout logs |
147
- | `ingest-opencode` | Backfill historical OpenCode sessions from SQLite |
148
- | `wrap-codex -- <args>` | Real-time Codex wrapper with telemetry |
149
-
150
- No separate API key required — grading and evolution use whatever agent CLI you already have installed (Claude Code, Codex, or OpenCode).
151
-
152
- See `skill/Workflows/` for detailed step-by-step guides for each command.
153
-
154
- ---
155
-
156
- ## How It Works
157
-
158
- ### Telemetry Capture
159
-
160
- ```
161
- Claude Code (hooks): OpenCode (hooks):
162
- UserPromptSubmit → prompt-log.ts message.* → opencode-prompt-log.ts
163
- PostToolUse → skill-eval.ts tool.execute.after → opencode-skill-eval.ts
164
- Stop → session-stop.ts session.idle → opencode-session-stop.ts
165
- │ │
166
- └──────────┬─────────────────────────┘
167
-
168
- Shared JSONL Log Schema (~/.claude/)
169
- ├── all_queries_log.jsonl
170
- ├── skill_usage_log.jsonl
171
- └── session_telemetry_log.jsonl
172
-
173
- Codex (wrapper/ingestor — hooks not yet available):
174
- codex-wrapper.ts (real-time tee of JSONL stream)
175
- codex-rollout.ts (batch ingest from rollout logs)
176
-
177
- └──→ Same shared JSONL schema
178
- ```
179
-
180
- ### Eval & Grading
181
-
182
- ```
183
- selftune evals cross-references the two query logs:
184
- Positives = skill_usage_log entries for target skill
185
- Negatives = all_queries_log entries NOT in positives
186
-
187
- selftune grade reads:
188
- session_telemetry_log → process metrics (tool calls, errors, turns)
189
- transcript JSONL → what actually happened
190
- expectations → what should have happened
191
- ```
192
-
193
- ### Evolution Loop
194
-
195
- ```
196
- selftune evolve:
197
- 1. Load eval set (or generate from logs)
198
- 2. Extract failure patterns (missed queries grouped by invocation type)
199
- 3. Generate improved description via LLM
200
- 4. Validate against eval set (must improve, <5% regression)
201
- 5. Deploy updated SKILL.md + PR + audit trail
202
-
203
- selftune watch:
204
- Monitor pass rate over sliding window of recent sessions
205
- Alert (or auto-rollback) on regression > threshold
206
- ```
207
-
208
- ---
209
-
210
- ## Architecture
211
-
212
- ```
213
- cli/selftune/
214
- ├── index.ts CLI entry point (command router)
215
- ├── init.ts Agent detection, config bootstrap
216
- ├── types.ts, constants.ts Shared interfaces and constants
217
- ├── observability.ts Health checks (doctor command)
218
- ├── status.ts Skill health summary (status command)
219
- ├── last.ts Last session insight (last command)
220
- ├── dashboard.ts HTML dashboard builder (dashboard command)
221
- ├── utils/ JSONL, transcript parsing, LLM calls, schema validation
222
- ├── hooks/ Claude Code + OpenCode telemetry capture
223
- ├── ingestors/ Codex adapters + OpenCode backfill
224
- ├── eval/ False negative detection, eval set generation
225
- ├── grading/ 3-tier session grading (agent or API mode)
226
- ├── evolution/ Failure extraction, proposal, validation, deploy, rollback
227
- └── monitoring/ Post-deploy regression detection
228
-
229
- dashboard/
230
- └── index.html Skill-health-centric HTML dashboard template
231
-
232
- skill/
233
- ├── SKILL.md Routing table (~120 lines)
234
- ├── settings_snippet.json Claude Code hook config template
235
- ├── references/ Domain knowledge (logs, grading methodology, taxonomy)
236
- └── Workflows/ Step-by-step guides (1 per command)
237
- ```
238
-
239
- Dependencies flow forward only: `shared → hooks/ingestors → eval → grading → evolution → monitoring`. Enforced by `lint-architecture.ts`.
240
-
241
- Config persists at `~/.selftune/config.json` (written by `init`, read by all commands via skill workflows).
242
-
243
- See [ARCHITECTURE.md](ARCHITECTURE.md) for the full domain map and module rules.
244
-
245
- ---
246
-
247
- ## Log Schema
248
-
249
- Three append-only JSONL files at `~/.claude/`:
250
-
251
- | File | Record type | Key fields |
252
- |---|---|---|
253
- | `all_queries_log.jsonl` | `QueryLogRecord` | `timestamp`, `session_id`, `query`, `source?` |
254
- | `skill_usage_log.jsonl` | `SkillUsageRecord` | `timestamp`, `session_id`, `skill_name`, `query`, `triggered` |
255
- | `session_telemetry_log.jsonl` | `SessionTelemetryRecord` | `timestamp`, `session_id`, `tool_calls`, `bash_commands`, `skills_triggered`, `errors_encountered` |
256
- | `evolution_audit_log.jsonl` | `EvolutionAuditEntry` | `timestamp`, `proposal_id`, `action`, `details`, `eval_snapshot?` |
257
-
258
- The `source` field identifies the platform: `claude_code`, `codex`, or `opencode`.
259
-
260
- ---
124
+ | Rewrite the description yourself | No data on how users actually talk. No validation. No regression detection. |
125
+ | Add "ALWAYS invoke when..." directives | Brittle. One agent rewrite away from breaking. |
126
+ | Force-load skills on every prompt | Doesn't fix the description. Expensive band-aid. |
127
+ | **selftune** | Learns from real usage, rewrites descriptions to match how you work, validates against eval sets, auto-rollbacks on regressions. |
261
128
 
262
- ## Development
129
+ ## Different Layer, Different Problem
263
130
 
264
- ```bash
265
- make check # lint + architecture lint + all tests
266
- make lint # biome check + architecture lint
267
- make test # bun test
268
- ```
131
+ LLM observability tools trace API calls. Infrastructure tools monitor servers. Neither knows whether the right skill fired for the right person. selftune does — and fixes it automatically.
269
132
 
270
- Zero runtime dependencies. Uses Bun built-ins only.
133
+ selftune is complementary to these tools, not competitive. They trace what happens inside the LLM. selftune makes sure the right skill is called in the first place.
271
134
 
272
- ---
135
+ | Dimension | selftune | Langfuse | LangSmith | OpenLIT |
136
+ |-----------|----------|----------|-----------|---------|
137
+ | **Layer** | Skill-specific | LLM call | Agent trace | Infrastructure |
138
+ | **Detects** | Missed triggers, false negatives, skill conflicts | Token usage, latency | Chain failures | System metrics |
139
+ | **Improves** | Descriptions, body, and routing automatically | — | — | — |
140
+ | **Setup** | Zero deps, zero API keys | Self-host or cloud | Cloud required | Helm chart |
141
+ | **Price** | Free (MIT) | Freemium | Paid | Free |
142
+ | **Unique** | Self-improving skills + auto-rollback | Prompt management | Evaluations | Dashboards |
273
143
 
274
- ## Tips
144
+ ## Platforms
275
145
 
276
- - Run `selftune init` first everything else reads from the config it writes.
277
- - Let logs accumulate over several days before running evals — more diverse real queries = more reliable signal.
278
- - All hooks are silent (exit 0) and take <50ms. Negligible overhead.
279
- - Logs are append-only JSONL. Safe to delete to start fresh, or archive old files.
280
- - Use `--max 75` to increase eval set size once you have enough data.
281
- - Use `--seed 123` for a different random sample of negatives.
282
- - Use `--dry-run` with `evolve` to preview proposals without deploying.
283
- - The `doctor` command checks log health, hook presence, config status, and schema validity.
146
+ **Claude Code** (fully supported) — Hooks install automatically. `selftune ingest claude` backfills existing transcripts. This is the primary supported platform.
284
147
 
285
- ---
148
+ **Codex** (experimental) — `selftune ingest wrap-codex -- <args>` or `selftune ingest codex`. Adapter exists but is not actively tested.
286
149
 
287
- ## Contributing
150
+ **OpenCode** (experimental) — `selftune ingest opencode`. Adapter exists but is not actively tested.
288
151
 
289
- See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, architecture rules, and PR guidelines.
152
+ **OpenClaw** (experimental) `selftune ingest openclaw` + `selftune cron setup` for autonomous evolution. Adapter exists but is not actively tested.
290
153
 
291
- Please follow our [Code of Conduct](CODE_OF_CONDUCT.md).
154
+ Requires [Bun](https://bun.sh) or Node.js 18+. No extra API keys.
292
155
 
293
156
  ---
294
157
 
295
- ## Security
296
-
297
- To report a vulnerability, see [SECURITY.md](SECURITY.md).
158
+ <div align="center">
298
159
 
299
- ---
300
-
301
- ## Sponsor
302
-
303
- If selftune saves you time, consider [sponsoring the project](https://github.com/sponsors/WellDunDun).
304
-
305
- ---
160
+ [Architecture](ARCHITECTURE.md) · [Contributing](CONTRIBUTING.md) · [Security](SECURITY.md) · [Integration Guide](docs/integration-guide.md) · [Sponsor](https://github.com/sponsors/WellDunDun)
306
161
 
307
- ## Milestones
162
+ MIT licensed. Free forever. Primary support for Claude Code; experimental adapters for Codex, OpenCode, and OpenClaw.
308
163
 
309
- | Version | Scope | Status |
310
- |---|---|---|
311
- | v0.1 | Hooks, ingestors, shared schema, eval generation | Done |
312
- | v0.2 | Session grading, grader skill | Done |
313
- | v0.3 | Evolution loop (propose, validate, deploy, rollback) | Done |
314
- | v0.4 | Post-deploy monitoring, regression detection | Done |
315
- | v0.5 | Agent-first skill restructure, `init` command, config bootstrap | Done |
316
- | v0.6 | Three-layer observability: `status`, `last`, redesigned dashboard | Done |
164
+ </div>