selftune 0.1.4 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/diagnosis-analyst.md +156 -0
- package/.claude/agents/evolution-reviewer.md +180 -0
- package/.claude/agents/integration-guide.md +212 -0
- package/.claude/agents/pattern-analyst.md +160 -0
- package/CHANGELOG.md +46 -1
- package/README.md +105 -257
- package/apps/local-dashboard/dist/assets/geist-cyrillic-wght-normal-CHSlOQsW.woff2 +0 -0
- package/apps/local-dashboard/dist/assets/geist-latin-ext-wght-normal-DMtmJ5ZE.woff2 +0 -0
- package/apps/local-dashboard/dist/assets/geist-latin-wght-normal-Dm3htQBi.woff2 +0 -0
- package/apps/local-dashboard/dist/assets/index-C4EOTFZ2.js +15 -0
- package/apps/local-dashboard/dist/assets/index-bl-Webyd.css +1 -0
- package/apps/local-dashboard/dist/assets/vendor-react-U7zYD9Rg.js +60 -0
- package/apps/local-dashboard/dist/assets/vendor-table-B7VF2Ipl.js +26 -0
- package/apps/local-dashboard/dist/assets/vendor-ui-D7_zX_qy.js +346 -0
- package/apps/local-dashboard/dist/favicon.png +0 -0
- package/apps/local-dashboard/dist/index.html +17 -0
- package/apps/local-dashboard/dist/logo.png +0 -0
- package/apps/local-dashboard/dist/logo.svg +9 -0
- package/assets/BeforeAfter.gif +0 -0
- package/assets/FeedbackLoop.gif +0 -0
- package/assets/logo.svg +9 -0
- package/assets/skill-health-badge.svg +20 -0
- package/cli/selftune/activation-rules.ts +171 -0
- package/cli/selftune/badge/badge-data.ts +108 -0
- package/cli/selftune/badge/badge-svg.ts +212 -0
- package/cli/selftune/badge/badge.ts +99 -0
- package/cli/selftune/canonical-export.ts +183 -0
- package/cli/selftune/constants.ts +103 -1
- package/cli/selftune/contribute/bundle.ts +314 -0
- package/cli/selftune/contribute/contribute.ts +214 -0
- package/cli/selftune/contribute/sanitize.ts +162 -0
- package/cli/selftune/cron/setup.ts +266 -0
- package/cli/selftune/dashboard-contract.ts +202 -0
- package/cli/selftune/dashboard-server.ts +1049 -0
- package/cli/selftune/dashboard.ts +43 -156
- package/cli/selftune/eval/baseline.ts +248 -0
- package/cli/selftune/eval/composability-v2.ts +273 -0
- package/cli/selftune/eval/composability.ts +117 -0
- package/cli/selftune/eval/generate-unit-tests.ts +143 -0
- package/cli/selftune/eval/hooks-to-evals.ts +101 -16
- package/cli/selftune/eval/import-skillsbench.ts +221 -0
- package/cli/selftune/eval/synthetic-evals.ts +172 -0
- package/cli/selftune/eval/unit-test-cli.ts +152 -0
- package/cli/selftune/eval/unit-test.ts +196 -0
- package/cli/selftune/evolution/deploy-proposal.ts +142 -1
- package/cli/selftune/evolution/evidence.ts +26 -0
- package/cli/selftune/evolution/evolve-body.ts +586 -0
- package/cli/selftune/evolution/evolve.ts +825 -116
- package/cli/selftune/evolution/extract-patterns.ts +105 -16
- package/cli/selftune/evolution/pareto.ts +314 -0
- package/cli/selftune/evolution/propose-body.ts +171 -0
- package/cli/selftune/evolution/propose-description.ts +100 -2
- package/cli/selftune/evolution/propose-routing.ts +166 -0
- package/cli/selftune/evolution/refine-body.ts +141 -0
- package/cli/selftune/evolution/rollback.ts +21 -4
- package/cli/selftune/evolution/validate-body.ts +254 -0
- package/cli/selftune/evolution/validate-proposal.ts +257 -35
- package/cli/selftune/evolution/validate-routing.ts +177 -0
- package/cli/selftune/grading/auto-grade.ts +200 -0
- package/cli/selftune/grading/grade-session.ts +513 -42
- package/cli/selftune/grading/pre-gates.ts +104 -0
- package/cli/selftune/grading/results.ts +42 -0
- package/cli/selftune/hooks/auto-activate.ts +185 -0
- package/cli/selftune/hooks/evolution-guard.ts +165 -0
- package/cli/selftune/hooks/prompt-log.ts +172 -2
- package/cli/selftune/hooks/session-stop.ts +123 -3
- package/cli/selftune/hooks/skill-change-guard.ts +112 -0
- package/cli/selftune/hooks/skill-eval.ts +119 -3
- package/cli/selftune/index.ts +415 -48
- package/cli/selftune/ingestors/claude-replay.ts +377 -0
- package/cli/selftune/ingestors/codex-rollout.ts +345 -46
- package/cli/selftune/ingestors/codex-wrapper.ts +207 -39
- package/cli/selftune/ingestors/openclaw-ingest.ts +573 -0
- package/cli/selftune/ingestors/opencode-ingest.ts +193 -17
- package/cli/selftune/init.ts +376 -16
- package/cli/selftune/last.ts +14 -5
- package/cli/selftune/localdb/db.ts +63 -0
- package/cli/selftune/localdb/materialize.ts +428 -0
- package/cli/selftune/localdb/queries.ts +376 -0
- package/cli/selftune/localdb/schema.ts +204 -0
- package/cli/selftune/memory/writer.ts +447 -0
- package/cli/selftune/monitoring/watch.ts +90 -16
- package/cli/selftune/normalization.ts +682 -0
- package/cli/selftune/observability.ts +19 -44
- package/cli/selftune/orchestrate.ts +1073 -0
- package/cli/selftune/quickstart.ts +203 -0
- package/cli/selftune/repair/skill-usage.ts +576 -0
- package/cli/selftune/schedule.ts +561 -0
- package/cli/selftune/status.ts +59 -33
- package/cli/selftune/sync.ts +627 -0
- package/cli/selftune/types.ts +525 -5
- package/cli/selftune/utils/canonical-log.ts +45 -0
- package/cli/selftune/utils/frontmatter.ts +217 -0
- package/cli/selftune/utils/hooks.ts +41 -0
- package/cli/selftune/utils/html.ts +27 -0
- package/cli/selftune/utils/llm-call.ts +103 -19
- package/cli/selftune/utils/math.ts +10 -0
- package/cli/selftune/utils/query-filter.ts +139 -0
- package/cli/selftune/utils/skill-discovery.ts +340 -0
- package/cli/selftune/utils/skill-log.ts +68 -0
- package/cli/selftune/utils/skill-usage-confidence.ts +18 -0
- package/cli/selftune/utils/transcript.ts +307 -26
- package/cli/selftune/utils/trigger-check.ts +89 -0
- package/cli/selftune/utils/tui.ts +156 -0
- package/cli/selftune/workflows/discover.ts +254 -0
- package/cli/selftune/workflows/skill-md-writer.ts +288 -0
- package/cli/selftune/workflows/workflows.ts +188 -0
- package/package.json +28 -11
- package/packages/telemetry-contract/README.md +11 -0
- package/packages/telemetry-contract/fixtures/golden.json +87 -0
- package/packages/telemetry-contract/fixtures/golden.test.ts +42 -0
- package/packages/telemetry-contract/index.ts +1 -0
- package/packages/telemetry-contract/package.json +19 -0
- package/packages/telemetry-contract/src/index.ts +2 -0
- package/packages/telemetry-contract/src/types.ts +163 -0
- package/packages/telemetry-contract/src/validators.ts +109 -0
- package/skill/SKILL.md +180 -33
- package/skill/Workflows/AutoActivation.md +145 -0
- package/skill/Workflows/Badge.md +124 -0
- package/skill/Workflows/Baseline.md +144 -0
- package/skill/Workflows/Composability.md +107 -0
- package/skill/Workflows/Contribute.md +94 -0
- package/skill/Workflows/Cron.md +132 -0
- package/skill/Workflows/Dashboard.md +214 -0
- package/skill/Workflows/Doctor.md +63 -14
- package/skill/Workflows/Evals.md +110 -18
- package/skill/Workflows/EvolutionMemory.md +154 -0
- package/skill/Workflows/Evolve.md +181 -21
- package/skill/Workflows/EvolveBody.md +159 -0
- package/skill/Workflows/Grade.md +36 -31
- package/skill/Workflows/ImportSkillsBench.md +117 -0
- package/skill/Workflows/Ingest.md +142 -21
- package/skill/Workflows/Initialize.md +91 -23
- package/skill/Workflows/Orchestrate.md +139 -0
- package/skill/Workflows/Replay.md +91 -0
- package/skill/Workflows/Rollback.md +23 -4
- package/skill/Workflows/Schedule.md +61 -0
- package/skill/Workflows/Sync.md +88 -0
- package/skill/Workflows/UnitTest.md +150 -0
- package/skill/Workflows/Watch.md +33 -1
- package/skill/Workflows/Workflows.md +129 -0
- package/skill/assets/activation-rules-default.json +26 -0
- package/skill/assets/multi-skill-settings.json +63 -0
- package/skill/assets/single-skill-settings.json +57 -0
- package/skill/references/invocation-taxonomy.md +2 -2
- package/skill/references/logs.md +164 -2
- package/skill/references/setup-patterns.md +65 -0
- package/skill/references/version-history.md +40 -0
- package/skill/settings_snippet.json +23 -0
- package/templates/activation-rules-default.json +27 -0
- package/templates/multi-skill-settings.json +64 -0
- package/templates/single-skill-settings.json +58 -0
- package/dashboard/index.html +0 -1119
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,49 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/).
|
|
7
7
|
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- **Real-time improvement signal detection** — `prompt-log` hook detects user corrections ("why didn't you use X?") and explicit skill requests via pure regex patterns. Signals are logged to `~/.claude/improvement_signals.jsonl` with skill name extraction from installed skills.
|
|
13
|
+
- **Signal-reactive orchestration** — `session-stop` hook checks for pending improvement signals and spawns a focused `selftune orchestrate --max-skills 2` run in the background. Respects a 30-minute lockfile to prevent concurrent runs.
|
|
14
|
+
- **Signal-aware candidate selection** — Orchestrator reads pending signals and boosts priority for mentioned skills (+150 per signal, capped at +450). Signaled skills bypass the minimum evidence gate and the "UNGRADED with 0 missed queries" gate.
|
|
15
|
+
- **Orchestrate lockfile** — `acquireLock()`/`releaseLock()` with PID+timestamp in `~/.claude/.orchestrate.lock`. 30-minute stale threshold prevents deadlocks from crashed runs.
|
|
16
|
+
- **Signal consumption** — After an orchestrate run completes, consumed signals are marked with `consumed: true`, `consumed_at`, and `consumed_by_run` so they don't affect subsequent runs.
|
|
17
|
+
|
|
18
|
+
## [0.2.0] — 2026-03-08
|
|
19
|
+
|
|
20
|
+
### Added
|
|
21
|
+
|
|
22
|
+
- **Full skill body evolution** — Teacher-student model for evolving routing tables and complete skill bodies with 3-gate validation (structural, trigger, quality)
|
|
23
|
+
- **Synthetic eval generation** — `selftune eval generate --synthetic --skill <name> --skill-path <path>` generates eval sets from SKILL.md via LLM without needing real session logs. Solves cold-start for new skills.
|
|
24
|
+
- **Batch trigger validation** — `validateProposalBatched()` batches 10 queries per LLM call (configurable via `TRIGGER_CHECK_BATCH_SIZE`). ~10x faster evolution loops. Sequential `validateProposalSequential()` kept for backward compat.
|
|
25
|
+
- **Cheap-loop evolution mode** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate. New `--gate-model` and `--proposal-model` flags for manual per-stage control.
|
|
26
|
+
- **Validation model selection** — `--validation-model` flag on `evolve` and `evolve body` commands (default: `haiku`).
|
|
27
|
+
- **Proposal model selection** — `--proposal-model` flag on `evolve`, passed through to `generateProposal()` and `generateMultipleProposals()`.
|
|
28
|
+
- **Gate validation dependency injection** — `gateValidateProposal` added to `EvolveDeps` for testability.
|
|
29
|
+
- **Auto-activation system** — `auto-activate.ts` UserPromptSubmit hook detects when selftune should run and outputs formatted suggestions; session state tracking prevents repeated nags; PAI coexistence support
|
|
30
|
+
- **Skill change guard** — `skill-change-guard.ts` PreToolUse hook detects Write/Edit to SKILL.md files and suggests running `selftune watch`
|
|
31
|
+
- **Evolution memory** — 3-file persistence system at `~/.selftune/memory/` (context.md, plan.md, decisions.md) survives context resets; auto-maintained by evolve, rollback, and watch commands
|
|
32
|
+
- **Specialized agents** — 4 purpose-built Claude Code agents: diagnosis-analyst, pattern-analyst, evolution-reviewer, integration-guide
|
|
33
|
+
- **Enforcement guardrails** — `evolution-guard.ts` PreToolUse hook blocks SKILL.md edits on actively monitored skills unless `selftune watch` has been run recently
|
|
34
|
+
- **Integration guide** — Comprehensive `docs/integration-guide.md` with project-type patterns (single-skill, multi-skill, monorepo, Codex-only, OpenCode-only, mixed)
|
|
35
|
+
- **Settings templates** — `templates/single-skill-settings.json`, `templates/multi-skill-settings.json`, `templates/activation-rules-default.json`
|
|
36
|
+
- **Enhanced init** — `selftune init` now detects workspace structure (skill count, monorepo layout) and suggests appropriate template
|
|
37
|
+
- **Dashboard server** — `selftune dashboard --serve` launches live Bun.serve server with SSE auto-refresh, action buttons (watch/evolve/rollback), and evolution timeline
|
|
38
|
+
- **Activation rules engine** — Configurable trigger rules for auto-activation (grading thresholds, stale evolutions, regression detection)
|
|
39
|
+
- **Sandbox test harness** (`tests/sandbox/run-sandbox.ts`): Exercises all CLI commands and hooks against fixture data in an isolated `/tmp` environment. Runs in ~400ms with 10/10 tests passing.
|
|
40
|
+
- **Devcontainer-based LLM testing** (`.devcontainer/` + `tests/sandbox/docker/`): Based on the official Claude Code devcontainer reference. Uses `claude -p` with `--dangerously-skip-permissions` for unattended LLM-dependent testing (grade, evolve, watch). No API key required — uses existing Claude subscription.
|
|
41
|
+
- **Realistic test fixtures**: 3 skills from skills.sh (find-skills, frontend-design, ai-image-generation) with 15 sessions, 30 queries, 7 skill usage records, and evolution audit history.
|
|
42
|
+
- **Hook integration tests**: All 3 Claude Code hooks (prompt-log, skill-eval, session-stop) tested via stdin payload injection.
|
|
43
|
+
|
|
44
|
+
### Changed
|
|
45
|
+
|
|
46
|
+
- `validateProposal()` now delegates to `validateProposalBatched()` by default (was sequential).
|
|
47
|
+
- `hooks-to-evals.ts` `cliMain()` is now async to support synthetic generation.
|
|
48
|
+
- `EvolveOptions` extended with `validationModel`, `cheapLoop`, `gateModel`, `proposalModel`.
|
|
49
|
+
- `EvolveResult` extended with `gateValidation`.
|
|
50
|
+
|
|
8
51
|
## [0.1.4] - 2026-03-01
|
|
9
52
|
|
|
10
53
|
### Added
|
|
@@ -12,6 +55,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
|
|
|
12
55
|
- `selftune status` — CLI skill health summary with pass rates, trends, and system health
|
|
13
56
|
- `selftune last` — Quick insight from the most recent session
|
|
14
57
|
- `selftune dashboard` — Skill-health-centric HTML dashboard with grid view and drill-down
|
|
58
|
+
- `selftune ingest claude` — Claude Code transcript replay for retroactive log backfill
|
|
59
|
+
- `selftune contribute` — Opt-in anonymized data export for community contribution
|
|
15
60
|
- CI/CD workflows: publish, auto-bump, CodeQL, scorecard
|
|
16
61
|
- FOSS governance: LICENSE (MIT), CODE_OF_CONDUCT, CONTRIBUTING, SECURITY
|
|
17
62
|
- npm package configuration with CJS bin entry point
|
|
@@ -20,7 +65,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
|
|
|
20
65
|
|
|
21
66
|
### Added
|
|
22
67
|
|
|
23
|
-
- CLI entry point with 10 commands: `init`, `
|
|
68
|
+
- CLI entry point with 10 commands: `init`, `eval generate`, `grade`, `evolve`, `evolve rollback`, `watch`, `doctor`, `ingest codex`, `ingest opencode`, `ingest wrap-codex`
|
|
24
69
|
- Agent auto-detection for Claude Code, Codex, and OpenCode
|
|
25
70
|
- Telemetry hooks for Claude Code (`prompt-log`, `skill-eval`, `session-stop`)
|
|
26
71
|
- Codex wrapper and batch ingestor for rollout logs
|
package/README.md
CHANGED
|
@@ -1,316 +1,164 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
[](https://www.typescriptlang.org/)
|
|
7
|
-
[](https://www.npmjs.com/package/selftune?activeTab=dependencies)
|
|
8
|
-
[](https://bun.sh)
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
3
|
+
<img src="assets/logo.svg" alt="selftune logo" width="80" />
|
|
4
|
+
|
|
5
|
+
# selftune
|
|
9
6
|
|
|
10
|
-
|
|
7
|
+
**Self-improving skills for AI agents.**
|
|
11
8
|
|
|
9
|
+
[](https://github.com/selftune-dev/selftune/actions/workflows/ci.yml)
|
|
10
|
+
[](https://github.com/selftune-dev/selftune/actions/workflows/codeql.yml)
|
|
11
|
+
[](https://securityscorecards.dev/viewer/?uri=github.com/selftune-dev/selftune)
|
|
12
12
|
[](https://www.npmjs.com/package/selftune)
|
|
13
|
-
[](LICENSE)
|
|
14
|
+
[](https://www.typescriptlang.org/)
|
|
15
15
|
[](https://www.npmjs.com/package/selftune?activeTab=dependencies)
|
|
16
16
|
[](https://bun.sh)
|
|
17
17
|
|
|
18
|
-
|
|
18
|
+
Your agent skills learn how you work. Detect what's broken. Fix it automatically.
|
|
19
19
|
|
|
20
|
-
Works
|
|
20
|
+
**[Install](#install)** · **[Use Cases](#built-for-how-you-actually-work)** · **[How It Works](#how-it-works)** · **[Commands](#commands)** · **[Platforms](#platforms)** · **[Docs](docs/integration-guide.md)**
|
|
21
21
|
|
|
22
|
-
|
|
23
|
-
Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat
|
|
24
|
-
```
|
|
22
|
+
</div>
|
|
25
23
|
|
|
26
24
|
---
|
|
27
25
|
|
|
28
|
-
|
|
26
|
+
Your skills don't understand how you talk. You say "make me a slide deck" and nothing happens — no error, no log, no signal. selftune watches your real sessions, learns how you actually speak, and rewrites skill descriptions to match. Automatically.
|
|
29
27
|
|
|
30
|
-
|
|
31
|
-
npx selftune@latest doctor
|
|
32
|
-
```
|
|
28
|
+
Works with **Claude Code** (primary). Codex, OpenCode, and OpenClaw adapters are experimental. Zero runtime dependencies.
|
|
33
29
|
|
|
34
|
-
|
|
30
|
+
## Install
|
|
35
31
|
|
|
36
32
|
```bash
|
|
37
|
-
|
|
38
|
-
selftune doctor
|
|
33
|
+
npx skills add selftune-dev/selftune
|
|
39
34
|
```
|
|
40
35
|
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
---
|
|
44
|
-
|
|
45
|
-
## Why
|
|
46
|
-
|
|
47
|
-
Agent skills are static, but users are not. When a skill undertriggers — when someone says "make me a slide deck" and the pptx skill doesn't fire — that failure is invisible. The user concludes "AI doesn't follow directions" rather than recognizing the skill description doesn't match how real people talk.
|
|
48
|
-
|
|
49
|
-
selftune closes this feedback loop.
|
|
50
|
-
|
|
51
|
-
---
|
|
52
|
-
|
|
53
|
-
## What It Does
|
|
36
|
+
Then tell your agent: **"initialize selftune"**
|
|
54
37
|
|
|
55
|
-
|
|
56
|
-
|---|---|
|
|
57
|
-
| **Session telemetry** | Captures per-session process metrics across all three platforms |
|
|
58
|
-
| **False negative detection** | Surfaces queries where a skill should have fired but didn't |
|
|
59
|
-
| **Eval set generation** | Converts hook logs into trigger eval sets with real usage as ground truth |
|
|
60
|
-
| **Session grading** | 3-tier evaluation (Trigger / Process / Quality) using the agent you already have |
|
|
61
|
-
| **Skill evolution** | Proposes improved descriptions, validates them, deploys with audit trail |
|
|
62
|
-
| **Post-deploy monitoring** | Watches evolved skills for regressions, auto-rollback on pass rate drops |
|
|
63
|
-
|
|
64
|
-
---
|
|
38
|
+
Two minutes. No API keys. No external services. No configuration ceremony. Uses your existing agent subscription. You'll see which skills are undertriggering.
|
|
65
39
|
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
### 1. Add the skill
|
|
40
|
+
**CLI only** (no skill, just the CLI):
|
|
69
41
|
|
|
70
42
|
```bash
|
|
71
|
-
npx
|
|
43
|
+
npx selftune@latest doctor
|
|
72
44
|
```
|
|
73
45
|
|
|
74
|
-
|
|
46
|
+
## Before / After
|
|
75
47
|
|
|
76
|
-
|
|
48
|
+
<p align="center">
|
|
49
|
+
<img src="./assets/BeforeAfter.gif" alt="Before: 47% pass rate → After: 89% pass rate" width="800">
|
|
50
|
+
</p>
|
|
77
51
|
|
|
78
|
-
|
|
52
|
+
selftune learned that real users say "slides", "deck", "presentation for Monday" — none of which matched the original skill description. It rewrote the description to match how people actually talk. Validated against the eval set. Deployed with a backup. Done.
|
|
79
53
|
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
## Development
|
|
83
|
-
|
|
84
|
-
For contributors running from source.
|
|
85
|
-
|
|
86
|
-
### 1. Initialize
|
|
87
|
-
|
|
88
|
-
```bash
|
|
89
|
-
npx selftune@latest init
|
|
90
|
-
```
|
|
54
|
+
## Built for How You Actually Work
|
|
91
55
|
|
|
92
|
-
|
|
56
|
+
**I write and use my own skills** — Your skill descriptions don't match how you actually talk. Tell your agent "improve my skills" and selftune learns your language from real sessions, evolves descriptions to match, and validates before deploying. No manual tuning.
|
|
93
57
|
|
|
94
|
-
|
|
58
|
+
**I publish skills others install** — Your skill works for you, but every user talks differently. selftune ships skills that get better for every user automatically — adapting descriptions to how each person actually works.
|
|
95
59
|
|
|
96
|
-
|
|
60
|
+
**I manage an agent setup with many skills** — You have 15+ skills installed. Some work. Some don't. Some conflict. Tell your agent "how are my skills doing?" and selftune gives you a health dashboard and automatically improves the skills that aren't keeping up.
|
|
97
61
|
|
|
98
|
-
|
|
62
|
+
## How It Works
|
|
99
63
|
|
|
100
|
-
|
|
64
|
+
<p align="center">
|
|
65
|
+
<img src="./assets/FeedbackLoop.gif" alt="Observe → Detect → Evolve → Watch" width="800">
|
|
66
|
+
</p>
|
|
101
67
|
|
|
102
|
-
|
|
103
|
-
selftune doctor
|
|
104
|
-
```
|
|
68
|
+
A continuous feedback loop that makes your skills learn and adapt. Automatically. Your agent runs everything — you just install the skill and talk naturally.
|
|
105
69
|
|
|
106
|
-
|
|
70
|
+
**Observe** — Hooks capture every query and which skills fired. On Claude Code, hooks install automatically during `selftune init`. Backfill existing transcripts with `selftune ingest claude`.
|
|
107
71
|
|
|
108
|
-
|
|
72
|
+
**Detect** — Finds the gap between how you talk and how your skills are described. You say "make me a slide deck" and your pptx skill stays silent — selftune catches that mismatch. Real-time correction signals ("why didn't you use X?") are detected and trigger immediate improvement.
|
|
109
73
|
|
|
110
|
-
**
|
|
74
|
+
**Evolve** — Rewrites skill descriptions — and full skill bodies — to match how you actually work. Cheap-loop mode uses haiku for the loop, sonnet for the gate (~80% cost reduction). Teacher-student body evolution with 3-gate validation. Automatic backup.
|
|
111
75
|
|
|
112
|
-
**
|
|
113
|
-
```bash
|
|
114
|
-
selftune wrap-codex -- <your codex args>
|
|
115
|
-
selftune ingest-codex
|
|
116
|
-
```
|
|
76
|
+
**Watch** — After deploying changes, selftune monitors skill trigger rates. If anything regresses, it rolls back automatically.
|
|
117
77
|
|
|
118
|
-
**
|
|
119
|
-
```bash
|
|
120
|
-
selftune ingest-opencode
|
|
121
|
-
```
|
|
78
|
+
**Automate** — Run `selftune cron setup` to install OS-level scheduling. selftune syncs, evaluates, evolves, and watches on a schedule — no manual intervention needed.
|
|
122
79
|
|
|
123
|
-
|
|
80
|
+
## What's New in v0.2.0
|
|
124
81
|
|
|
125
|
-
|
|
82
|
+
- **Full skill body evolution** — Beyond descriptions: evolve routing tables and entire skill bodies using teacher-student model with structural, trigger, and quality gates
|
|
83
|
+
- **Synthetic eval generation** — `selftune eval generate --synthetic` generates eval sets from SKILL.md via LLM, no session logs needed. Solves cold-start: new skills get evals immediately.
|
|
84
|
+
- **Cheap-loop evolution** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate. ~80% cost reduction.
|
|
85
|
+
- **Batch trigger validation** — Validation now batches 10 queries per LLM call instead of one-per-query. ~10x faster evolution loops.
|
|
86
|
+
- **Per-stage model control** — `--validation-model`, `--proposal-model`, and `--gate-model` flags give fine-grained control over which model runs each evolution stage.
|
|
87
|
+
- **Auto-activation system** — Hooks detect when selftune should run and suggest actions
|
|
88
|
+
- **Enforcement guardrails** — Blocks SKILL.md edits on monitored skills unless `selftune watch` has been run
|
|
89
|
+
- **Live dashboard server** — `selftune dashboard --serve` with SSE auto-refresh and action buttons
|
|
90
|
+
- **Evolution memory** — Persists context, plans, and decisions across context resets
|
|
91
|
+
- **4 specialized agents** — Diagnosis analyst, pattern analyst, evolution reviewer, integration guide
|
|
92
|
+
- **Sandbox test harness** — Comprehensive automated test coverage, including devcontainer-based LLM testing
|
|
126
93
|
|
|
127
94
|
## Commands
|
|
128
95
|
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
|
96
|
+
Your agent runs these — you just say what you want ("improve my skills", "show the dashboard").
|
|
97
|
+
|
|
98
|
+
| Group | Command | What it does |
|
|
99
|
+
|-------|---------|-------------|
|
|
100
|
+
| | `selftune status` | See which skills are undertriggering and why |
|
|
101
|
+
| | `selftune orchestrate` | Run the full autonomous loop (sync → evolve → watch) |
|
|
102
|
+
| | `selftune dashboard` | Open the visual skill health dashboard |
|
|
103
|
+
| | `selftune doctor` | Health check: logs, hooks, config, permissions |
|
|
104
|
+
| **ingest** | `selftune ingest claude` | Backfill from Claude Code transcripts |
|
|
105
|
+
| | `selftune ingest codex` | Import Codex rollout logs (experimental) |
|
|
106
|
+
| **grade** | `selftune grade --skill <name>` | Grade a skill session with evidence |
|
|
107
|
+
| | `selftune grade baseline --skill <name>` | Measure skill value vs no-skill baseline |
|
|
108
|
+
| **evolve** | `selftune evolve --skill <name>` | Propose, validate, and deploy improved descriptions |
|
|
109
|
+
| | `selftune evolve body --skill <name>` | Evolve full skill body or routing table |
|
|
110
|
+
| | `selftune evolve rollback --skill <name>` | Rollback a previous evolution |
|
|
111
|
+
| **eval** | `selftune eval generate --skill <name>` | Generate eval sets (`--synthetic` for cold-start) |
|
|
112
|
+
| | `selftune eval unit-test --skill <name>` | Run or generate skill-level unit tests |
|
|
113
|
+
| | `selftune eval composability --skill <name>` | Detect conflicts between co-occurring skills |
|
|
114
|
+
| | `selftune eval import` | Import external eval corpus from [SkillsBench](https://github.com/benchflow-ai/skillsbench) |
|
|
115
|
+
| **auto** | `selftune cron setup` | Install OS-level scheduling (cron/launchd/systemd) |
|
|
116
|
+
| | `selftune watch --skill <name>` | Monitor after deploy. Auto-rollback on regression. |
|
|
117
|
+
|
|
118
|
+
Full command reference: `selftune --help`
|
|
119
|
+
|
|
120
|
+
## Why Not Just Rewrite Skills Manually?
|
|
121
|
+
|
|
122
|
+
| Approach | Problem |
|
|
134
123
|
|---|---|
|
|
135
|
-
|
|
|
136
|
-
|
|
|
137
|
-
|
|
|
138
|
-
|
|
|
139
|
-
| `evolve --skill <name> --skill-path <path>` | Analyze failures, propose and deploy improved description |
|
|
140
|
-
| `rollback --skill <name> --skill-path <path>` | Restore pre-evolution description |
|
|
141
|
-
| `watch --skill <name> --skill-path <path>` | Monitor post-deploy pass rates, detect regressions |
|
|
142
|
-
| `status` | Show skill health summary (pass rates, trends, missed queries) |
|
|
143
|
-
| `last` | Show quick insight from the most recent session |
|
|
144
|
-
| `doctor` | Health checks on logs, hooks, config, and schema |
|
|
145
|
-
| `dashboard` | Open skill-health-centric HTML dashboard in browser |
|
|
146
|
-
| `ingest-codex` | Batch ingest Codex rollout logs |
|
|
147
|
-
| `ingest-opencode` | Backfill historical OpenCode sessions from SQLite |
|
|
148
|
-
| `wrap-codex -- <args>` | Real-time Codex wrapper with telemetry |
|
|
149
|
-
|
|
150
|
-
No separate API key required — grading and evolution use whatever agent CLI you already have installed (Claude Code, Codex, or OpenCode).
|
|
151
|
-
|
|
152
|
-
See `skill/Workflows/` for detailed step-by-step guides for each command.
|
|
153
|
-
|
|
154
|
-
---
|
|
155
|
-
|
|
156
|
-
## How It Works
|
|
157
|
-
|
|
158
|
-
### Telemetry Capture
|
|
159
|
-
|
|
160
|
-
```
|
|
161
|
-
Claude Code (hooks): OpenCode (hooks):
|
|
162
|
-
UserPromptSubmit → prompt-log.ts message.* → opencode-prompt-log.ts
|
|
163
|
-
PostToolUse → skill-eval.ts tool.execute.after → opencode-skill-eval.ts
|
|
164
|
-
Stop → session-stop.ts session.idle → opencode-session-stop.ts
|
|
165
|
-
│ │
|
|
166
|
-
└──────────┬─────────────────────────┘
|
|
167
|
-
▼
|
|
168
|
-
Shared JSONL Log Schema (~/.claude/)
|
|
169
|
-
├── all_queries_log.jsonl
|
|
170
|
-
├── skill_usage_log.jsonl
|
|
171
|
-
└── session_telemetry_log.jsonl
|
|
172
|
-
|
|
173
|
-
Codex (wrapper/ingestor — hooks not yet available):
|
|
174
|
-
codex-wrapper.ts (real-time tee of JSONL stream)
|
|
175
|
-
codex-rollout.ts (batch ingest from rollout logs)
|
|
176
|
-
│
|
|
177
|
-
└──→ Same shared JSONL schema
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
### Eval & Grading
|
|
181
|
-
|
|
182
|
-
```
|
|
183
|
-
selftune evals cross-references the two query logs:
|
|
184
|
-
Positives = skill_usage_log entries for target skill
|
|
185
|
-
Negatives = all_queries_log entries NOT in positives
|
|
186
|
-
|
|
187
|
-
selftune grade reads:
|
|
188
|
-
session_telemetry_log → process metrics (tool calls, errors, turns)
|
|
189
|
-
transcript JSONL → what actually happened
|
|
190
|
-
expectations → what should have happened
|
|
191
|
-
```
|
|
192
|
-
|
|
193
|
-
### Evolution Loop
|
|
194
|
-
|
|
195
|
-
```
|
|
196
|
-
selftune evolve:
|
|
197
|
-
1. Load eval set (or generate from logs)
|
|
198
|
-
2. Extract failure patterns (missed queries grouped by invocation type)
|
|
199
|
-
3. Generate improved description via LLM
|
|
200
|
-
4. Validate against eval set (must improve, <5% regression)
|
|
201
|
-
5. Deploy updated SKILL.md + PR + audit trail
|
|
202
|
-
|
|
203
|
-
selftune watch:
|
|
204
|
-
Monitor pass rate over sliding window of recent sessions
|
|
205
|
-
Alert (or auto-rollback) on regression > threshold
|
|
206
|
-
```
|
|
207
|
-
|
|
208
|
-
---
|
|
209
|
-
|
|
210
|
-
## Architecture
|
|
211
|
-
|
|
212
|
-
```
|
|
213
|
-
cli/selftune/
|
|
214
|
-
├── index.ts CLI entry point (command router)
|
|
215
|
-
├── init.ts Agent detection, config bootstrap
|
|
216
|
-
├── types.ts, constants.ts Shared interfaces and constants
|
|
217
|
-
├── observability.ts Health checks (doctor command)
|
|
218
|
-
├── status.ts Skill health summary (status command)
|
|
219
|
-
├── last.ts Last session insight (last command)
|
|
220
|
-
├── dashboard.ts HTML dashboard builder (dashboard command)
|
|
221
|
-
├── utils/ JSONL, transcript parsing, LLM calls, schema validation
|
|
222
|
-
├── hooks/ Claude Code + OpenCode telemetry capture
|
|
223
|
-
├── ingestors/ Codex adapters + OpenCode backfill
|
|
224
|
-
├── eval/ False negative detection, eval set generation
|
|
225
|
-
├── grading/ 3-tier session grading (agent or API mode)
|
|
226
|
-
├── evolution/ Failure extraction, proposal, validation, deploy, rollback
|
|
227
|
-
└── monitoring/ Post-deploy regression detection
|
|
228
|
-
|
|
229
|
-
dashboard/
|
|
230
|
-
└── index.html Skill-health-centric HTML dashboard template
|
|
231
|
-
|
|
232
|
-
skill/
|
|
233
|
-
├── SKILL.md Routing table (~120 lines)
|
|
234
|
-
├── settings_snippet.json Claude Code hook config template
|
|
235
|
-
├── references/ Domain knowledge (logs, grading methodology, taxonomy)
|
|
236
|
-
└── Workflows/ Step-by-step guides (1 per command)
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
Dependencies flow forward only: `shared → hooks/ingestors → eval → grading → evolution → monitoring`. Enforced by `lint-architecture.ts`.
|
|
240
|
-
|
|
241
|
-
Config persists at `~/.selftune/config.json` (written by `init`, read by all commands via skill workflows).
|
|
242
|
-
|
|
243
|
-
See [ARCHITECTURE.md](ARCHITECTURE.md) for the full domain map and module rules.
|
|
244
|
-
|
|
245
|
-
---
|
|
246
|
-
|
|
247
|
-
## Log Schema
|
|
248
|
-
|
|
249
|
-
Three append-only JSONL files at `~/.claude/`:
|
|
250
|
-
|
|
251
|
-
| File | Record type | Key fields |
|
|
252
|
-
|---|---|---|
|
|
253
|
-
| `all_queries_log.jsonl` | `QueryLogRecord` | `timestamp`, `session_id`, `query`, `source?` |
|
|
254
|
-
| `skill_usage_log.jsonl` | `SkillUsageRecord` | `timestamp`, `session_id`, `skill_name`, `query`, `triggered` |
|
|
255
|
-
| `session_telemetry_log.jsonl` | `SessionTelemetryRecord` | `timestamp`, `session_id`, `tool_calls`, `bash_commands`, `skills_triggered`, `errors_encountered` |
|
|
256
|
-
| `evolution_audit_log.jsonl` | `EvolutionAuditEntry` | `timestamp`, `proposal_id`, `action`, `details`, `eval_snapshot?` |
|
|
257
|
-
|
|
258
|
-
The `source` field identifies the platform: `claude_code`, `codex`, or `opencode`.
|
|
259
|
-
|
|
260
|
-
---
|
|
124
|
+
| Rewrite the description yourself | No data on how users actually talk. No validation. No regression detection. |
|
|
125
|
+
| Add "ALWAYS invoke when..." directives | Brittle. One agent rewrite away from breaking. |
|
|
126
|
+
| Force-load skills on every prompt | Doesn't fix the description. Expensive band-aid. |
|
|
127
|
+
| **selftune** | Learns from real usage, rewrites descriptions to match how you work, validates against eval sets, auto-rollbacks on regressions. |
|
|
261
128
|
|
|
262
|
-
##
|
|
129
|
+
## Different Layer, Different Problem
|
|
263
130
|
|
|
264
|
-
|
|
265
|
-
make check # lint + architecture lint + all tests
|
|
266
|
-
make lint # biome check + architecture lint
|
|
267
|
-
make test # bun test
|
|
268
|
-
```
|
|
131
|
+
LLM observability tools trace API calls. Infrastructure tools monitor servers. Neither knows whether the right skill fired for the right person. selftune does — and fixes it automatically.
|
|
269
132
|
|
|
270
|
-
|
|
133
|
+
selftune is complementary to these tools, not competitive. They trace what happens inside the LLM. selftune makes sure the right skill is called in the first place.
|
|
271
134
|
|
|
272
|
-
|
|
135
|
+
| Dimension | selftune | Langfuse | LangSmith | OpenLIT |
|
|
136
|
+
|-----------|----------|----------|-----------|---------|
|
|
137
|
+
| **Layer** | Skill-specific | LLM call | Agent trace | Infrastructure |
|
|
138
|
+
| **Detects** | Missed triggers, false negatives, skill conflicts | Token usage, latency | Chain failures | System metrics |
|
|
139
|
+
| **Improves** | Descriptions, body, and routing automatically | — | — | — |
|
|
140
|
+
| **Setup** | Zero deps, zero API keys | Self-host or cloud | Cloud required | Helm chart |
|
|
141
|
+
| **Price** | Free (MIT) | Freemium | Paid | Free |
|
|
142
|
+
| **Unique** | Self-improving skills + auto-rollback | Prompt management | Evaluations | Dashboards |
|
|
273
143
|
|
|
274
|
-
##
|
|
144
|
+
## Platforms
|
|
275
145
|
|
|
276
|
-
|
|
277
|
-
- Let logs accumulate over several days before running evals — more diverse real queries = more reliable signal.
|
|
278
|
-
- All hooks are silent (exit 0) and take <50ms. Negligible overhead.
|
|
279
|
-
- Logs are append-only JSONL. Safe to delete to start fresh, or archive old files.
|
|
280
|
-
- Use `--max 75` to increase eval set size once you have enough data.
|
|
281
|
-
- Use `--seed 123` for a different random sample of negatives.
|
|
282
|
-
- Use `--dry-run` with `evolve` to preview proposals without deploying.
|
|
283
|
-
- The `doctor` command checks log health, hook presence, config status, and schema validity.
|
|
146
|
+
**Claude Code** (fully supported) — Hooks install automatically. `selftune ingest claude` backfills existing transcripts. This is the primary supported platform.
|
|
284
147
|
|
|
285
|
-
|
|
148
|
+
**Codex** (experimental) — `selftune ingest wrap-codex -- <args>` or `selftune ingest codex`. Adapter exists but is not actively tested.
|
|
286
149
|
|
|
287
|
-
|
|
150
|
+
**OpenCode** (experimental) — `selftune ingest opencode`. Adapter exists but is not actively tested.
|
|
288
151
|
|
|
289
|
-
|
|
152
|
+
**OpenClaw** (experimental) — `selftune ingest openclaw` + `selftune cron setup` for autonomous evolution. Adapter exists but is not actively tested.
|
|
290
153
|
|
|
291
|
-
|
|
154
|
+
Requires [Bun](https://bun.sh) or Node.js 18+. No extra API keys.
|
|
292
155
|
|
|
293
156
|
---
|
|
294
157
|
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
To report a vulnerability, see [SECURITY.md](SECURITY.md).
|
|
158
|
+
<div align="center">
|
|
298
159
|
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
## Sponsor
|
|
302
|
-
|
|
303
|
-
If selftune saves you time, consider [sponsoring the project](https://github.com/sponsors/WellDunDun).
|
|
304
|
-
|
|
305
|
-
---
|
|
160
|
+
[Architecture](ARCHITECTURE.md) · [Contributing](CONTRIBUTING.md) · [Security](SECURITY.md) · [Integration Guide](docs/integration-guide.md) · [Sponsor](https://github.com/sponsors/WellDunDun)
|
|
306
161
|
|
|
307
|
-
|
|
162
|
+
MIT licensed. Free forever. Primary support for Claude Code; experimental adapters for Codex, OpenCode, and OpenClaw.
|
|
308
163
|
|
|
309
|
-
|
|
310
|
-
|---|---|---|
|
|
311
|
-
| v0.1 | Hooks, ingestors, shared schema, eval generation | Done |
|
|
312
|
-
| v0.2 | Session grading, grader skill | Done |
|
|
313
|
-
| v0.3 | Evolution loop (propose, validate, deploy, rollback) | Done |
|
|
314
|
-
| v0.4 | Post-deploy monitoring, regression detection | Done |
|
|
315
|
-
| v0.5 | Agent-first skill restructure, `init` command, config bootstrap | Done |
|
|
316
|
-
| v0.6 | Three-layer observability: `status`, `last`, redesigned dashboard | Done |
|
|
164
|
+
</div>
|
|
Binary file
|
|
Binary file
|