selftune 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/CHANGELOG.md +23 -0
  2. package/README.md +259 -0
  3. package/bin/selftune.cjs +29 -0
  4. package/cli/selftune/constants.ts +71 -0
  5. package/cli/selftune/eval/hooks-to-evals.ts +422 -0
  6. package/cli/selftune/evolution/audit.ts +44 -0
  7. package/cli/selftune/evolution/deploy-proposal.ts +244 -0
  8. package/cli/selftune/evolution/evolve.ts +406 -0
  9. package/cli/selftune/evolution/extract-patterns.ts +145 -0
  10. package/cli/selftune/evolution/propose-description.ts +146 -0
  11. package/cli/selftune/evolution/rollback.ts +242 -0
  12. package/cli/selftune/evolution/stopping-criteria.ts +69 -0
  13. package/cli/selftune/evolution/validate-proposal.ts +137 -0
  14. package/cli/selftune/grading/grade-session.ts +459 -0
  15. package/cli/selftune/hooks/prompt-log.ts +52 -0
  16. package/cli/selftune/hooks/session-stop.ts +54 -0
  17. package/cli/selftune/hooks/skill-eval.ts +73 -0
  18. package/cli/selftune/index.ts +104 -0
  19. package/cli/selftune/ingestors/codex-rollout.ts +416 -0
  20. package/cli/selftune/ingestors/codex-wrapper.ts +332 -0
  21. package/cli/selftune/ingestors/opencode-ingest.ts +565 -0
  22. package/cli/selftune/init.ts +297 -0
  23. package/cli/selftune/monitoring/watch.ts +328 -0
  24. package/cli/selftune/observability.ts +255 -0
  25. package/cli/selftune/types.ts +255 -0
  26. package/cli/selftune/utils/jsonl.ts +75 -0
  27. package/cli/selftune/utils/llm-call.ts +192 -0
  28. package/cli/selftune/utils/logging.ts +40 -0
  29. package/cli/selftune/utils/schema-validator.ts +47 -0
  30. package/cli/selftune/utils/seeded-random.ts +31 -0
  31. package/cli/selftune/utils/transcript.ts +260 -0
  32. package/package.json +29 -0
  33. package/skill/SKILL.md +120 -0
  34. package/skill/Workflows/Doctor.md +145 -0
  35. package/skill/Workflows/Evals.md +193 -0
  36. package/skill/Workflows/Evolve.md +159 -0
  37. package/skill/Workflows/Grade.md +157 -0
  38. package/skill/Workflows/Ingest.md +159 -0
  39. package/skill/Workflows/Initialize.md +125 -0
  40. package/skill/Workflows/Rollback.md +131 -0
  41. package/skill/Workflows/Watch.md +128 -0
  42. package/skill/references/grading-methodology.md +176 -0
  43. package/skill/references/invocation-taxonomy.md +144 -0
  44. package/skill/references/logs.md +168 -0
  45. package/skill/settings_snippet.json +41 -0
@@ -0,0 +1,157 @@
1
+ # selftune Grade Workflow
2
+
3
+ Grade a completed skill session against expectations. Produces `grading.json`
4
+ with a 3-tier evaluation covering trigger, process, and quality.
5
+
6
+ ## Default Command
7
+
8
+ ```bash
9
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
10
+ bun run $CLI_PATH grade --skill <name> [options]
11
+ ```
12
+
13
+ Fallback:
14
+ ```bash
15
+ bun run <repo-path>/cli/selftune/index.ts grade --skill <name> [options]
16
+ ```
17
+
18
+ ## Options
19
+
20
+ | Flag | Description | Default |
21
+ |------|-------------|---------|
22
+ | `--skill <name>` | Skill name to grade | Required |
23
+ | `--expectations "..."` | Explicit expectations (semicolon-separated) | Auto-derived |
24
+ | `--evals-json <path>` | Pre-built eval set JSON file | None |
25
+ | `--eval-id <n>` | Specific eval ID to grade from the eval set | None |
26
+ | `--use-agent` | Grade via agent subprocess (no API key needed) | Off (uses API) |
27
+
28
+ ## Output Format
29
+
30
+ The command produces `grading.json`. See `references/grading-methodology.md`
31
+ for the full schema. Key fields:
32
+
33
+ ```json
34
+ {
35
+ "session_id": "abc123",
36
+ "skill_name": "pptx",
37
+ "transcript_path": "~/.claude/projects/.../abc123.jsonl",
38
+ "graded_at": "2026-02-28T12:00:00Z",
39
+ "expectations": [
40
+ { "text": "...", "passed": true, "evidence": "..." }
41
+ ],
42
+ "summary": { "passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67 },
43
+ "execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 },
44
+ "claims": [
45
+ { "claim": "...", "type": "factual", "verified": true, "evidence": "..." }
46
+ ],
47
+ "eval_feedback": { "suggestions": [], "overall": "..." }
48
+ }
49
+ ```
50
+
51
+ ## Parsing Instructions
52
+
53
+ ### Get Pass Rate
54
+
55
+ ```bash
56
+ # Parse: .summary.pass_rate (float 0-1)
57
+ # Parse: .summary.passed / .summary.total
58
+ ```
59
+
60
+ ### Find Failed Expectations
61
+
62
+ ```bash
63
+ # Parse: .expectations[] | select(.passed == false) | .text
64
+ ```
65
+
66
+ ### Extract Claims
67
+
68
+ ```bash
69
+ # Parse: .claims[] | { claim, type, verified }
70
+ ```
71
+
72
+ ### Get Eval Feedback
73
+
74
+ ```bash
75
+ # Parse: .eval_feedback.suggestions[].reason
76
+ # Parse: .eval_feedback.overall
77
+ ```
78
+
79
+ ## Steps
80
+
81
+ ### 1. Find the Session
82
+
83
+ Read `~/.claude/session_telemetry_log.jsonl` and find the most recent entry
84
+ where `skills_triggered` contains the target skill name.
85
+
86
+ Note the `transcript_path`, `tool_calls`, `errors_encountered`, and
87
+ `session_id` fields. See `references/logs.md` for the telemetry format.
88
+
89
+ ### 2. Read the Transcript
90
+
91
+ Parse the JSONL file at `transcript_path`. Identify:
92
+ - User messages (what was asked)
93
+ - Assistant tool calls (what the agent did)
94
+ - Tool results (what happened)
95
+ - Error patterns (what went wrong)
96
+
97
+ See `references/logs.md` for transcript format variants.
98
+
99
+ ### 3. Determine Expectations
100
+
101
+ If the user provided `--expectations`, parse the semicolon-separated list.
102
+ Otherwise, derive defaults. See `references/grading-methodology.md` for the
103
+ full default expectations list.
104
+
105
+ Always include at least one Process expectation and one Quality expectation.
106
+
107
+ ### 4. Grade Each Expectation
108
+
109
+ For each expectation, search both the telemetry record and the transcript
110
+ for evidence. Mark as:
111
+ - **PASS** if evidence exists and supports the expectation
112
+ - **FAIL** if evidence is absent or contradicts the expectation
113
+
114
+ Cite specific evidence: transcript line numbers, tool call names, bash output.
115
+
116
+ ### 5. Extract Implicit Claims
117
+
118
+ Pull 2-4 claims from the transcript that are not covered by the explicit
119
+ expectations. Classify each as factual, process, or quality. Verify each
120
+ against the transcript. See `references/grading-methodology.md` for claim
121
+ types and examples.
122
+
123
+ ### 6. Flag Eval Gaps
124
+
125
+ Review each passed expectation. If it would also pass for wrong output,
126
+ note it in `eval_feedback.suggestions`. See `references/grading-methodology.md`
127
+ for gap flagging criteria.
128
+
129
+ ### 7. Write grading.json
130
+
131
+ Write the full grading result to `grading.json` in the current directory.
132
+
133
+ ### 8. Summarize
134
+
135
+ Report to the user:
136
+ - Pass rate (e.g., "2/3 passed, 67%")
137
+ - Failed expectations with evidence
138
+ - Notable claims
139
+ - Top eval feedback suggestion
140
+
141
+ Keep the summary concise. The full details are in `grading.json`.
142
+
143
+ ## Common Patterns
144
+
145
+ **"Grade my last pptx session"**
146
+ > Find the most recent telemetry entry for `pptx`. Use default expectations.
147
+ > Ask if the user wants custom expectations or proceed with defaults.
148
+
149
+ **"Grade with these specific expectations"**
150
+ > Pass `--expectations "expect1;expect2;expect3"` to override defaults.
151
+
152
+ **"Grade using an eval set"**
153
+ > Pass `--evals-json path/to/evals.json` and optionally `--eval-id N`
154
+ > to grade a specific eval scenario.
155
+
156
+ **"I don't have an API key"**
157
+ > Use `--use-agent` to grade via agent subprocess instead of direct API.
@@ -0,0 +1,159 @@
1
+ # selftune Ingest Workflow
2
+
3
+ Import sessions from non-Claude-Code agent platforms into the shared
4
+ selftune log format. Covers three sub-commands: `ingest-codex`,
5
+ `ingest-opencode`, and `wrap-codex`.
6
+
7
+ ## When to Use Each
8
+
9
+ | Sub-command | Platform | Mode | When |
10
+ |-------------|----------|------|------|
11
+ | `ingest-codex` | Codex | Batch | Import existing Codex rollout logs |
12
+ | `ingest-opencode` | OpenCode | Batch | Import existing OpenCode sessions |
13
+ | `wrap-codex` | Codex | Real-time | Wrap `codex exec` to capture telemetry live |
14
+
15
+ ---
16
+
17
+ ## ingest-codex
18
+
19
+ Batch ingest Codex rollout logs into the shared JSONL schema.
20
+
21
+ ### Default Command
22
+
23
+ ```bash
24
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
25
+ bun run $CLI_PATH ingest-codex
26
+ ```
27
+
28
+ Fallback:
29
+ ```bash
30
+ bun run <repo-path>/cli/selftune/index.ts ingest-codex
31
+ ```
32
+
33
+ ### Options
34
+
35
+ None. Reads from the standard Codex session directory.
36
+
37
+ ### Source
38
+
39
+ Reads from `$CODEX_HOME/sessions/` directory. Expects the Codex rollout
40
+ JSONL format. See `references/logs.md` for the Codex rollout format.
41
+
42
+ ### Output
43
+
44
+ Writes to:
45
+ - `~/.claude/all_queries_log.jsonl` -- extracted user queries
46
+ - `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "codex_rollout"`
47
+
48
+ ### Steps
49
+
50
+ 1. Verify `$CODEX_HOME/sessions/` directory exists and contains session files
51
+ 2. Run `ingest-codex`
52
+ 3. Verify entries were written by checking log file line counts
53
+ 4. Run `doctor` to confirm logs are healthy
54
+
55
+ ---
56
+
57
+ ## ingest-opencode
58
+
59
+ Ingest OpenCode sessions from the SQLite database.
60
+
61
+ ### Default Command
62
+
63
+ ```bash
64
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
65
+ bun run $CLI_PATH ingest-opencode
66
+ ```
67
+
68
+ Fallback:
69
+ ```bash
70
+ bun run <repo-path>/cli/selftune/index.ts ingest-opencode
71
+ ```
72
+
73
+ ### Options
74
+
75
+ None. Auto-discovers the database location.
76
+
77
+ ### Source
78
+
79
+ Primary: `~/.local/share/opencode/opencode.db` (SQLite database)
80
+ Fallback: Legacy JSON session files in the OpenCode data directory
81
+
82
+ See `references/logs.md` for the OpenCode message format.
83
+
84
+ ### Output
85
+
86
+ Writes to:
87
+ - `~/.claude/all_queries_log.jsonl` -- extracted user queries
88
+ - `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "opencode"` or `"opencode_json"`
89
+
90
+ ### Steps
91
+
92
+ 1. Verify the OpenCode database exists at the expected path
93
+ 2. Run `ingest-opencode`
94
+ 3. Verify entries were written by checking log file line counts
95
+ 4. Run `doctor` to confirm logs are healthy
96
+
97
+ ---
98
+
99
+ ## wrap-codex
100
+
101
+ Wrap `codex exec` with real-time telemetry capture. Drop-in replacement
102
+ that tees the JSONL stream while passing through to Codex.
103
+
104
+ ### Default Command
105
+
106
+ ```bash
107
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
108
+ bun run $CLI_PATH wrap-codex -- <your codex args>
109
+ ```
110
+
111
+ Fallback:
112
+ ```bash
113
+ bun run <repo-path>/cli/selftune/index.ts wrap-codex -- <your codex args>
114
+ ```
115
+
116
+ ### Usage
117
+
118
+ Everything after `--` is passed directly to `codex exec`:
119
+
120
+ ```bash
121
+ bun run $CLI_PATH wrap-codex -- --model o3 "Fix the failing tests"
122
+ ```
123
+
124
+ ### Output
125
+
126
+ Writes to:
127
+ - `~/.claude/all_queries_log.jsonl` -- the user query
128
+ - `~/.claude/session_telemetry_log.jsonl` -- session metrics with `source: "codex"`
129
+
130
+ The Codex output is passed through unchanged. The wrapper only tees the
131
+ stream for telemetry; it does not modify Codex behavior.
132
+
133
+ ### Steps
134
+
135
+ 1. Build the wrap-codex command with the desired Codex arguments
136
+ 2. Run the command (replaces `codex exec` in your workflow)
137
+ 3. Session telemetry is captured automatically
138
+ 4. Verify with `doctor` after first use
139
+
140
+ ---
141
+
142
+ ## Common Patterns
143
+
144
+ **"Ingest codex logs"**
145
+ > Run `ingest-codex`. No options needed. Reads from `$CODEX_HOME/sessions/`.
146
+
147
+ **"Import opencode sessions"**
148
+ > Run `ingest-opencode`. Reads from the SQLite database automatically.
149
+
150
+ **"Run codex through selftune"**
151
+ > Use `wrap-codex -- <codex args>` instead of `codex exec <args>` directly.
152
+
153
+ **"Batch ingest vs real-time"**
154
+ > Use `ingest-codex` or `ingest-opencode` for historical sessions.
155
+ > Use `wrap-codex` for ongoing sessions. Both produce the same log format.
156
+
157
+ **"How do I know it worked?"**
158
+ > Run `doctor` after ingestion. Check that log files exist and are parseable.
159
+ > Run `evals --list-skills` to see if the ingested sessions appear.
@@ -0,0 +1,125 @@
1
+ # selftune Initialize Workflow
2
+
3
+ Bootstrap selftune for first-time use or after changing environments.
4
+
5
+ ## When to Use
6
+
7
+ - First time using selftune in a new environment
8
+ - After switching agent platforms (Claude Code, Codex, OpenCode)
9
+ - After reinstalling or moving the selftune repository
10
+ - When `~/.selftune/config.json` does not exist
11
+
12
+ ## Default Command
13
+
14
+ ```bash
15
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
16
+ bun run $CLI_PATH init [--agent <type>] [--cli-path <path>] [--llm-mode agent|api]
17
+ ```
18
+
19
+ Fallback (if config does not exist yet):
20
+ ```bash
21
+ bun run <repo-path>/cli/selftune/index.ts init [options]
22
+ ```
23
+
24
+ ## Options
25
+
26
+ | Flag | Description | Default |
27
+ |------|-------------|---------|
28
+ | `--agent <type>` | Agent platform: `claude`, `codex`, `opencode` | Auto-detected |
29
+ | `--cli-path <path>` | Absolute path to `cli/selftune/index.ts` | Derived from repo location |
30
+ | `--llm-mode <mode>` | `agent` (use agent subprocess) or `api` (use Anthropic API directly) | `agent` |
31
+
32
+ ## Output Format
33
+
34
+ Creates `~/.selftune/config.json`:
35
+
36
+ ```json
37
+ {
38
+ "agent_type": "claude",
39
+ "cli_path": "/Users/you/selftune/cli/selftune/index.ts",
40
+ "llm_mode": "agent",
41
+ "agent_cli": "claude",
42
+ "hooks_installed": true,
43
+ "initialized_at": "2026-02-28T10:00:00Z"
44
+ }
45
+ ```
46
+
47
+ ### Field Descriptions
48
+
49
+ | Field | Type | Description |
50
+ |-------|------|-------------|
51
+ | `agent_type` | string | Detected or specified agent platform |
52
+ | `cli_path` | string | Absolute path to the CLI entry point |
53
+ | `llm_mode` | string | How LLM calls are made: `agent` or `api` |
54
+ | `agent_cli` | string | CLI binary name for the detected agent |
55
+ | `hooks_installed` | boolean | Whether telemetry hooks are installed |
56
+ | `initialized_at` | string | ISO 8601 timestamp |
57
+
58
+ ## Steps
59
+
60
+ ### 1. Check Existing Config
61
+
62
+ ```bash
63
+ cat ~/.selftune/config.json 2>/dev/null
64
+ ```
65
+
66
+ If the file exists and is valid JSON, selftune is already initialized.
67
+ Skip to Step 5 (verify with doctor) unless the user wants to reinitialize.
68
+
69
+ ### 2. Run Init
70
+
71
+ ```bash
72
+ bun run /path/to/cli/selftune/index.ts init --agent claude --cli-path /path/to/cli/selftune/index.ts
73
+ ```
74
+
75
+ Replace paths with the actual selftune repository location.
76
+
77
+ ### 3. Install Hooks (Claude Code)
78
+
79
+ For Claude Code agents, merge the hooks from `skill/settings_snippet.json`
80
+ into `~/.claude/settings.json`. Three hooks are required:
81
+
82
+ | Hook | Script | Purpose |
83
+ |------|--------|---------|
84
+ | `UserPromptSubmit` | `hooks/prompt-log.ts` | Log every user query |
85
+ | `PostToolUse` (Read) | `hooks/skill-eval.ts` | Track skill triggers |
86
+ | `Stop` | `hooks/session-stop.ts` | Capture session telemetry |
87
+
88
+ Replace `/PATH/TO/` in the snippet with the actual `cli/selftune/` directory.
89
+
90
+ ### 4. Platform-Specific Setup
91
+
92
+ **Codex agents:**
93
+ - Use `wrap-codex` for real-time telemetry capture (see `Workflows/Ingest.md`)
94
+ - Or batch-ingest existing sessions with `ingest-codex`
95
+
96
+ **OpenCode agents:**
97
+ - Use `ingest-opencode` to import sessions from the SQLite database
98
+ - See `Workflows/Ingest.md` for details
99
+
100
+ ### 5. Verify with Doctor
101
+
102
+ ```bash
103
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
104
+ bun run $CLI_PATH doctor
105
+ ```
106
+
107
+ Parse the JSON output. All checks should pass. If any fail, address the
108
+ reported issues before proceeding.
109
+
110
+ ## Common Patterns
111
+
112
+ **"I just cloned the selftune repo"**
113
+ > Run init with `--cli-path` pointing to the cloned `cli/selftune/index.ts`.
114
+ > Then install hooks for your agent platform.
115
+
116
+ **"I moved the repo to a new directory"**
117
+ > Re-run init with the updated `--cli-path`. The config will be overwritten.
118
+
119
+ **"Hooks aren't capturing data"**
120
+ > Run `doctor` to check hook installation. Verify paths in
121
+ > `~/.claude/settings.json` point to actual files.
122
+
123
+ **"Config exists but seems stale"**
124
+ > Delete `~/.selftune/config.json` and re-run init, or run init with
125
+ > `--cli-path` to update the path.
@@ -0,0 +1,131 @@
1
+ # selftune Rollback Workflow
2
+
3
+ Undo a skill evolution by restoring the pre-evolution description.
4
+ Records the rollback in the evolution audit log for traceability.
5
+
6
+ ## Default Command
7
+
8
+ ```bash
9
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
10
+ bun run $CLI_PATH rollback --skill <name> --skill-path <path> [options]
11
+ ```
12
+
13
+ Fallback:
14
+ ```bash
15
+ bun run <repo-path>/cli/selftune/index.ts rollback --skill <name> --skill-path <path> [options]
16
+ ```
17
+
18
+ ## Options
19
+
20
+ | Flag | Description | Default |
21
+ |------|-------------|---------|
22
+ | `--skill <name>` | Skill name | Required |
23
+ | `--skill-path <path>` | Path to the skill's SKILL.md | Required |
24
+ | `--proposal-id <id>` | Specific proposal to rollback | Latest evolution |
25
+
26
+ ## Output Format
27
+
28
+ The command writes a `rolled_back` entry to `~/.claude/evolution_audit_log.jsonl`:
29
+
30
+ ```json
31
+ {
32
+ "timestamp": "2026-02-28T14:00:00Z",
33
+ "proposal_id": "evolve-pptx-1709125200000",
34
+ "action": "rolled_back",
35
+ "details": "Restored from backup file",
36
+ "eval_snapshot": {
37
+ "total": 50,
38
+ "passed": 35,
39
+ "failed": 15,
40
+ "pass_rate": 0.70
41
+ }
42
+ }
43
+ ```
44
+
45
+ ## Parsing Instructions
46
+
47
+ ### Verify Rollback Succeeded
48
+
49
+ ```bash
50
+ # Parse: latest entry in evolution_audit_log.jsonl for the skill
51
+ # Check: .action === "rolled_back"
52
+ # Check: .proposal_id matches the target proposal
53
+ ```
54
+
55
+ ### Check Restoration Source
56
+
57
+ ```bash
58
+ # Parse: .details field
59
+ # Values: "Restored from backup file" or "Restored from audit trail"
60
+ ```
61
+
62
+ ## Restoration Strategies
63
+
64
+ The command tries these strategies in order:
65
+
66
+ ### 1. Backup File
67
+
68
+ Looks for `SKILL.md.bak` alongside the skill file. This is the most
69
+ reliable source -- created automatically during `evolve --deploy`.
70
+
71
+ ### 2. Audit Trail
72
+
73
+ If no backup file exists, reads the evolution audit log for the `created`
74
+ entry with the matching `proposal_id`. The `details` field starts with
75
+ `original_description:` followed by the pre-evolution content.
76
+
77
+ ### 3. Failure
78
+
79
+ If neither source is available, the rollback fails with an error message.
80
+ Manual restoration from version control is required.
81
+
82
+ ## Steps
83
+
84
+ ### 1. Find the Last Evolution
85
+
86
+ Read `~/.claude/evolution_audit_log.jsonl` and find the most recent
87
+ `deployed` entry for the target skill. Note the `proposal_id`.
88
+
89
+ If `--proposal-id` is specified, use that instead.
90
+
91
+ ### 2. Run Rollback
92
+
93
+ ```bash
94
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
95
+ bun run $CLI_PATH rollback --skill pptx --skill-path /path/to/SKILL.md
96
+ ```
97
+
98
+ Or to rollback a specific proposal:
99
+
100
+ ```bash
101
+ bun run $CLI_PATH rollback --skill pptx --skill-path /path/to/SKILL.md --proposal-id evolve-pptx-1709125200000
102
+ ```
103
+
104
+ ### 3. Verify Restoration
105
+
106
+ After rollback, verify the SKILL.md content is restored:
107
+ - Read the file and confirm it matches the pre-evolution version
108
+ - Check the audit log for the `rolled_back` entry
109
+ - Optionally re-run evals to confirm the original pass rate
110
+
111
+ ### 4. Post-Rollback Audit
112
+
113
+ The rollback is logged. Future `evolve` runs will see the rollback in the
114
+ audit trail and can use it to avoid repeating failed evolution patterns.
115
+
116
+ ## Common Patterns
117
+
118
+ **"Rollback the last evolution"**
119
+ > Run rollback with `--skill` and `--skill-path`. The command automatically
120
+ > finds the latest `deployed` entry in the audit log.
121
+
122
+ **"Undo the pptx skill change"**
123
+ > Same as above, specifying `--skill pptx`.
124
+
125
+ **"Restore the original description"**
126
+ > If multiple evolutions have occurred, use `--proposal-id` to target a
127
+ > specific one. Without it, only the most recent evolution is rolled back.
128
+
129
+ **"The rollback says no backup found"**
130
+ > Check version control (git) for the pre-evolution SKILL.md. The audit
131
+ > trail may also contain the original description in a `created` entry.
@@ -0,0 +1,128 @@
1
+ # selftune Watch Workflow
2
+
3
+ Monitor post-deploy skill performance for regressions. Compares current
4
+ pass rates against a baseline within a sliding window of recent sessions.
5
+
6
+ ## Default Command
7
+
8
+ ```bash
9
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
10
+ bun run $CLI_PATH watch --skill <name> --skill-path <path> [options]
11
+ ```
12
+
13
+ Fallback:
14
+ ```bash
15
+ bun run <repo-path>/cli/selftune/index.ts watch --skill <name> --skill-path <path> [options]
16
+ ```
17
+
18
+ ## Options
19
+
20
+ | Flag | Description | Default |
21
+ |------|-------------|---------|
22
+ | `--skill <name>` | Skill name | Required |
23
+ | `--skill-path <path>` | Path to the skill's SKILL.md | Required |
24
+ | `--window <n>` | Sliding window size (number of sessions) | 20 |
25
+ | `--threshold <n>` | Regression threshold (drop from baseline) | 0.1 |
26
+ | `--baseline <n>` | Explicit baseline pass rate (0-1) | Auto-detected from last deploy |
27
+ | `--auto-rollback` | Automatically rollback on detected regression | Off |
28
+
29
+ ## Output Format
30
+
31
+ ```json
32
+ {
33
+ "skill_name": "pptx",
34
+ "window_size": 20,
35
+ "sessions_evaluated": 18,
36
+ "current_pass_rate": 0.89,
37
+ "baseline_pass_rate": 0.92,
38
+ "threshold": 0.1,
39
+ "regression_detected": false,
40
+ "delta": -0.03,
41
+ "status": "healthy",
42
+ "evaluated_at": "2026-02-28T14:00:00Z"
43
+ }
44
+ ```
45
+
46
+ ### Status Values
47
+
48
+ | Status | Meaning |
49
+ |--------|---------|
50
+ | `healthy` | Current pass rate is within threshold of baseline |
51
+ | `warning` | Pass rate dropped but within threshold |
52
+ | `regression` | Pass rate dropped below baseline minus threshold |
53
+ | `insufficient_data` | Not enough sessions in the window to evaluate |
54
+
55
+ ## Parsing Instructions
56
+
57
+ ### Check Regression Status
58
+
59
+ ```bash
60
+ # Parse: .regression_detected (boolean)
61
+ # Parse: .status (string)
62
+ # Parse: .delta (float, negative = regression)
63
+ ```
64
+
65
+ ### Get Key Metrics
66
+
67
+ ```bash
68
+ # Parse: .current_pass_rate vs .baseline_pass_rate
69
+ # Parse: .sessions_evaluated (should be close to .window_size)
70
+ ```
71
+
72
+ ## Steps
73
+
74
+ ### 1. Run Watch
75
+
76
+ ```bash
77
+ CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
78
+ bun run $CLI_PATH watch --skill pptx --skill-path /path/to/SKILL.md
79
+ ```
80
+
81
+ ### 2. Check Regression Status
82
+
83
+ Parse the JSON output. Key decision points:
84
+
85
+ | Status | Action |
86
+ |--------|--------|
87
+ | `healthy` | No action needed. Skill is performing well. |
88
+ | `warning` | Monitor closely. Consider re-running after more sessions. |
89
+ | `regression` | Investigate. Consider rollback. |
90
+ | `insufficient_data` | Wait for more sessions before evaluating. |
91
+
92
+ ### 3. Decide Action
93
+
94
+ If regression is detected:
95
+ - Review recent session transcripts to understand what changed
96
+ - Check if the eval set is still representative
97
+ - Run `rollback` if the regression is confirmed (see `Workflows/Rollback.md`)
98
+
99
+ If `--auto-rollback` was set, the command automatically restores the
100
+ previous description and logs a `rolled_back` entry.
101
+
102
+ ### 4. Report
103
+
104
+ Summarize the snapshot for the user:
105
+ - Current pass rate vs baseline
106
+ - Number of sessions evaluated
107
+ - Whether regression was detected
108
+ - Recommended action
109
+
110
+ ## Common Patterns
111
+
112
+ **"Is the skill performing well after the change?"**
113
+ > Run watch with the skill name and path. Report the snapshot.
114
+
115
+ **"Check for regressions"**
116
+ > Same as above. Focus on the `regression_detected` and `delta` fields.
117
+
118
+ **"How is the skill doing?"**
119
+ > Run watch. If `insufficient_data`, tell the user to wait for more
120
+ > sessions before drawing conclusions.
121
+
122
+ **"Auto-rollback if it regresses"**
123
+ > Use `--auto-rollback`. The command will restore the previous description
124
+ > automatically if pass rate drops below baseline minus threshold.
125
+
126
+ **"Set a custom baseline"**
127
+ > Use `--baseline 0.85` to override auto-detection. Useful when the
128
+ > auto-detected baseline is from an older evolution.