npm - selftune - Versions diffs - 0.1.0 - Mend

selftune 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/CHANGELOG.md +23 -0
package/README.md +259 -0
package/bin/selftune.cjs +29 -0
package/cli/selftune/constants.ts +71 -0
package/cli/selftune/eval/hooks-to-evals.ts +422 -0
package/cli/selftune/evolution/audit.ts +44 -0
package/cli/selftune/evolution/deploy-proposal.ts +244 -0
package/cli/selftune/evolution/evolve.ts +406 -0
package/cli/selftune/evolution/extract-patterns.ts +145 -0
package/cli/selftune/evolution/propose-description.ts +146 -0
package/cli/selftune/evolution/rollback.ts +242 -0
package/cli/selftune/evolution/stopping-criteria.ts +69 -0
package/cli/selftune/evolution/validate-proposal.ts +137 -0
package/cli/selftune/grading/grade-session.ts +459 -0
package/cli/selftune/hooks/prompt-log.ts +52 -0
package/cli/selftune/hooks/session-stop.ts +54 -0
package/cli/selftune/hooks/skill-eval.ts +73 -0
package/cli/selftune/index.ts +104 -0
package/cli/selftune/ingestors/codex-rollout.ts +416 -0
package/cli/selftune/ingestors/codex-wrapper.ts +332 -0
package/cli/selftune/ingestors/opencode-ingest.ts +565 -0
package/cli/selftune/init.ts +297 -0
package/cli/selftune/monitoring/watch.ts +328 -0
package/cli/selftune/observability.ts +255 -0
package/cli/selftune/types.ts +255 -0
package/cli/selftune/utils/jsonl.ts +75 -0
package/cli/selftune/utils/llm-call.ts +192 -0
package/cli/selftune/utils/logging.ts +40 -0
package/cli/selftune/utils/schema-validator.ts +47 -0
package/cli/selftune/utils/seeded-random.ts +31 -0
package/cli/selftune/utils/transcript.ts +260 -0
package/package.json +29 -0
package/skill/SKILL.md +120 -0
package/skill/Workflows/Doctor.md +145 -0
package/skill/Workflows/Evals.md +193 -0
package/skill/Workflows/Evolve.md +159 -0
package/skill/Workflows/Grade.md +157 -0
package/skill/Workflows/Ingest.md +159 -0
package/skill/Workflows/Initialize.md +125 -0
package/skill/Workflows/Rollback.md +131 -0
package/skill/Workflows/Watch.md +128 -0
package/skill/references/grading-methodology.md +176 -0
package/skill/references/invocation-taxonomy.md +144 -0
package/skill/references/logs.md +168 -0
package/skill/settings_snippet.json +41 -0

package/skill/Workflows/Grade.md ADDED Viewed

@@ -0,0 +1,157 @@
+# selftune Grade Workflow
+Grade a completed skill session against expectations. Produces `grading.json`
+with a 3-tier evaluation covering trigger, process, and quality.
+## Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH grade --skill <name> [options]
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts grade --skill <name> [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--skill <name>` | Skill name to grade | Required |
+| `--expectations "..."` | Explicit expectations (semicolon-separated) | Auto-derived |
+| `--evals-json <path>` | Pre-built eval set JSON file | None |
+| `--eval-id <n>` | Specific eval ID to grade from the eval set | None |
+| `--use-agent` | Grade via agent subprocess (no API key needed) | Off (uses API) |
+## Output Format
+The command produces `grading.json`. See `references/grading-methodology.md`
+for the full schema. Key fields:
+```json
+{
+  "session_id": "abc123",
+  "skill_name": "pptx",
+  "transcript_path": "~/.claude/projects/.../abc123.jsonl",
+  "graded_at": "2026-02-28T12:00:00Z",
+  "expectations": [
+    { "text": "...", "passed": true, "evidence": "..." }
+  ],
+  "summary": { "passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67 },
+  "execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 },
+  "claims": [
+    { "claim": "...", "type": "factual", "verified": true, "evidence": "..." }
+  ],
+  "eval_feedback": { "suggestions": [], "overall": "..." }
+}
+```
+## Parsing Instructions
+### Get Pass Rate
+```bash
+# Parse: .summary.pass_rate (float 0-1)
+# Parse: .summary.passed / .summary.total
+```
+### Find Failed Expectations
+```bash
+# Parse: .expectations[] | select(.passed == false) | .text
+```
+### Extract Claims
+```bash
+# Parse: .claims[] | { claim, type, verified }
+```
+### Get Eval Feedback
+```bash
+# Parse: .eval_feedback.suggestions[].reason
+# Parse: .eval_feedback.overall
+```
+## Steps
+### 1. Find the Session
+Read `~/.claude/session_telemetry_log.jsonl` and find the most recent entry
+where `skills_triggered` contains the target skill name.
+Note the `transcript_path`, `tool_calls`, `errors_encountered`, and
+`session_id` fields. See `references/logs.md` for the telemetry format.
+### 2. Read the Transcript
+Parse the JSONL file at `transcript_path`. Identify:
+- User messages (what was asked)
+- Assistant tool calls (what the agent did)
+- Tool results (what happened)
+- Error patterns (what went wrong)
+See `references/logs.md` for transcript format variants.
+### 3. Determine Expectations
+If the user provided `--expectations`, parse the semicolon-separated list.
+Otherwise, derive defaults. See `references/grading-methodology.md` for the
+full default expectations list.
+Always include at least one Process expectation and one Quality expectation.
+### 4. Grade Each Expectation
+For each expectation, search both the telemetry record and the transcript
+for evidence. Mark as:
+- **PASS** if evidence exists and supports the expectation
+- **FAIL** if evidence is absent or contradicts the expectation
+Cite specific evidence: transcript line numbers, tool call names, bash output.
+### 5. Extract Implicit Claims
+Pull 2-4 claims from the transcript that are not covered by the explicit
+expectations. Classify each as factual, process, or quality. Verify each
+against the transcript. See `references/grading-methodology.md` for claim
+types and examples.
+### 6. Flag Eval Gaps
+Review each passed expectation. If it would also pass for wrong output,
+note it in `eval_feedback.suggestions`. See `references/grading-methodology.md`
+for gap flagging criteria.
+### 7. Write grading.json
+Write the full grading result to `grading.json` in the current directory.
+### 8. Summarize
+Report to the user:
+- Pass rate (e.g., "2/3 passed, 67%")
+- Failed expectations with evidence
+- Notable claims
+- Top eval feedback suggestion
+Keep the summary concise. The full details are in `grading.json`.
+## Common Patterns
+**"Grade my last pptx session"**
+> Find the most recent telemetry entry for `pptx`. Use default expectations.
+> Ask if the user wants custom expectations or proceed with defaults.
+**"Grade with these specific expectations"**
+> Pass `--expectations "expect1;expect2;expect3"` to override defaults.
+**"Grade using an eval set"**
+> Pass `--evals-json path/to/evals.json` and optionally `--eval-id N`
+> to grade a specific eval scenario.
+**"I don't have an API key"**
+> Use `--use-agent` to grade via agent subprocess instead of direct API.

package/skill/Workflows/Ingest.md ADDED Viewed

@@ -0,0 +1,159 @@
+# selftune Ingest Workflow
+Import sessions from non-Claude-Code agent platforms into the shared
+selftune log format. Covers three sub-commands: `ingest-codex`,
+`ingest-opencode`, and `wrap-codex`.
+## When to Use Each
+| Sub-command | Platform | Mode | When |
+|-------------|----------|------|------|
+| `ingest-codex` | Codex | Batch | Import existing Codex rollout logs |
+| `ingest-opencode` | OpenCode | Batch | Import existing OpenCode sessions |
+| `wrap-codex` | Codex | Real-time | Wrap `codex exec` to capture telemetry live |
+---
+## ingest-codex
+Batch ingest Codex rollout logs into the shared JSONL schema.
+### Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH ingest-codex
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts ingest-codex
+```
+### Options
+None. Reads from the standard Codex session directory.
+### Source
+Reads from `$CODEX_HOME/sessions/` directory. Expects the Codex rollout
+JSONL format. See `references/logs.md` for the Codex rollout format.
+### Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- extracted user queries
+- `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "codex_rollout"`
+### Steps
+1. Verify `$CODEX_HOME/sessions/` directory exists and contains session files
+2. Run `ingest-codex`
+3. Verify entries were written by checking log file line counts
+4. Run `doctor` to confirm logs are healthy
+---
+## ingest-opencode
+Ingest OpenCode sessions from the SQLite database.
+### Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH ingest-opencode
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts ingest-opencode
+```
+### Options
+None. Auto-discovers the database location.
+### Source
+Primary: `~/.local/share/opencode/opencode.db` (SQLite database)
+Fallback: Legacy JSON session files in the OpenCode data directory
+See `references/logs.md` for the OpenCode message format.
+### Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- extracted user queries
+- `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "opencode"` or `"opencode_json"`
+### Steps
+1. Verify the OpenCode database exists at the expected path
+2. Run `ingest-opencode`
+3. Verify entries were written by checking log file line counts
+4. Run `doctor` to confirm logs are healthy
+---
+## wrap-codex
+Wrap `codex exec` with real-time telemetry capture. Drop-in replacement
+that tees the JSONL stream while passing through to Codex.
+### Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH wrap-codex -- <your codex args>
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts wrap-codex -- <your codex args>
+```
+### Usage
+Everything after `--` is passed directly to `codex exec`:
+```bash
+bun run $CLI_PATH wrap-codex -- --model o3 "Fix the failing tests"
+```
+### Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- the user query
+- `~/.claude/session_telemetry_log.jsonl` -- session metrics with `source: "codex"`
+The Codex output is passed through unchanged. The wrapper only tees the
+stream for telemetry; it does not modify Codex behavior.
+### Steps
+1. Build the wrap-codex command with the desired Codex arguments
+2. Run the command (replaces `codex exec` in your workflow)
+3. Session telemetry is captured automatically
+4. Verify with `doctor` after first use
+---
+## Common Patterns
+**"Ingest codex logs"**
+> Run `ingest-codex`. No options needed. Reads from `$CODEX_HOME/sessions/`.
+**"Import opencode sessions"**
+> Run `ingest-opencode`. Reads from the SQLite database automatically.
+**"Run codex through selftune"**
+> Use `wrap-codex -- <codex args>` instead of `codex exec <args>` directly.
+**"Batch ingest vs real-time"**
+> Use `ingest-codex` or `ingest-opencode` for historical sessions.
+> Use `wrap-codex` for ongoing sessions. Both produce the same log format.
+**"How do I know it worked?"**
+> Run `doctor` after ingestion. Check that log files exist and are parseable.
+> Run `evals --list-skills` to see if the ingested sessions appear.

package/skill/Workflows/Initialize.md ADDED Viewed

@@ -0,0 +1,125 @@
+# selftune Initialize Workflow
+Bootstrap selftune for first-time use or after changing environments.
+## When to Use
+- First time using selftune in a new environment
+- After switching agent platforms (Claude Code, Codex, OpenCode)
+- After reinstalling or moving the selftune repository
+- When `~/.selftune/config.json` does not exist
+## Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH init [--agent <type>] [--cli-path <path>] [--llm-mode agent|api]
+```
+Fallback (if config does not exist yet):
+```bash
+bun run <repo-path>/cli/selftune/index.ts init [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--agent <type>` | Agent platform: `claude`, `codex`, `opencode` | Auto-detected |
+| `--cli-path <path>` | Absolute path to `cli/selftune/index.ts` | Derived from repo location |
+| `--llm-mode <mode>` | `agent` (use agent subprocess) or `api` (use Anthropic API directly) | `agent` |
+## Output Format
+Creates `~/.selftune/config.json`:
+```json
+{
+  "agent_type": "claude",
+  "cli_path": "/Users/you/selftune/cli/selftune/index.ts",
+  "llm_mode": "agent",
+  "agent_cli": "claude",
+  "hooks_installed": true,
+  "initialized_at": "2026-02-28T10:00:00Z"
+}
+```
+### Field Descriptions
+| Field | Type | Description |
+|-------|------|-------------|
+| `agent_type` | string | Detected or specified agent platform |
+| `cli_path` | string | Absolute path to the CLI entry point |
+| `llm_mode` | string | How LLM calls are made: `agent` or `api` |
+| `agent_cli` | string | CLI binary name for the detected agent |
+| `hooks_installed` | boolean | Whether telemetry hooks are installed |
+| `initialized_at` | string | ISO 8601 timestamp |
+## Steps
+### 1. Check Existing Config
+```bash
+cat ~/.selftune/config.json 2>/dev/null
+```
+If the file exists and is valid JSON, selftune is already initialized.
+Skip to Step 5 (verify with doctor) unless the user wants to reinitialize.
+### 2. Run Init
+```bash
+bun run /path/to/cli/selftune/index.ts init --agent claude --cli-path /path/to/cli/selftune/index.ts
+```
+Replace paths with the actual selftune repository location.
+### 3. Install Hooks (Claude Code)
+For Claude Code agents, merge the hooks from `skill/settings_snippet.json`
+into `~/.claude/settings.json`. Three hooks are required:
+| Hook | Script | Purpose |
+|------|--------|---------|
+| `UserPromptSubmit` | `hooks/prompt-log.ts` | Log every user query |
+| `PostToolUse` (Read) | `hooks/skill-eval.ts` | Track skill triggers |
+| `Stop` | `hooks/session-stop.ts` | Capture session telemetry |
+Replace `/PATH/TO/` in the snippet with the actual `cli/selftune/` directory.
+### 4. Platform-Specific Setup
+**Codex agents:**
+- Use `wrap-codex` for real-time telemetry capture (see `Workflows/Ingest.md`)
+- Or batch-ingest existing sessions with `ingest-codex`
+**OpenCode agents:**
+- Use `ingest-opencode` to import sessions from the SQLite database
+- See `Workflows/Ingest.md` for details
+### 5. Verify with Doctor
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH doctor
+```
+Parse the JSON output. All checks should pass. If any fail, address the
+reported issues before proceeding.
+## Common Patterns
+**"I just cloned the selftune repo"**
+> Run init with `--cli-path` pointing to the cloned `cli/selftune/index.ts`.
+> Then install hooks for your agent platform.
+**"I moved the repo to a new directory"**
+> Re-run init with the updated `--cli-path`. The config will be overwritten.
+**"Hooks aren't capturing data"**
+> Run `doctor` to check hook installation. Verify paths in
+> `~/.claude/settings.json` point to actual files.
+**"Config exists but seems stale"**
+> Delete `~/.selftune/config.json` and re-run init, or run init with
+> `--cli-path` to update the path.

package/skill/Workflows/Rollback.md ADDED Viewed

@@ -0,0 +1,131 @@
+# selftune Rollback Workflow
+Undo a skill evolution by restoring the pre-evolution description.
+Records the rollback in the evolution audit log for traceability.
+## Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH rollback --skill <name> --skill-path <path> [options]
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts rollback --skill <name> --skill-path <path> [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--skill <name>` | Skill name | Required |
+| `--skill-path <path>` | Path to the skill's SKILL.md | Required |
+| `--proposal-id <id>` | Specific proposal to rollback | Latest evolution |
+## Output Format
+The command writes a `rolled_back` entry to `~/.claude/evolution_audit_log.jsonl`:
+```json
+{
+  "timestamp": "2026-02-28T14:00:00Z",
+  "proposal_id": "evolve-pptx-1709125200000",
+  "action": "rolled_back",
+  "details": "Restored from backup file",
+  "eval_snapshot": {
+    "total": 50,
+    "passed": 35,
+    "failed": 15,
+    "pass_rate": 0.70
+  }
+}
+```
+## Parsing Instructions
+### Verify Rollback Succeeded
+```bash
+# Parse: latest entry in evolution_audit_log.jsonl for the skill
+# Check: .action === "rolled_back"
+# Check: .proposal_id matches the target proposal
+```
+### Check Restoration Source
+```bash
+# Parse: .details field
+# Values: "Restored from backup file" or "Restored from audit trail"
+```
+## Restoration Strategies
+The command tries these strategies in order:
+### 1. Backup File
+Looks for `SKILL.md.bak` alongside the skill file. This is the most
+reliable source -- created automatically during `evolve --deploy`.
+### 2. Audit Trail
+If no backup file exists, reads the evolution audit log for the `created`
+entry with the matching `proposal_id`. The `details` field starts with
+`original_description:` followed by the pre-evolution content.
+### 3. Failure
+If neither source is available, the rollback fails with an error message.
+Manual restoration from version control is required.
+## Steps
+### 1. Find the Last Evolution
+Read `~/.claude/evolution_audit_log.jsonl` and find the most recent
+`deployed` entry for the target skill. Note the `proposal_id`.
+If `--proposal-id` is specified, use that instead.
+### 2. Run Rollback
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH rollback --skill pptx --skill-path /path/to/SKILL.md
+```
+Or to rollback a specific proposal:
+```bash
+bun run $CLI_PATH rollback --skill pptx --skill-path /path/to/SKILL.md --proposal-id evolve-pptx-1709125200000
+```
+### 3. Verify Restoration
+After rollback, verify the SKILL.md content is restored:
+- Read the file and confirm it matches the pre-evolution version
+- Check the audit log for the `rolled_back` entry
+- Optionally re-run evals to confirm the original pass rate
+### 4. Post-Rollback Audit
+The rollback is logged. Future `evolve` runs will see the rollback in the
+audit trail and can use it to avoid repeating failed evolution patterns.
+## Common Patterns
+**"Rollback the last evolution"**
+> Run rollback with `--skill` and `--skill-path`. The command automatically
+> finds the latest `deployed` entry in the audit log.
+**"Undo the pptx skill change"**
+> Same as above, specifying `--skill pptx`.
+**"Restore the original description"**
+> If multiple evolutions have occurred, use `--proposal-id` to target a
+> specific one. Without it, only the most recent evolution is rolled back.
+**"The rollback says no backup found"**
+> Check version control (git) for the pre-evolution SKILL.md. The audit
+> trail may also contain the original description in a `created` entry.

package/skill/Workflows/Watch.md ADDED Viewed

@@ -0,0 +1,128 @@
+# selftune Watch Workflow
+Monitor post-deploy skill performance for regressions. Compares current
+pass rates against a baseline within a sliding window of recent sessions.
+## Default Command
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH watch --skill <name> --skill-path <path> [options]
+```
+Fallback:
+```bash
+bun run <repo-path>/cli/selftune/index.ts watch --skill <name> --skill-path <path> [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--skill <name>` | Skill name | Required |
+| `--skill-path <path>` | Path to the skill's SKILL.md | Required |
+| `--window <n>` | Sliding window size (number of sessions) | 20 |
+| `--threshold <n>` | Regression threshold (drop from baseline) | 0.1 |
+| `--baseline <n>` | Explicit baseline pass rate (0-1) | Auto-detected from last deploy |
+| `--auto-rollback` | Automatically rollback on detected regression | Off |
+## Output Format
+```json
+{
+  "skill_name": "pptx",
+  "window_size": 20,
+  "sessions_evaluated": 18,
+  "current_pass_rate": 0.89,
+  "baseline_pass_rate": 0.92,
+  "threshold": 0.1,
+  "regression_detected": false,
+  "delta": -0.03,
+  "status": "healthy",
+  "evaluated_at": "2026-02-28T14:00:00Z"
+}
+```
+### Status Values
+| Status | Meaning |
+|--------|---------|
+| `healthy` | Current pass rate is within threshold of baseline |
+| `warning` | Pass rate dropped but within threshold |
+| `regression` | Pass rate dropped below baseline minus threshold |
+| `insufficient_data` | Not enough sessions in the window to evaluate |
+## Parsing Instructions
+### Check Regression Status
+```bash
+# Parse: .regression_detected (boolean)
+# Parse: .status (string)
+# Parse: .delta (float, negative = regression)
+```
+### Get Key Metrics
+```bash
+# Parse: .current_pass_rate vs .baseline_pass_rate
+# Parse: .sessions_evaluated (should be close to .window_size)
+```
+## Steps
+### 1. Run Watch
+```bash
+CLI_PATH=$(cat ~/.selftune/config.json | jq -r .cli_path)
+bun run $CLI_PATH watch --skill pptx --skill-path /path/to/SKILL.md
+```
+### 2. Check Regression Status
+Parse the JSON output. Key decision points:
+| Status | Action |
+|--------|--------|
+| `healthy` | No action needed. Skill is performing well. |
+| `warning` | Monitor closely. Consider re-running after more sessions. |
+| `regression` | Investigate. Consider rollback. |
+| `insufficient_data` | Wait for more sessions before evaluating. |
+### 3. Decide Action
+If regression is detected:
+- Review recent session transcripts to understand what changed
+- Check if the eval set is still representative
+- Run `rollback` if the regression is confirmed (see `Workflows/Rollback.md`)
+If `--auto-rollback` was set, the command automatically restores the
+previous description and logs a `rolled_back` entry.
+### 4. Report
+Summarize the snapshot for the user:
+- Current pass rate vs baseline
+- Number of sessions evaluated
+- Whether regression was detected
+- Recommended action
+## Common Patterns
+**"Is the skill performing well after the change?"**
+> Run watch with the skill name and path. Report the snapshot.
+**"Check for regressions"**
+> Same as above. Focus on the `regression_detected` and `delta` fields.
+**"How is the skill doing?"**
+> Run watch. If `insufficient_data`, tell the user to wait for more
+> sessions before drawing conclusions.
+**"Auto-rollback if it regresses"**
+> Use `--auto-rollback`. The command will restore the previous description
+> automatically if pass rate drops below baseline minus threshold.
+**"Set a custom baseline"**
+> Use `--baseline 0.85` to override auto-detection. Useful when the
+> auto-detected baseline is from an older evolution.