npm - selftune - Versions diffs - 0.1.0 - Mend

selftune 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/CHANGELOG.md +23 -0
package/README.md +259 -0
package/bin/selftune.cjs +29 -0
package/cli/selftune/constants.ts +71 -0
package/cli/selftune/eval/hooks-to-evals.ts +422 -0
package/cli/selftune/evolution/audit.ts +44 -0
package/cli/selftune/evolution/deploy-proposal.ts +244 -0
package/cli/selftune/evolution/evolve.ts +406 -0
package/cli/selftune/evolution/extract-patterns.ts +145 -0
package/cli/selftune/evolution/propose-description.ts +146 -0
package/cli/selftune/evolution/rollback.ts +242 -0
package/cli/selftune/evolution/stopping-criteria.ts +69 -0
package/cli/selftune/evolution/validate-proposal.ts +137 -0
package/cli/selftune/grading/grade-session.ts +459 -0
package/cli/selftune/hooks/prompt-log.ts +52 -0
package/cli/selftune/hooks/session-stop.ts +54 -0
package/cli/selftune/hooks/skill-eval.ts +73 -0
package/cli/selftune/index.ts +104 -0
package/cli/selftune/ingestors/codex-rollout.ts +416 -0
package/cli/selftune/ingestors/codex-wrapper.ts +332 -0
package/cli/selftune/ingestors/opencode-ingest.ts +565 -0
package/cli/selftune/init.ts +297 -0
package/cli/selftune/monitoring/watch.ts +328 -0
package/cli/selftune/observability.ts +255 -0
package/cli/selftune/types.ts +255 -0
package/cli/selftune/utils/jsonl.ts +75 -0
package/cli/selftune/utils/llm-call.ts +192 -0
package/cli/selftune/utils/logging.ts +40 -0
package/cli/selftune/utils/schema-validator.ts +47 -0
package/cli/selftune/utils/seeded-random.ts +31 -0
package/cli/selftune/utils/transcript.ts +260 -0
package/package.json +29 -0
package/skill/SKILL.md +120 -0
package/skill/Workflows/Doctor.md +145 -0
package/skill/Workflows/Evals.md +193 -0
package/skill/Workflows/Evolve.md +159 -0
package/skill/Workflows/Grade.md +157 -0
package/skill/Workflows/Ingest.md +159 -0
package/skill/Workflows/Initialize.md +125 -0
package/skill/Workflows/Rollback.md +131 -0
package/skill/Workflows/Watch.md +128 -0
package/skill/references/grading-methodology.md +176 -0
package/skill/references/invocation-taxonomy.md +144 -0
package/skill/references/logs.md +168 -0
package/skill/settings_snippet.json +41 -0

package/skill/references/grading-methodology.md ADDED Viewed

@@ -0,0 +1,176 @@
+# Grading Methodology Reference
+How selftune evaluates skill sessions. Used by the `grade` command and
+referenced by evolution workflows to understand quality signals.
+---
+## 3-Tier Grading Model
+Every session is graded across three tiers, each answering a different question:
+| Tier | Question | Example expectation |
+|------|----------|---------------------|
+| **Trigger** | Did the skill fire at all? | `skills_triggered` contains the skill name |
+| **Process** | Did the agent follow the right steps? | SKILL.md was read before main work started |
+| **Quality** | Was the output actually good? | Output file has correct content and structure |
+A session can pass Trigger but fail Process (skill fired, but steps were wrong),
+or pass Process but fail Quality (steps were right, but output was bad).
+---
+## Expectation Derivation
+When the user does not supply explicit expectations, derive reasonable defaults.
+Always include at least one Process and one Quality expectation.
+### Default Expectations
+1. **SKILL.md was read before main work started** (Process)
+   - Evidence: a `Read` tool call with a path ending in `SKILL.md` appears before
+     any `Write`, `Edit`, or significant `Bash` command.
+2. **No more than 1 error encountered** (Quality)
+   - Evidence: `errors_encountered` field in session telemetry is 0 or 1.
+3. **Expected output type exists** (Quality)
+   - Evidence: the file, command output, or artifact the user asked for is present.
+4. **No thrashing** (Process)
+   - Evidence: no single bash command or tool call is repeated more than 3 times
+     consecutively in the transcript.
+5. **Skill steps followed in order** (Process)
+   - Evidence: the sequence of tool calls matches the step order in SKILL.md.
+---
+## Evidence Standards
+### What counts as evidence
+- A specific tool call from the transcript (e.g., `[TOOL:Read] /path/to/SKILL.md`)
+- A bash command and its output (e.g., `Bash output: 'presentation.pptx created'`)
+- A telemetry field value (e.g., `errors_encountered: 0`)
+- A transcript line number and content
+### Strictness rules
+- **A file existing is NOT evidence it has correct content.** Verify content claims
+  separately from existence claims.
+- **Absence of evidence IS evidence of absence** for process steps. If the transcript
+  does not show SKILL.md being read, the expectation fails.
+- **Cite specific evidence.** Never mark PASS without pointing to a transcript line,
+  tool call, or telemetry field.
+---
+## Claims Extraction
+After grading explicit expectations, extract 2-4 implicit claims from the transcript.
+Each claim falls into one of three types:
+| Type | What it captures | Example |
+|------|------------------|---------|
+| **Factual** | A verifiable statement the agent made | "The agent said 12 slides were created" |
+| **Process** | An observed behavior pattern | "The agent read SKILL.md before making any file changes" |
+| **Quality** | An output characteristic | "The output file was named correctly" |
+For each claim:
+1. State the claim clearly
+2. Classify its type
+3. Mark `verified: true` or `verified: false`
+4. Cite evidence (or note its absence)
+---
+## Eval Feedback and Eval Gap Flagging
+After grading, review each PASSED expectation and ask:
+> "Would this expectation also pass if the agent produced wrong output?"
+If yes, flag it in `eval_feedback.suggestions` with a reason. This drives
+eval set improvement over time.
+### When to flag
+- An expectation checks file existence but not content
+- An expectation checks command success but not output correctness
+- An expectation is too generic to catch quality regressions
+### When NOT to flag
+- The expectation is already specific enough
+- The gap is trivial or not worth the eval set complexity
+Only raise things worth improving. The goal is actionable feedback, not exhaustive nitpicking.
+---
+## grading.json Schema
+```json
+{
+  "session_id": "abc123",
+  "skill_name": "pptx",
+  "transcript_path": "/home/user/.claude/projects/.../abc123.jsonl",
+  "graded_at": "2026-02-28T12:00:00Z",
+  "expectations": [
+    {
+      "text": "SKILL.md was read before any file was created",
+      "passed": true,
+      "evidence": "Transcript line 3: [TOOL:Read] /path/to/SKILL.md"
+    },
+    {
+      "text": "Output file has correct slide count",
+      "passed": false,
+      "evidence": "Expected 12 slides, found 8 in bash output"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 1,
+    "total": 3,
+    "pass_rate": 0.67
+  },
+  "execution_metrics": {
+    "tool_calls": { "Read": 2, "Write": 1, "Bash": 3 },
+    "total_tool_calls": 6,
+    "total_steps": 4,
+    "bash_commands_run": 3,
+    "errors_encountered": 0,
+    "skills_triggered": ["pptx"],
+    "transcript_chars": 4200
+  },
+  "claims": [
+    {
+      "claim": "Output was a .pptx file",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Bash output: 'presentation.pptx created'"
+    }
+  ],
+  "eval_feedback": {
+    "suggestions": [
+      { "reason": "No expectation checks slide content" }
+    ],
+    "overall": "Process coverage good; add output quality assertions."
+  }
+}
+```
+### Field descriptions
+| Field | Type | Description |
+|-------|------|-------------|
+| `session_id` | string | From session telemetry |
+| `skill_name` | string | The skill being graded |
+| `transcript_path` | string | Path to the session transcript JSONL |
+| `graded_at` | string | ISO 8601 timestamp of grading |
+| `expectations[]` | array | Each expectation with verdict and evidence |
+| `summary` | object | Aggregate pass/fail counts and rate |
+| `execution_metrics` | object | Raw metrics from session telemetry |
+| `claims[]` | array | Implicit claims extracted from transcript |
+| `eval_feedback` | object | Suggestions for improving the eval set |

package/skill/references/invocation-taxonomy.md ADDED Viewed

@@ -0,0 +1,144 @@
+# Invocation Taxonomy Reference
+How selftune classifies the ways users trigger (or should trigger) a skill.
+Used by the `evals` command and referenced by evolution workflows to understand
+coverage gaps.
+---
+## The 4 Invocation Types
+Every query in an eval set is annotated with one of four invocation types.
+Three are positive (should trigger the skill), one is negative (should not).
+### Explicit
+The user names the skill directly.
+> "Use the pptx skill to make slides"
+> "Run the selftune grade command"
+> "Open the reins audit tool"
+**What it means:** The user knows the skill exists and asks for it by name.
+This is the easiest type to catch. If a skill misses explicit invocations,
+something is fundamentally broken.
+### Implicit
+The user describes the task without naming the skill.
+> "Make me a slide deck"
+> "Grade my last session"
+> "Score this project's readiness"
+**What it means:** The user knows what they want but not which skill does it.
+The skill description's trigger phrases must cover these natural-language
+variations. Missing implicit invocations means the description is too narrow.
+### Contextual
+The user describes the task with domain-specific noise and context.
+> "I need slides for the Q3 board meeting with revenue charts"
+> "After that deploy, check if the skill is still working"
+> "The last codex run felt off, can you evaluate it"
+**What it means:** The user is thinking about their domain, not about skills.
+The query contains the intent buried in context. Missing contextual invocations
+means the skill description lacks real-world vocabulary.
+### Negative
+The query should NOT trigger the skill.
+> "What format should I use for a presentation?"
+> "Explain what eval means in machine learning"
+> "How do I write a grading rubric for my class"
+**What it means:** The query contains keywords that might confuse a matcher
+(e.g., "presentation", "eval", "grading") but the intent does not match
+the skill's purpose. Negative examples prevent false positives.
+---
+## What "Healthy" Looks Like
+A healthy skill catches all three positive invocation types:
+| Type | Healthy | Unhealthy |
+|------|---------|-----------|
+| Explicit | Catches all | Misses some (broken) |
+| Implicit | Catches most | Only catches explicit (too rigid) |
+| Contextual | Catches many | Only catches explicit + some implicit (needs evolution) |
+| Negative | Rejects all | False positives on keyword overlap |
+### The Coverage Spectrum
+```
+Explicit only     -->  Skill is too rigid, users must babysit
++ Implicit        -->  Skill works for informed users
++ Contextual      -->  Skill works naturally in real workflows
+- Negative clean  -->  No false positives
+```
+A skill that only catches explicit invocations is forcing users to know its
+name and syntax. That defeats the purpose of skill-based routing.
+---
+## Connection to Evolution
+The invocation taxonomy directly drives the evolution feedback loop:
+### Missed Implicit = Undertriggering
+When `evals` shows implicit queries that don't trigger the skill, the
+description is too narrow. The `evolve` command will:
+1. Extract the missed implicit patterns
+2. Propose description changes that cover them
+3. Validate that existing triggers still work
+### Missed Contextual = Under-evolved
+When implicit queries trigger but contextual ones don't, the skill needs
+richer vocabulary. Evolution should add domain-specific language to the
+description's trigger phrases.
+### False-Positive Negatives = Overtriggering
+When negative queries trigger the skill, the description is too broad.
+Evolution should tighten the scope or add "Don't Use When" clauses.
+### The Evolution Priority
+Fix in this order:
+1. **Missed explicit** -- broken, fix immediately
+2. **Missed implicit** -- undertriggering, evolve next
+3. **Missed contextual** -- under-evolved, evolve when implicit is clean
+4. **False-positive negatives** -- overtriggering, tighten after broadening
+---
+## Eval Set Structure
+Each entry in a generated eval set looks like:
+```json
+{
+  "id": 1,
+  "query": "Make me a slide deck for the Q3 board meeting",
+  "expected": true,
+  "invocation_type": "contextual",
+  "skill_name": "pptx",
+  "source_session": "abc123"
+}
+```
+| Field | Description |
+|-------|-------------|
+| `id` | Sequential identifier |
+| `query` | The user's original query text |
+| `expected` | `true` = should trigger, `false` = should not |
+| `invocation_type` | One of: `explicit`, `implicit`, `contextual`, `negative` |
+| `skill_name` | The skill this eval targets |
+| `source_session` | Session ID the query came from (if positive) |

package/skill/references/logs.md ADDED Viewed

@@ -0,0 +1,168 @@
+# Log Format Reference
+selftune writes to four log files. This reference describes each format
+in detail for the skill to use when parsing sessions and audit trails.
+---
+## ~/.claude/session_telemetry_log.jsonl
+One JSON record per line. Each record is one completed agent session.
+```json
+{
+  "timestamp": "2026-02-28T10:00:00+00:00",
+  "session_id": "abc123",
+  "source": "claude_code",
+  "cwd": "/home/user/projects/myapp",
+  "transcript_path": "/home/user/.claude/projects/.../abc123.jsonl",
+  "last_user_query": "Make me a slide deck for the board meeting",
+  "tool_calls": {
+    "Read": 2,
+    "Write": 1,
+    "Bash": 3,
+    "Edit": 0
+  },
+  "total_tool_calls": 6,
+  "bash_commands": [
+    "pip install python-pptx --break-system-packages",
+    "python3 /tmp/create_pptx.py"
+  ],
+  "skills_triggered": ["pptx"],
+  "assistant_turns": 5,
+  "errors_encountered": 0,
+  "transcript_chars": 4200
+}
+```
+**source values:**
+- `claude_code` — written by session-stop.ts (Stop hook)
+- `codex` — written by ingestors/codex-wrapper.ts
+- `codex_rollout` — written by ingestors/codex-rollout.ts
+- `opencode` — written by ingestors/opencode-ingest.ts
+- `opencode_json` — legacy OpenCode JSON files
+---
+## ~/.claude/skill_usage_log.jsonl
+One record per skill trigger event. Populated by skill-eval.ts (PostToolUse hook).
+```json
+{
+  "timestamp": "2026-02-28T10:00:00+00:00",
+  "session_id": "abc123",
+  "skill_name": "pptx",
+  "skill_path": "/mnt/skills/public/pptx/SKILL.md",
+  "query": "Make me a slide deck for the board meeting",
+  "triggered": true,
+  "source": "claude_code"
+}
+```
+---
+## ~/.claude/all_queries_log.jsonl
+Every user query, whether or not it triggered a skill. Populated by prompt-log.ts (UserPromptSubmit hook).
+```json
+{
+  "timestamp": "2026-02-28T10:00:00+00:00",
+  "session_id": "abc123",
+  "query": "Make me a slide deck for the board meeting",
+  "source": "claude_code"
+}
+```
+---
+## ~/.claude/evolution_audit_log.jsonl
+One record per evolution action. Written by the evolution and rollback modules.
+```json
+{
+  "timestamp": "2026-02-28T12:00:00+00:00",
+  "proposal_id": "evolve-pptx-1709125200000",
+  "action": "created",
+  "details": "original_description: Grade a skill session against expectations...",
+  "eval_snapshot": {
+    "total": 50,
+    "passed": 35,
+    "failed": 15,
+    "pass_rate": 0.70
+  }
+}
+```
+**action values:**
+- `created` — New evolution proposal generated. `details` starts with `original_description:` prefix preserving the pre-evolution SKILL.md content.
+- `validated` — Proposal tested against eval set. `eval_snapshot` contains before/after pass rates.
+- `deployed` — Updated SKILL.md written to disk. `eval_snapshot` contains final pass rates.
+- `rolled_back` — SKILL.md restored to pre-evolution state (from `.bak` file or audit trail).
+**Required fields:** `timestamp`, `proposal_id`, `action`
+**Optional fields:** `details`, `eval_snapshot`
+---
+## Claude Code Transcript Format (~/.claude/projects/.../session.jsonl)
+One JSON object per line. Two observed variants:
+**Variant A (nested, current):**
+```json
+{"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}}
+{"type": "assistant", "message": {"role": "assistant", "content": [
+  {"type": "text", "text": "I'll read the skill first."},
+  {"type": "tool_use", "name": "Read", "input": {"file_path": "/path/to/SKILL.md"}}
+]}}
+```
+**Variant B (flat, older):**
+```json
+{"role": "user", "content": "..."}
+{"role": "assistant", "content": [{"type": "tool_use", "name": "Bash", "input": {"command": "..."}}]}
+```
+Tool use always appears in assistant content blocks as `{"type": "tool_use", "name": "ToolName", "input": {...}}`.
+Skill reads appear as `Read` tool calls where `input.file_path` ends in `SKILL.md`.
+---
+## Codex Rollout Format ($CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl)
+```json
+{"type": "thread.started", "thread_id": "..."}
+{"type": "turn.started"}
+{"type": "item.completed", "item": {"id": "i0", "item_type": "reasoning", "text": "I should use the setup-demo-app skill"}}
+{"type": "item.completed", "item": {"id": "i1", "item_type": "command_execution", "command": "npm install", "exit_code": 0}}
+{"type": "item.completed", "item": {"id": "i2", "item_type": "file_change", "changes": [{"path": "..."}]}}
+{"type": "item.completed", "item": {"id": "i3", "item_type": "agent_message", "text": "Done!"}}
+{"type": "turn.completed", "usage": {"input_tokens": 1200, "output_tokens": 450}}
+```
+Item types: `reasoning`, `command_execution`, `file_change`, `agent_message`,
+`mcp_tool_call`, `web_search`, `todo_list`, `error`
+---
+## OpenCode Message Format (in SQLite message.content column)
+Content is a JSON string containing an array of blocks. Anthropic format:
+```json
+[
+  {"type": "text", "text": "I'll create the presentation."},
+  {"type": "tool_use", "name": "Bash", "input": {"command": "pip install python-pptx"}},
+  {"type": "tool_use", "name": "Read", "input": {"file_path": "/skills/pptx/SKILL.md"}}
+]
+```
+Tool results appear in subsequent user messages:
+```json
+[{"type": "tool_result", "tool_use_id": "...", "content": "OK", "is_error": false}]
+```

package/skill/settings_snippet.json ADDED Viewed

@@ -0,0 +1,41 @@
+{
+  "_readme": "Merge the 'hooks' block below into your ~/.claude/settings.json",
+  "_readme2": "Replace /PATH/TO/ with the actual directory where you saved the scripts",
+  "hooks": {
+    "UserPromptSubmit": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bun run /PATH/TO/cli/selftune/hooks/prompt-log.ts",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
+    "PostToolUse": [
+      {
+        "matcher": "Read",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bun run /PATH/TO/cli/selftune/hooks/skill-eval.ts",
+            "timeout": 5
+          }
+        ]
+      }
+    ],
+    "Stop": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bun run /PATH/TO/cli/selftune/hooks/session-stop.ts",
+            "timeout": 15
+          }
+        ]
+      }
+    ]
+  }
+}