npm - selftune - Versions diffs - 0.1.2 → 0.2.0 - Mend

selftune 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (89) hide show

package/.claude/agents/diagnosis-analyst.md +146 -0
package/.claude/agents/evolution-reviewer.md +167 -0
package/.claude/agents/integration-guide.md +200 -0
package/.claude/agents/pattern-analyst.md +147 -0
package/CHANGELOG.md +38 -1
package/README.md +96 -256
package/assets/BeforeAfter.gif +0 -0
package/assets/FeedbackLoop.gif +0 -0
package/assets/logo.svg +9 -0
package/assets/skill-health-badge.svg +20 -0
package/cli/selftune/activation-rules.ts +171 -0
package/cli/selftune/badge/badge-data.ts +108 -0
package/cli/selftune/badge/badge-svg.ts +212 -0
package/cli/selftune/badge/badge.ts +103 -0
package/cli/selftune/constants.ts +75 -1
package/cli/selftune/contribute/bundle.ts +314 -0
package/cli/selftune/contribute/contribute.ts +214 -0
package/cli/selftune/contribute/sanitize.ts +162 -0
package/cli/selftune/cron/setup.ts +266 -0
package/cli/selftune/dashboard-server.ts +582 -0
package/cli/selftune/dashboard.ts +31 -12
package/cli/selftune/eval/baseline.ts +247 -0
package/cli/selftune/eval/composability.ts +117 -0
package/cli/selftune/eval/generate-unit-tests.ts +143 -0
package/cli/selftune/eval/hooks-to-evals.ts +68 -2
package/cli/selftune/eval/import-skillsbench.ts +221 -0
package/cli/selftune/eval/synthetic-evals.ts +172 -0
package/cli/selftune/eval/unit-test-cli.ts +152 -0
package/cli/selftune/eval/unit-test.ts +196 -0
package/cli/selftune/evolution/deploy-proposal.ts +142 -1
package/cli/selftune/evolution/evolve-body.ts +492 -0
package/cli/selftune/evolution/evolve.ts +479 -104
package/cli/selftune/evolution/extract-patterns.ts +32 -1
package/cli/selftune/evolution/pareto.ts +314 -0
package/cli/selftune/evolution/propose-body.ts +171 -0
package/cli/selftune/evolution/propose-description.ts +100 -2
package/cli/selftune/evolution/propose-routing.ts +166 -0
package/cli/selftune/evolution/refine-body.ts +141 -0
package/cli/selftune/evolution/rollback.ts +20 -3
package/cli/selftune/evolution/validate-body.ts +254 -0
package/cli/selftune/evolution/validate-proposal.ts +257 -35
package/cli/selftune/evolution/validate-routing.ts +177 -0
package/cli/selftune/grading/grade-session.ts +145 -19
package/cli/selftune/grading/pre-gates.ts +104 -0
package/cli/selftune/hooks/auto-activate.ts +185 -0
package/cli/selftune/hooks/evolution-guard.ts +165 -0
package/cli/selftune/hooks/skill-change-guard.ts +112 -0
package/cli/selftune/index.ts +88 -0
package/cli/selftune/ingestors/claude-replay.ts +351 -0
package/cli/selftune/ingestors/codex-rollout.ts +1 -1
package/cli/selftune/ingestors/openclaw-ingest.ts +440 -0
package/cli/selftune/ingestors/opencode-ingest.ts +2 -2
package/cli/selftune/init.ts +168 -5
package/cli/selftune/last.ts +2 -2
package/cli/selftune/memory/writer.ts +447 -0
package/cli/selftune/monitoring/watch.ts +25 -2
package/cli/selftune/status.ts +18 -15
package/cli/selftune/types.ts +377 -5
package/cli/selftune/utils/frontmatter.ts +217 -0
package/cli/selftune/utils/llm-call.ts +29 -3
package/cli/selftune/utils/transcript.ts +35 -0
package/cli/selftune/utils/trigger-check.ts +89 -0
package/cli/selftune/utils/tui.ts +156 -0
package/dashboard/index.html +585 -19
package/package.json +17 -6
package/skill/SKILL.md +127 -10
package/skill/Workflows/AutoActivation.md +144 -0
package/skill/Workflows/Badge.md +118 -0
package/skill/Workflows/Baseline.md +121 -0
package/skill/Workflows/Composability.md +100 -0
package/skill/Workflows/Contribute.md +91 -0
package/skill/Workflows/Cron.md +155 -0
package/skill/Workflows/Dashboard.md +203 -0
package/skill/Workflows/Doctor.md +37 -1
package/skill/Workflows/Evals.md +73 -5
package/skill/Workflows/EvolutionMemory.md +152 -0
package/skill/Workflows/Evolve.md +111 -6
package/skill/Workflows/EvolveBody.md +159 -0
package/skill/Workflows/ImportSkillsBench.md +111 -0
package/skill/Workflows/Ingest.md +129 -15
package/skill/Workflows/Initialize.md +58 -3
package/skill/Workflows/Replay.md +70 -0
package/skill/Workflows/Rollback.md +20 -1
package/skill/Workflows/UnitTest.md +138 -0
package/skill/Workflows/Watch.md +22 -0
package/skill/settings_snippet.json +23 -0
package/templates/activation-rules-default.json +27 -0
package/templates/multi-skill-settings.json +64 -0
package/templates/single-skill-settings.json +58 -0

package/skill/Workflows/ImportSkillsBench.md ADDED Viewed

@@ -0,0 +1,111 @@
+# selftune Import SkillsBench Workflow
+Import evaluation tasks from the SkillsBench corpus (87 real-world agent
+benchmarks) and convert them to selftune eval entries. This enriches
+your skill's eval set with externally validated test cases.
+## Default Command
+```bash
+selftune import-skillsbench --dir <path> --skill <name> --output <path> [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--dir <path>` | Path to SkillsBench tasks directory | Required |
+| `--skill <name>` | Target skill to match tasks against | Required |
+| `--output <path>` | Output eval set JSON file | Required |
+| `--match-strategy <type>` | Matching strategy: `exact` or `fuzzy` | `exact` |
+## Match Strategies
+### `exact`
+Matches tasks where `expected_skill` in `task.toml` exactly matches the
+target skill name. Precise but may miss relevant tasks.
+### `fuzzy`
+Uses keyword overlap between the task's category/tags and the skill name.
+Casts a wider net but may include marginally relevant tasks. Review the
+output and remove false matches.
+## SkillsBench Directory Structure
+The importer expects this layout:
+```
+tasks/
+├── task-001/
+│   ├── instruction.md     # Task description (used as query)
+│   └── task.toml          # Metadata (difficulty, category, tags, expected_skill)
+├── task-002/
+│   ├── instruction.md
+│   └── task.toml
+└── ...
+```
+### `task.toml` Format
+```toml
+difficulty = "medium"
+category = "research"
+tags = ["web-search", "analysis", "summarization"]
+expected_skill = "Research"
+expected_tools = ["WebSearch", "Read"]
+```
+All fields are optional. Tasks without `task.toml` use default values.
+## Output Format
+Standard selftune eval entries:
+```json
+[
+  {
+    "id": 1,
+    "query": "Find and summarize the latest papers on transformer architectures",
+    "expected": true,
+    "invocation_type": "implicit",
+    "skill_name": "Research",
+    "source_session": null,
+    "source": "skillsbench"
+  }
+]
+```
+## Steps
+### 1. Obtain SkillsBench Corpus
+Clone or download the SkillsBench repository containing the task directory.
+### 2. Import Tasks
+```bash
+selftune import-skillsbench --dir /path/to/skillsbench/tasks --skill Research --output evals-bench.json
+```
+### 3. Review Output
+Inspect the generated eval entries. Remove any that don't match your skill's
+intended scope. Adjust match strategy if needed.
+### 4. Merge with Existing Evals
+Combine imported entries with your existing eval set for a richer validation
+corpus. Use the merged set with `selftune evolve --eval-set merged-evals.json`.
+## Common Patterns
+**"Import SkillsBench tasks for Research"**
+> `selftune import-skillsbench --dir /path/tasks --skill Research --output bench-evals.json`
+**"Use fuzzy matching for broader coverage"**
+> `selftune import-skillsbench --dir /path/tasks --skill pptx --output bench-evals.json --match-strategy fuzzy`
+**"Enrich my eval set with external benchmarks"**
+> Import with `import-skillsbench`, then pass the output to `evolve --eval-set`.

package/skill/Workflows/Ingest.md CHANGED Viewed

@@ -1,19 +1,69 @@
 # selftune Ingest Workflow
-Import sessions from non-Claude-Code agent platforms into the shared
-selftune log format. Covers three sub-commands: `ingest-codex`,
-`ingest-opencode`, and `wrap-codex`.
+Import sessions from agent platforms into the shared selftune log format.
+Covers five sub-commands: `replay`, `ingest-codex`, `ingest-opencode`,
+`ingest-openclaw`, and `wrap-codex`.
 ## When to Use Each
 | Sub-command | Platform | Mode | When |
 |-------------|----------|------|------|
+| `replay` | Claude Code | Batch | Backfill logs from existing Claude Code transcripts |
 | `ingest-codex` | Codex | Batch | Import existing Codex rollout logs |
 | `ingest-opencode` | OpenCode | Batch | Import existing OpenCode sessions |
+| `ingest-openclaw` | OpenClaw | Batch | Import existing OpenClaw agent sessions |
 | `wrap-codex` | Codex | Real-time | Wrap `codex exec` to capture telemetry live |
 ---
+## replay
+Batch ingest existing Claude Code session transcripts into the shared JSONL schema.
+### Default Command
+```bash
+selftune replay
+```
+### Options
+| Flag | Description |
+|------|-------------|
+| `--since <date>` | Only ingest sessions modified after this date (e.g., `2026-01-01`) |
+| `--dry-run` | Show what would be ingested without writing to logs |
+| `--force` | Re-ingest all sessions, ignoring the marker file |
+| `--verbose` | Show per-file progress during ingestion |
+| `--projects-dir <path>` | Override default `~/.claude/projects/` directory |
+### Source
+Reads from `~/.claude/projects/<hash>/<session-id>.jsonl`. These are the
+transcript files Claude Code automatically saves for every session.
+### Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- extracted user queries (one per query, not just last)
+- `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "claude_code_replay"`
+- `~/.claude/skill_usage_log.jsonl` -- skill triggers with `source: "claude_code_replay"`
+### Steps
+1. Run `selftune replay --dry-run` to preview what would be ingested
+2. Run `selftune replay` to ingest all sessions
+3. Run `selftune doctor` to confirm logs are healthy
+4. Run `selftune evals --list-skills` to see if the ingested sessions appear
+### Notes
+- Idempotent: uses a marker file (`~/.claude/claude_code_ingested_sessions.json`) to track
+  which transcripts have already been ingested. Safe to run repeatedly.
+- Extracts ALL user queries per session, not just the last one.
+- Filters out system messages, short queries (<4 chars), and queries matching `SKIP_PREFIXES`.
+---
 ## ingest-codex
 Batch ingest Codex rollout logs into the shared JSONL schema.
@@ -42,9 +92,9 @@ Writes to:
 ### Steps
 1. Verify `$CODEX_HOME/sessions/` directory exists and contains session files
-2. Run `ingest-codex`
+2. Run `selftune ingest-codex`
 3. Verify entries were written by checking log file line counts
-4. Run `doctor` to confirm logs are healthy
+4. Run `selftune doctor` to confirm logs are healthy
 ---
@@ -78,9 +128,61 @@ Writes to:
 ### Steps
 1. Verify the OpenCode database exists at the expected path
-2. Run `ingest-opencode`
+2. Run `selftune ingest-opencode`
 3. Verify entries were written by checking log file line counts
-4. Run `doctor` to confirm logs are healthy
+4. Run `selftune doctor` to confirm logs are healthy
+---
+## ingest-openclaw
+Batch ingest OpenClaw agent session histories into the shared JSONL schema.
+Supports multiple agents and auto-discovers session files across all agent directories.
+### Default Command
+```bash
+selftune ingest-openclaw
+```
+### Options
+| Flag | Description |
+|------|-------------|
+| `--agents-dir <path>` | Override default `~/.openclaw/agents/` directory |
+| `--since <date>` | Only ingest sessions modified after this date (e.g., `2026-01-01`) |
+| `--dry-run` | Show what would be ingested without writing to logs |
+| `--force` | Re-ingest all sessions, ignoring the marker file |
+| `--verbose` / `-v` | Show per-session progress during ingestion |
+### Source
+Reads from `~/.openclaw/agents/<agentId>/sessions/*.jsonl`. Each JSONL file contains:
+- Line 1 (session header): `{"type":"session","version":5,"id":"<uuid>","timestamp":"<iso>","cwd":"<path>"}`
+- Line 2+ (messages): `{"role":"user|assistant|toolResult","content":[...],"timestamp":<ms>}`
+### Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- extracted user queries
+- `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "openclaw"`
+- `~/.claude/skill_usage_log.jsonl` -- skill triggers with `source: "openclaw"`
+### Steps
+1. Run `selftune ingest-openclaw --dry-run` to preview what would be ingested
+2. Run `selftune ingest-openclaw` to ingest all sessions
+3. Run `selftune doctor` to confirm logs are healthy
+4. Run `selftune evals --list-skills` to see if the ingested sessions appear
+### Notes
+- Idempotent: uses a marker file to track which sessions have already been ingested.
+  Safe to run repeatedly. Use `--force` to re-ingest everything.
+- Skill detection heuristic: identifies skills by checking for `SKILL.md` file reads in
+  tool calls and by matching known skill names in assistant text content.
+- Multi-agent support: scans all agent directories under the agents root, ingesting
+  sessions from every agent found.
 ---
@@ -117,25 +219,37 @@ stream for telemetry; it does not modify Codex behavior.
 1. Build the wrap-codex command with the desired Codex arguments
 2. Run the command (replaces `codex exec` in your workflow)
 3. Session telemetry is captured automatically
-4. Verify with `doctor` after first use
+4. Verify with `selftune doctor` after first use
 ---
 ## Common Patterns
+**"Backfill Claude Code sessions"**
+> Run `selftune replay`. No options needed. Reads from `~/.claude/projects/`.
+**"Replay only recent Claude Code sessions"**
+> Run `selftune replay --since 2026-02-01` with an appropriate date.
 **"Ingest codex logs"**
-> Run `ingest-codex`. No options needed. Reads from `$CODEX_HOME/sessions/`.
+> Run `selftune ingest-codex`. No options needed. Reads from `$CODEX_HOME/sessions/`.
 **"Import opencode sessions"**
-> Run `ingest-opencode`. Reads from the SQLite database automatically.
+> Run `selftune ingest-opencode`. Reads from the SQLite database automatically.
+**"Ingest OpenClaw sessions"**
+> Run `selftune ingest-openclaw`. Reads from `~/.openclaw/agents/` automatically.
+**"Import only recent OpenClaw sessions"**
+> Run `selftune ingest-openclaw --since 2026-02-01` with an appropriate date.
 **"Run codex through selftune"**
-> Use `wrap-codex -- <codex args>` instead of `codex exec <args>` directly.
+> Use `selftune wrap-codex -- <codex args>` instead of `codex exec <args>` directly.
 **"Batch ingest vs real-time"**
-> Use `ingest-codex` or `ingest-opencode` for historical sessions.
-> Use `wrap-codex` for ongoing sessions. Both produce the same log format.
+> Use `selftune ingest-codex` or `selftune ingest-opencode` for historical sessions.
+> Use `selftune wrap-codex` for ongoing sessions. Both produce the same log format.
 **"How do I know it worked?"**
-> Run `doctor` after ingestion. Check that log files exist and are parseable.
-> Run `evals --list-skills` to see if the ingested sessions appear.
+> Run `selftune doctor` after ingestion. Check that log files exist and are parseable.
+> Run `selftune evals --list-skills` to see if the ingested sessions appear.

package/skill/Workflows/Initialize.md CHANGED Viewed

@@ -19,6 +19,7 @@ selftune init [--agent <type>] [--cli-path <path>] [--force]
 | Flag | Description | Default |
 |------|-------------|---------|
 | `--agent <type>` | Agent platform: `claude`, `codex`, `opencode` | Auto-detected |
+| `--cli-path <path>` | Override auto-detected CLI entry-point path | Auto-detected |
 | `--force` | Reinitialize even if config already exists | Off |
 ## Output Format
@@ -68,7 +69,7 @@ cat ~/.selftune/config.json 2>/dev/null
 ```
 If the file exists and is valid JSON, selftune is already initialized.
-Skip to Step 5 (verify with doctor) unless the user wants to reinitialize.
+Skip to Step 8 (verify with doctor) unless the user wants to reinitialize.
 ### 3. Run Init
@@ -79,12 +80,15 @@ selftune init
 ### 4. Install Hooks (Claude Code)
 If `init` reports hooks are not installed, merge the entries from
-`skill/settings_snippet.json` into `~/.claude/settings.json`. Three hooks
+`skill/settings_snippet.json` into `~/.claude/settings.json`. Six hooks
 are required:
 | Hook | Script | Purpose |
 |------|--------|---------|
 | `UserPromptSubmit` | `hooks/prompt-log.ts` | Log every user query |
+| `UserPromptSubmit` | `hooks/auto-activate.ts` | Suggest skills before prompt processing |
+| `PreToolUse` (Write/Edit) | `hooks/skill-change-guard.ts` | Detect uncontrolled skill edits |
+| `PreToolUse` (Write/Edit) | `hooks/evolution-guard.ts` | Block SKILL.md edits on monitored skills |
 | `PostToolUse` (Read) | `hooks/skill-eval.ts` | Track skill triggers |
 | `Stop` | `hooks/session-stop.ts` | Capture session telemetry |
@@ -99,7 +103,48 @@ The hooks directory is at `dirname(cli_path)/hooks/`.
 - Use `selftune ingest-opencode` to import sessions from the SQLite database
 - See `Workflows/Ingest.md` for details
-### 5. Verify with Doctor
+### 5. Initialize Memory Directory
+Create the memory directory if it does not exist:
+```bash
+mkdir -p ~/.selftune/memory
+```
+The memory system stores three files at `~/.selftune/memory/`:
+- `context.md` -- active evolution state and session context
+- `decisions.md` -- evolution decisions and rollback history
+- `plan.md` -- current priorities and evolution strategy
+These files are created automatically by the memory writer during evolve,
+watch, and rollback workflows. The directory just needs to exist.
+### 6. Set Up Activation Rules
+Copy the default activation rules template:
+```bash
+cp templates/activation-rules-default.json ~/.selftune/activation-rules.json
+```
+The activation rules file configures auto-activation behavior -- which skills
+get suggested and under what conditions. Edit `~/.selftune/activation-rules.json`
+to customize thresholds and skill mappings for your project.
+### 7. Verify Agent Availability
+Check that the specialized agent files are present:
+```bash
+ls .claude/agents/
+```
+Expected agents: `diagnosis-analyst.md`, `pattern-analyst.md`,
+`evolution-reviewer.md`, `integration-guide.md`. These are used by evolve
+and doctor workflows for deeper analysis. If missing, copy them from the
+selftune repository's `.claude/agents/` directory.
+### 8. Verify with Doctor
 ```bash
 selftune doctor
@@ -108,6 +153,16 @@ selftune doctor
 Parse the JSON output. All checks should pass. If any fail, address the
 reported issues before proceeding.
+## Integration Guide
+For project-type-specific setup (single-skill, multi-skill, monorepo, Codex,
+OpenCode, mixed agents), see [docs/integration-guide.md](../../docs/integration-guide.md).
+Templates for each project type are in the `templates/` directory:
+- `templates/single-skill-settings.json` — hooks for single-skill projects
+- `templates/multi-skill-settings.json` — hooks for multi-skill projects with activation rules
+- `templates/activation-rules-default.json` — default auto-activation rule configuration
 ## Common Patterns
 **"Initialize selftune"**

package/skill/Workflows/Replay.md ADDED Viewed

@@ -0,0 +1,70 @@
+# selftune Replay Workflow
+Backfill the shared JSONL logs from existing Claude Code conversation
+transcripts. Useful for bootstrapping selftune with historical session data.
+## When to Use
+- New selftune installation with months of Claude Code history
+- After re-initializing logs and wanting to recover data
+- To populate eval data without waiting for new sessions
+## Key Difference from Hooks
+Real-time hooks capture only the **last** user query per session. Replay
+extracts **all** user queries, writing one `QueryLogRecord` per message.
+This produces much richer eval data from historical sessions.
+## Default Command
+```bash
+selftune replay
+```
+## Options
+| Flag | Description |
+|------|-------------|
+| `--since <date>` | Only include transcripts modified after this date |
+| `--dry-run` | Preview what would be ingested without writing |
+| `--force` | Re-ingest all transcripts (ignore marker file) |
+| `--verbose` | Show detailed progress per file |
+| `--projects-dir <path>` | Override default `~/.claude/projects/` path |
+## Source
+Reads Claude Code transcripts from `~/.claude/projects/<hash>/<session>.jsonl`.
+Each transcript is a JSONL file containing user and assistant messages.
+## Output
+Writes to:
+- `~/.claude/all_queries_log.jsonl` -- one record per user query (all messages, not just last)
+- `~/.claude/session_telemetry_log.jsonl` -- per-session metrics with `source: "claude_code_replay"`
+- `~/.claude/skill_usage_log.jsonl` -- skill triggers detected in transcripts
+## Idempotency
+Uses a marker file at `~/.claude/claude_code_ingested_sessions.json` to track
+which transcripts have already been ingested. Use `--force` to re-ingest all.
+## Steps
+1. Run `selftune replay --dry-run` to preview what would be ingested
+2. Run `selftune replay` to perform the ingestion
+3. Run `selftune doctor` to verify logs are healthy
+4. Run `selftune evals --list-skills` to see if replayed sessions appear
+## Common Patterns
+**"Backfill my logs"**
+> Run `selftune replay`. No options needed.
+**"Only replay recent sessions"**
+> Run `selftune replay --since 2026-02-01`
+**"Re-ingest everything"**
+> Run `selftune replay --force`
+**"How do I know it worked?"**
+> Run `selftune doctor` after replay. Check log file line counts increased.

package/skill/Workflows/Rollback.md CHANGED Viewed

@@ -75,6 +75,16 @@ Manual restoration from version control is required.
 ## Steps
+### 0. Read Evolution Context
+Before starting, read `~/.selftune/memory/context.md` for session context:
+- Active evolutions and their current status
+- Previous rollback history
+- Last update timestamp
+This provides continuity across context resets. If the file doesn't exist,
+proceed normally — it will be created after the first rollback.
 ### 1. Find the Last Evolution
 Read `~/.claude/evolution_audit_log.jsonl` and find the most recent
@@ -101,7 +111,16 @@ After rollback, verify the SKILL.md content is restored:
 - Check the audit log for the `rolled_back` entry
 - Optionally re-run evals to confirm the original pass rate
-### 4. Post-Rollback Audit
+### 4. Update Memory
+After rollback completes, the memory writer updates:
+- `~/.selftune/memory/decisions.md` -- records the rollback decision and reason
+- `~/.selftune/memory/context.md` -- clears the active evolution state and notes the rollback
+This ensures future evolve and watch workflows have context about why the
+rollback occurred, even across context window resets.
+### 5. Post-Rollback Audit
 The rollback is logged. Future `evolve` runs will see the rollback in the
 audit trail and can use it to avoid repeating failed evolution patterns.

package/skill/Workflows/UnitTest.md ADDED Viewed

@@ -0,0 +1,138 @@
+# selftune Unit Test Workflow
+Run or generate unit tests for individual skills. Tests verify trigger
+accuracy, output content, and tool usage with deterministic assertions.
+## Default Command
+```bash
+selftune unit-test --skill <name> --tests <path> [options]
+```
+## Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--skill <name>` | Skill name | Required |
+| `--tests <path>` | Path to unit test JSON file | `~/.selftune/unit-tests/<skill>.json` |
+| `--run-agent` | Run agent-based assertions (not just trigger checks) | Off |
+| `--generate` | Generate tests from skill content instead of running | Off |
+| `--skill-path <path>` | Path to SKILL.md (required for `--generate`) | None |
+| `--eval-set <path>` | Eval set for failure context (used with `--generate`) | None |
+| `--model <flag>` | Model flag for LLM calls | Agent default |
+## Test Format
+Tests are stored as JSON arrays in `~/.selftune/unit-tests/<skill>.json`:
+```json
+[
+  {
+    "test_id": "research-trigger-1",
+    "skill_name": "Research",
+    "description": "Should trigger on explicit research request",
+    "query": "Research the latest trends in AI safety",
+    "expected_trigger": true,
+    "assertions": [
+      {
+        "type": "trigger_check",
+        "value": "true",
+        "description": "Skill should trigger for this query"
+      }
+    ],
+    "tags": ["explicit", "core"],
+    "source": "manual"
+  }
+]
+```
+## Assertion Types
+| Type | What it checks | Requires agent? |
+|------|---------------|-----------------|
+| `trigger_check` | Query triggers the skill description | No (LLM only) |
+| `output_contains` | Agent output contains expected text | Yes |
+| `output_matches_regex` | Agent output matches regex pattern | Yes |
+| `tool_called` | Agent used a specific tool | Yes |
+Trigger check assertions are cheap (single LLM call). Agent-based assertions
+require `--run-agent` and run the query through the full agent.
+## Output Format
+```json
+{
+  "skill_name": "Research",
+  "total": 10,
+  "passed": 8,
+  "failed": 2,
+  "pass_rate": 0.80,
+  "results": [
+    {
+      "test_id": "research-trigger-1",
+      "overall_passed": true,
+      "trigger_passed": true,
+      "assertion_results": [
+        { "type": "trigger_check", "value": "true", "passed": true, "evidence": "LLM responded YES" }
+      ],
+      "duration_ms": 450
+    }
+  ],
+  "ran_at": "2026-03-04T12:00:00.000Z"
+}
+```
+## Steps
+### 1. Generate Tests (First Time)
+For a new skill, generate initial tests from the skill content:
+```bash
+selftune unit-test --skill Research --generate --skill-path ~/.claude/skills/Research/SKILL.md
+```
+This uses an LLM to create test cases covering:
+- Explicit trigger queries
+- Implicit trigger queries
+- Contextual trigger queries
+- Negative examples (should NOT trigger)
+Tests are saved to `~/.selftune/unit-tests/Research.json`.
+### 2. Run Tests
+```bash
+selftune unit-test --skill Research --tests ~/.selftune/unit-tests/Research.json
+```
+By default, only `trigger_check` assertions run (fast, no agent needed).
+Add `--run-agent` for full agent-based assertions.
+### 3. Review Results
+Check `pass_rate` and investigate failures:
+- Failed trigger checks → description needs improvement
+- Failed output assertions → skill workflow needs fixes
+- Failed tool assertions → skill routing is broken
+### 4. Iterate
+After evolving a skill, re-run unit tests to verify improvements:
+1. Evolve: `selftune evolve --skill Research --skill-path /path/SKILL.md`
+2. Test: `selftune unit-test --skill Research`
+3. Check pass rate improved
+## Common Patterns
+**"Generate tests for the pptx skill"**
+> `selftune unit-test --skill pptx --generate --skill-path /path/SKILL.md`
+**"Run existing tests"**
+> `selftune unit-test --skill pptx --tests ~/.selftune/unit-tests/pptx.json`
+**"Run full agent tests"**
+> `selftune unit-test --skill pptx --tests /path/tests.json --run-agent`
+**"Test after evolution"**
+> Run `selftune unit-test` after each `selftune evolve` to verify improvements.

package/skill/Workflows/Watch.md CHANGED Viewed

@@ -65,6 +65,21 @@ selftune watch --skill <name> --skill-path <path> [options]
 ## Steps
+### 0. Read Evolution Context
+Before starting, read `~/.selftune/memory/context.md` for session context:
+- Active evolutions and their current status
+- Known issues and regression history
+- Last update timestamp
+This provides continuity across context resets. If the file doesn't exist,
+proceed normally -- it will be created after the first watch.
+The evolution-guard hook prevents conflicting SKILL.md edits while watch is
+evaluating the skill. The auto-activation system uses watch results to
+adjust suggestion confidence -- skills showing regressions get flagged for
+attention in subsequent prompts.
 ### 1. Run Watch
 ```bash
@@ -100,6 +115,13 @@ Summarize the snapshot for the user:
 - Whether regression was detected
 - Recommended action
+### 5. Update Memory
+After watch completes, the memory writer updates
+`~/.selftune/memory/context.md` with the current regression status,
+pass rates, and recommended next action. This ensures continuity if the
+context window resets before the user acts on the results.
 ## Common Patterns
 **"Is the skill performing well after the change?"**