npm - selftune - Versions diffs - 0.2.0 → 0.2.2 - Mend

selftune 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (122) hide show

package/.claude/agents/diagnosis-analyst.md +20 -10
package/.claude/agents/evolution-reviewer.md +14 -1
package/.claude/agents/integration-guide.md +18 -6
package/.claude/agents/pattern-analyst.md +18 -5
package/CHANGELOG.md +12 -4
package/README.md +43 -35
package/apps/local-dashboard/dist/assets/geist-cyrillic-wght-normal-CHSlOQsW.woff2 +0 -0
package/apps/local-dashboard/dist/assets/geist-latin-ext-wght-normal-DMtmJ5ZE.woff2 +0 -0
package/apps/local-dashboard/dist/assets/geist-latin-wght-normal-Dm3htQBi.woff2 +0 -0
package/apps/local-dashboard/dist/assets/index-C4EOTFZ2.js +15 -0
package/apps/local-dashboard/dist/assets/index-bl-Webyd.css +1 -0
package/apps/local-dashboard/dist/assets/vendor-react-U7zYD9Rg.js +60 -0
package/apps/local-dashboard/dist/assets/vendor-table-B7VF2Ipl.js +26 -0
package/apps/local-dashboard/dist/assets/vendor-ui-D7_zX_qy.js +346 -0
package/apps/local-dashboard/dist/favicon.png +0 -0
package/apps/local-dashboard/dist/index.html +17 -0
package/apps/local-dashboard/dist/logo.png +0 -0
package/apps/local-dashboard/dist/logo.svg +9 -0
package/cli/selftune/badge/badge-data.ts +1 -1
package/cli/selftune/badge/badge.ts +4 -8
package/cli/selftune/canonical-export.ts +183 -0
package/cli/selftune/constants.ts +28 -0
package/cli/selftune/contribute/contribute.ts +1 -1
package/cli/selftune/cron/setup.ts +17 -17
package/cli/selftune/dashboard-contract.ts +202 -0
package/cli/selftune/dashboard-server.ts +653 -186
package/cli/selftune/dashboard.ts +41 -176
package/cli/selftune/eval/baseline.ts +5 -4
package/cli/selftune/eval/composability-v2.ts +273 -0
package/cli/selftune/eval/hooks-to-evals.ts +34 -15
package/cli/selftune/eval/unit-test-cli.ts +1 -1
package/cli/selftune/evolution/evidence.ts +26 -0
package/cli/selftune/evolution/evolve-body.ts +105 -11
package/cli/selftune/evolution/evolve.ts +371 -25
package/cli/selftune/evolution/extract-patterns.ts +87 -29
package/cli/selftune/evolution/rollback.ts +2 -2
package/cli/selftune/grading/auto-grade.ts +200 -0
package/cli/selftune/grading/grade-session.ts +448 -97
package/cli/selftune/grading/results.ts +42 -0
package/cli/selftune/hooks/prompt-log.ts +172 -2
package/cli/selftune/hooks/session-stop.ts +123 -3
package/cli/selftune/hooks/skill-eval.ts +119 -3
package/cli/selftune/index.ts +395 -116
package/cli/selftune/ingestors/claude-replay.ts +140 -114
package/cli/selftune/ingestors/codex-rollout.ts +345 -46
package/cli/selftune/ingestors/codex-wrapper.ts +207 -39
package/cli/selftune/ingestors/openclaw-ingest.ts +141 -8
package/cli/selftune/ingestors/opencode-ingest.ts +193 -17
package/cli/selftune/init.ts +227 -14
package/cli/selftune/last.ts +14 -5
package/cli/selftune/localdb/db.ts +63 -0
package/cli/selftune/localdb/materialize.ts +428 -0
package/cli/selftune/localdb/queries.ts +376 -0
package/cli/selftune/localdb/schema.ts +204 -0
package/cli/selftune/monitoring/watch.ts +66 -15
package/cli/selftune/normalization.ts +682 -0
package/cli/selftune/observability.ts +19 -44
package/cli/selftune/orchestrate.ts +1073 -0
package/cli/selftune/quickstart.ts +203 -0
package/cli/selftune/repair/skill-usage.ts +576 -0
package/cli/selftune/schedule.ts +561 -0
package/cli/selftune/status.ts +48 -26
package/cli/selftune/sync.ts +627 -0
package/cli/selftune/types.ts +148 -0
package/cli/selftune/utils/canonical-log.ts +45 -0
package/cli/selftune/utils/hooks.ts +41 -0
package/cli/selftune/utils/html.ts +27 -0
package/cli/selftune/utils/llm-call.ts +78 -20
package/cli/selftune/utils/math.ts +10 -0
package/cli/selftune/utils/query-filter.ts +139 -0
package/cli/selftune/utils/skill-discovery.ts +340 -0
package/cli/selftune/utils/skill-log.ts +68 -0
package/cli/selftune/utils/skill-usage-confidence.ts +18 -0
package/cli/selftune/utils/transcript.ts +272 -26
package/cli/selftune/workflows/discover.ts +254 -0
package/cli/selftune/workflows/skill-md-writer.ts +288 -0
package/cli/selftune/workflows/workflows.ts +188 -0
package/package.json +21 -8
package/packages/telemetry-contract/README.md +11 -0
package/packages/telemetry-contract/fixtures/golden.json +87 -0
package/packages/telemetry-contract/fixtures/golden.test.ts +42 -0
package/packages/telemetry-contract/index.ts +1 -0
package/packages/telemetry-contract/package.json +19 -0
package/packages/telemetry-contract/src/index.ts +2 -0
package/packages/telemetry-contract/src/types.ts +163 -0
package/packages/telemetry-contract/src/validators.ts +109 -0
package/skill/SKILL.md +84 -53
package/skill/Workflows/AutoActivation.md +17 -16
package/skill/Workflows/Badge.md +6 -0
package/skill/Workflows/Baseline.md +46 -23
package/skill/Workflows/Composability.md +12 -5
package/skill/Workflows/Contribute.md +17 -14
package/skill/Workflows/Cron.md +56 -79
package/skill/Workflows/Dashboard.md +45 -34
package/skill/Workflows/Doctor.md +30 -17
package/skill/Workflows/Evals.md +64 -40
package/skill/Workflows/EvolutionMemory.md +2 -0
package/skill/Workflows/Evolve.md +102 -47
package/skill/Workflows/EvolveBody.md +6 -6
package/skill/Workflows/Grade.md +36 -31
package/skill/Workflows/ImportSkillsBench.md +11 -5
package/skill/Workflows/Ingest.md +43 -36
package/skill/Workflows/Initialize.md +44 -30
package/skill/Workflows/Orchestrate.md +139 -0
package/skill/Workflows/Replay.md +39 -18
package/skill/Workflows/Rollback.md +3 -3
package/skill/Workflows/Schedule.md +61 -0
package/skill/Workflows/Sync.md +88 -0
package/skill/Workflows/UnitTest.md +34 -22
package/skill/Workflows/Watch.md +14 -4
package/skill/Workflows/Workflows.md +129 -0
package/skill/assets/activation-rules-default.json +26 -0
package/skill/assets/multi-skill-settings.json +63 -0
package/skill/assets/single-skill-settings.json +57 -0
package/skill/references/invocation-taxonomy.md +2 -2
package/skill/references/logs.md +164 -2
package/skill/references/setup-patterns.md +65 -0
package/skill/references/version-history.md +40 -0
package/skill/settings_snippet.json +1 -1
package/templates/multi-skill-settings.json +7 -7
package/templates/single-skill-settings.json +6 -6
package/dashboard/index.html +0 -1680

package/skill/Workflows/Evals.md CHANGED Viewed

@@ -4,10 +4,19 @@ Generate eval sets from hook logs. Detects false negatives (queries that
 should have triggered a skill but did not) and annotates each entry with
 its invocation type.
+## When to Invoke
+Invoke this workflow when the user requests any of the following:
+- Generating eval sets or test data for a skill
+- Checking which skills are undertriggering
+- Viewing skill telemetry or usage stats
+- Preparing data before running the Evolve workflow
+- Any request containing "evals", "eval set", "test queries", or "skill stats"
 ## Default Command
 ```bash
-selftune evals --skill <name> [options]
+selftune eval generate --skill <name> [options]
 ```
 ## Options
@@ -101,10 +110,10 @@ selftune evals --skill <name> [options]
 Discover which skills have telemetry data and how many queries each has.
 ```bash
-selftune evals --list-skills
+selftune eval generate --list-skills
 ```
-Use this first to identify which skills have enough data for eval generation.
+Run this first to identify which skills have enough data for eval generation.
 ### Generate Synthetic Evals (Cold Start)
@@ -112,7 +121,7 @@ When a skill has no telemetry data yet, use `--synthetic` to generate eval
 queries directly from the SKILL.md content via an LLM.
 ```bash
-selftune evals --skill pptx --synthetic --skill-path /path/to/skills/pptx/SKILL.md
+selftune eval generate --skill pptx --synthetic --skill-path /path/to/skills/pptx/SKILL.md
 ```
 The command:
@@ -125,7 +134,7 @@ The command:
 Use `--model` to override the default LLM model:
 ```bash
-selftune evals --skill pptx --synthetic --skill-path ./skills/pptx/SKILL.md --model claude-sonnet-4-5-20250514
+selftune eval generate --skill pptx --synthetic --skill-path ./skills/pptx/SKILL.md --model claude-sonnet-4-5-20250514
 ```
 ### Generate Evals (Log-Based)
@@ -135,7 +144,7 @@ Cross-reference `skill_usage_log.jsonl` (positive triggers) against
 an eval set annotated with invocation types.
 ```bash
-selftune evals --skill pptx --max 50 --out evals-pptx.json
+selftune eval generate --skill pptx --max 50 --out evals-pptx.json
 ```
 The command:
@@ -152,40 +161,53 @@ View aggregate telemetry for a skill: average turns, tool call breakdown,
 error rates, and common bash command patterns.
 ```bash
-selftune evals --skill pptx --stats
+selftune eval generate --skill pptx --stats
 ```
 ## Steps
 ### 0. Pre-Flight Configuration
-Before generating evals, present configuration options to the user.
-If the user says "use defaults" or similar, skip to step 1 with recommended defaults.
+Before generating evals, present numbered configuration options to the user inline in your response, then wait for the user's answer before proceeding.
-For `--list-skills` or `--stats` requests, skip pre-flight entirely (read-only).
+If the user responds with "use defaults", "just do it", or similar shorthand, skip to step 1 using the recommended defaults.
-Present these options:
+For `--list-skills` or `--stats` requests, skip pre-flight entirely — these are read-only operations.
-```text
-selftune evals — Pre-Flight Configuration
+Present the following options inline in your response:
-1. Generation Mode
-   a) Log-based — build evals from real usage logs (recommended if logs exist)
-   b) Synthetic — generate evals from SKILL.md via LLM (for new skills with no data)
+1. **Generation Mode**
+   - a) Log-based — build evals from real usage logs (recommended if logs exist)
+   - b) Synthetic — generate evals from SKILL.md via LLM (for new skills with no data)
-2. Max Entries: [50] (default — how many eval entries to generate)
+2. **Skill Path** (synthetic mode only)
+   - Provide absolute or relative path to the target SKILL.md
+   - Example: `./skills/pptx/SKILL.md`
-3. Model (synthetic mode only)
-   a) Fast (haiku) — quick generation
-   b) Balanced (sonnet) — better query diversity (recommended)
-   c) Best (opus) — highest quality synthetic queries
+3. **Max Entries:** 50 (default — how many eval entries to generate)
-4. Output Path: [evals-<skill>.json] (default)
+4. **Model** (synthetic mode only)
+   - a) Fast (haiku) — quick generation
+   - b) Balanced (sonnet) — better query diversity (recommended)
+   - c) Best (opus) — highest quality synthetic queries
-→ Reply with your choices or "use defaults" for recommended settings.
-```
+5. **Output Path:** `evals-<skill>.json` (default)
-After the user responds, show a confirmation summary:
+Ask: "Reply with your choices or 'use defaults' for recommended settings."
+After the user responds, parse their selections and map each choice to the corresponding CLI flags:
+| Selection | CLI Flag |
+|-----------|----------|
+| 1a (log-based) | _(no flag, default)_ |
+| 1b (synthetic) | `--synthetic --skill-path <path>` |
+| Custom max entries | `--max <value>` |
+| 4a (haiku) | `--model haiku` (resolved internally by selftune) |
+| 4b (sonnet) | `--model sonnet` |
+| 4c (opus) | `--model opus` |
+| Custom output path | `--out <path>` |
+Show a confirmation summary to the user:
 ```text
 Configuration Summary:
@@ -196,15 +218,17 @@ Configuration Summary:
 Proceeding...
 ```
+Build the CLI command string with all selected flags and continue to step 1.
 ### 1. List Available Skills
-Run `selftune evals --list-skills` to see what skills have telemetry data. If the target
+Run `selftune eval generate --list-skills` to see what skills have telemetry data. If the target
 skill has zero or very few queries, more sessions are needed before
 eval generation is useful.
 ### 2. Generate the Eval Set
-Run with `--skill <name>`. Review the output file for:
+Run with `--skill <name>`. Parse the JSON output and review for:
 - Balance between positive and negative entries
 - Coverage of all three positive invocation types (explicit, implicit, contextual)
 - Reasonable negative examples (keyword overlap but wrong intent)
@@ -233,20 +257,20 @@ beyond trigger coverage.
 ## Common Patterns
-**"What skills are undertriggering?"**
-> Run `selftune evals --list-skills`, then for each skill with significant query counts,
-> generate evals and check for missed implicit/contextual queries.
+**User asks which skills are undertriggering:**
+Run `selftune eval generate --list-skills`, then for each skill with significant query counts,
+generate evals and check for missed implicit/contextual queries.
-**"Generate evals for pptx"**
-> Run `selftune evals --skill pptx`. Review the invocation type distribution.
-> Feed the output to `evolve` if coverage gaps exist.
+**User asks to generate evals for a specific skill:**
+Run `selftune eval generate --skill <name>`. Parse the JSON output and review the invocation type distribution.
+Feed the output to the Evolve workflow if coverage gaps exist.
-**"Show me skill stats"**
-> Run `selftune evals --skill <name> --stats` for aggregate telemetry.
+**User asks for skill telemetry or stats:**
+Run `selftune eval generate --skill <name> --stats` for aggregate telemetry.
-**"I have a new skill with no usage data"**
-> Use `selftune evals --skill <name> --synthetic --skill-path /path/to/SKILL.md`.
-> This generates eval queries from the skill description without needing session logs.
+**User has a new skill with no usage data:**
+Use `selftune eval generate --skill <name> --synthetic --skill-path /path/to/SKILL.md`.
+This generates eval queries from the skill description without needing session logs.
-**"I want reproducible evals"**
-> Use `--seed <n>` to fix the random sampling of negative examples.
+**User wants reproducible evals:**
+Add `--seed <n>` to fix the random sampling of negative examples.

package/skill/Workflows/EvolutionMemory.md CHANGED Viewed

@@ -1,5 +1,7 @@
 # selftune Evolution Memory
+This reference documents the evolution memory system. The agent reads these files automatically during evolve, watch, and rollback workflows for session continuity.
 Human-readable session context that survives context window resets. Provides
 continuity across evolve, watch, and rollback workflows by recording outcomes,
 decisions, and known issues in plain markdown files.

package/skill/Workflows/Evolve.md CHANGED Viewed

@@ -4,6 +4,14 @@ Improve a skill's description based on real usage signal. Analyzes failure
 patterns from eval sets and proposes description changes that catch more
 natural-language queries without breaking existing triggers.
+## When to Invoke
+Invoke this workflow when the user requests any of the following:
+- Improving or evolving a skill's trigger coverage
+- Fixing undertriggering or missed queries for a skill
+- Optimizing a skill description based on usage data
+- Any request containing "evolve", "improve triggers", or "fix skill matching"
 ## Default Command
 ```bash
@@ -25,6 +33,8 @@ selftune evolve --skill <name> --skill-path <path> [options]
 | `--cheap-loop` | Use cheap models for loop, expensive for final gate | Off |
 | `--gate-model <model>` | Model for final gate validation | `sonnet` (when `--cheap-loop`) |
 | `--proposal-model <model>` | Model for proposal generation LLM calls | None |
+| `--sync-first` | Refresh source-truth telemetry before generating evals/failure patterns | Off |
+| `--sync-force` | Force a full source rescan during `--sync-first` | Off |
 ## Output Format
@@ -79,41 +89,51 @@ The evolution process writes multiple audit entries:
 ### 0. Pre-Flight Configuration
-Before running the evolve command, present configuration options to the user.
-If the user says "use defaults", "just run it", or similar, skip to step 1
-with the recommended defaults marked below.
+Before running the evolve command, present numbered configuration options to the user inline in your response, then wait for the user's answer before proceeding.
-Present these options (use AskUserQuestion or inline prompt):
+If the user responds with "use defaults", "just run it", or similar shorthand, skip to step 1 using the recommended defaults marked below.
-```
-selftune evolve — Pre-Flight Configuration
+Present the following options inline in your response:
-1. Execution Mode
-   a) Dry run — preview proposal without deploying (recommended for first run)
-   b) Live — validate and deploy if improved
+1. **Execution Mode**
+   - a) Dry run — preview proposal without deploying (recommended for first run)
+   - b) Live — validate and deploy if improved
-2. Model Tier (see SKILL.md Model Tier Reference)
-   a) Fast (haiku) — cheapest, ~2s/call (recommended with cheap-loop)
-   b) Balanced (sonnet) — good quality, ~5s/call
-   c) Best (opus) — highest quality, ~10s/call
+2. **Model Tier** (see SKILL.md Model Tier Reference)
+   - a) Fast (haiku) — cheapest, ~2s/call (recommended with cheap-loop)
+   - b) Balanced (sonnet) — good quality, ~5s/call
+   - c) Best (opus) — highest quality, ~10s/call
-3. Cost Optimization
-   a) Cheap loop — haiku for iteration, sonnet for final gate (recommended)
-   b) Single model — use one model throughout
+3. **Cost Optimization**
+   - a) Cheap loop — haiku for iteration, sonnet for final gate (recommended)
+   - b) Single model — use one model throughout
-4. Confidence Threshold: [0.6] (default, higher = stricter)
+4. **Confidence Threshold:** 0.6 (default, higher = stricter)
-5. Max Iterations: [3] (default, more = longer but better results)
+5. **Max Iterations:** 3 (default, more = longer but better results)
-6. Multi-Candidate Selection
-   a) Single candidate — one proposal per iteration (recommended)
-   b) Pareto mode — generate multiple candidates, pick best on frontier
+6. **Multi-Candidate Selection**
+   - a) Single candidate — one proposal per iteration (recommended)
+   - b) Pareto mode — generate multiple candidates, pick best on frontier
-→ Reply with your choices (e.g., "1a, 2a, 3a, defaults for rest")
-  or "use defaults" for recommended settings.
-```
+Ask: "Reply with your choices (e.g., '1a, 2a, 3a, defaults for rest') or 'use defaults' for recommended settings."
+After the user responds, parse their selections and map each choice to the corresponding CLI flags:
-After the user responds, show a confirmation summary:
+| Selection | CLI Flag |
+|-----------|----------|
+| 1a (dry run) | `--dry-run` |
+| 1b (live) | _(no flag)_ |
+| 2a (haiku) | `--validation-model haiku` |
+| 2b (sonnet) | `--validation-model sonnet` |
+| 2c (opus) | `--validation-model opus` |
+| 3a (cheap loop) | `--cheap-loop` |
+| 3b (single model) | _(no flag)_ |
+| Custom confidence | `--confidence <value>` |
+| Custom iterations | `--max-iterations <value>` |
+| 6b (pareto) | `--pareto` |
+Show a confirmation summary to the user:
 ```
 Configuration Summary:
@@ -126,7 +146,7 @@ Configuration Summary:
 Proceeding...
 ```
-Then build the CLI command with the selected flags and continue to step 1.
+Build the CLI command string with all selected flags and continue to step 1.
 ### 1. Read Evolution Context
@@ -143,7 +163,20 @@ edits on monitored skills during active evolution. This prevents conflicting
 changes while the evolve process is running. The guard is automatically
 engaged when evolve starts and released when it completes.
-### 2. Load or Generate Eval Set
+### 2. Refresh Source Truth (Recommended)
+If the host has accumulated significant agent activity or is known to be
+polluted, prefer:
+```bash
+selftune evolve --skill <name> --skill-path <path> --sync-first
+```
+`--sync-first` runs the authoritative transcript/rollout sync before eval-set
+generation and failure-pattern extraction. Use `--sync-force` when you need
+to ignore markers and rescan everything.
+### 3. Load or Generate Eval Set
 If `--eval-set` is provided, use it directly. Otherwise, the command
 generates one from logs (equivalent to running `evals --skill <name>`).
@@ -151,7 +184,7 @@ generates one from logs (equivalent to running `evals --skill <name>`).
 An eval set is required for validation. Without enough telemetry data,
 evolution cannot reliably measure improvement.
-### 3. Extract Failure Patterns
+### 4. Extract Failure Patterns
 The command groups missed queries by invocation type:
 - Missed explicit: description is broken (rare, high priority)
@@ -160,7 +193,7 @@ The command groups missed queries by invocation type:
 See `references/invocation-taxonomy.md` for the taxonomy.
-### 4. Propose Description Changes
+### 5. Propose Description Changes
 An LLM generates a candidate description that would catch the missed
 queries. The candidate:
@@ -168,7 +201,7 @@ queries. The candidate:
 - Adds new phrases covering missed patterns
 - Maintains the description's structure and tone
-### 5. Validate Against Eval Set
+### 6. Validate Against Eval Set
 The candidate is tested against the full eval set:
 - Must improve overall pass rate
@@ -178,7 +211,7 @@ The candidate is tested against the full eval set:
 If validation fails, the command retries up to `--max-iterations` times
 with adjusted proposals.
-### 6. Deploy (or Preview)
+### 7. Deploy (or Preview)
 If `--dry-run`, the proposal is printed but not deployed. The audit log
 still records `created` and `validated` entries for review.
@@ -188,7 +221,7 @@ If deploying:
 2. The updated description is written to SKILL.md
 3. A `deployed` entry is logged to the evolution audit
-### 7. Update Memory
+### 8. Update Memory
 After evolution completes (deploy or dry-run), the memory writer updates:
 - `~/.selftune/memory/context.md` -- records the evolution outcome and current state
@@ -237,22 +270,44 @@ selftune evolve --skill X --skill-path Y --proposal-model haiku --validation-mod
 ## Common Patterns
-**"Evolve the pptx skill"**
-> Need `--skill pptx` and `--skill-path /path/to/pptx/SKILL.md`.
-> If the user hasn't specified the path, search for the SKILL.md file
-> in the workspace or ask.
+**User asks to evolve a specific skill (e.g., "evolve the pptx skill"):**
+Requires `--skill pptx` and `--skill-path /path/to/pptx/SKILL.md`.
+If the user has not specified the path, search for the SKILL.md file
+in the workspace. If not found, ask the user for the path.
+**User wants a preview without deployment (e.g., "just show me what would change"):**
+Add `--dry-run` to preview proposals without deploying.
+**Evolution results are insufficient:**
+Check the eval set quality. Missing contextual examples limit
+what evolution can learn. Generate a richer eval set first using the Evals workflow.
+**Evolution keeps failing validation:**
+Lower `--confidence` slightly or increase `--max-iterations`.
+Also check if the eval set has contradictory expectations.
+**Agent CLI override needed:**
+The evolve command auto-detects the installed agent CLI.
+Use `--agent <name>` to override (claude, codex, opencode).
+## Subagent Escalation
+For high-stakes evolutions, consider spawning the `evolution-reviewer` agent
+as a subagent to review the proposal before deploying. This is especially
+valuable when the skill has a history of regressions, the evolution touches
+many trigger phrases, or the confidence score is near the threshold.
-**"Just show me what would change"**
-> Use `--dry-run` to preview proposals without deploying.
+## Autonomous Mode
-**"The evolution didn't help enough"**
-> Check the eval set quality. Missing contextual examples will limit
-> what evolution can learn. Generate a richer eval set first.
+When called by `selftune orchestrate` (via cron or --loop), evolution runs
+without user interaction:
-**"Evolution keeps failing validation"**
-> Lower `--confidence` slightly or increase `--max-iterations`.
-> Also check if the eval set has contradictory expectations.
+- Pre-flight is skipped entirely — defaults are used
+- The orchestrator selects candidate skills based on health scores
+- Evolution uses cheap-loop mode (Haiku) by default
+- Validation runs automatically against the eval set
+- Deploy happens if validation passes the regression threshold
+- Results are logged to orchestrate-runs.jsonl
-**"Which agent is being used?"**
-> The evolve command auto-detects your installed agent CLI.
-> Use `--agent <name>` to override (claude, codex, opencode).
+No user confirmation is needed. The safety controls (regression threshold,
+auto-rollback via watch, SKILL.md backup) provide the guardrails.

package/skill/Workflows/EvolveBody.md CHANGED Viewed

@@ -7,7 +7,7 @@ LLM validates them through a 3-gate pipeline.
 ## Default Command
 ```bash
-selftune evolve-body --skill <name> --skill-path <path> --target <target> [options]
+selftune evolve body --skill <name> --skill-path <path> --target <target> [options]
 ```
 ## Options
@@ -65,7 +65,7 @@ If the user says "use defaults" or similar, skip to step 1 with recommended defa
 Present these options:
 ```
-selftune evolve-body — Pre-Flight Configuration
+selftune evolve body — Pre-Flight Configuration
 1. Evolution Target
    a) Routing table — optimize the workflow routing table only
@@ -116,7 +116,7 @@ The command reads SKILL.md and splits it into sections using `parseSkillSections
 ### 2. Build Eval Set
 If `--eval-set` is provided, use it directly. Otherwise, generate from logs
-(same as `selftune evals --skill <name>`).
+(same as `selftune eval generate --skill <name>`).
 ### 3. Extract Failure Patterns
@@ -147,13 +147,13 @@ If `--dry-run`, prints the proposal without deploying. Otherwise:
 ## Common Patterns
 **"Evolve the routing table for the Research skill"**
-> `selftune evolve-body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target routing_table`
+> `selftune evolve body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target routing_table`
 **"Rewrite the entire skill body"**
-> `selftune evolve-body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target full_body --dry-run`
+> `selftune evolve body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target full_body --dry-run`
 **"Use a stronger model for generation"**
-> `selftune evolve-body --skill pptx --skill-path /path/SKILL.md --target full_body --teacher-model opus --student-model haiku`
+> `selftune evolve body --skill pptx --skill-path /path/SKILL.md --target full_body --teacher-model opus --student-model haiku`
 **"Preview what would change"**
 > Always start with `--dry-run` to review the proposal before deploying.

package/skill/Workflows/Grade.md CHANGED Viewed

@@ -74,15 +74,14 @@ for the full schema. Key fields:
 ### 1. Find the Session
-Read `~/.claude/session_telemetry_log.jsonl` and find the most recent entry
-where `skills_triggered` contains the target skill name.
-Note the `transcript_path`, `tool_calls`, `errors_encountered`, and
-`session_id` fields. See `references/logs.md` for the telemetry format.
+Read `~/.claude/session_telemetry_log.jsonl`. Find the most recent entry
+where `skills_triggered` contains the target skill name. Extract the
+`transcript_path`, `tool_calls`, `errors_encountered`, and `session_id`
+fields. See `references/logs.md` for the telemetry format.
 ### 2. Read the Transcript
-Parse the JSONL file at `transcript_path`. Identify:
+Parse the JSONL file at `transcript_path`. Extract:
 - User messages (what was asked)
 - Assistant tool calls (what the agent did)
 - Tool results (what happened)
@@ -92,16 +91,14 @@ See `references/logs.md` for transcript format variants.
 ### 3. Determine Expectations
-If the user provided `--expectations`, parse the semicolon-separated list.
-Otherwise, derive defaults. See `references/grading-methodology.md` for the
-full default expectations list.
-Always include at least one Process expectation and one Quality expectation.
+If `--expectations` was provided, parse the semicolon-separated list.
+Otherwise, derive defaults from `references/grading-methodology.md`.
+Ensure at least one Process expectation and one Quality expectation.
 ### 4. Grade Each Expectation
-For each expectation, search both the telemetry record and the transcript
-for evidence. Mark as:
+Search both the telemetry record and the transcript for evidence per
+expectation. Mark as:
 - **PASS** if evidence exists and supports the expectation
 - **FAIL** if evidence is absent or contradicts the expectation
@@ -109,22 +106,21 @@ Cite specific evidence: transcript line numbers, tool call names, bash output.
 ### 5. Extract Implicit Claims
-Pull 2-4 claims from the transcript that are not covered by the explicit
-expectations. Classify each as factual, process, or quality. Verify each
-against the transcript. See `references/grading-methodology.md` for claim
-types and examples.
+Pull 2-4 claims from the transcript not covered by explicit expectations.
+Classify each as factual, process, or quality. Verify each against the
+transcript. See `references/grading-methodology.md` for claim types.
 ### 6. Flag Eval Gaps
 Review each passed expectation. If it would also pass for wrong output,
-note it in `eval_feedback.suggestions`. See `references/grading-methodology.md`
-for gap flagging criteria.
+record it in `eval_feedback.suggestions`. See
+`references/grading-methodology.md` for gap flagging criteria.
 ### 7. Write grading.json
 Write the full grading result to `grading.json` in the current directory.
-### 8. Summarize
+### 8. Report Results
 Report to the user:
 - Pass rate (e.g., "2/3 passed, 67%")
@@ -136,17 +132,26 @@ Keep the summary concise. The full details are in `grading.json`.
 ## Common Patterns
-**"Grade my last pptx session"**
-> Find the most recent telemetry entry for `pptx`. Use default expectations.
-> Ask if the user wants custom expectations or proceed with defaults.
+**User asks to grade a skill session**
+> Run `selftune grade --skill <name>` with default expectations. Results are
+> written to `grading.json`. Read that file and report the pass rate and any
+> failures to the user.
+**User provides specific expectations**
+> Run `selftune grade --skill <name> --expectations "expect1;expect2;expect3"`.
+> Parse results and report.
+**User wants to grade from an eval set**
+> Run `selftune grade --skill <name> --evals-json path/to/evals.json`.
+> Optionally add `--eval-id N` for a specific scenario.
-**"Grade with these specific expectations"**
-> Pass `--expectations "expect1;expect2;expect3"` to override defaults.
+**Agent detection override needed**
+> The grader auto-detects the agent CLI. If detection fails or the user
+> specifies an agent, pass `--agent <name>` to override.
-**"Grade using an eval set"**
-> Pass `--evals-json path/to/evals.json` and optionally `--eval-id N`
-> to grade a specific eval scenario.
+## Autonomous Mode
-**"Which agent is being used?"**
-> The grader auto-detects your installed agent CLI (claude, codex, opencode).
-> Use `--agent <name>` to override the auto-detected agent.
+Grading runs implicitly during orchestrate as part of status computation.
+The orchestrator reads grading results to determine which skills are
+candidates for evolution. No explicit grade command is called — the
+grading results from previous sessions feed into candidate selection.

package/skill/Workflows/ImportSkillsBench.md CHANGED Viewed

@@ -1,5 +1,11 @@
 # selftune Import SkillsBench Workflow
+## When to Use
+When the user wants to enrich eval sets with external benchmark tasks or import SkillsBench corpora.
+## Overview
 Import evaluation tasks from the SkillsBench corpus (87 real-world agent
 benchmarks) and convert them to selftune eval entries. This enriches
 your skill's eval set with externally validated test cases.
@@ -7,7 +13,7 @@ your skill's eval set with externally validated test cases.
 ## Default Command
 ```bash
-selftune import-skillsbench --dir <path> --skill <name> --output <path> [options]
+selftune eval import --dir <path> --skill <name> --output <path> [options]
 ```
 ## Options
@@ -86,7 +92,7 @@ Clone or download the SkillsBench repository containing the task directory.
 ### 2. Import Tasks
 ```bash
-selftune import-skillsbench --dir /path/to/skillsbench/tasks --skill Research --output evals-bench.json
+selftune eval import --dir /path/to/skillsbench/tasks --skill Research --output evals-bench.json
 ```
 ### 3. Review Output
@@ -102,10 +108,10 @@ corpus. Use the merged set with `selftune evolve --eval-set merged-evals.json`.
 ## Common Patterns
 **"Import SkillsBench tasks for Research"**
-> `selftune import-skillsbench --dir /path/tasks --skill Research --output bench-evals.json`
+> `selftune eval import --dir /path/tasks --skill Research --output bench-evals.json`
 **"Use fuzzy matching for broader coverage"**
-> `selftune import-skillsbench --dir /path/tasks --skill pptx --output bench-evals.json --match-strategy fuzzy`
+> `selftune eval import --dir /path/tasks --skill pptx --output bench-evals.json --match-strategy fuzzy`
 **"Enrich my eval set with external benchmarks"**
-> Import with `import-skillsbench`, then pass the output to `evolve --eval-set`.
+> Import with `eval import`, then pass the output to `evolve --eval-set`.