npm - selftune - Versions diffs - 0.2.31 → 0.2.32 - Mend

selftune 0.2.31 → 0.2.32

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (95) hide show

package/README.md +83 -56
package/apps/local-dashboard/dist/assets/index-B-ut4w0B.js +15 -0
package/apps/local-dashboard/dist/assets/index-BFGfCVrL.css +1 -0
package/apps/local-dashboard/dist/assets/vendor-ui-DfowE3Hu.js +1 -0
package/apps/local-dashboard/dist/index.html +3 -3
package/cli/selftune/command-surface.ts +613 -2
package/cli/selftune/create/baseline.ts +429 -0
package/cli/selftune/create/check.ts +35 -0
package/cli/selftune/create/init.ts +115 -0
package/cli/selftune/create/package-candidate-state.ts +771 -0
package/cli/selftune/create/package-evaluator.ts +710 -0
package/cli/selftune/create/package-fingerprint.ts +142 -0
package/cli/selftune/create/package-search.ts +377 -0
package/cli/selftune/create/publish.ts +431 -0
package/cli/selftune/create/readiness.ts +495 -0
package/cli/selftune/create/replay.ts +330 -0
package/cli/selftune/create/report.ts +74 -0
package/cli/selftune/create/scaffold.ts +121 -0
package/cli/selftune/create/skills-ref-adapter.ts +177 -0
package/cli/selftune/create/status.ts +33 -0
package/cli/selftune/create/templates.ts +249 -0
package/cli/selftune/cron/setup.ts +1 -1
package/cli/selftune/dashboard-action-events.ts +4 -1
package/cli/selftune/dashboard-action-result.ts +789 -24
package/cli/selftune/dashboard-action-stream.ts +80 -0
package/cli/selftune/dashboard-contract.ts +146 -3
package/cli/selftune/dashboard-server.ts +5 -4
package/cli/selftune/eval/hooks-to-evals.ts +58 -35
package/cli/selftune/eval/synthetic-evals.ts +145 -17
package/cli/selftune/evolution/bounded-mutations.ts +1045 -0
package/cli/selftune/evolution/evolve-body.ts +9 -36
package/cli/selftune/evolution/evolve.ts +8 -72
package/cli/selftune/evolution/stopping-criteria.ts +5 -13
package/cli/selftune/evolution/unblock-suggestions.ts +0 -16
package/cli/selftune/evolution/validate-host-replay.ts +115 -15
package/cli/selftune/improve.ts +206 -0
package/cli/selftune/index.ts +123 -6
package/cli/selftune/init.ts +1 -1
package/cli/selftune/localdb/queries/dashboard.ts +30 -0
package/cli/selftune/localdb/schema.ts +52 -0
package/cli/selftune/monitoring/watch.ts +257 -23
package/cli/selftune/orchestrate/execute.ts +300 -1
package/cli/selftune/orchestrate/finalize.ts +14 -0
package/cli/selftune/orchestrate/plan.ts +22 -5
package/cli/selftune/orchestrate/prepare.ts +59 -4
package/cli/selftune/orchestrate/report.ts +1 -1
package/cli/selftune/orchestrate.ts +34 -1
package/cli/selftune/publish.ts +35 -0
package/cli/selftune/routes/actions.ts +81 -15
package/cli/selftune/routes/overview.ts +1 -1
package/cli/selftune/routes/skill-report.ts +147 -2
package/cli/selftune/run.ts +18 -0
package/cli/selftune/schedule.ts +3 -3
package/cli/selftune/search-run.ts +703 -0
package/cli/selftune/status.ts +35 -11
package/cli/selftune/testing-readiness.ts +431 -40
package/cli/selftune/types.ts +316 -0
package/cli/selftune/utils/eval-readiness.ts +1 -0
package/cli/selftune/utils/json-output.ts +11 -0
package/cli/selftune/utils/lifecycle-surface.ts +48 -0
package/cli/selftune/utils/query-filter.ts +82 -1
package/cli/selftune/utils/tui.ts +85 -2
package/cli/selftune/verify.ts +205 -0
package/cli/selftune/workflows/proposals.ts +1 -1
package/cli/selftune/workflows/skill-scaffold.ts +141 -63
package/cli/selftune/workflows/workflows.ts +4 -4
package/package.json +1 -1
package/skill/SKILL.md +148 -85
package/skill/references/cli-quick-reference.md +16 -1
package/skill/references/creator-playbook.md +31 -10
package/skill/workflows/Baseline.md +8 -9
package/skill/workflows/Contributions.md +4 -4
package/skill/workflows/Create.md +173 -0
package/skill/workflows/CreateTestDeploy.md +34 -30
package/skill/workflows/Cron.md +2 -2
package/skill/workflows/Dashboard.md +3 -3
package/skill/workflows/Evals.md +13 -7
package/skill/workflows/Evolve.md +75 -32
package/skill/workflows/EvolveBody.md +22 -15
package/skill/workflows/Hook.md +1 -1
package/skill/workflows/Improve.md +168 -0
package/skill/workflows/Initialize.md +3 -3
package/skill/workflows/Orchestrate.md +49 -12
package/skill/workflows/Publish.md +100 -0
package/skill/workflows/Run.md +72 -0
package/skill/workflows/Schedule.md +2 -2
package/skill/workflows/SearchRun.md +89 -0
package/skill/workflows/SignalsDashboard.md +2 -2
package/skill/workflows/UnitTest.md +13 -4
package/skill/workflows/Verify.md +136 -0
package/skill/workflows/Watch.md +114 -47
package/skill/workflows/Workflows.md +13 -8
package/apps/local-dashboard/dist/assets/index-B7v_o1WC.js +0 -15
package/apps/local-dashboard/dist/assets/index-CrO77SVi.css +0 -1
package/apps/local-dashboard/dist/assets/vendor-ui-B0H8s1mP.js +0 -1

package/skill/workflows/Create.md ADDED Viewed

@@ -0,0 +1,173 @@
+# selftune Create Workflow
+## When to Use
+When the user wants to author a brand-new skill package, bootstrap a clean draft
+skill, or start from a package skeleton instead of mutating an existing skill.
+## Overview
+`Create` is the beginning of the lifecycle for first-class package drafts.
+Today the command surface is still split:
+- `selftune create init` starts from a blank package
+- `selftune create scaffold` starts from a discovered workflow
+- `selftune create status` tells you where the draft is in the lifecycle
+After authoring, move to `Verify` rather than staying in low-level `create`
+subcommands longer than necessary.
+## Primary Commands
+```bash
+selftune create init --name <name> --description <text> [--output-dir <path>] [--force] [--json]
+selftune create scaffold --from-workflow <id|index> [--output-dir <path>] [--skill-name <name>] [--description <text>] [--write] [--force] [--json]
+selftune create status --skill-path <path> [--json]
+selftune verify --skill-path <path> [--json]
+selftune create check --skill-path <path> [--json]
+selftune create replay --skill-path <path> [--mode routing|package] [--agent AGENT] [--eval-set PATH] [--json]
+selftune create baseline --skill-path <path> [--mode routing|package] [--agent AGENT] [--eval-set PATH] [--json]
+selftune create report --skill-path <path> [--agent AGENT] [--eval-set PATH] [--json]
+selftune publish --skill-path <path> [--json]
+selftune create publish --skill-path <path> [--watch] [--ignore-watch-alerts] [--json]
+```
+## Options
+- `--name <name>`: Display name for the new skill package. Required.
+- `--description <text>`: Short routing description for the draft skill.
+  Required.
+- `--output-dir <path>`: Parent directory for the new package. Default: the
+  repo-root `.agents/skills` directory.
+- `--from-workflow <id|index>`: Workflow ID or 1-based index from
+  `selftune workflows`. Required for `scaffold`.
+- `--skill-name <name>`: Override the generated scaffolded skill name.
+- `--force`: Overwrite scaffold files if the package directory already exists.
+- `--write`: Persist the workflow-derived scaffold to disk. Without this flag,
+  `scaffold` previews the package only.
+- `--min-occurrences <n>`: Minimum workflow frequency to consider while
+  resolving `--from-workflow`.
+- `--skill <name>`: Restrict workflow discovery to chains containing the named
+  skill during `scaffold`.
+- `--json`: Emit the created package summary as JSON.
+- `--skill-path <path>`: Path to a skill directory or `SKILL.md`. Required for
+  `status`, `check`, `replay`, `baseline`, `report`, and `publish`.
+- `--mode routing|package`: Replay or baseline only the router, or the full
+  package tree.
+- `--agent AGENT`: Runtime agent for replay, baseline, or report execution.
+- `--eval-set PATH`: Override the canonical eval-set path for replay,
+  baseline, or report.
+- `--watch`: Start watch immediately after `create publish` succeeds.
+- `--ignore-watch-alerts`: Bypass the publish-time watch gate after watch
+  runs.
+- `-h, --help`: Show command help.
+## Generated Layout
+```text
+<skill-name>/
+├── SKILL.md
+├── workflows/
+│   └── default.md
+├── references/
+│   └── overview.md
+├── scripts/
+├── assets/
+└── selftune.create.json
+```
+## What Each File Is For
+- `SKILL.md`: The trigger surface and top-level routing contract.
+- `workflows/default.md`: The first execution path once the skill triggers.
+- `references/overview.md`: Background context that should be loaded on demand.
+- `scripts/`: Deterministic helpers you want the agent to reuse.
+- `assets/`: Static templates or seed artifacts.
+- `selftune.create.json`: selftune-specific package metadata for readiness and
+  future package replay.
+## Examples
+```bash
+selftune create init --name "Research Assistant" --description "Use when the user needs structured research help."
+selftune create status --skill-path .agents/skills/research-assistant
+selftune verify --skill-path .agents/skills/research-assistant
+selftune create scaffold --from-workflow 1
+selftune create replay --skill-path .agents/skills/research-assistant --mode package
+selftune create baseline --skill-path .agents/skills/research-assistant --mode package
+selftune create report --skill-path .agents/skills/research-assistant
+selftune publish --skill-path .agents/skills/research-assistant
+selftune create scaffold --from-workflow "Copywriting→MarketingAutomation→SelfTuneBlog" --skill-name "blog publisher" --write
+selftune create init --name "Release Note Writer" --description "Use when the user needs changelog-ready release notes." --output-dir .agents/skills
+selftune create init --name "Internal Docs Helper" --description "Use when the user needs internal documentation updates." --json
+```
+## Common Patterns
+- "Start a brand-new skill package"
+  `selftune create init --name "Research Assistant" --description "Use when the user needs structured research help."`
+- "Write the scaffold into a different local registry"
+  `selftune create init --name "Research Assistant" --description "Use when the user needs structured research help." --output-dir ~/skills`
+- "Replace an older draft with a fresh scaffold"
+  `selftune create init --name "Research Assistant" --description "Use when the user needs structured research help." --force`
+- "Preview a package scaffold from telemetry"
+  `selftune create scaffold --from-workflow 1`
+- "Write a workflow-derived package draft"
+  `selftune create scaffold --from-workflow 1 --output-dir .agents/skills --write`
+- "See where the draft is in the lifecycle"
+  `selftune create status --skill-path .agents/skills/research-assistant`
+- "Run the lifecycle-first draft verification step"
+  `selftune verify --skill-path .agents/skills/research-assistant`
+- "Run the low-level draft readiness check"
+  `selftune create check --skill-path .agents/skills/research-assistant`
+- "Replay-validate the whole draft package"
+  `selftune create replay --skill-path .agents/skills/research-assistant --mode package`
+- "Measure draft-package lift versus no-skill"
+  `selftune create baseline --skill-path .agents/skills/research-assistant --mode package`
+- "Render the benchmark-style package report"
+  `selftune create report --skill-path .agents/skills/research-assistant`
+- "Ship the draft through the lifecycle-first surface"
+  `selftune publish --skill-path .agents/skills/research-assistant`
+- "Ship the draft through the legacy create surface"
+  `selftune create publish --skill-path .agents/skills/research-assistant --watch`
+## Follow-on Workflows
+After the draft exists:
+- use `workflows/Verify.md` to build trust evidence
+- use `workflows/Publish.md` to ship the draft safely
+## Notes
+- The generated package is intentionally sparse. It is a draft, not a published
+  skill.
+- Replace the placeholder routing and workflow text before distribution.
+- `Create` only owns draft authoring and local draft state.
+- `Verify` owns trust evidence.
+- `Publish` owns shipping + watch handoff.
+- Lower-level `create check`, `create replay`, `create baseline`, `create report`,
+  and `create publish` still exist, but they are no longer the primary teaching
+  path in the skill surface.
+- `create publish --watch --json` now returns both the raw nested `watch_result`
+  payload and a normalized `package_evaluation.watch` block, so agents can read
+  post-deploy pass rates, invocation totals, rollback state, and grade-watch
+  deltas from the same measured package-evaluation contract they already use for
+  replay and baseline evidence.
+- The publish payload now also surfaces `watch_gate_passed`,
+  `watch_gate_warnings`, and `watch_trust_score`, so agents can tell whether the
+  latest watch signal cleared the advisory trust gate without parsing prose.
+- `create report` and `create publish --json` now also surface
+  `package_evaluation.grading` when grading baselines and recent grading runs
+  exist, so agents can compare draft-package replay/baseline results against
+  observed execution quality instead of treating grading as a separate watch-only
+  signal.
+- selftune now stores the latest measured package-evaluation summary
+  canonically in SQLite and mirrors it to
+  `~/.selftune/package-evaluations/<skill>.json`, so later publish/report/watch
+  steps can reuse one measured artifact instead of treating package evaluation
+  as stdout-only output.
+- `selftune workflows scaffold` now writes the same package shape for backward
+  compatibility, but `selftune create scaffold` is the primary authoring
+  surface.

package/skill/workflows/CreateTestDeploy.md CHANGED Viewed

@@ -1,37 +1,38 @@
 # selftune Create, Test, and Deploy Workflow
 Use this when the user wants one guided path from a new or shaky skill to a
-safe shipped skill.
+safe shipped package.
 This is a composed workflow. It does not replace the atomic `Evals`,
 `UnitTest`, `Baseline`, `Evolve`, or `Watch` workflows. It decides which one
-comes next and keeps the creator trust loop in order.
+comes next and keeps the package evaluation pipeline in order.
 ## When to Use
 - The user says "create, test, and deploy"
-- The user wants the full creator loop end to end
+- The user wants the full package evaluation pipeline end to end
 - The user asks "how do I know this skill works?" before shipping
 - The user asks whether a skill is ready to deploy
 - The user wants one recommended path from cold start to live watch
 ## Default Path
-There is no single `selftune create-test-deploy` command yet. Run the loop
-step by step:
+Prefer the newer lifecycle:
 ```bash
-selftune eval generate --skill <name> --skill-path <path>
-selftune eval unit-test --skill <name> --generate --skill-path <path>
-selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
-selftune grade baseline --skill <name> --skill-path <path>
-selftune evolve --skill <name> --skill-path <path> --with-baseline
-selftune watch --skill <name>
+# author or inspect the draft
+selftune create status --skill-path <path>
+# build trust evidence
+selftune verify --skill-path <path>
+# ship safely
+selftune publish --skill-path <path>
 ```
 ## How to Run It
-### 1. Resolve the current loop position
+### 1. Resolve the current lifecycle position
 Start with one of these surfaces:
@@ -90,11 +91,10 @@ Then continue to replay dry-run validation.
 Run:
 ```bash
-selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
+selftune create replay --skill-path <path> --mode package
 ```
-This is the pre-deploy proof step. It validates against runtime-style routing
-without mutating the skill.
+This is the runtime proof step behind `verify`.
 Then continue to baseline.
@@ -103,21 +103,21 @@ Then continue to baseline.
 Run:
 ```bash
-selftune grade baseline --skill <name> --skill-path <path>
+selftune create baseline --skill-path <path> --mode package
 ```
-Then continue to live deploy.
+Then re-run `verify`.
 #### Ready to deploy
 Run:
 ```bash
-selftune evolve --skill <name> --skill-path <path> --with-baseline
+selftune publish --skill-path <path>
 ```
-This is the recommended creator ship command because it deploys only after the
-candidate clears the earlier trust gates.
+This is the recommended creator ship command because it re-runs the draft
+package validation gates and starts watch automatically.
 Then continue to watch.
@@ -134,13 +134,19 @@ another iteration.
 ## Which workflow to read next
-Load the atomic workflow that matches the next missing step:
+Prefer the newer primary workflows:
+- authoring -> `workflows/Create.md`
+- trust-building -> `workflows/Verify.md`
+- shipping -> `workflows/Publish.md`
+Load the lower-level workflows only when the user explicitly wants the details:
-- eval generation -> `workflows/Evals.md`
-- unit tests -> `workflows/UnitTest.md`
-- replay dry-run / deploy -> `workflows/Evolve.md`
-- baseline -> `workflows/Baseline.md`
-- live monitoring -> `workflows/Watch.md`
+- `workflows/Evals.md`
+- `workflows/UnitTest.md`
+- `workflows/Replay.md`
+- `workflows/Baseline.md`
+- `workflows/Watch.md`
 Use `references/creator-playbook.md` when the user is publishing a skill other
 people will install and needs before-ship versus after-ship guidance.
@@ -150,13 +156,11 @@ people will install and needs before-ship versus after-ship guidance.
 **User asks for one end-to-end shipping path**
 > Use this workflow. Check the current readiness surface first, then run the
-> next missing creator-loop step instead of dumping every command at once.
+> next missing pipeline step instead of dumping every command at once.
 **User asks whether a skill is safe to ship**
-> Use `selftune status` or the dashboard to confirm evals, unit tests, replay
-> validation, and baseline exist. If all four are complete, run `selftune
-> evolve --with-baseline`. Otherwise run the missing step first.
+> Use `Verify` first. If the skill is already verified, move to `Publish`.
 **User already shipped the skill**

package/skill/workflows/Cron.md CHANGED Viewed

@@ -94,7 +94,7 @@ no token cost for routine runs.
 OS scheduler fires (cron/launchd/systemd)
     |
     v
-selftune orchestrate --max-skills 3   (CLI runs directly, no agent)
+selftune run --max-skills 3           (CLI runs directly, no agent)
     |
     v
 sync → candidate selection → evolve → validate → deploy → watch
@@ -107,7 +107,7 @@ Next interactive agent session uses updated description
 ```
 This is distinct from interactive mode where the user says "improve my skills"
-and the agent runs orchestrate. Automated mode is for routine maintenance;
+and the agent runs `selftune run`. Automated mode is for routine maintenance;
 interactive mode is for user-directed improvements.
 ## Safety Controls

package/skill/workflows/Dashboard.md CHANGED Viewed

@@ -72,7 +72,7 @@ staying stale.
 The dashboard connects to `/api/v2/events` via Server-Sent Events.
 The server watches the SQLite WAL file for changes and broadcasts an
 `update` event when new data is written. The dashboard also broadcasts
-`action` events while creator-loop commands are running so the UI can
+`action` events while lifecycle commands are running so the UI can
 show live stdout/stderr and terminal success/failure. This works for
 both dashboard-triggered actions and supported `selftune` commands run
 directly in another terminal, because the CLI writes a shared action
@@ -81,7 +81,7 @@ invalidates cached queries on updates and terminal action events (~1s
 latency for DB-backed updates).
 For demo or operator workflows, the skill report can open a dedicated
-live-run screen. That screen follows one active creator-loop run at a
+live-run screen. That screen follows one active lifecycle run at a
 time, keeps a larger terminal log visible, and shows parsed dry-run
 summary fields plus historical model/platform/token aggregates from the
 skill report. Replay dry-runs also attach live `metrics` events when the
@@ -105,7 +105,7 @@ See [docs/design-docs/live-dashboard-sse.md](../../docs/design-docs/live-dashboa
 Action buttons in the dashboard trigger selftune commands via POST
 requests. Each endpoint spawns a `bun run` subprocess.
-**Creator-loop and watch/deploy actions** request body:
+**Lifecycle and watch/deploy actions** request body:
 ```json
 {

package/skill/workflows/Evals.md CHANGED Viewed

@@ -20,24 +20,27 @@ Invoke this workflow when the user requests any of the following:
 selftune eval generate --skill <name> [options]
 ```
-## Recommended Creator Loop
+## Recommended Package Evaluation Pipeline
-Use eval generation as step 1 of the default creator loop:
+Use eval generation as step 1 of the package evaluation pipeline:
 ```bash
+selftune verify --skill-path <path>
 selftune eval generate --skill <name>
+selftune verify --skill-path <path>
 selftune eval unit-test --skill <name> --generate --skill-path <path>
-selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
-selftune grade baseline --skill <name> --skill-path <path>
-selftune evolve --skill <name> --skill-path <path> --with-baseline
-selftune watch --skill <name>
+selftune verify --skill-path <path>
 ```
 The command still writes the requested output path, and it now also mirrors a canonical copy into
 `~/.selftune/eval-sets/<skill>.json` so the dashboard and `selftune status` can track whether eval
-coverage exists. Once the earlier steps are complete, the creator loop surfaces now flip from
+coverage exists. Once the earlier steps are complete, the pipeline surfaces now flip from
 "needs testing" to "ready to deploy" and then "watching" after ship.
+For already-published skills, eval generation is still a common supporting step
+before `selftune improve` / `selftune evolve` when you need fresher trigger
+evidence.
 ## Options
 | Flag                               | Description                                           | Default                           |
@@ -51,6 +54,7 @@ coverage exists. Once the earlier steps are complete, the creator loop surfaces
 | `--no-negatives`                   | Exclude negative examples from output                 | Off                               |
 | `--no-taxonomy`                    | Skip invocation_type classification                   | Off                               |
 | `--skill-log <path>`               | Path to skill_usage_log.jsonl                         | Default log path                  |
+| `--agent <name>`                   | Agent CLI for synthetic/blended eval generation (`claude`, `codex`, `opencode`, `pi`) | Auto-detected          |
 | `--query-log <path>`               | Path to all_queries_log.jsonl                         | Default log path                  |
 | `--telemetry-log <path>`           | Path to session_telemetry_log.jsonl                   | Default log path                  |
 | `--synthetic`                      | Generate evals from SKILL.md via LLM (no logs needed) | Off                               |
@@ -184,6 +188,7 @@ queries directly from the SKILL.md content via an LLM.
 ```bash
 selftune eval generate --skill pptx --synthetic --skill-path /path/to/skills/pptx/SKILL.md
+selftune eval generate --skill pptx --synthetic --skill-path /path/to/skills/pptx/SKILL.md --agent opencode
 ```
 If the skill is installed locally but has no trusted trigger history yet, use the faster creator
@@ -191,6 +196,7 @@ onboarding path:
 ```bash
 selftune eval generate --skill pptx --auto-synthetic --skill-path /path/to/skills/pptx/SKILL.md
+selftune eval generate --skill pptx --auto-synthetic --skill-path /path/to/skills/pptx/SKILL.md --agent opencode
 ```
 `--auto-synthetic` keeps the normal log-based path when real trigger data exists, but falls back

package/skill/workflows/Evolve.md CHANGED Viewed

@@ -1,8 +1,10 @@
 # selftune Evolve Workflow
-Improve a skill's description based on real usage signal. Analyzes failure
-patterns from eval sets and proposes description changes that catch more
-natural-language queries without breaking existing triggers.
+Improve a skill's description as part of the package evaluation pipeline.
+Analyzes failure patterns from eval sets and proposes description changes
+that catch more natural-language queries without breaking existing triggers.
+Each proposal is evaluated through replay, baseline, and grading before
+acceptance into the measured frontier.
 ## When to Invoke
@@ -19,19 +21,31 @@ Invoke this workflow when the user requests any of the following:
 selftune evolve --skill <name> --skill-path <path> [options]
 ```
-## Recommended Creator Loop
+## Recommended Package Evaluation Pipeline
 Do not treat `evolve` as the first step when a creator asks whether a skill is
-ready. The default loop is:
+ready. The default package evaluation pipeline is:
 ```bash
+selftune create status --skill-path <path>
+selftune verify --skill-path <path>
 selftune eval generate --skill <name> --skill-path <path>
 selftune eval unit-test --skill <name> --generate --skill-path <path>
-selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
-selftune grade baseline --skill <name> --skill-path <path>
+selftune create replay --skill-path <path> --mode package
+selftune create baseline --skill-path <path> --mode package
+selftune verify --skill-path <path>
+selftune publish --skill-path <path>
 ```
-Then move to a live `selftune evolve ...` or `selftune watch ...` run.
+For already-published skills, this workflow is the right mutation surface. The
+lifecycle alias is `selftune improve`; use `selftune evolve` directly when you
+need exact advanced flags:
+```bash
+selftune improve --skill <name> --skill-path <path> --dry-run --validation-mode replay
+selftune evolve --skill <name> --skill-path <path>
+selftune watch --skill <name>
+```
 If canonical evals or stored unit-test results already exist, reuse them rather
 than regenerating everything.
@@ -45,7 +59,7 @@ than regenerating everything.
 | `--eval-set <path>`          | Pre-built eval set JSON                                                 | Auto-generated from logs       |
 | `--agent <name>`             | Agent CLI to use (claude, codex, opencode, pi)                          | Auto-detected                  |
 | `--dry-run`                  | Propose and validate without deploying                                  | Off                            |
-| `--confidence <n>`           | Minimum confidence threshold (0-1)                                      | 0.6                            |
+| `--confidence <n>`           | Low-confidence review threshold (0-1)                                   | 0.6                            |
 | `--max-iterations <n>`       | Maximum retry iterations                                                | 3                              |
 | `--validation-model <model>` | Model for trigger-check validation LLM calls                            | `haiku`                        |
 | `--pareto`                   | Generate multiple candidates per iteration                              | On                             |
@@ -56,7 +70,7 @@ than regenerating everything.
 | `--full-model`               | Use full-cost model throughout (disables cheap-loop)                    | Off                            |
 | `--verbose`                  | Print detailed progress during evolution                                | Off                            |
 | `--gate-model <model>`       | Model for final gate validation                                         | `sonnet` (when `--cheap-loop`) |
-| `--gate-effort <level>`      | Thinking effort for the final gate (`low|medium|high|max`)              | None                           |
+| `--gate-effort <level>`      | Thinking effort for the final gate (`low\|medium\|high\|max`)           | None                           |
 | `--adaptive-gate`            | Escalate risky gate checks to `opus` + `high` effort                    | Off                            |
 | `--proposal-model <model>`   | Model for proposal generation LLM calls                                 | None                           |
 | `--validation-mode <mode>`   | Validation strategy: `auto`, `replay`, or `judge`                       | `auto`                         |
@@ -300,7 +314,8 @@ The candidate is tested against the full eval set:
 - Must improve overall pass rate
 - Must not regress more than 5% on previously-passing entries
-- Must exceed the `--confidence` threshold
+- May still deploy when confidence is low if measured validation is strong;
+  `--confidence` only controls warning/review sensitivity
 If validation fails, the command retries up to `--max-iterations` times
 with adjusted proposals.
@@ -385,18 +400,18 @@ Proposals are scored on heuristic quality criteria (no LLM required). The compos
 The evolution loop uses a modular stopping criteria evaluator
 (`evolution/stopping-criteria.ts`) that checks conditions in priority order
-after each validation pass. The evaluator receives the current pass rate,
-historical pass rates from previous iterations, and proposal confidence to
-make a unified stop/continue decision. The stopping reason is recorded in
-audit entries for traceability.
-| #   | Condition          | Meaning                                                        |
-| --- | ------------------ | -------------------------------------------------------------- |
-| 1   | **Converged**      | Pass rate >= 0.95                                              |
-| 2   | **Max iterations** | Reached `--max-iterations` limit                               |
-| 3   | **Low confidence** | Proposal confidence below `--confidence` threshold             |
-| 4   | **Plateau**        | < 1% pass rate variation across 3 consecutive iterations       |
-| 5   | **Continue**       | None of the above -- keep iterating                            |
+after each validation pass. The evaluator receives the current pass rate and
+historical pass rates from previous iterations to make a unified
+stop/continue decision. Confidence is still recorded as metadata and may
+raise warnings or gate-review risk, but it is not a standalone stop reason.
+The stopping reason is recorded in audit entries for traceability.
+| #   | Condition          | Meaning                                                  |
+| --- | ------------------ | -------------------------------------------------------- |
+| 1   | **Converged**      | Pass rate >= 0.95                                        |
+| 2   | **Max iterations** | Reached `--max-iterations` limit                         |
+| 3   | **Plateau**        | < 1% pass rate variation across 3 consecutive iterations |
+| 4   | **Continue**       | None of the above -- keep iterating                      |
 ## Cheap Loop Mode
@@ -447,11 +462,11 @@ selftune evolve apply-proposal --id <proposal-id> --skill-path <path> [--dry-run
 ### Apply-Proposal Options
-| Flag              | Description                                     | Default  |
-| ----------------- | ----------------------------------------------- | -------- |
-| `--id <uuid>`     | Proposal UUID from the dashboard                | Required |
-| `--skill-path`    | Path to the target SKILL.md                     | Required |
-| `--dry-run`       | Preview the proposal without writing to disk    | Off      |
+| Flag           | Description                                  | Default  |
+| -------------- | -------------------------------------------- | -------- |
+| `--id <uuid>`  | Proposal UUID from the dashboard             | Required |
+| `--skill-path` | Path to the target SKILL.md                  | Required |
+| `--dry-run`    | Preview the proposal without writing to disk | Off      |
 ### Apply-Proposal Flow
@@ -482,8 +497,8 @@ Check the eval set quality. Missing contextual examples limit
 what evolution can learn. Generate a richer eval set first using the Evals workflow.
 **Evolution keeps failing validation:**
-Lower `--confidence` slightly or increase `--max-iterations`.
-Also check if the eval set has contradictory expectations.
+Increase `--max-iterations` or improve the eval set.
+Lower `--confidence` only if you want fewer low-confidence review warnings.
 **Agent CLI override needed:**
 The evolve command auto-detects the installed agent CLI.
@@ -497,10 +512,35 @@ This is especially valuable when the skill has a history of regressions,
 the evolution touches many trigger phrases, or the confidence score is near
 the threshold.
+## Scope: Description vs Package
+The `evolve` command operates on description-level triggers and phrasing. For
+package-level improvement (routing tables, body content, and the full skill
+package), use `selftune improve --scope package` or `selftune search-run`,
+which delegates to the bounded package search flow. That search path now uses
+reflective proposals first, then measured targeted routing/body variants, then
+deterministic fallback. It also evaluates a merged routing/body candidate when
+both surfaces produce accepted improvements.
+When `selftune orchestrate` (or `selftune run`) selects candidates
+automatically, it chooses between description-level evolve and package-level
+search based on evidence:
+- Plain `selftune improve` also auto-selects package search for package-shaped
+  skills with draft manifests or package-frontier evidence.
+- Skills with an accepted package frontier or canonical package evaluation
+  showing room for improvement are routed to package search.
+- Skills without package evaluation history continue through description-level
+  evolve.
+This means `evolve` remains the right tool for description and trigger
+coverage, while package-level mutations are handled by the search pipeline.
 ## Autonomous Mode
-When called by `selftune orchestrate` (via cron or --loop), evolution runs
-without user interaction:
+When called by `selftune orchestrate` (via cron or --loop), description-level
+evolution runs without user interaction:
 - Pre-flight is skipped entirely — defaults are used
 - The orchestrator selects candidate skills based on health scores
@@ -511,3 +551,6 @@ without user interaction:
 No user confirmation is needed. The safety controls (regression threshold,
 auto-rollback via watch, SKILL.md backup) provide the guardrails.
+For package-level candidates, orchestrate delegates to bounded search instead
+of evolve. See `workflows/Orchestrate.md` for the full scope-selection logic.