npm - selftune - Versions diffs - 0.2.23 → 0.2.25 - Mend

selftune 0.2.23 → 0.2.25

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (219) hide show

package/skill/{Workflows → workflows}/CreatorContributions.md RENAMED Viewed

@@ -1,10 +1,12 @@
 # selftune Creator-Contributions Workflow
-Manage the creator-side `selftune.contribute.json` file bundled with a skill.
+Manage the **creator sharing setup** — the `selftune.contribute.json` file
+bundled with a skill package.
 This is **not** the same as:
-- `selftune contributions` — end-user opt-in / opt-out preferences
-- `selftune contribute` — community export bundle
+- `selftune contributions` — end-user **sharing preferences** (opt-in / opt-out)
+- `selftune contribute` — community **export bundle** (anonymized data export)
+- The signals dashboard — viewing aggregated **contributor signal data** from all contributors
 ## When to Use
@@ -45,8 +47,17 @@ selftune creator-contributions disable --skill <name> [--skill-path <path>]
 ## Notes
 - This is local packaging/setup only. It does **not** upload creator-directed signals yet.
-- The creator ID is currently sourced from `--creator-id` or the local alpha identity's `cloud_user_id`.
+- The `creator_id` field must be the creator's cloud user UUID (the `cloud_user_id` from alpha enrollment). This is the canonical identifier used to route signals back to the correct creator account.
+- The creator ID is sourced from `--creator-id` or the local alpha identity's `cloud_user_id`.
 - Use this workflow when the user is preparing a skill package.
+- For the full creator lifecycle, read `references/creator-playbook.md` before shipping.
+## Selftune Dogfood Config
+The selftune skill itself ships a bundled `selftune.contribute.json` at
+`oss/selftune/skill/selftune.contribute.json`. This is the selftune project
+dogfooding its own creator-directed relay flow. The `creator_id` field is
+set to the production selftune creator's cloud user UUID.
 ## Common Patterns
@@ -60,13 +71,14 @@ selftune creator-contributions disable --skill <name> [--skill-path <path>]
 > Run `selftune creator-contributions enable --skill <name>`.
 > If auto-discovery fails, rerun with `--skill-path /path/to/SKILL.md`.
 > If no creator identity is available locally, rerun with `--creator-id <id>`.
-> Example: `selftune creator-contributions enable --skill sc-search --skill-path ./skills/sc-search/SKILL.md --creator-id cr_state_change --signals trigger,grade,miss_category --message "Share privacy-safe usage signals with the skill creator." --privacy-url https://statechange.ai/privacy`
+> The command rejects non-UUID creator IDs and unsupported signal names.
+> Example: `selftune creator-contributions enable --skill sc-search --skill-path ./skills/sc-search/SKILL.md --creator-id 550e8400-e29b-41d4-a716-446655440000 --signals trigger,grade,miss_category --message "Share privacy-safe usage signals with the skill creator." --privacy-url https://statechange.ai/privacy`
 **User wants to enable creator contributions for a whole installed skill suite**
 > Run `selftune creator-contributions enable --all --prefix sc-`.
 > This is the fastest path when preparing a whole family of skills like State Change skills.
-> Example: `selftune creator-contributions enable --all --prefix sc- --creator-id cr_state_change`
+> Example: `selftune creator-contributions enable --all --prefix sc- --creator-id 550e8400-e29b-41d4-a716-446655440000`
 **User wants to stop bundling creator contribution config**

package/skill/{Workflows → workflows}/Cron.md RENAMED Viewed

@@ -130,4 +130,4 @@ interactive mode is for user-directed improvements.
 - **User needs a specific timezone (OpenClaw)** -- Run `selftune cron setup --platform openclaw --tz America/New_York`.
 - **User asks what jobs are registered** -- Run `selftune cron list`. Shows a table of all selftune cron jobs with their schedules and descriptions.
 - **User wants to remove cron automation** -- Run `selftune cron remove`. Preview first with `selftune cron remove --dry-run`.
-- **Skill regressed after cron evolution** -- The watch job should catch this automatically. If not, run `selftune evolve rollback --skill <name> --skill-path <path>` manually. See `Workflows/Rollback.md`.
+- **Skill regressed after cron evolution** -- The watch job should catch this automatically. If not, run `selftune evolve rollback --skill <name> --skill-path <path>` manually. See `workflows/Rollback.md`.

package/skill/{Workflows → workflows}/Dashboard.md RENAMED Viewed

@@ -22,6 +22,7 @@ generate JSONL from SQLite for debugging or offline analysis.
 | Flag            | Description                               | Default |
 | --------------- | ----------------------------------------- | ------- |
 | `--port <port>` | Custom port for the server                | 3141    |
+| `--restart`     | Force-restart an existing dashboard on the target port | Off |
 | `--no-open`     | Start server without opening browser      | Off     |
 | `--serve`       | _(Deprecated)_ Alias for default behavior | —       |
@@ -35,6 +36,16 @@ suggesting `selftune dashboard` instead.
 The live server binds to `localhost:3141` by default. Use `--port` to
 override.
+If a healthy selftune dashboard is already running on the requested port,
+`selftune dashboard` reuses it instead of failing. If the running standalone
+dashboard version is older than the installed CLI, the command restarts it
+automatically to pick up the update. Use `--restart` to force that behavior
+even when the versions match.
+The dashboard client also polls `/api/health` for `spa_build_id`. If the server
+is newer than the loaded client, the UI shows a reload prompt instead of silently
+staying stale.
 ### Endpoints
 | Method | Path                       | Description                                                |
@@ -162,6 +173,7 @@ checked file paths.
 ```bash
 selftune dashboard
 selftune dashboard --port 8080
+selftune dashboard --restart
 selftune dashboard --no-open
 ```
@@ -182,6 +194,14 @@ to trigger watch, evolve, or rollback directly from the dashboard.
 > Run `selftune dashboard`. The server provides real-time updates via SSE
 > (~1 second latency).
+**User just updated selftune and wants the dashboard to pick up the new UI**
+> Run `selftune dashboard`. It reuses a healthy instance when possible and
+> automatically restarts an older standalone dashboard version on the same port.
+> If the user explicitly wants a restart, run `selftune dashboard --restart`.
+> If the browser still has an older client loaded, the dashboard shows a reload
+> prompt based on `/api/health` build metadata.
 **Dashboard shows no data**
 > Run `selftune doctor` to verify hooks are installed. If hooks are missing,

package/skill/{Workflows → workflows}/Doctor.md RENAMED Viewed

@@ -163,7 +163,7 @@ For each failed check, take the appropriate action:
 | `evolution_audit`          | Remove corrupted entries. Future operations will append clean entries.                                                                           |
 | `dashboard_freshness_mode` | This is an operator warning, not a broken install. Expect possible freshness gaps for SQLite-only writes and export before destructive recovery. |
 | `skill_version_sync`       | Run `bun run sync-version` to stamp SKILL.md from package.json.                                                                                  |
-| `version_up_to_date`       | Run `npm install -g selftune` to update.                                                                                                         |
+| `version_up_to_date`       | Follow `.checks[].guidance.next_command` for the active install source. Common fixes are `npm install -g selftune@latest`, `bun add -g selftune@latest`, or `npx skills add selftune-dev/selftune`. |
 ### 4. Re-run Doctor

package/skill/{Workflows → workflows}/Evals.md RENAMED Viewed

@@ -20,6 +20,24 @@ Invoke this workflow when the user requests any of the following:
 selftune eval generate --skill <name> [options]
 ```
+## Recommended Creator Loop
+Use eval generation as step 1 of the default creator loop:
+```bash
+selftune eval generate --skill <name>
+selftune eval unit-test --skill <name> --generate --skill-path <path>
+selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
+selftune grade baseline --skill <name> --skill-path <path>
+selftune evolve --skill <name> --skill-path <path> --with-baseline
+selftune watch --skill <name>
+```
+The command still writes the requested output path, and it now also mirrors a canonical copy into
+`~/.selftune/eval-sets/<skill>.json` so the dashboard and `selftune status` can track whether eval
+coverage exists. Once the earlier steps are complete, the creator loop surfaces now flip from
+"needs testing" to "ready to deploy" and then "watching" after ship.
 ## Options
 | Flag                               | Description                                           | Default                           |
@@ -39,6 +57,8 @@ selftune eval generate --skill <name> [options]
 | `--auto-synthetic`                 | Fall back to SKILL.md-based cold-start evals when no trusted triggers exist | Off                  |
 | `--skill-path <path>`              | Path to SKILL.md (required with `--synthetic`)        | —                                 |
 | `--model <model>`                  | LLM model to use for synthetic generation             | Agent default                     |
+| `--blend`                          | Blend log-based and synthetic evals into one set      | Off                               |
+| `--help`                           | Show command help                                     | Off                               |
 ## Output Format
@@ -49,11 +69,14 @@ selftune eval generate --skill <name> [options]
   {
     "query": "Make me a slide deck for the Q3 board meeting",
     "should_trigger": true,
-    "invocation_type": "contextual"
+    "invocation_type": "contextual",
+    "source": "log",
+    "created_at": "2026-04-01T12:00:00Z"
   },
   {
     "query": "What format should I use for a presentation?",
-    "should_trigger": false
+    "should_trigger": false,
+    "source": "synthetic"
   }
 ]
 ```
@@ -61,6 +84,24 @@ selftune eval generate --skill <name> [options]
 Each entry has `query` (string, max 500 chars), `should_trigger` (boolean),
 and optional `invocation_type` (omitted when `--no-taxonomy` is set).
+Entries also carry optional provenance fields:
+- `source` — `"log"` (from real usage logs), `"synthetic"` (LLM-generated from SKILL.md), or `"blended"` (synthetic entry that survived dedup in a blended set)
+- `created_at` — ISO timestamp of when the entry was created
+Use `computeEvalSourceStats(entries)` to get aggregate provenance statistics:
+```json
+{
+  "total": 80,
+  "synthetic": 10,
+  "log": 50,
+  "blended": 20,
+  "oldest": "2026-03-01T00:00:00Z",
+  "newest": "2026-04-01T12:00:00Z"
+}
+```
 ### List Skills
 ```json
@@ -181,6 +222,30 @@ Use `--model` to override the default LLM model:
 selftune eval generate --skill pptx --synthetic --skill-path ./skills/pptx/SKILL.md --model claude-sonnet-4-5-20250514
 ```
+### Generate Blended Evals
+When a skill has real log data but you want to fill coverage gaps with synthetic
+entries, use `--blend` to combine both sources into one eval set.
+```bash
+selftune eval generate --skill pptx --blend --skill-path /path/to/skills/pptx/SKILL.md
+```
+The blending policy:
+1. Keep ALL log-based entries (marked `source: "log"`)
+2. Generate synthetic entries from SKILL.md
+3. Deduplicate: drop any synthetic entry whose normalized Levenshtein distance to any log entry is < 0.3
+4. Mark surviving synthetic entries as `source: "blended"`
+5. Cap total entries at 2x the log-based count
+This preserves real-world boundary cases from logs while filling underrepresented
+invocation types with synthetic entries. The 2x cap prevents synthetic entries from
+overwhelming log signal.
+`--blend` requires a resolvable SKILL.md path. Use `--skill-path` or install the
+skill locally so selftune can find it.
 ### Generate Evals (Log-Based)
 Cross-reference `skill_usage_log.jsonl` (positive triggers) against

package/skill/{Workflows → workflows}/Evolve.md RENAMED Viewed

@@ -19,6 +19,23 @@ Invoke this workflow when the user requests any of the following:
 selftune evolve --skill <name> --skill-path <path> [options]
 ```
+## Recommended Creator Loop
+Do not treat `evolve` as the first step when a creator asks whether a skill is
+ready. The default loop is:
+```bash
+selftune eval generate --skill <name> --skill-path <path>
+selftune eval unit-test --skill <name> --generate --skill-path <path>
+selftune evolve --skill <name> --skill-path <path> --dry-run --validation-mode replay
+selftune grade baseline --skill <name> --skill-path <path>
+```
+Then move to a live `selftune evolve ...` or `selftune watch ...` run.
+If canonical evals or stored unit-test results already exist, reuse them rather
+than regenerating everything.
 ## Options
 | Flag                         | Description                                                             | Default                        |
@@ -26,7 +43,7 @@ selftune evolve --skill <name> --skill-path <path> [options]
 | `--skill <name>`             | Skill name                                                              | Required                       |
 | `--skill-path <path>`        | Path to the skill's SKILL.md                                            | Required                       |
 | `--eval-set <path>`          | Pre-built eval set JSON                                                 | Auto-generated from logs       |
-| `--agent <name>`             | Agent CLI to use (claude, codex, opencode)                              | Auto-detected                  |
+| `--agent <name>`             | Agent CLI to use (claude, codex, opencode, pi)                          | Auto-detected                  |
 | `--dry-run`                  | Propose and validate without deploying                                  | Off                            |
 | `--confidence <n>`           | Minimum confidence threshold (0-1)                                      | 0.6                            |
 | `--max-iterations <n>`       | Maximum retry iterations                                                | 3                              |
@@ -42,8 +59,10 @@ selftune evolve --skill <name> --skill-path <path> [options]
 | `--gate-effort <level>`      | Thinking effort for the final gate (`low|medium|high|max`)              | None                           |
 | `--adaptive-gate`            | Escalate risky gate checks to `opus` + `high` effort                    | Off                            |
 | `--proposal-model <model>`   | Model for proposal generation LLM calls                                 | None                           |
+| `--validation-mode <mode>`   | Validation strategy: `auto`, `replay`, or `judge`                       | `auto`                         |
 | `--sync-first`               | Refresh source-truth telemetry before generating evals/failure patterns | Off                            |
 | `--sync-force`               | Force a full source rescan during `--sync-first`                        | Off                            |
+| `--help`                     | Show command help                                                       | Off                            |
 ## Output Format
@@ -83,37 +102,42 @@ Routing/body validation may also carry provenance fields such as:
 - `validation_fixture_id` — fixture identifier when replay-backed validation is used
 - `before_pass_rate` / `after_pass_rate` — only present when trigger validation actually ran; structural-guard exits do not emit synthetic pass rates
-Most evolve runs today still validate through `llm_judge`. Routing evolution now
-auto-builds a replay fixture from the target skill plus installed sibling
-skills in the same registry, so replay-backed validation is preferred whenever
-that local fixture can be constructed because it captures host-style routing
-behavior instead of model judgment.
-For Claude Code, the replay path now stages a temporary project-local
-`.claude/skills` registry, swaps in the candidate routing table, and runs a
-one-turn Claude print-mode session with project/local settings only. Validation
-records whether Claude actually invoked the target skill, invoked a competing
-skill, invoked an unrelated skill, or made no routing decision at all.
-Unrelated skill use is treated as a replay failure even on negative evals,
-because it still indicates the runtime routed somewhere unexpected. If that
-runtime path is unavailable or fails to reach a runtime decision, selftune
-falls back to the existing fixture-backed surface simulation and notes the
-fallback in the replay evidence instead of pretending it was a runtime result.
-For non-Claude platforms today, replay remains fixture-backed: it evaluates the
-target routing table against the installed target/competing skill surfaces in a
-controlled replay fixture and records per-entry evidence. That is still a
-stronger signal than a free-form judge prompt, but you should describe it as
-replay-backed validation, not as live operator telemetry.
+Most evolve runs today still validate through `llm_judge`. Replay-backed
+validation is only considered available when selftune can run a real
+host/runtime replay for the target host. Today that means the Claude Code,
+Codex, and OpenCode paths can stage a temporary local registry, apply the
+candidate skill content, and observe the runtime's actual routing decision;
+when that runtime path is unavailable, `auto` falls back to `llm_judge` and
+`replay` errors explicitly instead of silently downgrading to fixture
+simulation.
+Description, routing, and full-body evolution now share the same public
+validation contract: `auto` prefers replay and falls back to judge, `replay`
+requires a replay path, and `judge` bypasses replay entirely. Audit and
+evidence records may also include `validation_fallback_reason` when `auto`
+had to fall back from replay to judge.
+Replay stages the candidate into the target host's project-local registry:
+Claude Code uses `.claude/skills`, Codex uses `.agents/skills`, and OpenCode
+uses `.opencode/skills`. Validation records whether the runtime selected the
+target skill, selected a competing skill, selected an unrelated skill, or made
+no routing decision at all. Reads outside the staged skill set are treated as
+replay failures even on negative evals, because they indicate the runtime left
+the controlled evaluation surface.
+For hosts without runtime replay support today, replay is not available. In
+`auto` mode selftune falls back to `llm_judge`; in `replay` mode it exits with
+`REPLAY_UNAVAILABLE`. Do not describe fixture-only surface matching as replay
+validation in user-facing summaries.
 Replay parsing is intentionally conservative: unreadable skill files degrade to
 empty surfaces instead of throwing, and malformed routing rows with empty
-trigger cells are ignored rather than treated as valid triggers. Claude replay
-also normalizes observed `Read` paths against the staged workspace, so relative
-skill reads still count as read-only evidence for the target or competing
-skill. Reads outside the staged skill set are treated as replay failures rather
-than benign negatives, because they indicate the runtime left the controlled
-evaluation surface.
+trigger cells are ignored rather than treated as valid triggers. Replay also
+normalizes observed skill reads against the staged workspace, so relative skill
+paths from Claude, Codex, or OpenCode still count as evidence for the target or
+competing skill. Reads outside the staged skill set are treated as replay
+failures rather than benign negatives, because they indicate the runtime left
+the controlled evaluation surface.
 ## Parsing Instructions
@@ -281,6 +305,40 @@ The candidate is tested against the full eval set:
 If validation fails, the command retries up to `--max-iterations` times
 with adjusted proposals.
+### Validation Mode (`--validation-mode`)
+The `--validation-mode` flag controls which validation engine is used for
+description proposals. Three modes are available:
+| Mode     | Behavior                                                                 |
+| -------- | ------------------------------------------------------------------------ |
+| `auto`   | Try replay-based validation first; fall back to LLM judge if unavailable |
+| `replay` | Replay engine only; error if no replay fixture or runner is available    |
+| `judge`  | LLM judge only (legacy path via `validateProposal`)                      |
+The default is `auto`, which provides the strongest available signal without
+requiring manual fixture configuration. When replay is available, it stages the
+candidate skill content into a temporary local registry and records the
+runtime's actual routing decision per eval entry. For description evolution,
+that means the proposed description is applied to the target skill before
+replay. When replay is not available, `auto` falls back to the LLM judge and
+logs the fallback.
+The actual mode used is recorded as `validation_mode` in audit entries
+(`llm_judge`, `host_replay`, or `structural_guard`), along with
+`validation_agent` and `validation_fixture_id` when applicable.
+```bash
+# Default: auto (replay-first, judge fallback)
+selftune evolve --skill pptx --skill-path ./skills/pptx/SKILL.md
+# Force replay only (error if unavailable)
+selftune evolve --skill pptx --skill-path ./skills/pptx/SKILL.md --validation-mode replay
+# Force judge only (legacy behavior)
+selftune evolve --skill pptx --skill-path ./skills/pptx/SKILL.md --validation-mode judge
+```
 ### Aggregate Metrics To Report
 When summarizing an evolution run, include these aggregate metrics rather
@@ -378,6 +436,37 @@ selftune evolve --skill X --skill-path Y --cheap-loop --gate-model opus --gate-e
 selftune evolve --skill X --skill-path Y --proposal-model haiku --validation-model sonnet
 ```
+## Apply Contributor Proposal
+The `apply-proposal` subcommand fetches an approved contributor aggregate
+proposal from the cloud dashboard and applies it to the local SKILL.md.
+```bash
+selftune evolve apply-proposal --id <proposal-id> --skill-path <path> [--dry-run]
+```
+### Apply-Proposal Options
+| Flag              | Description                                     | Default  |
+| ----------------- | ----------------------------------------------- | -------- |
+| `--id <uuid>`     | Proposal UUID from the dashboard                | Required |
+| `--skill-path`    | Path to the target SKILL.md                     | Required |
+| `--dry-run`       | Preview the proposal without writing to disk    | Off      |
+### Apply-Proposal Flow
+1. Fetch the proposal via `GET /api/v1/proposals/:id`
+2. Verify `proposed_by` is `contributor_aggregate` and status is `approved`
+3. Display a summary (type, reason, pass rate change, diff preview)
+4. If not `--dry-run`: back up SKILL.md, apply the proposed value, and
+   `PATCH /api/v1/proposals/:id` with status `applied`
+### When to Use
+- After reviewing and approving a contributor proposal in the cloud dashboard
+- When community signal suggests a description or body improvement
+- As the final step in the contributor-driven evolution workflow
 ## Common Patterns
 **User asks to evolve a specific skill (e.g., "evolve the pptx skill"):**
@@ -398,7 +487,7 @@ Also check if the eval set has contradictory expectations.
 **Agent CLI override needed:**
 The evolve command auto-detects the installed agent CLI.
-Use `--agent <name>` to override (claude, codex, opencode).
+Use `--agent <name>` to override (claude, codex, opencode, pi).
 ## Subagent Escalation

package/skill/{Workflows → workflows}/EvolveBody.md RENAMED Viewed

@@ -10,6 +10,22 @@ LLM validates them through a 3-gate pipeline.
 selftune evolve body --skill <name> --skill-path <path> --target <target> [options]
 ```
+## Recommended Creator Loop
+Before mutating routing or the full body, make sure the creator trust loop is in
+place:
+```bash
+selftune eval generate --skill <name> --skill-path <path>
+selftune eval unit-test --skill <name> --generate --skill-path <path>
+selftune evolve body --skill <name> --skill-path <path> --target <target> --dry-run --validation-mode replay
+selftune grade baseline --skill <name> --skill-path <path>
+```
+If replay validation or the baseline is still missing, prefer filling that gap
+before live deployment. Body and routing evolution are much harder to trust than
+description-only changes when the creator loop is incomplete.
 ## Options
 | Flag                         | Description                                                                           | Default                  |
@@ -26,6 +42,7 @@ selftune evolve body --skill <name> --skill-path <path> --target <target> [optio
 | `--max-iterations <n>`       | Maximum refinement iterations                                                         | 3                        |
 | `--task-description <text>`  | Context for the evolution goal                                                        | None                     |
 | `--validation-model <model>` | Model for trigger-check validation calls (overrides `--student-model` for validation) | None                     |
+| `--validation-mode <mode>`   | Validation strategy: `auto`, `replay`, or `judge`                                     | `auto`                   |
 | `--teacher-effort <level>`   | Effort level for teacher LLM: `low`, `medium`, `high`, `max`                          | `high`                   |
 | `--review`                   | Run `evolution-reviewer` subagent as Gate 4 before deployment                         | Off                      |
 | `--few-shot <paths>`         | Comma-separated paths to example SKILL.md files                                       | None                     |
@@ -51,7 +68,7 @@ Every proposal passes through three sequential gates:
 | Gate                          | Type        | What it checks                                                                                  | Cost     |
 | ----------------------------- | ----------- | ----------------------------------------------------------------------------------------------- | -------- |
 | **Gate 1: Structural**        | Pure code   | YAML frontmatter present, `# Title` exists, `## Workflow Routing` preserved if original had one | Free     |
-| **Gate 2: Trigger Accuracy**  | Student LLM | YES/NO trigger check per eval entry on the extracted description                                | Cheap    |
+| **Gate 2: Trigger Accuracy**  | Replay or student LLM | Runtime replay when available; otherwise YES/NO trigger check per eval entry                     | Cheap    |
 | **Gate 3: Quality**           | Student LLM | Body clarity and completeness score (0.0-1.0)                                                   | Cheap    |
 | **Gate 4: Reviewer** (opt-in) | Subagent    | `evolution-reviewer` multi-turn review — reads files, checks evidence, APPROVE/REJECT verdict   | Moderate |
@@ -141,6 +158,25 @@ Few-shot examples from `--few-shot` paths provide structural guidance.
 Each gate runs in sequence. If a gate fails, the teacher receives the
 failure details and generates a refined proposal.
+### Validation Mode (`--validation-mode`)
+`evolve body` uses the same validation contract as `evolve`:
+| Mode     | Behavior                                                                 |
+| -------- | ------------------------------------------------------------------------ |
+| `auto`   | Try replay-backed validation first; fall back to LLM judge if unavailable |
+| `replay` | Replay engine only; error if no replay fixture or runner is available    |
+| `judge`  | LLM judge only                                                           |
+When replay is available, selftune stages the candidate skill content into a
+temporary local registry before running the real host/runtime replay. Claude
+Code uses `.claude/skills`, Codex uses `.agents/skills`, and OpenCode uses
+`.opencode/skills`. Routing targets stage the candidate `## Workflow Routing`
+section; body targets stage the full candidate body while preserving
+frontmatter and title. When replay is not available, `auto` falls back to the
+LLM judge and records the `validation_fallback_reason` in audit/evidence
+output.
 ### 6. Deploy or Preview
 If `--dry-run`, prints the proposal without deploying. Otherwise:
@@ -164,6 +200,10 @@ If `--dry-run`, prints the proposal without deploying. Otherwise:
 > `selftune evolve body --skill pptx --skill-path /path/SKILL.md --target body --teacher-model opus --student-model haiku`
+**"Force replay-only validation for a routing change"**
+> `selftune evolve body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target routing --validation-mode replay`
 **"Preview what would change"**
 > Always start with `--dry-run` to review the proposal before deploying.

package/skill/{Workflows → workflows}/Grade.md RENAMED Viewed

@@ -17,7 +17,7 @@ selftune grade --skill <name> [options]
 | `--expectations "..."` | Explicit expectations (semicolon-separated) | Auto-derived  |
 | `--evals-json <path>`  | Pre-built eval set JSON file                | None          |
 | `--eval-id <n>`        | Specific eval ID to grade from the eval set | None          |
-| `--agent <name>`       | Agent CLI to use (claude, codex, opencode)  | Auto-detected |
+| `--agent <name>`       | Agent CLI to use (claude, codex, opencode, pi)  | Auto-detected |
 ## Output Format

package/skill/{Workflows → workflows}/Initialize.md RENAMED Viewed

@@ -89,9 +89,12 @@ which selftune
 If `selftune` is not on PATH, install it:
 ```bash
-npm install -g selftune
+npx skills add selftune-dev/selftune
 ```
+If you manage the CLI directly instead of using the skill installer, use
+`npm install -g selftune` or `bun add -g selftune`.
 ### 2. Check Existing Config
 ```bash
@@ -172,7 +175,7 @@ selftune cline install      # creates hook scripts
 selftune pi install         # creates extension hook scripts
 ```
-Use `--dry-run` first if the user wants to preview. See `Workflows/PlatformHooks.md`
+Use `--dry-run` first if the user wants to preview. See `workflows/PlatformHooks.md`
 for platform-specific details.
 **Batch ingest** fallback for platforms without real-time hooks or to backfill history:
@@ -415,8 +418,9 @@ retrying with `selftune init --alpha --alpha-email <email> --force`.
 **User asks to set up or initialize selftune**
-> Run `which selftune` to check installation. If missing, install with
-> `npm install -g selftune`. Run `selftune init`, then verify with
+> Run `which selftune` to check installation. If missing, install or refresh with
+> `npx skills add selftune-dev/selftune`. If the user manages the CLI directly,
+> use `npm install -g selftune` or `bun add -g selftune`. Run `selftune init`, then verify with
 > `selftune doctor`. Report results to the user.
 **User wants alpha enrollment**

package/skill/{Workflows → workflows}/Orchestrate.md RENAMED Viewed

@@ -50,6 +50,7 @@ proposalModel = haiku
 | `--max-auto-grade <n>`      | Max ungraded skills to auto-grade per run (0 to disable)   | `5`        |
 | `--loop`                    | Run as a long-lived process that cycles continuously       | Off        |
 | `--loop-interval <seconds>` | Pause between cycles (minimum 60)                          | `3600`     |
+| `--help`                    | Show command help                                          | Off        |
 ## Default Behavior
@@ -57,7 +58,12 @@ proposalModel = haiku
 - Auto-grade up to 5 ungraded skills that have session data (enables evolution on first run after ingest)
 - Prioritize critical/warning/ungraded skills with real missed-query signal
 - Deploy validated low-risk description changes automatically
-- Watch recent deployments and roll back regressions automatically
+- Auto-grade and write grading baselines for freshly deployed skills
+- Generate review-first new skill proposals from strong workflow patterns
+- Watch recent deployments (including freshly deployed skills in same run) and roll back regressions automatically
+- Monitor grade regression alongside trigger regression during watch
+- Upload personal telemetry to cloud (alpha users)
+- Flush staged creator-directed contribution signals for opted-in skills
 Use `--review-required` only when you want a stricter policy for a specific run.
@@ -111,6 +117,7 @@ Machine-readable JSON with the summary fields plus a `decisions` array containin
 - `skill`, `action`, `reason`
 - `deployed`, `evolveReason`, `validation` (before/after pass rates, improved flag) — when evolved
 - `alert`, `rolledBack`, `passRate`, `recommendation` — when watched
+- `freshlyWatchedSkills` — array of skill names that were deployed and watched in the same run
 This is the recommended runtime for recurring autonomous scheduling.
@@ -162,8 +169,11 @@ In autonomous mode, orchestrate calls sub-workflows in this fixed order:
 2. **Status** — compute skill health using existing grade results (reads `grading.json` outputs from previous sessions)
 3. **Auto-grade** — grade up to `--max-auto-grade` (default 5) ungraded skills that have session data but no grades yet. Skipped during `--dry-run` (grading makes LLM calls). After grading, status is recomputed so candidate selection sees updated grades. Fail-open: individual grading errors are logged but never block the loop.
 4. **Evolve** — run evolution on selected candidates (pre-flight is skipped; Pareto mode uses 3 candidates; cheap-loop uses `haiku` for proposal + validation and `sonnet` for the final gate; adaptive gate escalation promotes risky proposals to `opus` + `high` effort; baseline and token-efficiency stay off)
-5. **Watch** — monitor recently evolved skills (auto-rollback enabled by default, `--recent-window` hours lookback)
-6. **Alpha Upload** — if enrolled in the alpha program (`config.alpha.enrolled === true`) and an API key is configured, stage new canonical records (sessions, invocations, evolution evidence, orchestrate runs) into `canonical_upload_staging`, build V2 push payloads, and flush to the cloud API (`POST /api/v1/push`) with Bearer auth. Fail-open: upload errors never block the orchestrate loop. Respects `--dry-run`.
+5. **Post-deploy grade + baseline** — for each freshly deployed skill, grade the most recent session and write a grading baseline to SQLite (`grading_baselines` table). The baseline records the measured pass rate and sample size, anchoring future grade regression detection. Fail-open: individual grading errors are logged but never block the loop.
+6. **Watch** — monitor recently evolved skills (auto-rollback enabled by default, `--recent-window` hours lookback). Skills freshly deployed in this run are included in the watch set immediately, so they are monitored in the same orchestrate cycle rather than waiting for the next run. These appear in `freshlyWatchedSkills` in the output. Grade watch (`enableGradeWatch: true`) runs alongside trigger regression for all watched skills.
+7. **Workflow proposals** — discover repeated multi-skill patterns and create review-first `new_skill` proposals when a workflow is strong enough to merit codification. These are never auto-deployed; they are surfaced as proposals for review.
+8. **Alpha Upload** — if enrolled in the alpha program (`config.alpha.enrolled === true`) and an API key is configured, stage new canonical records (sessions, invocations, evolution evidence, orchestrate runs) into `canonical_upload_staging`, build V2 push payloads, and flush to the cloud API (`POST /api/v1/push`) with Bearer auth. Fail-open: upload errors never block the orchestrate loop. Respects `--dry-run`.
+9. **Contribution relay flush** — if an API key is configured, flush any staged creator-directed contribution signals for opted-in skills. Fail-open: relay errors never block the orchestrate loop. Respects `--dry-run`.
 When orchestrate invokes evolve for a selected candidate, it always passes
 `confidenceThreshold: 0.6` and `maxIterations: 3`, plus the autonomous evolve

package/skill/{Workflows → workflows}/Schedule.md RENAMED Viewed

@@ -4,7 +4,7 @@ Generate ready-to-use scheduling examples for automating selftune with
 standard system tools. This is the **primary automation path** — it works
 on any machine without requiring a specific agent runtime.
-For OpenClaw-specific scheduling, see `Workflows/Cron.md`.
+For OpenClaw-specific scheduling, see `workflows/Cron.md`.
 ## When to Use
@@ -51,7 +51,7 @@ Outputs examples for all three scheduling systems (cron, launchd, systemd).
 ## Alias
-`selftune schedule` is now an alias for `selftune cron`. Both commands are interchangeable. See `Workflows/Cron.md` for the full cron workflow reference.
+`selftune schedule` is now an alias for `selftune cron`. Both commands are interchangeable. See `workflows/Cron.md` for the full cron workflow reference.
 ## PATH Resolution (All Platforms)
@@ -69,4 +69,4 @@ environments that don't include homebrew, bun, or node binary locations.
 - **User wants quick setup on a Linux server** -- Run `selftune schedule --install --format cron`.
 - **User wants setup on macOS** -- Run `selftune schedule --install --format launchd`.
 - **User wants setup on a systemd-based server** -- Run `selftune schedule --install --format systemd`.
-- **User mentions OpenClaw** -- Use `selftune cron setup --platform openclaw` for the OpenClaw scheduler adapter. The default product path is still `selftune schedule --install`. See `Workflows/Cron.md`.
+- **User mentions OpenClaw** -- Use `selftune cron setup --platform openclaw` for the OpenClaw scheduler adapter. The default product path is still `selftune schedule --install`. See `workflows/Cron.md`.