npm - selftune - Versions diffs - 0.2.9 → 0.2.10 - Mend

selftune 0.2.9 → 0.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (130) hide show

package/README.md +35 -35
package/apps/local-dashboard/dist/assets/index-BZVLv70T.js +16 -0
package/apps/local-dashboard/dist/assets/{vendor-react-BQH_6WrG.js → vendor-react-BXP54cYo.js} +4 -4
package/apps/local-dashboard/dist/assets/{vendor-table-dK1QMLq9.js → vendor-table-DTF_SXoy.js} +1 -1
package/apps/local-dashboard/dist/assets/{vendor-ui-CO2mrx6e.js → vendor-ui-CWU0d1wd.js} +66 -66
package/apps/local-dashboard/dist/index.html +15 -15
package/bin/selftune.cjs +1 -1
package/cli/selftune/activation-rules.ts +1 -0
package/cli/selftune/alpha-upload/build-payloads.ts +18 -2
package/cli/selftune/alpha-upload/stage-canonical.ts +94 -0
package/cli/selftune/auth/device-code.ts +32 -0
package/cli/selftune/auto-update.ts +12 -0
package/cli/selftune/badge/badge.ts +1 -0
package/cli/selftune/canonical-export.ts +5 -0
package/cli/selftune/claude-agents.ts +154 -0
package/cli/selftune/contribute/bundle.ts +1 -0
package/cli/selftune/contribute/contribute.ts +1 -0
package/cli/selftune/cron/setup.ts +2 -2
package/cli/selftune/dashboard-server.ts +1 -0
package/cli/selftune/eval/hooks-to-evals.ts +1 -0
package/cli/selftune/eval/import-skillsbench.ts +1 -0
package/cli/selftune/eval/synthetic-evals.ts +2 -3
package/cli/selftune/eval/unit-test.ts +1 -0
package/cli/selftune/evolution/deploy-proposal.ts +1 -0
package/cli/selftune/evolution/evolve-body.ts +93 -6
package/cli/selftune/evolution/evolve.ts +0 -1
package/cli/selftune/evolution/propose-body.ts +3 -2
package/cli/selftune/evolution/propose-routing.ts +3 -2
package/cli/selftune/evolution/refine-body.ts +3 -2
package/cli/selftune/export.ts +1 -0
package/cli/selftune/grading/grade-session.ts +8 -0
package/cli/selftune/hooks/auto-activate.ts +1 -0
package/cli/selftune/hooks/evolution-guard.ts +1 -1
package/cli/selftune/hooks/prompt-log.ts +1 -0
package/cli/selftune/hooks/session-stop.ts +34 -40
package/cli/selftune/hooks/skill-change-guard.ts +1 -0
package/cli/selftune/hooks/skill-eval.ts +1 -1
package/cli/selftune/index.ts +23 -14
package/cli/selftune/ingestors/claude-replay.ts +1 -0
package/cli/selftune/ingestors/codex-rollout.ts +1 -0
package/cli/selftune/ingestors/codex-wrapper.ts +1 -0
package/cli/selftune/ingestors/openclaw-ingest.ts +1 -0
package/cli/selftune/ingestors/opencode-ingest.ts +1 -0
package/cli/selftune/init.ts +121 -29
package/cli/selftune/localdb/db.ts +1 -0
package/cli/selftune/localdb/direct-write.ts +39 -0
package/cli/selftune/localdb/materialize.ts +2 -0
package/cli/selftune/localdb/queries.ts +53 -0
package/cli/selftune/localdb/schema.ts +28 -0
package/cli/selftune/normalization.ts +1 -0
package/cli/selftune/observability.ts +1 -0
package/cli/selftune/repair/skill-usage.ts +1 -0
package/cli/selftune/routes/orchestrate-runs.ts +1 -0
package/cli/selftune/routes/overview.ts +1 -0
package/cli/selftune/routes/skill-report.ts +1 -0
package/cli/selftune/sync.ts +30 -1
package/cli/selftune/uninstall.ts +412 -0
package/cli/selftune/utils/canonical-log.ts +2 -0
package/cli/selftune/utils/jsonl.ts +1 -0
package/cli/selftune/utils/llm-call.ts +131 -3
package/cli/selftune/utils/skill-log.ts +1 -0
package/cli/selftune/utils/transcript.ts +1 -0
package/cli/selftune/utils/trigger-check.ts +1 -1
package/cli/selftune/workflows/skill-md-writer.ts +5 -5
package/cli/selftune/workflows/workflows.ts +1 -0
package/package.json +37 -33
package/packages/telemetry-contract/fixtures/golden.test.ts +1 -0
package/packages/telemetry-contract/package.json +1 -1
package/packages/telemetry-contract/src/schemas.ts +1 -0
package/packages/telemetry-contract/tests/compatibility.test.ts +1 -0
package/packages/ui/README.md +35 -34
package/packages/ui/package.json +3 -3
package/packages/ui/src/components/ActivityTimeline.tsx +49 -42
package/packages/ui/src/components/EvidenceViewer.tsx +306 -182
package/packages/ui/src/components/EvolutionTimeline.tsx +83 -72
package/packages/ui/src/components/InfoTip.tsx +4 -3
package/packages/ui/src/components/OrchestrateRunsPanel.tsx +60 -53
package/packages/ui/src/components/section-cards.tsx +19 -24
package/packages/ui/src/components/skill-health-grid.tsx +213 -193
package/packages/ui/src/lib/constants.tsx +1 -0
package/packages/ui/src/primitives/badge.tsx +12 -15
package/packages/ui/src/primitives/button.tsx +7 -7
package/packages/ui/src/primitives/card.tsx +15 -26
package/packages/ui/src/primitives/checkbox.tsx +7 -8
package/packages/ui/src/primitives/collapsible.tsx +5 -5
package/packages/ui/src/primitives/dropdown-menu.tsx +45 -55
package/packages/ui/src/primitives/label.tsx +6 -6
package/packages/ui/src/primitives/select.tsx +28 -37
package/packages/ui/src/primitives/table.tsx +17 -44
package/packages/ui/src/primitives/tabs.tsx +14 -21
package/packages/ui/src/primitives/tooltip.tsx +10 -22
package/skill/SKILL.md +70 -57
package/skill/Workflows/AlphaUpload.md +4 -4
package/skill/Workflows/AutoActivation.md +11 -6
package/skill/Workflows/Badge.md +22 -16
package/skill/Workflows/Baseline.md +34 -36
package/skill/Workflows/Composability.md +16 -11
package/skill/Workflows/Contribute.md +26 -21
package/skill/Workflows/Cron.md +23 -22
package/skill/Workflows/Dashboard.md +32 -27
package/skill/Workflows/Doctor.md +33 -27
package/skill/Workflows/Evals.md +48 -47
package/skill/Workflows/EvolutionMemory.md +31 -21
package/skill/Workflows/Evolve.md +84 -82
package/skill/Workflows/EvolveBody.md +58 -47
package/skill/Workflows/Grade.md +16 -13
package/skill/Workflows/ImportSkillsBench.md +9 -6
package/skill/Workflows/Ingest.md +36 -21
package/skill/Workflows/Initialize.md +108 -40
package/skill/Workflows/Orchestrate.md +22 -16
package/skill/Workflows/Replay.md +12 -7
package/skill/Workflows/Rollback.md +13 -6
package/skill/Workflows/Schedule.md +6 -6
package/skill/Workflows/Sync.md +18 -11
package/skill/Workflows/UnitTest.md +28 -17
package/skill/Workflows/Watch.md +28 -21
package/skill/agents/diagnosis-analyst.md +11 -0
package/skill/agents/evolution-reviewer.md +15 -1
package/skill/agents/integration-guide.md +10 -0
package/skill/agents/pattern-analyst.md +12 -1
package/skill/references/grading-methodology.md +23 -24
package/skill/references/interactive-config.md +7 -7
package/skill/references/invocation-taxonomy.md +22 -20
package/skill/references/logs.md +14 -6
package/skill/references/setup-patterns.md +4 -2
package/.claude/agents/diagnosis-analyst.md +0 -156
package/.claude/agents/evolution-reviewer.md +0 -180
package/.claude/agents/integration-guide.md +0 -212
package/.claude/agents/pattern-analyst.md +0 -160
package/apps/local-dashboard/dist/assets/index-C4UYGWKr.js +0 -15

package/skill/Workflows/Evolve.md CHANGED Viewed

@@ -7,6 +7,7 @@ natural-language queries without breaking existing triggers.
 ## When to Invoke
 Invoke this workflow when the user requests any of the following:
 - Improving or evolving a skill's trigger coverage
 - Fixing undertriggering or missed queries for a skill
 - Optimizing a skill description based on usage data
@@ -20,27 +21,27 @@ selftune evolve --skill <name> --skill-path <path> [options]
 ## Options
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--skill <name>` | Skill name | Required |
-| `--skill-path <path>` | Path to the skill's SKILL.md | Required |
-| `--eval-set <path>` | Pre-built eval set JSON | Auto-generated from logs |
-| `--agent <name>` | Agent CLI to use (claude, codex, opencode) | Auto-detected |
-| `--dry-run` | Propose and validate without deploying | Off |
-| `--confidence <n>` | Minimum confidence threshold (0-1) | 0.6 |
-| `--max-iterations <n>` | Maximum retry iterations | 3 |
-| `--validation-model <model>` | Model for trigger-check validation LLM calls | `haiku` |
-| `--pareto` | Generate multiple candidates per iteration | Off |
-| `--candidates <n>` | Number of candidates per iteration (with `--pareto`) | 3 |
-| `--token-efficiency` | Optimize for token efficiency in proposals | Off |
-| `--with-baseline` | Include a no-skill baseline comparison | Off |
-| `--cheap-loop` | Use cheap models for loop, expensive for final gate | On |
-| `--full-model` | Use full-cost model throughout (disables cheap-loop) | Off |
-| `--verbose` | Print detailed progress during evolution | Off |
-| `--gate-model <model>` | Model for final gate validation | `sonnet` (when `--cheap-loop`) |
-| `--proposal-model <model>` | Model for proposal generation LLM calls | None |
-| `--sync-first` | Refresh source-truth telemetry before generating evals/failure patterns | Off |
-| `--sync-force` | Force a full source rescan during `--sync-first` | Off |
+| Flag                         | Description                                                             | Default                        |
+| ---------------------------- | ----------------------------------------------------------------------- | ------------------------------ |
+| `--skill <name>`             | Skill name                                                              | Required                       |
+| `--skill-path <path>`        | Path to the skill's SKILL.md                                            | Required                       |
+| `--eval-set <path>`          | Pre-built eval set JSON                                                 | Auto-generated from logs       |
+| `--agent <name>`             | Agent CLI to use (claude, codex, opencode)                              | Auto-detected                  |
+| `--dry-run`                  | Propose and validate without deploying                                  | Off                            |
+| `--confidence <n>`           | Minimum confidence threshold (0-1)                                      | 0.6                            |
+| `--max-iterations <n>`       | Maximum retry iterations                                                | 3                              |
+| `--validation-model <model>` | Model for trigger-check validation LLM calls                            | `haiku`                        |
+| `--pareto`                   | Generate multiple candidates per iteration                              | Off                            |
+| `--candidates <n>`           | Number of candidates per iteration (with `--pareto`)                    | 3                              |
+| `--token-efficiency`         | Optimize for token efficiency in proposals                              | Off                            |
+| `--with-baseline`            | Include a no-skill baseline comparison                                  | Off                            |
+| `--cheap-loop`               | Use cheap models for loop, expensive for final gate                     | On                             |
+| `--full-model`               | Use full-cost model throughout (disables cheap-loop)                    | Off                            |
+| `--verbose`                  | Print detailed progress during evolution                                | Off                            |
+| `--gate-model <model>`       | Model for final gate validation                                         | `sonnet` (when `--cheap-loop`) |
+| `--proposal-model <model>`   | Model for proposal generation LLM calls                                 | None                           |
+| `--sync-first`               | Refresh source-truth telemetry before generating evals/failure patterns | Off                            |
+| `--sync-force`               | Force a full source rescan during `--sync-first`                        | Off                            |
 ## Output Format
@@ -54,7 +55,7 @@ See `references/logs.md` for the audit log schema.
   "proposal_id": "evolve-pptx-1709125200000",
   "skill_name": "pptx",
   "iteration": 1,
-  "original_pass_rate": 0.70,
+  "original_pass_rate": 0.7,
   "proposed_pass_rate": 0.92,
   "regression_count": 0,
   "confidence": 0.85,
@@ -67,11 +68,11 @@ See `references/logs.md` for the audit log schema.
 The evolution process writes multiple audit entries:
-| Action | When | Key details |
-|--------|------|-------------|
-| `created` | Proposal generated | `details` contains `original_description:` prefix |
-| `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates |
-| `deployed` | Updated SKILL.md written to disk | `eval_snapshot` with final rates |
+| Action      | When                             | Key details                                       |
+| ----------- | -------------------------------- | ------------------------------------------------- |
+| `created`   | Proposal generated               | `details` contains `original_description:` prefix |
+| `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates      |
+| `deployed`  | Updated SKILL.md written to disk | `eval_snapshot` with final rates                  |
 ## Parsing Instructions
@@ -97,51 +98,45 @@ The evolution process writes multiple audit entries:
 Before running the evolve command, use the `AskUserQuestion` tool to present structured configuration options. If the user responds with "use defaults" or similar shorthand, skip to step 1 using the recommended defaults. If the user cancels, stop and do not continue.
-Use `AskUserQuestion` with these questions (max 4 per call — split if needed):
-**Call 1:**
-```json
-{
-  "questions": [
-    {
-      "question": "Execution Mode",
-      "options": ["Dry run — preview without deploying (recommended for first run)", "Live — validate and deploy if improved"]
-    },
-    {
-      "question": "Model Tier (see SKILL.md reference)",
-      "options": ["Fast (haiku) — cheapest, ~2s/call (recommended with cheap-loop)", "Balanced (sonnet) — good quality, ~5s/call", "Best (opus) — highest quality, ~10s/call"]
-    },
-    {
-      "question": "Cost Optimization",
-      "options": ["Cheap loop — haiku for iteration, sonnet for final gate (recommended)", "Single model — use one model throughout"]
-    },
-    {
-      "question": "Advanced Options",
-      "options": ["Defaults (0.6 confidence, 3 iterations, single candidate) (recommended)", "Stricter (0.7 confidence, 5 iterations)", "Pareto mode (multiple candidates per iteration)"]
-    }
-  ]
-}
-```
-If `AskUserQuestion` is not available, fall back to presenting these as inline numbered options.
+Ask one `AskUserQuestion` at a time in this order:
+1. `Execution Mode`
+   Options:
+   - `Dry run — preview without deploying (recommended for first run)`
+   - `Live — validate and deploy if improved`
+2. `Model Tier (see SKILL.md reference)`
+   Options:
+   - `Fast (haiku) — cheapest, ~2s/call (recommended with cheap-loop)`
+   - `Balanced (sonnet) — good quality, ~5s/call`
+   - `Best (opus) — highest quality, ~10s/call`
+3. `Cost Optimization`
+   Options:
+   - `Cheap loop — haiku for iteration, sonnet for final gate (recommended)`
+   - `Single model — use one model throughout`
+4. `Advanced Options`
+   Options:
+   - `Defaults (0.6 confidence, 3 iterations, single candidate) (recommended)`
+   - `Stricter (0.7 confidence, 5 iterations)`
+   - `Pareto mode (multiple candidates per iteration)`
+If `AskUserQuestion` is not available or Claude does not invoke it, fall back to presenting the same choices as inline numbered options.
 If the user cancels, stop -- do not proceed with defaults. If the user selects "use defaults", skip to step 1 with recommended defaults.
 After the user responds, parse their selections and map each choice to the corresponding CLI flags:
-| Selection | CLI Flag |
-|-----------|----------|
-| 1a (dry run) | `--dry-run` |
-| 1b (live) | _(no flag)_ |
-| 2a (haiku) | `--validation-model haiku` |
-| 2b (sonnet) | `--validation-model sonnet` |
-| 2c (opus) | `--validation-model opus` |
-| 3a (cheap loop) | `--cheap-loop` |
-| 3b (single model) | _(no flag)_ |
-| Custom confidence | `--confidence <value>` |
-| Custom iterations | `--max-iterations <value>` |
-| 6b (pareto) | `--pareto` |
+| Selection         | CLI Flag                    |
+| ----------------- | --------------------------- |
+| 1a (dry run)      | `--dry-run`                 |
+| 1b (live)         | _(no flag)_                 |
+| 2a (haiku)        | `--validation-model haiku`  |
+| 2b (sonnet)       | `--validation-model sonnet` |
+| 2c (opus)         | `--validation-model opus`   |
+| 3a (cheap loop)   | `--cheap-loop`              |
+| 3b (single model) | _(no flag)_                 |
+| Custom confidence | `--confidence <value>`      |
+| Custom iterations | `--max-iterations <value>`  |
+| 6b (pareto)       | `--pareto`                  |
 Show a confirmation summary to the user:
@@ -161,6 +156,7 @@ Build the CLI command string with all selected flags and continue to step 1.
 ### 1. Read Evolution Context
 Before running, read `~/.selftune/memory/context.md` for session context:
 - Active evolutions and their current status
 - Known issues from previous runs
 - Last update timestamp
@@ -197,6 +193,7 @@ evolution cannot reliably measure improvement.
 ### 4. Extract Failure Patterns
 The command groups missed queries by invocation type:
 - Missed explicit: description is broken (rare, high priority)
 - Missed implicit: description is too narrow (common, evolve target)
 - Missed contextual: description lacks domain vocabulary (evolve target)
@@ -227,6 +224,7 @@ For body evolution (`evolve body`), only the size constraint applies.
 An LLM generates a candidate description that would catch the missed
 queries. The candidate:
 - Preserves existing trigger phrases that work
 - Adds new phrases covering missed patterns
 - Maintains the description's structure and tone
@@ -234,6 +232,7 @@ queries. The candidate:
 ### 6. Validate Against Eval Set
 The candidate is tested against the full eval set:
 - Must improve overall pass rate
 - Must not regress more than 5% on previously-passing entries
 - Must exceed the `--confidence` threshold
@@ -246,14 +245,14 @@ with adjusted proposals.
 When summarizing an evolution run, include these aggregate metrics rather
 than only saying "passed" or "failed":
-| Metric | Meaning |
-|--------|---------|
-| `original_pass_rate` | Baseline pass rate before the proposal |
-| `proposed_pass_rate` | Pass rate after applying the proposal |
-| `regression_count` | Eval entries that passed before and failed after |
-| `net_change` | Total passes gained minus regressions introduced |
-| `iteration` / `iterations_used` | Which retry produced the current candidate |
-| `baseline_lift` | Additional lift over the no-skill baseline when `--with-baseline` is enabled |
+| Metric                          | Meaning                                                                      |
+| ------------------------------- | ---------------------------------------------------------------------------- |
+| `original_pass_rate`            | Baseline pass rate before the proposal                                       |
+| `proposed_pass_rate`            | Pass rate after applying the proposal                                        |
+| `regression_count`              | Eval entries that passed before and failed after                             |
+| `net_change`                    | Total passes gained minus regressions introduced                             |
+| `iteration` / `iterations_used` | Which retry produced the current candidate                                   |
+| `baseline_lift`                 | Additional lift over the no-skill baseline when `--with-baseline` is enabled |
 These metrics explain whether the proposal is genuinely better, merely
 different, or too risky to deploy.
@@ -264,6 +263,7 @@ If `--dry-run`, the proposal is printed but not deployed. The audit log
 still records `created` and `validated` entries for review.
 If deploying:
 1. The current SKILL.md is backed up to `SKILL.md.bak`
 2. The updated description is written to SKILL.md
 3. A `deployed` entry is logged to the evolution audit
@@ -271,6 +271,7 @@ If deploying:
 ### 8. Update Memory
 After evolution completes (deploy or dry-run), the memory writer updates:
 - `~/.selftune/memory/context.md` -- records the evolution outcome and current state
 - `~/.selftune/memory/decisions.md` -- logs the decision rationale and proposal details
@@ -281,13 +282,13 @@ even after a context window reset.
 The evolution loop stops when any of these conditions is met (priority order):
-| # | Condition | Meaning |
-|---|-----------|---------|
-| 1 | **Converged** | Pass rate >= 0.95 |
-| 2 | **Max iterations** | Reached `--max-iterations` limit |
-| 3 | **Low confidence** | Proposal confidence below `--confidence` threshold |
-| 4 | **Plateau** | Pass rate unchanged across 3 consecutive iterations |
-| 5 | **Continue** | None of the above -- keep iterating |
+| #   | Condition          | Meaning                                             |
+| --- | ------------------ | --------------------------------------------------- |
+| 1   | **Converged**      | Pass rate >= 0.95                                   |
+| 2   | **Max iterations** | Reached `--max-iterations` limit                    |
+| 3   | **Low confidence** | Proposal confidence below `--confidence` threshold  |
+| 4   | **Plateau**        | Pass rate unchanged across 3 consecutive iterations |
+| 5   | **Continue**       | None of the above -- keep iterating                 |
 ## Cheap Loop Mode
@@ -296,6 +297,7 @@ and only uses an expensive model (sonnet) for a final gate validation before
 deploying. This reduces cost while maintaining deployment quality.
 When `--cheap-loop` is set:
 - `--proposal-model` defaults to `haiku`
 - `--validation-model` defaults to `haiku`
 - `--gate-model` defaults to `sonnet`

package/skill/Workflows/EvolveBody.md CHANGED Viewed

@@ -12,21 +12,23 @@ selftune evolve body --skill <name> --skill-path <path> --target <target> [optio
 ## Options
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--skill <name>` | Skill name | Required |
-| `--skill-path <path>` | Path to the skill's SKILL.md | Required |
-| `--target <type>` | Evolution target: `routing` or `body` | Required |
-| `--teacher-agent <name>` | Agent CLI for proposal generation | Auto-detected |
-| `--student-agent <name>` | Agent CLI for validation | Same as teacher |
-| `--teacher-model <flag>` | Model flag for teacher (e.g. `opus`) | Agent default |
-| `--student-model <flag>` | Model flag for student (e.g. `haiku`) | Agent default |
-| `--eval-set <path>` | Pre-built eval set JSON | Auto-generated from logs |
-| `--dry-run` | Propose and validate without deploying | Off |
-| `--max-iterations <n>` | Maximum refinement iterations | 3 |
-| `--task-description <text>` | Context for the evolution goal | None |
-| `--validation-model <model>` | Model for trigger-check validation calls (overrides `--student-model` for validation) | None |
-| `--few-shot <paths>` | Comma-separated paths to example SKILL.md files | None |
+| Flag                         | Description                                                                           | Default                  |
+| ---------------------------- | ------------------------------------------------------------------------------------- | ------------------------ |
+| `--skill <name>`             | Skill name                                                                            | Required                 |
+| `--skill-path <path>`        | Path to the skill's SKILL.md                                                          | Required                 |
+| `--target <type>`            | Evolution target: `routing` or `body`                                                 | Required                 |
+| `--teacher-agent <name>`     | Agent CLI for proposal generation                                                     | Auto-detected            |
+| `--student-agent <name>`     | Agent CLI for validation                                                              | Same as teacher          |
+| `--teacher-model <flag>`     | Model flag for teacher                                                                | `opus`                   |
+| `--student-model <flag>`     | Model flag for student                                                                | `haiku`                  |
+| `--eval-set <path>`          | Pre-built eval set JSON                                                               | Auto-generated from logs |
+| `--dry-run`                  | Propose and validate without deploying                                                | Off                      |
+| `--max-iterations <n>`       | Maximum refinement iterations                                                         | 3                        |
+| `--task-description <text>`  | Context for the evolution goal                                                        | None                     |
+| `--validation-model <model>` | Model for trigger-check validation calls (overrides `--student-model` for validation) | None                     |
+| `--teacher-effort <level>`   | Effort level for teacher LLM: `low`, `medium`, `high`, `max`                          | `high`                   |
+| `--review`                   | Run `evolution-reviewer` subagent as Gate 4 before deployment                         | Off                      |
+| `--few-shot <paths>`         | Comma-separated paths to example SKILL.md files                                       | None                     |
 ## Evolution Targets
@@ -46,11 +48,12 @@ teacher generates a complete replacement, validated through 3 gates.
 Every proposal passes through three sequential gates:
-| Gate | Type | What it checks | Cost |
-|------|------|---------------|------|
-| **Gate 1: Structural** | Pure code | YAML frontmatter present, `# Title` exists, `## Workflow Routing` preserved if original had one | Free |
-| **Gate 2: Trigger Accuracy** | Student LLM | YES/NO trigger check per eval entry on the extracted description | Cheap |
-| **Gate 3: Quality** | Student LLM | Body clarity and completeness score (0.0-1.0) | Cheap |
+| Gate                          | Type        | What it checks                                                                                  | Cost     |
+| ----------------------------- | ----------- | ----------------------------------------------------------------------------------------------- | -------- |
+| **Gate 1: Structural**        | Pure code   | YAML frontmatter present, `# Title` exists, `## Workflow Routing` preserved if original had one | Free     |
+| **Gate 2: Trigger Accuracy**  | Student LLM | YES/NO trigger check per eval entry on the extracted description                                | Cheap    |
+| **Gate 3: Quality**           | Student LLM | Body clarity and completeness score (0.0-1.0)                                                   | Cheap    |
+| **Gate 4: Reviewer** (opt-in) | Subagent    | `evolution-reviewer` multi-turn review — reads files, checks evidence, APPROVE/REJECT verdict   | Moderate |
 If any gate fails, the teacher receives structured feedback and generates
 a refined proposal. This repeats up to `--max-iterations` times.
@@ -62,32 +65,32 @@ a refined proposal. This repeats up to `--max-iterations` times.
 Before running evolve-body, use the `AskUserQuestion` tool to present structured configuration options.
 If the user says "use defaults" or similar, skip to step 1 with recommended defaults. If the user cancels, abort the workflow -- do not proceed with defaults.
-Use `AskUserQuestion` with these questions:
-```json
-{
-  "questions": [
-    {
-      "question": "Evolution Target",
-      "options": ["Routing table — optimize workflow routing only (recommended)", "Full body — rewrite entire SKILL.md (more aggressive)"]
-    },
-    {
-      "question": "Execution Mode",
-      "options": ["Dry run — preview without deploying (recommended)", "Live — validate and deploy if improved"]
-    },
-    {
-      "question": "Teacher Model (generates proposals)",
-      "options": ["Balanced (sonnet) — good quality (recommended)", "Best (opus) — highest quality, slower"]
-    },
-    {
-      "question": "Student Model & Iterations",
-      "options": ["Fast (haiku) + 3 iterations (recommended)", "Balanced (sonnet) + 3 iterations", "Fast (haiku) + 5 iterations"]
-    }
-  ]
-}
-```
-If `AskUserQuestion` is not available, fall back to presenting these as inline numbered options.
+Ask one `AskUserQuestion` at a time in this order:
+1. `Evolution Target`
+   Options:
+   - `Routing table — optimize workflow routing only (recommended)`
+   - `Full body — rewrite entire SKILL.md (more aggressive)`
+2. `Execution Mode`
+   Options:
+   - `Dry run — preview without deploying (recommended)`
+   - `Live — validate and deploy if improved`
+3. `Teacher Model (generates proposals)`
+   Options:
+   - `Best (opus + high effort) — thinking-enabled, highest quality (recommended)`
+   - `Balanced (sonnet) — good quality, faster`
+4. `Student Model & Iterations`
+   Options:
+   - `Fast (haiku) + 3 iterations (recommended)`
+   - `Balanced (sonnet) + 3 iterations`
+   - `Fast (haiku) + 5 iterations`
+5. `Teacher Effort (thinking depth for proposal generation)`
+   Options:
+   - `High — extended thinking for better proposals (recommended)`
+   - `Max — maximum thinking depth (Opus 4.6 only, slower)`
+   - `Medium — standard reasoning`
+If `AskUserQuestion` is not available or Claude does not invoke it, fall back to presenting the same choices as inline numbered options.
 After the user responds, show a confirmation summary:
@@ -95,7 +98,8 @@ After the user responds, show a confirmation summary:
 Configuration Summary:
   Target:        routing
   Mode:          dry-run
-  Teacher model: sonnet
+  Teacher model: opus
+  Teacher effort: high
   Student model: haiku
   Iterations:    3
   Few-shot:      none
@@ -106,6 +110,7 @@ Proceeding...
 ### 1. Parse Current Skill
 The command reads SKILL.md and splits it into sections using `parseSkillSections()`:
 - Frontmatter (YAML between `---` markers)
 - Title (first `# Heading`)
 - Description (text between title and first `## Section`)
@@ -125,6 +130,7 @@ pipeline. See `references/invocation-taxonomy.md`.
 ### 4. Generate Proposal (Teacher)
 The teacher LLM generates a proposal based on the target:
 - **routing**: Optimized `## Workflow Routing` markdown table
 - **body**: Complete SKILL.md body replacement
@@ -138,6 +144,7 @@ failure details and generates a refined proposal.
 ### 6. Deploy or Preview
 If `--dry-run`, prints the proposal without deploying. Otherwise:
 1. Creates a timestamped backup of the current SKILL.md
 2. Applies the change: `replaceSection()` for routing, `replaceBody()` for body
 3. Records audit entries
@@ -146,13 +153,17 @@ If `--dry-run`, prints the proposal without deploying. Otherwise:
 ## Common Patterns
 **"Evolve the routing table for the Research skill"**
 > `selftune evolve body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target routing`
 **"Rewrite the entire skill body"**
 > `selftune evolve body --skill Research --skill-path ~/.claude/skills/Research/SKILL.md --target body --dry-run`
 **"Use a stronger model for generation"**
 > `selftune evolve body --skill pptx --skill-path /path/SKILL.md --target body --teacher-model opus --student-model haiku`
 **"Preview what would change"**
 > Always start with `--dry-run` to review the proposal before deploying.

package/skill/Workflows/Grade.md CHANGED Viewed

@@ -11,13 +11,13 @@ selftune grade --skill <name> [options]
 ## Options
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--skill <name>` | Skill name to grade | Required |
-| `--expectations "..."` | Explicit expectations (semicolon-separated) | Auto-derived |
-| `--evals-json <path>` | Pre-built eval set JSON file | None |
-| `--eval-id <n>` | Specific eval ID to grade from the eval set | None |
-| `--agent <name>` | Agent CLI to use (claude, codex, opencode) | Auto-detected |
+| Flag                   | Description                                 | Default       |
+| ---------------------- | ------------------------------------------- | ------------- |
+| `--skill <name>`       | Skill name to grade                         | Required      |
+| `--expectations "..."` | Explicit expectations (semicolon-separated) | Auto-derived  |
+| `--evals-json <path>`  | Pre-built eval set JSON file                | None          |
+| `--eval-id <n>`        | Specific eval ID to grade from the eval set | None          |
+| `--agent <name>`       | Agent CLI to use (claude, codex, opencode)  | Auto-detected |
 ## Output Format
@@ -30,14 +30,10 @@ for the full schema. Key fields:
   "skill_name": "pptx",
   "transcript_path": "~/.claude/projects/.../abc123.jsonl",
   "graded_at": "2026-02-28T12:00:00Z",
-  "expectations": [
-    { "text": "...", "passed": true, "evidence": "..." }
-  ],
+  "expectations": [{ "text": "...", "passed": true, "evidence": "..." }],
   "summary": { "passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67 },
   "execution_metrics": { "tool_calls": {}, "total_tool_calls": 6, "errors_encountered": 0 },
-  "claims": [
-    { "claim": "...", "type": "factual", "verified": true, "evidence": "..." }
-  ],
+  "claims": [{ "claim": "...", "type": "factual", "verified": true, "evidence": "..." }],
   "eval_feedback": { "suggestions": [], "overall": "..." }
 }
 ```
@@ -82,6 +78,7 @@ fields. See `references/logs.md` for the telemetry format.
 ### 2. Read the Transcript
 Parse the JSONL file at `transcript_path`. Extract:
 - User messages (what was asked)
 - Assistant tool calls (what the agent did)
 - Tool results (what happened)
@@ -99,6 +96,7 @@ Ensure at least one Process expectation and one Quality expectation.
 Search both the telemetry record and the transcript for evidence per
 expectation. Mark as:
 - **PASS** if evidence exists and supports the expectation
 - **FAIL** if evidence is absent or contradicts the expectation
@@ -123,6 +121,7 @@ Write the full grading result to `grading.json` in the current directory.
 ### 8. Report Results
 Report to the user:
 - Pass rate (e.g., "2/3 passed, 67%")
 - Failed expectations with evidence
 - Notable claims
@@ -133,19 +132,23 @@ Keep the summary concise. The full details are in `grading.json`.
 ## Common Patterns
 **User asks to grade a skill session**
 > Run `selftune grade --skill <name>` with default expectations. Results are
 > written to `grading.json`. Read that file and report the pass rate and any
 > failures to the user.
 **User provides specific expectations**
 > Run `selftune grade --skill <name> --expectations "expect1;expect2;expect3"`.
 > Parse results and report.
 **User wants to grade from an eval set**
 > Run `selftune grade --skill <name> --evals-json path/to/evals.json`.
 > Optionally add `--eval-id N` for a specific scenario.
 **Agent detection override needed**
 > The grader auto-detects the agent CLI. If detection fails or the user
 > specifies an agent, pass `--agent <name>` to override.

package/skill/Workflows/ImportSkillsBench.md CHANGED Viewed

@@ -18,12 +18,12 @@ selftune eval import --dir <path> --skill <name> --output <path> [options]
 ## Options
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--dir <path>` | Path to SkillsBench tasks directory | Required |
-| `--skill <name>` | Target skill to match tasks against | Required |
-| `--output <path>` | Output eval set JSON file | Required |
-| `--match-strategy <type>` | Matching strategy: `exact` or `fuzzy` | `exact` |
+| Flag                      | Description                           | Default  |
+| ------------------------- | ------------------------------------- | -------- |
+| `--dir <path>`            | Path to SkillsBench tasks directory   | Required |
+| `--skill <name>`          | Target skill to match tasks against   | Required |
+| `--output <path>`         | Output eval set JSON file             | Required |
+| `--match-strategy <type>` | Matching strategy: `exact` or `fuzzy` | `exact`  |
 ## Match Strategies
@@ -108,10 +108,13 @@ corpus. Use the merged set with `selftune evolve --eval-set merged-evals.json`.
 ## Common Patterns
 **"Import SkillsBench tasks for Research"**
 > `selftune eval import --dir /path/tasks --skill Research --output bench-evals.json`
 **"Use fuzzy matching for broader coverage"**
 > `selftune eval import --dir /path/tasks --skill pptx --output bench-evals.json --match-strategy fuzzy`
 **"Enrich my eval set with external benchmarks"**
 > Import with `eval import`, then pass the output to `evolve --eval-set`.