selftune 0.2.19 → 0.2.21

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -31,6 +31,10 @@ export interface EvolutionEntry {
31
31
  action: string;
32
32
  details: string;
33
33
  eval_snapshot?: EvalSnapshot | null;
34
+ validation_mode?: "structural_guard" | "host_replay" | "llm_judge" | null;
35
+ validation_agent?: string | null;
36
+ validation_fixture_id?: string | null;
37
+ validation_evidence_ref?: string | null;
34
38
  }
35
39
 
36
40
  export interface UnmatchedQuery {
@@ -104,6 +104,19 @@ This is for packaging questions like:
104
104
  - "Are my sibling skills competing for the same user intent?"
105
105
  - "Should I stop evolving these independently and redesign the family?"
106
106
 
107
+ When trusted telemetry is sparse, the same command also emits a
108
+ `cold_start_suspicion` block. That is a weaker, earlier signal based on the
109
+ installed skill surfaces:
110
+
111
+ 1. Frontmatter / top-level description similarity
112
+ 2. Overlap in `## When to Use` language
113
+ 3. Shared command surface (for example, siblings that both wrap `mentor search`)
114
+ 4. Synthetic sibling-confusion probes derived from those overlapping surfaces
115
+
116
+ Treat `cold_start_suspicion.candidate` as architecture suspicion, not proof.
117
+ It is meant to tell you "this family may want a parent skill" before enough
118
+ real usage exists to confirm it through trusted positive-query overlap.
119
+
107
120
  ## Steps
108
121
 
109
122
  ### 1. Run Analysis
@@ -140,6 +153,7 @@ Interpretation:
140
153
 
141
154
  - `consolidation_candidate: false` means keep improving the sibling descriptions/workflows separately
142
155
  - `consolidation_candidate: true` means the problem is likely packaging, not just wording
156
+ - `cold_start_suspicion.candidate: true` means installed skill surfaces already look suspicious even though trusted telemetry is still sparse
143
157
  - `refactor_proposal` is a draft for human review only; do not auto-deploy a family rewrite
144
158
 
145
159
  ## Subagent Escalation
@@ -173,4 +187,4 @@ resolution plan with trigger ownership recommendations.
173
187
 
174
188
  **"Should I consolidate this sibling skill family?"**
175
189
 
176
- > Run `selftune eval family-overlap` and look for `consolidation_candidate` plus the `refactor_proposal`.
190
+ > Run `selftune eval family-overlap` and look for `consolidation_candidate` when you have live evidence, or `cold_start_suspicion` when you only have installed skill surfaces plus cold-start evals.
@@ -76,6 +76,45 @@ The evolution process writes multiple audit entries:
76
76
  | `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates |
77
77
  | `deployed` | Updated SKILL.md written to disk | `eval_snapshot` with final rates |
78
78
 
79
+ Routing/body validation may also carry provenance fields such as:
80
+
81
+ - `validation_mode` — `llm_judge`, `host_replay`, or `structural_guard`
82
+ - `validation_agent` — which host/agent performed the validation
83
+ - `validation_fixture_id` — fixture identifier when replay-backed validation is used
84
+ - `before_pass_rate` / `after_pass_rate` — only present when trigger validation actually ran; structural-guard exits do not emit synthetic pass rates
85
+
86
+ Most evolve runs today still validate through `llm_judge`. Routing evolution now
87
+ auto-builds a replay fixture from the target skill plus installed sibling
88
+ skills in the same registry, so replay-backed validation is preferred whenever
89
+ that local fixture can be constructed because it captures host-style routing
90
+ behavior instead of model judgment.
91
+
92
+ For Claude Code, the replay path now stages a temporary project-local
93
+ `.claude/skills` registry, swaps in the candidate routing table, and runs a
94
+ one-turn Claude print-mode session with project/local settings only. Validation
95
+ records whether Claude actually invoked the target skill, invoked a competing
96
+ skill, invoked an unrelated skill, or made no routing decision at all.
97
+ Unrelated skill use is treated as a replay failure even on negative evals,
98
+ because it still indicates the runtime routed somewhere unexpected. If that
99
+ runtime path is unavailable or fails to reach a runtime decision, selftune
100
+ falls back to the existing fixture-backed surface simulation and notes the
101
+ fallback in the replay evidence instead of pretending it was a runtime result.
102
+
103
+ For non-Claude platforms today, replay remains fixture-backed: it evaluates the
104
+ target routing table against the installed target/competing skill surfaces in a
105
+ controlled replay fixture and records per-entry evidence. That is still a
106
+ stronger signal than a free-form judge prompt, but you should describe it as
107
+ replay-backed validation, not as live operator telemetry.
108
+
109
+ Replay parsing is intentionally conservative: unreadable skill files degrade to
110
+ empty surfaces instead of throwing, and malformed routing rows with empty
111
+ trigger cells are ignored rather than treated as valid triggers. Claude replay
112
+ also normalizes observed `Read` paths against the staged workspace, so relative
113
+ skill reads still count as read-only evidence for the target or competing
114
+ skill. Reads outside the staged skill set are treated as replay failures rather
115
+ than benign negatives, because they indicate the runtime left the controlled
116
+ evaluation surface.
117
+
79
118
  ## Parsing Instructions
80
119
 
81
120
  ### Track Evolution Progress