selftune 0.2.19 → 0.2.21
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/apps/local-dashboard/dist/assets/{index-DnhnXQm6.js → index-D8O-RG1I.js} +2 -2
- package/apps/local-dashboard/dist/index.html +1 -1
- package/cli/selftune/dashboard-contract.ts +4 -0
- package/cli/selftune/eval/family-overlap.ts +320 -1
- package/cli/selftune/evolution/evidence.ts +5 -0
- package/cli/selftune/evolution/evolve-body.ts +86 -2
- package/cli/selftune/evolution/evolve.ts +58 -1
- package/cli/selftune/evolution/validate-body.ts +10 -0
- package/cli/selftune/evolution/validate-host-replay.ts +624 -0
- package/cli/selftune/evolution/validate-proposal.ts +10 -0
- package/cli/selftune/evolution/validate-routing.ts +112 -5
- package/cli/selftune/localdb/direct-write.ts +8 -3
- package/cli/selftune/localdb/materialize.ts +7 -2
- package/cli/selftune/localdb/queries.ts +11 -1
- package/cli/selftune/localdb/schema.ts +10 -1
- package/cli/selftune/routes/skill-report.ts +6 -1
- package/cli/selftune/types.ts +54 -0
- package/cli/selftune/utils/text-similarity.ts +73 -0
- package/package.json +1 -1
- package/packages/ui/src/components/EvidenceViewer.tsx +85 -2
- package/packages/ui/src/components/EvolutionTimeline.tsx +23 -1
- package/packages/ui/src/types.ts +4 -0
- package/skill/Workflows/Composability.md +15 -1
- package/skill/Workflows/Evolve.md +39 -0
package/packages/ui/src/types.ts
CHANGED
|
@@ -31,6 +31,10 @@ export interface EvolutionEntry {
|
|
|
31
31
|
action: string;
|
|
32
32
|
details: string;
|
|
33
33
|
eval_snapshot?: EvalSnapshot | null;
|
|
34
|
+
validation_mode?: "structural_guard" | "host_replay" | "llm_judge" | null;
|
|
35
|
+
validation_agent?: string | null;
|
|
36
|
+
validation_fixture_id?: string | null;
|
|
37
|
+
validation_evidence_ref?: string | null;
|
|
34
38
|
}
|
|
35
39
|
|
|
36
40
|
export interface UnmatchedQuery {
|
|
@@ -104,6 +104,19 @@ This is for packaging questions like:
|
|
|
104
104
|
- "Are my sibling skills competing for the same user intent?"
|
|
105
105
|
- "Should I stop evolving these independently and redesign the family?"
|
|
106
106
|
|
|
107
|
+
When trusted telemetry is sparse, the same command also emits a
|
|
108
|
+
`cold_start_suspicion` block. That is a weaker, earlier signal based on the
|
|
109
|
+
installed skill surfaces:
|
|
110
|
+
|
|
111
|
+
1. Frontmatter / top-level description similarity
|
|
112
|
+
2. Overlap in `## When to Use` language
|
|
113
|
+
3. Shared command surface (for example, siblings that both wrap `mentor search`)
|
|
114
|
+
4. Synthetic sibling-confusion probes derived from those overlapping surfaces
|
|
115
|
+
|
|
116
|
+
Treat `cold_start_suspicion.candidate` as architecture suspicion, not proof.
|
|
117
|
+
It is meant to tell you "this family may want a parent skill" before enough
|
|
118
|
+
real usage exists to confirm it through trusted positive-query overlap.
|
|
119
|
+
|
|
107
120
|
## Steps
|
|
108
121
|
|
|
109
122
|
### 1. Run Analysis
|
|
@@ -140,6 +153,7 @@ Interpretation:
|
|
|
140
153
|
|
|
141
154
|
- `consolidation_candidate: false` means keep improving the sibling descriptions/workflows separately
|
|
142
155
|
- `consolidation_candidate: true` means the problem is likely packaging, not just wording
|
|
156
|
+
- `cold_start_suspicion.candidate: true` means installed skill surfaces already look suspicious even though trusted telemetry is still sparse
|
|
143
157
|
- `refactor_proposal` is a draft for human review only; do not auto-deploy a family rewrite
|
|
144
158
|
|
|
145
159
|
## Subagent Escalation
|
|
@@ -173,4 +187,4 @@ resolution plan with trigger ownership recommendations.
|
|
|
173
187
|
|
|
174
188
|
**"Should I consolidate this sibling skill family?"**
|
|
175
189
|
|
|
176
|
-
> Run `selftune eval family-overlap` and look for `consolidation_candidate` plus
|
|
190
|
+
> Run `selftune eval family-overlap` and look for `consolidation_candidate` when you have live evidence, or `cold_start_suspicion` when you only have installed skill surfaces plus cold-start evals.
|
|
@@ -76,6 +76,45 @@ The evolution process writes multiple audit entries:
|
|
|
76
76
|
| `validated` | Proposal tested against eval set | `eval_snapshot` with before/after pass rates |
|
|
77
77
|
| `deployed` | Updated SKILL.md written to disk | `eval_snapshot` with final rates |
|
|
78
78
|
|
|
79
|
+
Routing/body validation may also carry provenance fields such as:
|
|
80
|
+
|
|
81
|
+
- `validation_mode` — `llm_judge`, `host_replay`, or `structural_guard`
|
|
82
|
+
- `validation_agent` — which host/agent performed the validation
|
|
83
|
+
- `validation_fixture_id` — fixture identifier when replay-backed validation is used
|
|
84
|
+
- `before_pass_rate` / `after_pass_rate` — only present when trigger validation actually ran; structural-guard exits do not emit synthetic pass rates
|
|
85
|
+
|
|
86
|
+
Most evolve runs today still validate through `llm_judge`. Routing evolution now
|
|
87
|
+
auto-builds a replay fixture from the target skill plus installed sibling
|
|
88
|
+
skills in the same registry, so replay-backed validation is preferred whenever
|
|
89
|
+
that local fixture can be constructed because it captures host-style routing
|
|
90
|
+
behavior instead of model judgment.
|
|
91
|
+
|
|
92
|
+
For Claude Code, the replay path now stages a temporary project-local
|
|
93
|
+
`.claude/skills` registry, swaps in the candidate routing table, and runs a
|
|
94
|
+
one-turn Claude print-mode session with project/local settings only. Validation
|
|
95
|
+
records whether Claude actually invoked the target skill, invoked a competing
|
|
96
|
+
skill, invoked an unrelated skill, or made no routing decision at all.
|
|
97
|
+
Unrelated skill use is treated as a replay failure even on negative evals,
|
|
98
|
+
because it still indicates the runtime routed somewhere unexpected. If that
|
|
99
|
+
runtime path is unavailable or fails to reach a runtime decision, selftune
|
|
100
|
+
falls back to the existing fixture-backed surface simulation and notes the
|
|
101
|
+
fallback in the replay evidence instead of pretending it was a runtime result.
|
|
102
|
+
|
|
103
|
+
For non-Claude platforms today, replay remains fixture-backed: it evaluates the
|
|
104
|
+
target routing table against the installed target/competing skill surfaces in a
|
|
105
|
+
controlled replay fixture and records per-entry evidence. That is still a
|
|
106
|
+
stronger signal than a free-form judge prompt, but you should describe it as
|
|
107
|
+
replay-backed validation, not as live operator telemetry.
|
|
108
|
+
|
|
109
|
+
Replay parsing is intentionally conservative: unreadable skill files degrade to
|
|
110
|
+
empty surfaces instead of throwing, and malformed routing rows with empty
|
|
111
|
+
trigger cells are ignored rather than treated as valid triggers. Claude replay
|
|
112
|
+
also normalizes observed `Read` paths against the staged workspace, so relative
|
|
113
|
+
skill reads still count as read-only evidence for the target or competing
|
|
114
|
+
skill. Reads outside the staged skill set are treated as replay failures rather
|
|
115
|
+
than benign negatives, because they indicate the runtime left the controlled
|
|
116
|
+
evaluation surface.
|
|
117
|
+
|
|
79
118
|
## Parsing Instructions
|
|
80
119
|
|
|
81
120
|
### Track Evolution Progress
|