@rudderhq/agent-runtime-gemini-local 0.2.1 → 0.2.2-canary.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +2 -2
- package/skills/conversation-to-skill/LICENSE.txt +202 -0
- package/skills/conversation-to-skill/SKILL.md +428 -0
- package/skills/conversation-to-skill/agents/analyzer.md +274 -0
- package/skills/conversation-to-skill/agents/comparator.md +202 -0
- package/skills/conversation-to-skill/agents/grader.md +223 -0
- package/skills/conversation-to-skill/assets/eval_review.html +146 -0
- package/skills/conversation-to-skill/eval-viewer/generate_review.py +471 -0
- package/skills/conversation-to-skill/eval-viewer/viewer.html +1325 -0
- package/skills/conversation-to-skill/references/compatibility.md +36 -0
- package/skills/conversation-to-skill/references/description-optimization.md +113 -0
- package/skills/conversation-to-skill/references/evaluation-suite.md +410 -0
- package/skills/conversation-to-skill/references/schemas.md +431 -0
- package/skills/conversation-to-skill/scripts/__init__.py +0 -0
- package/skills/conversation-to-skill/scripts/aggregate_benchmark.py +401 -0
- package/skills/conversation-to-skill/scripts/generate_report.py +335 -0
- package/skills/conversation-to-skill/scripts/improve_description.py +197 -0
- package/skills/conversation-to-skill/scripts/model_backends.py +115 -0
- package/skills/conversation-to-skill/scripts/package_skill.py +136 -0
- package/skills/conversation-to-skill/scripts/quick_validate.py +103 -0
- package/skills/conversation-to-skill/scripts/run_eval.py +363 -0
- package/skills/conversation-to-skill/scripts/run_loop.py +319 -0
- package/skills/conversation-to-skill/scripts/utils.py +223 -0
- package/skills/rudder/references/organization-skills.md +1 -1
- package/skills/skill-creator/SKILL.md +9 -0
- package/skills/skill-optimizer/CHANGELOG.md +29 -0
- package/skills/skill-optimizer/SKILL.md +205 -0
- package/skills/skill-optimizer/references/adapters/creative-brand-content.md +30 -0
- package/skills/skill-optimizer/references/adapters/customer-support-sales.md +30 -0
- package/skills/skill-optimizer/references/adapters/document-data-processing.md +31 -0
- package/skills/skill-optimizer/references/adapters/education-training.md +31 -0
- package/skills/skill-optimizer/references/adapters/finance-accounting.md +31 -0
- package/skills/skill-optimizer/references/adapters/healthcare-operations.md +30 -0
- package/skills/skill-optimizer/references/adapters/hr-people-ops.md +31 -0
- package/skills/skill-optimizer/references/adapters/legal-compliance.md +31 -0
- package/skills/skill-optimizer/references/adapters/operations-supply-chain.md +31 -0
- package/skills/skill-optimizer/references/adapters/personal-productivity.md +29 -0
- package/skills/skill-optimizer/references/adapters/research-knowledge.md +31 -0
- package/skills/skill-optimizer/references/adapters/software-ai.md +31 -0
- package/skills/skill-optimizer/references/domain-adapter-patterns.md +66 -0
- package/skills/skill-optimizer/references/eval-method.md +17 -0
- package/skills/skill-optimizer/references/universal-optimization-lens.md +73 -0
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# Host Compatibility
|
|
2
|
+
|
|
3
|
+
This skill is designed to work across multiple agent hosts. The workflow stays
|
|
4
|
+
the same, but some mechanics change.
|
|
5
|
+
|
|
6
|
+
## Capability Matrix
|
|
7
|
+
|
|
8
|
+
| Host | Draft skill | Run evals | Parallel baseline | Description tuning | Packaging |
|
|
9
|
+
|------|-------------|-----------|-------------------|--------------------|-----------|
|
|
10
|
+
| Codex | Yes | Yes, judged via `codex exec` | Usually yes | Yes | Yes |
|
|
11
|
+
| Claude Code | Yes | Yes, observed via `claude -p` | Yes | Yes | Yes |
|
|
12
|
+
| Claude.ai | Yes | Manual/serial | No | Usually no | Yes |
|
|
13
|
+
| Generic shell agent | Yes | Yes if a CLI exists | Depends | Depends | Yes |
|
|
14
|
+
|
|
15
|
+
## Important Differences
|
|
16
|
+
|
|
17
|
+
- **Claude Code**
|
|
18
|
+
- `scripts/run_eval.py --backend claude` measures real observed triggering.
|
|
19
|
+
- This is the highest-fidelity description benchmark.
|
|
20
|
+
|
|
21
|
+
- **Codex**
|
|
22
|
+
- `scripts/run_eval.py --backend codex` measures judged routing, not native
|
|
23
|
+
skill invocation. It is still useful for testing whether the description
|
|
24
|
+
makes the intended use cases obvious.
|
|
25
|
+
- Treat Codex trigger numbers as a proxy, not ground truth.
|
|
26
|
+
|
|
27
|
+
- **Claude.ai or chat-only hosts**
|
|
28
|
+
- Skip automated trigger benchmarks unless you have a shell + CLI bridge.
|
|
29
|
+
- Do serial manual evals and focus on qualitative output review.
|
|
30
|
+
|
|
31
|
+
## Recommended Defaults
|
|
32
|
+
|
|
33
|
+
- If you are improving instructions, tool choices, examples, or bundled scripts:
|
|
34
|
+
qualitative evals matter most.
|
|
35
|
+
- If you are improving trigger phrasing:
|
|
36
|
+
prefer `--backend claude` when available, otherwise `--backend codex`.
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
# Description Optimization
|
|
2
|
+
|
|
3
|
+
Use this reference when the skill itself is already decent but its frontmatter
|
|
4
|
+
description may under-trigger or over-trigger.
|
|
5
|
+
|
|
6
|
+
The description is the primary routing surface.
|
|
7
|
+
Treat optimization as a real evaluation problem, not copy editing.
|
|
8
|
+
|
|
9
|
+
## When To Use This
|
|
10
|
+
|
|
11
|
+
Use description optimization when:
|
|
12
|
+
|
|
13
|
+
- the user asks to improve triggering
|
|
14
|
+
- the skill reads well but is not being invoked reliably
|
|
15
|
+
- the skill is firing on near-miss prompts
|
|
16
|
+
- you have already finished at least a solid draft of the skill
|
|
17
|
+
|
|
18
|
+
Do not optimize the description first if the skill body is still weak.
|
|
19
|
+
Fix the capability before tuning the routing surface.
|
|
20
|
+
|
|
21
|
+
## Step 1. Generate trigger eval queries
|
|
22
|
+
|
|
23
|
+
Create roughly 20 realistic queries split between:
|
|
24
|
+
|
|
25
|
+
- `should_trigger: true`
|
|
26
|
+
- `should_trigger: false`
|
|
27
|
+
|
|
28
|
+
Save them as JSON:
|
|
29
|
+
|
|
30
|
+
```json
|
|
31
|
+
[
|
|
32
|
+
{"query": "the user prompt", "should_trigger": true},
|
|
33
|
+
{"query": "another prompt", "should_trigger": false}
|
|
34
|
+
]
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
The queries should feel like real user inputs, not toy examples.
|
|
38
|
+
Include concrete details such as file names, work context, URLs, messy wording,
|
|
39
|
+
typos, abbreviations, or ambiguous phrasing.
|
|
40
|
+
|
|
41
|
+
Bad negative cases are obviously irrelevant.
|
|
42
|
+
Good negative cases are near-misses that share vocabulary but should route
|
|
43
|
+
somewhere else.
|
|
44
|
+
|
|
45
|
+
## Step 2. Review the query set with the user
|
|
46
|
+
|
|
47
|
+
Before running the loop, let the user review the query set.
|
|
48
|
+
Use the local bundled HTML review asset in `assets/eval_review.html` to present
|
|
49
|
+
and edit the eval set.
|
|
50
|
+
|
|
51
|
+
The template workflow is:
|
|
52
|
+
|
|
53
|
+
1. load the HTML template
|
|
54
|
+
2. inject eval data, skill name, and current description
|
|
55
|
+
3. write a temp HTML file
|
|
56
|
+
4. open it for the user
|
|
57
|
+
5. read the exported eval set from Downloads
|
|
58
|
+
|
|
59
|
+
This step matters because poor trigger queries produce poor descriptions.
|
|
60
|
+
|
|
61
|
+
## Step 3. Run the optimization loop
|
|
62
|
+
|
|
63
|
+
Run the local bundled optimization loop from the skill root:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
python scripts/run_loop.py \
|
|
67
|
+
--eval-set <path-to-trigger-eval.json> \
|
|
68
|
+
--skill-path <path-to-skill> \
|
|
69
|
+
--backend auto \
|
|
70
|
+
--max-iterations 5 \
|
|
71
|
+
--verbose
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
If the backend supports model selection, use the current host model when that
|
|
75
|
+
keeps the results closer to the user's real experience.
|
|
76
|
+
|
|
77
|
+
While the loop runs:
|
|
78
|
+
|
|
79
|
+
- report progress to the user
|
|
80
|
+
- watch train and held-out test behavior
|
|
81
|
+
- avoid selecting descriptions based only on training performance
|
|
82
|
+
|
|
83
|
+
## Step 4. Apply the result carefully
|
|
84
|
+
|
|
85
|
+
Take the best description from the loop, update frontmatter, then show the user:
|
|
86
|
+
|
|
87
|
+
- before
|
|
88
|
+
- after
|
|
89
|
+
- relevant scores or win rate
|
|
90
|
+
|
|
91
|
+
Do not silently swap descriptions with no explanation.
|
|
92
|
+
|
|
93
|
+
## Query Design Rules
|
|
94
|
+
|
|
95
|
+
For positive cases:
|
|
96
|
+
|
|
97
|
+
- vary phrasing from formal to casual
|
|
98
|
+
- include cases where the user does not explicitly name the skill
|
|
99
|
+
- include edge cases where this skill should win over a competing one
|
|
100
|
+
|
|
101
|
+
For negative cases:
|
|
102
|
+
|
|
103
|
+
- use adjacent domains and semantic overlaps
|
|
104
|
+
- include ambiguous prompts a naive keyword match would misroute
|
|
105
|
+
- avoid "obviously unrelated" filler
|
|
106
|
+
|
|
107
|
+
## Host Notes
|
|
108
|
+
|
|
109
|
+
On hosts that do not expose a shell-accessible backend for trigger evaluation,
|
|
110
|
+
skip the automated loop and do a manual description review instead.
|
|
111
|
+
|
|
112
|
+
On Codex-like judged backends, treat the results as a routing proxy rather than
|
|
113
|
+
a perfect measurement of native host behavior.
|
|
@@ -0,0 +1,410 @@
|
|
|
1
|
+
# Evaluation Suite
|
|
2
|
+
|
|
3
|
+
Use this reference whenever `conversation-to-skill` decides a skill should be
|
|
4
|
+
evaluated rather than merely drafted.
|
|
5
|
+
|
|
6
|
+
This is the full evaluation suite.
|
|
7
|
+
Do not reduce it to "run one prompt and eyeball it" unless the host truly
|
|
8
|
+
cannot support more.
|
|
9
|
+
|
|
10
|
+
This skill bundles the required evaluation support files locally:
|
|
11
|
+
|
|
12
|
+
- `agents/grader.md`
|
|
13
|
+
- `agents/comparator.md`
|
|
14
|
+
- `agents/analyzer.md`
|
|
15
|
+
- `eval-viewer/generate_review.py`
|
|
16
|
+
- `assets/eval_review.html`
|
|
17
|
+
- `scripts/*.py`
|
|
18
|
+
- `references/compatibility.md`
|
|
19
|
+
- `references/schemas.md`
|
|
20
|
+
|
|
21
|
+
Prefer these local copies over reaching into another skill directory.
|
|
22
|
+
|
|
23
|
+
## When To Use This
|
|
24
|
+
|
|
25
|
+
Use the full suite when at least one of these is true:
|
|
26
|
+
|
|
27
|
+
- the user asks to test, benchmark, compare, or improve a skill
|
|
28
|
+
- the skill has objectively checkable outputs
|
|
29
|
+
- the first draft is clearly weak and needs iteration
|
|
30
|
+
- description quality or trigger accuracy matters
|
|
31
|
+
|
|
32
|
+
You may skip the full suite when the user explicitly wants a draft only, or when
|
|
33
|
+
the skill output is so subjective that formal grading would be fake precision.
|
|
34
|
+
|
|
35
|
+
## Layout
|
|
36
|
+
|
|
37
|
+
Choose the skill location first, then set up evaluation paths around it.
|
|
38
|
+
|
|
39
|
+
- Global skill: `~/.agents/skills/<skill-name>`
|
|
40
|
+
- Project-based skill: `<project-path>/.agents/skills/<skill-name>`
|
|
41
|
+
- Eval workspace: sibling directory named `<skill-name>-workspace/`
|
|
42
|
+
|
|
43
|
+
Within the workspace, organize by iteration:
|
|
44
|
+
|
|
45
|
+
```text
|
|
46
|
+
<skill-name>-workspace/
|
|
47
|
+
├── skill-snapshot/ # optional baseline snapshot for existing skill
|
|
48
|
+
├── iteration-1/
|
|
49
|
+
│ ├── eval-0-<name>/
|
|
50
|
+
│ ├── eval-1-<name>/
|
|
51
|
+
│ └── benchmark.json
|
|
52
|
+
└── iteration-2/
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Within each eval directory, keep run outputs separated:
|
|
56
|
+
|
|
57
|
+
```text
|
|
58
|
+
eval-0-descriptive-name/
|
|
59
|
+
├── eval_metadata.json
|
|
60
|
+
├── with_skill/
|
|
61
|
+
│ ├── outputs/
|
|
62
|
+
│ ├── grading.json
|
|
63
|
+
│ └── timing.json
|
|
64
|
+
└── without_skill/ # or old_skill/
|
|
65
|
+
├── outputs/
|
|
66
|
+
├── grading.json
|
|
67
|
+
└── timing.json
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Do not create every directory upfront.
|
|
71
|
+
Create only what the current iteration needs.
|
|
72
|
+
|
|
73
|
+
## Before Running Anything
|
|
74
|
+
|
|
75
|
+
### 1. Write realistic test prompts
|
|
76
|
+
|
|
77
|
+
Create 2-3 realistic prompts that a real user would plausibly type.
|
|
78
|
+
Share them with the user when that feedback would help.
|
|
79
|
+
|
|
80
|
+
Save them to `evals/evals.json`:
|
|
81
|
+
|
|
82
|
+
```json
|
|
83
|
+
{
|
|
84
|
+
"skill_name": "example-skill",
|
|
85
|
+
"evals": [
|
|
86
|
+
{
|
|
87
|
+
"id": 1,
|
|
88
|
+
"prompt": "User's task prompt",
|
|
89
|
+
"expected_output": "Description of expected result",
|
|
90
|
+
"files": []
|
|
91
|
+
}
|
|
92
|
+
]
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Do not write assertions yet.
|
|
97
|
+
You will draft them while the runs are in progress.
|
|
98
|
+
|
|
99
|
+
### 2. Decide the baseline
|
|
100
|
+
|
|
101
|
+
Use the right comparator:
|
|
102
|
+
|
|
103
|
+
- New skill: `without_skill`
|
|
104
|
+
- Existing skill being improved: snapshot the old version first, then compare against `old_skill`
|
|
105
|
+
|
|
106
|
+
If improving an existing skill, snapshot before editing:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
cp -r <skill-path> <workspace>/skill-snapshot/
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### 3. Create eval metadata
|
|
113
|
+
|
|
114
|
+
Each eval directory should contain an `eval_metadata.json`:
|
|
115
|
+
|
|
116
|
+
```json
|
|
117
|
+
{
|
|
118
|
+
"eval_id": 0,
|
|
119
|
+
"eval_name": "descriptive-name-here",
|
|
120
|
+
"prompt": "The user's task prompt",
|
|
121
|
+
"assertions": []
|
|
122
|
+
}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Use descriptive eval names.
|
|
126
|
+
Avoid generic names like `eval-0` when a better label exists.
|
|
127
|
+
|
|
128
|
+
## Running The Suite
|
|
129
|
+
|
|
130
|
+
This sequence is continuous.
|
|
131
|
+
Do not stop after spawning runs or after creating the benchmark.
|
|
132
|
+
|
|
133
|
+
### Step 1. Spawn all runs in the same turn
|
|
134
|
+
|
|
135
|
+
For each test case, start both variants at once:
|
|
136
|
+
|
|
137
|
+
- one run with the skill
|
|
138
|
+
- one baseline run without the skill, or against the old snapshot
|
|
139
|
+
|
|
140
|
+
Do not launch all with-skill runs first and defer baselines.
|
|
141
|
+
Parallel launch keeps the comparison cleaner and faster.
|
|
142
|
+
|
|
143
|
+
Use prompts shaped like:
|
|
144
|
+
|
|
145
|
+
```text
|
|
146
|
+
Execute this task:
|
|
147
|
+
- Skill path: <path-to-skill>
|
|
148
|
+
- Task: <eval prompt>
|
|
149
|
+
- Input files: <eval files if any, or "none">
|
|
150
|
+
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
|
|
151
|
+
- Outputs to save: <what the user actually cares about>
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Baseline run:
|
|
155
|
+
|
|
156
|
+
```text
|
|
157
|
+
Execute this task:
|
|
158
|
+
- Skill path: none # or old snapshot path
|
|
159
|
+
- Task: <eval prompt>
|
|
160
|
+
- Input files: <eval files if any, or "none">
|
|
161
|
+
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/<baseline>/outputs/
|
|
162
|
+
- Outputs to save: <same deliverables as with-skill>
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Step 2. Draft assertions while runs are executing
|
|
166
|
+
|
|
167
|
+
Do not wait idly.
|
|
168
|
+
While runs are in progress:
|
|
169
|
+
|
|
170
|
+
- draft assertions for each eval
|
|
171
|
+
- update `eval_metadata.json`
|
|
172
|
+
- update `evals/evals.json`
|
|
173
|
+
- explain to the user what each assertion checks if that context matters
|
|
174
|
+
|
|
175
|
+
Good assertions are:
|
|
176
|
+
|
|
177
|
+
- objective
|
|
178
|
+
- descriptive
|
|
179
|
+
- easy to understand in the benchmark viewer
|
|
180
|
+
|
|
181
|
+
Bad assertions are vague or subjective.
|
|
182
|
+
If quality depends on human judgment, keep that part qualitative.
|
|
183
|
+
|
|
184
|
+
### Step 3. Capture timing as runs finish
|
|
185
|
+
|
|
186
|
+
When each run completes, capture the timing data immediately:
|
|
187
|
+
|
|
188
|
+
```json
|
|
189
|
+
{
|
|
190
|
+
"total_tokens": 84852,
|
|
191
|
+
"duration_ms": 23332,
|
|
192
|
+
"total_duration_seconds": 23.3
|
|
193
|
+
}
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
Save it to `timing.json` in the run directory.
|
|
197
|
+
Do not delay this step if the host exposes timing only in completion
|
|
198
|
+
notifications.
|
|
199
|
+
|
|
200
|
+
### Step 4. Grade each run
|
|
201
|
+
|
|
202
|
+
Evaluate each assertion against the outputs and save `grading.json`.
|
|
203
|
+
|
|
204
|
+
The expectations array must use these exact fields:
|
|
205
|
+
|
|
206
|
+
- `text`
|
|
207
|
+
- `passed`
|
|
208
|
+
- `evidence`
|
|
209
|
+
|
|
210
|
+
Example:
|
|
211
|
+
|
|
212
|
+
```json
|
|
213
|
+
{
|
|
214
|
+
"expectations": [
|
|
215
|
+
{
|
|
216
|
+
"text": "Output includes a trigger-oriented description",
|
|
217
|
+
"passed": true,
|
|
218
|
+
"evidence": "Frontmatter description mentions both capability and trigger contexts."
|
|
219
|
+
}
|
|
220
|
+
]
|
|
221
|
+
}
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
If an assertion can be checked programmatically, prefer a script over manual
|
|
225
|
+
inspection.
|
|
226
|
+
|
|
227
|
+
### Step 5. Aggregate into a benchmark
|
|
228
|
+
|
|
229
|
+
Once all runs are graded, aggregate the iteration into benchmark artifacts.
|
|
230
|
+
|
|
231
|
+
Run the local aggregation script from the skill root:
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
python scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
Expected outputs:
|
|
238
|
+
|
|
239
|
+
- `benchmark.json`
|
|
240
|
+
- `benchmark.md`
|
|
241
|
+
|
|
242
|
+
Put each `with_skill` result before its baseline counterpart in summaries so the
|
|
243
|
+
comparison is easy to scan.
|
|
244
|
+
|
|
245
|
+
### Step 6. Do an analyst pass
|
|
246
|
+
|
|
247
|
+
Read the benchmark and look for patterns that averages hide:
|
|
248
|
+
|
|
249
|
+
- assertions that always pass regardless of the skill
|
|
250
|
+
- flaky or high-variance evals
|
|
251
|
+
- speed or token regressions that are not buying quality
|
|
252
|
+
- cases where the skill only improves one narrow prompt
|
|
253
|
+
|
|
254
|
+
If a metric is non-discriminating, change the eval design rather than pretending
|
|
255
|
+
it is useful.
|
|
256
|
+
|
|
257
|
+
### Step 7. Generate the review viewer
|
|
258
|
+
|
|
259
|
+
Do not stop at `benchmark.json`.
|
|
260
|
+
Always generate a reviewable artifact for the human.
|
|
261
|
+
|
|
262
|
+
Generate the review viewer with the local bundled script:
|
|
263
|
+
|
|
264
|
+
```bash
|
|
265
|
+
nohup python eval-viewer/generate_review.py \
|
|
266
|
+
<workspace>/iteration-N \
|
|
267
|
+
--skill-name "<name>" \
|
|
268
|
+
--benchmark <workspace>/iteration-N/benchmark.json \
|
|
269
|
+
> /dev/null 2>&1 &
|
|
270
|
+
VIEWER_PID=$!
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
For iteration 2+, also pass:
|
|
274
|
+
|
|
275
|
+
```bash
|
|
276
|
+
--previous-workspace <workspace>/iteration-<N-1>
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
If a synthetic benchmark created no real outputs yet, add a minimal file such as
|
|
280
|
+
`outputs/summary.md` so the viewer has something to render.
|
|
281
|
+
|
|
282
|
+
### Step 8. Hand the viewer to the user in the same turn
|
|
283
|
+
|
|
284
|
+
Do not make the user ask for the results viewer later.
|
|
285
|
+
Tell them where it is immediately.
|
|
286
|
+
|
|
287
|
+
If a server-backed viewer is available, say it is open and explain the two main
|
|
288
|
+
tabs:
|
|
289
|
+
|
|
290
|
+
- `Outputs`: prompt, output artifacts, grades, and feedback box
|
|
291
|
+
- `Benchmark`: pass rate, time, token usage, and analyst observations
|
|
292
|
+
|
|
293
|
+
### Step 9. Read feedback and iterate
|
|
294
|
+
|
|
295
|
+
When the user finishes review, read `feedback.json`:
|
|
296
|
+
|
|
297
|
+
```json
|
|
298
|
+
{
|
|
299
|
+
"reviews": [
|
|
300
|
+
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
|
|
301
|
+
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
|
|
302
|
+
],
|
|
303
|
+
"status": "complete"
|
|
304
|
+
}
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
Empty feedback usually means the user found that case acceptable.
|
|
308
|
+
Focus changes on cases with specific complaints.
|
|
309
|
+
|
|
310
|
+
If you started a viewer server, stop it afterwards:
|
|
311
|
+
|
|
312
|
+
```bash
|
|
313
|
+
kill $VIEWER_PID 2>/dev/null
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
## Improvement Loop
|
|
317
|
+
|
|
318
|
+
Use the feedback to improve the skill, not to overfit it.
|
|
319
|
+
|
|
320
|
+
Priorities:
|
|
321
|
+
|
|
322
|
+
1. generalize from repeated failures instead of patching one exact prompt
|
|
323
|
+
2. remove instructions that create busywork without quality gains
|
|
324
|
+
3. explain why important behavior matters
|
|
325
|
+
4. package repeated deterministic work into scripts when multiple runs rediscover it
|
|
326
|
+
|
|
327
|
+
After revising the skill:
|
|
328
|
+
|
|
329
|
+
1. rerun all evals into `iteration-<N+1>/`
|
|
330
|
+
2. keep the same baseline logic unless there is a clear reason to change it
|
|
331
|
+
3. regenerate the viewer with `--previous-workspace`
|
|
332
|
+
4. collect user feedback again
|
|
333
|
+
5. repeat until quality is acceptable or progress stalls
|
|
334
|
+
|
|
335
|
+
Stop when:
|
|
336
|
+
|
|
337
|
+
- the user is happy
|
|
338
|
+
- feedback is empty across the board
|
|
339
|
+
- the skill is no longer improving meaningfully
|
|
340
|
+
|
|
341
|
+
## Blind Comparison
|
|
342
|
+
|
|
343
|
+
If the user specifically wants a more rigorous A/B comparison between two skill
|
|
344
|
+
versions, run a blind comparison:
|
|
345
|
+
|
|
346
|
+
- give two outputs to an independent grader without saying which is which
|
|
347
|
+
- have it judge quality
|
|
348
|
+
- analyze why the winner won
|
|
349
|
+
|
|
350
|
+
This is optional.
|
|
351
|
+
Use it when normal human review is not enough.
|
|
352
|
+
|
|
353
|
+
## Host-Specific Adaptation
|
|
354
|
+
|
|
355
|
+
### Chat-only host
|
|
356
|
+
|
|
357
|
+
If the host has no subagents:
|
|
358
|
+
|
|
359
|
+
- run test cases one by one
|
|
360
|
+
- skip baseline if independent comparison is impossible
|
|
361
|
+
- present outputs directly in chat or save them for the user to inspect
|
|
362
|
+
- focus on qualitative feedback
|
|
363
|
+
- skip benchmarking if it would be fake rigor
|
|
364
|
+
|
|
365
|
+
### Headless worker host
|
|
366
|
+
|
|
367
|
+
If the host has no browser or display:
|
|
368
|
+
|
|
369
|
+
- still run the full evaluation workflow
|
|
370
|
+
- generate a static HTML review artifact instead of opening a live viewer
|
|
371
|
+
- provide the exact output path to the user
|
|
372
|
+
- expect feedback to arrive as a downloaded `feedback.json`
|
|
373
|
+
|
|
374
|
+
For headless review generation, prefer:
|
|
375
|
+
|
|
376
|
+
```bash
|
|
377
|
+
python eval-viewer/generate_review.py \
|
|
378
|
+
<workspace>/iteration-N \
|
|
379
|
+
--skill-name "<name>" \
|
|
380
|
+
--benchmark <workspace>/iteration-N/benchmark.json \
|
|
381
|
+
--static <workspace>/iteration-N/review.html
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
## Packaging
|
|
385
|
+
|
|
386
|
+
If the user wants a distributable skill package and the appropriate tooling is
|
|
387
|
+
available, package it after the skill is stable.
|
|
388
|
+
|
|
389
|
+
If the user wants packaging, use the local bundled script:
|
|
390
|
+
|
|
391
|
+
```bash
|
|
392
|
+
python scripts/package_skill.py <path/to/skill-folder>
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
## Final Rule
|
|
396
|
+
|
|
397
|
+
If you chose evaluation, follow through.
|
|
398
|
+
Do not stop after writing prompts.
|
|
399
|
+
Do not stop after generating benchmarks.
|
|
400
|
+
Do not stop after revising the skill once.
|
|
401
|
+
|
|
402
|
+
The full suite is:
|
|
403
|
+
|
|
404
|
+
- draft or revise the skill
|
|
405
|
+
- run test cases
|
|
406
|
+
- grade and benchmark them
|
|
407
|
+
- generate a human-review artifact
|
|
408
|
+
- collect feedback
|
|
409
|
+
- improve the skill
|
|
410
|
+
- rerun and compare again
|