@rudderhq/agent-runtime-gemini-local 0.2.1 → 0.2.2-canary.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/package.json +2 -2
  2. package/skills/conversation-to-skill/LICENSE.txt +202 -0
  3. package/skills/conversation-to-skill/SKILL.md +428 -0
  4. package/skills/conversation-to-skill/agents/analyzer.md +274 -0
  5. package/skills/conversation-to-skill/agents/comparator.md +202 -0
  6. package/skills/conversation-to-skill/agents/grader.md +223 -0
  7. package/skills/conversation-to-skill/assets/eval_review.html +146 -0
  8. package/skills/conversation-to-skill/eval-viewer/generate_review.py +471 -0
  9. package/skills/conversation-to-skill/eval-viewer/viewer.html +1325 -0
  10. package/skills/conversation-to-skill/references/compatibility.md +36 -0
  11. package/skills/conversation-to-skill/references/description-optimization.md +113 -0
  12. package/skills/conversation-to-skill/references/evaluation-suite.md +410 -0
  13. package/skills/conversation-to-skill/references/schemas.md +431 -0
  14. package/skills/conversation-to-skill/scripts/__init__.py +0 -0
  15. package/skills/conversation-to-skill/scripts/aggregate_benchmark.py +401 -0
  16. package/skills/conversation-to-skill/scripts/generate_report.py +335 -0
  17. package/skills/conversation-to-skill/scripts/improve_description.py +197 -0
  18. package/skills/conversation-to-skill/scripts/model_backends.py +115 -0
  19. package/skills/conversation-to-skill/scripts/package_skill.py +136 -0
  20. package/skills/conversation-to-skill/scripts/quick_validate.py +103 -0
  21. package/skills/conversation-to-skill/scripts/run_eval.py +363 -0
  22. package/skills/conversation-to-skill/scripts/run_loop.py +319 -0
  23. package/skills/conversation-to-skill/scripts/utils.py +223 -0
  24. package/skills/rudder/references/organization-skills.md +1 -1
  25. package/skills/skill-creator/SKILL.md +9 -0
  26. package/skills/skill-optimizer/CHANGELOG.md +29 -0
  27. package/skills/skill-optimizer/SKILL.md +205 -0
  28. package/skills/skill-optimizer/references/adapters/creative-brand-content.md +30 -0
  29. package/skills/skill-optimizer/references/adapters/customer-support-sales.md +30 -0
  30. package/skills/skill-optimizer/references/adapters/document-data-processing.md +31 -0
  31. package/skills/skill-optimizer/references/adapters/education-training.md +31 -0
  32. package/skills/skill-optimizer/references/adapters/finance-accounting.md +31 -0
  33. package/skills/skill-optimizer/references/adapters/healthcare-operations.md +30 -0
  34. package/skills/skill-optimizer/references/adapters/hr-people-ops.md +31 -0
  35. package/skills/skill-optimizer/references/adapters/legal-compliance.md +31 -0
  36. package/skills/skill-optimizer/references/adapters/operations-supply-chain.md +31 -0
  37. package/skills/skill-optimizer/references/adapters/personal-productivity.md +29 -0
  38. package/skills/skill-optimizer/references/adapters/research-knowledge.md +31 -0
  39. package/skills/skill-optimizer/references/adapters/software-ai.md +31 -0
  40. package/skills/skill-optimizer/references/domain-adapter-patterns.md +66 -0
  41. package/skills/skill-optimizer/references/eval-method.md +17 -0
  42. package/skills/skill-optimizer/references/universal-optimization-lens.md +73 -0
@@ -0,0 +1,36 @@
1
+ # Host Compatibility
2
+
3
+ This skill is designed to work across multiple agent hosts. The workflow stays
4
+ the same, but some mechanics change.
5
+
6
+ ## Capability Matrix
7
+
8
+ | Host | Draft skill | Run evals | Parallel baseline | Description tuning | Packaging |
9
+ |------|-------------|-----------|-------------------|--------------------|-----------|
10
+ | Codex | Yes | Yes, judged via `codex exec` | Usually yes | Yes | Yes |
11
+ | Claude Code | Yes | Yes, observed via `claude -p` | Yes | Yes | Yes |
12
+ | Claude.ai | Yes | Manual/serial | No | Usually no | Yes |
13
+ | Generic shell agent | Yes | Yes if a CLI exists | Depends | Depends | Yes |
14
+
15
+ ## Important Differences
16
+
17
+ - **Claude Code**
18
+ - `scripts/run_eval.py --backend claude` measures real observed triggering.
19
+ - This is the highest-fidelity description benchmark.
20
+
21
+ - **Codex**
22
+ - `scripts/run_eval.py --backend codex` measures judged routing, not native
23
+ skill invocation. It is still useful for testing whether the description
24
+ makes the intended use cases obvious.
25
+ - Treat Codex trigger numbers as a proxy, not ground truth.
26
+
27
+ - **Claude.ai or chat-only hosts**
28
+ - Skip automated trigger benchmarks unless you have a shell + CLI bridge.
29
+ - Do serial manual evals and focus on qualitative output review.
30
+
31
+ ## Recommended Defaults
32
+
33
+ - If you are improving instructions, tool choices, examples, or bundled scripts:
34
+ qualitative evals matter most.
35
+ - If you are improving trigger phrasing:
36
+ prefer `--backend claude` when available, otherwise `--backend codex`.
@@ -0,0 +1,113 @@
1
+ # Description Optimization
2
+
3
+ Use this reference when the skill itself is already decent but its frontmatter
4
+ description may under-trigger or over-trigger.
5
+
6
+ The description is the primary routing surface.
7
+ Treat optimization as a real evaluation problem, not copy editing.
8
+
9
+ ## When To Use This
10
+
11
+ Use description optimization when:
12
+
13
+ - the user asks to improve triggering
14
+ - the skill reads well but is not being invoked reliably
15
+ - the skill is firing on near-miss prompts
16
+ - you have already finished at least a solid draft of the skill
17
+
18
+ Do not optimize the description first if the skill body is still weak.
19
+ Fix the capability before tuning the routing surface.
20
+
21
+ ## Step 1. Generate trigger eval queries
22
+
23
+ Create roughly 20 realistic queries split between:
24
+
25
+ - `should_trigger: true`
26
+ - `should_trigger: false`
27
+
28
+ Save them as JSON:
29
+
30
+ ```json
31
+ [
32
+ {"query": "the user prompt", "should_trigger": true},
33
+ {"query": "another prompt", "should_trigger": false}
34
+ ]
35
+ ```
36
+
37
+ The queries should feel like real user inputs, not toy examples.
38
+ Include concrete details such as file names, work context, URLs, messy wording,
39
+ typos, abbreviations, or ambiguous phrasing.
40
+
41
+ Bad negative cases are obviously irrelevant.
42
+ Good negative cases are near-misses that share vocabulary but should route
43
+ somewhere else.
44
+
45
+ ## Step 2. Review the query set with the user
46
+
47
+ Before running the loop, let the user review the query set.
48
+ Use the local bundled HTML review asset in `assets/eval_review.html` to present
49
+ and edit the eval set.
50
+
51
+ The template workflow is:
52
+
53
+ 1. load the HTML template
54
+ 2. inject eval data, skill name, and current description
55
+ 3. write a temp HTML file
56
+ 4. open it for the user
57
+ 5. read the exported eval set from Downloads
58
+
59
+ This step matters because poor trigger queries produce poor descriptions.
60
+
61
+ ## Step 3. Run the optimization loop
62
+
63
+ Run the local bundled optimization loop from the skill root:
64
+
65
+ ```bash
66
+ python scripts/run_loop.py \
67
+ --eval-set <path-to-trigger-eval.json> \
68
+ --skill-path <path-to-skill> \
69
+ --backend auto \
70
+ --max-iterations 5 \
71
+ --verbose
72
+ ```
73
+
74
+ If the backend supports model selection, use the current host model when that
75
+ keeps the results closer to the user's real experience.
76
+
77
+ While the loop runs:
78
+
79
+ - report progress to the user
80
+ - watch train and held-out test behavior
81
+ - avoid selecting descriptions based only on training performance
82
+
83
+ ## Step 4. Apply the result carefully
84
+
85
+ Take the best description from the loop, update frontmatter, then show the user:
86
+
87
+ - before
88
+ - after
89
+ - relevant scores or win rate
90
+
91
+ Do not silently swap descriptions with no explanation.
92
+
93
+ ## Query Design Rules
94
+
95
+ For positive cases:
96
+
97
+ - vary phrasing from formal to casual
98
+ - include cases where the user does not explicitly name the skill
99
+ - include edge cases where this skill should win over a competing one
100
+
101
+ For negative cases:
102
+
103
+ - use adjacent domains and semantic overlaps
104
+ - include ambiguous prompts a naive keyword match would misroute
105
+ - avoid "obviously unrelated" filler
106
+
107
+ ## Host Notes
108
+
109
+ On hosts that do not expose a shell-accessible backend for trigger evaluation,
110
+ skip the automated loop and do a manual description review instead.
111
+
112
+ On Codex-like judged backends, treat the results as a routing proxy rather than
113
+ a perfect measurement of native host behavior.
@@ -0,0 +1,410 @@
1
+ # Evaluation Suite
2
+
3
+ Use this reference whenever `conversation-to-skill` decides a skill should be
4
+ evaluated rather than merely drafted.
5
+
6
+ This is the full evaluation suite.
7
+ Do not reduce it to "run one prompt and eyeball it" unless the host truly
8
+ cannot support more.
9
+
10
+ This skill bundles the required evaluation support files locally:
11
+
12
+ - `agents/grader.md`
13
+ - `agents/comparator.md`
14
+ - `agents/analyzer.md`
15
+ - `eval-viewer/generate_review.py`
16
+ - `assets/eval_review.html`
17
+ - `scripts/*.py`
18
+ - `references/compatibility.md`
19
+ - `references/schemas.md`
20
+
21
+ Prefer these local copies over reaching into another skill directory.
22
+
23
+ ## When To Use This
24
+
25
+ Use the full suite when at least one of these is true:
26
+
27
+ - the user asks to test, benchmark, compare, or improve a skill
28
+ - the skill has objectively checkable outputs
29
+ - the first draft is clearly weak and needs iteration
30
+ - description quality or trigger accuracy matters
31
+
32
+ You may skip the full suite when the user explicitly wants a draft only, or when
33
+ the skill output is so subjective that formal grading would be fake precision.
34
+
35
+ ## Layout
36
+
37
+ Choose the skill location first, then set up evaluation paths around it.
38
+
39
+ - Global skill: `~/.agents/skills/<skill-name>`
40
+ - Project-based skill: `<project-path>/.agents/skills/<skill-name>`
41
+ - Eval workspace: sibling directory named `<skill-name>-workspace/`
42
+
43
+ Within the workspace, organize by iteration:
44
+
45
+ ```text
46
+ <skill-name>-workspace/
47
+ ├── skill-snapshot/ # optional baseline snapshot for existing skill
48
+ ├── iteration-1/
49
+ │ ├── eval-0-<name>/
50
+ │ ├── eval-1-<name>/
51
+ │ └── benchmark.json
52
+ └── iteration-2/
53
+ ```
54
+
55
+ Within each eval directory, keep run outputs separated:
56
+
57
+ ```text
58
+ eval-0-descriptive-name/
59
+ ├── eval_metadata.json
60
+ ├── with_skill/
61
+ │ ├── outputs/
62
+ │ ├── grading.json
63
+ │ └── timing.json
64
+ └── without_skill/ # or old_skill/
65
+ ├── outputs/
66
+ ├── grading.json
67
+ └── timing.json
68
+ ```
69
+
70
+ Do not create every directory upfront.
71
+ Create only what the current iteration needs.
72
+
73
+ ## Before Running Anything
74
+
75
+ ### 1. Write realistic test prompts
76
+
77
+ Create 2-3 realistic prompts that a real user would plausibly type.
78
+ Share them with the user when that feedback would help.
79
+
80
+ Save them to `evals/evals.json`:
81
+
82
+ ```json
83
+ {
84
+ "skill_name": "example-skill",
85
+ "evals": [
86
+ {
87
+ "id": 1,
88
+ "prompt": "User's task prompt",
89
+ "expected_output": "Description of expected result",
90
+ "files": []
91
+ }
92
+ ]
93
+ }
94
+ ```
95
+
96
+ Do not write assertions yet.
97
+ You will draft them while the runs are in progress.
98
+
99
+ ### 2. Decide the baseline
100
+
101
+ Use the right comparator:
102
+
103
+ - New skill: `without_skill`
104
+ - Existing skill being improved: snapshot the old version first, then compare against `old_skill`
105
+
106
+ If improving an existing skill, snapshot before editing:
107
+
108
+ ```bash
109
+ cp -r <skill-path> <workspace>/skill-snapshot/
110
+ ```
111
+
112
+ ### 3. Create eval metadata
113
+
114
+ Each eval directory should contain an `eval_metadata.json`:
115
+
116
+ ```json
117
+ {
118
+ "eval_id": 0,
119
+ "eval_name": "descriptive-name-here",
120
+ "prompt": "The user's task prompt",
121
+ "assertions": []
122
+ }
123
+ ```
124
+
125
+ Use descriptive eval names.
126
+ Avoid generic names like `eval-0` when a better label exists.
127
+
128
+ ## Running The Suite
129
+
130
+ This sequence is continuous.
131
+ Do not stop after spawning runs or after creating the benchmark.
132
+
133
+ ### Step 1. Spawn all runs in the same turn
134
+
135
+ For each test case, start both variants at once:
136
+
137
+ - one run with the skill
138
+ - one baseline run without the skill, or against the old snapshot
139
+
140
+ Do not launch all with-skill runs first and defer baselines.
141
+ Parallel launch keeps the comparison cleaner and faster.
142
+
143
+ Use prompts shaped like:
144
+
145
+ ```text
146
+ Execute this task:
147
+ - Skill path: <path-to-skill>
148
+ - Task: <eval prompt>
149
+ - Input files: <eval files if any, or "none">
150
+ - Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
151
+ - Outputs to save: <what the user actually cares about>
152
+ ```
153
+
154
+ Baseline run:
155
+
156
+ ```text
157
+ Execute this task:
158
+ - Skill path: none # or old snapshot path
159
+ - Task: <eval prompt>
160
+ - Input files: <eval files if any, or "none">
161
+ - Save outputs to: <workspace>/iteration-<N>/eval-<ID>/<baseline>/outputs/
162
+ - Outputs to save: <same deliverables as with-skill>
163
+ ```
164
+
165
+ ### Step 2. Draft assertions while runs are executing
166
+
167
+ Do not wait idly.
168
+ While runs are in progress:
169
+
170
+ - draft assertions for each eval
171
+ - update `eval_metadata.json`
172
+ - update `evals/evals.json`
173
+ - explain to the user what each assertion checks if that context matters
174
+
175
+ Good assertions are:
176
+
177
+ - objective
178
+ - descriptive
179
+ - easy to understand in the benchmark viewer
180
+
181
+ Bad assertions are vague or subjective.
182
+ If quality depends on human judgment, keep that part qualitative.
183
+
184
+ ### Step 3. Capture timing as runs finish
185
+
186
+ When each run completes, capture the timing data immediately:
187
+
188
+ ```json
189
+ {
190
+ "total_tokens": 84852,
191
+ "duration_ms": 23332,
192
+ "total_duration_seconds": 23.3
193
+ }
194
+ ```
195
+
196
+ Save it to `timing.json` in the run directory.
197
+ Do not delay this step if the host exposes timing only in completion
198
+ notifications.
199
+
200
+ ### Step 4. Grade each run
201
+
202
+ Evaluate each assertion against the outputs and save `grading.json`.
203
+
204
+ The expectations array must use these exact fields:
205
+
206
+ - `text`
207
+ - `passed`
208
+ - `evidence`
209
+
210
+ Example:
211
+
212
+ ```json
213
+ {
214
+ "expectations": [
215
+ {
216
+ "text": "Output includes a trigger-oriented description",
217
+ "passed": true,
218
+ "evidence": "Frontmatter description mentions both capability and trigger contexts."
219
+ }
220
+ ]
221
+ }
222
+ ```
223
+
224
+ If an assertion can be checked programmatically, prefer a script over manual
225
+ inspection.
226
+
227
+ ### Step 5. Aggregate into a benchmark
228
+
229
+ Once all runs are graded, aggregate the iteration into benchmark artifacts.
230
+
231
+ Run the local aggregation script from the skill root:
232
+
233
+ ```bash
234
+ python scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>
235
+ ```
236
+
237
+ Expected outputs:
238
+
239
+ - `benchmark.json`
240
+ - `benchmark.md`
241
+
242
+ Put each `with_skill` result before its baseline counterpart in summaries so the
243
+ comparison is easy to scan.
244
+
245
+ ### Step 6. Do an analyst pass
246
+
247
+ Read the benchmark and look for patterns that averages hide:
248
+
249
+ - assertions that always pass regardless of the skill
250
+ - flaky or high-variance evals
251
+ - speed or token regressions that are not buying quality
252
+ - cases where the skill only improves one narrow prompt
253
+
254
+ If a metric is non-discriminating, change the eval design rather than pretending
255
+ it is useful.
256
+
257
+ ### Step 7. Generate the review viewer
258
+
259
+ Do not stop at `benchmark.json`.
260
+ Always generate a reviewable artifact for the human.
261
+
262
+ Generate the review viewer with the local bundled script:
263
+
264
+ ```bash
265
+ nohup python eval-viewer/generate_review.py \
266
+ <workspace>/iteration-N \
267
+ --skill-name "<name>" \
268
+ --benchmark <workspace>/iteration-N/benchmark.json \
269
+ > /dev/null 2>&1 &
270
+ VIEWER_PID=$!
271
+ ```
272
+
273
+ For iteration 2+, also pass:
274
+
275
+ ```bash
276
+ --previous-workspace <workspace>/iteration-<N-1>
277
+ ```
278
+
279
+ If a synthetic benchmark created no real outputs yet, add a minimal file such as
280
+ `outputs/summary.md` so the viewer has something to render.
281
+
282
+ ### Step 8. Hand the viewer to the user in the same turn
283
+
284
+ Do not make the user ask for the results viewer later.
285
+ Tell them where it is immediately.
286
+
287
+ If a server-backed viewer is available, say it is open and explain the two main
288
+ tabs:
289
+
290
+ - `Outputs`: prompt, output artifacts, grades, and feedback box
291
+ - `Benchmark`: pass rate, time, token usage, and analyst observations
292
+
293
+ ### Step 9. Read feedback and iterate
294
+
295
+ When the user finishes review, read `feedback.json`:
296
+
297
+ ```json
298
+ {
299
+ "reviews": [
300
+ {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
301
+ {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
302
+ ],
303
+ "status": "complete"
304
+ }
305
+ ```
306
+
307
+ Empty feedback usually means the user found that case acceptable.
308
+ Focus changes on cases with specific complaints.
309
+
310
+ If you started a viewer server, stop it afterwards:
311
+
312
+ ```bash
313
+ kill $VIEWER_PID 2>/dev/null
314
+ ```
315
+
316
+ ## Improvement Loop
317
+
318
+ Use the feedback to improve the skill, not to overfit it.
319
+
320
+ Priorities:
321
+
322
+ 1. generalize from repeated failures instead of patching one exact prompt
323
+ 2. remove instructions that create busywork without quality gains
324
+ 3. explain why important behavior matters
325
+ 4. package repeated deterministic work into scripts when multiple runs rediscover it
326
+
327
+ After revising the skill:
328
+
329
+ 1. rerun all evals into `iteration-<N+1>/`
330
+ 2. keep the same baseline logic unless there is a clear reason to change it
331
+ 3. regenerate the viewer with `--previous-workspace`
332
+ 4. collect user feedback again
333
+ 5. repeat until quality is acceptable or progress stalls
334
+
335
+ Stop when:
336
+
337
+ - the user is happy
338
+ - feedback is empty across the board
339
+ - the skill is no longer improving meaningfully
340
+
341
+ ## Blind Comparison
342
+
343
+ If the user specifically wants a more rigorous A/B comparison between two skill
344
+ versions, run a blind comparison:
345
+
346
+ - give two outputs to an independent grader without saying which is which
347
+ - have it judge quality
348
+ - analyze why the winner won
349
+
350
+ This is optional.
351
+ Use it when normal human review is not enough.
352
+
353
+ ## Host-Specific Adaptation
354
+
355
+ ### Chat-only host
356
+
357
+ If the host has no subagents:
358
+
359
+ - run test cases one by one
360
+ - skip baseline if independent comparison is impossible
361
+ - present outputs directly in chat or save them for the user to inspect
362
+ - focus on qualitative feedback
363
+ - skip benchmarking if it would be fake rigor
364
+
365
+ ### Headless worker host
366
+
367
+ If the host has no browser or display:
368
+
369
+ - still run the full evaluation workflow
370
+ - generate a static HTML review artifact instead of opening a live viewer
371
+ - provide the exact output path to the user
372
+ - expect feedback to arrive as a downloaded `feedback.json`
373
+
374
+ For headless review generation, prefer:
375
+
376
+ ```bash
377
+ python eval-viewer/generate_review.py \
378
+ <workspace>/iteration-N \
379
+ --skill-name "<name>" \
380
+ --benchmark <workspace>/iteration-N/benchmark.json \
381
+ --static <workspace>/iteration-N/review.html
382
+ ```
383
+
384
+ ## Packaging
385
+
386
+ If the user wants a distributable skill package and the appropriate tooling is
387
+ available, package it after the skill is stable.
388
+
389
+ If the user wants packaging, use the local bundled script:
390
+
391
+ ```bash
392
+ python scripts/package_skill.py <path/to/skill-folder>
393
+ ```
394
+
395
+ ## Final Rule
396
+
397
+ If you chose evaluation, follow through.
398
+ Do not stop after writing prompts.
399
+ Do not stop after generating benchmarks.
400
+ Do not stop after revising the skill once.
401
+
402
+ The full suite is:
403
+
404
+ - draft or revise the skill
405
+ - run test cases
406
+ - grade and benchmark them
407
+ - generate a human-review artifact
408
+ - collect feedback
409
+ - improve the skill
410
+ - rerun and compare again