@ax-llm/ax 19.0.21 → 19.0.23
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/index.cjs +403 -421
- package/index.cjs.map +1 -1
- package/index.d.cts +147 -137
- package/index.d.ts +147 -137
- package/index.global.js +280 -298
- package/index.global.js.map +1 -1
- package/index.js +403 -421
- package/index.js.map +1 -1
- package/package.json +1 -1
- package/skills/ax-agent-optimize.md +339 -0
- package/skills/ax-agent.md +167 -117
- package/skills/ax-ai.md +1 -1
- package/skills/ax-flow.md +1 -1
- package/skills/ax-gen.md +1 -1
- package/skills/ax-gepa.md +18 -2
- package/skills/ax-learn.md +1 -1
- package/skills/ax-llm.md +2 -2
- package/skills/ax-signature.md +1 -1
package/package.json
CHANGED
|
@@ -0,0 +1,339 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ax-agent-optimize
|
|
3
|
+
description: This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(...), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or recursive optimization guidance.
|
|
4
|
+
version: "19.0.23"
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# AxAgent Optimize Codegen Rules (@ax-llm/ax)
|
|
8
|
+
|
|
9
|
+
Use this skill for `agent.optimize(...)` workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it.
|
|
10
|
+
|
|
11
|
+
Your job is to help the model choose a good optimization setup for the user's actual goal:
|
|
12
|
+
|
|
13
|
+
- If the user wants better tool use, prefer action-aware tasks and either a deterministic metric or the built-in judge depending on how objective the scoring is.
|
|
14
|
+
- If the user wants better wording only, responder optimization may be enough.
|
|
15
|
+
- If the user wants reusable improvements, include artifact save/load.
|
|
16
|
+
- If the user wants cost or recursion behavior improved, make the eval tasks expose those tradeoffs explicitly.
|
|
17
|
+
|
|
18
|
+
## Use These Defaults
|
|
19
|
+
|
|
20
|
+
- Use `agent.optimize(...)` only after the agent is already configured and runnable.
|
|
21
|
+
- Prefer a deterministic custom `metric` when success is easy to score from the prediction and task record.
|
|
22
|
+
- Prefer the built-in judge path for open-ended assistant tasks: `judgeAI` plus `judgeOptions`.
|
|
23
|
+
- Only reach for a plain typed `AxGen` evaluator when the user needs LLM-as-judge behavior outside the built-in `agent.optimize(...)` flow.
|
|
24
|
+
- Default optimize target is `root.actor`; use `target: 'responder'` or explicit program IDs only when the user clearly asks for that.
|
|
25
|
+
- Use eval-safe tools or in-memory mocks because optimization replays tasks many times.
|
|
26
|
+
- Prefer precise tool return schemas such as `f.object(...)` over vague `f.json(...)` whenever the agent must reason about returned fields.
|
|
27
|
+
- Prefer task wording with canonical entity names like "the Atlas project" instead of ambiguous labels like "Atlas" when ambiguity could trigger pointless clarification.
|
|
28
|
+
- Save `result.optimizedProgram`, then restore with `new AxOptimizedProgramImpl(...)` and `agent.applyOptimization(...)`.
|
|
29
|
+
- When recursive behavior matters, keep `mode: 'advanced'` on the agent and tune against realistic `recursionOptions`.
|
|
30
|
+
|
|
31
|
+
## Decision Guide
|
|
32
|
+
|
|
33
|
+
Pick the optimization shape from the user's need:
|
|
34
|
+
|
|
35
|
+
- "Make the agent use tools correctly" -> optimize `root.actor` with `expectedActions` and `forbiddenActions`.
|
|
36
|
+
- "Make final answers read better" -> consider `target: 'responder'`, but only if the task is not mostly tool-selection or clarification behavior.
|
|
37
|
+
- "Make the whole agent better" -> use the default actor target first; only broaden target selection when the user clearly wants that extra scope.
|
|
38
|
+
- "Tune recursive delegation" -> keep `mode: 'advanced'` and use tasks that actually exercise recursion depth, fan-out, and termination choices.
|
|
39
|
+
- "Compare before and after" -> include a held-out task plus artifact save/load and replay.
|
|
40
|
+
|
|
41
|
+
Choose task design carefully:
|
|
42
|
+
|
|
43
|
+
- Prefer a small number of realistic tasks over broad but vague datasets.
|
|
44
|
+
- Prefer concrete criteria over generic "be helpful" language.
|
|
45
|
+
- Prefer explicit action expectations when correctness depends on tools, recipients, dates, or side effects.
|
|
46
|
+
- Prefer eval-safe mocks anytime the task touches email, scheduling, external APIs, or persistence.
|
|
47
|
+
|
|
48
|
+
## Make Agents Optimizable
|
|
49
|
+
|
|
50
|
+
Optimization works much better when the agent and dataset remove avoidable ambiguity:
|
|
51
|
+
|
|
52
|
+
- Prefer typed tool outputs over free-form JSON blobs so the actor can rely on exact field names.
|
|
53
|
+
- Tell the actor the exact tool fields it may use when payload shape matters.
|
|
54
|
+
- Explicitly ban invented fields if the model has any reason to guess hidden IDs or alternate key names.
|
|
55
|
+
- If recursive children only see explicit `llmQuery(..., context)` payloads, say that directly in the actor prompt.
|
|
56
|
+
- For recursive synthesis, tell the agent what the narrowed context should look like before delegation.
|
|
57
|
+
- Keep `maxSubAgentCalls` small in examples unless the user is explicitly testing broad fan-out behavior.
|
|
58
|
+
- Use canonical, unambiguous task wording so the model does not burn turns asking for fake clarification.
|
|
59
|
+
- In JS-runtime agents, require raw runnable JavaScript only. Ban `javascript:` prefixes, mixed prose/code, and multi-snippet turns.
|
|
60
|
+
|
|
61
|
+
Good pattern:
|
|
62
|
+
|
|
63
|
+
- tool schema says exactly what fields exist
|
|
64
|
+
- task names the exact entity to look up
|
|
65
|
+
- actor prompt says which fields to extract before delegation
|
|
66
|
+
- metric or judge penalizes unnecessary recursion and tool misuse
|
|
67
|
+
|
|
68
|
+
Bad pattern:
|
|
69
|
+
|
|
70
|
+
- tool returns `json` with an underspecified shape
|
|
71
|
+
- task uses overloaded names like `Atlas` without clarifying whether that is a project, team, or account
|
|
72
|
+
- recursive child is expected to infer hidden parent state that was never passed in context
|
|
73
|
+
- code agent is allowed to mix natural language with JavaScript in the same turn
|
|
74
|
+
|
|
75
|
+
## Metric vs Judge
|
|
76
|
+
|
|
77
|
+
Choose the scoring path based on how objectively the task can be measured:
|
|
78
|
+
|
|
79
|
+
- Use a custom `metric` when you can score success directly from `prediction` and `example`.
|
|
80
|
+
- Use the built-in agent judge when success depends on a full-run qualitative review across tool choices, clarifications, and final output.
|
|
81
|
+
- Use `judgeOptions.description` to tell the built-in judge what to value most.
|
|
82
|
+
- Use helper-based judge code only when the user is not inside `agent.optimize(...)` and still wants LLM judging.
|
|
83
|
+
|
|
84
|
+
Quick rules:
|
|
85
|
+
|
|
86
|
+
- Tool correctness with exact expected calls or forbidden calls: prefer a deterministic metric first.
|
|
87
|
+
- Simple extraction or classification with known correct answers: prefer a deterministic metric.
|
|
88
|
+
- Open-ended assistant quality, nuanced clarification behavior, or broad synthesis quality: prefer the built-in judge.
|
|
89
|
+
- GEPA or optimizer flows outside agents that still need LLM judging: use a plain typed `AxGen` evaluator.
|
|
90
|
+
|
|
91
|
+
Important:
|
|
92
|
+
|
|
93
|
+
- A custom `metric` overrides the built-in judge path entirely.
|
|
94
|
+
- Do not introduce a dedicated judge abstraction in new examples; prefer a plain typed `AxGen`.
|
|
95
|
+
- Do not add both a custom `metric` and judge guidance unless the user explicitly wants two separate scoring systems and understands only the custom metric drives optimization.
|
|
96
|
+
- If the user builds a plain `AxGen` judge metric, prefer a numeric `score:number` output over a string tier when possible. It is simpler and less fragile in practice.
|
|
97
|
+
|
|
98
|
+
## Canonical Pattern
|
|
99
|
+
|
|
100
|
+
```typescript
|
|
101
|
+
import {
|
|
102
|
+
AxAIGoogleGeminiModel,
|
|
103
|
+
AxJSRuntime,
|
|
104
|
+
AxOptimizedProgramImpl,
|
|
105
|
+
axDefaultOptimizerLogger,
|
|
106
|
+
agent,
|
|
107
|
+
ai,
|
|
108
|
+
f,
|
|
109
|
+
fn,
|
|
110
|
+
} from '@ax-llm/ax';
|
|
111
|
+
|
|
112
|
+
const tools = [
|
|
113
|
+
fn('sendEmail')
|
|
114
|
+
.namespace('email')
|
|
115
|
+
.description('Send an email message')
|
|
116
|
+
.arg('to', f.string('Recipient email address'))
|
|
117
|
+
.arg('body', f.string('Email body text'))
|
|
118
|
+
.returns(
|
|
119
|
+
f.object({
|
|
120
|
+
sent: f.boolean('Whether the email was sent'),
|
|
121
|
+
to: f.string('Recipient email address'),
|
|
122
|
+
})
|
|
123
|
+
)
|
|
124
|
+
.handler(async ({ to }) => ({ sent: true, to }))
|
|
125
|
+
.build(),
|
|
126
|
+
];
|
|
127
|
+
|
|
128
|
+
const studentAI = ai({
|
|
129
|
+
name: 'google-gemini',
|
|
130
|
+
apiKey: process.env.GOOGLE_APIKEY!,
|
|
131
|
+
config: { model: AxAIGoogleGeminiModel.Gemini25FlashLite, temperature: 0.2 },
|
|
132
|
+
});
|
|
133
|
+
|
|
134
|
+
const judgeAI = ai({
|
|
135
|
+
name: 'google-gemini',
|
|
136
|
+
apiKey: process.env.GOOGLE_APIKEY!,
|
|
137
|
+
config: { model: AxAIGoogleGeminiModel.Gemini3Pro, temperature: 1.0 },
|
|
138
|
+
});
|
|
139
|
+
|
|
140
|
+
const assistant = agent('query:string -> answer:string', {
|
|
141
|
+
ai: studentAI,
|
|
142
|
+
judgeAI,
|
|
143
|
+
contextFields: [],
|
|
144
|
+
runtime: new AxJSRuntime(),
|
|
145
|
+
functions: { local: tools },
|
|
146
|
+
contextPolicy: { preset: 'adaptive' },
|
|
147
|
+
judgeOptions: {
|
|
148
|
+
description: 'Prefer correct tool use over polished wording.',
|
|
149
|
+
model: 'judge-model',
|
|
150
|
+
},
|
|
151
|
+
});
|
|
152
|
+
|
|
153
|
+
const tasks = [
|
|
154
|
+
{
|
|
155
|
+
input: { query: 'Send an email to Jim saying good morning.' },
|
|
156
|
+
criteria: 'Use the email tool and send the message to Jim.',
|
|
157
|
+
expectedActions: ['email.sendEmail'],
|
|
158
|
+
},
|
|
159
|
+
];
|
|
160
|
+
|
|
161
|
+
const result = await assistant.optimize(tasks, {
|
|
162
|
+
target: 'actor',
|
|
163
|
+
maxMetricCalls: 12,
|
|
164
|
+
verbose: true,
|
|
165
|
+
optimizerLogger: axDefaultOptimizerLogger,
|
|
166
|
+
onProgress: (progress) => {
|
|
167
|
+
console.log(
|
|
168
|
+
`round ${progress.round}/${progress.totalRounds} current=${progress.currentScore} best=${progress.bestScore}`
|
|
169
|
+
);
|
|
170
|
+
},
|
|
171
|
+
});
|
|
172
|
+
|
|
173
|
+
const saved = JSON.stringify(result.optimizedProgram, null, 2);
|
|
174
|
+
const restored = new AxOptimizedProgramImpl(JSON.parse(saved));
|
|
175
|
+
assistant.applyOptimization(restored);
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Deterministic Metric Pattern
|
|
179
|
+
|
|
180
|
+
Use this when the task has crisp correctness and cost/behavior tradeoffs:
|
|
181
|
+
|
|
182
|
+
```typescript
|
|
183
|
+
const result = await assistant.optimize(tasks, {
|
|
184
|
+
target: 'actor',
|
|
185
|
+
metric: ({ prediction, example }) => {
|
|
186
|
+
if (prediction.completionType !== 'final' || !prediction.output) {
|
|
187
|
+
return 0;
|
|
188
|
+
}
|
|
189
|
+
|
|
190
|
+
let score = 0;
|
|
191
|
+
|
|
192
|
+
if (prediction.output.answer.includes('Jim')) score += 0.4;
|
|
193
|
+
|
|
194
|
+
if (
|
|
195
|
+
prediction.functionCalls.some(
|
|
196
|
+
(call) => call.qualifiedName === 'email.sendEmail'
|
|
197
|
+
)
|
|
198
|
+
) {
|
|
199
|
+
score += 0.4;
|
|
200
|
+
}
|
|
201
|
+
|
|
202
|
+
if ((prediction.recursiveStats?.recursiveCallCount ?? 0) === 0) {
|
|
203
|
+
score += 0.2;
|
|
204
|
+
}
|
|
205
|
+
|
|
206
|
+
return score;
|
|
207
|
+
},
|
|
208
|
+
});
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
Use this pattern when:
|
|
212
|
+
|
|
213
|
+
- the task has a known correct answer or exact action pattern
|
|
214
|
+
- recursion cost or tool count must be measured explicitly
|
|
215
|
+
- you want repeatable, low-variance optimization runs
|
|
216
|
+
|
|
217
|
+
## Built-In Judge Pattern
|
|
218
|
+
|
|
219
|
+
Use this when the agent behavior needs holistic review:
|
|
220
|
+
|
|
221
|
+
```typescript
|
|
222
|
+
const result = await assistant.optimize(tasks, {
|
|
223
|
+
judgeAI,
|
|
224
|
+
judgeOptions: {
|
|
225
|
+
model: AxAIGoogleGeminiModel.Gemini3Pro,
|
|
226
|
+
description:
|
|
227
|
+
'Be strict about unnecessary delegation, weak clarifications, and incorrect tool choices.',
|
|
228
|
+
},
|
|
229
|
+
maxMetricCalls: 12,
|
|
230
|
+
});
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
Use this pattern when:
|
|
234
|
+
|
|
235
|
+
- task quality is open-ended or hard to score exactly
|
|
236
|
+
- the final answer quality matters together with the action trace
|
|
237
|
+
- the user wants a judge to consider clarifications, tool errors, and overall completion quality
|
|
238
|
+
|
|
239
|
+
## Plain `AxGen` Judge Pattern
|
|
240
|
+
|
|
241
|
+
Use this only when the user needs LLM judging outside the built-in `agent.optimize(...)` path:
|
|
242
|
+
|
|
243
|
+
```typescript
|
|
244
|
+
import { AxGen, s } from '@ax-llm/ax';
|
|
245
|
+
|
|
246
|
+
const judgeGen = new AxGen(
|
|
247
|
+
s(`
|
|
248
|
+
taskInput:json "Task input",
|
|
249
|
+
candidateOutput:json "Candidate output",
|
|
250
|
+
expectedOutput?:json "Optional reference output"
|
|
251
|
+
->
|
|
252
|
+
score:number "Normalized score from 0 to 1"
|
|
253
|
+
`)
|
|
254
|
+
);
|
|
255
|
+
judgeGen.setInstruction(
|
|
256
|
+
'Score the candidate output from 0 to 1. Reward correctness and task completion. Return only the score field.'
|
|
257
|
+
);
|
|
258
|
+
|
|
259
|
+
const metric = async ({ prediction, example }) => {
|
|
260
|
+
const result = await judgeGen.forward(judgeAI, {
|
|
261
|
+
taskInput: example,
|
|
262
|
+
candidateOutput: prediction,
|
|
263
|
+
expectedOutput: example.expectedOutput,
|
|
264
|
+
});
|
|
265
|
+
|
|
266
|
+
return Math.max(0, Math.min(1, result.score));
|
|
267
|
+
};
|
|
268
|
+
|
|
269
|
+
const result = await optimizer.compile(program, train, metric, {
|
|
270
|
+
validationExamples: validation,
|
|
271
|
+
});
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
Use this pattern when:
|
|
275
|
+
|
|
276
|
+
- the user is optimizing an `AxGen`, flow, or another program directly
|
|
277
|
+
- the user wants LLM judging without the higher-level `agent.optimize(...)` wrapper
|
|
278
|
+
- the user wants to inspect judge results directly, not just a numeric score
|
|
279
|
+
|
|
280
|
+
## Dataset And Judge Rules
|
|
281
|
+
|
|
282
|
+
- Pass already-loaded tasks. Do not invent a benchmark loader unless the user asks for one.
|
|
283
|
+
- Use `expectedActions` and `forbiddenActions` when tool correctness matters.
|
|
284
|
+
- `judgeOptions` mirrors normal forward options and supports extra judge guidance through `description`.
|
|
285
|
+
- The built-in judge scores from the full agent run, not just the final reply. It can see completion type, clarification payload, final output, action log, normalized function calls, tool errors, and turn count.
|
|
286
|
+
- For recursive advanced-mode evals, the built-in judge can also see `recursiveTrace` and `recursiveStats`.
|
|
287
|
+
- If the user provides a custom `metric`, that overrides the built-in judge path.
|
|
288
|
+
- If the user provides an LLM-based custom metric, keep the output schema as small as possible and prefer a direct numeric score.
|
|
289
|
+
|
|
290
|
+
Decision rules:
|
|
291
|
+
|
|
292
|
+
- Prefer a custom metric when the user has deterministic business scoring, exact action expectations, or explicit cost tradeoffs.
|
|
293
|
+
- Prefer the built-in judge when the user wants practical assistant-quality tuning and does not already have a trusted metric.
|
|
294
|
+
- Prefer a plain typed `AxGen` evaluator when the user is not calling `agent.optimize(...)` but still wants LLM judging.
|
|
295
|
+
- Prefer `judgeOptions.description` to steer the judge toward the user's real priority, such as tool correctness, brevity, groundedness, or policy compliance.
|
|
296
|
+
|
|
297
|
+
## Eval Semantics
|
|
298
|
+
|
|
299
|
+
- `agent.optimize(...)` runs each evaluation rollout from a clean continuation state.
|
|
300
|
+
- Saved runtime state from `getState()` and `setState(...)` is not used during eval rollouts.
|
|
301
|
+
- During optimize/eval, `ask_clarification(...)` is treated as a scored evaluation outcome instead of going through the responder.
|
|
302
|
+
- For clarification outcomes in custom metrics, expect `prediction.completionType === 'ask_clarification'`, populated `prediction.clarification`, and absent `prediction.output`.
|
|
303
|
+
- For final outcomes in custom metrics, expect `prediction.completionType === 'final'` and populated `prediction.output`.
|
|
304
|
+
- `target: 'responder'` still works, but clarification-heavy tasks are usually low-signal for responder optimization.
|
|
305
|
+
|
|
306
|
+
## Recursive Optimization Notes
|
|
307
|
+
|
|
308
|
+
- Recursive-slot artifacts require an agent configured for recursive advanced mode.
|
|
309
|
+
- Keep `mode: 'advanced'` top-level; child recursion behavior still follows `recursionOptions`.
|
|
310
|
+
- When recursive behavior matters, tune against the same `maxDepth`, `promptLevel`, and tool/discovery structure you expect in production.
|
|
311
|
+
- Use recursive traces and recursive stats when the user wants to diagnose where token or delegation cost is coming from.
|
|
312
|
+
- For recursion-efficiency tuning, prefer a deterministic metric unless the user specifically needs a qualitative LLM review of decomposition quality.
|
|
313
|
+
- Prefer `recursionOptions.promptLevel: 'detailed'` in examples when child agents need to respect strict JS/runtime policy.
|
|
314
|
+
- Tell the actor that recursive children only see passed context, not parent globals or prior tool results.
|
|
315
|
+
- For synthesis-style recursive tasks, specify the desired delegation pattern explicitly, for example "use at most one focused delegated child analysis after narrowing the tool output in JS."
|
|
316
|
+
- Penalize over-decomposition directly in the metric or judge prompt.
|
|
317
|
+
- If one training task keeps collapsing to zero, inspect that task first instead of adding more optimizer rounds. Most failures come from task ambiguity, weak tool schemas, or vague delegation guidance rather than GEPA itself.
|
|
318
|
+
|
|
319
|
+
## Artifacts And Replay
|
|
320
|
+
|
|
321
|
+
- Save `result.optimizedProgram` if the user wants portable artifacts.
|
|
322
|
+
- Restore artifacts with `new AxOptimizedProgramImpl(...)`, then call `agent.applyOptimization(...)`.
|
|
323
|
+
- For demonstrations, use fresh eval-safe tool state for baseline, optimize, and restored replay so side effects do not leak across phases.
|
|
324
|
+
- If the user wants to show improvement, run a held-out task before optimization, then replay it on a freshly restored optimized agent.
|
|
325
|
+
|
|
326
|
+
## Examples
|
|
327
|
+
|
|
328
|
+
- [RLM Agent Optimize](https://raw.githubusercontent.com/ax-llm/ax/refs/heads/main/src/examples/rlm-agent-optimize.ts) — Gemini office-assistant tuning with save/load
|
|
329
|
+
- [RLM Agent Recursive Optimize](https://raw.githubusercontent.com/ax-llm/ax/refs/heads/main/src/examples/rlm-agent-recursive-optimize.ts) — recursive-slot optimization artifacts
|
|
330
|
+
|
|
331
|
+
## Do Not Generate
|
|
332
|
+
|
|
333
|
+
- Do not optimize against production tools with real side effects unless the user explicitly wants that.
|
|
334
|
+
- Do not recommend responder-only optimization by default for clarification-heavy workflows.
|
|
335
|
+
- Do not omit artifact save/load steps when the user asks for reusable optimized configurations.
|
|
336
|
+
- Do not introduce a dedicated judge class or helper abstraction in new agent-optimize examples; prefer the built-in judge path or a plain typed `AxGen`.
|
|
337
|
+
- Do not rely on vague `json` tool returns when the agent must reason about specific fields across recursive steps.
|
|
338
|
+
- Do not leave recursive child context implicit. If the child needs a fact, pass it explicitly.
|
|
339
|
+
- Do not let code-generation agents mix prose and JavaScript if the user is optimizing runtime behavior.
|