@ax-llm/ax 21.0.12 → 21.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,57 +1,175 @@
1
1
  ---
2
2
  name: ax-audio
3
- description: This skill helps an LLM generate correct conversational audio I/O code with @ax-llm/ax. Use when the user asks about .chat() audio input, audio output, OpenAI gpt-audio or realtime models, Gemini Live native audio, Grok Voice Agent models, voices, formats, transcripts, or how audio fits with signatures and structured outputs.
4
- version: "21.0.12"
3
+ description: This skill helps an LLM generate correct audio code with @ax-llm/ax. Use when the user asks about ai.transcribe(), ai.speak(), signature audio inputs or outputs, agent audio behavior, .chat() conversational audio, OpenAI audio or realtime models, Gemini Live native audio, Grok Voice Agent models, voices, formats, transcripts, or how audio fits with structured outputs.
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # Audio I/O Codegen Rules (@ax-llm/ax)
8
8
 
9
- Use this skill for bounded-turn conversational audio through `.chat()`. Prefer short, modern, copyable examples. Do not model generated audio as a DSPy signature output field.
9
+ Use this skill for audio in Ax. Pick the smallest audio surface that matches the job:
10
10
 
11
- ## Core Rule
11
+ - Use `ai.transcribe(...)` for batch speech-to-text.
12
+ - Use `ai.speak(...)` for batch text-to-speech.
13
+ - Use `speech:audio` signature outputs for structured programs that should return synthesized audio artifacts.
14
+ - Use `.chat()` audio config for conversational or realtime audio turns.
12
15
 
13
- Audio output is returned on `AxChatResponseResult.audio`, not in signature fields.
16
+ ## Core Rules
14
17
 
15
- Signatures should keep text fields text-shaped:
18
+ - Input `:audio` is an audio input value: `{ data, format?, mimeType?, sampleRate?, channels? }`.
19
+ - Output `:audio` is a scripted audio artifact. The model returns plain text for that field; Ax synthesizes it after structured output parsing.
20
+ - Output audio JSON schema is model-facing `string`, not a binary object.
21
+ - Agents transcribe input audio fields before planner/executor/responder stages by default, so agent stages see text instead of base64 audio.
22
+ - Realtime and conversational audio still use `.chat()` and `modelConfig.audio`.
23
+ - Batch signature audio artifacts use forward-time `speech` options, not `modelConfig.audio`.
24
+
25
+ ## Direct Batch APIs
26
+
27
+ ```typescript
28
+ import { ai } from '@ax-llm/ax';
29
+
30
+ const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });
31
+
32
+ const transcript = await llm.transcribe({
33
+ audio: { data: base64Wav, format: 'wav' },
34
+ model: 'gpt-4o-mini-transcribe',
35
+ language: 'en',
36
+ prompt: 'Product support call',
37
+ });
38
+
39
+ const speech = await llm.speak({
40
+ text: transcript.text,
41
+ model: 'gpt-4o-mini-tts',
42
+ voice: 'alloy',
43
+ format: 'mp3',
44
+ });
45
+
46
+ console.log(transcript.text);
47
+ console.log(speech.data);
48
+ console.log(speech.transcript);
49
+ ```
50
+
51
+ Providers without the requested batch audio capability throw `AxMediaNotSupportedError`.
52
+
53
+ ## Signature Audio Artifacts
54
+
55
+ ```typescript
56
+ import { ai, ax } from '@ax-llm/ax';
57
+
58
+ const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });
59
+ const say = ax('question:string -> speech:audio, summary:string');
60
+
61
+ const result = await say.forward(
62
+ llm,
63
+ { question: 'Explain retries in one sentence.' },
64
+ {
65
+ speech: {
66
+ speak: { voice: 'alloy', format: 'mp3' },
67
+ fields: {
68
+ speech: { voice: 'alloy' },
69
+ },
70
+ },
71
+ }
72
+ );
73
+
74
+ console.log(result.summary);
75
+ console.log(result.speech.data);
76
+ console.log(result.speech.mimeType);
77
+ console.log(result.speech.transcript);
78
+ ```
79
+
80
+ The model emits a text script for `speech`; Ax replaces it with `AxChatAudioOutput` after result selection. If the field already contains an audio artifact with `{ data }` or `{ id }`, Ax leaves it alone.
81
+
82
+ ## Agent Audio Inputs
83
+
84
+ ```typescript
85
+ import { agent, ai } from '@ax-llm/ax';
86
+
87
+ const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });
88
+
89
+ const voiceAgent = agent(
90
+ 'recording:audio, question:string -> speech:audio, summary:string',
91
+ {
92
+ agentIdentity: {
93
+ name: 'Voice Assistant',
94
+ description: 'Answers spoken requests with spoken and written output',
95
+ },
96
+ contextFields: [],
97
+ }
98
+ );
99
+
100
+ const result = await voiceAgent.forward(
101
+ llm,
102
+ {
103
+ recording: { data: base64Wav, format: 'wav' },
104
+ question: 'What should I do next?',
105
+ },
106
+ {
107
+ speech: {
108
+ transcribe: { model: 'gpt-4o-mini-transcribe' },
109
+ speak: { voice: 'alloy', format: 'mp3' },
110
+ },
111
+ }
112
+ );
113
+
114
+ console.log(result.summary);
115
+ console.log(result.speech.data);
116
+ ```
117
+
118
+ The agent runtime transcribes `recording` first and passes the transcript through the internal agent stages. Use direct `ax(...)` or `.chat()` when you specifically want native audio understanding in the model call.
119
+
120
+ ## Conversational `.chat()` Audio
121
+
122
+ Use `modelConfig.audio` for conversational audio turns where audio is part of the chat response instead of a structured signature field.
16
123
 
17
124
  ```typescript
18
- const result = await llm.chat({
125
+ const res = await llm.chat({
19
126
  chatPrompt: [{ role: 'user', content: 'Say hello out loud.' }],
20
127
  modelConfig: {
21
- audio: { output: { enabled: true } },
128
+ audio: { output: { enabled: true, voice: 'alloy', format: 'wav' } },
22
129
  },
23
130
  });
24
131
 
25
- console.log(result.results[0]?.content);
26
- console.log(result.results[0]?.audio?.data);
27
- console.log(result.results[0]?.audio?.transcript);
132
+ console.log(res.results[0]?.content);
133
+ console.log(res.results[0]?.audio?.data);
134
+ console.log(res.results[0]?.audio?.transcript);
28
135
  ```
29
136
 
30
- Do not write signatures like `question:string -> audio:audio`. Use `.chat()` for conversational audio and use `audio.data` for the generated bytes.
31
-
32
137
  ## Config Shape
33
138
 
34
139
  ```typescript
35
- type AxChatAudioConfig = {
36
- input?: {
37
- format?: 'wav' | 'mp3' | 'flac' | 'opus' | 'aac' | 'pcm16' | 'pcm' | 'ogg';
38
- mimeType?: string;
39
- sampleRate?: number;
40
- channels?: number;
41
- };
42
- output?: {
43
- enabled?: boolean;
44
- voice?: string | { id: string };
45
- format?: 'wav' | 'mp3' | 'flac' | 'opus' | 'aac' | 'pcm16' | 'pcm' | 'ogg';
46
- sampleRate?: number;
47
- channels?: number;
48
- includeTranscript?: boolean;
140
+ type AxAudioFormat =
141
+ | 'wav'
142
+ | 'mp3'
143
+ | 'flac'
144
+ | 'opus'
145
+ | 'aac'
146
+ | 'pcm16'
147
+ | 'pcm'
148
+ | 'ogg'
149
+ | 'raw'
150
+ | 'mulaw'
151
+ | 'ulaw'
152
+ | 'alaw';
153
+
154
+ type AxSpeechConfig = {
155
+ transcribe?: {
156
+ model?: string;
157
+ language?: string;
158
+ prompt?: string;
49
159
  };
50
- live?: {
51
- turnTimeoutMs?: number;
52
- enableAffectiveDialog?: boolean;
53
- proactiveAudio?: boolean;
160
+ speak?: {
161
+ model?: string;
162
+ voice?: string;
163
+ format?: AxAudioFormat;
54
164
  };
165
+ fields?: Record<
166
+ string,
167
+ {
168
+ model?: string;
169
+ voice?: string;
170
+ format?: AxAudioFormat;
171
+ }
172
+ >;
55
173
  };
56
174
  ```
57
175
 
@@ -246,6 +364,10 @@ for await (const chunk of stream) {
246
364
 
247
365
  ## Structured Outputs
248
366
 
249
- Do not combine audio output with structured response formats. Audio chat may return a text transcript in `content`, but generated audio bytes live at `result.results[0].audio`.
367
+ Use signature audio outputs for structured speech artifacts:
368
+
369
+ ```typescript
370
+ const gen = ax('question:string -> answer:string, speech:audio');
371
+ ```
250
372
 
251
- For structured extraction from speech, use a text-only or transcription step first, then pass the transcript into `ax(...)` or `flow(...)`.
373
+ Use `.chat()` audio when the response itself is a conversational audio turn. Do not combine `.chat()` audio output with provider-native structured response formats unless that provider explicitly supports the combination.
package/skills/ax-flow.md CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-flow
3
3
  description: This skill helps an LLM generate correct AxFlow workflow code using @ax-llm/ax. Use when the user asks about flow(), AxFlow, workflow orchestration, parallel execution, DAG workflows, conditional routing, map/reduce patterns, or multi-node AI pipelines.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # AxFlow Codegen Rules (@ax-llm/ax)
package/skills/ax-gen.md CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-gen
3
3
  description: This skill helps an LLM generate correct AxGen code using @ax-llm/ax. Use when the user asks about ax(), AxGen, generators, forward(), streamingForward(), assertions, field processors, step hooks, self-tuning, or structured outputs.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # AxGen Codegen Rules (@ax-llm/ax)
package/skills/ax-gepa.md CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-gepa
3
3
  description: This skill helps an LLM generate correct AxGEPA optimization code using @ax-llm/ax. Use when the user asks about AxGEPA, GEPA, Pareto optimization, multi-objective prompt tuning, reflective prompt evolution, validationExamples, maxMetricCalls, or optimizing a generator, flow, or agent tree.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # AxGEPA Codegen Rules (@ax-llm/ax)
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-learn
3
3
  description: This skill helps an LLM generate correct AxLearn code using @ax-llm/ax. Use when the user asks about self-improving agents, trace-backed learning, feedback-aware updates, or AxLearn modes.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # AxLearn Codegen Rules (@ax-llm/ax)
package/skills/ax-llm.md CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-llm
3
3
  description: This skill helps with using the @ax-llm/ax TypeScript library for building LLM applications. Use when the user asks about ax(), ai(), f(), s(), agent(), flow(), AxGen, AxAgent, AxFlow, signatures, streaming, or mentions @ax-llm/ax.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # Ax Library (@ax-llm/ax) Quick Reference
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ax-signature
3
3
  description: This skill helps an LLM generate correct DSPy signature code using @ax-llm/ax. Use when the user asks about signatures, s(), f(), field types, string syntax, fluent builder API, validation constraints, or type-safe inputs/outputs.
4
- version: "21.0.12"
4
+ version: "21.0.13"
5
5
  ---
6
6
 
7
7
  # Ax Signature Reference
@@ -25,7 +25,7 @@ version: "21.0.12"
25
25
  | DateRange | `:dateRange` | `{ start: Date; end: Date }` | `travelDates:dateRange` |
26
26
  | DateTimeRange | `:datetimeRange` | `{ start: Date; end: Date }` | `meetingWindow:datetimeRange` |
27
27
  | Image | `:image` | `{mimeType, data}` | `photo:image` (input only) |
28
- | Audio | `:audio` | `{format?, data}` | `recording:audio` (input only) |
28
+ | Audio | `:audio` | input: `AxAudioInput`; output: `AxChatAudioOutput` | `recording:audio`, `speech:audio` |
29
29
  | File | `:file` | `{mimeType, data}` | `document:file` (input only) |
30
30
  | URL | `:url` | `string` | `website:url` |
31
31
  | Code | `:code` | `string` | `pythonScript:code` |
@@ -256,9 +256,11 @@ Bad: `text`, `data`, `input`, `output`, `a`, `x`, `val` (too generic), `1field`
256
256
 
257
257
  ## Media Type Restrictions
258
258
 
259
- - Media types (image, audio, file) are **top-level input fields only**
260
- - Cannot be nested in objects
261
- - Cannot be output fields
259
+ - Image and file fields are top-level input fields only.
260
+ - Audio fields can be top-level inputs or single top-level outputs.
261
+ - Audio output fields are scripted speech artifacts: the model returns plain text, then Ax synthesizes `AxChatAudioOutput`.
262
+ - Media fields cannot be nested in objects.
263
+ - Media arrays are supported for inputs only; output `audio[]` is not supported.
262
264
 
263
265
  ## Common Patterns
264
266
 
@@ -269,9 +271,12 @@ Bad: `text`, `data`, `input`, `output`, `a`, `x`, `val` (too generic), `1field`
269
271
  // Classification
270
272
  'email:string -> priority:class "urgent, normal, low"'
271
273
 
272
- // Multi-modal
274
+ // Multi-modal input
273
275
  'imageData:image, question?:string -> description:string, objects:string[]'
274
276
 
277
+ // Scripted speech output
278
+ 'question:string -> speech:audio, summary:string'
279
+
275
280
  // Data Extraction
276
281
  'invoiceText:string -> invoiceNumber:string, totalAmount:number, lineItems:json[]'
277
282
 
@@ -283,13 +288,13 @@ Bad: `text`, `data`, `input`, `output`, `a`, `x`, `val` (too generic), `1field`
283
288
 
284
289
  - Use `f()` fluent builder, NOT nested `f.array(f.string())` -- those are removed.
285
290
  - Field names must be descriptive (not generic like `text`, `data`, `input`).
286
- - Media types are input-only, top-level only.
291
+ - Image/file media types are input-only, top-level only; audio may also be a single top-level output.
287
292
  - `.internal()` / `{ internal: true }` is output-only (for chain-of-thought reasoning).
288
293
  - `.cache()` / `{ cache: true }` is input-only (for prompt caching).
289
294
  - Validation errors trigger auto-retry with correction feedback.
290
295
  - `f.email()`, `f.url()`, `f.date()`, `f.datetime()` are shorthand for `f.string().email()` etc.; `f.dateRange()` and `f.datetimeRange()` return `{ start: Date; end: Date }`.
291
296
  - `z.enum()` maps to ax's `class` type — only valid on **output** fields.
292
- - For multimodal inputs (images, audio, files) use `f.image()` / `f.audio()` / `f.file()` — zod has no equivalent.
297
+ - For multimodal inputs (images, audio, files) and scripted audio outputs, use `f.image()` / `f.audio()` / `f.file()` — zod has no equivalent.
293
298
 
294
299
  ## Examples
295
300