@jambonz/schema 0.1.6 → 0.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +3 -3
- package/README.md +1 -1
- package/callbacks/{pipeline-turn.schema.json → agent-turn.schema.json} +4 -4
- package/callbacks/call-status.schema.json +6 -1
- package/components/recognizer-assemblyAiOptions.schema.json +3 -3
- package/docs/guides/bridged-call-patterns.md +261 -0
- package/docs/guides/session-commands.md +51 -13
- package/docs/verbs/{pipeline.md → agent.md} +36 -36
- package/docs/verbs/dub.md +219 -0
- package/docs/verbs/transcribe.md +167 -0
- package/jambonz-app.schema.json +1 -1
- package/package.json +1 -1
- package/verbs/{pipeline.schema.json → agent.schema.json} +15 -15
- package/verbs/dub.schema.json +1 -1
- package/verbs/transcribe.schema.json +2 -1
package/AGENTS.md
CHANGED
|
@@ -35,7 +35,7 @@ The verb schemas and JSON structure are identical in both modes. The difference
|
|
|
35
35
|
- **Webhook**: Simple IVR, call routing, voicemail, basic gather-and-respond patterns.
|
|
36
36
|
- **WebSocket**: LLM-powered voice agents, real-time audio streaming, complex conversational flows, anything requiring bidirectional communication, or asynchronous logic, or streaming tts.
|
|
37
37
|
|
|
38
|
-
**IMPORTANT**: Any application that uses a speech-to-speech verb (`openai_s2s`, `google_s2s`, `deepgram_s2s`, `ultravox_s2s`, `elevenlabs_s2s`, `s2s`, or `
|
|
38
|
+
**IMPORTANT**: Any application that uses a speech-to-speech verb (`openai_s2s`, `google_s2s`, `deepgram_s2s`, `ultravox_s2s`, `elevenlabs_s2s`, `s2s`, or `agent`) MUST use WebSocket transport, not webhooks. These verbs require persistent bidirectional communication for real-time audio and events.
|
|
39
39
|
|
|
40
40
|
## Schema
|
|
41
41
|
|
|
@@ -62,10 +62,10 @@ Two tools are available:
|
|
|
62
62
|
- **gather** — Collect speech (STT) and/or DTMF input. The workhorse for interactive menus and voice input.
|
|
63
63
|
|
|
64
64
|
### AI & Real-time
|
|
65
|
-
- **openai_s2s** / **google_s2s** / **deepgram_s2s** / **ultravox_s2s** — Connect the caller to a vendor-specific LLM for real-time voice conversation. These are the **preferred** verbs when the vendor is known. Each handles the full STT→LLM→TTS
|
|
65
|
+
- **openai_s2s** / **google_s2s** / **deepgram_s2s** / **ultravox_s2s** — Connect the caller to a vendor-specific LLM for real-time voice conversation. These are the **preferred** verbs when the vendor is known. Each handles the full STT→LLM→TTS flow with the vendor pre-set.
|
|
66
66
|
- **elevenlabs_s2s** — Connect the caller to an ElevenLabs Conversational AI agent. **Unlike other s2s vendors**, ElevenLabs requires a pre-configured `agent_id` (created in the ElevenLabs dashboard) rather than a model and messages. See [ElevenLabs S2S specifics](#elevenlabs-s2s-specifics) below.
|
|
67
67
|
- **s2s** — Generic LLM voice conversation verb. Use only when the vendor is determined at runtime (e.g. from an env var). Requires `vendor` to be specified.
|
|
68
|
-
- **
|
|
68
|
+
- **agent** — Higher-level voice AI agent with integrated turn detection. Mix-and-match STT, LLM, and TTS vendors.
|
|
69
69
|
- **dialogflow** — Connect the caller to a Google Dialogflow agent (ES, CX, or CES).
|
|
70
70
|
- **stream** — Stream raw audio to a websocket endpoint for custom processing.
|
|
71
71
|
- **transcribe** — Real-time call transcription sent to a webhook.
|
package/README.md
CHANGED
|
@@ -4,7 +4,7 @@ JSON Schema definitions and validation for jambonz verb applications.
|
|
|
4
4
|
|
|
5
5
|
## What's Included
|
|
6
6
|
|
|
7
|
-
- **33 verb schemas** (`verbs/`) -- every jambonz verb (say, gather, dial, openai_s2s,
|
|
7
|
+
- **33 verb schemas** (`verbs/`) -- every jambonz verb (say, gather, dial, openai_s2s, agent, etc.)
|
|
8
8
|
- **42 component schemas** (`components/`) -- shared types (synthesizer, recognizer, target, actionHook, etc.)
|
|
9
9
|
- **32 callback schemas** (`callbacks/`) -- actionHook payload definitions for each verb
|
|
10
10
|
- **AGENTS.md** -- language-agnostic developer guide covering the verb model, transport modes, and protocol
|
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
{
|
|
2
2
|
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
|
-
"$id": "https://jambonz.org/schema/callbacks/
|
|
4
|
-
"title": "
|
|
5
|
-
"description": "Events sent to the
|
|
3
|
+
"$id": "https://jambonz.org/schema/callbacks/agent-turn",
|
|
4
|
+
"title": "Agent EventHook Events",
|
|
5
|
+
"description": "Events sent to the agent verb's eventHook during a conversation. These are sent as 'agent:event' messages over the WebSocket connection.",
|
|
6
6
|
"type": "object",
|
|
7
7
|
"oneOf": [
|
|
8
8
|
{
|
|
@@ -84,7 +84,7 @@
|
|
|
84
84
|
{
|
|
85
85
|
"properties": {
|
|
86
86
|
"type": {
|
|
87
|
-
"const": "
|
|
87
|
+
"const": "llm_response",
|
|
88
88
|
"description": "Sent when the LLM has finished generating its response for the current turn. Contains the complete response text."
|
|
89
89
|
},
|
|
90
90
|
"response": {
|
|
@@ -2,12 +2,17 @@
|
|
|
2
2
|
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
3
|
"$id": "https://jambonz.org/schema/callbacks/call-status",
|
|
4
4
|
"title": "Call Status Webhook Payload",
|
|
5
|
-
"description": "Payload sent to the call status webhook URL whenever the call state changes (e.g. trying, in-progress, completed). The status webhook is configured at the application level in jambonz. Multiple status events are sent over the life of a call. The final event (completed or failed) includes additional fields like duration and termination cause.",
|
|
5
|
+
"description": "Payload sent to the call status webhook URL whenever the call state changes (e.g. trying, in-progress, completed). The status webhook is configured at the application level in jambonz. Multiple status events are sent over the life of a call. The final event (completed or failed) includes additional fields like duration and termination cause.\n\n**Capturing B-leg call_sid:** When using the dial verb to bridge calls, status events are sent for both legs. The A-leg (original inbound call) has `direction: 'inbound'`. The B-leg (outbound dialed call) has `direction: 'outbound'`. To capture the B-leg's call_sid for later use (e.g., injecting commands to the B-leg), listen for status events where `direction === 'outbound'` and extract the `call_sid` field.",
|
|
6
6
|
"allOf": [
|
|
7
7
|
{ "$ref": "base" }
|
|
8
8
|
],
|
|
9
9
|
"type": "object",
|
|
10
10
|
"properties": {
|
|
11
|
+
"direction": {
|
|
12
|
+
"type": "string",
|
|
13
|
+
"enum": ["inbound", "outbound"],
|
|
14
|
+
"description": "Call direction. 'inbound' = A-leg (original incoming call to the application). 'outbound' = B-leg (call placed by the dial verb). Use this field to identify which leg generated the status event, especially when capturing the B-leg's call_sid for mid-call control."
|
|
15
|
+
},
|
|
11
16
|
"call_termination_by": {
|
|
12
17
|
"type": "string",
|
|
13
18
|
"enum": ["caller", "jambonz"],
|
|
@@ -28,15 +28,15 @@
|
|
|
28
28
|
},
|
|
29
29
|
"minEndOfTurnSilenceWhenConfident": {
|
|
30
30
|
"type": "number",
|
|
31
|
-
"description": "Minimum silence duration (
|
|
31
|
+
"description": "Minimum silence duration (milliseconds) to trigger end-of-turn when confidence is met. Default: 400."
|
|
32
32
|
},
|
|
33
33
|
"maxTurnSilence": {
|
|
34
34
|
"type": "number",
|
|
35
|
-
"description": "Maximum silence duration (
|
|
35
|
+
"description": "Maximum silence duration (milliseconds) before forcing end-of-turn. Default: 1280."
|
|
36
36
|
},
|
|
37
37
|
"minTurnSilence": {
|
|
38
38
|
"type": "number",
|
|
39
|
-
"description": "Minimum silence duration (
|
|
39
|
+
"description": "Minimum silence duration (milliseconds) before allowing end-of-turn."
|
|
40
40
|
},
|
|
41
41
|
"keyterms": {
|
|
42
42
|
"type": "array",
|
|
@@ -0,0 +1,261 @@
|
|
|
1
|
+
# Bridged Call Patterns
|
|
2
|
+
|
|
3
|
+
This guide covers common patterns for building applications that bridge two call legs (A-leg and B-leg) and need to interact with each party independently—such as real-time translation, call coaching, or call monitoring.
|
|
4
|
+
|
|
5
|
+
## Understanding A-leg and B-leg
|
|
6
|
+
|
|
7
|
+
When an inbound call arrives and your application uses the `dial` verb to connect the caller to another party:
|
|
8
|
+
|
|
9
|
+
- **A-leg**: The original inbound call (caller → jambonz)
|
|
10
|
+
- **B-leg**: The outbound call placed by the `dial` verb (jambonz → callee)
|
|
11
|
+
|
|
12
|
+
Each leg has its own `call_sid` identifier and can receive independent commands.
|
|
13
|
+
|
|
14
|
+
## Capturing the B-leg call_sid
|
|
15
|
+
|
|
16
|
+
Many patterns require knowing the B-leg's `call_sid` to inject commands to that leg. Capture it from `call:status` events:
|
|
17
|
+
|
|
18
|
+
```typescript
|
|
19
|
+
let dialCallSid: string;
|
|
20
|
+
|
|
21
|
+
session.on('call:status', (evt: Record<string, any>) => {
|
|
22
|
+
// B-leg events have direction === 'outbound'
|
|
23
|
+
if (evt.direction === 'outbound') {
|
|
24
|
+
dialCallSid = evt.call_sid;
|
|
25
|
+
console.log(`B-leg call_sid captured: ${dialCallSid}`);
|
|
26
|
+
}
|
|
27
|
+
});
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
**When it fires:** The `call:status` event with `direction: 'outbound'` fires when the dial verb initiates the B-leg call. You'll receive status updates for both legs throughout the call lifecycle.
|
|
31
|
+
|
|
32
|
+
## Setting Up Transcription on Both Legs
|
|
33
|
+
|
|
34
|
+
To transcribe both parties separately (e.g., for translation), configure transcription on each leg:
|
|
35
|
+
|
|
36
|
+
```typescript
|
|
37
|
+
session
|
|
38
|
+
// Transcribe A-leg (caller)
|
|
39
|
+
.transcribe({
|
|
40
|
+
transcriptionHook: '/transcription/caller',
|
|
41
|
+
channel: 1, // Near-end = caller's voice
|
|
42
|
+
recognizer: {
|
|
43
|
+
vendor: 'deepgram',
|
|
44
|
+
language: 'en-US',
|
|
45
|
+
deepgramOptions: { model: 'nova-2' }
|
|
46
|
+
}
|
|
47
|
+
})
|
|
48
|
+
// Bridge to B-leg with separate transcription
|
|
49
|
+
.dial({
|
|
50
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
51
|
+
transcribe: {
|
|
52
|
+
transcriptionHook: '/transcription/callee',
|
|
53
|
+
channel: 2, // Far-end from A-leg = callee's voice
|
|
54
|
+
recognizer: {
|
|
55
|
+
vendor: 'deepgram',
|
|
56
|
+
language: 'es-ES',
|
|
57
|
+
deepgramOptions: { model: 'nova-2' }
|
|
58
|
+
}
|
|
59
|
+
}
|
|
60
|
+
})
|
|
61
|
+
.send();
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Channel Values Explained
|
|
65
|
+
|
|
66
|
+
| Location | Channel | What it captures |
|
|
67
|
+
|----------|---------|------------------|
|
|
68
|
+
| A-leg transcribe | `1` (near-end) | Caller's voice |
|
|
69
|
+
| A-leg transcribe | `2` (far-end) | What caller hears (including B-leg) |
|
|
70
|
+
| dial.transcribe | `2` (far-end) | Callee's voice (B-leg inbound audio) |
|
|
71
|
+
| dial.transcribe | (omitted) | Both parties mixed |
|
|
72
|
+
|
|
73
|
+
### Identifying the Speaker in Transcription Events
|
|
74
|
+
|
|
75
|
+
Use the `call_sid` in transcription events to determine who spoke:
|
|
76
|
+
|
|
77
|
+
```typescript
|
|
78
|
+
session.on('/transcription/caller', (evt: Record<string, any>) => {
|
|
79
|
+
if (evt.speech?.is_final) {
|
|
80
|
+
const transcript = evt.speech.alternatives[0].transcript;
|
|
81
|
+
handleCallerSpeech(transcript);
|
|
82
|
+
}
|
|
83
|
+
});
|
|
84
|
+
|
|
85
|
+
session.on('/transcription/callee', (evt: Record<string, any>) => {
|
|
86
|
+
if (evt.speech?.is_final) {
|
|
87
|
+
const transcript = evt.speech.alternatives[0].transcript;
|
|
88
|
+
handleCalleeSpeech(transcript);
|
|
89
|
+
}
|
|
90
|
+
});
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Or with a single hook, use `call_sid` to differentiate:
|
|
94
|
+
|
|
95
|
+
```typescript
|
|
96
|
+
session.on('/transcription', (evt: Record<string, any>) => {
|
|
97
|
+
if (!evt.speech?.is_final) return;
|
|
98
|
+
|
|
99
|
+
const transcript = evt.speech.alternatives[0].transcript;
|
|
100
|
+
const speaker = evt.call_sid === dialCallSid ? 'callee' : 'caller';
|
|
101
|
+
|
|
102
|
+
console.log(`${speaker}: ${transcript}`);
|
|
103
|
+
});
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
## Creating Dub Tracks for Audio Injection
|
|
107
|
+
|
|
108
|
+
To inject audio to each party (e.g., translated speech), create tracks on each leg:
|
|
109
|
+
|
|
110
|
+
```typescript
|
|
111
|
+
session
|
|
112
|
+
// Track on A-leg — caller hears it
|
|
113
|
+
.dub({ action: 'addTrack', track: 'caller-audio' })
|
|
114
|
+
.dial({
|
|
115
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
116
|
+
// Track on B-leg — callee hears it
|
|
117
|
+
dub: [
|
|
118
|
+
{ action: 'addTrack', track: 'callee-audio' }
|
|
119
|
+
]
|
|
120
|
+
})
|
|
121
|
+
.send();
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
**Track routing rule:** Tracks are heard by the party on whose call leg they exist.
|
|
125
|
+
|
|
126
|
+
## Injecting Commands to Specific Legs
|
|
127
|
+
|
|
128
|
+
### Default: Commands go to A-leg
|
|
129
|
+
|
|
130
|
+
```typescript
|
|
131
|
+
session.injectCommand('dub', {
|
|
132
|
+
action: 'sayOnTrack',
|
|
133
|
+
track: 'caller-audio',
|
|
134
|
+
say: 'This message is for the caller'
|
|
135
|
+
});
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Targeting B-leg: Pass call_sid as third argument
|
|
139
|
+
|
|
140
|
+
```typescript
|
|
141
|
+
session.injectCommand('dub', {
|
|
142
|
+
action: 'sayOnTrack',
|
|
143
|
+
track: 'callee-audio',
|
|
144
|
+
say: 'This message is for the callee'
|
|
145
|
+
}, dialCallSid); // Third argument routes to B-leg
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
**Important:** The `call_sid` is passed as a separate third argument, not inside the data object.
|
|
149
|
+
|
|
150
|
+
## Complete Example: Real-Time Translator
|
|
151
|
+
|
|
152
|
+
This example bridges an English caller with a Spanish-speaking callee, providing real-time translation in both directions.
|
|
153
|
+
|
|
154
|
+
```typescript
|
|
155
|
+
import { createEndpoint, Session } from '@jambonz/sdk/websocket';
|
|
156
|
+
import http from 'http';
|
|
157
|
+
|
|
158
|
+
const server = http.createServer();
|
|
159
|
+
const makeService = createEndpoint({ server, port: 3000 });
|
|
160
|
+
const svc = makeService({ path: '/' });
|
|
161
|
+
|
|
162
|
+
svc.on('session:new', (session: Session) => {
|
|
163
|
+
let dialCallSid: string;
|
|
164
|
+
|
|
165
|
+
// Capture B-leg call_sid
|
|
166
|
+
session.on('call:status', (evt: Record<string, any>) => {
|
|
167
|
+
if (evt.direction === 'outbound') {
|
|
168
|
+
dialCallSid = evt.call_sid;
|
|
169
|
+
}
|
|
170
|
+
});
|
|
171
|
+
|
|
172
|
+
// Handle caller's speech (English → Spanish for callee)
|
|
173
|
+
session.on('/transcription/caller', async (evt: Record<string, any>) => {
|
|
174
|
+
session.reply();
|
|
175
|
+
if (!evt.speech?.is_final) return;
|
|
176
|
+
|
|
177
|
+
const transcript = evt.speech.alternatives[0].transcript;
|
|
178
|
+
const translated = await translateToSpanish(transcript);
|
|
179
|
+
|
|
180
|
+
// Inject to B-leg track (callee hears it)
|
|
181
|
+
session.injectCommand('dub', {
|
|
182
|
+
action: 'sayOnTrack',
|
|
183
|
+
track: 'callee-audio',
|
|
184
|
+
say: {
|
|
185
|
+
text: translated,
|
|
186
|
+
synthesizer: { vendor: 'elevenlabs', language: 'es-ES' }
|
|
187
|
+
}
|
|
188
|
+
}, dialCallSid);
|
|
189
|
+
});
|
|
190
|
+
|
|
191
|
+
// Handle callee's speech (Spanish → English for caller)
|
|
192
|
+
session.on('/transcription/callee', async (evt: Record<string, any>) => {
|
|
193
|
+
session.reply();
|
|
194
|
+
if (!evt.speech?.is_final) return;
|
|
195
|
+
|
|
196
|
+
const transcript = evt.speech.alternatives[0].transcript;
|
|
197
|
+
const translated = await translateToEnglish(transcript);
|
|
198
|
+
|
|
199
|
+
// Inject to A-leg track (caller hears it)
|
|
200
|
+
session.injectCommand('dub', {
|
|
201
|
+
action: 'sayOnTrack',
|
|
202
|
+
track: 'caller-audio',
|
|
203
|
+
say: {
|
|
204
|
+
text: translated,
|
|
205
|
+
synthesizer: { vendor: 'elevenlabs', language: 'en-US' }
|
|
206
|
+
}
|
|
207
|
+
});
|
|
208
|
+
});
|
|
209
|
+
|
|
210
|
+
// Set up the call
|
|
211
|
+
session
|
|
212
|
+
.dub({ action: 'addTrack', track: 'caller-audio' })
|
|
213
|
+
.transcribe({
|
|
214
|
+
transcriptionHook: '/transcription/caller',
|
|
215
|
+
channel: 1,
|
|
216
|
+
recognizer: { vendor: 'deepgram', language: 'en-US' }
|
|
217
|
+
})
|
|
218
|
+
.dial({
|
|
219
|
+
target: [{ type: 'phone', number: process.env.CALLEE_NUMBER! }],
|
|
220
|
+
dub: [
|
|
221
|
+
{ action: 'addTrack', track: 'callee-audio' }
|
|
222
|
+
],
|
|
223
|
+
transcribe: {
|
|
224
|
+
transcriptionHook: '/transcription/callee',
|
|
225
|
+
channel: 2,
|
|
226
|
+
recognizer: { vendor: 'deepgram', language: 'es-ES' }
|
|
227
|
+
}
|
|
228
|
+
})
|
|
229
|
+
.send();
|
|
230
|
+
});
|
|
231
|
+
|
|
232
|
+
async function translateToSpanish(text: string): Promise<string> {
|
|
233
|
+
// Implement translation logic
|
|
234
|
+
return text;
|
|
235
|
+
}
|
|
236
|
+
|
|
237
|
+
async function translateToEnglish(text: string): Promise<string> {
|
|
238
|
+
// Implement translation logic
|
|
239
|
+
return text;
|
|
240
|
+
}
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## Summary: Key Patterns
|
|
244
|
+
|
|
245
|
+
| Pattern | Implementation |
|
|
246
|
+
|---------|----------------|
|
|
247
|
+
| Capture B-leg call_sid | Listen for `call:status` where `direction === 'outbound'` |
|
|
248
|
+
| Transcribe caller | `transcribe` with `channel: 1` on A-leg |
|
|
249
|
+
| Transcribe callee | `transcribe` with `channel: 2` nested in `dial` |
|
|
250
|
+
| Audio track for caller | `dub` with `addTrack` on A-leg |
|
|
251
|
+
| Audio track for callee | `dub` with `addTrack` in `dial.dub` array |
|
|
252
|
+
| Inject to A-leg | `session.injectCommand(verb, data)` |
|
|
253
|
+
| Inject to B-leg | `session.injectCommand(verb, data, dialCallSid)` |
|
|
254
|
+
|
|
255
|
+
## See Also
|
|
256
|
+
|
|
257
|
+
- `docs/verbs/transcribe.md` - Transcribe verb usage guide
|
|
258
|
+
- `docs/verbs/dub.md` - Dub verb usage guide
|
|
259
|
+
- `docs/guides/session-commands.md` - Session commands reference
|
|
260
|
+
- `callback:call-status` - Call status webhook schema
|
|
261
|
+
- `verb:dial` - Dial verb schema
|
|
@@ -143,16 +143,54 @@ For any command not covered by a specific method:
|
|
|
143
143
|
session.injectCommand('commandName', { ...data });
|
|
144
144
|
```
|
|
145
145
|
|
|
146
|
-
|
|
146
|
+
### Targeting Specific Call Legs
|
|
147
147
|
|
|
148
|
-
|
|
148
|
+
When bridging calls with the `dial` verb, you may need to inject commands to a specific call leg (A-leg or B-leg). By default, `injectCommand` targets the current session's call (A-leg). To target the B-leg (or any other call), pass the target `call_sid` as the third argument.
|
|
149
|
+
|
|
150
|
+
**Capturing the B-leg call_sid:**
|
|
151
|
+
|
|
152
|
+
Listen for `call:status` events with `direction === 'outbound'` to capture the B-leg's call_sid:
|
|
153
|
+
|
|
154
|
+
```typescript
|
|
155
|
+
let dialCallSid: string;
|
|
156
|
+
|
|
157
|
+
session.on('call:status', (evt: Record<string, any>) => {
|
|
158
|
+
if (evt.direction === 'outbound') {
|
|
159
|
+
dialCallSid = evt.call_sid;
|
|
160
|
+
}
|
|
161
|
+
});
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**Injecting commands to specific legs:**
|
|
165
|
+
|
|
166
|
+
```typescript
|
|
167
|
+
// Target A-leg (current session) — omit the third argument:
|
|
168
|
+
session.injectCommand('dub', {
|
|
169
|
+
action: 'sayOnTrack',
|
|
170
|
+
track: 'caller-track',
|
|
171
|
+
say: 'Message for the caller'
|
|
172
|
+
});
|
|
173
|
+
|
|
174
|
+
// Target B-leg — pass call_sid as third argument:
|
|
175
|
+
session.injectCommand('dub', {
|
|
176
|
+
action: 'sayOnTrack',
|
|
177
|
+
track: 'callee-track',
|
|
178
|
+
say: 'Message for the callee'
|
|
179
|
+
}, dialCallSid);
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
This pattern is essential for applications like real-time translation, where you need to inject translated speech to each party separately based on which leg's audio was transcribed.
|
|
183
|
+
|
|
184
|
+
## Agent Update
|
|
185
|
+
|
|
186
|
+
The `updateAgent()` method sends mid-conversation updates to an active `agent` verb. Four operation types are supported:
|
|
149
187
|
|
|
150
188
|
### Update Instructions
|
|
151
189
|
|
|
152
190
|
Replace the LLM system prompt while the conversation is in progress:
|
|
153
191
|
|
|
154
192
|
```typescript
|
|
155
|
-
session.
|
|
193
|
+
session.updateAgent({
|
|
156
194
|
type: 'update_instructions',
|
|
157
195
|
instructions: 'You are now a billing support agent. Help the caller with invoice questions.',
|
|
158
196
|
});
|
|
@@ -163,7 +201,7 @@ session.updatePipeline({
|
|
|
163
201
|
Append messages to the LLM conversation history (e.g. CRM data retrieved after the call started):
|
|
164
202
|
|
|
165
203
|
```typescript
|
|
166
|
-
session.
|
|
204
|
+
session.updateAgent({
|
|
167
205
|
type: 'inject_context',
|
|
168
206
|
messages: [
|
|
169
207
|
{ role: 'system', content: 'Customer account #12345: Gold tier, 3 open tickets.' },
|
|
@@ -176,7 +214,7 @@ session.updatePipeline({
|
|
|
176
214
|
Replace the tool set available to the LLM:
|
|
177
215
|
|
|
178
216
|
```typescript
|
|
179
|
-
session.
|
|
217
|
+
session.updateAgent({
|
|
180
218
|
type: 'update_tools',
|
|
181
219
|
tools: [
|
|
182
220
|
{
|
|
@@ -193,24 +231,24 @@ session.updatePipeline({
|
|
|
193
231
|
|
|
194
232
|
### Generate Reply
|
|
195
233
|
|
|
196
|
-
Prompt the LLM to generate a new response. If the
|
|
234
|
+
Prompt the LLM to generate a new response. If the agent is not idle, the request is queued and executes when the current turn completes. Use `interrupt: true` to cancel the current response and generate immediately.
|
|
197
235
|
|
|
198
236
|
```typescript
|
|
199
237
|
// Simple prompt
|
|
200
|
-
session.
|
|
238
|
+
session.updateAgent({
|
|
201
239
|
type: 'generate_reply',
|
|
202
240
|
user_input: 'The customer just entered their account number: 12345',
|
|
203
241
|
});
|
|
204
242
|
|
|
205
243
|
// With one-shot instructions
|
|
206
|
-
session.
|
|
244
|
+
session.updateAgent({
|
|
207
245
|
type: 'generate_reply',
|
|
208
246
|
user_input: 'Customer is asking about refunds',
|
|
209
247
|
instructions: 'Be empathetic and offer a 20% discount before processing a refund.',
|
|
210
248
|
});
|
|
211
249
|
|
|
212
250
|
// Interrupt current response and generate a new one
|
|
213
|
-
session.
|
|
251
|
+
session.updateAgent({
|
|
214
252
|
type: 'generate_reply',
|
|
215
253
|
user_input: 'Urgent: supervisor override',
|
|
216
254
|
interrupt: true,
|
|
@@ -219,7 +257,7 @@ session.updatePipeline({
|
|
|
219
257
|
|
|
220
258
|
## LLM Tool Output
|
|
221
259
|
|
|
222
|
-
When using the `
|
|
260
|
+
When using the `agent` verb with a `toolHook`, tool call requests arrive as events. Return results with:
|
|
223
261
|
|
|
224
262
|
```typescript
|
|
225
263
|
session.on('/tool-hook', (evt: Record<string, any>) => {
|
|
@@ -237,7 +275,7 @@ The result is stringified and fed back to the LLM as the tool response.
|
|
|
237
275
|
|
|
238
276
|
## Building a Cascaded Voice AI Agent
|
|
239
277
|
|
|
240
|
-
The **
|
|
278
|
+
The **agent** verb is the simplest way to build a voice AI agent — jambonz manages everything. But when you need full control over the LLM interaction (custom tool handling, conversation history management, multiple LLM providers, etc.), build a **cascaded agent**: your app handles STT transcripts and LLM calls directly, piping responses back via TTS token streaming.
|
|
241
279
|
|
|
242
280
|
### Architecture
|
|
243
281
|
|
|
@@ -257,9 +295,9 @@ User speaks again → bargeIn fires → repeat
|
|
|
257
295
|
|
|
258
296
|
The key mechanism is the `bargeIn` actionHook on the `config` verb. When enabled with `sticky: true`, it persists across all verbs. Whenever the caller speaks, the `/speech-detected` hook fires with the speech transcript — even while TTS is playing (which triggers an interruption). Your app then calls the LLM and streams the response back.
|
|
259
297
|
|
|
260
|
-
### When to Use Cascaded vs
|
|
298
|
+
### When to Use Cascaded vs Agent
|
|
261
299
|
|
|
262
|
-
| |
|
|
300
|
+
| | Agent verb | Cascaded agent |
|
|
263
301
|
|---|---|---|
|
|
264
302
|
| **STT/LLM/TTS** | jambonz orchestrates all three | App owns the LLM; jambonz handles STT and TTS |
|
|
265
303
|
| **Turn detection** | Built-in (Krisp or STT-native) | App manages via bargeIn actionHook |
|
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
## Overview
|
|
2
2
|
|
|
3
|
-
The
|
|
3
|
+
The agent verb orchestrates a complete voice AI agent by wiring together three separate components — STT, LLM, and TTS — with integrated turn detection. Unlike the s2s verbs (where a single vendor handles everything), the agent verb lets you mix and match: e.g. Deepgram for STT, Anthropic for the LLM, and Cartesia for TTS.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
The agent manages the full conversational turn cycle:
|
|
6
6
|
1. User speaks → STT produces a transcript
|
|
7
7
|
2. Turn detection decides the user is done speaking
|
|
8
8
|
3. Transcript is sent to the LLM
|
|
@@ -12,7 +12,7 @@ Pipeline manages the full conversational turn cycle:
|
|
|
12
12
|
|
|
13
13
|
## Turn detection
|
|
14
14
|
|
|
15
|
-
The `turnDetection` property controls how the
|
|
15
|
+
The `turnDetection` property controls how the agent decides the user has finished speaking.
|
|
16
16
|
|
|
17
17
|
**`"stt"` (default)** — Uses the STT vendor's native end-of-utterance signal. For most vendors this is silence-based. Some vendors have smarter built-in turn detection:
|
|
18
18
|
- **deepgramflux** — Acoustic + semantic turn detection (Deepgram's "Flux" model)
|
|
@@ -105,15 +105,15 @@ The `eventHook` receives real-time events during the conversation. In WebSocket
|
|
|
105
105
|
| Event type | Description | Key fields |
|
|
106
106
|
|---|---|---|
|
|
107
107
|
| `user_transcript` | User speech recognized | `transcript` |
|
|
108
|
-
| `
|
|
108
|
+
| `llm_response` | Assistant reply text | `response` |
|
|
109
109
|
| `user_interruption` | User barged in | — |
|
|
110
110
|
| `turn_end` | End-of-turn summary | `transcript`, `response`, `interrupted`, `latency` |
|
|
111
111
|
|
|
112
|
-
The `turn_end` event is the most useful for observability. It includes per-component latency metrics (STT, LLM, TTS) in milliseconds. See the `callback:
|
|
112
|
+
The `turn_end` event is the most useful for observability. It includes per-component latency metrics (STT, LLM, TTS) in milliseconds. See the `callback:agent-turn` schema for the full payload structure.
|
|
113
113
|
|
|
114
114
|
## toolHook (function calling)
|
|
115
115
|
|
|
116
|
-
When the LLM requests a tool/function call, the
|
|
116
|
+
When the LLM requests a tool/function call, the agent sends a request to the `toolHook` with:
|
|
117
117
|
|
|
118
118
|
```json
|
|
119
119
|
{
|
|
@@ -131,11 +131,11 @@ The `arguments` field is already parsed (an object, not a JSON string).
|
|
|
131
131
|
|
|
132
132
|
## MCP servers (external tools)
|
|
133
133
|
|
|
134
|
-
Instead of (or in addition to) defining tools inline via `llmOptions.tools` and handling them with `toolHook`, you can connect to external MCP servers. The
|
|
134
|
+
Instead of (or in addition to) defining tools inline via `llmOptions.tools` and handling them with `toolHook`, you can connect to external MCP servers. The agent connects to each server at startup via SSE transport, discovers available tools, and makes them available to the LLM alongside any inline tools.
|
|
135
135
|
|
|
136
136
|
```json
|
|
137
137
|
{
|
|
138
|
-
"verb": "
|
|
138
|
+
"verb": "agent",
|
|
139
139
|
"mcpServers": [
|
|
140
140
|
{
|
|
141
141
|
"url": "https://livescoremcp.com/sse"
|
|
@@ -155,7 +155,7 @@ Instead of (or in addition to) defining tools inline via `llmOptions.tools` and
|
|
|
155
155
|
}
|
|
156
156
|
```
|
|
157
157
|
|
|
158
|
-
The [LiveScore MCP server](https://livescoremcp.com/) is a free, public MCP server that exposes tools for live football scores, fixtures, team stats, and player data. The
|
|
158
|
+
The [LiveScore MCP server](https://livescoremcp.com/) is a free, public MCP server that exposes tools for live football scores, fixtures, team stats, and player data. The agent discovers these tools automatically at startup — no need to define tool schemas in `llmOptions.tools`. A caller can simply ask "what football matches are on right now?" and the LLM will use the `get_live_scores` tool to fetch real-time data.
|
|
159
159
|
|
|
160
160
|
If an MCP server requires authentication, pass credentials in the `auth` property:
|
|
161
161
|
|
|
@@ -172,13 +172,13 @@ If an MCP server requires authentication, pass credentials in the `auth` propert
|
|
|
172
172
|
}
|
|
173
173
|
```
|
|
174
174
|
|
|
175
|
-
**How tool dispatch works**: When the LLM requests a tool call, the
|
|
175
|
+
**How tool dispatch works**: When the LLM requests a tool call, the agent checks MCP servers first. If the tool name matches one discovered from an MCP server, the call is dispatched there directly and the result is fed back to the LLM. If no MCP server provides the tool, it falls through to the `toolHook` webhook. You can use both MCP servers and `toolHook` together — MCP handles the tools it knows about, and `toolHook` handles the rest.
|
|
176
176
|
|
|
177
|
-
**TypeScript example** —
|
|
177
|
+
**TypeScript example** — an agent with the LiveScore MCP server:
|
|
178
178
|
|
|
179
179
|
```typescript
|
|
180
180
|
session
|
|
181
|
-
.
|
|
181
|
+
.agent({
|
|
182
182
|
stt: { vendor: 'deepgram', language: 'en-US' },
|
|
183
183
|
tts: { vendor: 'cartesia', voice: 'sonic-english' },
|
|
184
184
|
llm: {
|
|
@@ -196,18 +196,18 @@ session
|
|
|
196
196
|
// { url: 'https://mcp.example.com/sse', auth: { apiKey: 'your-key' } },
|
|
197
197
|
],
|
|
198
198
|
turnDetection: 'krisp',
|
|
199
|
-
actionHook: '/
|
|
199
|
+
actionHook: '/agent-complete',
|
|
200
200
|
})
|
|
201
201
|
.send();
|
|
202
202
|
```
|
|
203
203
|
|
|
204
204
|
## Mid-conversation updates
|
|
205
205
|
|
|
206
|
-
The
|
|
206
|
+
The agent supports asynchronous updates while a conversation is in progress. These let you change the agent's behavior, inject new context, modify available tools, or trigger a new LLM response — without interrupting the current verb stack.
|
|
207
207
|
|
|
208
208
|
Updates can be sent via:
|
|
209
|
-
- **WebSocket**: `session.
|
|
210
|
-
- **REST API**: `client.calls.
|
|
209
|
+
- **WebSocket**: `session.updateAgent(data)` (sends an `agent:update` command)
|
|
210
|
+
- **REST API**: `client.calls.updateAgent(callSid, data)` (sends `agent_update` in the PUT body)
|
|
211
211
|
|
|
212
212
|
### update_instructions
|
|
213
213
|
|
|
@@ -215,13 +215,13 @@ Replace the LLM system prompt mid-conversation. Useful when the conversation tra
|
|
|
215
215
|
|
|
216
216
|
```typescript
|
|
217
217
|
// WebSocket
|
|
218
|
-
session.
|
|
218
|
+
session.updateAgent({
|
|
219
219
|
type: 'update_instructions',
|
|
220
220
|
instructions: 'You are now a billing support agent. Help the caller with invoice questions.',
|
|
221
221
|
});
|
|
222
222
|
|
|
223
223
|
// REST
|
|
224
|
-
await client.calls.
|
|
224
|
+
await client.calls.updateAgent(callSid, {
|
|
225
225
|
type: 'update_instructions',
|
|
226
226
|
instructions: 'You are now a billing support agent. Help the caller with invoice questions.',
|
|
227
227
|
});
|
|
@@ -232,7 +232,7 @@ await client.calls.updatePipeline(callSid, {
|
|
|
232
232
|
Append messages to the LLM conversation history. Useful for injecting CRM data, call notes, or other context retrieved after the call started.
|
|
233
233
|
|
|
234
234
|
```typescript
|
|
235
|
-
session.
|
|
235
|
+
session.updateAgent({
|
|
236
236
|
type: 'inject_context',
|
|
237
237
|
messages: [
|
|
238
238
|
{ role: 'system', content: 'Customer account #12345: Gold tier, 3 open tickets.' },
|
|
@@ -245,7 +245,7 @@ session.updatePipeline({
|
|
|
245
245
|
Replace the tool set available to the LLM. The new tools take effect on the next LLM turn.
|
|
246
246
|
|
|
247
247
|
```typescript
|
|
248
|
-
session.
|
|
248
|
+
session.updateAgent({
|
|
249
249
|
type: 'update_tools',
|
|
250
250
|
tools: [
|
|
251
251
|
{
|
|
@@ -262,26 +262,26 @@ session.updatePipeline({
|
|
|
262
262
|
|
|
263
263
|
### generate_reply
|
|
264
264
|
|
|
265
|
-
Prompt the LLM to generate a new response. If the
|
|
265
|
+
Prompt the LLM to generate a new response. If the agent is currently idle, the prompt executes immediately. If the agent is busy (e.g. the assistant is speaking), the request is queued and executes when the current turn completes.
|
|
266
266
|
|
|
267
267
|
Use `interrupt: true` to cancel the current response and generate immediately — useful for supervisor overrides or urgent context changes.
|
|
268
268
|
|
|
269
269
|
```typescript
|
|
270
270
|
// Simple prompt
|
|
271
|
-
session.
|
|
271
|
+
session.updateAgent({
|
|
272
272
|
type: 'generate_reply',
|
|
273
273
|
user_input: 'The customer just entered their account number: 12345',
|
|
274
274
|
});
|
|
275
275
|
|
|
276
276
|
// With one-shot instructions
|
|
277
|
-
session.
|
|
277
|
+
session.updateAgent({
|
|
278
278
|
type: 'generate_reply',
|
|
279
279
|
user_input: 'Customer is asking about refunds',
|
|
280
280
|
instructions: 'Be empathetic and offer a 20% discount before processing a refund.',
|
|
281
281
|
});
|
|
282
282
|
|
|
283
283
|
// Interrupt current response
|
|
284
|
-
session.
|
|
284
|
+
session.updateAgent({
|
|
285
285
|
type: 'generate_reply',
|
|
286
286
|
user_input: 'Urgent: supervisor override',
|
|
287
287
|
interrupt: true,
|
|
@@ -326,11 +326,11 @@ For Anthropic models, use `"vendor": "anthropic"` and structure messages accordi
|
|
|
326
326
|
|
|
327
327
|
## Greeting
|
|
328
328
|
|
|
329
|
-
By default (`greeting: true`), the
|
|
329
|
+
By default (`greeting: true`), the agent prompts the LLM to generate an initial greeting before the user speaks. Set `greeting: false` if you want the agent to wait silently for the user to speak first.
|
|
330
330
|
|
|
331
331
|
## Complete example (TypeScript)
|
|
332
332
|
|
|
333
|
-
A
|
|
333
|
+
A voice agent using Deepgram STT, OpenAI LLM, and Cartesia TTS with Krisp turn detection. Exposes multiple endpoints with different STT/TTS combinations:
|
|
334
334
|
|
|
335
335
|
```typescript
|
|
336
336
|
import * as http from 'node:http';
|
|
@@ -354,19 +354,19 @@ function handleSession(session: Session) {
|
|
|
354
354
|
const model = session.data.env_vars?.OPENAI_MODEL || 'gpt-4.1-mini';
|
|
355
355
|
const systemPrompt = session.data.env_vars?.SYSTEM_PROMPT || envVars.SYSTEM_PROMPT.default;
|
|
356
356
|
|
|
357
|
-
session.on('/
|
|
357
|
+
session.on('/agent-event', (evt: Record<string, unknown>) => {
|
|
358
358
|
if (evt.type === 'turn_end') {
|
|
359
359
|
const { transcript, response, interrupted, latency } = evt as Record<string, unknown>;
|
|
360
360
|
console.log('turn_end', JSON.stringify({ transcript, response, interrupted, latency }, null, 2));
|
|
361
361
|
}
|
|
362
362
|
});
|
|
363
363
|
|
|
364
|
-
session.on('/
|
|
364
|
+
session.on('/agent-complete', () => {
|
|
365
365
|
session.hangup().reply();
|
|
366
366
|
});
|
|
367
367
|
|
|
368
368
|
session
|
|
369
|
-
.
|
|
369
|
+
.agent({
|
|
370
370
|
stt: {
|
|
371
371
|
vendor: 'deepgram',
|
|
372
372
|
language: 'multi',
|
|
@@ -386,8 +386,8 @@ function handleSession(session: Session) {
|
|
|
386
386
|
turnDetection: 'krisp',
|
|
387
387
|
earlyGeneration: true,
|
|
388
388
|
bargeIn: { enable: true },
|
|
389
|
-
eventHook: '/
|
|
390
|
-
actionHook: '/
|
|
389
|
+
eventHook: '/agent-event',
|
|
390
|
+
actionHook: '/agent-complete',
|
|
391
391
|
})
|
|
392
392
|
.send();
|
|
393
393
|
}
|
|
@@ -426,19 +426,19 @@ function handleSession(session) {
|
|
|
426
426
|
const model = session.data.env_vars?.OPENAI_MODEL || 'gpt-4.1-mini';
|
|
427
427
|
const systemPrompt = session.data.env_vars?.SYSTEM_PROMPT || envVars.SYSTEM_PROMPT.default;
|
|
428
428
|
|
|
429
|
-
session.on('/
|
|
429
|
+
session.on('/agent-event', (evt) => {
|
|
430
430
|
if (evt.type === 'turn_end') {
|
|
431
431
|
const { transcript, response, interrupted, latency } = evt;
|
|
432
432
|
console.log('turn_end', JSON.stringify({ transcript, response, interrupted, latency }, null, 2));
|
|
433
433
|
}
|
|
434
434
|
});
|
|
435
435
|
|
|
436
|
-
session.on('/
|
|
436
|
+
session.on('/agent-complete', () => {
|
|
437
437
|
session.hangup().reply();
|
|
438
438
|
});
|
|
439
439
|
|
|
440
440
|
session
|
|
441
|
-
.
|
|
441
|
+
.agent({
|
|
442
442
|
stt: {
|
|
443
443
|
vendor: 'deepgram',
|
|
444
444
|
language: 'multi',
|
|
@@ -458,8 +458,8 @@ function handleSession(session) {
|
|
|
458
458
|
turnDetection: 'krisp',
|
|
459
459
|
earlyGeneration: true,
|
|
460
460
|
bargeIn: { enable: true },
|
|
461
|
-
eventHook: '/
|
|
462
|
-
actionHook: '/
|
|
461
|
+
eventHook: '/agent-event',
|
|
462
|
+
actionHook: '/agent-complete',
|
|
463
463
|
})
|
|
464
464
|
.send();
|
|
465
465
|
}
|
|
@@ -0,0 +1,219 @@
|
|
|
1
|
+
# Dub Verb Usage Guide
|
|
2
|
+
|
|
3
|
+
The `dub` verb manages auxiliary audio tracks that are mixed into the call audio. Tracks can play background music, coaching whispers, or inject synthesized speech—useful for real-time translation, agent assistance, or audio overlays.
|
|
4
|
+
|
|
5
|
+
## Track Routing: Who Hears What
|
|
6
|
+
|
|
7
|
+
**Critical concept:** Dub tracks are heard by the party on whose call leg they are created.
|
|
8
|
+
|
|
9
|
+
| Where track is created | Who hears it |
|
|
10
|
+
|------------------------|--------------|
|
|
11
|
+
| Main verb stack (A-leg) | Caller |
|
|
12
|
+
| Nested in dial verb's `dub` array | Callee |
|
|
13
|
+
|
|
14
|
+
When using `injectCommand` to play/say on a track, the command routes to the call leg where the track was created—unless you specify a different `call_sid` as the third argument.
|
|
15
|
+
|
|
16
|
+
## Basic Usage
|
|
17
|
+
|
|
18
|
+
### Creating Tracks
|
|
19
|
+
|
|
20
|
+
Create a track before playing audio on it:
|
|
21
|
+
|
|
22
|
+
```typescript
|
|
23
|
+
// Track on A-leg (caller hears it)
|
|
24
|
+
session
|
|
25
|
+
.dub({ action: 'addTrack', track: 'caller-audio' })
|
|
26
|
+
.dial({
|
|
27
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
28
|
+
dub: [
|
|
29
|
+
// Track on B-leg (callee hears it)
|
|
30
|
+
{ action: 'addTrack', track: 'callee-audio' }
|
|
31
|
+
]
|
|
32
|
+
})
|
|
33
|
+
.send();
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Playing Audio on Tracks
|
|
37
|
+
|
|
38
|
+
```typescript
|
|
39
|
+
// Play audio file
|
|
40
|
+
session
|
|
41
|
+
.dub({
|
|
42
|
+
action: 'playOnTrack',
|
|
43
|
+
track: 'bgm',
|
|
44
|
+
play: 'https://example.com/music.mp3',
|
|
45
|
+
loop: true,
|
|
46
|
+
gain: -15 // Reduce volume by 15dB
|
|
47
|
+
})
|
|
48
|
+
.send();
|
|
49
|
+
|
|
50
|
+
// Synthesize and play speech
|
|
51
|
+
session
|
|
52
|
+
.dub({
|
|
53
|
+
action: 'sayOnTrack',
|
|
54
|
+
track: 'coach',
|
|
55
|
+
say: 'Ask about their timeline'
|
|
56
|
+
})
|
|
57
|
+
.send();
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Controlling Tracks
|
|
61
|
+
|
|
62
|
+
```typescript
|
|
63
|
+
// Silence a track (mute without removing)
|
|
64
|
+
session.dub({ action: 'silenceTrack', track: 'bgm' }).send();
|
|
65
|
+
|
|
66
|
+
// Remove a track entirely
|
|
67
|
+
session.dub({ action: 'removeTrack', track: 'bgm' }).send();
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Mid-Call Audio Injection with injectCommand
|
|
71
|
+
|
|
72
|
+
The most powerful use of dub tracks is injecting audio mid-call via `injectCommand`. This is how you implement real-time features like translation or coaching.
|
|
73
|
+
|
|
74
|
+
### Injecting to the Current Call Leg (A-leg)
|
|
75
|
+
|
|
76
|
+
```typescript
|
|
77
|
+
session.injectCommand('dub', {
|
|
78
|
+
action: 'sayOnTrack',
|
|
79
|
+
track: 'caller-audio',
|
|
80
|
+
say: 'Translated text for the caller'
|
|
81
|
+
});
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Injecting to a Specific Call Leg (B-leg)
|
|
85
|
+
|
|
86
|
+
When the target track exists on a different call leg, pass the target `call_sid` as the third argument:
|
|
87
|
+
|
|
88
|
+
```typescript
|
|
89
|
+
// First, capture the B-leg call_sid from call:status events
|
|
90
|
+
let dialCallSid: string;
|
|
91
|
+
|
|
92
|
+
session.on('call:status', (evt: Record<string, any>) => {
|
|
93
|
+
if (evt.direction === 'outbound') {
|
|
94
|
+
dialCallSid = evt.call_sid;
|
|
95
|
+
}
|
|
96
|
+
});
|
|
97
|
+
|
|
98
|
+
// Then inject to the B-leg
|
|
99
|
+
session.injectCommand('dub', {
|
|
100
|
+
action: 'sayOnTrack',
|
|
101
|
+
track: 'callee-audio',
|
|
102
|
+
say: 'Translated text for the callee'
|
|
103
|
+
}, dialCallSid);
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
**Important:** The third argument is the `call_sid` of the call leg where the track exists, not part of the data object.
|
|
107
|
+
|
|
108
|
+
## Common Use Cases
|
|
109
|
+
|
|
110
|
+
### Real-Time Translation
|
|
111
|
+
|
|
112
|
+
Set up tracks for each party, then inject translated speech based on transcriptions:
|
|
113
|
+
|
|
114
|
+
```typescript
|
|
115
|
+
session
|
|
116
|
+
.dub({ action: 'addTrack', track: 'caller-translation' })
|
|
117
|
+
.dial({
|
|
118
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
119
|
+
dub: [
|
|
120
|
+
{ action: 'addTrack', track: 'callee-translation' }
|
|
121
|
+
],
|
|
122
|
+
transcribe: {
|
|
123
|
+
transcriptionHook: '/transcription',
|
|
124
|
+
recognizer: { vendor: 'deepgram' }
|
|
125
|
+
}
|
|
126
|
+
})
|
|
127
|
+
.send();
|
|
128
|
+
|
|
129
|
+
// When callee speaks, translate and play to caller:
|
|
130
|
+
session.injectCommand('dub', {
|
|
131
|
+
action: 'sayOnTrack',
|
|
132
|
+
track: 'caller-translation',
|
|
133
|
+
say: { text: translatedText, synthesizer: { language: 'en-US' } }
|
|
134
|
+
});
|
|
135
|
+
|
|
136
|
+
// When caller speaks, translate and play to callee:
|
|
137
|
+
session.injectCommand('dub', {
|
|
138
|
+
action: 'sayOnTrack',
|
|
139
|
+
track: 'callee-translation',
|
|
140
|
+
say: { text: translatedText, synthesizer: { language: 'es-ES' } }
|
|
141
|
+
}, dialCallSid);
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Agent Coaching / Whisper
|
|
145
|
+
|
|
146
|
+
Play prompts only the agent hears (A-leg is the agent):
|
|
147
|
+
|
|
148
|
+
```typescript
|
|
149
|
+
session
|
|
150
|
+
.dub({ action: 'addTrack', track: 'coach' })
|
|
151
|
+
.dial({
|
|
152
|
+
target: [{ type: 'phone', number: '+15551234567' }]
|
|
153
|
+
})
|
|
154
|
+
.send();
|
|
155
|
+
|
|
156
|
+
// Supervisor sends a coaching message:
|
|
157
|
+
session.injectCommand('dub', {
|
|
158
|
+
action: 'sayOnTrack',
|
|
159
|
+
track: 'coach',
|
|
160
|
+
say: 'Offer them a 20% discount'
|
|
161
|
+
});
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
### Background Music / Hold Music
|
|
165
|
+
|
|
166
|
+
```typescript
|
|
167
|
+
session
|
|
168
|
+
.dub({ action: 'addTrack', track: 'bgm' })
|
|
169
|
+
.dub({
|
|
170
|
+
action: 'playOnTrack',
|
|
171
|
+
track: 'bgm',
|
|
172
|
+
play: 'https://example.com/hold-music.mp3',
|
|
173
|
+
loop: true,
|
|
174
|
+
gain: -20
|
|
175
|
+
})
|
|
176
|
+
.send();
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
## Say Configuration Object
|
|
180
|
+
|
|
181
|
+
The `say` property can be a string or a configuration object for more control:
|
|
182
|
+
|
|
183
|
+
```typescript
|
|
184
|
+
session.injectCommand('dub', {
|
|
185
|
+
action: 'sayOnTrack',
|
|
186
|
+
track: 'translation',
|
|
187
|
+
say: {
|
|
188
|
+
text: 'Hello, how can I help you?',
|
|
189
|
+
synthesizer: {
|
|
190
|
+
vendor: 'elevenlabs',
|
|
191
|
+
voice: 'EXAVITQu4vr4xnSDxMaL',
|
|
192
|
+
language: 'en-US'
|
|
193
|
+
}
|
|
194
|
+
}
|
|
195
|
+
});
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Gain Control
|
|
199
|
+
|
|
200
|
+
Use the `gain` property to adjust track volume in dB:
|
|
201
|
+
|
|
202
|
+
- Negative values reduce volume (e.g., `-15` for quiet background music)
|
|
203
|
+
- Positive values increase volume (use carefully to avoid clipping)
|
|
204
|
+
- `0` is the default (no change)
|
|
205
|
+
|
|
206
|
+
```typescript
|
|
207
|
+
session.dub({
|
|
208
|
+
action: 'playOnTrack',
|
|
209
|
+
track: 'bgm',
|
|
210
|
+
play: 'https://example.com/music.mp3',
|
|
211
|
+
gain: -15
|
|
212
|
+
}).send();
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
## See Also
|
|
216
|
+
|
|
217
|
+
- `verb:dub` - Full schema with all properties
|
|
218
|
+
- `docs/guides/session-commands.md` - injectCommand documentation
|
|
219
|
+
- `docs/guides/bridged-call-patterns.md` - Complete guide to bridged call scenarios
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
# Transcribe Verb Usage Guide
|
|
2
|
+
|
|
3
|
+
The `transcribe` verb enables real-time speech-to-text on a call. Transcription runs as a background process—subsequent verbs execute immediately while transcription continues. Results are streamed to your `transcriptionHook` webhook as they are produced.
|
|
4
|
+
|
|
5
|
+
## Basic Usage
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
session
|
|
9
|
+
.transcribe({
|
|
10
|
+
transcriptionHook: '/transcription',
|
|
11
|
+
recognizer: {
|
|
12
|
+
vendor: 'deepgram',
|
|
13
|
+
language: 'en-US',
|
|
14
|
+
deepgramOptions: {
|
|
15
|
+
model: 'nova-2',
|
|
16
|
+
smartFormatting: true
|
|
17
|
+
}
|
|
18
|
+
}
|
|
19
|
+
})
|
|
20
|
+
.dial({ target: [{ type: 'phone', number: '+15551234567' }] })
|
|
21
|
+
.send();
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
## Channel Isolation for Bridged Calls
|
|
25
|
+
|
|
26
|
+
When transcribing a bridged call (A-leg connected to B-leg via `dial`), you can isolate which party's audio to transcribe using the `channel` property:
|
|
27
|
+
|
|
28
|
+
| Channel | Description |
|
|
29
|
+
|---------|-------------|
|
|
30
|
+
| (omitted) | Both parties' audio, mixed |
|
|
31
|
+
| `1` | Near-end audio (local party—caller on A-leg, callee on B-leg) |
|
|
32
|
+
| `2` | Far-end audio (remote party) |
|
|
33
|
+
|
|
34
|
+
### Transcribing Both Legs Separately
|
|
35
|
+
|
|
36
|
+
To get separate transcriptions for each party in a bridged call:
|
|
37
|
+
|
|
38
|
+
```typescript
|
|
39
|
+
session
|
|
40
|
+
.transcribe({
|
|
41
|
+
transcriptionHook: '/transcription/caller',
|
|
42
|
+
channel: 1, // Caller's audio (A-leg near-end)
|
|
43
|
+
recognizer: { vendor: 'deepgram', language: 'en-US' }
|
|
44
|
+
})
|
|
45
|
+
.dial({
|
|
46
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
47
|
+
transcribe: {
|
|
48
|
+
transcriptionHook: '/transcription/callee',
|
|
49
|
+
channel: 2, // Callee's audio (B-leg far-end from A-leg perspective)
|
|
50
|
+
recognizer: { vendor: 'deepgram', language: 'es-ES' }
|
|
51
|
+
}
|
|
52
|
+
})
|
|
53
|
+
.send();
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Important:** When `transcribe` is nested in the `dial` verb, channel 2 isolates the B-leg's inbound audio (what the callee is saying).
|
|
57
|
+
|
|
58
|
+
## Transcription Hook Payload
|
|
59
|
+
|
|
60
|
+
Your webhook receives transcription results with these key fields:
|
|
61
|
+
|
|
62
|
+
```typescript
|
|
63
|
+
session.on('/transcription', (evt: Record<string, any>) => {
|
|
64
|
+
const {
|
|
65
|
+
call_sid, // Which call leg generated this transcript
|
|
66
|
+
speech, // Speech recognition results
|
|
67
|
+
is_final, // true for final results, false for interim
|
|
68
|
+
transcription_sid, // Unique ID for this transcription session
|
|
69
|
+
} = evt;
|
|
70
|
+
|
|
71
|
+
if (speech?.is_final) {
|
|
72
|
+
const { transcript, confidence } = speech.alternatives[0];
|
|
73
|
+
console.log(`[${call_sid}] ${transcript} (${confidence})`);
|
|
74
|
+
}
|
|
75
|
+
});
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## Identifying Which Party Spoke
|
|
79
|
+
|
|
80
|
+
The `call_sid` in transcription events identifies which call leg generated the transcript. If you're tracking the B-leg's call_sid (captured from `call:status` events), you can identify the speaker:
|
|
81
|
+
|
|
82
|
+
```typescript
|
|
83
|
+
let dialCallSid: string;
|
|
84
|
+
|
|
85
|
+
session.on('call:status', (evt: Record<string, any>) => {
|
|
86
|
+
if (evt.direction === 'outbound') {
|
|
87
|
+
dialCallSid = evt.call_sid;
|
|
88
|
+
}
|
|
89
|
+
});
|
|
90
|
+
|
|
91
|
+
session.on('/transcription', (evt: Record<string, any>) => {
|
|
92
|
+
const speaker = evt.call_sid === dialCallSid ? 'callee' : 'caller';
|
|
93
|
+
console.log(`${speaker}: ${evt.speech?.alternatives[0]?.transcript}`);
|
|
94
|
+
});
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Nested Transcribe in Config vs Dial
|
|
98
|
+
|
|
99
|
+
You can enable transcription in two places:
|
|
100
|
+
|
|
101
|
+
### In the main verb stack (or config)
|
|
102
|
+
|
|
103
|
+
Transcribes the A-leg. Without `channel`, captures caller audio. With `channel: 2`, captures far-end (what caller hears).
|
|
104
|
+
|
|
105
|
+
```typescript
|
|
106
|
+
session
|
|
107
|
+
.config({
|
|
108
|
+
transcribe: {
|
|
109
|
+
enable: true,
|
|
110
|
+
transcriptionHook: '/transcription',
|
|
111
|
+
recognizer: { vendor: 'deepgram' }
|
|
112
|
+
}
|
|
113
|
+
})
|
|
114
|
+
.send();
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Nested in dial
|
|
118
|
+
|
|
119
|
+
Transcribes during the bridged call. Without `channel`, captures both parties mixed. With `channel: 2`, isolates B-leg audio.
|
|
120
|
+
|
|
121
|
+
```typescript
|
|
122
|
+
session
|
|
123
|
+
.dial({
|
|
124
|
+
target: [{ type: 'phone', number: '+15551234567' }],
|
|
125
|
+
transcribe: {
|
|
126
|
+
transcriptionHook: '/transcription',
|
|
127
|
+
channel: 2, // Just the callee's voice
|
|
128
|
+
recognizer: { vendor: 'deepgram', language: 'es-ES' }
|
|
129
|
+
}
|
|
130
|
+
})
|
|
131
|
+
.send();
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
## Interim vs Final Results
|
|
135
|
+
|
|
136
|
+
Most STT vendors provide both interim (partial) and final transcription results:
|
|
137
|
+
|
|
138
|
+
- **Interim results** (`is_final: false`): Real-time partial transcripts that update as speech continues. Useful for live displays.
|
|
139
|
+
- **Final results** (`is_final: true`): Complete utterance transcripts after the speaker pauses. Use these for processing/translation.
|
|
140
|
+
|
|
141
|
+
```typescript
|
|
142
|
+
session.on('/transcription', (evt: Record<string, any>) => {
|
|
143
|
+
if (evt.speech?.is_final) {
|
|
144
|
+
// Process complete utterance
|
|
145
|
+
processTranscript(evt.speech.alternatives[0].transcript);
|
|
146
|
+
}
|
|
147
|
+
// Optionally handle interim results for live display
|
|
148
|
+
});
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Enabling/Disabling Transcription
|
|
152
|
+
|
|
153
|
+
Use `enable: false` to stop background transcription:
|
|
154
|
+
|
|
155
|
+
```typescript
|
|
156
|
+
session
|
|
157
|
+
.config({
|
|
158
|
+
transcribe: { enable: false }
|
|
159
|
+
})
|
|
160
|
+
.send();
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
## See Also
|
|
164
|
+
|
|
165
|
+
- `callback:transcribe` - Full transcription webhook payload schema
|
|
166
|
+
- `component:recognizer` - STT configuration options per vendor
|
|
167
|
+
- `docs/guides/bridged-call-patterns.md` - Complete guide to bridged call scenarios
|
package/jambonz-app.schema.json
CHANGED
|
@@ -28,7 +28,7 @@
|
|
|
28
28
|
{ "$ref": "verbs/deepgram_s2s" },
|
|
29
29
|
{ "$ref": "verbs/ultravox_s2s" },
|
|
30
30
|
{ "$ref": "verbs/dialogflow" },
|
|
31
|
-
{ "$ref": "verbs/
|
|
31
|
+
{ "$ref": "verbs/agent" },
|
|
32
32
|
{ "$ref": "verbs/conference" },
|
|
33
33
|
{ "$ref": "verbs/transcribe" },
|
|
34
34
|
{ "$ref": "verbs/enqueue" },
|
package/package.json
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
{
|
|
2
2
|
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
|
-
"$id": "https://jambonz.org/schema/verbs/
|
|
3
|
+
"$id": "https://jambonz.org/schema/verbs/agent",
|
|
4
4
|
"minVersion": "10.1.0",
|
|
5
|
-
"title": "
|
|
6
|
-
"description": "Configures a complete STT → LLM → TTS
|
|
5
|
+
"title": "Agent",
|
|
6
|
+
"description": "Configures a complete voice AI agent by wiring together STT → LLM → TTS with integrated turn detection. Provides a higher-level abstraction than manually orchestrating the individual components. Optimized for building voice AI agents with proper turn-taking behavior.",
|
|
7
7
|
"type": "object",
|
|
8
8
|
"properties": {
|
|
9
9
|
"verb": {
|
|
10
|
-
"const": "
|
|
10
|
+
"const": "agent"
|
|
11
11
|
},
|
|
12
12
|
"id": {
|
|
13
13
|
"type": "string",
|
|
@@ -15,11 +15,11 @@
|
|
|
15
15
|
},
|
|
16
16
|
"stt": {
|
|
17
17
|
"$ref": "../components/recognizer",
|
|
18
|
-
"description": "Speech-to-text configuration for the
|
|
18
|
+
"description": "Speech-to-text configuration for the agent."
|
|
19
19
|
},
|
|
20
20
|
"tts": {
|
|
21
21
|
"$ref": "../components/synthesizer",
|
|
22
|
-
"description": "Text-to-speech configuration for the
|
|
22
|
+
"description": "Text-to-speech configuration for the agent."
|
|
23
23
|
},
|
|
24
24
|
"turnDetection": {
|
|
25
25
|
"oneOf": [
|
|
@@ -53,7 +53,7 @@
|
|
|
53
53
|
}
|
|
54
54
|
],
|
|
55
55
|
"default": "stt",
|
|
56
|
-
"description": "Turn detection strategy. Controls when the
|
|
56
|
+
"description": "Turn detection strategy. Controls when the agent decides the user has finished speaking. STT vendors with native turn-taking (deepgramflux, assemblyai, speechmatics) always use their built-in detection regardless of this setting."
|
|
57
57
|
},
|
|
58
58
|
"bargeIn": {
|
|
59
59
|
"type": "object",
|
|
@@ -86,16 +86,16 @@
|
|
|
86
86
|
},
|
|
87
87
|
"llm": {
|
|
88
88
|
"type": "object",
|
|
89
|
-
"description": "LLM configuration for the
|
|
89
|
+
"description": "LLM configuration for the agent. See the 'llm' verb schema for details.",
|
|
90
90
|
"additionalProperties": true
|
|
91
91
|
},
|
|
92
92
|
"actionHook": {
|
|
93
93
|
"$ref": "../components/actionHook",
|
|
94
|
-
"description": "A webhook invoked when the
|
|
94
|
+
"description": "A webhook invoked when the agent ends."
|
|
95
95
|
},
|
|
96
96
|
"eventHook": {
|
|
97
97
|
"$ref": "../components/actionHook",
|
|
98
|
-
"description": "A webhook invoked for
|
|
98
|
+
"description": "A webhook invoked for agent events. Receives event types: 'user_transcript' (user speech recognized), 'llm_response' (assistant reply), 'user_interruption' (barge-in detected), and 'turn_end' (end-of-turn summary with transcript, response, and latency metrics)."
|
|
99
99
|
},
|
|
100
100
|
"toolHook": {
|
|
101
101
|
"$ref": "../components/actionHook",
|
|
@@ -171,7 +171,7 @@
|
|
|
171
171
|
},
|
|
172
172
|
"required": ["url"]
|
|
173
173
|
},
|
|
174
|
-
"description": "External MCP servers that provide tools to the LLM. The
|
|
174
|
+
"description": "External MCP servers that provide tools to the LLM. The agent connects at startup via SSE, discovers available tools, and makes them callable by the LLM."
|
|
175
175
|
}
|
|
176
176
|
},
|
|
177
177
|
"required": [
|
|
@@ -179,7 +179,7 @@
|
|
|
179
179
|
],
|
|
180
180
|
"examples": [
|
|
181
181
|
{
|
|
182
|
-
"verb": "
|
|
182
|
+
"verb": "agent",
|
|
183
183
|
"stt": {
|
|
184
184
|
"vendor": "deepgram",
|
|
185
185
|
"language": "en-US"
|
|
@@ -201,10 +201,10 @@
|
|
|
201
201
|
}
|
|
202
202
|
},
|
|
203
203
|
"turnDetection": "stt",
|
|
204
|
-
"actionHook": "/
|
|
204
|
+
"actionHook": "/agent-complete"
|
|
205
205
|
},
|
|
206
206
|
{
|
|
207
|
-
"verb": "
|
|
207
|
+
"verb": "agent",
|
|
208
208
|
"stt": {
|
|
209
209
|
"vendor": "deepgram",
|
|
210
210
|
"language": "en-US"
|
|
@@ -234,7 +234,7 @@
|
|
|
234
234
|
"minSpeechDuration": 0.3,
|
|
235
235
|
"sticky": false
|
|
236
236
|
},
|
|
237
|
-
"actionHook": "/
|
|
237
|
+
"actionHook": "/agent-complete"
|
|
238
238
|
}
|
|
239
239
|
]
|
|
240
240
|
}
|
package/verbs/dub.schema.json
CHANGED
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
"$id": "https://jambonz.org/schema/verbs/dub",
|
|
4
4
|
"minVersion": "0.9.6",
|
|
5
5
|
"title": "Dub",
|
|
6
|
-
"description": "Manages audio dubbing tracks on a call. Allows adding, removing, and controlling auxiliary audio tracks that are mixed into the call audio. Used for background music, coaching whispers, or injecting audio from external sources.",
|
|
6
|
+
"description": "Manages audio dubbing tracks on a call. Allows adding, removing, and controlling auxiliary audio tracks that are mixed into the call audio. Used for background music, coaching whispers, or injecting audio from external sources.\n\n**Track Routing:** Tracks are heard by the party on whose call leg they are created. A dub verb in the main verb stack (A-leg) creates tracks heard by the caller. A dub verb nested in the dial verb's `dub` array creates tracks heard by the callee. When using injectCommand to play/say on a track from a different call leg, pass the target call's `call_sid` as the third argument to `session.injectCommand()` to route the command to the correct leg.",
|
|
7
7
|
"type": "object",
|
|
8
8
|
"properties": {
|
|
9
9
|
"verb": {
|
|
@@ -37,7 +37,8 @@
|
|
|
37
37
|
},
|
|
38
38
|
"channel": {
|
|
39
39
|
"type": "number",
|
|
40
|
-
"
|
|
40
|
+
"enum": [1, 2],
|
|
41
|
+
"description": "Specific audio channel to transcribe. Channel 1 = near-end (local party's audio, i.e. caller on A-leg or callee on B-leg). Channel 2 = far-end (remote party's audio). When transcribe is nested in the dial verb, omitting channel captures both legs mixed; specifying channel: 2 isolates the B-leg's inbound audio."
|
|
41
42
|
}
|
|
42
43
|
},
|
|
43
44
|
"examples": [
|