@speech-sdk/core 0.6.2 → 0.8.0-alpha
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +202 -21
- package/README.md +267 -264
- package/dist/__tests__/e2e/_save-audio.d.ts +5 -24
- package/dist/__tests__/e2e/_save-audio.d.ts.map +1 -1
- package/dist/__tests__/e2e/_save-audio.js +19 -42
- package/dist/__tests__/e2e/_save-audio.js.map +1 -1
- package/dist/audio-duration.d.ts +0 -5
- package/dist/audio-duration.d.ts.map +1 -1
- package/dist/audio-duration.js +3 -10
- package/dist/audio-duration.js.map +1 -1
- package/dist/audio-utils.d.ts +1 -9
- package/dist/audio-utils.d.ts.map +1 -1
- package/dist/audio-utils.js +10 -13
- package/dist/audio-utils.js.map +1 -1
- package/dist/captions.d.ts +29 -0
- package/dist/captions.d.ts.map +1 -0
- package/dist/captions.js +193 -0
- package/dist/captions.js.map +1 -0
- package/dist/conversation/attribute-timestamps.d.ts +26 -0
- package/dist/conversation/attribute-timestamps.d.ts.map +1 -0
- package/dist/conversation/attribute-timestamps.js +276 -0
- package/dist/conversation/attribute-timestamps.js.map +1 -0
- package/dist/conversation/dispatch.d.ts +5 -5
- package/dist/conversation/dispatch.d.ts.map +1 -1
- package/dist/conversation/dispatch.js +18 -8
- package/dist/conversation/dispatch.js.map +1 -1
- package/dist/conversation/errors.d.ts +3 -0
- package/dist/conversation/errors.d.ts.map +1 -1
- package/dist/conversation/errors.js +6 -0
- package/dist/conversation/errors.js.map +1 -1
- package/dist/conversation/pcm-concat.d.ts +0 -23
- package/dist/conversation/pcm-concat.d.ts.map +1 -1
- package/dist/conversation/pcm-concat.js +5 -43
- package/dist/conversation/pcm-concat.js.map +1 -1
- package/dist/conversation/proportional-fill.d.ts +10 -0
- package/dist/conversation/proportional-fill.d.ts.map +1 -0
- package/dist/conversation/proportional-fill.js +64 -0
- package/dist/conversation/proportional-fill.js.map +1 -0
- package/dist/conversation/silence-detection.d.ts +14 -0
- package/dist/conversation/silence-detection.d.ts.map +1 -0
- package/dist/conversation/silence-detection.js +52 -0
- package/dist/conversation/silence-detection.js.map +1 -0
- package/dist/conversation/stitch.d.ts +3 -1
- package/dist/conversation/stitch.d.ts.map +1 -1
- package/dist/conversation/stitch.js +54 -13
- package/dist/conversation/stitch.js.map +1 -1
- package/dist/conversation/types.d.ts +1 -19
- package/dist/conversation/types.d.ts.map +1 -1
- package/dist/conversation/validate.d.ts +1 -16
- package/dist/conversation/validate.d.ts.map +1 -1
- package/dist/conversation/validate.js +29 -29
- package/dist/conversation/validate.js.map +1 -1
- package/dist/default-stt-fallback.d.ts +3 -0
- package/dist/default-stt-fallback.d.ts.map +1 -0
- package/dist/default-stt-fallback.js +11 -0
- package/dist/default-stt-fallback.js.map +1 -0
- package/dist/derive-timestamps.d.ts +10 -0
- package/dist/derive-timestamps.d.ts.map +1 -0
- package/dist/derive-timestamps.js +24 -0
- package/dist/derive-timestamps.js.map +1 -0
- package/dist/errors.d.ts +20 -2
- package/dist/errors.d.ts.map +1 -1
- package/dist/errors.js +28 -2
- package/dist/errors.js.map +1 -1
- package/dist/generate-conversation.d.ts +5 -4
- package/dist/generate-conversation.d.ts.map +1 -1
- package/dist/generate-conversation.js +191 -38
- package/dist/generate-conversation.js.map +1 -1
- package/dist/generate-speech.d.ts +2 -10
- package/dist/generate-speech.d.ts.map +1 -1
- package/dist/generate-speech.js +111 -33
- package/dist/generate-speech.js.map +1 -1
- package/dist/index.d.ts +5 -8
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +6 -4
- package/dist/index.js.map +1 -1
- package/dist/logger.d.ts +2 -0
- package/dist/logger.d.ts.map +1 -0
- package/dist/logger.js +29 -0
- package/dist/logger.js.map +1 -0
- package/dist/metadata.d.ts +0 -22
- package/dist/metadata.d.ts.map +1 -1
- package/dist/provider-utils.d.ts +3 -1
- package/dist/provider-utils.d.ts.map +1 -1
- package/dist/provider-utils.js +36 -39
- package/dist/provider-utils.js.map +1 -1
- package/dist/providers/cartesia/alignment.d.ts +8 -0
- package/dist/providers/cartesia/alignment.d.ts.map +1 -0
- package/dist/providers/cartesia/alignment.js +18 -0
- package/dist/providers/cartesia/alignment.js.map +1 -0
- package/dist/providers/cartesia/index.d.ts +11 -13
- package/dist/providers/cartesia/index.d.ts.map +1 -1
- package/dist/providers/cartesia/index.js +184 -61
- package/dist/providers/cartesia/index.js.map +1 -1
- package/dist/providers/deepgram/index.d.ts +7 -8
- package/dist/providers/deepgram/index.d.ts.map +1 -1
- package/dist/providers/deepgram/index.js +17 -18
- package/dist/providers/deepgram/index.js.map +1 -1
- package/dist/providers/elevenlabs/alignment.d.ts +10 -0
- package/dist/providers/elevenlabs/alignment.d.ts.map +1 -0
- package/dist/providers/elevenlabs/alignment.js +47 -0
- package/dist/providers/elevenlabs/alignment.js.map +1 -0
- package/dist/providers/elevenlabs/index.d.ts +10 -26
- package/dist/providers/elevenlabs/index.d.ts.map +1 -1
- package/dist/providers/elevenlabs/index.js +216 -154
- package/dist/providers/elevenlabs/index.js.map +1 -1
- package/dist/providers/fal/index.d.ts +7 -43
- package/dist/providers/fal/index.d.ts.map +1 -1
- package/dist/providers/fal/index.js +37 -86
- package/dist/providers/fal/index.js.map +1 -1
- package/dist/providers/fish-audio/index.d.ts +7 -8
- package/dist/providers/fish-audio/index.d.ts.map +1 -1
- package/dist/providers/fish-audio/index.js +23 -19
- package/dist/providers/fish-audio/index.js.map +1 -1
- package/dist/providers/gateway/index.d.ts +68 -0
- package/dist/providers/gateway/index.d.ts.map +1 -0
- package/dist/providers/gateway/index.js +236 -0
- package/dist/providers/gateway/index.js.map +1 -0
- package/dist/providers/google/index.d.ts +7 -20
- package/dist/providers/google/index.d.ts.map +1 -1
- package/dist/providers/google/index.js +161 -151
- package/dist/providers/google/index.js.map +1 -1
- package/dist/providers/hume/alignment.d.ts +33 -0
- package/dist/providers/hume/alignment.d.ts.map +1 -0
- package/dist/providers/hume/alignment.js +37 -0
- package/dist/providers/hume/alignment.js.map +1 -0
- package/dist/providers/hume/index.d.ts +11 -13
- package/dist/providers/hume/index.d.ts.map +1 -1
- package/dist/providers/hume/index.js +105 -41
- package/dist/providers/hume/index.js.map +1 -1
- package/dist/providers/inworld/alignment.d.ts +11 -0
- package/dist/providers/inworld/alignment.d.ts.map +1 -0
- package/dist/providers/inworld/alignment.js +24 -0
- package/dist/providers/inworld/alignment.js.map +1 -0
- package/dist/providers/inworld/index.d.ts +10 -14
- package/dist/providers/inworld/index.d.ts.map +1 -1
- package/dist/providers/inworld/index.js +55 -38
- package/dist/providers/inworld/index.js.map +1 -1
- package/dist/providers/mistral/index.d.ts +7 -8
- package/dist/providers/mistral/index.d.ts.map +1 -1
- package/dist/providers/mistral/index.js +39 -38
- package/dist/providers/mistral/index.js.map +1 -1
- package/dist/providers/murf/alignment.d.ts +13 -0
- package/dist/providers/murf/alignment.d.ts.map +1 -0
- package/dist/providers/murf/alignment.js +22 -0
- package/dist/providers/murf/alignment.js.map +1 -0
- package/dist/providers/murf/index.d.ts +11 -13
- package/dist/providers/murf/index.d.ts.map +1 -1
- package/dist/providers/murf/index.js +73 -56
- package/dist/providers/murf/index.js.map +1 -1
- package/dist/providers/openai/index.d.ts +36 -20
- package/dist/providers/openai/index.d.ts.map +1 -1
- package/dist/providers/openai/index.js +270 -102
- package/dist/providers/openai/index.js.map +1 -1
- package/dist/providers/resemble/alignment.d.ts +11 -0
- package/dist/providers/resemble/alignment.d.ts.map +1 -0
- package/dist/providers/resemble/alignment.js +54 -0
- package/dist/providers/resemble/alignment.js.map +1 -0
- package/dist/providers/resemble/index.d.ts +10 -8
- package/dist/providers/resemble/index.d.ts.map +1 -1
- package/dist/providers/resemble/index.js +58 -40
- package/dist/providers/resemble/index.js.map +1 -1
- package/dist/providers/xai/index.d.ts +7 -9
- package/dist/providers/xai/index.d.ts.map +1 -1
- package/dist/providers/xai/index.js +37 -40
- package/dist/providers/xai/index.js.map +1 -1
- package/dist/providers.d.ts +29 -0
- package/dist/providers.d.ts.map +1 -0
- package/dist/providers.js +15 -0
- package/dist/providers.js.map +1 -0
- package/dist/resolve-provider.d.ts.map +1 -1
- package/dist/resolve-provider.js +7 -59
- package/dist/resolve-provider.js.map +1 -1
- package/dist/speech-provider.d.ts +19 -15
- package/dist/speech-provider.d.ts.map +1 -1
- package/dist/speech-provider.js +9 -14
- package/dist/speech-provider.js.map +1 -1
- package/dist/speech-result.d.ts +5 -0
- package/dist/speech-result.d.ts.map +1 -1
- package/dist/speech-result.js.map +1 -1
- package/dist/speech-to-text-provider.d.ts +28 -0
- package/dist/speech-to-text-provider.d.ts.map +1 -0
- package/dist/speech-to-text-provider.js +2 -0
- package/dist/speech-to-text-provider.js.map +1 -0
- package/dist/stream-speech.d.ts.map +1 -1
- package/dist/stream-speech.js +2 -3
- package/dist/stream-speech.js.map +1 -1
- package/dist/timestamps.d.ts +9 -0
- package/dist/timestamps.d.ts.map +1 -0
- package/dist/timestamps.js +2 -0
- package/dist/timestamps.js.map +1 -0
- package/dist/turns.d.ts +9 -0
- package/dist/turns.d.ts.map +1 -0
- package/dist/turns.js +21 -0
- package/dist/turns.js.map +1 -0
- package/dist/types.d.ts +25 -0
- package/dist/types.d.ts.map +1 -1
- package/dist/volume-adjust.d.ts +0 -6
- package/dist/volume-adjust.d.ts.map +1 -1
- package/dist/volume-adjust.js +0 -6
- package/dist/volume-adjust.js.map +1 -1
- package/package.json +12 -63
package/README.md
CHANGED
|
@@ -1,31 +1,48 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
3
|
+
<img src="https://github.com/user-attachments/assets/42d9b528-e507-4162-8120-338bb0c92650" alt="Speech SDK" width="140" />
|
|
4
|
+
|
|
1
5
|
# Speech SDK
|
|
2
6
|
|
|
3
|
-
|
|
4
|
-
[](https://www.npmjs.com/package/@speech-sdk/core)
|
|
5
|
-
[](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
|
|
7
|
+
**Text-to-speech across 13 providers, one API.**
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
A lightweight, provider-agnostic TypeScript SDK. Zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
|
|
8
10
|
|
|
9
|
-
|
|
11
|
+
[](https://www.npmjs.com/package/@speech-sdk/core)
|
|
12
|
+
[](https://www.npmjs.com/package/@speech-sdk/core)
|
|
13
|
+
[](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
|
|
14
|
+
[](https://discord.gg/xcTQMU3nCV)
|
|
15
|
+
[](https://github.com/Jellypod-Inc/speech-sdk/stargazers)
|
|
10
16
|
|
|
11
|
-
|
|
17
|
+
**[Quick start](#quick-start)** · **[Providers](#supported-providers)** · **[Streaming](#streaming)** · **[Multi-Speaker Conversations](#conversations)** · **[Timestamps](#timestamps)**
|
|
12
18
|
|
|
19
|
+
</div>
|
|
13
20
|
|
|
14
|
-
|
|
21
|
+
<br />
|
|
15
22
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
23
|
+
<img width="1200" height="630" alt="Speech SDK" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
|
|
24
|
+
|
|
25
|
+
Learn more at [speechsdk.dev](https://speechsdk.dev/).
|
|
19
26
|
|
|
20
|
-
|
|
27
|
+
## Features
|
|
21
28
|
|
|
22
|
-
|
|
29
|
+
- **Universal** — one `generateSpeech()` call across every supported provider.
|
|
30
|
+
- **Streaming** — `streamSpeech()` returns a standard `ReadableStream<Uint8Array>`.
|
|
31
|
+
- **Conversations** — `generateConversation()` produces multi-speaker audio, picking a gateway, native-dialogue, or local-stitch path automatically.
|
|
32
|
+
- **Word-level timestamps** — `timestamps: true` returns alignment, using the provider's native data or falling back to STT.
|
|
33
|
+
- **Volume normalization** — RMS-level outputs to an absolute loudness target.
|
|
34
|
+
- **Audio tags & voice cloning** — bracket cues like `[laugh]` and reference-audio cloning where supported.
|
|
35
|
+
|
|
36
|
+
## Install
|
|
23
37
|
|
|
24
38
|
```bash
|
|
25
|
-
|
|
39
|
+
npm install @speech-sdk/core
|
|
26
40
|
```
|
|
27
41
|
|
|
28
|
-
|
|
42
|
+
> [!TIP]
|
|
43
|
+
> Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: `npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk`.
|
|
44
|
+
|
|
45
|
+
## Quick start
|
|
29
46
|
|
|
30
47
|
```ts
|
|
31
48
|
import { generateSpeech } from '@speech-sdk/core';
|
|
@@ -36,383 +53,369 @@ const result = await generateSpeech({
|
|
|
36
53
|
voice: 'alloy',
|
|
37
54
|
});
|
|
38
55
|
|
|
39
|
-
// Access the audio
|
|
40
56
|
result.audio.uint8Array; // Uint8Array
|
|
41
|
-
result.audio.base64; // string (lazy
|
|
57
|
+
result.audio.base64; // string (lazy)
|
|
42
58
|
result.audio.mediaType; // "audio/mpeg"
|
|
43
59
|
```
|
|
44
60
|
|
|
45
|
-
|
|
61
|
+
Pass a `provider/model` string, or just the provider name to use its default model. The string above is enough to get going — set one env var and you're done.
|
|
46
62
|
|
|
47
|
-
|
|
63
|
+
## Gateway vs direct provider
|
|
48
64
|
|
|
49
|
-
|
|
50
|
-
const result = await generateSpeech({
|
|
51
|
-
model: 'openai/gpt-4o-mini-tts',
|
|
52
|
-
text: 'Hello from speech-sdk!',
|
|
53
|
-
voice: 'alloy',
|
|
54
|
-
volumeDbfs: -20,
|
|
55
|
-
});
|
|
65
|
+
The SDK has two ways to reach a provider, and the choice is made by **how you pass `model`**:
|
|
56
66
|
|
|
57
|
-
|
|
67
|
+
```ts
|
|
68
|
+
// 1. String → routes through Speech Gateway (https://api.speechgateway.com)
|
|
69
|
+
// Needs SPEECH_GATEWAY_API_KEY (sign up at https://speechgateway.com).
|
|
70
|
+
await generateSpeech({ model: 'openai/gpt-4o-mini-tts', text: '...', voice: 'alloy' });
|
|
71
|
+
|
|
72
|
+
// 2. Factory → calls the provider directly (no proxy hop)
|
|
73
|
+
// Reads the provider's env var (e.g. OPENAI_API_KEY), or pass apiKey to the factory.
|
|
74
|
+
import { createOpenAI } from '@speech-sdk/core/providers';
|
|
75
|
+
await generateSpeech({ model: createOpenAI()('gpt-4o-mini-tts'), text: '...', voice: 'alloy' });
|
|
58
76
|
```
|
|
59
77
|
|
|
60
|
-
|
|
78
|
+
| | Speech Gateway (string) | Direct provider (factory) |
|
|
79
|
+
|---|---|---|
|
|
80
|
+
| When to use | You want a single endpoint and easy provider swaps | You already have provider keys, want zero-hop latency, or need provider features the gateway hasn't surfaced |
|
|
81
|
+
| Setup | `SPEECH_GATEWAY_API_KEY` only | One env var per provider you use |
|
|
82
|
+
| Key resolution | `apiKey` option → `SPEECH_GATEWAY_API_KEY` | `createX({ apiKey })` → `<PROVIDER>_API_KEY` |
|
|
83
|
+
| Endpoint | `api.speechgateway.com` | Provider's own API |
|
|
84
|
+
|
|
85
|
+
The gateway also accepts `createSpeechGateway({ apiKey, baseURL })` if you want to construct it explicitly (e.g. for a custom proxy URL).
|
|
86
|
+
|
|
87
|
+
## Supported providers
|
|
88
|
+
|
|
89
|
+
| Provider | Prefix | Env var |
|
|
90
|
+
|---|---|---|
|
|
91
|
+
| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `OPENAI_API_KEY` |
|
|
92
|
+
| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `ELEVENLABS_API_KEY` |
|
|
93
|
+
| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `DEEPGRAM_API_KEY` |
|
|
94
|
+
| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `CARTESIA_API_KEY` |
|
|
95
|
+
| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `HUME_API_KEY` |
|
|
96
|
+
| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `INWORLD_API_KEY` |
|
|
97
|
+
| [Google Gemini TTS](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `GOOGLE_API_KEY` |
|
|
98
|
+
| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `FISH_AUDIO_API_KEY` |
|
|
99
|
+
| [Murf](https://murf.ai/api/docs) | `murf` | `MURF_API_KEY` |
|
|
100
|
+
| [Resemble](https://docs.resemble.ai) | `resemble` | `RESEMBLE_API_KEY` |
|
|
101
|
+
| [fal](https://fal.ai/models) | `fal-ai` | `FAL_API_KEY` |
|
|
102
|
+
| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `MISTRAL_API_KEY` |
|
|
103
|
+
| [xAI](https://docs.x.ai/docs/models) | `xai` | `XAI_API_KEY` |
|
|
104
|
+
|
|
105
|
+
The env var applies when you call the provider directly via its factory. Pass a string `model` like `"openai/tts-1"` to route through Speech Gateway instead, which reads `SPEECH_GATEWAY_API_KEY` — see [Gateway vs direct provider](#gateway-vs-direct-provider). Most providers ship a default model (`createOpenAI()()`); a few (e.g. fal) require an explicit model id. See the linked docs for each provider's full model list.
|
|
106
|
+
|
|
107
|
+
Provider-specific parameters pass through via `providerOptions` using each API's native field names.
|
|
61
108
|
|
|
62
109
|
## Streaming
|
|
63
110
|
|
|
64
|
-
|
|
111
|
+
`streamSpeech()` returns audio incrementally as a `ReadableStream<Uint8Array>`.
|
|
65
112
|
|
|
66
113
|
```ts
|
|
67
|
-
import { streamSpeech } from
|
|
114
|
+
import { streamSpeech } from '@speech-sdk/core';
|
|
68
115
|
|
|
69
116
|
const { audio, mediaType } = await streamSpeech({
|
|
70
|
-
model:
|
|
71
|
-
text:
|
|
72
|
-
voice:
|
|
117
|
+
model: 'cartesia/sonic-3',
|
|
118
|
+
text: 'Streaming straight to the client.',
|
|
119
|
+
voice: 'voice-id',
|
|
73
120
|
});
|
|
121
|
+
|
|
122
|
+
// Forward to an HTTP response:
|
|
123
|
+
return new Response(audio, { headers: { 'Content-Type': mediaType } });
|
|
74
124
|
```
|
|
75
125
|
|
|
76
|
-
|
|
126
|
+
> [!NOTE]
|
|
127
|
+
> Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling `streamSpeech()` on a non-streaming model throws `StreamingNotSupportedError`.
|
|
77
128
|
|
|
78
|
-
|
|
79
|
-
import { createWriteStream } from "node:fs";
|
|
80
|
-
import { Readable } from "node:stream";
|
|
129
|
+
## Conversations
|
|
81
130
|
|
|
82
|
-
|
|
83
|
-
model: "elevenlabs/eleven_flash_v2_5",
|
|
84
|
-
text: "Hello world",
|
|
85
|
-
voice: "JBFqnCBsd6RMkjVDRZzb",
|
|
86
|
-
});
|
|
131
|
+
`generateConversation()` produces a single multi-voice clip from an ordered array of turns. The path is chosen by what the turns are:
|
|
87
132
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
```
|
|
133
|
+
- **Gateway** — every turn uses a gateway-routed string model (e.g. `"openai/tts-1"`). One request to Speech Gateway; the server handles rendering, stitching, and normalization. The SDK never stitches locally on this path — clone voices on gateway models throw `StitchUnsupportedError`.
|
|
134
|
+
- **Native dialogue** — every turn uses the same direct-provider model and that model exposes a multi-speaker endpoint. One API call, naturally mixed.
|
|
135
|
+
- **Stitch** — direct-provider conversations that don't qualify for native dialogue (multi-provider, or no dialogue endpoint). Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
|
|
92
136
|
|
|
93
|
-
|
|
137
|
+
Mixing gateway-routed turns with direct-provider turns in one call throws `MixedDispatchError`.
|
|
94
138
|
|
|
95
139
|
```ts
|
|
96
|
-
|
|
97
|
-
const { audio, mediaType } = await streamSpeech({
|
|
98
|
-
model: "cartesia/sonic-3",
|
|
99
|
-
text: "Streaming straight to the client.",
|
|
100
|
-
voice: "voice-id",
|
|
101
|
-
});
|
|
102
|
-
|
|
103
|
-
return new Response(audio, { headers: { "Content-Type": mediaType } });
|
|
104
|
-
}
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
### Read chunks manually
|
|
140
|
+
import { generateConversation } from '@speech-sdk/core';
|
|
108
141
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
}
|
|
142
|
+
const result = await generateConversation({
|
|
143
|
+
turns: [
|
|
144
|
+
{ model: 'openai/tts-1', voice: 'nova', text: "Hi, I'm hosted by OpenAI." },
|
|
145
|
+
{ model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
|
|
146
|
+
{ model: 'hume/octave-2', voice: 'Kora', text: "I'm Hume Octave. Thanks for listening." },
|
|
147
|
+
],
|
|
148
|
+
});
|
|
116
149
|
```
|
|
117
150
|
|
|
118
|
-
|
|
151
|
+
Options: `gapMs` (default 300), `volumeDbfs` (default `-20`), `maxConcurrency` (default 6), `maxRetries` (default 2), `timestamps`, `apiKey`, `providerOptions`, `abortSignal`, `headers`. Per-turn overrides: `model`, `providerOptions` (stitch path only — throws `ConversationInputError` on native). Native-dialogue models enforce their own voice-count and character limits; violations throw `DialogueConstraintError`.
|
|
152
|
+
|
|
153
|
+
## Timestamps
|
|
119
154
|
|
|
120
|
-
|
|
155
|
+
Pass `timestamps` to get word-level alignment. Timings are in seconds from the start of the audio.
|
|
121
156
|
|
|
122
157
|
```ts
|
|
123
|
-
|
|
158
|
+
const result = await generateSpeech({
|
|
159
|
+
model: 'elevenlabs/eleven_multilingual_v2',
|
|
160
|
+
text: 'Hello from speech-sdk!',
|
|
161
|
+
voice: 'JBFqnCBsd6RMkjVDRZzb',
|
|
162
|
+
timestamps: true,
|
|
163
|
+
});
|
|
124
164
|
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
}
|
|
165
|
+
result.timestamps;
|
|
166
|
+
// [
|
|
167
|
+
// { text: "Hello", start: 0.00, end: 0.32 },
|
|
168
|
+
// { text: "from", start: 0.36, end: 0.55 },
|
|
169
|
+
// ...
|
|
170
|
+
// ]
|
|
129
171
|
```
|
|
130
172
|
|
|
131
|
-
|
|
173
|
+
| Value | Behavior |
|
|
174
|
+
|---|---|
|
|
175
|
+
| `true` | Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
|
|
176
|
+
| `false` *(default)* | Never return timestamps. |
|
|
132
177
|
|
|
133
|
-
|
|
178
|
+
With `timestamps: true`, models without native alignment require an STT fallback. The SDK automatically uses OpenAI Whisper when `OPENAI_API_KEY` is set in the environment — no extra configuration needed. Gateway-routed models (string model IDs like `"openai/tts-1"`) do not need a fallback — the gateway server provides it.
|
|
134
179
|
|
|
135
|
-
|
|
180
|
+
**Resolution order:** factory `fallbackSTT` → `OPENAI_API_KEY` env var (automatic Whisper fallback) → throws `TimestampKeyMissingError`.
|
|
136
181
|
|
|
137
|
-
|
|
182
|
+
Configure `fallbackSTT` on the factory to use a different key or STT model (set it once, applies to all calls):
|
|
183
|
+
|
|
184
|
+
```ts
|
|
185
|
+
import { generateSpeech } from '@speech-sdk/core';
|
|
186
|
+
import { createOpenAI, createElevenLabs } from '@speech-sdk/core/providers';
|
|
187
|
+
|
|
188
|
+
const elevenlabs = createElevenLabs({
|
|
189
|
+
apiKey: process.env.ELEVENLABS_API_KEY,
|
|
190
|
+
fallbackSTT: createOpenAI({ apiKey: process.env.MY_OPENAI_KEY }).stt('whisper-1'),
|
|
191
|
+
});
|
|
192
|
+
|
|
193
|
+
const result = await generateSpeech({
|
|
194
|
+
model: elevenlabs('eleven_flash_v2'),
|
|
195
|
+
voice: 'JBFqnCBsd6RMkjVDRZzb',
|
|
196
|
+
text: 'Hello, world.',
|
|
197
|
+
timestamps: true,
|
|
198
|
+
});
|
|
199
|
+
```
|
|
138
200
|
|
|
139
|
-
|
|
201
|
+
Whether a given model returns native alignment or transcribes via the STT fallback is a provider detail — both paths produce the same `WordTimestamp[]` shape.
|
|
140
202
|
|
|
141
|
-
|
|
142
|
-
- **Stitch fallback** — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls `generateSpeech()` per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.
|
|
203
|
+
`generateConversation` accepts the same options and returns `ConversationWordTimestamp[]` — every word carries a `turnIndex: number` pointing back into the input `turns[]`. This is what lets you build chat-bubble UIs, speaker-attributed transcripts, and "who's speaking now?" lookups during playback without re-deriving turn boundaries.
|
|
143
204
|
|
|
144
205
|
```ts
|
|
145
|
-
import { generateConversation } from
|
|
206
|
+
import { generateConversation, timestampsToTurns } from '@speech-sdk/core';
|
|
146
207
|
|
|
147
208
|
const result = await generateConversation({
|
|
209
|
+
model: 'elevenlabs/eleven_v3',
|
|
148
210
|
turns: [
|
|
149
|
-
{
|
|
150
|
-
{
|
|
151
|
-
{ model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
|
|
152
|
-
{ model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
|
|
211
|
+
{ voice: 'rachel', text: 'Hi there.' },
|
|
212
|
+
{ voice: 'adam', text: 'Hello!' },
|
|
153
213
|
],
|
|
214
|
+
timestamps: true,
|
|
154
215
|
});
|
|
155
216
|
|
|
156
|
-
|
|
157
|
-
result.
|
|
217
|
+
// Collapse consecutive words from the same turn into per-turn timings:
|
|
218
|
+
const turnTimestamps = timestampsToTurns(result.timestamps ?? []);
|
|
158
219
|
```
|
|
159
220
|
|
|
160
|
-
|
|
221
|
+
### Captions (SRT / WebVTT)
|
|
161
222
|
|
|
162
|
-
|
|
223
|
+
`timestampsToCaptions()` converts word-level timestamps into a caption file. SRT is the default; pass `format: 'vtt'` for WebVTT.
|
|
163
224
|
|
|
164
225
|
```ts
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
maxRetries?: number, // per-turn retries, default 2
|
|
173
|
-
apiKey?: string,
|
|
174
|
-
providerOptions?: Record<string, unknown>, // forwarded to every provider; per-turn override available
|
|
175
|
-
abortSignal?: AbortSignal,
|
|
176
|
-
headers?: Record<string, string>,
|
|
226
|
+
import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
|
|
227
|
+
|
|
228
|
+
const { timestamps } = await generateSpeech({
|
|
229
|
+
model: 'elevenlabs/eleven_v3',
|
|
230
|
+
text: 'Hello world. This is a test.',
|
|
231
|
+
voice: 'JBFqnCBsd6RMkjVDRZzb',
|
|
232
|
+
timestamps: true,
|
|
177
233
|
});
|
|
178
234
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
text: string; // required, non-empty
|
|
182
|
-
model?: string | ResolvedModel; // per-turn override of the top-level model
|
|
183
|
-
providerOptions?: Record<string, unknown>, // stitch path only; see note below
|
|
184
|
-
}
|
|
235
|
+
const srt = timestampsToCaptions(timestamps ?? []);
|
|
236
|
+
const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
|
|
185
237
|
```
|
|
186
238
|
|
|
187
|
-
|
|
239
|
+
Cues break on sentence boundaries, then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass `CaptionsOptions` to customize `format`, `maxLineLength`, `maxLinesPerCue`, `maxCharsPerCue`, `maxCueDurationMs`, or `longPhraseCommaBreakChars`.
|
|
188
240
|
|
|
189
|
-
|
|
241
|
+
## Volume normalization
|
|
190
242
|
|
|
191
|
-
|
|
243
|
+
Pass `volumeDbfs` to RMS-normalize to an absolute target loudness (must be ≤ 0; `-20` is the broadcast/podcast convention).
|
|
192
244
|
|
|
193
245
|
```ts
|
|
194
|
-
await
|
|
195
|
-
|
|
196
|
-
|
|
246
|
+
const result = await generateSpeech({
|
|
247
|
+
model: 'openai/gpt-4o-mini-tts',
|
|
248
|
+
text: 'Hello!',
|
|
249
|
+
voice: 'alloy',
|
|
250
|
+
volumeDbfs: -20,
|
|
197
251
|
});
|
|
198
|
-
```
|
|
199
252
|
|
|
200
|
-
|
|
253
|
+
result.audio.mediaType; // "audio/wav" — re-encoded after normalization
|
|
254
|
+
```
|
|
201
255
|
|
|
202
|
-
|
|
256
|
+
`generateConversation` always normalizes; override the target with `volumeDbfs`. A warning is surfaced (and the raw mix passes through) if the provider has no decodable PCM/WAV mode.
|
|
203
257
|
|
|
204
|
-
|
|
258
|
+
## Audio tags
|
|
205
259
|
|
|
206
|
-
|
|
260
|
+
Bracket syntax `[tag]` adds expressive cues. Each provider handles tags natively where supported, maps them to its closest equivalent, or strips them and surfaces a warning in `result.warnings`.
|
|
207
261
|
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
262
|
+
```ts
|
|
263
|
+
await generateSpeech({
|
|
264
|
+
model: 'elevenlabs/eleven_v3',
|
|
265
|
+
text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
|
|
266
|
+
voice: 'voice-id',
|
|
267
|
+
});
|
|
268
|
+
```
|
|
213
269
|
|
|
214
|
-
|
|
270
|
+
## Voice cloning
|
|
215
271
|
|
|
216
|
-
|
|
217
|
-
|---|---|---|
|
|
218
|
-
| ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 total chars |
|
|
219
|
-
| Google | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** (API requirement) |
|
|
220
|
-
| Hume | `octave-1`, `octave-2` | 1–4 voices |
|
|
221
|
-
| Fish Audio | `s2-pro` | 1–4 voices |
|
|
222
|
-
| fal | `dia-tts` | 1–2 voices |
|
|
223
|
-
|
|
224
|
-
## Supported Providers
|
|
225
|
-
|
|
226
|
-
Use `provider/model` strings. Passing just the provider name uses its default model.
|
|
227
|
-
|
|
228
|
-
| Provider | String Prefix | Default Model | Env Var | Docs |
|
|
229
|
-
|---|---|---|---|---|
|
|
230
|
-
| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` | [API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
|
|
231
|
-
| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` | [API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) |
|
|
232
|
-
| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` | [API Reference](https://developers.deepgram.com/docs/tts-models) |
|
|
233
|
-
| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` | [API Reference](https://docs.cartesia.ai/api-reference/tts/bytes) |
|
|
234
|
-
| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` | [API Reference](https://dev.hume.ai/reference/text-to-speech-tts/synthesize-json) |
|
|
235
|
-
| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` | [API Reference](https://docs.inworld.ai/tts/api-reference) |
|
|
236
|
-
| [Google (Gemini TTS)](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` | [API Reference](https://ai.google.dev/gemini-api/docs/text-generation) |
|
|
237
|
-
| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` | [API Reference](https://docs.fish.audio/developer-guide/core-features/text-to-speech) |
|
|
238
|
-
| [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` | [API Reference](https://murf.ai/api/docs/api-reference/text-to-speech/generate) |
|
|
239
|
-
| [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` | [API Reference](https://docs.resemble.ai/api-reference/text-to-speech/synthesize) |
|
|
240
|
-
| [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` | [API Reference](https://fal.ai/models) |
|
|
241
|
-
| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` | [API Reference](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) |
|
|
242
|
-
| [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` | [API Reference](https://docs.x.ai/docs/api-reference#text-to-speech) |
|
|
272
|
+
Some providers support reference-audio cloning. Pass a voice object instead of a string.
|
|
243
273
|
|
|
244
274
|
```ts
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
generateSpeech({
|
|
249
|
-
|
|
250
|
-
|
|
275
|
+
import { createFal, createMistral } from '@speech-sdk/core/providers';
|
|
276
|
+
|
|
277
|
+
// Base64 reference:
|
|
278
|
+
await generateSpeech({
|
|
279
|
+
model: createMistral()(),
|
|
280
|
+
text: 'Hello!',
|
|
281
|
+
voice: { audio: 'base64-encoded-audio...' },
|
|
282
|
+
});
|
|
251
283
|
|
|
252
|
-
|
|
284
|
+
// URL reference:
|
|
285
|
+
await generateSpeech({
|
|
286
|
+
model: createFal()('fal-ai/f5-tts'),
|
|
287
|
+
text: 'Hello!',
|
|
288
|
+
voice: { url: 'https://example.com/reference.wav' },
|
|
289
|
+
});
|
|
290
|
+
```
|
|
253
291
|
|
|
254
|
-
## Custom
|
|
292
|
+
## Custom configuration
|
|
255
293
|
|
|
256
|
-
|
|
294
|
+
Factory functions give you custom API keys, base URLs, or `fetch` implementations:
|
|
257
295
|
|
|
258
296
|
```ts
|
|
259
297
|
import { generateSpeech } from '@speech-sdk/core';
|
|
260
|
-
import { createOpenAI } from '@speech-sdk/core/
|
|
261
|
-
import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
|
|
298
|
+
import { createOpenAI } from '@speech-sdk/core/providers';
|
|
262
299
|
|
|
263
300
|
const myOpenAI = createOpenAI({
|
|
264
301
|
apiKey: 'sk-...',
|
|
265
302
|
baseURL: 'https://my-proxy.com/v1',
|
|
266
303
|
});
|
|
267
304
|
|
|
268
|
-
|
|
305
|
+
await generateSpeech({
|
|
269
306
|
model: myOpenAI('gpt-4o-mini-tts'),
|
|
270
307
|
text: 'Hello!',
|
|
271
308
|
voice: 'alloy',
|
|
272
309
|
});
|
|
273
310
|
```
|
|
274
311
|
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
When using string models (e.g., `'openai/tts-1'`), API keys are resolved from environment variables (see table above). Factory functions accept an explicit `apiKey` option which takes precedence.
|
|
278
|
-
|
|
279
|
-
## Audio Tags
|
|
312
|
+
## Public imports
|
|
280
313
|
|
|
281
|
-
|
|
314
|
+
The root package exports the main runtime APIs:
|
|
282
315
|
|
|
283
316
|
```ts
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
317
|
+
import {
|
|
318
|
+
generateSpeech,
|
|
319
|
+
streamSpeech,
|
|
320
|
+
generateConversation,
|
|
321
|
+
timestampsToCaptions,
|
|
322
|
+
ApiError,
|
|
323
|
+
} from '@speech-sdk/core';
|
|
291
324
|
```
|
|
292
325
|
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
| Provider | Behavior |
|
|
296
|
-
|---|---|
|
|
297
|
-
| OpenAI (`gpt-4o-mini-tts`) | Tags mapped to the `instructions` field for expressive delivery control |
|
|
298
|
-
| ElevenLabs (`eleven_v3`) | All `[tag]` passed through natively |
|
|
299
|
-
| Google (`gemini-3.1-flash-tts-preview`) | All `[tag]` passed through natively (e.g. `[whispers]`, `[shouting]`, `[sighs]`, `[laugh]`) |
|
|
300
|
-
| Cartesia (`sonic-3`) | Emotion tags (`[happy]`, `[sad]`, `[angry]`, etc.) converted to SSML; `[laughter]` passed through; unknown tags stripped |
|
|
301
|
-
| All others | Tags stripped and warnings returned |
|
|
326
|
+
Provider and STT factories live under `@speech-sdk/core/providers`:
|
|
302
327
|
|
|
303
328
|
```ts
|
|
304
|
-
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
}
|
|
310
|
-
// Sent to OpenAI:
|
|
311
|
-
// input: "Hi John how are you? I'm feeling great"
|
|
312
|
-
// instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
|
|
313
|
-
console.log(result.warnings); // undefined
|
|
329
|
+
import {
|
|
330
|
+
createOpenAI,
|
|
331
|
+
createElevenLabs,
|
|
332
|
+
createCartesia,
|
|
333
|
+
createSpeechGateway,
|
|
334
|
+
} from '@speech-sdk/core/providers';
|
|
314
335
|
```
|
|
315
336
|
|
|
316
|
-
|
|
317
|
-
|
|
318
|
-
Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
|
|
337
|
+
Public types live under `@speech-sdk/core/types`:
|
|
319
338
|
|
|
320
339
|
```ts
|
|
321
|
-
import {
|
|
322
|
-
|
|
323
|
-
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
text: 'Hello!',
|
|
329
|
-
voice: { audio: 'base64-encoded-audio...' },
|
|
330
|
-
});
|
|
340
|
+
import type {
|
|
341
|
+
GenerateSpeechOptions,
|
|
342
|
+
SpeechResult,
|
|
343
|
+
ConversationResult,
|
|
344
|
+
Voice,
|
|
345
|
+
WordTimestamp,
|
|
346
|
+
} from '@speech-sdk/core/types';
|
|
331
347
|
```
|
|
332
348
|
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
```ts
|
|
336
|
-
import { createFal } from '@speech-sdk/core/fal-ai';
|
|
337
|
-
|
|
338
|
-
const fal = createFal();
|
|
339
|
-
const result = await generateSpeech({
|
|
340
|
-
model: fal('fal-ai/chatterbox'),
|
|
341
|
-
text: 'Hello!',
|
|
342
|
-
voice: { url: 'https://example.com/reference.wav' },
|
|
343
|
-
});
|
|
344
|
-
```
|
|
345
|
-
|
|
346
|
-
## Options
|
|
349
|
+
## API reference
|
|
347
350
|
|
|
348
351
|
```ts
|
|
349
352
|
generateSpeech({
|
|
350
|
-
model: string | ResolvedModel,
|
|
351
|
-
text: string,
|
|
352
|
-
voice: Voice,
|
|
353
|
-
providerOptions?: object,
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
## Result
|
|
353
|
+
model: string | ResolvedModel, // required
|
|
354
|
+
text: string, // required
|
|
355
|
+
voice: Voice, // required — string | { url } | { audio }
|
|
356
|
+
providerOptions?: object,
|
|
357
|
+
volumeDbfs?: number, // ≤ 0
|
|
358
|
+
timestamps?: boolean, // default false
|
|
359
|
+
maxRetries?: number, // default 2
|
|
360
|
+
abortSignal?: AbortSignal,
|
|
361
|
+
headers?: Record<string, string>,
|
|
362
|
+
}): Promise<SpeechResult>
|
|
361
363
|
|
|
362
|
-
```ts
|
|
363
364
|
interface SpeechResult {
|
|
364
|
-
audio: {
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
mediaType: string; // e.g. "audio/mpeg"
|
|
368
|
-
};
|
|
365
|
+
audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
|
|
366
|
+
metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
|
|
367
|
+
timestamps?: WordTimestamp[];
|
|
369
368
|
providerMetadata?: Record<string, unknown>;
|
|
369
|
+
warnings?: string[];
|
|
370
|
+
}
|
|
371
|
+
|
|
372
|
+
interface WordTimestamp { text: string; start: number; end: number } // seconds
|
|
373
|
+
|
|
374
|
+
// Returned by generateConversation — extends WordTimestamp with turnIndex
|
|
375
|
+
interface ConversationWordTimestamp extends WordTimestamp {
|
|
376
|
+
turnIndex: number; // index into the input turns[] array
|
|
370
377
|
}
|
|
371
378
|
```
|
|
372
379
|
|
|
373
|
-
## Error
|
|
380
|
+
## Error handling
|
|
374
381
|
|
|
375
382
|
```ts
|
|
376
|
-
import { generateSpeech, ApiError
|
|
383
|
+
import { generateSpeech, ApiError } from '@speech-sdk/core';
|
|
377
384
|
|
|
378
385
|
try {
|
|
379
|
-
|
|
386
|
+
await generateSpeech({ /* ... */ });
|
|
380
387
|
} catch (error) {
|
|
381
388
|
if (error instanceof ApiError) {
|
|
382
|
-
|
|
383
|
-
|
|
384
|
-
|
|
389
|
+
error.statusCode; // 401, 429, 500, ...
|
|
390
|
+
error.responseBody;
|
|
391
|
+
error.code; // stable machine-readable code (optional)
|
|
385
392
|
}
|
|
386
393
|
}
|
|
387
394
|
```
|
|
388
395
|
|
|
396
|
+
`ApiError.code` is populated from the RFC 7807 `application/problem+json` `code` extension when the upstream provides one (currently only the Speech Gateway). Match on `err.code` over `err.message` text — codes are a stable contract, messages aren't.
|
|
397
|
+
|
|
389
398
|
| Error | When |
|
|
390
399
|
|---|---|
|
|
391
|
-
| `ApiError` | Provider
|
|
392
|
-
| `
|
|
393
|
-
| `
|
|
394
|
-
|
|
395
|
-
|
|
400
|
+
| `ApiError` | Provider returned non-2xx |
|
|
401
|
+
| `MissingApiKeyError` | No `apiKey` passed and the provider's env var is unset |
|
|
402
|
+
| `NoSpeechGeneratedError` | Empty input (after tag stripping) or empty provider response |
|
|
403
|
+
| `StreamingNotSupportedError` | `streamSpeech()` on a non-streaming model |
|
|
404
|
+
| `VolumeAdjustmentUnsupportedError` | `volumeDbfs` with no decodable output mode |
|
|
405
|
+
| `TimestampKeyMissingError` | `timestamps: true` with no native support, no `fallbackSTT` configured, and `OPENAI_API_KEY` not set |
|
|
406
|
+
| `ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError` | `generateConversation` validation / native caps / stitch incompatibility |
|
|
407
|
+
| `SpeechSDKError` | Base class |
|
|
396
408
|
|
|
397
|
-
|
|
409
|
+
Retries 5xx and network errors with exponential backoff ([p-retry](https://github.com/sindresorhus/p-retry)); does not retry 4xx. Default 2 retries; override via `maxRetries`.
|
|
398
410
|
|
|
399
411
|
## Development
|
|
400
412
|
|
|
401
413
|
```bash
|
|
402
414
|
pnpm install
|
|
403
|
-
pnpm test
|
|
404
|
-
pnpm run test:e2e
|
|
405
|
-
pnpm run typecheck
|
|
406
|
-
|
|
407
|
-
|
|
408
|
-
E2E tests hit real provider APIs. Set the relevant API key environment variables in a `.env` file or export them in your shell.
|
|
409
|
-
|
|
410
|
-
Set `SPEECH_SDK_E2E_OUTPUT_DIR` to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
|
|
411
|
-
|
|
412
|
-
```bash
|
|
413
|
-
SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e
|
|
415
|
+
pnpm test # unit tests
|
|
416
|
+
pnpm run test:e2e # e2e tests (requires provider API keys)
|
|
417
|
+
pnpm run typecheck
|
|
418
|
+
pnpm fix # format + lint
|
|
414
419
|
```
|
|
415
420
|
|
|
416
|
-
|
|
417
|
-
|
|
418
|
-
MIT
|
|
421
|
+
E2E tests hit real provider APIs. Set the relevant keys in `.env` or export them. Set `SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos` to write conversation e2e audio to disk.
|