@speech-sdk/core 0.6.1 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +202 -21
- package/README.md +215 -269
- package/dist/__tests__/e2e/_save-audio.d.ts +51 -2
- package/dist/__tests__/e2e/_save-audio.d.ts.map +1 -1
- package/dist/__tests__/e2e/_save-audio.js +139 -11
- package/dist/__tests__/e2e/_save-audio.js.map +1 -1
- package/dist/audio-utils.d.ts +2 -0
- package/dist/audio-utils.d.ts.map +1 -1
- package/dist/audio-utils.js +9 -0
- package/dist/audio-utils.js.map +1 -1
- package/dist/captions.d.ts +137 -0
- package/dist/captions.d.ts.map +1 -0
- package/dist/captions.js +283 -0
- package/dist/captions.js.map +1 -0
- package/dist/conversation/stitch.d.ts +5 -0
- package/dist/conversation/stitch.d.ts.map +1 -1
- package/dist/conversation/stitch.js +37 -0
- package/dist/conversation/stitch.js.map +1 -1
- package/dist/conversation/types.d.ts +16 -0
- package/dist/conversation/types.d.ts.map +1 -1
- package/dist/conversation/validate.d.ts.map +1 -1
- package/dist/conversation/validate.js +0 -6
- package/dist/conversation/validate.js.map +1 -1
- package/dist/derive-timestamps.d.ts +14 -0
- package/dist/derive-timestamps.d.ts.map +1 -0
- package/dist/derive-timestamps.js +38 -0
- package/dist/derive-timestamps.js.map +1 -0
- package/dist/errors.d.ts +25 -0
- package/dist/errors.d.ts.map +1 -1
- package/dist/errors.js +28 -0
- package/dist/errors.js.map +1 -1
- package/dist/generate-conversation.d.ts +2 -1
- package/dist/generate-conversation.d.ts.map +1 -1
- package/dist/generate-conversation.js +72 -0
- package/dist/generate-conversation.js.map +1 -1
- package/dist/generate-speech.d.ts +18 -1
- package/dist/generate-speech.d.ts.map +1 -1
- package/dist/generate-speech.js +73 -16
- package/dist/generate-speech.js.map +1 -1
- package/dist/index.d.ts +6 -2
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +2 -1
- package/dist/index.js.map +1 -1
- package/dist/logger.d.ts +2 -0
- package/dist/logger.d.ts.map +1 -0
- package/dist/logger.js +40 -0
- package/dist/logger.js.map +1 -0
- package/dist/provider-utils.d.ts +8 -0
- package/dist/provider-utils.d.ts.map +1 -1
- package/dist/provider-utils.js +16 -2
- package/dist/provider-utils.js.map +1 -1
- package/dist/providers/cartesia/alignment.d.ts +24 -0
- package/dist/providers/cartesia/alignment.d.ts.map +1 -0
- package/dist/providers/cartesia/alignment.js +23 -0
- package/dist/providers/cartesia/alignment.js.map +1 -0
- package/dist/providers/cartesia/index.d.ts +12 -2
- package/dist/providers/cartesia/index.d.ts.map +1 -1
- package/dist/providers/cartesia/index.js +137 -2
- package/dist/providers/cartesia/index.js.map +1 -1
- package/dist/providers/elevenlabs/alignment.d.ts +24 -0
- package/dist/providers/elevenlabs/alignment.d.ts.map +1 -0
- package/dist/providers/elevenlabs/alignment.js +48 -0
- package/dist/providers/elevenlabs/alignment.js.map +1 -0
- package/dist/providers/elevenlabs/index.d.ts +19 -4
- package/dist/providers/elevenlabs/index.d.ts.map +1 -1
- package/dist/providers/elevenlabs/index.js +83 -13
- package/dist/providers/elevenlabs/index.js.map +1 -1
- package/dist/providers/fal/index.d.ts +0 -25
- package/dist/providers/fal/index.d.ts.map +1 -1
- package/dist/providers/fal/index.js +3 -58
- package/dist/providers/fal/index.js.map +1 -1
- package/dist/providers/hume/alignment.d.ts +38 -0
- package/dist/providers/hume/alignment.d.ts.map +1 -0
- package/dist/providers/hume/alignment.js +31 -0
- package/dist/providers/hume/alignment.js.map +1 -0
- package/dist/providers/hume/index.d.ts +8 -1
- package/dist/providers/hume/index.d.ts.map +1 -1
- package/dist/providers/hume/index.js +75 -1
- package/dist/providers/hume/index.js.map +1 -1
- package/dist/providers/inworld/alignment.d.ts +25 -0
- package/dist/providers/inworld/alignment.d.ts.map +1 -0
- package/dist/providers/inworld/alignment.js +23 -0
- package/dist/providers/inworld/alignment.js.map +1 -0
- package/dist/providers/inworld/index.d.ts +11 -2
- package/dist/providers/inworld/index.d.ts.map +1 -1
- package/dist/providers/inworld/index.js +11 -2
- package/dist/providers/inworld/index.js.map +1 -1
- package/dist/providers/murf/alignment.d.ts +22 -0
- package/dist/providers/murf/alignment.d.ts.map +1 -0
- package/dist/providers/murf/alignment.js +17 -0
- package/dist/providers/murf/alignment.js.map +1 -0
- package/dist/providers/murf/index.d.ts +8 -1
- package/dist/providers/murf/index.d.ts.map +1 -1
- package/dist/providers/murf/index.js +10 -1
- package/dist/providers/murf/index.js.map +1 -1
- package/dist/providers/openai/index.d.ts +12 -3
- package/dist/providers/openai/index.d.ts.map +1 -1
- package/dist/providers/openai/index.js +7 -3
- package/dist/providers/openai/index.js.map +1 -1
- package/dist/providers/resemble/alignment.d.ts +32 -0
- package/dist/providers/resemble/alignment.d.ts.map +1 -0
- package/dist/providers/resemble/alignment.js +57 -0
- package/dist/providers/resemble/alignment.js.map +1 -0
- package/dist/providers/resemble/index.d.ts +7 -1
- package/dist/providers/resemble/index.d.ts.map +1 -1
- package/dist/providers/resemble/index.js +13 -1
- package/dist/providers/resemble/index.js.map +1 -1
- package/dist/resolve-provider.d.ts.map +1 -1
- package/dist/resolve-provider.js +3 -12
- package/dist/resolve-provider.js.map +1 -1
- package/dist/speech-provider.d.ts +48 -4
- package/dist/speech-provider.d.ts.map +1 -1
- package/dist/speech-provider.js +16 -0
- package/dist/speech-provider.js.map +1 -1
- package/dist/speech-result.d.ts +10 -0
- package/dist/speech-result.d.ts.map +1 -1
- package/dist/speech-result.js.map +1 -1
- package/dist/speech-to-text-provider.d.ts +40 -0
- package/dist/speech-to-text-provider.d.ts.map +1 -0
- package/dist/speech-to-text-provider.js +2 -0
- package/dist/speech-to-text-provider.js.map +1 -0
- package/dist/stt-providers/openai/index.d.ts +42 -0
- package/dist/stt-providers/openai/index.d.ts.map +1 -0
- package/dist/stt-providers/openai/index.js +184 -0
- package/dist/stt-providers/openai/index.js.map +1 -0
- package/dist/timestamps.d.ts +23 -0
- package/dist/timestamps.d.ts.map +1 -0
- package/dist/timestamps.js +2 -0
- package/dist/timestamps.js.map +1 -0
- package/package.json +6 -2
package/README.md
CHANGED
|
@@ -4,28 +4,38 @@
|
|
|
4
4
|
[](https://www.npmjs.com/package/@speech-sdk/core)
|
|
5
5
|
[](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
A lightweight, provider-agnostic TypeScript SDK for text-to-speech. One API, 13 providers, zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
<img width="1200" height="630" alt="Speech SDK" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
Learn more at [speechsdk.dev](https://speechsdk.dev/).
|
|
12
12
|
|
|
13
|
+
## Features
|
|
13
14
|
|
|
14
|
-
|
|
15
|
+
- **Universal** — `generateSpeech()` works across OpenAI, ElevenLabs, Deepgram, Cartesia, Hume, Google Gemini TTS, Fish Audio, Inworld, Murf, Resemble, fal, Mistral, and xAI.
|
|
16
|
+
- **Streaming** — `streamSpeech()` returns a standard `ReadableStream<Uint8Array>`.
|
|
17
|
+
- **Conversations** — `generateConversation()` produces multi-speaker audio, using native dialogue endpoints when available and stitching locally when not.
|
|
18
|
+
- **Word-level timestamps** — `timestamps: "on"` returns alignment, using the provider's native data or falling back to STT.
|
|
19
|
+
- **Volume normalization** — RMS-level outputs to an absolute loudness target.
|
|
20
|
+
- **Audio tags & voice cloning** — `[laugh]`, `[sigh]`, emotion cues; reference-audio cloning where supported.
|
|
15
21
|
|
|
16
|
-
|
|
17
|
-
npm install @speech-sdk/core
|
|
18
|
-
```
|
|
22
|
+
## Contents
|
|
19
23
|
|
|
20
|
-
|
|
24
|
+
- [Install](#install) · [Quick start](#quick-start) · [Supported providers](#supported-providers)
|
|
25
|
+
- [Streaming](#streaming) · [Conversations](#conversations) · [Timestamps](#timestamps)
|
|
26
|
+
- [Volume normalization](#volume-normalization) · [Audio tags](#audio-tags) · [Voice cloning](#voice-cloning)
|
|
27
|
+
- [Custom configuration](#custom-configuration) · [API reference](#api-reference) · [Error handling](#error-handling) · [Development](#development)
|
|
21
28
|
|
|
22
|
-
|
|
29
|
+
## Install
|
|
23
30
|
|
|
24
31
|
```bash
|
|
25
|
-
|
|
32
|
+
npm install @speech-sdk/core
|
|
26
33
|
```
|
|
27
34
|
|
|
28
|
-
|
|
35
|
+
> [!TIP]
|
|
36
|
+
> Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: `npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk`.
|
|
37
|
+
|
|
38
|
+
## Quick start
|
|
29
39
|
|
|
30
40
|
```ts
|
|
31
41
|
import { generateSpeech } from '@speech-sdk/core';
|
|
@@ -36,383 +46,319 @@ const result = await generateSpeech({
|
|
|
36
46
|
voice: 'alloy',
|
|
37
47
|
});
|
|
38
48
|
|
|
39
|
-
// Access the audio
|
|
40
49
|
result.audio.uint8Array; // Uint8Array
|
|
41
|
-
result.audio.base64; // string (lazy
|
|
50
|
+
result.audio.base64; // string (lazy)
|
|
42
51
|
result.audio.mediaType; // "audio/mpeg"
|
|
43
52
|
```
|
|
44
53
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
Pass `volumeDbfs` to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):
|
|
54
|
+
Pass a `provider/model` string, or just the provider name to use its default model. API keys are read from env vars automatically.
|
|
48
55
|
|
|
49
|
-
|
|
50
|
-
const result = await generateSpeech({
|
|
51
|
-
model: 'openai/gpt-4o-mini-tts',
|
|
52
|
-
text: 'Hello from speech-sdk!',
|
|
53
|
-
voice: 'alloy',
|
|
54
|
-
volumeDbfs: -20,
|
|
55
|
-
});
|
|
56
|
+
## Supported providers
|
|
56
57
|
|
|
57
|
-
|
|
58
|
-
|
|
58
|
+
| Provider | Prefix | Default model | Env var |
|
|
59
|
+
|---|---|---|---|
|
|
60
|
+
| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` |
|
|
61
|
+
| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` |
|
|
62
|
+
| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` |
|
|
63
|
+
| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` |
|
|
64
|
+
| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` |
|
|
65
|
+
| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` |
|
|
66
|
+
| [Google Gemini TTS](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` |
|
|
67
|
+
| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` |
|
|
68
|
+
| [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` |
|
|
69
|
+
| [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` |
|
|
70
|
+
| [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` |
|
|
71
|
+
| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` |
|
|
72
|
+
| [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` |
|
|
59
73
|
|
|
60
|
-
|
|
74
|
+
Provider-specific parameters pass through via `providerOptions` using each API's native field names.
|
|
61
75
|
|
|
62
76
|
## Streaming
|
|
63
77
|
|
|
64
|
-
|
|
78
|
+
`streamSpeech()` returns audio incrementally as a `ReadableStream<Uint8Array>`.
|
|
65
79
|
|
|
66
80
|
```ts
|
|
67
|
-
import { streamSpeech } from
|
|
81
|
+
import { streamSpeech } from '@speech-sdk/core';
|
|
68
82
|
|
|
69
83
|
const { audio, mediaType } = await streamSpeech({
|
|
70
|
-
model:
|
|
71
|
-
text:
|
|
72
|
-
voice:
|
|
73
|
-
});
|
|
74
|
-
```
|
|
75
|
-
|
|
76
|
-
### Pipe to a file (Node)
|
|
77
|
-
|
|
78
|
-
```ts
|
|
79
|
-
import { createWriteStream } from "node:fs";
|
|
80
|
-
import { Readable } from "node:stream";
|
|
81
|
-
|
|
82
|
-
const { audio } = await streamSpeech({
|
|
83
|
-
model: "elevenlabs/eleven_flash_v2_5",
|
|
84
|
-
text: "Hello world",
|
|
85
|
-
voice: "JBFqnCBsd6RMkjVDRZzb",
|
|
86
|
-
});
|
|
87
|
-
|
|
88
|
-
await new Promise((resolve, reject) => {
|
|
89
|
-
Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
|
|
84
|
+
model: 'cartesia/sonic-3',
|
|
85
|
+
text: 'Streaming straight to the client.',
|
|
86
|
+
voice: 'voice-id',
|
|
90
87
|
});
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
### Forward to an HTTP response (Edge / Workers / Next.js Route Handler)
|
|
94
|
-
|
|
95
|
-
```ts
|
|
96
|
-
export async function GET() {
|
|
97
|
-
const { audio, mediaType } = await streamSpeech({
|
|
98
|
-
model: "cartesia/sonic-3",
|
|
99
|
-
text: "Streaming straight to the client.",
|
|
100
|
-
voice: "voice-id",
|
|
101
|
-
});
|
|
102
|
-
|
|
103
|
-
return new Response(audio, { headers: { "Content-Type": mediaType } });
|
|
104
|
-
}
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
### Read chunks manually
|
|
108
88
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
while (true) {
|
|
112
|
-
const { value, done } = await reader.read();
|
|
113
|
-
if (done) break;
|
|
114
|
-
// value is a Uint8Array of audio bytes
|
|
115
|
-
}
|
|
116
|
-
```
|
|
117
|
-
|
|
118
|
-
### Capability check
|
|
119
|
-
|
|
120
|
-
Check whether a model supports streaming before calling `streamSpeech()`:
|
|
121
|
-
|
|
122
|
-
```ts
|
|
123
|
-
import { hasFeature } from "@speech-sdk/core";
|
|
124
|
-
|
|
125
|
-
const model = provider.models.find((m) => m.id === "tts-1");
|
|
126
|
-
if (hasFeature(model, "streaming")) {
|
|
127
|
-
// safe to call streamSpeech()
|
|
128
|
-
}
|
|
89
|
+
// Forward to an HTTP response:
|
|
90
|
+
return new Response(audio, { headers: { 'Content-Type': mediaType } });
|
|
129
91
|
```
|
|
130
92
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
### Errors and retries
|
|
134
|
-
|
|
135
|
-
Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the `ReadableStream` consumer as a stream error and are not retried. Pass `maxRetries` (default `2`) and an `abortSignal` the same way as `generateSpeech()`.
|
|
93
|
+
> [!NOTE]
|
|
94
|
+
> Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling `streamSpeech()` on a non-streaming model throws `StreamingNotSupportedError`.
|
|
136
95
|
|
|
137
96
|
## Conversations
|
|
138
97
|
|
|
139
|
-
`generateConversation()` produces a single multi-voice
|
|
98
|
+
`generateConversation()` produces a single multi-voice clip from an ordered array of turns, picking the best path automatically:
|
|
140
99
|
|
|
141
|
-
- **Native dialogue** —
|
|
142
|
-
- **Stitch fallback** —
|
|
100
|
+
- **Native dialogue** — one provider with a multi-speaker endpoint (ElevenLabs v3, Gemini TTS, Hume Octave, Fish Audio S2-Pro, fal Dia). One API call, natural mix.
|
|
101
|
+
- **Stitch fallback** — multi-provider or no dialogue endpoint. Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
|
|
143
102
|
|
|
144
103
|
```ts
|
|
145
|
-
import { generateConversation } from
|
|
104
|
+
import { generateConversation } from '@speech-sdk/core/conversation';
|
|
146
105
|
|
|
147
106
|
const result = await generateConversation({
|
|
148
107
|
turns: [
|
|
149
|
-
{ model:
|
|
150
|
-
{ model:
|
|
151
|
-
{ model:
|
|
152
|
-
{ model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
|
|
108
|
+
{ model: 'openai/tts-1', voice: 'nova', text: "Hi, I'm hosted by OpenAI." },
|
|
109
|
+
{ model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
|
|
110
|
+
{ model: 'hume/octave-2', voice: 'Kora', text: "I'm Hume Octave. Thanks for listening." },
|
|
153
111
|
],
|
|
154
112
|
});
|
|
155
|
-
|
|
156
|
-
result.audio.uint8Array; // Uint8Array of one combined WAV
|
|
157
|
-
result.audio.mediaType; // "audio/wav"
|
|
158
113
|
```
|
|
159
114
|
|
|
160
|
-
|
|
115
|
+
Options: `gapMs` (default 300), `normalizeVolume` (default `true`), `volumeDbfs` (default `-20`), `maxConcurrency` (default 6), `maxRetries` (default 2), `timestamps`, `timestampProvider`, `apiKey`, `providerOptions`, `abortSignal`, `headers`. Per-turn overrides: `model`, `providerOptions` (stitch path only — throws `ConversationInputError` on native).
|
|
116
|
+
|
|
117
|
+
**Native dialogue caps:**
|
|
118
|
+
|
|
119
|
+
| Provider | Models | Voice constraints |
|
|
120
|
+
|---|---|---|
|
|
121
|
+
| ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 chars |
|
|
122
|
+
| Google | `gemini-2.5-{flash,pro}-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** |
|
|
123
|
+
| Hume | `octave-1`, `octave-2` | 1–4 voices |
|
|
124
|
+
| Fish Audio | `s2-pro` | 1–4 voices |
|
|
125
|
+
|
|
126
|
+
## Timestamps
|
|
161
127
|
|
|
162
|
-
|
|
128
|
+
Pass `timestamps` to get word-level alignment. Timings are in seconds from the start of the audio.
|
|
163
129
|
|
|
164
130
|
```ts
|
|
165
|
-
|
|
166
|
-
model
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
volumeDbfs?: number, // RMS target loudness in dBFS (≤0), default -20
|
|
171
|
-
maxConcurrency?: number, // cap parallel generateSpeech calls, default 6
|
|
172
|
-
maxRetries?: number, // per-turn retries, default 2
|
|
173
|
-
apiKey?: string,
|
|
174
|
-
providerOptions?: Record<string, unknown>, // forwarded to every provider; per-turn override available
|
|
175
|
-
abortSignal?: AbortSignal,
|
|
176
|
-
headers?: Record<string, string>,
|
|
131
|
+
const result = await generateSpeech({
|
|
132
|
+
model: 'elevenlabs/eleven_multilingual_v2',
|
|
133
|
+
text: 'Hello from speech-sdk!',
|
|
134
|
+
voice: 'JBFqnCBsd6RMkjVDRZzb',
|
|
135
|
+
timestamps: 'on',
|
|
177
136
|
});
|
|
178
137
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
138
|
+
result.timestamps;
|
|
139
|
+
// [
|
|
140
|
+
// { text: "Hello", start: 0.00, end: 0.32 },
|
|
141
|
+
// { text: "from", start: 0.36, end: 0.55 },
|
|
142
|
+
// ...
|
|
143
|
+
// ]
|
|
185
144
|
```
|
|
186
145
|
|
|
187
|
-
|
|
146
|
+
| Mode | Behavior |
|
|
147
|
+
|---|---|
|
|
148
|
+
| `"auto"` *(default)* | Return timestamps only if the provider supplies them natively. Free. |
|
|
149
|
+
| `"on"` | Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
|
|
150
|
+
| `"off"` | Never return timestamps. |
|
|
188
151
|
|
|
189
|
-
|
|
152
|
+
On `"on"`, the fallback defaults to OpenAI Whisper (`openai/whisper-1`, needs `OPENAI_API_KEY`). Override by constructing a `ResolvedSTTModel` via a factory and passing it as `timestampProvider`:
|
|
190
153
|
|
|
191
154
|
```ts
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
155
|
+
import { createOpenAISTT } from '@speech-sdk/core/stt/openai';
|
|
156
|
+
|
|
157
|
+
await generateSpeech({
|
|
158
|
+
model: 'cartesia/sonic-3',
|
|
159
|
+
text: 'Hello!',
|
|
160
|
+
voice: 'voice-id',
|
|
161
|
+
timestamps: 'on',
|
|
162
|
+
timestampProvider: createOpenAISTT({ apiKey: process.env.MY_WHISPER_KEY })('whisper-1'),
|
|
195
163
|
});
|
|
196
164
|
```
|
|
197
165
|
|
|
198
|
-
|
|
166
|
+
**Per-provider support:**
|
|
199
167
|
|
|
200
|
-
|
|
168
|
+
| Provider | Timestamps |
|
|
169
|
+
|---|---|
|
|
170
|
+
| ElevenLabs (`eleven_v3`, `eleven_multilingual_v2`, `eleven_flash_v2`, `eleven_flash_v2_5`) | **Native** — returned in the TTS response, free on `"auto"` |
|
|
171
|
+
| Murf (`GEN2`) | **Native** — `wordDurations` returned in the TTS response, free on `"auto"` (FALCON streaming model has no native alignment) |
|
|
172
|
+
| Hume (`octave-2`) | **Native** — word alignment from the JSON `/v0/tts` endpoint, free on `"auto"` (`octave-1` has no native alignment) |
|
|
173
|
+
| Inworld (`inworld-tts-1.5-max`, `inworld-tts-1.5-mini`) | **Native** — `timestampInfo.wordAlignment` returned in the TTS response, free on `"auto"` (best on English/Spanish) |
|
|
174
|
+
| Cartesia (`sonic-3`, `sonic-2`) | **Native** — routed through `/tts/sse` with `add_timestamps: true`; merges interleaved chunk + timestamps events into audio + `WordTimestamp[]` |
|
|
175
|
+
| Resemble (`default`) | **Native** — `audio_timestamps` always returned by `/synthesize`; SDK aggregates grapheme-level timing into words (mirrors ElevenLabs aggregator) |
|
|
176
|
+
| All others (OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI) | No native alignment; `"on"` transcribes via the STT fallback, `"auto"` returns `undefined` |
|
|
201
177
|
|
|
202
|
-
|
|
178
|
+
`generateConversation` accepts the same options and returns a flat `WordTimestamp[]` across all turns — stitch-path timings are offset by cumulative turn duration + gap.
|
|
203
179
|
|
|
204
|
-
|
|
180
|
+
### Captions (SRT / WebVTT)
|
|
205
181
|
|
|
206
|
-
|
|
207
|
-
|---|---|
|
|
208
|
-
| `ConversationInputError` | Validation failure — empty turns, blank text, more than 4 unique voices, or a turn missing a model |
|
|
209
|
-
| `DialogueConstraintError` | A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2) |
|
|
210
|
-
| `StitchUnsupportedError` | The stitch path was selected but a chosen provider/model can't emit PCM/WAV |
|
|
182
|
+
Convert word-level timestamps into a caption file. SRT is the default; pass `format: 'vtt'` for WebVTT (required for HTML `<track>`).
|
|
211
183
|
|
|
212
|
-
|
|
184
|
+
```ts
|
|
185
|
+
import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
|
|
213
186
|
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
| fal | `dia-tts` | 1–2 voices |
|
|
221
|
-
|
|
222
|
-
Across the SDK, conversations are capped at **4 unique voices** total regardless of provider.
|
|
223
|
-
|
|
224
|
-
## Supported Providers
|
|
225
|
-
|
|
226
|
-
Use `provider/model` strings. Passing just the provider name uses its default model.
|
|
227
|
-
|
|
228
|
-
| Provider | String Prefix | Default Model | Env Var | Docs |
|
|
229
|
-
|---|---|---|---|---|
|
|
230
|
-
| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` | [API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
|
|
231
|
-
| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` | [API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) |
|
|
232
|
-
| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` | [API Reference](https://developers.deepgram.com/docs/tts-models) |
|
|
233
|
-
| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` | [API Reference](https://docs.cartesia.ai/api-reference/tts/bytes) |
|
|
234
|
-
| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` | [API Reference](https://dev.hume.ai/reference/text-to-speech-tts/synthesize-json) |
|
|
235
|
-
| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` | [API Reference](https://docs.inworld.ai/tts/api-reference) |
|
|
236
|
-
| [Google (Gemini TTS)](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` | [API Reference](https://ai.google.dev/gemini-api/docs/text-generation) |
|
|
237
|
-
| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` | [API Reference](https://docs.fish.audio/developer-guide/core-features/text-to-speech) |
|
|
238
|
-
| [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` | [API Reference](https://murf.ai/api/docs/api-reference/text-to-speech/generate) |
|
|
239
|
-
| [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` | [API Reference](https://docs.resemble.ai/api-reference/text-to-speech/synthesize) |
|
|
240
|
-
| [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` | [API Reference](https://fal.ai/models) |
|
|
241
|
-
| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` | [API Reference](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) |
|
|
242
|
-
| [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` | [API Reference](https://docs.x.ai/docs/api-reference#text-to-speech) |
|
|
187
|
+
const { timestamps } = await generateSpeech({
|
|
188
|
+
model: 'elevenlabs/eleven_v3',
|
|
189
|
+
text: 'Hello world. This is a test.',
|
|
190
|
+
voice: 'JBFqnCBsd6RMkjVDRZzb',
|
|
191
|
+
timestamps: 'on',
|
|
192
|
+
});
|
|
243
193
|
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
194
|
+
const srt = timestampsToCaptions(timestamps ?? []);
|
|
195
|
+
// 1
|
|
196
|
+
// 00:00:00,000 --> 00:00:01,200
|
|
197
|
+
// Hello world.
|
|
198
|
+
//
|
|
199
|
+
// 2
|
|
200
|
+
// 00:00:01,300 --> 00:00:02,800
|
|
201
|
+
// This is a test.
|
|
202
|
+
|
|
203
|
+
const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
|
|
204
|
+
// WEBVTT
|
|
205
|
+
//
|
|
206
|
+
// 1
|
|
207
|
+
// 00:00:00.000 --> 00:00:01.200
|
|
208
|
+
// Hello world.
|
|
209
|
+
//
|
|
210
|
+
// 2
|
|
211
|
+
// 00:00:01.300 --> 00:00:02.800
|
|
212
|
+
// This is a test.
|
|
250
213
|
```
|
|
251
214
|
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
## Custom Configuration
|
|
215
|
+
Output follows the SubRip and [W3C WebVTT](https://www.w3.org/TR/webvtt1/) conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (`&`, `<`, `>`) on the VTT path.
|
|
255
216
|
|
|
256
|
-
|
|
217
|
+
Cues break on sentence boundaries (`.`, `!`, `?`), then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass `CaptionsOptions` to customize `format`, `maxLineLength`, `maxLinesPerCue`, `maxCharsPerCue`, `maxCueDurationMs`, or `longPhraseCommaBreakChars`.
|
|
257
218
|
|
|
258
|
-
|
|
259
|
-
import { generateSpeech } from '@speech-sdk/core';
|
|
260
|
-
import { createOpenAI } from '@speech-sdk/core/openai';
|
|
261
|
-
import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
|
|
219
|
+
## Volume normalization
|
|
262
220
|
|
|
263
|
-
|
|
264
|
-
apiKey: 'sk-...',
|
|
265
|
-
baseURL: 'https://my-proxy.com/v1',
|
|
266
|
-
});
|
|
221
|
+
Pass `volumeDbfs` to RMS-normalize to an absolute target loudness (must be ≤ 0; `-20` is the broadcast/podcast convention).
|
|
267
222
|
|
|
223
|
+
```ts
|
|
268
224
|
const result = await generateSpeech({
|
|
269
|
-
model:
|
|
225
|
+
model: 'openai/gpt-4o-mini-tts',
|
|
270
226
|
text: 'Hello!',
|
|
271
227
|
voice: 'alloy',
|
|
228
|
+
volumeDbfs: -20,
|
|
272
229
|
});
|
|
273
|
-
```
|
|
274
230
|
|
|
275
|
-
|
|
231
|
+
result.audio.mediaType; // "audio/wav" — re-encoded after normalization
|
|
232
|
+
```
|
|
276
233
|
|
|
277
|
-
|
|
234
|
+
`generateConversation` normalizes by default. Pass `normalizeVolume: false` to skip. Throws `VolumeAdjustmentUnsupportedError` if the provider has no decodable PCM/WAV mode.
|
|
278
235
|
|
|
279
|
-
## Audio
|
|
236
|
+
## Audio tags
|
|
280
237
|
|
|
281
|
-
|
|
238
|
+
Bracket syntax `[tag]` adds expressive cues. Unsupported tags are stripped with warnings in `result.warnings`.
|
|
282
239
|
|
|
283
240
|
```ts
|
|
284
|
-
|
|
241
|
+
await generateSpeech({
|
|
285
242
|
model: 'elevenlabs/eleven_v3',
|
|
286
243
|
text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
|
|
287
244
|
voice: 'voice-id',
|
|
288
245
|
});
|
|
289
|
-
|
|
290
|
-
console.log(result.warnings); // undefined — eleven_v3 supports all tags
|
|
291
246
|
```
|
|
292
247
|
|
|
293
|
-
### Provider behavior
|
|
294
|
-
|
|
295
248
|
| Provider | Behavior |
|
|
296
249
|
|---|---|
|
|
297
|
-
| OpenAI (`gpt-4o-mini-tts`) |
|
|
298
|
-
| ElevenLabs (`eleven_v3`) |
|
|
299
|
-
| Google (`gemini-3.1-flash-tts-preview`) |
|
|
300
|
-
| Cartesia (`sonic-3`) | Emotion tags
|
|
301
|
-
| All others |
|
|
302
|
-
|
|
303
|
-
```ts
|
|
304
|
-
// OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
|
|
305
|
-
const result = await generateSpeech({
|
|
306
|
-
model: 'openai/gpt-4o-mini-tts',
|
|
307
|
-
text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
|
|
308
|
-
voice: 'alloy',
|
|
309
|
-
});
|
|
310
|
-
// Sent to OpenAI:
|
|
311
|
-
// input: "Hi John how are you? I'm feeling great"
|
|
312
|
-
// instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
|
|
313
|
-
console.log(result.warnings); // undefined
|
|
314
|
-
```
|
|
250
|
+
| OpenAI (`gpt-4o-mini-tts`) | Mapped to the `instructions` field |
|
|
251
|
+
| ElevenLabs (`eleven_v3`) | Passed through natively |
|
|
252
|
+
| Google (`gemini-3.1-flash-tts-preview`) | Passed through natively |
|
|
253
|
+
| Cartesia (`sonic-3`) | Emotion tags → SSML; `[laughter]` passed through; unknown stripped |
|
|
254
|
+
| All others | Stripped with warnings |
|
|
315
255
|
|
|
316
|
-
## Voice
|
|
256
|
+
## Voice cloning
|
|
317
257
|
|
|
318
|
-
Some providers support
|
|
258
|
+
Some providers support reference-audio cloning. Pass a voice object instead of a string.
|
|
319
259
|
|
|
320
260
|
```ts
|
|
321
261
|
import { createMistral } from '@speech-sdk/core/mistral';
|
|
262
|
+
import { createFal } from '@speech-sdk/core/fal-ai';
|
|
322
263
|
|
|
323
|
-
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
const result = await generateSpeech({
|
|
327
|
-
model: mistral(),
|
|
264
|
+
// Base64 reference:
|
|
265
|
+
await generateSpeech({
|
|
266
|
+
model: createMistral()(),
|
|
328
267
|
text: 'Hello!',
|
|
329
268
|
voice: { audio: 'base64-encoded-audio...' },
|
|
330
269
|
});
|
|
331
|
-
```
|
|
332
|
-
|
|
333
|
-
Clone from a URL (fal):
|
|
334
|
-
|
|
335
|
-
```ts
|
|
336
|
-
import { createFal } from '@speech-sdk/core/fal-ai';
|
|
337
270
|
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
model:
|
|
271
|
+
// URL reference:
|
|
272
|
+
await generateSpeech({
|
|
273
|
+
model: createFal()('fal-ai/f5-tts'),
|
|
341
274
|
text: 'Hello!',
|
|
342
275
|
voice: { url: 'https://example.com/reference.wav' },
|
|
343
276
|
});
|
|
344
277
|
```
|
|
345
278
|
|
|
346
|
-
##
|
|
279
|
+
## Custom configuration
|
|
280
|
+
|
|
281
|
+
Factory functions give you custom API keys, base URLs, or `fetch` implementations:
|
|
347
282
|
|
|
348
283
|
```ts
|
|
349
|
-
generateSpeech
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
284
|
+
import { generateSpeech } from '@speech-sdk/core';
|
|
285
|
+
import { createOpenAI } from '@speech-sdk/core/openai';
|
|
286
|
+
|
|
287
|
+
const myOpenAI = createOpenAI({
|
|
288
|
+
apiKey: 'sk-...',
|
|
289
|
+
baseURL: 'https://my-proxy.com/v1',
|
|
290
|
+
});
|
|
291
|
+
|
|
292
|
+
await generateSpeech({
|
|
293
|
+
model: myOpenAI('gpt-4o-mini-tts'),
|
|
294
|
+
text: 'Hello!',
|
|
295
|
+
voice: 'alloy',
|
|
357
296
|
});
|
|
358
297
|
```
|
|
359
298
|
|
|
360
|
-
##
|
|
299
|
+
## API reference
|
|
361
300
|
|
|
362
301
|
```ts
|
|
302
|
+
generateSpeech({
|
|
303
|
+
model: string | ResolvedModel, // required
|
|
304
|
+
text: string, // required
|
|
305
|
+
voice: Voice, // required — string | { url } | { audio }
|
|
306
|
+
providerOptions?: object,
|
|
307
|
+
volumeDbfs?: number, // ≤ 0
|
|
308
|
+
timestamps?: "on" | "auto" | "off", // default "auto"
|
|
309
|
+
timestampProvider?: ResolvedSTTModel, // override the STT fallback
|
|
310
|
+
maxRetries?: number, // default 2
|
|
311
|
+
abortSignal?: AbortSignal,
|
|
312
|
+
headers?: Record<string, string>,
|
|
313
|
+
}): Promise<SpeechResult>
|
|
314
|
+
|
|
363
315
|
interface SpeechResult {
|
|
364
|
-
audio: {
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
mediaType: string; // e.g. "audio/mpeg"
|
|
368
|
-
};
|
|
316
|
+
audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
|
|
317
|
+
metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
|
|
318
|
+
timestamps?: WordTimestamp[];
|
|
369
319
|
providerMetadata?: Record<string, unknown>;
|
|
320
|
+
warnings?: string[];
|
|
370
321
|
}
|
|
322
|
+
|
|
323
|
+
interface WordTimestamp { text: string; start: number; end: number } // seconds
|
|
371
324
|
```
|
|
372
325
|
|
|
373
|
-
## Error
|
|
326
|
+
## Error handling
|
|
374
327
|
|
|
375
328
|
```ts
|
|
376
|
-
import { generateSpeech, ApiError
|
|
329
|
+
import { generateSpeech, ApiError } from '@speech-sdk/core';
|
|
377
330
|
|
|
378
331
|
try {
|
|
379
|
-
|
|
332
|
+
await generateSpeech({ /* ... */ });
|
|
380
333
|
} catch (error) {
|
|
381
334
|
if (error instanceof ApiError) {
|
|
382
|
-
|
|
383
|
-
|
|
384
|
-
|
|
335
|
+
error.statusCode; // 401, 429, 500, ...
|
|
336
|
+
error.model; // "openai/gpt-4o-mini-tts"
|
|
337
|
+
error.responseBody;
|
|
385
338
|
}
|
|
386
339
|
}
|
|
387
340
|
```
|
|
388
341
|
|
|
389
342
|
| Error | When |
|
|
390
343
|
|---|---|
|
|
391
|
-
| `ApiError` | Provider
|
|
392
|
-
| `NoSpeechGeneratedError` |
|
|
393
|
-
| `
|
|
344
|
+
| `ApiError` | Provider returned non-2xx |
|
|
345
|
+
| `NoSpeechGeneratedError` | Empty input (after tag stripping) or empty provider response |
|
|
346
|
+
| `StreamingNotSupportedError` | `streamSpeech()` on a non-streaming model |
|
|
347
|
+
| `VolumeAdjustmentUnsupportedError` | `volumeDbfs` with no decodable output mode |
|
|
348
|
+
| `TimestampKeyMissingError` | `timestamps: "on"` fallback key missing |
|
|
349
|
+
| `ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError` | `generateConversation` validation / native caps / stitch incompatibility |
|
|
350
|
+
| `SpeechSDKError` | Base class |
|
|
394
351
|
|
|
395
|
-
|
|
396
|
-
|
|
397
|
-
Built-in retry with exponential backoff via [p-retry](https://github.com/sindresorhus/p-retry). Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
|
|
352
|
+
Retries 5xx and network errors with exponential backoff ([p-retry](https://github.com/sindresorhus/p-retry)); does not retry 4xx. Default 2 retries; override via `maxRetries`.
|
|
398
353
|
|
|
399
354
|
## Development
|
|
400
355
|
|
|
401
356
|
```bash
|
|
402
357
|
pnpm install
|
|
403
|
-
pnpm test
|
|
404
|
-
pnpm run test:e2e
|
|
405
|
-
pnpm run typecheck
|
|
358
|
+
pnpm test # unit tests
|
|
359
|
+
pnpm run test:e2e # e2e tests (requires provider API keys)
|
|
360
|
+
pnpm run typecheck
|
|
361
|
+
pnpm fix # format + lint
|
|
406
362
|
```
|
|
407
363
|
|
|
408
|
-
E2E tests hit real provider APIs. Set the relevant
|
|
409
|
-
|
|
410
|
-
Set `SPEECH_SDK_E2E_OUTPUT_DIR` to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
|
|
411
|
-
|
|
412
|
-
```bash
|
|
413
|
-
SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e
|
|
414
|
-
```
|
|
415
|
-
|
|
416
|
-
## License
|
|
417
|
-
|
|
418
|
-
MIT
|
|
364
|
+
E2E tests hit real provider APIs. Set the relevant keys in `.env` or export them. Set `SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos` to write conversation e2e audio to disk.
|