@speech-sdk/core 0.6.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (130) hide show
  1. package/LICENSE +202 -21
  2. package/README.md +215 -269
  3. package/dist/__tests__/e2e/_save-audio.d.ts +51 -2
  4. package/dist/__tests__/e2e/_save-audio.d.ts.map +1 -1
  5. package/dist/__tests__/e2e/_save-audio.js +139 -11
  6. package/dist/__tests__/e2e/_save-audio.js.map +1 -1
  7. package/dist/audio-utils.d.ts +2 -0
  8. package/dist/audio-utils.d.ts.map +1 -1
  9. package/dist/audio-utils.js +9 -0
  10. package/dist/audio-utils.js.map +1 -1
  11. package/dist/captions.d.ts +137 -0
  12. package/dist/captions.d.ts.map +1 -0
  13. package/dist/captions.js +283 -0
  14. package/dist/captions.js.map +1 -0
  15. package/dist/conversation/stitch.d.ts +5 -0
  16. package/dist/conversation/stitch.d.ts.map +1 -1
  17. package/dist/conversation/stitch.js +37 -0
  18. package/dist/conversation/stitch.js.map +1 -1
  19. package/dist/conversation/types.d.ts +16 -0
  20. package/dist/conversation/types.d.ts.map +1 -1
  21. package/dist/conversation/validate.d.ts.map +1 -1
  22. package/dist/conversation/validate.js +0 -6
  23. package/dist/conversation/validate.js.map +1 -1
  24. package/dist/derive-timestamps.d.ts +14 -0
  25. package/dist/derive-timestamps.d.ts.map +1 -0
  26. package/dist/derive-timestamps.js +38 -0
  27. package/dist/derive-timestamps.js.map +1 -0
  28. package/dist/errors.d.ts +25 -0
  29. package/dist/errors.d.ts.map +1 -1
  30. package/dist/errors.js +28 -0
  31. package/dist/errors.js.map +1 -1
  32. package/dist/generate-conversation.d.ts +2 -1
  33. package/dist/generate-conversation.d.ts.map +1 -1
  34. package/dist/generate-conversation.js +72 -0
  35. package/dist/generate-conversation.js.map +1 -1
  36. package/dist/generate-speech.d.ts +18 -1
  37. package/dist/generate-speech.d.ts.map +1 -1
  38. package/dist/generate-speech.js +73 -16
  39. package/dist/generate-speech.js.map +1 -1
  40. package/dist/index.d.ts +6 -2
  41. package/dist/index.d.ts.map +1 -1
  42. package/dist/index.js +2 -1
  43. package/dist/index.js.map +1 -1
  44. package/dist/logger.d.ts +2 -0
  45. package/dist/logger.d.ts.map +1 -0
  46. package/dist/logger.js +40 -0
  47. package/dist/logger.js.map +1 -0
  48. package/dist/provider-utils.d.ts +8 -0
  49. package/dist/provider-utils.d.ts.map +1 -1
  50. package/dist/provider-utils.js +16 -2
  51. package/dist/provider-utils.js.map +1 -1
  52. package/dist/providers/cartesia/alignment.d.ts +24 -0
  53. package/dist/providers/cartesia/alignment.d.ts.map +1 -0
  54. package/dist/providers/cartesia/alignment.js +23 -0
  55. package/dist/providers/cartesia/alignment.js.map +1 -0
  56. package/dist/providers/cartesia/index.d.ts +12 -2
  57. package/dist/providers/cartesia/index.d.ts.map +1 -1
  58. package/dist/providers/cartesia/index.js +137 -2
  59. package/dist/providers/cartesia/index.js.map +1 -1
  60. package/dist/providers/elevenlabs/alignment.d.ts +24 -0
  61. package/dist/providers/elevenlabs/alignment.d.ts.map +1 -0
  62. package/dist/providers/elevenlabs/alignment.js +48 -0
  63. package/dist/providers/elevenlabs/alignment.js.map +1 -0
  64. package/dist/providers/elevenlabs/index.d.ts +19 -4
  65. package/dist/providers/elevenlabs/index.d.ts.map +1 -1
  66. package/dist/providers/elevenlabs/index.js +83 -13
  67. package/dist/providers/elevenlabs/index.js.map +1 -1
  68. package/dist/providers/fal/index.d.ts +0 -25
  69. package/dist/providers/fal/index.d.ts.map +1 -1
  70. package/dist/providers/fal/index.js +3 -58
  71. package/dist/providers/fal/index.js.map +1 -1
  72. package/dist/providers/hume/alignment.d.ts +38 -0
  73. package/dist/providers/hume/alignment.d.ts.map +1 -0
  74. package/dist/providers/hume/alignment.js +31 -0
  75. package/dist/providers/hume/alignment.js.map +1 -0
  76. package/dist/providers/hume/index.d.ts +8 -1
  77. package/dist/providers/hume/index.d.ts.map +1 -1
  78. package/dist/providers/hume/index.js +75 -1
  79. package/dist/providers/hume/index.js.map +1 -1
  80. package/dist/providers/inworld/alignment.d.ts +25 -0
  81. package/dist/providers/inworld/alignment.d.ts.map +1 -0
  82. package/dist/providers/inworld/alignment.js +23 -0
  83. package/dist/providers/inworld/alignment.js.map +1 -0
  84. package/dist/providers/inworld/index.d.ts +11 -2
  85. package/dist/providers/inworld/index.d.ts.map +1 -1
  86. package/dist/providers/inworld/index.js +11 -2
  87. package/dist/providers/inworld/index.js.map +1 -1
  88. package/dist/providers/murf/alignment.d.ts +22 -0
  89. package/dist/providers/murf/alignment.d.ts.map +1 -0
  90. package/dist/providers/murf/alignment.js +17 -0
  91. package/dist/providers/murf/alignment.js.map +1 -0
  92. package/dist/providers/murf/index.d.ts +8 -1
  93. package/dist/providers/murf/index.d.ts.map +1 -1
  94. package/dist/providers/murf/index.js +10 -1
  95. package/dist/providers/murf/index.js.map +1 -1
  96. package/dist/providers/openai/index.d.ts +12 -3
  97. package/dist/providers/openai/index.d.ts.map +1 -1
  98. package/dist/providers/openai/index.js +7 -3
  99. package/dist/providers/openai/index.js.map +1 -1
  100. package/dist/providers/resemble/alignment.d.ts +32 -0
  101. package/dist/providers/resemble/alignment.d.ts.map +1 -0
  102. package/dist/providers/resemble/alignment.js +57 -0
  103. package/dist/providers/resemble/alignment.js.map +1 -0
  104. package/dist/providers/resemble/index.d.ts +7 -1
  105. package/dist/providers/resemble/index.d.ts.map +1 -1
  106. package/dist/providers/resemble/index.js +13 -1
  107. package/dist/providers/resemble/index.js.map +1 -1
  108. package/dist/resolve-provider.d.ts.map +1 -1
  109. package/dist/resolve-provider.js +3 -12
  110. package/dist/resolve-provider.js.map +1 -1
  111. package/dist/speech-provider.d.ts +48 -4
  112. package/dist/speech-provider.d.ts.map +1 -1
  113. package/dist/speech-provider.js +16 -0
  114. package/dist/speech-provider.js.map +1 -1
  115. package/dist/speech-result.d.ts +10 -0
  116. package/dist/speech-result.d.ts.map +1 -1
  117. package/dist/speech-result.js.map +1 -1
  118. package/dist/speech-to-text-provider.d.ts +40 -0
  119. package/dist/speech-to-text-provider.d.ts.map +1 -0
  120. package/dist/speech-to-text-provider.js +2 -0
  121. package/dist/speech-to-text-provider.js.map +1 -0
  122. package/dist/stt-providers/openai/index.d.ts +42 -0
  123. package/dist/stt-providers/openai/index.d.ts.map +1 -0
  124. package/dist/stt-providers/openai/index.js +184 -0
  125. package/dist/stt-providers/openai/index.js.map +1 -0
  126. package/dist/timestamps.d.ts +23 -0
  127. package/dist/timestamps.d.ts.map +1 -0
  128. package/dist/timestamps.js +2 -0
  129. package/dist/timestamps.js.map +1 -0
  130. package/package.json +6 -2
package/README.md CHANGED
@@ -4,28 +4,38 @@
4
4
  [![npm downloads](https://img.shields.io/npm/dm/@speech-sdk/core)](https://www.npmjs.com/package/@speech-sdk/core)
5
5
  [![license](https://img.shields.io/npm/l/@speech-sdk/core)](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
6
6
 
7
- The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.
7
+ A lightweight, provider-agnostic TypeScript SDK for text-to-speech. One API, 13 providers, zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
8
8
 
9
- To learn more about the Speech SDK, check out [https://speechsdk.dev/](https://speechsdk.dev/).
9
+ <img width="1200" height="630" alt="Speech SDK" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
10
10
 
11
- <img width="1200" height="630" alt="og-3" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
11
+ Learn more at [speechsdk.dev](https://speechsdk.dev/).
12
12
 
13
+ ## Features
13
14
 
14
- ## Install
15
+ - **Universal** — `generateSpeech()` works across OpenAI, ElevenLabs, Deepgram, Cartesia, Hume, Google Gemini TTS, Fish Audio, Inworld, Murf, Resemble, fal, Mistral, and xAI.
16
+ - **Streaming** — `streamSpeech()` returns a standard `ReadableStream<Uint8Array>`.
17
+ - **Conversations** — `generateConversation()` produces multi-speaker audio, using native dialogue endpoints when available and stitching locally when not.
18
+ - **Word-level timestamps** — `timestamps: "on"` returns alignment, using the provider's native data or falling back to STT.
19
+ - **Volume normalization** — RMS-level outputs to an absolute loudness target.
20
+ - **Audio tags & voice cloning** — `[laugh]`, `[sigh]`, emotion cues; reference-audio cloning where supported.
15
21
 
16
- ```bash
17
- npm install @speech-sdk/core
18
- ```
22
+ ## Contents
19
23
 
20
- ### Using an AI Coding Assistant?
24
+ - [Install](#install) · [Quick start](#quick-start) · [Supported providers](#supported-providers)
25
+ - [Streaming](#streaming) · [Conversations](#conversations) · [Timestamps](#timestamps)
26
+ - [Volume normalization](#volume-normalization) · [Audio tags](#audio-tags) · [Voice cloning](#voice-cloning)
27
+ - [Custom configuration](#custom-configuration) · [API reference](#api-reference) · [Error handling](#error-handling) · [Development](#development)
21
28
 
22
- Add the speech-sdk skill to give your AI assistant full knowledge of this library:
29
+ ## Install
23
30
 
24
31
  ```bash
25
- npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk
32
+ npm install @speech-sdk/core
26
33
  ```
27
34
 
28
- ## Quick Start
35
+ > [!TIP]
36
+ > Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: `npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk`.
37
+
38
+ ## Quick start
29
39
 
30
40
  ```ts
31
41
  import { generateSpeech } from '@speech-sdk/core';
@@ -36,383 +46,319 @@ const result = await generateSpeech({
36
46
  voice: 'alloy',
37
47
  });
38
48
 
39
- // Access the audio
40
49
  result.audio.uint8Array; // Uint8Array
41
- result.audio.base64; // string (lazy-computed)
50
+ result.audio.base64; // string (lazy)
42
51
  result.audio.mediaType; // "audio/mpeg"
43
52
  ```
44
53
 
45
- ### Volume normalization
46
-
47
- Pass `volumeDbfs` to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):
54
+ Pass a `provider/model` string, or just the provider name to use its default model. API keys are read from env vars automatically.
48
55
 
49
- ```ts
50
- const result = await generateSpeech({
51
- model: 'openai/gpt-4o-mini-tts',
52
- text: 'Hello from speech-sdk!',
53
- voice: 'alloy',
54
- volumeDbfs: -20,
55
- });
56
+ ## Supported providers
56
57
 
57
- result.audio.mediaType; // "audio/wav" re-encoded after normalization
58
- ```
58
+ | Provider | Prefix | Default model | Env var |
59
+ |---|---|---|---|
60
+ | [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` |
61
+ | [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` |
62
+ | [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` |
63
+ | [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` |
64
+ | [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` |
65
+ | [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` |
66
+ | [Google Gemini TTS](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` |
67
+ | [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` |
68
+ | [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` |
69
+ | [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` |
70
+ | [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` |
71
+ | [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` |
72
+ | [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` |
59
73
 
60
- When `volumeDbfs` is set the SDK transparently asks the provider for its decodable PCM/WAV mode, normalizes the samples, and returns 16-bit mono WAV so the response `mediaType` switches to `audio/wav` regardless of the provider's native default. Throws `VolumeAdjustmentUnsupportedError` if the provider has no decodable output mode.
74
+ Provider-specific parameters pass through via `providerOptions` using each API's native field names.
61
75
 
62
76
  ## Streaming
63
77
 
64
- Use `streamSpeech()` instead of `generateSpeech()` to receive audio bytes incrementally as the provider produces them. The result's `audio` field is a standard `ReadableStream<Uint8Array>` that works in Node, Edge runtimes, and browsers.
78
+ `streamSpeech()` returns audio incrementally as a `ReadableStream<Uint8Array>`.
65
79
 
66
80
  ```ts
67
- import { streamSpeech } from "@speech-sdk/core";
81
+ import { streamSpeech } from '@speech-sdk/core';
68
82
 
69
83
  const { audio, mediaType } = await streamSpeech({
70
- model: "openai/tts-1",
71
- text: "Hello from the speech SDK!",
72
- voice: "alloy",
73
- });
74
- ```
75
-
76
- ### Pipe to a file (Node)
77
-
78
- ```ts
79
- import { createWriteStream } from "node:fs";
80
- import { Readable } from "node:stream";
81
-
82
- const { audio } = await streamSpeech({
83
- model: "elevenlabs/eleven_flash_v2_5",
84
- text: "Hello world",
85
- voice: "JBFqnCBsd6RMkjVDRZzb",
86
- });
87
-
88
- await new Promise((resolve, reject) => {
89
- Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
84
+ model: 'cartesia/sonic-3',
85
+ text: 'Streaming straight to the client.',
86
+ voice: 'voice-id',
90
87
  });
91
- ```
92
-
93
- ### Forward to an HTTP response (Edge / Workers / Next.js Route Handler)
94
-
95
- ```ts
96
- export async function GET() {
97
- const { audio, mediaType } = await streamSpeech({
98
- model: "cartesia/sonic-3",
99
- text: "Streaming straight to the client.",
100
- voice: "voice-id",
101
- });
102
-
103
- return new Response(audio, { headers: { "Content-Type": mediaType } });
104
- }
105
- ```
106
-
107
- ### Read chunks manually
108
88
 
109
- ```ts
110
- const reader = audio.getReader();
111
- while (true) {
112
- const { value, done } = await reader.read();
113
- if (done) break;
114
- // value is a Uint8Array of audio bytes
115
- }
116
- ```
117
-
118
- ### Capability check
119
-
120
- Check whether a model supports streaming before calling `streamSpeech()`:
121
-
122
- ```ts
123
- import { hasFeature } from "@speech-sdk/core";
124
-
125
- const model = provider.models.find((m) => m.id === "tts-1");
126
- if (hasFeature(model, "streaming")) {
127
- // safe to call streamSpeech()
128
- }
89
+ // Forward to an HTTP response:
90
+ return new Response(audio, { headers: { 'Content-Type': mediaType } });
129
91
  ```
130
92
 
131
- Calling `streamSpeech()` on a model that doesn't declare the `"streaming"` feature throws `StreamingNotSupportedError`.
132
-
133
- ### Errors and retries
134
-
135
- Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the `ReadableStream` consumer as a stream error and are not retried. Pass `maxRetries` (default `2`) and an `abortSignal` the same way as `generateSpeech()`.
93
+ > [!NOTE]
94
+ > Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling `streamSpeech()` on a non-streaming model throws `StreamingNotSupportedError`.
136
95
 
137
96
  ## Conversations
138
97
 
139
- `generateConversation()` produces a single multi-voice audio clip from an ordered array of turns. It picks the best path automatically:
98
+ `generateConversation()` produces a single multi-voice clip from an ordered array of turns, picking the best path automatically:
140
99
 
141
- - **Native dialogue** — when every turn shares one model and that provider has a real multi-speaker dialogue endpoint, the SDK makes a single API call and returns the provider's natural mix. Works with **ElevenLabs v3**, **Google Gemini TTS** (exactly 2 voices), **Hume Octave**, **Fish Audio S2-Pro**, and **fal Dia**.
142
- - **Stitch fallback** — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls `generateSpeech()` per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.
100
+ - **Native dialogue** — one provider with a multi-speaker endpoint (ElevenLabs v3, Gemini TTS, Hume Octave, Fish Audio S2-Pro, fal Dia). One API call, natural mix.
101
+ - **Stitch fallback** — multi-provider or no dialogue endpoint. Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
143
102
 
144
103
  ```ts
145
- import { generateConversation } from "@speech-sdk/core/conversation";
104
+ import { generateConversation } from '@speech-sdk/core/conversation';
146
105
 
147
106
  const result = await generateConversation({
148
107
  turns: [
149
- { model: "openai/tts-1", voice: "nova", text: "Hi, I'm hosted by OpenAI." },
150
- { model: "elevenlabs/eleven_multilingual_v2", voice: "JBFqnCBsd6RMkjVDRZzb", text: "And I'm hosted by ElevenLabs." },
151
- { model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
152
- { model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
108
+ { model: 'openai/tts-1', voice: 'nova', text: "Hi, I'm hosted by OpenAI." },
109
+ { model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
110
+ { model: 'hume/octave-2', voice: 'Kora', text: "I'm Hume Octave. Thanks for listening." },
153
111
  ],
154
112
  });
155
-
156
- result.audio.uint8Array; // Uint8Array of one combined WAV
157
- result.audio.mediaType; // "audio/wav"
158
113
  ```
159
114
 
160
- The return type is the standard `SpeechResult`, so it composes with everything else in the SDK.
115
+ Options: `gapMs` (default 300), `normalizeVolume` (default `true`), `volumeDbfs` (default `-20`), `maxConcurrency` (default 6), `maxRetries` (default 2), `timestamps`, `timestampProvider`, `apiKey`, `providerOptions`, `abortSignal`, `headers`. Per-turn overrides: `model`, `providerOptions` (stitch path only — throws `ConversationInputError` on native).
116
+
117
+ **Native dialogue caps:**
118
+
119
+ | Provider | Models | Voice constraints |
120
+ |---|---|---|
121
+ | ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 chars |
122
+ | Google | `gemini-2.5-{flash,pro}-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** |
123
+ | Hume | `octave-1`, `octave-2` | 1–4 voices |
124
+ | Fish Audio | `s2-pro` | 1–4 voices |
125
+
126
+ ## Timestamps
161
127
 
162
- ### Conversation options
128
+ Pass `timestamps` to get word-level alignment. Timings are in seconds from the start of the audio.
163
129
 
164
130
  ```ts
165
- generateConversation({
166
- model?: string | ResolvedModel, // default model for all turns
167
- turns: ConversationTurn[], // 1..N turns; up to 4 unique voices
168
- gapMs?: number, // silence between turns (stitch path), default 300
169
- normalizeVolume?: boolean, // RMS-level the output, default true
170
- volumeDbfs?: number, // RMS target loudness in dBFS (≤0), default -20
171
- maxConcurrency?: number, // cap parallel generateSpeech calls, default 6
172
- maxRetries?: number, // per-turn retries, default 2
173
- apiKey?: string,
174
- providerOptions?: Record<string, unknown>, // forwarded to every provider; per-turn override available
175
- abortSignal?: AbortSignal,
176
- headers?: Record<string, string>,
131
+ const result = await generateSpeech({
132
+ model: 'elevenlabs/eleven_multilingual_v2',
133
+ text: 'Hello from speech-sdk!',
134
+ voice: 'JBFqnCBsd6RMkjVDRZzb',
135
+ timestamps: 'on',
177
136
  });
178
137
 
179
- interface ConversationTurn {
180
- voice: Voice; // required
181
- text: string; // required, non-empty
182
- model?: string | ResolvedModel; // per-turn override of the top-level model
183
- providerOptions?: Record<string, unknown>,
184
- }
138
+ result.timestamps;
139
+ // [
140
+ // { text: "Hello", start: 0.00, end: 0.32 },
141
+ // { text: "from", start: 0.36, end: 0.55 },
142
+ // ...
143
+ // ]
185
144
  ```
186
145
 
187
- ### Volume normalization
146
+ | Mode | Behavior |
147
+ |---|---|
148
+ | `"auto"` *(default)* | Return timestamps only if the provider supplies them natively. Free. |
149
+ | `"on"` | Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
150
+ | `"off"` | Never return timestamps. |
188
151
 
189
- `normalizeVolume: true` (the default) RMS-normalizes the output to an absolute target loudness — broadcast/podcast voice convention — so two `generateConversation` calls produce comparable levels regardless of provider mix or content. The target defaults to **−20 dBFS** (~20 dB of peak headroom), and is configurable via `volumeDbfs` (must be ≤ 0; lower is quieter).
152
+ On `"on"`, the fallback defaults to OpenAI Whisper (`openai/whisper-1`, needs `OPENAI_API_KEY`). Override by constructing a `ResolvedSTTModel` via a factory and passing it as `timestampProvider`:
190
153
 
191
154
  ```ts
192
- await generateConversation({
193
- turns: [...],
194
- volumeDbfs: -16, // a touch louder than the default
155
+ import { createOpenAISTT } from '@speech-sdk/core/stt/openai';
156
+
157
+ await generateSpeech({
158
+ model: 'cartesia/sonic-3',
159
+ text: 'Hello!',
160
+ voice: 'voice-id',
161
+ timestamps: 'on',
162
+ timestampProvider: createOpenAISTT({ apiKey: process.env.MY_WHISPER_KEY })('whisper-1'),
195
163
  });
196
164
  ```
197
165
 
198
- Normalization runs on **both paths** — stitched multi-provider conversations and single-provider native dialogue. On the native path the SDK transparently asks the provider for its decodable PCM/WAV mode (via `getStitchOptions`), levels the result, and re-encodes as 16-bit mono WAV — so the response `mediaType` becomes `audio/wav` whenever normalization runs. If a native dialogue provider can't emit decodable audio, the request still succeeds but a `warning` is appended explaining that volume normalization was skipped.
166
+ **Per-provider support:**
199
167
 
200
- Pass `normalizeVolume: false` to skip normalization entirely (zero work) and keep the raw provider audio bytes and `mediaType` untouched.
168
+ | Provider | Timestamps |
169
+ |---|---|
170
+ | ElevenLabs (`eleven_v3`, `eleven_multilingual_v2`, `eleven_flash_v2`, `eleven_flash_v2_5`) | **Native** — returned in the TTS response, free on `"auto"` |
171
+ | Murf (`GEN2`) | **Native** — `wordDurations` returned in the TTS response, free on `"auto"` (FALCON streaming model has no native alignment) |
172
+ | Hume (`octave-2`) | **Native** — word alignment from the JSON `/v0/tts` endpoint, free on `"auto"` (`octave-1` has no native alignment) |
173
+ | Inworld (`inworld-tts-1.5-max`, `inworld-tts-1.5-mini`) | **Native** — `timestampInfo.wordAlignment` returned in the TTS response, free on `"auto"` (best on English/Spanish) |
174
+ | Cartesia (`sonic-3`, `sonic-2`) | **Native** — routed through `/tts/sse` with `add_timestamps: true`; merges interleaved chunk + timestamps events into audio + `WordTimestamp[]` |
175
+ | Resemble (`default`) | **Native** — `audio_timestamps` always returned by `/synthesize`; SDK aggregates grapheme-level timing into words (mirrors ElevenLabs aggregator) |
176
+ | All others (OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI) | No native alignment; `"on"` transcribes via the STT fallback, `"auto"` returns `undefined` |
201
177
 
202
- ### Errors
178
+ `generateConversation` accepts the same options and returns a flat `WordTimestamp[]` across all turns — stitch-path timings are offset by cumulative turn duration + gap.
203
179
 
204
- Conversation-specific errors (importable from `@speech-sdk/core/conversation/errors`):
180
+ ### Captions (SRT / WebVTT)
205
181
 
206
- | Error | When |
207
- |---|---|
208
- | `ConversationInputError` | Validation failure — empty turns, blank text, more than 4 unique voices, or a turn missing a model |
209
- | `DialogueConstraintError` | A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2) |
210
- | `StitchUnsupportedError` | The stitch path was selected but a chosen provider/model can't emit PCM/WAV |
182
+ Convert word-level timestamps into a caption file. SRT is the default; pass `format: 'vtt'` for WebVTT (required for HTML `<track>`).
211
183
 
212
- ### Native dialogue caps
184
+ ```ts
185
+ import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
213
186
 
214
- | Provider | Native dialogue model | Voice constraints |
215
- |---|---|---|
216
- | ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 total chars |
217
- | Google | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** (API requirement) |
218
- | Hume | `octave-1`, `octave-2` | 1–4 voices |
219
- | Fish Audio | `s2-pro` | 1–4 voices |
220
- | fal | `dia-tts` | 1–2 voices |
221
-
222
- Across the SDK, conversations are capped at **4 unique voices** total regardless of provider.
223
-
224
- ## Supported Providers
225
-
226
- Use `provider/model` strings. Passing just the provider name uses its default model.
227
-
228
- | Provider | String Prefix | Default Model | Env Var | Docs |
229
- |---|---|---|---|---|
230
- | [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` | [API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
231
- | [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` | [API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) |
232
- | [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` | [API Reference](https://developers.deepgram.com/docs/tts-models) |
233
- | [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` | [API Reference](https://docs.cartesia.ai/api-reference/tts/bytes) |
234
- | [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` | [API Reference](https://dev.hume.ai/reference/text-to-speech-tts/synthesize-json) |
235
- | [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` | [API Reference](https://docs.inworld.ai/tts/api-reference) |
236
- | [Google (Gemini TTS)](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` | [API Reference](https://ai.google.dev/gemini-api/docs/text-generation) |
237
- | [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` | [API Reference](https://docs.fish.audio/developer-guide/core-features/text-to-speech) |
238
- | [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` | [API Reference](https://murf.ai/api/docs/api-reference/text-to-speech/generate) |
239
- | [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` | [API Reference](https://docs.resemble.ai/api-reference/text-to-speech/synthesize) |
240
- | [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` | [API Reference](https://fal.ai/models) |
241
- | [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` | [API Reference](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) |
242
- | [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` | [API Reference](https://docs.x.ai/docs/api-reference#text-to-speech) |
187
+ const { timestamps } = await generateSpeech({
188
+ model: 'elevenlabs/eleven_v3',
189
+ text: 'Hello world. This is a test.',
190
+ voice: 'JBFqnCBsd6RMkjVDRZzb',
191
+ timestamps: 'on',
192
+ });
243
193
 
244
- ```ts
245
- generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
246
- generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
247
- generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
248
- generateSpeech({ model: 'inworld/inworld-tts-1.5-max', text: '...', voice: 'Ashley' });
249
- generateSpeech({ model: 'openai', text: '...', voice: 'alloy' }); // uses default model
194
+ const srt = timestampsToCaptions(timestamps ?? []);
195
+ // 1
196
+ // 00:00:00,000 --> 00:00:01,200
197
+ // Hello world.
198
+ //
199
+ // 2
200
+ // 00:00:01,300 --> 00:00:02,800
201
+ // This is a test.
202
+
203
+ const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
204
+ // WEBVTT
205
+ //
206
+ // 1
207
+ // 00:00:00.000 --> 00:00:01.200
208
+ // Hello world.
209
+ //
210
+ // 2
211
+ // 00:00:01.300 --> 00:00:02.800
212
+ // This is a test.
250
213
  ```
251
214
 
252
- Provider-specific API parameters can be passed via `providerOptions` these are sent directly to the provider's API using the API's own field names.
253
-
254
- ## Custom Configuration
215
+ Output follows the SubRip and [W3C WebVTT](https://www.w3.org/TR/webvtt1/) conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (`&`, `<`, `>`) on the VTT path.
255
216
 
256
- Use factory functions when you need custom API keys, base URLs, or fetch implementations:
217
+ Cues break on sentence boundaries (`.`, `!`, `?`), then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass `CaptionsOptions` to customize `format`, `maxLineLength`, `maxLinesPerCue`, `maxCharsPerCue`, `maxCueDurationMs`, or `longPhraseCommaBreakChars`.
257
218
 
258
- ```ts
259
- import { generateSpeech } from '@speech-sdk/core';
260
- import { createOpenAI } from '@speech-sdk/core/openai';
261
- import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
219
+ ## Volume normalization
262
220
 
263
- const myOpenAI = createOpenAI({
264
- apiKey: 'sk-...',
265
- baseURL: 'https://my-proxy.com/v1',
266
- });
221
+ Pass `volumeDbfs` to RMS-normalize to an absolute target loudness (must be ≤ 0; `-20` is the broadcast/podcast convention).
267
222
 
223
+ ```ts
268
224
  const result = await generateSpeech({
269
- model: myOpenAI('gpt-4o-mini-tts'),
225
+ model: 'openai/gpt-4o-mini-tts',
270
226
  text: 'Hello!',
271
227
  voice: 'alloy',
228
+ volumeDbfs: -20,
272
229
  });
273
- ```
274
230
 
275
- ### API Key Resolution
231
+ result.audio.mediaType; // "audio/wav" re-encoded after normalization
232
+ ```
276
233
 
277
- When using string models (e.g., `'openai/tts-1'`), API keys are resolved from environment variables (see table above). Factory functions accept an explicit `apiKey` option which takes precedence.
234
+ `generateConversation` normalizes by default. Pass `normalizeVolume: false` to skip. Throws `VolumeAdjustmentUnsupportedError` if the provider has no decodable PCM/WAV mode.
278
235
 
279
- ## Audio Tags
236
+ ## Audio tags
280
237
 
281
- Use bracket syntax `[tag]` to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in `result.warnings`.
238
+ Bracket syntax `[tag]` adds expressive cues. Unsupported tags are stripped with warnings in `result.warnings`.
282
239
 
283
240
  ```ts
284
- const result = await generateSpeech({
241
+ await generateSpeech({
285
242
  model: 'elevenlabs/eleven_v3',
286
243
  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
287
244
  voice: 'voice-id',
288
245
  });
289
-
290
- console.log(result.warnings); // undefined — eleven_v3 supports all tags
291
246
  ```
292
247
 
293
- ### Provider behavior
294
-
295
248
  | Provider | Behavior |
296
249
  |---|---|
297
- | OpenAI (`gpt-4o-mini-tts`) | Tags mapped to the `instructions` field for expressive delivery control |
298
- | ElevenLabs (`eleven_v3`) | All `[tag]` passed through natively |
299
- | Google (`gemini-3.1-flash-tts-preview`) | All `[tag]` passed through natively (e.g. `[whispers]`, `[shouting]`, `[sighs]`, `[laugh]`) |
300
- | Cartesia (`sonic-3`) | Emotion tags (`[happy]`, `[sad]`, `[angry]`, etc.) converted to SSML; `[laughter]` passed through; unknown tags stripped |
301
- | All others | Tags stripped and warnings returned |
302
-
303
- ```ts
304
- // OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
305
- const result = await generateSpeech({
306
- model: 'openai/gpt-4o-mini-tts',
307
- text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
308
- voice: 'alloy',
309
- });
310
- // Sent to OpenAI:
311
- // input: "Hi John how are you? I'm feeling great"
312
- // instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
313
- console.log(result.warnings); // undefined
314
- ```
250
+ | OpenAI (`gpt-4o-mini-tts`) | Mapped to the `instructions` field |
251
+ | ElevenLabs (`eleven_v3`) | Passed through natively |
252
+ | Google (`gemini-3.1-flash-tts-preview`) | Passed through natively |
253
+ | Cartesia (`sonic-3`) | Emotion tags SSML; `[laughter]` passed through; unknown stripped |
254
+ | All others | Stripped with warnings |
315
255
 
316
- ## Voice Cloning
256
+ ## Voice cloning
317
257
 
318
- Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
258
+ Some providers support reference-audio cloning. Pass a voice object instead of a string.
319
259
 
320
260
  ```ts
321
261
  import { createMistral } from '@speech-sdk/core/mistral';
262
+ import { createFal } from '@speech-sdk/core/fal-ai';
322
263
 
323
- const mistral = createMistral();
324
-
325
- // Clone from base64 audio
326
- const result = await generateSpeech({
327
- model: mistral(),
264
+ // Base64 reference:
265
+ await generateSpeech({
266
+ model: createMistral()(),
328
267
  text: 'Hello!',
329
268
  voice: { audio: 'base64-encoded-audio...' },
330
269
  });
331
- ```
332
-
333
- Clone from a URL (fal):
334
-
335
- ```ts
336
- import { createFal } from '@speech-sdk/core/fal-ai';
337
270
 
338
- const fal = createFal();
339
- const result = await generateSpeech({
340
- model: fal('fal-ai/chatterbox'),
271
+ // URL reference:
272
+ await generateSpeech({
273
+ model: createFal()('fal-ai/f5-tts'),
341
274
  text: 'Hello!',
342
275
  voice: { url: 'https://example.com/reference.wav' },
343
276
  });
344
277
  ```
345
278
 
346
- ## Options
279
+ ## Custom configuration
280
+
281
+ Factory functions give you custom API keys, base URLs, or `fetch` implementations:
347
282
 
348
283
  ```ts
349
- generateSpeech({
350
- model: string | ResolvedModel, // required
351
- text: string, // required
352
- voice: Voice, // required
353
- providerOptions?: object, // provider-specific API params
354
- maxRetries?: number, // default: 2 (retries on 5xx/network errors)
355
- abortSignal?: AbortSignal, // cancel the request
356
- headers?: Record<string, string>, // additional HTTP headers
284
+ import { generateSpeech } from '@speech-sdk/core';
285
+ import { createOpenAI } from '@speech-sdk/core/openai';
286
+
287
+ const myOpenAI = createOpenAI({
288
+ apiKey: 'sk-...',
289
+ baseURL: 'https://my-proxy.com/v1',
290
+ });
291
+
292
+ await generateSpeech({
293
+ model: myOpenAI('gpt-4o-mini-tts'),
294
+ text: 'Hello!',
295
+ voice: 'alloy',
357
296
  });
358
297
  ```
359
298
 
360
- ## Result
299
+ ## API reference
361
300
 
362
301
  ```ts
302
+ generateSpeech({
303
+ model: string | ResolvedModel, // required
304
+ text: string, // required
305
+ voice: Voice, // required — string | { url } | { audio }
306
+ providerOptions?: object,
307
+ volumeDbfs?: number, // ≤ 0
308
+ timestamps?: "on" | "auto" | "off", // default "auto"
309
+ timestampProvider?: ResolvedSTTModel, // override the STT fallback
310
+ maxRetries?: number, // default 2
311
+ abortSignal?: AbortSignal,
312
+ headers?: Record<string, string>,
313
+ }): Promise<SpeechResult>
314
+
363
315
  interface SpeechResult {
364
- audio: {
365
- uint8Array: Uint8Array; // raw audio bytes
366
- base64: string; // base64 encoded (lazy)
367
- mediaType: string; // e.g. "audio/mpeg"
368
- };
316
+ audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
317
+ metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
318
+ timestamps?: WordTimestamp[];
369
319
  providerMetadata?: Record<string, unknown>;
320
+ warnings?: string[];
370
321
  }
322
+
323
+ interface WordTimestamp { text: string; start: number; end: number } // seconds
371
324
  ```
372
325
 
373
- ## Error Handling
326
+ ## Error handling
374
327
 
375
328
  ```ts
376
- import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';
329
+ import { generateSpeech, ApiError } from '@speech-sdk/core';
377
330
 
378
331
  try {
379
- const result = await generateSpeech({ ... });
332
+ await generateSpeech({ /* ... */ });
380
333
  } catch (error) {
381
334
  if (error instanceof ApiError) {
382
- console.log(error.statusCode); // 401
383
- console.log(error.model); // "openai/gpt-4o-mini-tts"
384
- console.log(error.responseBody);
335
+ error.statusCode; // 401, 429, 500, ...
336
+ error.model; // "openai/gpt-4o-mini-tts"
337
+ error.responseBody;
385
338
  }
386
339
  }
387
340
  ```
388
341
 
389
342
  | Error | When |
390
343
  |---|---|
391
- | `ApiError` | Provider API returns a non-2xx response |
392
- | `NoSpeechGeneratedError` | Provider returned empty audio |
393
- | `SpeechSDKError` | Base class for all errors |
344
+ | `ApiError` | Provider returned non-2xx |
345
+ | `NoSpeechGeneratedError` | Empty input (after tag stripping) or empty provider response |
346
+ | `StreamingNotSupportedError` | `streamSpeech()` on a non-streaming model |
347
+ | `VolumeAdjustmentUnsupportedError` | `volumeDbfs` with no decodable output mode |
348
+ | `TimestampKeyMissingError` | `timestamps: "on"` fallback key missing |
349
+ | `ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError` | `generateConversation` validation / native caps / stitch incompatibility |
350
+ | `SpeechSDKError` | Base class |
394
351
 
395
- ## Retry
396
-
397
- Built-in retry with exponential backoff via [p-retry](https://github.com/sindresorhus/p-retry). Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
352
+ Retries 5xx and network errors with exponential backoff ([p-retry](https://github.com/sindresorhus/p-retry)); does not retry 4xx. Default 2 retries; override via `maxRetries`.
398
353
 
399
354
  ## Development
400
355
 
401
356
  ```bash
402
357
  pnpm install
403
- pnpm test # unit tests
404
- pnpm run test:e2e # e2e tests (requires API keys)
405
- pnpm run typecheck # type-check without emitting
358
+ pnpm test # unit tests
359
+ pnpm run test:e2e # e2e tests (requires provider API keys)
360
+ pnpm run typecheck
361
+ pnpm fix # format + lint
406
362
  ```
407
363
 
408
- E2E tests hit real provider APIs. Set the relevant API key environment variables in a `.env` file or export them in your shell.
409
-
410
- Set `SPEECH_SDK_E2E_OUTPUT_DIR` to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
411
-
412
- ```bash
413
- SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e
414
- ```
415
-
416
- ## License
417
-
418
- MIT
364
+ E2E tests hit real provider APIs. Set the relevant keys in `.env` or export them. Set `SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos` to write conversation e2e audio to disk.