@speech-sdk/core 0.6.2 → 0.8.0-alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (202) hide show
  1. package/LICENSE +202 -21
  2. package/README.md +267 -264
  3. package/dist/__tests__/e2e/_save-audio.d.ts +5 -24
  4. package/dist/__tests__/e2e/_save-audio.d.ts.map +1 -1
  5. package/dist/__tests__/e2e/_save-audio.js +19 -42
  6. package/dist/__tests__/e2e/_save-audio.js.map +1 -1
  7. package/dist/audio-duration.d.ts +0 -5
  8. package/dist/audio-duration.d.ts.map +1 -1
  9. package/dist/audio-duration.js +3 -10
  10. package/dist/audio-duration.js.map +1 -1
  11. package/dist/audio-utils.d.ts +1 -9
  12. package/dist/audio-utils.d.ts.map +1 -1
  13. package/dist/audio-utils.js +10 -13
  14. package/dist/audio-utils.js.map +1 -1
  15. package/dist/captions.d.ts +29 -0
  16. package/dist/captions.d.ts.map +1 -0
  17. package/dist/captions.js +193 -0
  18. package/dist/captions.js.map +1 -0
  19. package/dist/conversation/attribute-timestamps.d.ts +26 -0
  20. package/dist/conversation/attribute-timestamps.d.ts.map +1 -0
  21. package/dist/conversation/attribute-timestamps.js +276 -0
  22. package/dist/conversation/attribute-timestamps.js.map +1 -0
  23. package/dist/conversation/dispatch.d.ts +5 -5
  24. package/dist/conversation/dispatch.d.ts.map +1 -1
  25. package/dist/conversation/dispatch.js +18 -8
  26. package/dist/conversation/dispatch.js.map +1 -1
  27. package/dist/conversation/errors.d.ts +3 -0
  28. package/dist/conversation/errors.d.ts.map +1 -1
  29. package/dist/conversation/errors.js +6 -0
  30. package/dist/conversation/errors.js.map +1 -1
  31. package/dist/conversation/pcm-concat.d.ts +0 -23
  32. package/dist/conversation/pcm-concat.d.ts.map +1 -1
  33. package/dist/conversation/pcm-concat.js +5 -43
  34. package/dist/conversation/pcm-concat.js.map +1 -1
  35. package/dist/conversation/proportional-fill.d.ts +10 -0
  36. package/dist/conversation/proportional-fill.d.ts.map +1 -0
  37. package/dist/conversation/proportional-fill.js +64 -0
  38. package/dist/conversation/proportional-fill.js.map +1 -0
  39. package/dist/conversation/silence-detection.d.ts +14 -0
  40. package/dist/conversation/silence-detection.d.ts.map +1 -0
  41. package/dist/conversation/silence-detection.js +52 -0
  42. package/dist/conversation/silence-detection.js.map +1 -0
  43. package/dist/conversation/stitch.d.ts +3 -1
  44. package/dist/conversation/stitch.d.ts.map +1 -1
  45. package/dist/conversation/stitch.js +54 -13
  46. package/dist/conversation/stitch.js.map +1 -1
  47. package/dist/conversation/types.d.ts +1 -19
  48. package/dist/conversation/types.d.ts.map +1 -1
  49. package/dist/conversation/validate.d.ts +1 -16
  50. package/dist/conversation/validate.d.ts.map +1 -1
  51. package/dist/conversation/validate.js +29 -29
  52. package/dist/conversation/validate.js.map +1 -1
  53. package/dist/default-stt-fallback.d.ts +3 -0
  54. package/dist/default-stt-fallback.d.ts.map +1 -0
  55. package/dist/default-stt-fallback.js +11 -0
  56. package/dist/default-stt-fallback.js.map +1 -0
  57. package/dist/derive-timestamps.d.ts +10 -0
  58. package/dist/derive-timestamps.d.ts.map +1 -0
  59. package/dist/derive-timestamps.js +24 -0
  60. package/dist/derive-timestamps.js.map +1 -0
  61. package/dist/errors.d.ts +20 -2
  62. package/dist/errors.d.ts.map +1 -1
  63. package/dist/errors.js +28 -2
  64. package/dist/errors.js.map +1 -1
  65. package/dist/generate-conversation.d.ts +5 -4
  66. package/dist/generate-conversation.d.ts.map +1 -1
  67. package/dist/generate-conversation.js +191 -38
  68. package/dist/generate-conversation.js.map +1 -1
  69. package/dist/generate-speech.d.ts +2 -10
  70. package/dist/generate-speech.d.ts.map +1 -1
  71. package/dist/generate-speech.js +111 -33
  72. package/dist/generate-speech.js.map +1 -1
  73. package/dist/index.d.ts +5 -8
  74. package/dist/index.d.ts.map +1 -1
  75. package/dist/index.js +6 -4
  76. package/dist/index.js.map +1 -1
  77. package/dist/logger.d.ts +2 -0
  78. package/dist/logger.d.ts.map +1 -0
  79. package/dist/logger.js +29 -0
  80. package/dist/logger.js.map +1 -0
  81. package/dist/metadata.d.ts +0 -22
  82. package/dist/metadata.d.ts.map +1 -1
  83. package/dist/provider-utils.d.ts +3 -1
  84. package/dist/provider-utils.d.ts.map +1 -1
  85. package/dist/provider-utils.js +36 -39
  86. package/dist/provider-utils.js.map +1 -1
  87. package/dist/providers/cartesia/alignment.d.ts +8 -0
  88. package/dist/providers/cartesia/alignment.d.ts.map +1 -0
  89. package/dist/providers/cartesia/alignment.js +18 -0
  90. package/dist/providers/cartesia/alignment.js.map +1 -0
  91. package/dist/providers/cartesia/index.d.ts +11 -13
  92. package/dist/providers/cartesia/index.d.ts.map +1 -1
  93. package/dist/providers/cartesia/index.js +184 -61
  94. package/dist/providers/cartesia/index.js.map +1 -1
  95. package/dist/providers/deepgram/index.d.ts +7 -8
  96. package/dist/providers/deepgram/index.d.ts.map +1 -1
  97. package/dist/providers/deepgram/index.js +17 -18
  98. package/dist/providers/deepgram/index.js.map +1 -1
  99. package/dist/providers/elevenlabs/alignment.d.ts +10 -0
  100. package/dist/providers/elevenlabs/alignment.d.ts.map +1 -0
  101. package/dist/providers/elevenlabs/alignment.js +47 -0
  102. package/dist/providers/elevenlabs/alignment.js.map +1 -0
  103. package/dist/providers/elevenlabs/index.d.ts +10 -26
  104. package/dist/providers/elevenlabs/index.d.ts.map +1 -1
  105. package/dist/providers/elevenlabs/index.js +216 -154
  106. package/dist/providers/elevenlabs/index.js.map +1 -1
  107. package/dist/providers/fal/index.d.ts +7 -43
  108. package/dist/providers/fal/index.d.ts.map +1 -1
  109. package/dist/providers/fal/index.js +37 -86
  110. package/dist/providers/fal/index.js.map +1 -1
  111. package/dist/providers/fish-audio/index.d.ts +7 -8
  112. package/dist/providers/fish-audio/index.d.ts.map +1 -1
  113. package/dist/providers/fish-audio/index.js +23 -19
  114. package/dist/providers/fish-audio/index.js.map +1 -1
  115. package/dist/providers/gateway/index.d.ts +68 -0
  116. package/dist/providers/gateway/index.d.ts.map +1 -0
  117. package/dist/providers/gateway/index.js +236 -0
  118. package/dist/providers/gateway/index.js.map +1 -0
  119. package/dist/providers/google/index.d.ts +7 -20
  120. package/dist/providers/google/index.d.ts.map +1 -1
  121. package/dist/providers/google/index.js +161 -151
  122. package/dist/providers/google/index.js.map +1 -1
  123. package/dist/providers/hume/alignment.d.ts +33 -0
  124. package/dist/providers/hume/alignment.d.ts.map +1 -0
  125. package/dist/providers/hume/alignment.js +37 -0
  126. package/dist/providers/hume/alignment.js.map +1 -0
  127. package/dist/providers/hume/index.d.ts +11 -13
  128. package/dist/providers/hume/index.d.ts.map +1 -1
  129. package/dist/providers/hume/index.js +105 -41
  130. package/dist/providers/hume/index.js.map +1 -1
  131. package/dist/providers/inworld/alignment.d.ts +11 -0
  132. package/dist/providers/inworld/alignment.d.ts.map +1 -0
  133. package/dist/providers/inworld/alignment.js +24 -0
  134. package/dist/providers/inworld/alignment.js.map +1 -0
  135. package/dist/providers/inworld/index.d.ts +10 -14
  136. package/dist/providers/inworld/index.d.ts.map +1 -1
  137. package/dist/providers/inworld/index.js +55 -38
  138. package/dist/providers/inworld/index.js.map +1 -1
  139. package/dist/providers/mistral/index.d.ts +7 -8
  140. package/dist/providers/mistral/index.d.ts.map +1 -1
  141. package/dist/providers/mistral/index.js +39 -38
  142. package/dist/providers/mistral/index.js.map +1 -1
  143. package/dist/providers/murf/alignment.d.ts +13 -0
  144. package/dist/providers/murf/alignment.d.ts.map +1 -0
  145. package/dist/providers/murf/alignment.js +22 -0
  146. package/dist/providers/murf/alignment.js.map +1 -0
  147. package/dist/providers/murf/index.d.ts +11 -13
  148. package/dist/providers/murf/index.d.ts.map +1 -1
  149. package/dist/providers/murf/index.js +73 -56
  150. package/dist/providers/murf/index.js.map +1 -1
  151. package/dist/providers/openai/index.d.ts +36 -20
  152. package/dist/providers/openai/index.d.ts.map +1 -1
  153. package/dist/providers/openai/index.js +270 -102
  154. package/dist/providers/openai/index.js.map +1 -1
  155. package/dist/providers/resemble/alignment.d.ts +11 -0
  156. package/dist/providers/resemble/alignment.d.ts.map +1 -0
  157. package/dist/providers/resemble/alignment.js +54 -0
  158. package/dist/providers/resemble/alignment.js.map +1 -0
  159. package/dist/providers/resemble/index.d.ts +10 -8
  160. package/dist/providers/resemble/index.d.ts.map +1 -1
  161. package/dist/providers/resemble/index.js +58 -40
  162. package/dist/providers/resemble/index.js.map +1 -1
  163. package/dist/providers/xai/index.d.ts +7 -9
  164. package/dist/providers/xai/index.d.ts.map +1 -1
  165. package/dist/providers/xai/index.js +37 -40
  166. package/dist/providers/xai/index.js.map +1 -1
  167. package/dist/providers.d.ts +29 -0
  168. package/dist/providers.d.ts.map +1 -0
  169. package/dist/providers.js +15 -0
  170. package/dist/providers.js.map +1 -0
  171. package/dist/resolve-provider.d.ts.map +1 -1
  172. package/dist/resolve-provider.js +7 -59
  173. package/dist/resolve-provider.js.map +1 -1
  174. package/dist/speech-provider.d.ts +19 -15
  175. package/dist/speech-provider.d.ts.map +1 -1
  176. package/dist/speech-provider.js +9 -14
  177. package/dist/speech-provider.js.map +1 -1
  178. package/dist/speech-result.d.ts +5 -0
  179. package/dist/speech-result.d.ts.map +1 -1
  180. package/dist/speech-result.js.map +1 -1
  181. package/dist/speech-to-text-provider.d.ts +28 -0
  182. package/dist/speech-to-text-provider.d.ts.map +1 -0
  183. package/dist/speech-to-text-provider.js +2 -0
  184. package/dist/speech-to-text-provider.js.map +1 -0
  185. package/dist/stream-speech.d.ts.map +1 -1
  186. package/dist/stream-speech.js +2 -3
  187. package/dist/stream-speech.js.map +1 -1
  188. package/dist/timestamps.d.ts +9 -0
  189. package/dist/timestamps.d.ts.map +1 -0
  190. package/dist/timestamps.js +2 -0
  191. package/dist/timestamps.js.map +1 -0
  192. package/dist/turns.d.ts +9 -0
  193. package/dist/turns.d.ts.map +1 -0
  194. package/dist/turns.js +21 -0
  195. package/dist/turns.js.map +1 -0
  196. package/dist/types.d.ts +25 -0
  197. package/dist/types.d.ts.map +1 -1
  198. package/dist/volume-adjust.d.ts +0 -6
  199. package/dist/volume-adjust.d.ts.map +1 -1
  200. package/dist/volume-adjust.js +0 -6
  201. package/dist/volume-adjust.js.map +1 -1
  202. package/package.json +12 -63
package/README.md CHANGED
@@ -1,31 +1,48 @@
1
+ <div align="center">
2
+
3
+ <img src="https://github.com/user-attachments/assets/42d9b528-e507-4162-8120-338bb0c92650" alt="Speech SDK" width="140" />
4
+
1
5
  # Speech SDK
2
6
 
3
- [![npm version](https://img.shields.io/npm/v/@speech-sdk/core)](https://www.npmjs.com/package/@speech-sdk/core)
4
- [![npm downloads](https://img.shields.io/npm/dm/@speech-sdk/core)](https://www.npmjs.com/package/@speech-sdk/core)
5
- [![license](https://img.shields.io/npm/l/@speech-sdk/core)](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
7
+ **Text-to-speech across 13 providers, one API.**
6
8
 
7
- The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.
9
+ A lightweight, provider-agnostic TypeScript SDK. Zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
8
10
 
9
- To learn more about the Speech SDK, check out [https://speechsdk.dev/](https://speechsdk.dev/).
11
+ [![npm version](https://img.shields.io/npm/v/@speech-sdk/core?style=flat-square)](https://www.npmjs.com/package/@speech-sdk/core)
12
+ [![npm downloads](https://img.shields.io/npm/dm/@speech-sdk/core?style=flat-square)](https://www.npmjs.com/package/@speech-sdk/core)
13
+ [![license](https://img.shields.io/npm/l/@speech-sdk/core?style=flat-square)](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
14
+ [![Discord](https://img.shields.io/badge/Discord-Join-5865F2?style=flat-square&logo=discord&logoColor=white)](https://discord.gg/xcTQMU3nCV)
15
+ [![Stars](https://img.shields.io/github/stars/Jellypod-Inc/speech-sdk?style=flat-square&logo=github&label=stars)](https://github.com/Jellypod-Inc/speech-sdk/stargazers)
10
16
 
11
- <img width="1200" height="630" alt="og-3" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
17
+ **[Quick start](#quick-start)** · **[Providers](#supported-providers)** · **[Streaming](#streaming)** · **[Multi-Speaker Conversations](#conversations)** · **[Timestamps](#timestamps)**
12
18
 
19
+ </div>
13
20
 
14
- ## Install
21
+ <br />
15
22
 
16
- ```bash
17
- npm install @speech-sdk/core
18
- ```
23
+ <img width="1200" height="630" alt="Speech SDK" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
24
+
25
+ Learn more at [speechsdk.dev](https://speechsdk.dev/).
19
26
 
20
- ### Using an AI Coding Assistant?
27
+ ## Features
21
28
 
22
- Add the speech-sdk skill to give your AI assistant full knowledge of this library:
29
+ - **Universal** one `generateSpeech()` call across every supported provider.
30
+ - **Streaming** — `streamSpeech()` returns a standard `ReadableStream<Uint8Array>`.
31
+ - **Conversations** — `generateConversation()` produces multi-speaker audio, picking a gateway, native-dialogue, or local-stitch path automatically.
32
+ - **Word-level timestamps** — `timestamps: true` returns alignment, using the provider's native data or falling back to STT.
33
+ - **Volume normalization** — RMS-level outputs to an absolute loudness target.
34
+ - **Audio tags & voice cloning** — bracket cues like `[laugh]` and reference-audio cloning where supported.
35
+
36
+ ## Install
23
37
 
24
38
  ```bash
25
- npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk
39
+ npm install @speech-sdk/core
26
40
  ```
27
41
 
28
- ## Quick Start
42
+ > [!TIP]
43
+ > Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: `npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk`.
44
+
45
+ ## Quick start
29
46
 
30
47
  ```ts
31
48
  import { generateSpeech } from '@speech-sdk/core';
@@ -36,383 +53,369 @@ const result = await generateSpeech({
36
53
  voice: 'alloy',
37
54
  });
38
55
 
39
- // Access the audio
40
56
  result.audio.uint8Array; // Uint8Array
41
- result.audio.base64; // string (lazy-computed)
57
+ result.audio.base64; // string (lazy)
42
58
  result.audio.mediaType; // "audio/mpeg"
43
59
  ```
44
60
 
45
- ### Volume normalization
61
+ Pass a `provider/model` string, or just the provider name to use its default model. The string above is enough to get going — set one env var and you're done.
46
62
 
47
- Pass `volumeDbfs` to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):
63
+ ## Gateway vs direct provider
48
64
 
49
- ```ts
50
- const result = await generateSpeech({
51
- model: 'openai/gpt-4o-mini-tts',
52
- text: 'Hello from speech-sdk!',
53
- voice: 'alloy',
54
- volumeDbfs: -20,
55
- });
65
+ The SDK has two ways to reach a provider, and the choice is made by **how you pass `model`**:
56
66
 
57
- result.audio.mediaType; // "audio/wav" — re-encoded after normalization
67
+ ```ts
68
+ // 1. String → routes through Speech Gateway (https://api.speechgateway.com)
69
+ // Needs SPEECH_GATEWAY_API_KEY (sign up at https://speechgateway.com).
70
+ await generateSpeech({ model: 'openai/gpt-4o-mini-tts', text: '...', voice: 'alloy' });
71
+
72
+ // 2. Factory → calls the provider directly (no proxy hop)
73
+ // Reads the provider's env var (e.g. OPENAI_API_KEY), or pass apiKey to the factory.
74
+ import { createOpenAI } from '@speech-sdk/core/providers';
75
+ await generateSpeech({ model: createOpenAI()('gpt-4o-mini-tts'), text: '...', voice: 'alloy' });
58
76
  ```
59
77
 
60
- When `volumeDbfs` is set the SDK transparently asks the provider for its decodable PCM/WAV mode, normalizes the samples, and returns 16-bit mono WAV — so the response `mediaType` switches to `audio/wav` regardless of the provider's native default. Throws `VolumeAdjustmentUnsupportedError` if the provider has no decodable output mode.
78
+ | | Speech Gateway (string) | Direct provider (factory) |
79
+ |---|---|---|
80
+ | When to use | You want a single endpoint and easy provider swaps | You already have provider keys, want zero-hop latency, or need provider features the gateway hasn't surfaced |
81
+ | Setup | `SPEECH_GATEWAY_API_KEY` only | One env var per provider you use |
82
+ | Key resolution | `apiKey` option → `SPEECH_GATEWAY_API_KEY` | `createX({ apiKey })` → `<PROVIDER>_API_KEY` |
83
+ | Endpoint | `api.speechgateway.com` | Provider's own API |
84
+
85
+ The gateway also accepts `createSpeechGateway({ apiKey, baseURL })` if you want to construct it explicitly (e.g. for a custom proxy URL).
86
+
87
+ ## Supported providers
88
+
89
+ | Provider | Prefix | Env var |
90
+ |---|---|---|
91
+ | [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `OPENAI_API_KEY` |
92
+ | [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `ELEVENLABS_API_KEY` |
93
+ | [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `DEEPGRAM_API_KEY` |
94
+ | [Cartesia](https://docs.cartesia.ai) | `cartesia` | `CARTESIA_API_KEY` |
95
+ | [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `HUME_API_KEY` |
96
+ | [Inworld](https://docs.inworld.ai/tts) | `inworld` | `INWORLD_API_KEY` |
97
+ | [Google Gemini TTS](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `GOOGLE_API_KEY` |
98
+ | [Fish Audio](https://docs.fish.audio) | `fish-audio` | `FISH_AUDIO_API_KEY` |
99
+ | [Murf](https://murf.ai/api/docs) | `murf` | `MURF_API_KEY` |
100
+ | [Resemble](https://docs.resemble.ai) | `resemble` | `RESEMBLE_API_KEY` |
101
+ | [fal](https://fal.ai/models) | `fal-ai` | `FAL_API_KEY` |
102
+ | [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `MISTRAL_API_KEY` |
103
+ | [xAI](https://docs.x.ai/docs/models) | `xai` | `XAI_API_KEY` |
104
+
105
+ The env var applies when you call the provider directly via its factory. Pass a string `model` like `"openai/tts-1"` to route through Speech Gateway instead, which reads `SPEECH_GATEWAY_API_KEY` — see [Gateway vs direct provider](#gateway-vs-direct-provider). Most providers ship a default model (`createOpenAI()()`); a few (e.g. fal) require an explicit model id. See the linked docs for each provider's full model list.
106
+
107
+ Provider-specific parameters pass through via `providerOptions` using each API's native field names.
61
108
 
62
109
  ## Streaming
63
110
 
64
- Use `streamSpeech()` instead of `generateSpeech()` to receive audio bytes incrementally as the provider produces them. The result's `audio` field is a standard `ReadableStream<Uint8Array>` that works in Node, Edge runtimes, and browsers.
111
+ `streamSpeech()` returns audio incrementally as a `ReadableStream<Uint8Array>`.
65
112
 
66
113
  ```ts
67
- import { streamSpeech } from "@speech-sdk/core";
114
+ import { streamSpeech } from '@speech-sdk/core';
68
115
 
69
116
  const { audio, mediaType } = await streamSpeech({
70
- model: "openai/tts-1",
71
- text: "Hello from the speech SDK!",
72
- voice: "alloy",
117
+ model: 'cartesia/sonic-3',
118
+ text: 'Streaming straight to the client.',
119
+ voice: 'voice-id',
73
120
  });
121
+
122
+ // Forward to an HTTP response:
123
+ return new Response(audio, { headers: { 'Content-Type': mediaType } });
74
124
  ```
75
125
 
76
- ### Pipe to a file (Node)
126
+ > [!NOTE]
127
+ > Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling `streamSpeech()` on a non-streaming model throws `StreamingNotSupportedError`.
77
128
 
78
- ```ts
79
- import { createWriteStream } from "node:fs";
80
- import { Readable } from "node:stream";
129
+ ## Conversations
81
130
 
82
- const { audio } = await streamSpeech({
83
- model: "elevenlabs/eleven_flash_v2_5",
84
- text: "Hello world",
85
- voice: "JBFqnCBsd6RMkjVDRZzb",
86
- });
131
+ `generateConversation()` produces a single multi-voice clip from an ordered array of turns. The path is chosen by what the turns are:
87
132
 
88
- await new Promise((resolve, reject) => {
89
- Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
90
- });
91
- ```
133
+ - **Gateway** — every turn uses a gateway-routed string model (e.g. `"openai/tts-1"`). One request to Speech Gateway; the server handles rendering, stitching, and normalization. The SDK never stitches locally on this path — clone voices on gateway models throw `StitchUnsupportedError`.
134
+ - **Native dialogue** — every turn uses the same direct-provider model and that model exposes a multi-speaker endpoint. One API call, naturally mixed.
135
+ - **Stitch** — direct-provider conversations that don't qualify for native dialogue (multi-provider, or no dialogue endpoint). Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
92
136
 
93
- ### Forward to an HTTP response (Edge / Workers / Next.js Route Handler)
137
+ Mixing gateway-routed turns with direct-provider turns in one call throws `MixedDispatchError`.
94
138
 
95
139
  ```ts
96
- export async function GET() {
97
- const { audio, mediaType } = await streamSpeech({
98
- model: "cartesia/sonic-3",
99
- text: "Streaming straight to the client.",
100
- voice: "voice-id",
101
- });
102
-
103
- return new Response(audio, { headers: { "Content-Type": mediaType } });
104
- }
105
- ```
106
-
107
- ### Read chunks manually
140
+ import { generateConversation } from '@speech-sdk/core';
108
141
 
109
- ```ts
110
- const reader = audio.getReader();
111
- while (true) {
112
- const { value, done } = await reader.read();
113
- if (done) break;
114
- // value is a Uint8Array of audio bytes
115
- }
142
+ const result = await generateConversation({
143
+ turns: [
144
+ { model: 'openai/tts-1', voice: 'nova', text: "Hi, I'm hosted by OpenAI." },
145
+ { model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
146
+ { model: 'hume/octave-2', voice: 'Kora', text: "I'm Hume Octave. Thanks for listening." },
147
+ ],
148
+ });
116
149
  ```
117
150
 
118
- ### Capability check
151
+ Options: `gapMs` (default 300), `volumeDbfs` (default `-20`), `maxConcurrency` (default 6), `maxRetries` (default 2), `timestamps`, `apiKey`, `providerOptions`, `abortSignal`, `headers`. Per-turn overrides: `model`, `providerOptions` (stitch path only — throws `ConversationInputError` on native). Native-dialogue models enforce their own voice-count and character limits; violations throw `DialogueConstraintError`.
152
+
153
+ ## Timestamps
119
154
 
120
- Check whether a model supports streaming before calling `streamSpeech()`:
155
+ Pass `timestamps` to get word-level alignment. Timings are in seconds from the start of the audio.
121
156
 
122
157
  ```ts
123
- import { hasFeature } from "@speech-sdk/core";
158
+ const result = await generateSpeech({
159
+ model: 'elevenlabs/eleven_multilingual_v2',
160
+ text: 'Hello from speech-sdk!',
161
+ voice: 'JBFqnCBsd6RMkjVDRZzb',
162
+ timestamps: true,
163
+ });
124
164
 
125
- const model = provider.models.find((m) => m.id === "tts-1");
126
- if (hasFeature(model, "streaming")) {
127
- // safe to call streamSpeech()
128
- }
165
+ result.timestamps;
166
+ // [
167
+ // { text: "Hello", start: 0.00, end: 0.32 },
168
+ // { text: "from", start: 0.36, end: 0.55 },
169
+ // ...
170
+ // ]
129
171
  ```
130
172
 
131
- Calling `streamSpeech()` on a model that doesn't declare the `"streaming"` feature throws `StreamingNotSupportedError`.
173
+ | Value | Behavior |
174
+ |---|---|
175
+ | `true` | Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
176
+ | `false` *(default)* | Never return timestamps. |
132
177
 
133
- ### Errors and retries
178
+ With `timestamps: true`, models without native alignment require an STT fallback. The SDK automatically uses OpenAI Whisper when `OPENAI_API_KEY` is set in the environment — no extra configuration needed. Gateway-routed models (string model IDs like `"openai/tts-1"`) do not need a fallback — the gateway server provides it.
134
179
 
135
- Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the `ReadableStream` consumer as a stream error and are not retried. Pass `maxRetries` (default `2`) and an `abortSignal` the same way as `generateSpeech()`.
180
+ **Resolution order:** factory `fallbackSTT` `OPENAI_API_KEY` env var (automatic Whisper fallback) throws `TimestampKeyMissingError`.
136
181
 
137
- ## Conversations
182
+ Configure `fallbackSTT` on the factory to use a different key or STT model (set it once, applies to all calls):
183
+
184
+ ```ts
185
+ import { generateSpeech } from '@speech-sdk/core';
186
+ import { createOpenAI, createElevenLabs } from '@speech-sdk/core/providers';
187
+
188
+ const elevenlabs = createElevenLabs({
189
+ apiKey: process.env.ELEVENLABS_API_KEY,
190
+ fallbackSTT: createOpenAI({ apiKey: process.env.MY_OPENAI_KEY }).stt('whisper-1'),
191
+ });
192
+
193
+ const result = await generateSpeech({
194
+ model: elevenlabs('eleven_flash_v2'),
195
+ voice: 'JBFqnCBsd6RMkjVDRZzb',
196
+ text: 'Hello, world.',
197
+ timestamps: true,
198
+ });
199
+ ```
138
200
 
139
- `generateConversation()` produces a single multi-voice audio clip from an ordered array of turns. It picks the best path automatically:
201
+ Whether a given model returns native alignment or transcribes via the STT fallback is a provider detail — both paths produce the same `WordTimestamp[]` shape.
140
202
 
141
- - **Native dialogue** when every turn shares one model and that provider has a real multi-speaker dialogue endpoint, the SDK makes a single API call and returns the provider's natural mix. Works with **ElevenLabs v3**, **Google Gemini TTS** (exactly 2 voices), **Hume Octave**, **Fish Audio S2-Pro**, and **fal Dia**.
142
- - **Stitch fallback** — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls `generateSpeech()` per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.
203
+ `generateConversation` accepts the same options and returns `ConversationWordTimestamp[]` every word carries a `turnIndex: number` pointing back into the input `turns[]`. This is what lets you build chat-bubble UIs, speaker-attributed transcripts, and "who's speaking now?" lookups during playback without re-deriving turn boundaries.
143
204
 
144
205
  ```ts
145
- import { generateConversation } from "@speech-sdk/core/conversation";
206
+ import { generateConversation, timestampsToTurns } from '@speech-sdk/core';
146
207
 
147
208
  const result = await generateConversation({
209
+ model: 'elevenlabs/eleven_v3',
148
210
  turns: [
149
- { model: "openai/tts-1", voice: "nova", text: "Hi, I'm hosted by OpenAI." },
150
- { model: "elevenlabs/eleven_multilingual_v2", voice: "JBFqnCBsd6RMkjVDRZzb", text: "And I'm hosted by ElevenLabs." },
151
- { model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
152
- { model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
211
+ { voice: 'rachel', text: 'Hi there.' },
212
+ { voice: 'adam', text: 'Hello!' },
153
213
  ],
214
+ timestamps: true,
154
215
  });
155
216
 
156
- result.audio.uint8Array; // Uint8Array of one combined WAV
157
- result.audio.mediaType; // "audio/wav"
217
+ // Collapse consecutive words from the same turn into per-turn timings:
218
+ const turnTimestamps = timestampsToTurns(result.timestamps ?? []);
158
219
  ```
159
220
 
160
- The return type is the standard `SpeechResult`, so it composes with everything else in the SDK.
221
+ ### Captions (SRT / WebVTT)
161
222
 
162
- ### Conversation options
223
+ `timestampsToCaptions()` converts word-level timestamps into a caption file. SRT is the default; pass `format: 'vtt'` for WebVTT.
163
224
 
164
225
  ```ts
165
- generateConversation({
166
- model?: string | ResolvedModel, // default model for all turns
167
- turns: ConversationTurn[], // 1..N turns; any number of unique voices
168
- gapMs?: number, // silence between turns (stitch path), default 300
169
- normalizeVolume?: boolean, // RMS-level the output, default true
170
- volumeDbfs?: number, // RMS target loudness in dBFS (≤0), default -20
171
- maxConcurrency?: number, // cap parallel generateSpeech calls, default 6
172
- maxRetries?: number, // per-turn retries, default 2
173
- apiKey?: string,
174
- providerOptions?: Record<string, unknown>, // forwarded to every provider; per-turn override available
175
- abortSignal?: AbortSignal,
176
- headers?: Record<string, string>,
226
+ import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
227
+
228
+ const { timestamps } = await generateSpeech({
229
+ model: 'elevenlabs/eleven_v3',
230
+ text: 'Hello world. This is a test.',
231
+ voice: 'JBFqnCBsd6RMkjVDRZzb',
232
+ timestamps: true,
177
233
  });
178
234
 
179
- interface ConversationTurn {
180
- voice: Voice; // required
181
- text: string; // required, non-empty
182
- model?: string | ResolvedModel; // per-turn override of the top-level model
183
- providerOptions?: Record<string, unknown>, // stitch path only; see note below
184
- }
235
+ const srt = timestampsToCaptions(timestamps ?? []);
236
+ const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
185
237
  ```
186
238
 
187
- Per-turn `providerOptions` are merged with the top-level `providerOptions` on the stitch path each turn's underlying `generateSpeech()` call receives `{ ...topLevel, ...turn, ...stitchDefaults }`. On the native-dialogue path the provider renders the whole script in one API call, so per-turn overrides have no well-defined meaning; setting `providerOptions` on any turn throws `ConversationInputError`. Move the options to the top-level `providerOptions` (forwarded once to the dialogue call) instead.
239
+ Cues break on sentence boundaries, then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass `CaptionsOptions` to customize `format`, `maxLineLength`, `maxLinesPerCue`, `maxCharsPerCue`, `maxCueDurationMs`, or `longPhraseCommaBreakChars`.
188
240
 
189
- ### Volume normalization
241
+ ## Volume normalization
190
242
 
191
- `normalizeVolume: true` (the default) RMS-normalizes the output to an absolute target loudness — broadcast/podcast voice convention — so two `generateConversation` calls produce comparable levels regardless of provider mix or content. The target defaults to **−20 dBFS** (~20 dB of peak headroom), and is configurable via `volumeDbfs` (must be ≤ 0; lower is quieter).
243
+ Pass `volumeDbfs` to RMS-normalize to an absolute target loudness (must be ≤ 0; `-20` is the broadcast/podcast convention).
192
244
 
193
245
  ```ts
194
- await generateConversation({
195
- turns: [...],
196
- volumeDbfs: -16, // a touch louder than the default
246
+ const result = await generateSpeech({
247
+ model: 'openai/gpt-4o-mini-tts',
248
+ text: 'Hello!',
249
+ voice: 'alloy',
250
+ volumeDbfs: -20,
197
251
  });
198
- ```
199
252
 
200
- Normalization runs on **both paths** — stitched multi-provider conversations and single-provider native dialogue. On the native path the SDK transparently asks the provider for its decodable PCM/WAV mode (via `getStitchOptions`), levels the result, and re-encodes as 16-bit mono WAV — so the response `mediaType` becomes `audio/wav` whenever normalization runs. If a native dialogue provider can't emit decodable audio, the request still succeeds but a `warning` is appended explaining that volume normalization was skipped.
253
+ result.audio.mediaType; // "audio/wav" re-encoded after normalization
254
+ ```
201
255
 
202
- Pass `normalizeVolume: false` to skip normalization entirely (zero work) and keep the raw provider audio bytes and `mediaType` untouched.
256
+ `generateConversation` always normalizes; override the target with `volumeDbfs`. A warning is surfaced (and the raw mix passes through) if the provider has no decodable PCM/WAV mode.
203
257
 
204
- ### Errors
258
+ ## Audio tags
205
259
 
206
- Conversation-specific errors (re-exported from `@speech-sdk/core/conversation` alongside `generateConversation`, or importable on their own from `@speech-sdk/core/conversation/errors`):
260
+ Bracket syntax `[tag]` adds expressive cues. Each provider handles tags natively where supported, maps them to its closest equivalent, or strips them and surfaces a warning in `result.warnings`.
207
261
 
208
- | Error | When |
209
- |---|---|
210
- | `ConversationInputError` | Validation failure — empty turns, blank text, or a turn missing a model |
211
- | `DialogueConstraintError` | A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2) |
212
- | `StitchUnsupportedError` | The stitch path was selected but a chosen provider/model can't emit PCM/WAV |
262
+ ```ts
263
+ await generateSpeech({
264
+ model: 'elevenlabs/eleven_v3',
265
+ text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
266
+ voice: 'voice-id',
267
+ });
268
+ ```
213
269
 
214
- ### Native dialogue caps
270
+ ## Voice cloning
215
271
 
216
- | Provider | Native dialogue model | Voice constraints |
217
- |---|---|---|
218
- | ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 total chars |
219
- | Google | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** (API requirement) |
220
- | Hume | `octave-1`, `octave-2` | 1–4 voices |
221
- | Fish Audio | `s2-pro` | 1–4 voices |
222
- | fal | `dia-tts` | 1–2 voices |
223
-
224
- ## Supported Providers
225
-
226
- Use `provider/model` strings. Passing just the provider name uses its default model.
227
-
228
- | Provider | String Prefix | Default Model | Env Var | Docs |
229
- |---|---|---|---|---|
230
- | [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` | [API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
231
- | [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` | [API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) |
232
- | [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` | [API Reference](https://developers.deepgram.com/docs/tts-models) |
233
- | [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` | [API Reference](https://docs.cartesia.ai/api-reference/tts/bytes) |
234
- | [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` | [API Reference](https://dev.hume.ai/reference/text-to-speech-tts/synthesize-json) |
235
- | [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` | [API Reference](https://docs.inworld.ai/tts/api-reference) |
236
- | [Google (Gemini TTS)](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` | [API Reference](https://ai.google.dev/gemini-api/docs/text-generation) |
237
- | [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` | [API Reference](https://docs.fish.audio/developer-guide/core-features/text-to-speech) |
238
- | [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` | [API Reference](https://murf.ai/api/docs/api-reference/text-to-speech/generate) |
239
- | [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` | [API Reference](https://docs.resemble.ai/api-reference/text-to-speech/synthesize) |
240
- | [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` | [API Reference](https://fal.ai/models) |
241
- | [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` | [API Reference](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) |
242
- | [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` | [API Reference](https://docs.x.ai/docs/api-reference#text-to-speech) |
272
+ Some providers support reference-audio cloning. Pass a voice object instead of a string.
243
273
 
244
274
  ```ts
245
- generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
246
- generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
247
- generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
248
- generateSpeech({ model: 'inworld/inworld-tts-1.5-max', text: '...', voice: 'Ashley' });
249
- generateSpeech({ model: 'openai', text: '...', voice: 'alloy' }); // uses default model
250
- ```
275
+ import { createFal, createMistral } from '@speech-sdk/core/providers';
276
+
277
+ // Base64 reference:
278
+ await generateSpeech({
279
+ model: createMistral()(),
280
+ text: 'Hello!',
281
+ voice: { audio: 'base64-encoded-audio...' },
282
+ });
251
283
 
252
- Provider-specific API parameters can be passed via `providerOptions` — these are sent directly to the provider's API using the API's own field names.
284
+ // URL reference:
285
+ await generateSpeech({
286
+ model: createFal()('fal-ai/f5-tts'),
287
+ text: 'Hello!',
288
+ voice: { url: 'https://example.com/reference.wav' },
289
+ });
290
+ ```
253
291
 
254
- ## Custom Configuration
292
+ ## Custom configuration
255
293
 
256
- Use factory functions when you need custom API keys, base URLs, or fetch implementations:
294
+ Factory functions give you custom API keys, base URLs, or `fetch` implementations:
257
295
 
258
296
  ```ts
259
297
  import { generateSpeech } from '@speech-sdk/core';
260
- import { createOpenAI } from '@speech-sdk/core/openai';
261
- import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
298
+ import { createOpenAI } from '@speech-sdk/core/providers';
262
299
 
263
300
  const myOpenAI = createOpenAI({
264
301
  apiKey: 'sk-...',
265
302
  baseURL: 'https://my-proxy.com/v1',
266
303
  });
267
304
 
268
- const result = await generateSpeech({
305
+ await generateSpeech({
269
306
  model: myOpenAI('gpt-4o-mini-tts'),
270
307
  text: 'Hello!',
271
308
  voice: 'alloy',
272
309
  });
273
310
  ```
274
311
 
275
- ### API Key Resolution
276
-
277
- When using string models (e.g., `'openai/tts-1'`), API keys are resolved from environment variables (see table above). Factory functions accept an explicit `apiKey` option which takes precedence.
278
-
279
- ## Audio Tags
312
+ ## Public imports
280
313
 
281
- Use bracket syntax `[tag]` to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in `result.warnings`.
314
+ The root package exports the main runtime APIs:
282
315
 
283
316
  ```ts
284
- const result = await generateSpeech({
285
- model: 'elevenlabs/eleven_v3',
286
- text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
287
- voice: 'voice-id',
288
- });
289
-
290
- console.log(result.warnings); // undefined — eleven_v3 supports all tags
317
+ import {
318
+ generateSpeech,
319
+ streamSpeech,
320
+ generateConversation,
321
+ timestampsToCaptions,
322
+ ApiError,
323
+ } from '@speech-sdk/core';
291
324
  ```
292
325
 
293
- ### Provider behavior
294
-
295
- | Provider | Behavior |
296
- |---|---|
297
- | OpenAI (`gpt-4o-mini-tts`) | Tags mapped to the `instructions` field for expressive delivery control |
298
- | ElevenLabs (`eleven_v3`) | All `[tag]` passed through natively |
299
- | Google (`gemini-3.1-flash-tts-preview`) | All `[tag]` passed through natively (e.g. `[whispers]`, `[shouting]`, `[sighs]`, `[laugh]`) |
300
- | Cartesia (`sonic-3`) | Emotion tags (`[happy]`, `[sad]`, `[angry]`, etc.) converted to SSML; `[laughter]` passed through; unknown tags stripped |
301
- | All others | Tags stripped and warnings returned |
326
+ Provider and STT factories live under `@speech-sdk/core/providers`:
302
327
 
303
328
  ```ts
304
- // OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
305
- const result = await generateSpeech({
306
- model: 'openai/gpt-4o-mini-tts',
307
- text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
308
- voice: 'alloy',
309
- });
310
- // Sent to OpenAI:
311
- // input: "Hi John how are you? I'm feeling great"
312
- // instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
313
- console.log(result.warnings); // undefined
329
+ import {
330
+ createOpenAI,
331
+ createElevenLabs,
332
+ createCartesia,
333
+ createSpeechGateway,
334
+ } from '@speech-sdk/core/providers';
314
335
  ```
315
336
 
316
- ## Voice Cloning
317
-
318
- Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
337
+ Public types live under `@speech-sdk/core/types`:
319
338
 
320
339
  ```ts
321
- import { createMistral } from '@speech-sdk/core/mistral';
322
-
323
- const mistral = createMistral();
324
-
325
- // Clone from base64 audio
326
- const result = await generateSpeech({
327
- model: mistral(),
328
- text: 'Hello!',
329
- voice: { audio: 'base64-encoded-audio...' },
330
- });
340
+ import type {
341
+ GenerateSpeechOptions,
342
+ SpeechResult,
343
+ ConversationResult,
344
+ Voice,
345
+ WordTimestamp,
346
+ } from '@speech-sdk/core/types';
331
347
  ```
332
348
 
333
- Clone from a URL (fal):
334
-
335
- ```ts
336
- import { createFal } from '@speech-sdk/core/fal-ai';
337
-
338
- const fal = createFal();
339
- const result = await generateSpeech({
340
- model: fal('fal-ai/chatterbox'),
341
- text: 'Hello!',
342
- voice: { url: 'https://example.com/reference.wav' },
343
- });
344
- ```
345
-
346
- ## Options
349
+ ## API reference
347
350
 
348
351
  ```ts
349
352
  generateSpeech({
350
- model: string | ResolvedModel, // required
351
- text: string, // required
352
- voice: Voice, // required
353
- providerOptions?: object, // provider-specific API params
354
- maxRetries?: number, // default: 2 (retries on 5xx/network errors)
355
- abortSignal?: AbortSignal, // cancel the request
356
- headers?: Record<string, string>, // additional HTTP headers
357
- });
358
- ```
359
-
360
- ## Result
353
+ model: string | ResolvedModel, // required
354
+ text: string, // required
355
+ voice: Voice, // required — string | { url } | { audio }
356
+ providerOptions?: object,
357
+ volumeDbfs?: number, // 0
358
+ timestamps?: boolean, // default false
359
+ maxRetries?: number, // default 2
360
+ abortSignal?: AbortSignal,
361
+ headers?: Record<string, string>,
362
+ }): Promise<SpeechResult>
361
363
 
362
- ```ts
363
364
  interface SpeechResult {
364
- audio: {
365
- uint8Array: Uint8Array; // raw audio bytes
366
- base64: string; // base64 encoded (lazy)
367
- mediaType: string; // e.g. "audio/mpeg"
368
- };
365
+ audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
366
+ metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
367
+ timestamps?: WordTimestamp[];
369
368
  providerMetadata?: Record<string, unknown>;
369
+ warnings?: string[];
370
+ }
371
+
372
+ interface WordTimestamp { text: string; start: number; end: number } // seconds
373
+
374
+ // Returned by generateConversation — extends WordTimestamp with turnIndex
375
+ interface ConversationWordTimestamp extends WordTimestamp {
376
+ turnIndex: number; // index into the input turns[] array
370
377
  }
371
378
  ```
372
379
 
373
- ## Error Handling
380
+ ## Error handling
374
381
 
375
382
  ```ts
376
- import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';
383
+ import { generateSpeech, ApiError } from '@speech-sdk/core';
377
384
 
378
385
  try {
379
- const result = await generateSpeech({ ... });
386
+ await generateSpeech({ /* ... */ });
380
387
  } catch (error) {
381
388
  if (error instanceof ApiError) {
382
- console.log(error.statusCode); // 401
383
- console.log(error.model); // "openai/gpt-4o-mini-tts"
384
- console.log(error.responseBody);
389
+ error.statusCode; // 401, 429, 500, ...
390
+ error.responseBody;
391
+ error.code; // stable machine-readable code (optional)
385
392
  }
386
393
  }
387
394
  ```
388
395
 
396
+ `ApiError.code` is populated from the RFC 7807 `application/problem+json` `code` extension when the upstream provides one (currently only the Speech Gateway). Match on `err.code` over `err.message` text — codes are a stable contract, messages aren't.
397
+
389
398
  | Error | When |
390
399
  |---|---|
391
- | `ApiError` | Provider API returns a non-2xx response |
392
- | `NoSpeechGeneratedError` | Provider returned empty audio |
393
- | `SpeechSDKError` | Base class for all errors |
394
-
395
- ## Retry
400
+ | `ApiError` | Provider returned non-2xx |
401
+ | `MissingApiKeyError` | No `apiKey` passed and the provider's env var is unset |
402
+ | `NoSpeechGeneratedError` | Empty input (after tag stripping) or empty provider response |
403
+ | `StreamingNotSupportedError` | `streamSpeech()` on a non-streaming model |
404
+ | `VolumeAdjustmentUnsupportedError` | `volumeDbfs` with no decodable output mode |
405
+ | `TimestampKeyMissingError` | `timestamps: true` with no native support, no `fallbackSTT` configured, and `OPENAI_API_KEY` not set |
406
+ | `ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError` | `generateConversation` validation / native caps / stitch incompatibility |
407
+ | `SpeechSDKError` | Base class |
396
408
 
397
- Built-in retry with exponential backoff via [p-retry](https://github.com/sindresorhus/p-retry). Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
409
+ Retries 5xx and network errors with exponential backoff ([p-retry](https://github.com/sindresorhus/p-retry)); does not retry 4xx. Default 2 retries; override via `maxRetries`.
398
410
 
399
411
  ## Development
400
412
 
401
413
  ```bash
402
414
  pnpm install
403
- pnpm test # unit tests
404
- pnpm run test:e2e # e2e tests (requires API keys)
405
- pnpm run typecheck # type-check without emitting
406
- ```
407
-
408
- E2E tests hit real provider APIs. Set the relevant API key environment variables in a `.env` file or export them in your shell.
409
-
410
- Set `SPEECH_SDK_E2E_OUTPUT_DIR` to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
411
-
412
- ```bash
413
- SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e
415
+ pnpm test # unit tests
416
+ pnpm run test:e2e # e2e tests (requires provider API keys)
417
+ pnpm run typecheck
418
+ pnpm fix # format + lint
414
419
  ```
415
420
 
416
- ## License
417
-
418
- MIT
421
+ E2E tests hit real provider APIs. Set the relevant keys in `.env` or export them. Set `SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos` to write conversation e2e audio to disk.