npm - @speech-sdk/core - Versions diffs - 0.6.2 → 0.8.0-alpha - Mend

@speech-sdk/core 0.6.2 → 0.8.0-alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (202) hide show

package/LICENSE +202 -21
package/README.md +267 -264
package/dist/__tests__/e2e/_save-audio.d.ts +5 -24
package/dist/__tests__/e2e/_save-audio.d.ts.map +1 -1
package/dist/__tests__/e2e/_save-audio.js +19 -42
package/dist/__tests__/e2e/_save-audio.js.map +1 -1
package/dist/audio-duration.d.ts +0 -5
package/dist/audio-duration.d.ts.map +1 -1
package/dist/audio-duration.js +3 -10
package/dist/audio-duration.js.map +1 -1
package/dist/audio-utils.d.ts +1 -9
package/dist/audio-utils.d.ts.map +1 -1
package/dist/audio-utils.js +10 -13
package/dist/audio-utils.js.map +1 -1
package/dist/captions.d.ts +29 -0
package/dist/captions.d.ts.map +1 -0
package/dist/captions.js +193 -0
package/dist/captions.js.map +1 -0
package/dist/conversation/attribute-timestamps.d.ts +26 -0
package/dist/conversation/attribute-timestamps.d.ts.map +1 -0
package/dist/conversation/attribute-timestamps.js +276 -0
package/dist/conversation/attribute-timestamps.js.map +1 -0
package/dist/conversation/dispatch.d.ts +5 -5
package/dist/conversation/dispatch.d.ts.map +1 -1
package/dist/conversation/dispatch.js +18 -8
package/dist/conversation/dispatch.js.map +1 -1
package/dist/conversation/errors.d.ts +3 -0
package/dist/conversation/errors.d.ts.map +1 -1
package/dist/conversation/errors.js +6 -0
package/dist/conversation/errors.js.map +1 -1
package/dist/conversation/pcm-concat.d.ts +0 -23
package/dist/conversation/pcm-concat.d.ts.map +1 -1
package/dist/conversation/pcm-concat.js +5 -43
package/dist/conversation/pcm-concat.js.map +1 -1
package/dist/conversation/proportional-fill.d.ts +10 -0
package/dist/conversation/proportional-fill.d.ts.map +1 -0
package/dist/conversation/proportional-fill.js +64 -0
package/dist/conversation/proportional-fill.js.map +1 -0
package/dist/conversation/silence-detection.d.ts +14 -0
package/dist/conversation/silence-detection.d.ts.map +1 -0
package/dist/conversation/silence-detection.js +52 -0
package/dist/conversation/silence-detection.js.map +1 -0
package/dist/conversation/stitch.d.ts +3 -1
package/dist/conversation/stitch.d.ts.map +1 -1
package/dist/conversation/stitch.js +54 -13
package/dist/conversation/stitch.js.map +1 -1
package/dist/conversation/types.d.ts +1 -19
package/dist/conversation/types.d.ts.map +1 -1
package/dist/conversation/validate.d.ts +1 -16
package/dist/conversation/validate.d.ts.map +1 -1
package/dist/conversation/validate.js +29 -29
package/dist/conversation/validate.js.map +1 -1
package/dist/default-stt-fallback.d.ts +3 -0
package/dist/default-stt-fallback.d.ts.map +1 -0
package/dist/default-stt-fallback.js +11 -0
package/dist/default-stt-fallback.js.map +1 -0
package/dist/derive-timestamps.d.ts +10 -0
package/dist/derive-timestamps.d.ts.map +1 -0
package/dist/derive-timestamps.js +24 -0
package/dist/derive-timestamps.js.map +1 -0
package/dist/errors.d.ts +20 -2
package/dist/errors.d.ts.map +1 -1
package/dist/errors.js +28 -2
package/dist/errors.js.map +1 -1
package/dist/generate-conversation.d.ts +5 -4
package/dist/generate-conversation.d.ts.map +1 -1
package/dist/generate-conversation.js +191 -38
package/dist/generate-conversation.js.map +1 -1
package/dist/generate-speech.d.ts +2 -10
package/dist/generate-speech.d.ts.map +1 -1
package/dist/generate-speech.js +111 -33
package/dist/generate-speech.js.map +1 -1
package/dist/index.d.ts +5 -8
package/dist/index.d.ts.map +1 -1
package/dist/index.js +6 -4
package/dist/index.js.map +1 -1
package/dist/logger.d.ts +2 -0
package/dist/logger.d.ts.map +1 -0
package/dist/logger.js +29 -0
package/dist/logger.js.map +1 -0
package/dist/metadata.d.ts +0 -22
package/dist/metadata.d.ts.map +1 -1
package/dist/provider-utils.d.ts +3 -1
package/dist/provider-utils.d.ts.map +1 -1
package/dist/provider-utils.js +36 -39
package/dist/provider-utils.js.map +1 -1
package/dist/providers/cartesia/alignment.d.ts +8 -0
package/dist/providers/cartesia/alignment.d.ts.map +1 -0
package/dist/providers/cartesia/alignment.js +18 -0
package/dist/providers/cartesia/alignment.js.map +1 -0
package/dist/providers/cartesia/index.d.ts +11 -13
package/dist/providers/cartesia/index.d.ts.map +1 -1
package/dist/providers/cartesia/index.js +184 -61
package/dist/providers/cartesia/index.js.map +1 -1
package/dist/providers/deepgram/index.d.ts +7 -8
package/dist/providers/deepgram/index.d.ts.map +1 -1
package/dist/providers/deepgram/index.js +17 -18
package/dist/providers/deepgram/index.js.map +1 -1
package/dist/providers/elevenlabs/alignment.d.ts +10 -0
package/dist/providers/elevenlabs/alignment.d.ts.map +1 -0
package/dist/providers/elevenlabs/alignment.js +47 -0
package/dist/providers/elevenlabs/alignment.js.map +1 -0
package/dist/providers/elevenlabs/index.d.ts +10 -26
package/dist/providers/elevenlabs/index.d.ts.map +1 -1
package/dist/providers/elevenlabs/index.js +216 -154
package/dist/providers/elevenlabs/index.js.map +1 -1
package/dist/providers/fal/index.d.ts +7 -43
package/dist/providers/fal/index.d.ts.map +1 -1
package/dist/providers/fal/index.js +37 -86
package/dist/providers/fal/index.js.map +1 -1
package/dist/providers/fish-audio/index.d.ts +7 -8
package/dist/providers/fish-audio/index.d.ts.map +1 -1
package/dist/providers/fish-audio/index.js +23 -19
package/dist/providers/fish-audio/index.js.map +1 -1
package/dist/providers/gateway/index.d.ts +68 -0
package/dist/providers/gateway/index.d.ts.map +1 -0
package/dist/providers/gateway/index.js +236 -0
package/dist/providers/gateway/index.js.map +1 -0
package/dist/providers/google/index.d.ts +7 -20
package/dist/providers/google/index.d.ts.map +1 -1
package/dist/providers/google/index.js +161 -151
package/dist/providers/google/index.js.map +1 -1
package/dist/providers/hume/alignment.d.ts +33 -0
package/dist/providers/hume/alignment.d.ts.map +1 -0
package/dist/providers/hume/alignment.js +37 -0
package/dist/providers/hume/alignment.js.map +1 -0
package/dist/providers/hume/index.d.ts +11 -13
package/dist/providers/hume/index.d.ts.map +1 -1
package/dist/providers/hume/index.js +105 -41
package/dist/providers/hume/index.js.map +1 -1
package/dist/providers/inworld/alignment.d.ts +11 -0
package/dist/providers/inworld/alignment.d.ts.map +1 -0
package/dist/providers/inworld/alignment.js +24 -0
package/dist/providers/inworld/alignment.js.map +1 -0
package/dist/providers/inworld/index.d.ts +10 -14
package/dist/providers/inworld/index.d.ts.map +1 -1
package/dist/providers/inworld/index.js +55 -38
package/dist/providers/inworld/index.js.map +1 -1
package/dist/providers/mistral/index.d.ts +7 -8
package/dist/providers/mistral/index.d.ts.map +1 -1
package/dist/providers/mistral/index.js +39 -38
package/dist/providers/mistral/index.js.map +1 -1
package/dist/providers/murf/alignment.d.ts +13 -0
package/dist/providers/murf/alignment.d.ts.map +1 -0
package/dist/providers/murf/alignment.js +22 -0
package/dist/providers/murf/alignment.js.map +1 -0
package/dist/providers/murf/index.d.ts +11 -13
package/dist/providers/murf/index.d.ts.map +1 -1
package/dist/providers/murf/index.js +73 -56
package/dist/providers/murf/index.js.map +1 -1
package/dist/providers/openai/index.d.ts +36 -20
package/dist/providers/openai/index.d.ts.map +1 -1
package/dist/providers/openai/index.js +270 -102
package/dist/providers/openai/index.js.map +1 -1
package/dist/providers/resemble/alignment.d.ts +11 -0
package/dist/providers/resemble/alignment.d.ts.map +1 -0
package/dist/providers/resemble/alignment.js +54 -0
package/dist/providers/resemble/alignment.js.map +1 -0
package/dist/providers/resemble/index.d.ts +10 -8
package/dist/providers/resemble/index.d.ts.map +1 -1
package/dist/providers/resemble/index.js +58 -40
package/dist/providers/resemble/index.js.map +1 -1
package/dist/providers/xai/index.d.ts +7 -9
package/dist/providers/xai/index.d.ts.map +1 -1
package/dist/providers/xai/index.js +37 -40
package/dist/providers/xai/index.js.map +1 -1
package/dist/providers.d.ts +29 -0
package/dist/providers.d.ts.map +1 -0
package/dist/providers.js +15 -0
package/dist/providers.js.map +1 -0
package/dist/resolve-provider.d.ts.map +1 -1
package/dist/resolve-provider.js +7 -59
package/dist/resolve-provider.js.map +1 -1
package/dist/speech-provider.d.ts +19 -15
package/dist/speech-provider.d.ts.map +1 -1
package/dist/speech-provider.js +9 -14
package/dist/speech-provider.js.map +1 -1
package/dist/speech-result.d.ts +5 -0
package/dist/speech-result.d.ts.map +1 -1
package/dist/speech-result.js.map +1 -1
package/dist/speech-to-text-provider.d.ts +28 -0
package/dist/speech-to-text-provider.d.ts.map +1 -0
package/dist/speech-to-text-provider.js +2 -0
package/dist/speech-to-text-provider.js.map +1 -0
package/dist/stream-speech.d.ts.map +1 -1
package/dist/stream-speech.js +2 -3
package/dist/stream-speech.js.map +1 -1
package/dist/timestamps.d.ts +9 -0
package/dist/timestamps.d.ts.map +1 -0
package/dist/timestamps.js +2 -0
package/dist/timestamps.js.map +1 -0
package/dist/turns.d.ts +9 -0
package/dist/turns.d.ts.map +1 -0
package/dist/turns.js +21 -0
package/dist/turns.js.map +1 -0
package/dist/types.d.ts +25 -0
package/dist/types.d.ts.map +1 -1
package/dist/volume-adjust.d.ts +0 -6
package/dist/volume-adjust.d.ts.map +1 -1
package/dist/volume-adjust.js +0 -6
package/dist/volume-adjust.js.map +1 -1
package/package.json +12 -63

package/README.md CHANGED Viewed

@@ -1,31 +1,48 @@
+<div align="center">
+<img src="https://github.com/user-attachments/assets/42d9b528-e507-4162-8120-338bb0c92650" alt="Speech SDK" width="140" />
 # Speech SDK
-[![npm version](https://img.shields.io/npm/v/@speech-sdk/core)](https://www.npmjs.com/package/@speech-sdk/core)
-[![npm downloads](https://img.shields.io/npm/dm/@speech-sdk/core)](https://www.npmjs.com/package/@speech-sdk/core)
-[![license](https://img.shields.io/npm/l/@speech-sdk/core)](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
+**Text-to-speech across 13 providers, one API.**
-The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.
+A lightweight, provider-agnostic TypeScript SDK. Zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
-To learn more about the Speech SDK, check out [https://speechsdk.dev/](https://speechsdk.dev/).
+[![npm version](https://img.shields.io/npm/v/@speech-sdk/core?style=flat-square)](https://www.npmjs.com/package/@speech-sdk/core)
+[![npm downloads](https://img.shields.io/npm/dm/@speech-sdk/core?style=flat-square)](https://www.npmjs.com/package/@speech-sdk/core)
+[![license](https://img.shields.io/npm/l/@speech-sdk/core?style=flat-square)](https://github.com/Jellypod-Inc/speech-sdk/blob/main/LICENSE)
+[![Discord](https://img.shields.io/badge/Discord-Join-5865F2?style=flat-square&logo=discord&logoColor=white)](https://discord.gg/xcTQMU3nCV)
+[![Stars](https://img.shields.io/github/stars/Jellypod-Inc/speech-sdk?style=flat-square&logo=github&label=stars)](https://github.com/Jellypod-Inc/speech-sdk/stargazers)
-<img width="1200" height="630" alt="og-3" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
+**[Quick start](#quick-start)** · **[Providers](#supported-providers)** · **[Streaming](#streaming)** · **[Multi-Speaker Conversations](#conversations)** · **[Timestamps](#timestamps)**
+</div>
-## Install
+<br />
-```bash
-npm install @speech-sdk/core
-```
+<img width="1200" height="630" alt="Speech SDK" src="https://github.com/user-attachments/assets/b90c0235-9405-4939-bffa-75fc82be5afb" />
+Learn more at [speechsdk.dev](https://speechsdk.dev/).
-### Using an AI Coding Assistant?
+## Features
-Add the speech-sdk skill to give your AI assistant full knowledge of this library:
+- **Universal** — one `generateSpeech()` call across every supported provider.
+- **Streaming** — `streamSpeech()` returns a standard `ReadableStream<Uint8Array>`.
+- **Conversations** — `generateConversation()` produces multi-speaker audio, picking a gateway, native-dialogue, or local-stitch path automatically.
+- **Word-level timestamps** — `timestamps: true` returns alignment, using the provider's native data or falling back to STT.
+- **Volume normalization** — RMS-level outputs to an absolute loudness target.
+- **Audio tags & voice cloning** — bracket cues like `[laugh]` and reference-audio cloning where supported.
+## Install
 ```bash
-npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk
+npm install @speech-sdk/core
 ```
-## Quick Start
+> [!TIP]
+> Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: `npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk`.
+## Quick start
 ```ts
 import { generateSpeech } from '@speech-sdk/core';
@@ -36,383 +53,369 @@ const result = await generateSpeech({
   voice: 'alloy',
 });
-// Access the audio
 result.audio.uint8Array;  // Uint8Array
-result.audio.base64;      // string (lazy-computed)
+result.audio.base64;      // string (lazy)
 result.audio.mediaType;   // "audio/mpeg"
 ```
-### Volume normalization
+Pass a `provider/model` string, or just the provider name to use its default model. The string above is enough to get going — set one env var and you're done.
-Pass `volumeDbfs` to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):
+## Gateway vs direct provider
-```ts
-const result = await generateSpeech({
-  model: 'openai/gpt-4o-mini-tts',
-  text: 'Hello from speech-sdk!',
-  voice: 'alloy',
-  volumeDbfs: -20,
-});
+The SDK has two ways to reach a provider, and the choice is made by **how you pass `model`**:
-result.audio.mediaType;   // "audio/wav" — re-encoded after normalization
+```ts
+// 1. String → routes through Speech Gateway (https://api.speechgateway.com)
+//    Needs SPEECH_GATEWAY_API_KEY (sign up at https://speechgateway.com).
+await generateSpeech({ model: 'openai/gpt-4o-mini-tts', text: '...', voice: 'alloy' });
+// 2. Factory → calls the provider directly (no proxy hop)
+//    Reads the provider's env var (e.g. OPENAI_API_KEY), or pass apiKey to the factory.
+import { createOpenAI } from '@speech-sdk/core/providers';
+await generateSpeech({ model: createOpenAI()('gpt-4o-mini-tts'), text: '...', voice: 'alloy' });
 ```
-When `volumeDbfs` is set the SDK transparently asks the provider for its decodable PCM/WAV mode, normalizes the samples, and returns 16-bit mono WAV — so the response `mediaType` switches to `audio/wav` regardless of the provider's native default. Throws `VolumeAdjustmentUnsupportedError` if the provider has no decodable output mode.
+| | Speech Gateway (string) | Direct provider (factory) |
+|---|---|---|
+| When to use | You want a single endpoint and easy provider swaps | You already have provider keys, want zero-hop latency, or need provider features the gateway hasn't surfaced |
+| Setup | `SPEECH_GATEWAY_API_KEY` only | One env var per provider you use |
+| Key resolution | `apiKey` option → `SPEECH_GATEWAY_API_KEY` | `createX({ apiKey })` → `<PROVIDER>_API_KEY` |
+| Endpoint | `api.speechgateway.com` | Provider's own API |
+The gateway also accepts `createSpeechGateway({ apiKey, baseURL })` if you want to construct it explicitly (e.g. for a custom proxy URL).
+## Supported providers
+| Provider | Prefix | Env var |
+|---|---|---|
+| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `OPENAI_API_KEY` |
+| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `ELEVENLABS_API_KEY` |
+| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `DEEPGRAM_API_KEY` |
+| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `CARTESIA_API_KEY` |
+| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `HUME_API_KEY` |
+| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `INWORLD_API_KEY` |
+| [Google Gemini TTS](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `GOOGLE_API_KEY` |
+| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `FISH_AUDIO_API_KEY` |
+| [Murf](https://murf.ai/api/docs) | `murf` | `MURF_API_KEY` |
+| [Resemble](https://docs.resemble.ai) | `resemble` | `RESEMBLE_API_KEY` |
+| [fal](https://fal.ai/models) | `fal-ai` | `FAL_API_KEY` |
+| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `MISTRAL_API_KEY` |
+| [xAI](https://docs.x.ai/docs/models) | `xai` | `XAI_API_KEY` |
+The env var applies when you call the provider directly via its factory. Pass a string `model` like `"openai/tts-1"` to route through Speech Gateway instead, which reads `SPEECH_GATEWAY_API_KEY` — see [Gateway vs direct provider](#gateway-vs-direct-provider). Most providers ship a default model (`createOpenAI()()`); a few (e.g. fal) require an explicit model id. See the linked docs for each provider's full model list.
+Provider-specific parameters pass through via `providerOptions` using each API's native field names.
 ## Streaming
-Use `streamSpeech()` instead of `generateSpeech()` to receive audio bytes incrementally as the provider produces them. The result's `audio` field is a standard `ReadableStream<Uint8Array>` that works in Node, Edge runtimes, and browsers.
+`streamSpeech()` returns audio incrementally as a `ReadableStream<Uint8Array>`.
 ```ts
-import { streamSpeech } from "@speech-sdk/core";
+import { streamSpeech } from '@speech-sdk/core';
 const { audio, mediaType } = await streamSpeech({
-  model: "openai/tts-1",
-  text: "Hello from the speech SDK!",
-  voice: "alloy",
+  model: 'cartesia/sonic-3',
+  text: 'Streaming straight to the client.',
+  voice: 'voice-id',
 });
+// Forward to an HTTP response:
+return new Response(audio, { headers: { 'Content-Type': mediaType } });
 ```
-### Pipe to a file (Node)
+> [!NOTE]
+> Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling `streamSpeech()` on a non-streaming model throws `StreamingNotSupportedError`.
-```ts
-import { createWriteStream } from "node:fs";
-import { Readable } from "node:stream";
+## Conversations
-const { audio } = await streamSpeech({
-  model: "elevenlabs/eleven_flash_v2_5",
-  text: "Hello world",
-  voice: "JBFqnCBsd6RMkjVDRZzb",
-});
+`generateConversation()` produces a single multi-voice clip from an ordered array of turns. The path is chosen by what the turns are:
-await new Promise((resolve, reject) => {
-  Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
-});
-```
+- **Gateway** — every turn uses a gateway-routed string model (e.g. `"openai/tts-1"`). One request to Speech Gateway; the server handles rendering, stitching, and normalization. The SDK never stitches locally on this path — clone voices on gateway models throw `StitchUnsupportedError`.
+- **Native dialogue** — every turn uses the same direct-provider model and that model exposes a multi-speaker endpoint. One API call, naturally mixed.
+- **Stitch** — direct-provider conversations that don't qualify for native dialogue (multi-provider, or no dialogue endpoint). Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
-### Forward to an HTTP response (Edge / Workers / Next.js Route Handler)
+Mixing gateway-routed turns with direct-provider turns in one call throws `MixedDispatchError`.
 ```ts
-export async function GET() {
-  const { audio, mediaType } = await streamSpeech({
-    model: "cartesia/sonic-3",
-    text: "Streaming straight to the client.",
-    voice: "voice-id",
-  });
-  return new Response(audio, { headers: { "Content-Type": mediaType } });
-}
-```
-### Read chunks manually
+import { generateConversation } from '@speech-sdk/core';
-```ts
-const reader = audio.getReader();
-while (true) {
-  const { value, done } = await reader.read();
-  if (done) break;
-  // value is a Uint8Array of audio bytes
-}
+const result = await generateConversation({
+  turns: [
+    { model: 'openai/tts-1',                     voice: 'nova',                 text: "Hi, I'm hosted by OpenAI." },
+    { model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
+    { model: 'hume/octave-2',                    voice: 'Kora',                 text: "I'm Hume Octave. Thanks for listening." },
+  ],
+});
 ```
-### Capability check
+Options: `gapMs` (default 300), `volumeDbfs` (default `-20`), `maxConcurrency` (default 6), `maxRetries` (default 2), `timestamps`, `apiKey`, `providerOptions`, `abortSignal`, `headers`. Per-turn overrides: `model`, `providerOptions` (stitch path only — throws `ConversationInputError` on native). Native-dialogue models enforce their own voice-count and character limits; violations throw `DialogueConstraintError`.
+## Timestamps
-Check whether a model supports streaming before calling `streamSpeech()`:
+Pass `timestamps` to get word-level alignment. Timings are in seconds from the start of the audio.
 ```ts
-import { hasFeature } from "@speech-sdk/core";
+const result = await generateSpeech({
+  model: 'elevenlabs/eleven_multilingual_v2',
+  text: 'Hello from speech-sdk!',
+  voice: 'JBFqnCBsd6RMkjVDRZzb',
+  timestamps: true,
+});
-const model = provider.models.find((m) => m.id === "tts-1");
-if (hasFeature(model, "streaming")) {
-  // safe to call streamSpeech()
-}
+result.timestamps;
+// [
+//   { text: "Hello",  start: 0.00, end: 0.32 },
+//   { text: "from",   start: 0.36, end: 0.55 },
+//   ...
+// ]
 ```
-Calling `streamSpeech()` on a model that doesn't declare the `"streaming"` feature throws `StreamingNotSupportedError`.
+| Value | Behavior |
+|---|---|
+| `true` | Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
+| `false` *(default)* | Never return timestamps. |
-### Errors and retries
+With `timestamps: true`, models without native alignment require an STT fallback. The SDK automatically uses OpenAI Whisper when `OPENAI_API_KEY` is set in the environment — no extra configuration needed. Gateway-routed models (string model IDs like `"openai/tts-1"`) do not need a fallback — the gateway server provides it.
-Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the `ReadableStream` consumer as a stream error and are not retried. Pass `maxRetries` (default `2`) and an `abortSignal` the same way as `generateSpeech()`.
+**Resolution order:** factory `fallbackSTT` → `OPENAI_API_KEY` env var (automatic Whisper fallback) → throws `TimestampKeyMissingError`.
-## Conversations
+Configure `fallbackSTT` on the factory to use a different key or STT model (set it once, applies to all calls):
+```ts
+import { generateSpeech } from '@speech-sdk/core';
+import { createOpenAI, createElevenLabs } from '@speech-sdk/core/providers';
+const elevenlabs = createElevenLabs({
+  apiKey: process.env.ELEVENLABS_API_KEY,
+  fallbackSTT: createOpenAI({ apiKey: process.env.MY_OPENAI_KEY }).stt('whisper-1'),
+});
+const result = await generateSpeech({
+  model: elevenlabs('eleven_flash_v2'),
+  voice: 'JBFqnCBsd6RMkjVDRZzb',
+  text: 'Hello, world.',
+  timestamps: true,
+});
+```
-`generateConversation()` produces a single multi-voice audio clip from an ordered array of turns. It picks the best path automatically:
+Whether a given model returns native alignment or transcribes via the STT fallback is a provider detail — both paths produce the same `WordTimestamp[]` shape.
-- **Native dialogue** — when every turn shares one model and that provider has a real multi-speaker dialogue endpoint, the SDK makes a single API call and returns the provider's natural mix. Works with **ElevenLabs v3**, **Google Gemini TTS** (exactly 2 voices), **Hume Octave**, **Fish Audio S2-Pro**, and **fal Dia**.
-- **Stitch fallback** — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls `generateSpeech()` per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.
+`generateConversation` accepts the same options and returns `ConversationWordTimestamp[]` — every word carries a `turnIndex: number` pointing back into the input `turns[]`. This is what lets you build chat-bubble UIs, speaker-attributed transcripts, and "who's speaking now?" lookups during playback without re-deriving turn boundaries.
 ```ts
-import { generateConversation } from "@speech-sdk/core/conversation";
+import { generateConversation, timestampsToTurns } from '@speech-sdk/core';
 const result = await generateConversation({
+  model: 'elevenlabs/eleven_v3',
   turns: [
-    { model: "openai/tts-1", voice: "nova", text: "Hi, I'm hosted by OpenAI." },
-    { model: "elevenlabs/eleven_multilingual_v2", voice: "JBFqnCBsd6RMkjVDRZzb", text: "And I'm hosted by ElevenLabs." },
-    { model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
-    { model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
+    { voice: 'rachel', text: 'Hi there.' },
+    { voice: 'adam',   text: 'Hello!' },
   ],
+  timestamps: true,
 });
-result.audio.uint8Array;  // Uint8Array of one combined WAV
-result.audio.mediaType;   // "audio/wav"
+// Collapse consecutive words from the same turn into per-turn timings:
+const turnTimestamps = timestampsToTurns(result.timestamps ?? []);
 ```
-The return type is the standard `SpeechResult`, so it composes with everything else in the SDK.
+### Captions (SRT / WebVTT)
-### Conversation options
+`timestampsToCaptions()` converts word-level timestamps into a caption file. SRT is the default; pass `format: 'vtt'` for WebVTT.
 ```ts
-generateConversation({
-  model?: string | ResolvedModel,                 // default model for all turns
-  turns: ConversationTurn[],                      // 1..N turns; any number of unique voices
-  gapMs?: number,                                 // silence between turns (stitch path), default 300
-  normalizeVolume?: boolean,                      // RMS-level the output, default true
-  volumeDbfs?: number,                            // RMS target loudness in dBFS (≤0), default -20
-  maxConcurrency?: number,                        // cap parallel generateSpeech calls, default 6
-  maxRetries?: number,                            // per-turn retries, default 2
-  apiKey?: string,
-  providerOptions?: Record<string, unknown>,      // forwarded to every provider; per-turn override available
-  abortSignal?: AbortSignal,
-  headers?: Record<string, string>,
+import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
+const { timestamps } = await generateSpeech({
+  model: 'elevenlabs/eleven_v3',
+  text: 'Hello world. This is a test.',
+  voice: 'JBFqnCBsd6RMkjVDRZzb',
+  timestamps: true,
 });
-interface ConversationTurn {
-  voice: Voice;                                   // required
-  text: string;                                   // required, non-empty
-  model?: string | ResolvedModel;                 // per-turn override of the top-level model
-  providerOptions?: Record<string, unknown>,      // stitch path only; see note below
-}
+const srt = timestampsToCaptions(timestamps ?? []);
+const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
 ```
-Per-turn `providerOptions` are merged with the top-level `providerOptions` on the stitch path — each turn's underlying `generateSpeech()` call receives `{ ...topLevel, ...turn, ...stitchDefaults }`. On the native-dialogue path the provider renders the whole script in one API call, so per-turn overrides have no well-defined meaning; setting `providerOptions` on any turn throws `ConversationInputError`. Move the options to the top-level `providerOptions` (forwarded once to the dialogue call) instead.
+Cues break on sentence boundaries, then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass `CaptionsOptions` to customize `format`, `maxLineLength`, `maxLinesPerCue`, `maxCharsPerCue`, `maxCueDurationMs`, or `longPhraseCommaBreakChars`.
-### Volume normalization
+## Volume normalization
-`normalizeVolume: true` (the default) RMS-normalizes the output to an absolute target loudness — broadcast/podcast voice convention — so two `generateConversation` calls produce comparable levels regardless of provider mix or content. The target defaults to **−20 dBFS** (~20 dB of peak headroom), and is configurable via `volumeDbfs` (must be ≤ 0; lower is quieter).
+Pass `volumeDbfs` to RMS-normalize to an absolute target loudness (must be ≤ 0; `-20` is the broadcast/podcast convention).
 ```ts
-await generateConversation({
-  turns: [...],
-  volumeDbfs: -16,           // a touch louder than the default
+const result = await generateSpeech({
+  model: 'openai/gpt-4o-mini-tts',
+  text: 'Hello!',
+  voice: 'alloy',
+  volumeDbfs: -20,
 });
-```
-Normalization runs on **both paths** — stitched multi-provider conversations and single-provider native dialogue. On the native path the SDK transparently asks the provider for its decodable PCM/WAV mode (via `getStitchOptions`), levels the result, and re-encodes as 16-bit mono WAV — so the response `mediaType` becomes `audio/wav` whenever normalization runs. If a native dialogue provider can't emit decodable audio, the request still succeeds but a `warning` is appended explaining that volume normalization was skipped.
+result.audio.mediaType;  // "audio/wav" — re-encoded after normalization
+```
-Pass `normalizeVolume: false` to skip normalization entirely (zero work) and keep the raw provider audio bytes and `mediaType` untouched.
+`generateConversation` always normalizes; override the target with `volumeDbfs`. A warning is surfaced (and the raw mix passes through) if the provider has no decodable PCM/WAV mode.
-### Errors
+## Audio tags
-Conversation-specific errors (re-exported from `@speech-sdk/core/conversation` alongside `generateConversation`, or importable on their own from `@speech-sdk/core/conversation/errors`):
+Bracket syntax `[tag]` adds expressive cues. Each provider handles tags natively where supported, maps them to its closest equivalent, or strips them and surfaces a warning in `result.warnings`.
-| Error | When |
-|---|---|
-| `ConversationInputError` | Validation failure — empty turns, blank text, or a turn missing a model |
-| `DialogueConstraintError` | A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2) |
-| `StitchUnsupportedError` | The stitch path was selected but a chosen provider/model can't emit PCM/WAV |
+```ts
+await generateSpeech({
+  model: 'elevenlabs/eleven_v3',
+  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
+  voice: 'voice-id',
+});
+```
-### Native dialogue caps
+## Voice cloning
-| Provider | Native dialogue model | Voice constraints |
-|---|---|---|
-| ElevenLabs | `eleven_v3` | 1–10 voices, ≤ 2,000 total chars |
-| Google | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`, `gemini-3.1-flash-tts-preview` | **Exactly 2 voices** (API requirement) |
-| Hume | `octave-1`, `octave-2` | 1–4 voices |
-| Fish Audio | `s2-pro` | 1–4 voices |
-| fal | `dia-tts` | 1–2 voices |
-## Supported Providers
-Use `provider/model` strings. Passing just the provider name uses its default model.
-| Provider | String Prefix | Default Model | Env Var | Docs |
-|---|---|---|---|---|
-| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | `openai` | `gpt-4o-mini-tts` | `OPENAI_API_KEY` | [API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
-| [ElevenLabs](https://elevenlabs.io/docs) | `elevenlabs` | `eleven_multilingual_v2` | `ELEVENLABS_API_KEY` | [API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) |
-| [Deepgram](https://developers.deepgram.com/docs/text-to-speech) | `deepgram` | `aura-2` | `DEEPGRAM_API_KEY` | [API Reference](https://developers.deepgram.com/docs/tts-models) |
-| [Cartesia](https://docs.cartesia.ai) | `cartesia` | `sonic-3` | `CARTESIA_API_KEY` | [API Reference](https://docs.cartesia.ai/api-reference/tts/bytes) |
-| [Hume](https://dev.hume.ai/docs/text-to-speech-tts/overview) | `hume` | `octave-2` | `HUME_API_KEY` | [API Reference](https://dev.hume.ai/reference/text-to-speech-tts/synthesize-json) |
-| [Inworld](https://docs.inworld.ai/tts) | `inworld` | `inworld-tts-1.5-max` | `INWORLD_API_KEY` | [API Reference](https://docs.inworld.ai/tts/api-reference) |
-| [Google (Gemini TTS)](https://docs.cloud.google.com/text-to-speech/docs/gemini-tts) | `google` | `gemini-2.5-flash-preview-tts` | `GOOGLE_API_KEY` | [API Reference](https://ai.google.dev/gemini-api/docs/text-generation) |
-| [Fish Audio](https://docs.fish.audio) | `fish-audio` | `s2-pro` | `FISH_AUDIO_API_KEY` | [API Reference](https://docs.fish.audio/developer-guide/core-features/text-to-speech) |
-| [Murf](https://murf.ai/api/docs) | `murf` | `GEN2` | `MURF_API_KEY` | [API Reference](https://murf.ai/api/docs/api-reference/text-to-speech/generate) |
-| [Resemble](https://docs.resemble.ai) | `resemble` | `default` | `RESEMBLE_API_KEY` | [API Reference](https://docs.resemble.ai/api-reference/text-to-speech/synthesize) |
-| [fal](https://fal.ai/models) | `fal-ai` | *(user-specified)* | `FAL_API_KEY` | [API Reference](https://fal.ai/models) |
-| [Mistral](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) | `mistral` | `voxtral-mini-tts-2603` | `MISTRAL_API_KEY` | [API Reference](https://docs.mistral.ai/capabilities/audio/text_to_speech/speech) |
-| [xAI](https://docs.x.ai/docs/models) | `xai` | `grok-tts` | `XAI_API_KEY` | [API Reference](https://docs.x.ai/docs/api-reference#text-to-speech) |
+Some providers support reference-audio cloning. Pass a voice object instead of a string.
 ```ts
-generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
-generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
-generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
-generateSpeech({ model: 'inworld/inworld-tts-1.5-max', text: '...', voice: 'Ashley' });
-generateSpeech({ model: 'openai', text: '...', voice: 'alloy' });  // uses default model
-```
+import { createFal, createMistral } from '@speech-sdk/core/providers';
+// Base64 reference:
+await generateSpeech({
+  model: createMistral()(),
+  text: 'Hello!',
+  voice: { audio: 'base64-encoded-audio...' },
+});
-Provider-specific API parameters can be passed via `providerOptions` — these are sent directly to the provider's API using the API's own field names.
+// URL reference:
+await generateSpeech({
+  model: createFal()('fal-ai/f5-tts'),
+  text: 'Hello!',
+  voice: { url: 'https://example.com/reference.wav' },
+});
+```
-## Custom Configuration
+## Custom configuration
-Use factory functions when you need custom API keys, base URLs, or fetch implementations:
+Factory functions give you custom API keys, base URLs, or `fetch` implementations:
 ```ts
 import { generateSpeech } from '@speech-sdk/core';
-import { createOpenAI } from '@speech-sdk/core/openai';
-import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
+import { createOpenAI } from '@speech-sdk/core/providers';
 const myOpenAI = createOpenAI({
   apiKey: 'sk-...',
   baseURL: 'https://my-proxy.com/v1',
 });
-const result = await generateSpeech({
+await generateSpeech({
   model: myOpenAI('gpt-4o-mini-tts'),
   text: 'Hello!',
   voice: 'alloy',
 });
 ```
-### API Key Resolution
-When using string models (e.g., `'openai/tts-1'`), API keys are resolved from environment variables (see table above). Factory functions accept an explicit `apiKey` option which takes precedence.
-## Audio Tags
+## Public imports
-Use bracket syntax `[tag]` to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in `result.warnings`.
+The root package exports the main runtime APIs:
 ```ts
-const result = await generateSpeech({
-  model: 'elevenlabs/eleven_v3',
-  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
-  voice: 'voice-id',
-});
-console.log(result.warnings); // undefined — eleven_v3 supports all tags
+import {
+  generateSpeech,
+  streamSpeech,
+  generateConversation,
+  timestampsToCaptions,
+  ApiError,
+} from '@speech-sdk/core';
 ```
-### Provider behavior
-| Provider | Behavior |
-|---|---|
-| OpenAI (`gpt-4o-mini-tts`) | Tags mapped to the `instructions` field for expressive delivery control |
-| ElevenLabs (`eleven_v3`) | All `[tag]` passed through natively |
-| Google (`gemini-3.1-flash-tts-preview`) | All `[tag]` passed through natively (e.g. `[whispers]`, `[shouting]`, `[sighs]`, `[laugh]`) |
-| Cartesia (`sonic-3`) | Emotion tags (`[happy]`, `[sad]`, `[angry]`, etc.) converted to SSML; `[laughter]` passed through; unknown tags stripped |
-| All others | Tags stripped and warnings returned |
+Provider and STT factories live under `@speech-sdk/core/providers`:
 ```ts
-// OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
-const result = await generateSpeech({
-  model: 'openai/gpt-4o-mini-tts',
-  text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
-  voice: 'alloy',
-});
-// Sent to OpenAI:
-//   input: "Hi John how are you? I'm feeling great"
-//   instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
-console.log(result.warnings); // undefined
+import {
+  createOpenAI,
+  createElevenLabs,
+  createCartesia,
+  createSpeechGateway,
+} from '@speech-sdk/core/providers';
 ```
-## Voice Cloning
-Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
+Public types live under `@speech-sdk/core/types`:
 ```ts
-import { createMistral } from '@speech-sdk/core/mistral';
-const mistral = createMistral();
-// Clone from base64 audio
-const result = await generateSpeech({
-  model: mistral(),
-  text: 'Hello!',
-  voice: { audio: 'base64-encoded-audio...' },
-});
+import type {
+  GenerateSpeechOptions,
+  SpeechResult,
+  ConversationResult,
+  Voice,
+  WordTimestamp,
+} from '@speech-sdk/core/types';
 ```
-Clone from a URL (fal):
-```ts
-import { createFal } from '@speech-sdk/core/fal-ai';
-const fal = createFal();
-const result = await generateSpeech({
-  model: fal('fal-ai/chatterbox'),
-  text: 'Hello!',
-  voice: { url: 'https://example.com/reference.wav' },
-});
-```
-## Options
+## API reference
 ```ts
 generateSpeech({
-  model: string | ResolvedModel,  // required
-  text: string,                   // required
-  voice: Voice,                   // required
-  providerOptions?: object,       // provider-specific API params
-  maxRetries?: number,            // default: 2 (retries on 5xx/network errors)
-  abortSignal?: AbortSignal,      // cancel the request
-  headers?: Record<string, string>, // additional HTTP headers
-});
-```
-## Result
+  model: string | ResolvedModel,          // required
+  text: string,                           // required
+  voice: Voice,                           // required — string | { url } | { audio }
+  providerOptions?: object,
+  volumeDbfs?: number,                    // ≤ 0
+  timestamps?: boolean,                   // default false
+  maxRetries?: number,                    // default 2
+  abortSignal?: AbortSignal,
+  headers?: Record<string, string>,
+}): Promise<SpeechResult>
-```ts
 interface SpeechResult {
-  audio: {
-    uint8Array: Uint8Array;   // raw audio bytes
-    base64: string;           // base64 encoded (lazy)
-    mediaType: string;        // e.g. "audio/mpeg"
-  };
+  audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
+  metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
+  timestamps?: WordTimestamp[];
   providerMetadata?: Record<string, unknown>;
+  warnings?: string[];
+}
+interface WordTimestamp { text: string; start: number; end: number }  // seconds
+// Returned by generateConversation — extends WordTimestamp with turnIndex
+interface ConversationWordTimestamp extends WordTimestamp {
+  turnIndex: number;  // index into the input turns[] array
 }
 ```
-## Error Handling
+## Error handling
 ```ts
-import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';
+import { generateSpeech, ApiError } from '@speech-sdk/core';
 try {
-  const result = await generateSpeech({ ... });
+  await generateSpeech({ /* ... */ });
 } catch (error) {
   if (error instanceof ApiError) {
-    console.log(error.statusCode);  // 401
-    console.log(error.model);       // "openai/gpt-4o-mini-tts"
-    console.log(error.responseBody);
+    error.statusCode;    // 401, 429, 500, ...
+    error.responseBody;
+    error.code;          // stable machine-readable code (optional)
   }
 }
 ```
+`ApiError.code` is populated from the RFC 7807 `application/problem+json` `code` extension when the upstream provides one (currently only the Speech Gateway). Match on `err.code` over `err.message` text — codes are a stable contract, messages aren't.
 | Error | When |
 |---|---|
-| `ApiError` | Provider API returns a non-2xx response |
-| `NoSpeechGeneratedError` | Provider returned empty audio |
-| `SpeechSDKError` | Base class for all errors |
-## Retry
+| `ApiError` | Provider returned non-2xx |
+| `MissingApiKeyError` | No `apiKey` passed and the provider's env var is unset |
+| `NoSpeechGeneratedError` | Empty input (after tag stripping) or empty provider response |
+| `StreamingNotSupportedError` | `streamSpeech()` on a non-streaming model |
+| `VolumeAdjustmentUnsupportedError` | `volumeDbfs` with no decodable output mode |
+| `TimestampKeyMissingError` | `timestamps: true` with no native support, no `fallbackSTT` configured, and `OPENAI_API_KEY` not set |
+| `ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError` | `generateConversation` validation / native caps / stitch incompatibility |
+| `SpeechSDKError` | Base class |
-Built-in retry with exponential backoff via [p-retry](https://github.com/sindresorhus/p-retry). Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
+Retries 5xx and network errors with exponential backoff ([p-retry](https://github.com/sindresorhus/p-retry)); does not retry 4xx. Default 2 retries; override via `maxRetries`.
 ## Development
 ```bash
 pnpm install
-pnpm test                       # unit tests
-pnpm run test:e2e               # e2e tests (requires API keys)
-pnpm run typecheck              # type-check without emitting
-```
-E2E tests hit real provider APIs. Set the relevant API key environment variables in a `.env` file or export them in your shell.
-Set `SPEECH_SDK_E2E_OUTPUT_DIR` to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
-```bash
-SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e
+pnpm test              # unit tests
+pnpm run test:e2e      # e2e tests (requires provider API keys)
+pnpm run typecheck
+pnpm fix               # format + lint
 ```
-## License
-MIT
+E2E tests hit real provider APIs. Set the relevant keys in `.env` or export them. Set `SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos` to write conversation e2e audio to disk.